Evaluation & benchmarks
arxiv arXiv cs.AI · 9d ago

Bayesian Audits Reveal Inconsistent AI Evaluation Timelines

Public AI evaluation archives show that a single terminal result can arise from two distinct pre-terminal histories, with estimated times to reach 95% of performance ceilings at 23.03 or 75.13. A candidate selection-aware frontier model fails synthetic recovery and uncertainty calibration, and is rejected by fixed audit gates. An archive-and-adjudication protocol verifies timing boundaries and falsifies unsupported frontier claims.

arxiv arXiv cs.LG · 9d ago

Adaptive Functional Gradient Descent with Convergence Guarantees

We propose a new functional gradient descent algorithm that adapts its representation during optimization. The method achieves convergence to a stationary point under smooth losses and to a global minimizer under smoothness and a Polyak-Lojasiewicz condition, despite using finite-dimensional approximations. It outperforms both fixed-approximation FGD and neural network baselines in regression, PDE solving, and computer vision tasks.

arxiv arXiv cs.LG · 9d ago

Unified Causal-Origin Taxonomy of Distributional Shifts in RL

This paper proposes a unified causal-origin taxonomy for distributional shifts in reinforcement learning, linking ID/OOD generalization to non-stationary settings. It decomposes the agent-environment interaction using a POMDP framework, identifying internal, agent-driven, and external, environment-driven shifts, with explicit, implicit, and hybrid types defined by the shifted-time boundary. The work introduces an evaluation framework to measure shift impact through performance degradation and recovery metrics, enabling systematic analysis of RL robustness.

arxiv arXiv cs.LG · 9d ago

CircuitLasso: Scalable Circuit Learning for LLM Interpretability

CircuitLasso enables scalable circuit learning in large language models by using sparse linear regression. It recovers circuits with structural accuracy matching state-of-the-art methods at significantly lower computational cost, and demonstrates human-interpretable semantic propagation through model components. The learned circuits achieve comparable performance on a domain-generalization task with reduced cost.