Evaluation & benchmarks
arxiv arXiv cs.CL · 9d ago

Post-Hoc Operators Fail to Improve Accuracy in Small Code Models

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While two operators—expression-layer recovery and adaptive consensus early-stop—offer benefits in compute efficiency or program recovery, none outperform BoN in accuracy. The results highlight systemic limitations in error detection and coverage, suggesting that model harnesses and error coverage must be improved before post-hoc reasoning is considered.

arxiv arXiv cs.CL · 9d ago

Language Models Encode Value of Their Current Trajectory

Qwen3-8B internally tracks the value of its current trajectory, defined as the likelihood of achieving its goals. This 'value' axis distinguishes confidence levels, backtracking behavior, and code correctness, and shows that preference optimization boosts confidence in rewarded behaviors. The model assigns low value to politically sensitive queries post-training, and fine-tuning increases confidence within specific domains.

arxiv arXiv cs.AI · 9d ago

Semantic Flip: Synthetic OOD Generation for Robust Refusal

Semantic Flip proposes a framework to synthesize out-of-distribution samples by transforming queries and video memory to create unanswerable pairs. These pairs train a lightweight rejection module that attaches to existing vision-language models without retraining, improving refusal performance in embodied question answering and spatial localization. On the new SpaceReject benchmark, it achieves an F1 score of 0.9559.

arxiv arXiv cs.AI · 9d ago

Variance in LLM Circuit Discovery: Causes and Mitigations

This paper analyzes variance in circuit discovery for large language models, identifying resampling, rephrasing, and sample-wise variance. It shows CEAP reduces resampling variance and argues rephrasing variance stems from prompt templates activating different circuits, implying LLMs may be inherently hard to steer. The study also finds sparsity does not resolve these issues and that sample-wise variance is largely benign due to selective contribution scaling affecting unfaithfulness scores.

arxiv arXiv cs.AI · 9d ago

MA-SBI: Calibration-Free SBI via Side-Channel Guidance

MA-SBI introduces a calibration-free simulation-based inference framework that uses side-channel text, like regime labels or instructions, to correct for simulator misspecification. It employs a learned corrector to apply observation-space shifts before posterior inference, without needing ground-truth parameter pairs or retraining. On hide-the-calibration benchmarks, MA-SBI matches the oracle posterior with text alone, outperforming RoPE under limited data, and shows robustness on real-world epidemiological and cognitive-science datasets.

arxiv arXiv cs.AI · 9d ago

Causal Model of Theory of Mind in AI Conflict

This paper proposes a structural causal model using a directed acyclic graph to define when Theory of Mind engagement is causally warranted in human-machine conflict. The model identifies four exogenous conditions, five mediators, and three causal pathways for ToM activation, with epistemic accuracy as the primary outcome. It offers a resource-rational framework for AI social reasoning, validated through simulation and human-machine studies.

arxiv arXiv cs.AI · 9d ago

Bayesian Audits Reveal Inconsistent AI Evaluation Timelines

Public AI evaluation archives show that a single terminal result can arise from two distinct pre-terminal histories, with estimated times to reach 95% of performance ceilings at 23.03 or 75.13. A candidate selection-aware frontier model fails synthetic recovery and uncertainty calibration, and is rejected by fixed audit gates. An archive-and-adjudication protocol verifies timing boundaries and falsifies unsupported frontier claims.