Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 43

Post-Hoc Operators Fail to Improve Accuracy in Small Code Models

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While two operators—expression-layer recovery and adaptive consensus early-stop—offer benefits in compute efficiency or program recovery, none outperform BoN in accuracy. The results highlight systemic limitations in error detection and coverage, suggesting that model harnesses and error coverage must be improved before post-hoc reasoning is considered.

arxiv arXiv cs.CL · 9d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to preserve prompt cache continuity and minimize token footprints.

arxiv arXiv cs.CL · 9d ago

MetaSyn: Benchmarking LLM Agents on Meta-Analysis Articles

MetaSyn introduces a dataset of 442 expert-curated meta-analyses from Nature Portfolio. It evaluates twelve LLM agent configurations and reveals a critical bottleneck in study screening, where no system recovers more than 52.7% of ground-truth included literature despite high retrieval recall.

arxiv arXiv cs.CL · 9d ago

Language Models Encode Value of Their Current Trajectory

Qwen3-8B internally tracks the value of its current trajectory, defined as the likelihood of achieving its goals. This 'value' axis distinguishes confidence levels, backtracking behavior, and code correctness, and shows that preference optimization boosts confidence in rewarded behaviors. The model assigns low value to politically sensitive queries post-training, and fine-tuning increases confidence within specific domains.

Post-Hoc Operators Fail to Improve Accuracy in Small Code Models

TokenPilot: Cache-Efficient Context Management for LLM Agents

MetaSyn: Benchmarking LLM Agents on Meta-Analysis Articles

Language Models Encode Value of Their Current Trajectory

Semantic Flip: Synthetic OOD Generation for Robust Refusal

Variance in LLM Circuit Discovery: Causes and Mitigations

MA-SBI: Calibration-Free SBI via Side-Channel Guidance

RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting

Unified Causal-Origin Taxonomy for Distributional Shifts in RL

CircuitLasso: Scalable Circuit Learning for LLM Interpretability

Causal Model of Theory of Mind in AI Conflict

Causal Framework for Auditing Synthetic Data Disclosures

Low Frame Rate Degradation in Neural Audio Codecs

Textual Reviews Have Limited Impact in Recommendation Models

AI research documentation improves over decade

Agentic LLM Framework for HTS Code Classification

ActiveSAM: Fast and Accurate Open-Vocabulary Segmentation

Bayesian Audits Reveal Inconsistent AI Evaluation Timelines

TuneJury: Open Metric for Music Generation Preference Alignment

TokenPilot: Cache-Efficient Context Management for LLM Agents