Evaluation & benchmarks — korshunov.ai

Topic · Evaluation & benchmarks

ActiveSAM is a training-free, zero-shot framework that enhances SAM 3 for open-vocabulary semantic segmentation by identifying an image-conditioned active class set. It improves speed-accuracy tradeoff, outperforming SegEarth-OV3 by +1.4 mIoU on average and running up to 5.5x faster on large-vocabulary datasets, with strong robustness to image corruption.

arxiv arXiv cs.LG · 10d ago

ExpRL: Exploratory RL for LLM Mid-Training

ExpRL introduces a novel mid-training approach for LLMs using human-written question-answer data as reward scaffolds. Instead of imitating reference solutions, it constructs problem-specific grading rubrics to reward intermediate reasoning steps, enabling better initialization for sparse-reward RL and outperforming SFT, sparse-reward GRPO, and self-distillation on math reasoning tasks.

arxiv arXiv cs.LG · 10d ago

HABC Improves RL Fine-Tuning of VLAs with Sparse Outcomes

Hierarchical Advantage-Weighted Behavior Cloning (HABC) enhances online RL fine-tuning of vision-language agents by using separate critic heads for viability and efficiency. It combines their outputs via a state-adaptive gate and applies per-transition weights, while intervention-aware credit assignment prevents supervision leakage. In real-robot experiments, HABC boosts success rates to 92%, 88%, and 38% on three bimanual tasks, surpassing SFT baselines of 36%, 44%, and 12%.

media r/LocalLLaMA · 10d ago

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

HalBench evaluates 29 open-source LLMs on a custom benchmark for sycophancy and hallucination. Qwen 3.6 and Gemma 4 outperform larger models, with Qwen 3.6 achieving 36.6% pushback—higher than GPT-5.4 and Gemini 3.1 Pro. Model size does not correlate with honest responses, indicating that architecture and training data matter more than parameters.

arxiv arXiv cs.AI · 9d ago

Bayesian Audits Reveal Inconsistent AI Evaluation Timelines

Public AI evaluation archives show that a single terminal result can arise from two distinct pre-terminal histories, with estimated times to reach 95% of performance ceilings at 23.03 or 75.13. A candidate selection-aware frontier model fails synthetic recovery and uncertainty calibration, and is rejected by fixed audit gates. An archive-and-adjudication protocol verifies timing boundaries and falsifies unsupported frontier claims.

arxiv arXiv cs.AI · 9d ago

TuneJury: Open Metric for Music Generation Preference Alignment

TuneJury is an open, instance-level pairwise reward model that predicts music preference scores from text prompts and audio clips. It is trained on diverse human-preference data and demonstrates strong generalization, with anchor calibration enabling efficient post-hoc alignment for music generation systems.

arxiv arXiv cs.AI · 9d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to preserve prompt cache continuity and minimize token footprint without introducing prefix mismatches.

arxiv arXiv cs.AI · 9d ago

Phase in Neural Representations: An Internal Oppenheim-Lim Test

Image classifiers like PRISM2D, GFNet, and ViT-B/16 show that phase, not magnitude, drives predictions in hidden layers. ResNet-50 reveals a latent sign code in late blocks, indicating phase/sign identity exists across architectures, though expressed differently due to activation and readout mechanisms.

arxiv arXiv cs.LG · 9d ago

Factorized Neural Operators Decompose Dynamic and Persistent Responses

Factorized Neural Operators (FaNO) decompose spectral representations into equivariant dynamic and invariant persistent responses. This factorized structure enables better interpretability, generalization, and consistent predictions across scales, domains, and physical regimes.

arxiv arXiv cs.LG · 9d ago

CEAP Reduces Variance in LLM Circuit Discovery

CEAP, a new circuit discovery method, substantially reduces resampling variance compared to EAP-IG. The paper shows that rephrasing variance arises from prompt templates activating different circuits, suggesting LLMs are inherently hard to steer across diverse inputs. Sample-wise variance is largely benign, as poor unfaithfulness scores result from selective contribution scaling, not circuit defects.

arxiv arXiv cs.LG · 9d ago

Adaptive Functional Gradient Descent with Convergence Guarantees

We propose a new functional gradient descent algorithm that adapts its representation during optimization. The method achieves convergence to a stationary point under smooth losses and to a global minimizer under smoothness and a Polyak-Lojasiewicz condition, despite using finite-dimensional approximations. It outperforms both fixed-approximation FGD and neural network baselines in regression, PDE solving, and computer vision tasks.

arxiv arXiv cs.LG · 10d ago

Unified Causal-Origin Taxonomy of Distributional Shifts in RL

This paper proposes a unified causal-origin taxonomy for distributional shifts in reinforcement learning, linking ID/OOD generalization to non-stationary settings. It decomposes the agent-environment interaction using a POMDP framework, identifying internal, agent-driven, and external, environment-driven shifts, with explicit, implicit, and hybrid types defined by the shifted-time boundary. The work introduces an evaluation framework to measure shift impact through performance degradation and recovery metrics, enabling systematic analysis of RL robustness.

arxiv arXiv cs.LG · 10d ago

CircuitLasso: Scalable Circuit Learning for LLM Interpretability

CircuitLasso enables scalable circuit learning in large language models by using sparse linear regression. It recovers circuits with structural accuracy matching state-of-the-art methods at significantly lower computational cost, and demonstrates human-interpretable semantic propagation through model components. The learned circuits achieve comparable performance on a domain-generalization task with reduced cost.

arxiv arXiv cs.LG · 10d ago

Causal Framework for Auditing Synthetic Data Disclosures

A model-agnostic auditing framework detects and distinguishes true and phantom disclosures in synthetic data. It uses only synthetic outputs and a held-out control set to perform statistical testing, offering tighter privacy leakage bounds than prior methods without requiring model access or additional training.

arxiv arXiv cs.LG · 10d ago

Task-Error Residual Learning for Real-Robot Five-Ball Juggling

A residual learning approach using directional task-error supervision achieves stable five-ball juggling on real robots, converging from the second attempt. The system outperforms human practice timelines and relies on both directional feedback and an informative prior, with a fixed-Jacobian Newton update proving most reliable.

arxiv arXiv cs.LG · 10d ago

Post-Hoc Falsification Operators Fail to Improve Accuracy in Small Code Models

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While some operators reduce compute usage or recover correct programs, none outperform BoN in accuracy, due to systemic limitations like coverage walls and consensus traps. An expression-layer recovery (M1) improves performance on HumanEval+ by 12 tasks, with no harm or leakage, and shows consistent results across model cells.

arxiv arXiv cs.LG · 10d ago

TuneJury: Open Metric for Music Generation Preference Alignment

arxiv arXiv cs.LG · 10d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to stabilize prompt prefixes and manage context segments efficiently.

arxiv arXiv cs.LG · 10d ago

KVEraser: Efficient Localized Context Erasing in LLMs

KVEraser enables efficient localized context erasing in large language models by replacing only the KV cache states of an erased span with learned steering states. It achieves near-full-recomputation performance on in-domain tasks and offers a 24% latency increase versus a 17.6x increase for full recomputation, with up to 3--4x speedup on long-document QA tasks.

arxiv arXiv cs.LG · 10d ago

DP-FL Backdoor Attacks: RING Exploits Privacy for Malicious Signals

A new attack, RING, exploits differential privacy in federated learning to conceal backdoor signals while maximizing impact. It achieves 90.3% attack success against state-of-the-art defenses, up to 26.08x over baseline methods, and reveals a critical security gap in DP-FL due to inherent masking of malicious updates.

ActiveSAM: Fast and Accurate Open-Vocabulary Segmentation

ExpRL: Exploratory RL for LLM Mid-Training

HABC Improves RL Fine-Tuning of VLAs with Sparse Outcomes

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

Bayesian Audits Reveal Inconsistent AI Evaluation Timelines

TuneJury: Open Metric for Music Generation Preference Alignment

TokenPilot: Cache-Efficient Context Management for LLM Agents

Phase in Neural Representations: An Internal Oppenheim-Lim Test

Factorized Neural Operators Decompose Dynamic and Persistent Responses

CEAP Reduces Variance in LLM Circuit Discovery

Adaptive Functional Gradient Descent with Convergence Guarantees

Unified Causal-Origin Taxonomy of Distributional Shifts in RL

CircuitLasso: Scalable Circuit Learning for LLM Interpretability

Causal Framework for Auditing Synthetic Data Disclosures

Task-Error Residual Learning for Real-Robot Five-Ball Juggling

Post-Hoc Falsification Operators Fail to Improve Accuracy in Small Code Models

TuneJury: Open Metric for Music Generation Preference Alignment

TokenPilot: Cache-Efficient Context Management for LLM Agents

KVEraser: Efficient Localized Context Erasing in LLMs

DP-FL Backdoor Attacks: RING Exploits Privacy for Malicious Signals