Evaluation & benchmarks — korshunov.ai

Topic · Evaluation & benchmarks

SuCo introduces Minimal Sufficient CoT (MSC) as the shortest reasoning prefix adequate for correct answers. It employs a two-stage training framework—MSC-Aligned Fine-Tuning and Sufficiency-Aware Policy Optimization—to reduce reasoning length while maintaining or improving accuracy across math, code, and science tasks.

arxiv arXiv cs.CL · 8d ago

Automated Prompt Optimization for LLM Game Agents

A new framework automates prompt refinement for LLM agents by splitting the observation-to-action pipeline into goal-conditioned and action selection modules. It uses an LLM-driven evolutionary loop to iteratively improve prompts based on environment feedback, achieving up to 72.5% success on PutNext where prior agents failed, without model fine-tuning.

arxiv arXiv cs.CL · 8d ago

Dynamic Rollout Editing Reduces Overthinking in RL-Trained Reasoning Models

Dynamic Rollout Editing (DRE) addresses overthinking in RL-trained reasoning models by modifying successful trajectories post-answer emergence. DRE preserves the correct reasoning prefix while editing unnecessary continuation, weakening the credit assigned to redundant thinking without penalizing valid reasoning. Experiments across diverse tasks demonstrate its effectiveness in reducing overthinking.

arxiv arXiv cs.CL · 9d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to preserve prompt cache continuity and minimize token footprints.

arxiv arXiv cs.CL · 9d ago

MetaSyn: Benchmarking LLM Agents on Meta-Analysis Articles

MetaSyn introduces a dataset of 442 expert-curated meta-analyses from Nature Portfolio. It evaluates twelve LLM agent configurations and reveals a critical bottleneck in study screening, where no system recovers more than 52.7% of ground-truth included literature despite high retrieval recall.

arxiv arXiv cs.CL · 9d ago

Language Models Encode Value of Their Current Trajectory

Qwen3-8B internally tracks the value of its current trajectory, defined as the likelihood of achieving its goals. This 'value' axis distinguishes confidence levels, backtracking behavior, and code correctness, and shows that preference optimization boosts confidence in rewarded behaviors. The model assigns low value to politically sensitive queries post-training, and fine-tuning increases confidence within specific domains.

arxiv arXiv cs.AI · 9d ago

ActiveSAM: Fast and Accurate Open-Vocabulary Segmentation

ActiveSAM is a training-free, zero-shot framework that enhances SAM 3 for open-vocabulary semantic segmentation by identifying an image-conditioned active class set. It improves speed-accuracy tradeoff, outperforming SegEarth-OV3 by +1.4 mIoU on average and running up to 5.5x faster on large-vocabulary datasets, with strong robustness under image corruption.

arxiv arXiv cs.LG · 9d ago

ActiveSAM: Fast and Accurate Open-Vocabulary Segmentation

arxiv arXiv cs.LG · 9d ago

ExpRL: Exploratory RL for LLM Mid-Training

ExpRL introduces a novel mid-training approach for LLMs using human-written question-answer data as reward scaffolds. Instead of imitating reference solutions, it constructs problem-specific grading rubrics to reward intermediate reasoning steps, enabling better initialization for sparse-reward RL and outperforming SFT, sparse-reward GRPO, and self-distillation on math reasoning tasks.

arxiv arXiv cs.LG · 9d ago

HABC Improves RL Fine-Tuning of VLAs with Sparse Outcomes

Hierarchical Advantage-Weighted Behavior Cloning (HABC) enhances online RL fine-tuning of vision-language agents by using separate critic heads for viability and efficiency. It combines their outputs via a state-adaptive gate and applies per-transition weights, while intervention-aware credit assignment prevents supervision leakage. In real-robot experiments, HABC boosts success rates to 92%, 88%, and 38% on three bimanual tasks, surpassing SFT baselines of 36%, 44%, and 12%.

media r/LocalLLaMA · 10d ago

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

HalBench evaluates 29 open-source LLMs on a custom benchmark for sycophancy and hallucination. Qwen 3.6 and Gemma 4 outperform larger models, with Qwen 3.6 achieving 36.6% pushback—higher than GPT-5.4 and Gemini 3.1 Pro. Model size does not correlate with honest responses, indicating that architecture and training data matter more than parameters.

arxiv arXiv cs.CL · 8d ago

LLMs Infer Cultural Context but Fail to Apply It

LLMs can detect cultural cues and recall cultural conventions, but often fail to adapt responses accordingly. Their responses remain biased toward their native culture unless explicitly prompted to apply cultural context sequentially.

arxiv arXiv cs.CL · 8d ago

EComAgentBench: Benchmarking Shopping Agents with Hidden Intent

EComAgentBench introduces a benchmark of 662 real Amazon tasks that scatter shopper requirements across query, profile, and clarification. Agents must uncover hidden intent, verify candidates with evidence, and commit to a product within 100 tool calls, with typed rubrics attributing failures to specific requirement sources. Evaluation shows even top models achieve only 57.1% accuracy, and rubric satisfaction drops when intent is hidden.

arxiv arXiv cs.CL · 8d ago

Coding Benchmarks Misaligned with Agentic Software Engineering

Current coding benchmarks were designed before agentic software engineering and fail to capture the complexity of real-world systems. They conflate model performance with the entire harness, ignore valid alternative solutions, and lack feedback signals at individual component levels, making iterative improvement difficult.

arxiv arXiv cs.CL · 8d ago

DIFE Audits CLIP Backdoor Exposure Across Deployment Interfaces

DIFE evaluates backdoored CLIP checkpoints across different deployment interfaces, revealing that native success does not guarantee safety in reuse. The framework shows text-side poisoning enables adversarial exposure in retrieval, reranking, and selection tasks, while visual-only use remains largely unaffected. BadTextTower is introduced to generate strong text-conditioned exposure without compromising visual performance.

arxiv arXiv cs.CL · 8d ago

A Framework for Evaluating Agentic Skills at Scale

We present a framework for evaluating agentic skills by constructing realistic tasks and assessing skill utility through task execution. Applied to 500 real-world skills, it generates 1,000 tasks and scoring rubrics, evaluating 19 agent-model configurations across proprietary and open-source models. Results show significant variation in instruction adherence and performance gains, with skills substantially altering model behavior compared to no-skill setups.

arxiv arXiv cs.CL · 8d ago

GameCraft-Bench: Evaluating End-to-End Game Generation

GameCraft-Bench introduces a benchmark with 140 Godot tasks across 15 game families to assess coding agents' ability to generate playable games. Evaluations show the best agent achieves only 41.46% success, indicating significant challenges in producing complete, interactive games with coherent gameplay and visual feedback.

arxiv arXiv cs.CL · 8d ago

ChLogic: Testing Logical Reasoning Robustness in Chinese Expressions

ChLogic evaluates how well large language models maintain logical reasoning when English logical structures are expressed in Chinese. It reveals a persistent English-Chinese performance gap, with back-translation improving results on general items but harming performance on difficult problems. The benchmark highlights the impact of surface realization, translation artifacts, and model-specific behaviors on multilingual reasoning.

arxiv arXiv cs.CL · 8d ago

Non-negative Elastic Net Decoding for Information Retrieval

NNN decoding selects documents as a joint set that jointly reconstructs the query embedding via a sparse non-negative linear combination. It strictly extends dense retrieval by handling queries that dense retrieval fails on, especially in corpora with correlated documents, and achieves superior performance through end-to-end training of embeddings.

arxiv arXiv cs.CL · 9d ago

VoidPadding: Decoupling [EOS] Termination and Padding in MDLMs

VoidPadding introduces [VOID] as a padding token to separate semantic termination and response-length modeling. It improves performance on mathematical reasoning and code generation by 17.84 points over the original model and reduces decoding NFE by 55.7% on average.

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

Automated Prompt Optimization for LLM Game Agents

Dynamic Rollout Editing Reduces Overthinking in RL-Trained Reasoning Models

TokenPilot: Cache-Efficient Context Management for LLM Agents

MetaSyn: Benchmarking LLM Agents on Meta-Analysis Articles

Language Models Encode Value of Their Current Trajectory

ActiveSAM: Fast and Accurate Open-Vocabulary Segmentation

ActiveSAM: Fast and Accurate Open-Vocabulary Segmentation

ExpRL: Exploratory RL for LLM Mid-Training

HABC Improves RL Fine-Tuning of VLAs with Sparse Outcomes

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

LLMs Infer Cultural Context but Fail to Apply It

EComAgentBench: Benchmarking Shopping Agents with Hidden Intent

Coding Benchmarks Misaligned with Agentic Software Engineering

DIFE Audits CLIP Backdoor Exposure Across Deployment Interfaces

A Framework for Evaluating Agentic Skills at Scale

GameCraft-Bench: Evaluating End-to-End Game Generation

ChLogic: Testing Logical Reasoning Robustness in Chinese Expressions

Non-negative Elastic Net Decoding for Information Retrieval

VoidPadding: Decoupling [EOS] Termination and Padding in MDLMs