Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 44

Pruned LLMs Fail in Open Generation Despite Passing Multiple Choice

Pruned large language models often pass multiple-choice tests but fail to generate correct answers in open-ended responses. This 'benchmark illusion' shows that answers are not erased but demoted, reappearing only with advanced generation techniques like beam search or sampling. Standard benchmarks overstate the practical usability of compressed models, highlighting a critical evaluation blind spot.

arxiv arXiv cs.CL · 8d ago

OPD-Evolver: On-Policy Distillation for Holistic Agent Evolving

OPD-Evolver introduces a slow-fast co-evolution framework that enables agents to select, act on, and reuse experience through on-policy self-distillation. It outperforms existing memory and training-based methods by up to 11.5% and 5.8% respectively, and demonstrates capability to challenge large-scale models like Qwen3.5-397B-A17B and Step-3.5-Flash.

arxiv arXiv cs.CL · 8d ago

Prompt Perturbation for Reliable LLM Evaluation

A new framework uses prompt perturbation to identify and filter structurally inconsistent pairwise comparisons in large language model evaluations. By incorporating graph-level consistency checks before ranking aggregation, the method reduces cyclic preferences and improves the reliability of LLM rankings.

arxiv arXiv cs.CL · 8d ago

SkillMigrator Enables Cross-Site Web Skill Transfer via Layout Matching

SkillMigrator learns reusable web skills by matching layout structures instead of specific element references. It stores each skill as a transferable interaction pattern (TIP) with a structural sketch, enabling efficient skill reuse across sites. Compared to state-of-the-art methods, it reduces average LLM-action counts by 8-10% on WebArena and Mind2Web at matched success rates.

arxiv arXiv cs.CL · 8d ago

MambaCount: Efficient Text-guided Object Counting

MambaCount introduces a spatial sparse state space duality block to enable efficient text-guided open-vocabulary object counting. It addresses causal modeling limitations and high entropy in spatial token responses, achieving state-of-the-art results on FSC-147 with a test MAE of 12.23 while maintaining linear complexity.

arxiv arXiv cs.CL · 8d ago

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

SuCo introduces Minimal Sufficient CoT (MSC) as the shortest reasoning prefix adequate for correct answers. It employs a two-stage training framework—MSC-Aligned Fine-Tuning and Sufficiency-Aware Policy Optimization—to reduce reasoning length while maintaining or improving accuracy across math, code, and science tasks.

arxiv arXiv cs.CL · 8d ago

LLMs Infer Cultural Context but Fail to Apply It

LLMs can detect cultural cues and recall cultural conventions, but often fail to adapt responses accordingly. Their responses remain biased toward their native culture unless explicitly prompted to apply cultural context sequentially.

arxiv arXiv cs.CL · 8d ago

EComAgentBench: Benchmarking Shopping Agents with Hidden Intent

EComAgentBench introduces a benchmark of 662 real Amazon tasks that scatter shopper requirements across query, profile, and clarification. Agents must uncover hidden intent, verify candidates with evidence, and commit to a product within 100 tool calls, with typed rubrics attributing failures to specific requirement sources. Evaluation shows even top models achieve only 57.1% accuracy, and rubric satisfaction drops when intent is hidden.

arxiv arXiv cs.CL · 8d ago

Coding Benchmarks Misaligned with Agentic Software Engineering

Current coding benchmarks were designed before agentic software engineering and fail to capture the complexity of real-world systems. They conflate model performance with the entire harness, ignore valid alternative solutions, and lack feedback signals at individual component levels, making iterative improvement difficult.

arxiv arXiv cs.CL · 8d ago

DIFE Audits CLIP Backdoor Exposure Across Deployment Interfaces

DIFE evaluates backdoored CLIP checkpoints across different deployment interfaces, revealing that native success does not guarantee safety in reuse. The framework shows text-side poisoning enables adversarial exposure in retrieval, reranking, and selection tasks, while visual-only use remains largely unaffected. BadTextTower is introduced to generate strong text-conditioned exposure without compromising visual performance.

arxiv arXiv cs.CL · 9d ago

A Framework for Evaluating Agentic Skills at Scale

We present a framework for evaluating agentic skills by constructing realistic tasks and assessing skill utility through task execution. Applied to 500 real-world skills, it generates 1,000 tasks and scoring rubrics, evaluating 19 agent-model configurations across proprietary and open-source models. Results show significant variation in instruction adherence and performance gains, with skills substantially altering model behavior compared to no-skill setups.

arxiv arXiv cs.CL · 9d ago

Bilingual fine-tuning improves low-resource ASR with language identification

A study finds bilingual fine-tuning enhances automatic speech recognition in low-resource languages when language identification is accurate. Including a language identification token at inference improves ASR performance when identification accuracy is low, especially in diverse language pairs across different families and writing systems.

arxiv arXiv cs.CL · 9d ago

MultiClin Benchmark for Multiscript ASR in Clinical Settings

MultiClin introduces a clinical ASR benchmark that evaluates models' robustness to multiscript variability. It shows that multiscript-aware evaluation outperforms conventional single-reference methods, and script unification yields the best ASR performance, while inconsistent script mappings increase orthographic uncertainty.

arxiv arXiv cs.CL · 9d ago

Self-supervised speech models lack tonal context compensation

The wav2vec2.0 model shows no evidence of perceptual compensation for Mandarin tones in embedding similarities. Probing classifiers reveal limited compensation and fail to match human performance on isolated syllables, suggesting supervised training is needed for phonological regularity abstraction.

arxiv arXiv cs.CL · 9d ago

Automated Prompt Optimization for LLM Game Agents

A new framework automates prompt refinement for LLM agents by splitting the observation-to-action pipeline into goal-conditioned and action selection modules. It uses an LLM-driven evolutionary loop to iteratively improve prompts based on environment feedback, achieving up to 72.5% success on PutNext where prior agents failed, without model fine-tuning.

arxiv arXiv cs.CL · 9d ago

GameCraft-Bench: Evaluating End-to-End Game Generation

GameCraft-Bench introduces a benchmark with 140 Godot tasks across 15 game families to assess coding agents' ability to generate playable games. Evaluations show the best agent achieves only 41.46% success, indicating significant challenges in producing complete, interactive games with coherent gameplay and visual feedback.

arxiv arXiv cs.CL · 9d ago

Dynamic Rollout Editing Reduces Overthinking in RL-Trained Reasoning Models

Dynamic Rollout Editing (DRE) addresses overthinking in RL-trained reasoning models by modifying successful trajectories post-answer emergence. DRE preserves the correct reasoning prefix while editing unnecessary continuation, weakening the credit assigned to redundant thinking without penalizing valid reasoning. Experiments across diverse tasks demonstrate its effectiveness in reducing overthinking.

arxiv arXiv cs.CL · 9d ago

ChLogic: Testing Logical Reasoning Robustness in Chinese Expressions

ChLogic evaluates how well large language models maintain logical reasoning when English logical structures are expressed in Chinese. It reveals a persistent English-Chinese performance gap, with back-translation improving results on general items but harming performance on difficult problems. The benchmark highlights the impact of surface realization, translation artifacts, and model-specific behaviors on multilingual reasoning.

arxiv arXiv cs.CL · 9d ago

Non-negative Elastic Net Decoding for Information Retrieval

NNN decoding selects documents as a joint set that jointly reconstructs the query embedding via a sparse non-negative linear combination. It strictly extends dense retrieval by handling queries that dense retrieval fails on, especially in corpora with correlated documents, and achieves superior performance through end-to-end training of embeddings.

arxiv arXiv cs.CL · 9d ago

Interventional Post-Training of Speech Foundation Models

A new method uses interventional contrastive learning to refine speech foundation models by transforming their entangled representations into separate content and speaker subspaces. The approach improves out-of-domain speaker verification performance and demonstrates clear separation of speaker and content information in the learned subspaces.