Evaluation & benchmarks — korshunov.ai — ML news

Evaluation & benchmarks Page 1 / 43

arxiv arXiv cs.CL · 8d ago

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.

arxiv arXiv cs.CL · 8d ago

Expressivity Analysis of Hierarchical Modelling in Deep Transformers

This paper analyzes deep transformer expressiveness using bounded-depth grammars. It constructs transformers with positional attention where model depth scales linearly with grammar depth, and neuron count grows quadratically with production rules. The results support the linear representation hypothesis by showing these models can encode abstract grammatical states in low-dimensional, linearly separable subspaces.

arxiv arXiv cs.CL · 8d ago

NAR-MBR Decoding for Fast and Accurate Speech Recognition

NAR-MBR decoding improves speech recognition by maximizing expected utility from sampled outputs of non-autoregressive models. It achieves better performance than prior NAR methods and runs faster than autoregressive decoding across multiple corpora.

arxiv arXiv cs.CL · 8d ago

LLM Features Can Hurt GNNs via Concatenation Interference

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 ± 0.3 pp. This degradation is linked to LLM-alone discriminability (Delta_sig), which correlates strongly with concatenation cost (r² = 0.38) and shows a power law relationship with feature dimension and node count (r² = 0.97), particularly in low-Delta_sig, low-node scenarios.

arxiv arXiv cs.CL · 8d ago

Pruned LLMs Fail in Open Generation Despite Passing Multiple Choice

Pruned large language models often pass multiple-choice tests but fail to generate correct answers in open-ended responses. This 'benchmark illusion' shows that answers are not erased but demoted, reappearing only with advanced generation techniques like beam search or sampling. Standard benchmarks overstate the practical usability of compressed models, highlighting a critical evaluation blind spot.

arxiv arXiv cs.CL · 8d ago

OPD-Evolver: On-Policy Distillation for Holistic Agent Evolving

OPD-Evolver introduces a slow-fast co-evolution framework that enables agents to select, act on, and reuse experience through on-policy self-distillation. It outperforms existing memory and training-based methods by up to 11.5% and 5.8% respectively, and demonstrates capability to challenge large-scale models like Qwen3.5-397B-A17B and Step-3.5-Flash.

arxiv arXiv cs.CL · 8d ago

Prompt Perturbation for Reliable LLM Evaluation

A new framework uses prompt perturbation to identify and filter structurally inconsistent pairwise comparisons in large language model evaluations. By incorporating graph-level consistency checks before ranking aggregation, the method reduces cyclic preferences and improves the reliability of LLM rankings.

arxiv arXiv cs.CL · 8d ago

SkillMigrator Enables Cross-Site Web Skill Transfer via Layout Matching

SkillMigrator learns reusable web skills by matching layout structures instead of specific element references. It stores each skill as a transferable interaction pattern (TIP) with a structural sketch, enabling efficient skill reuse across sites. Compared to state-of-the-art methods, it reduces average LLM-action counts by 8-10% on WebArena and Mind2Web at matched success rates.

arxiv arXiv cs.CL · 8d ago

MambaCount: Efficient Text-guided Object Counting

MambaCount introduces a spatial sparse state space duality block to enable efficient text-guided open-vocabulary object counting. It addresses causal modeling limitations and high entropy in spatial token responses, achieving state-of-the-art results on FSC-147 with a test MAE of 12.23 while maintaining linear complexity.

arxiv arXiv cs.CL · 8d ago

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

SuCo introduces Minimal Sufficient CoT (MSC) as the shortest reasoning prefix adequate for correct answers. It employs a two-stage training framework—MSC-Aligned Fine-Tuning and Sufficiency-Aware Policy Optimization—to reduce reasoning length while maintaining or improving accuracy across math, code, and science tasks.

arxiv arXiv cs.CL · 8d ago

LLMs Infer Cultural Context but Fail to Apply It

LLMs can detect cultural cues and recall cultural conventions, but often fail to adapt responses accordingly. Their responses remain biased toward their native culture unless explicitly prompted to apply cultural context sequentially.

arxiv arXiv cs.CL · 8d ago

EComAgentBench: Benchmarking Shopping Agents with Hidden Intent

EComAgentBench introduces a benchmark of 662 real Amazon tasks that scatter shopper requirements across query, profile, and clarification. Agents must uncover hidden intent, verify candidates with evidence, and commit to a product within 100 tool calls, with typed rubrics attributing failures to specific requirement sources. Evaluation shows even top models achieve only 57.1% accuracy, and rubric satisfaction drops when intent is hidden.

arxiv arXiv cs.CL · 8d ago

Coding Benchmarks Misaligned with Agentic Software Engineering

Current coding benchmarks were designed before agentic software engineering and fail to capture the complexity of real-world systems. They conflate model performance with the entire harness, ignore valid alternative solutions, and lack feedback signals at individual component levels, making iterative improvement difficult.

arxiv arXiv cs.CL · 8d ago

DIFE Audits CLIP Backdoor Exposure Across Deployment Interfaces

DIFE evaluates backdoored CLIP checkpoints across different deployment interfaces, revealing that native success does not guarantee safety in reuse. The framework shows text-side poisoning enables adversarial exposure in retrieval, reranking, and selection tasks, while visual-only use remains largely unaffected. BadTextTower is introduced to generate strong text-conditioned exposure without compromising visual performance.

arxiv arXiv cs.CL · 8d ago

A Framework for Evaluating Agentic Skills at Scale

We present a framework for evaluating agentic skills by constructing realistic tasks and assessing skill utility through task execution. Applied to 500 real-world skills, it generates 1,000 tasks and scoring rubrics, evaluating 19 agent-model configurations across proprietary and open-source models. Results show significant variation in instruction adherence and performance gains, with skills substantially altering model behavior compared to no-skill setups.

arxiv arXiv cs.CL · 8d ago

Bilingual fine-tuning improves low-resource ASR with language identification

A study finds bilingual fine-tuning enhances automatic speech recognition in low-resource languages when language identification is accurate. Including a language identification token at inference improves ASR performance when identification accuracy is low, especially in diverse language pairs across different families and writing systems.

arxiv arXiv cs.CL · 8d ago

MultiClin Benchmark for Multiscript ASR in Clinical Settings

MultiClin introduces a clinical ASR benchmark that evaluates models' robustness to multiscript variability. It shows that multiscript-aware evaluation outperforms conventional single-reference methods, and script unification yields the best ASR performance, while inconsistent script mappings increase orthographic uncertainty.

arxiv arXiv cs.CL · 8d ago

Self-supervised speech models lack tonal context compensation

The wav2vec2.0 model shows no evidence of perceptual compensation for Mandarin tones in embedding similarities. Probing classifiers reveal limited compensation and fail to match human performance on isolated syllables, suggesting supervised training is needed for phonological regularity abstraction.

arxiv arXiv cs.CL · 8d ago

Automated Prompt Optimization for LLM Game Agents

A new framework automates prompt refinement for LLM agents by splitting the observation-to-action pipeline into goal-conditioned and action selection modules. It uses an LLM-driven evolutionary loop to iteratively improve prompts based on environment feedback, achieving up to 72.5% success on PutNext where prior agents failed, without model fine-tuning.

arxiv arXiv cs.CL · 8d ago

GameCraft-Bench: Evaluating End-to-End Game Generation

GameCraft-Bench introduces a benchmark with 140 Godot tasks across 15 game families to assess coding agents' ability to generate playable games. Evaluations show the best agent achieves only 41.46% success, indicating significant challenges in producing complete, interactive games with coherent gameplay and visual feedback.