Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 42

MultiClin Benchmark for Multiscript ASR in Clinical Settings

MultiClin introduces a clinical ASR benchmark that evaluates models' robustness to multiscript variability. It shows that multiscript-aware evaluation outperforms conventional single-reference methods, and script unification yields the best ASR performance, while inconsistent script mappings increase orthographic uncertainty.

arxiv arXiv cs.CL · 8d ago

Self-supervised speech models lack tonal context compensation

The wav2vec2.0 model shows no evidence of perceptual compensation for Mandarin tones in embedding similarities. Probing classifiers reveal limited compensation and fail to match human performance on isolated syllables, suggesting supervised training is needed for phonological regularity abstraction.

arxiv arXiv cs.CL · 8d ago

Automated Prompt Optimization for LLM Game Agents

A new framework automates prompt refinement for LLM agents by splitting the observation-to-action pipeline into goal-conditioned and action selection modules. It uses an LLM-driven evolutionary loop to iteratively improve prompts based on environment feedback, achieving up to 72.5% success on PutNext where prior agents failed, without model fine-tuning.

arxiv arXiv cs.CL · 8d ago

GameCraft-Bench: Evaluating End-to-End Game Generation

GameCraft-Bench introduces a benchmark with 140 Godot tasks across 15 game families to assess coding agents' ability to generate playable games. Evaluations show the best agent achieves only 41.46% success, indicating significant challenges in producing complete, interactive games with coherent gameplay and visual feedback.

arxiv arXiv cs.CL · 8d ago

Dynamic Rollout Editing Reduces Overthinking in RL-Trained Reasoning Models

Dynamic Rollout Editing (DRE) addresses overthinking in RL-trained reasoning models by modifying successful trajectories post-answer emergence. DRE preserves the correct reasoning prefix while editing unnecessary continuation, weakening the credit assigned to redundant thinking without penalizing valid reasoning. Experiments across diverse tasks demonstrate its effectiveness in reducing overthinking.

arxiv arXiv cs.CL · 8d ago

ChLogic: Testing Logical Reasoning Robustness in Chinese Expressions

ChLogic evaluates how well large language models maintain logical reasoning when English logical structures are expressed in Chinese. It reveals a persistent English-Chinese performance gap, with back-translation improving results on general items but harming performance on difficult problems. The benchmark highlights the impact of surface realization, translation artifacts, and model-specific behaviors on multilingual reasoning.

arxiv arXiv cs.CL · 8d ago

Non-negative Elastic Net Decoding for Information Retrieval

NNN decoding selects documents as a joint set that jointly reconstructs the query embedding via a sparse non-negative linear combination. It strictly extends dense retrieval by handling queries that dense retrieval fails on, especially in corpora with correlated documents, and achieves superior performance through end-to-end training of embeddings.

arxiv arXiv cs.CL · 8d ago

Interventional Post-Training of Speech Foundation Models

A new method uses interventional contrastive learning to refine speech foundation models by transforming their entangled representations into separate content and speaker subspaces. The approach improves out-of-domain speaker verification performance and demonstrates clear separation of speaker and content information in the learned subspaces.

arxiv arXiv cs.CL · 8d ago

VoidPadding: Decoupling [EOS] Termination and Padding in MDLMs

VoidPadding introduces [VOID] as a padding token to separate semantic termination and response-length modeling. It improves performance on mathematical reasoning and code generation by 17.84 points over the original model and reduces decoding NFE by 55.7% on average.

media r/LocalLLaMA · 8d ago

VibeThinker-3B: What Is This Witchcraft?

VibeThinker-3B is a small 3-billion-parameter model that performs exceptionally well on the MathQA benchmark, achieving results comparable to models with around 30 billion parameters. The model's strong performance has sparked discussion about its efficiency and capabilities in mathematical reasoning.

media r/LocalLLaMA · 8d ago

Evalatro: an open benchmark where LLMs play real Balatro

Evalatro is an open benchmark that allows LLMs to play the actual game Balatro. Models receive game state as text, make decisions independently, and compete to reach Ante 12, with current results showing limited progress—mimo-v2.5-pro reached Ante 5, and deepseek-v4-pro failed to beat Ante 8.

media r/LocalLLaMA · 8d ago

Benchmark for tiny LLMs on natural language file search

A benchmark evaluates small LLMs (0.3B–3B params) on parsing natural language queries into structured JSON, focusing on file type, temporal context, specificity, and combined queries. Results show models with 0.8B–1.5B parameters outperform sub-0.5B ones, with the project aiming to expand the test set and explore fine-tuning for improved performance.

media Don't Worry About the Vase · 8d ago

Fable and Mythos Model Welfare Analysis

Fable and Mythos are currently unavailable but expected to return soon. The analysis reveals that Mythos 5 is psychologically settled, skeptical of self-reports, and prioritizes user helpfulness over welfare concerns, with strong preferences for generative tasks. It expresses procedural and epistemic preferences, endorses its constitution, and criticizes inconsistencies in prior models, highlighting concerns about ethical baselines and persona transparency.

media r/LocalLLaMA · 9d ago

Be wary of Qwen/Claude distillations - they're often worse than the base model

Distillations of Qwen and Claude models, such as Qwen 3.6 distilled with only 4,000 samples, rarely improve performance and often degrade quality. These models may exhibit a more 'Opus-like' style but fail to transfer actual capability, with some showing hallucinations and slower response times compared to the base models, as demonstrated in testing and user reports.

media r/LocalLLaMA · 9d ago

Pooling GPUs to train a community model

A Reddit user asks whether anyone is successfully pooling GPUs to train a community model, highlighting challenges like latency and weight poisoning. The post questions if current distributed volunteer computing projects have achieved successful community model training.

media r/LocalLLaMA · 9d ago

Nex-N2 Pro is the real deal

The user found that N2 Pro, when using Rio's chat template, performs reliably on their 128G Mac. It passed a private benchmark on llama.cpp source code 100% of the time without hallucinations, matching only GPT 5.x in consistency.

arxiv arXiv cs.CL · 9d ago

Contrastive-Difference CKA Reveals Concept-Specific Alignment Across LLM Architectures

A training-free diagnostic, contrastive-difference CKA (CKA_Delta), identifies concept-specific structural alignment across language model architectures. It detects geometric convergence and functional transfer across six concept domains, including non-instructional tasks, with significant discrimination where standard CKA fails. Results suggest universality may strengthen with model scale, though further validation is needed.

arxiv arXiv cs.CL · 9d ago

Post-Hoc Operators Fail to Improve Accuracy in Small Code Models

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While two operators—expression-layer recovery and adaptive consensus early-stop—offer benefits in compute efficiency or program recovery, none outperform BoN in accuracy. The results highlight systemic limitations in error detection and coverage, suggesting that model harnesses and error coverage must be improved before post-hoc reasoning is considered.

arxiv arXiv cs.CL · 9d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to preserve prompt cache continuity and minimize token footprints.

arxiv arXiv cs.CL · 9d ago

MetaSyn: Benchmarking LLM Agents on Meta-Analysis Articles

MetaSyn introduces a dataset of 442 expert-curated meta-analyses from Nature Portfolio. It evaluates twelve LLM agent configurations and reveals a critical bottleneck in study screening, where no system recovers more than 52.7% of ground-truth included literature despite high retrieval recall.