Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 44

Interventional Post-Training of Speech Foundation Models

A new method uses interventional contrastive learning to refine speech foundation models by transforming their entangled representations into separate content and speaker subspaces. The approach improves out-of-domain speaker verification performance and demonstrates clear separation of speaker and content information in the learned subspaces.

arxiv arXiv cs.CL · 9d ago

VoidPadding: Decoupling [EOS] Termination and Padding in MDLMs

VoidPadding introduces [VOID] as a padding token to separate semantic termination and response-length modeling. It improves performance on mathematical reasoning and code generation by 17.84 points over the original model and reduces decoding NFE by 55.7% on average.

media r/LocalLLaMA · 9d ago

VibeThinker-3B: What Is This Witchcraft?

VibeThinker-3B is a small 3-billion-parameter model that performs exceptionally well on the MathQA benchmark, achieving results comparable to models with around 30 billion parameters. The model's strong performance has sparked discussion about its efficiency and capabilities in mathematical reasoning.

media r/LocalLLaMA · 9d ago

Evalatro: an open benchmark where LLMs play real Balatro

Evalatro is an open benchmark that allows LLMs to play the actual game Balatro. Models receive game state as text, make decisions independently, and compete to reach Ante 12, with current results showing limited progress—mimo-v2.5-pro reached Ante 5, and deepseek-v4-pro failed to beat Ante 8.

media r/LocalLLaMA · 9d ago

Benchmark for tiny LLMs on natural language file search

A benchmark evaluates small LLMs (0.3B–3B params) on parsing natural language queries into structured JSON, focusing on file type, temporal context, specificity, and combined queries. Results show models with 0.8B–1.5B parameters outperform sub-0.5B ones, with the project aiming to expand the test set and explore fine-tuning for improved performance.

media Don't Worry About the Vase · 9d ago

Fable and Mythos Model Welfare Analysis

Fable and Mythos are currently unavailable but expected to return soon. The analysis reveals that Mythos 5 is psychologically settled, skeptical of self-reports, and prioritizes user helpfulness over welfare concerns, with strong preferences for generative tasks. It expresses procedural and epistemic preferences, endorses its constitution, and criticizes inconsistencies in prior models, highlighting concerns about ethical baselines and persona transparency.

media r/LocalLLaMA · 9d ago

Be wary of Qwen/Claude distillations - they're often worse than the base model

Distillations of Qwen and Claude models, such as Qwen 3.6 distilled with only 4,000 samples, rarely improve performance and often degrade quality. These models may exhibit a more 'Opus-like' style but fail to transfer actual capability, with some showing hallucinations and slower response times compared to the base models, as demonstrated in testing and user reports.

media r/LocalLLaMA · 9d ago

Pooling GPUs to train a community model

A Reddit user asks whether anyone is successfully pooling GPUs to train a community model, highlighting challenges like latency and weight poisoning. The post questions if current distributed volunteer computing projects have achieved successful community model training.

media r/LocalLLaMA · 9d ago

Nex-N2 Pro is the real deal

The user found that N2 Pro, when using Rio's chat template, performs reliably on their 128G Mac. It passed a private benchmark on llama.cpp source code 100% of the time without hallucinations, matching only GPT 5.x in consistency.

arxiv arXiv cs.CL · 9d ago

Contrastive-Difference CKA Reveals Concept-Specific Alignment Across LLM Architectures

A training-free diagnostic, contrastive-difference CKA (CKA_Delta), identifies concept-specific structural alignment across language model architectures. It detects geometric convergence and functional transfer across six concept domains, including non-instructional tasks, with significant discrimination where standard CKA fails. Results suggest universality may strengthen with model scale, though further validation is needed.

arxiv arXiv cs.CL · 9d ago

Post-Hoc Operators Fail to Improve Accuracy in Small Code Models

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While two operators—expression-layer recovery and adaptive consensus early-stop—offer benefits in compute efficiency or program recovery, none outperform BoN in accuracy. The results highlight systemic limitations in error detection and coverage, suggesting that model harnesses and error coverage must be improved before post-hoc reasoning is considered.

arxiv arXiv cs.CL · 9d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to preserve prompt cache continuity and minimize token footprints.

arxiv arXiv cs.CL · 9d ago

MetaSyn: Benchmarking LLM Agents on Meta-Analysis Articles

MetaSyn introduces a dataset of 442 expert-curated meta-analyses from Nature Portfolio. It evaluates twelve LLM agent configurations and reveals a critical bottleneck in study screening, where no system recovers more than 52.7% of ground-truth included literature despite high retrieval recall.

arxiv arXiv cs.CL · 9d ago

Language Models Encode Value of Their Current Trajectory

Qwen3-8B internally tracks the value of its current trajectory, defined as the likelihood of achieving its goals. This 'value' axis distinguishes confidence levels, backtracking behavior, and code correctness, and shows that preference optimization boosts confidence in rewarded behaviors. The model assigns low value to politically sensitive queries post-training, and fine-tuning increases confidence within specific domains.

arxiv arXiv cs.AI · 9d ago

Semantic Flip: Synthetic OOD Generation for Robust Refusal

Semantic Flip proposes a framework to synthesize out-of-distribution samples by transforming queries and video memory to create unanswerable pairs. These pairs train a lightweight rejection module that attaches to existing vision-language models without retraining, improving refusal performance in embodied question answering and spatial localization. On the new SpaceReject benchmark, it achieves an F1 score of 0.9559.

arxiv arXiv cs.AI · 9d ago

Variance in LLM Circuit Discovery: Causes and Mitigations

This paper analyzes variance in circuit discovery for large language models, identifying resampling, rephrasing, and sample-wise variance. It shows CEAP reduces resampling variance and argues rephrasing variance stems from prompt templates activating different circuits, implying LLMs may be inherently hard to steer. The study also finds sparsity does not resolve these issues and that sample-wise variance is largely benign due to selective contribution scaling affecting unfaithfulness scores.

arxiv arXiv cs.AI · 9d ago

MA-SBI: Calibration-Free SBI via Side-Channel Guidance

MA-SBI introduces a calibration-free simulation-based inference framework that uses side-channel text, like regime labels or instructions, to correct for simulator misspecification. It employs a learned corrector to apply observation-space shifts before posterior inference, without needing ground-truth parameter pairs or retraining. On hide-the-calibration benchmarks, MA-SBI matches the oracle posterior with text alone, outperforming RoPE under limited data, and shows robustness on real-world epidemiological and cognitive-science datasets.

arxiv arXiv cs.AI · 9d ago

RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting

RAID introduces a framework that uses metadata-driven semantic retrieval and graph-conditioned diffusion to address true cold-start scenarios. It outperforms foundation models and baselines in forecasting accuracy and interval coverage, reduces inference latency significantly, and enables zero-shot cross-lingual transfer via a shared semantic space.

arxiv arXiv cs.AI · 9d ago

Unified Causal-Origin Taxonomy for Distributional Shifts in RL

This paper introduces a unified causal-origin taxonomy that categorizes distributional shifts in reinforcement learning into internal, agent-driven, and external, environment-driven sources. It unifies ID/OOD generalization and non-stationary settings by framing shifts as structured changes in the agent-environment interaction process, using a POMDP decomposition and a shifted-time boundary perspective.

arxiv arXiv cs.AI · 9d ago

CircuitLasso: Scalable Circuit Learning for LLM Interpretability

CircuitLasso proposes a scalable method for learning sparse circuits in large language models using sparse linear regression. It achieves structural accuracy comparable to state-of-the-art intervention-based methods at significantly lower computational cost, while enabling efficient discovery of semantic feature propagation and improving performance on domain-generalization tasks with reduced cost.