Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Self-supervised speech models lack tonal context compensation

The wav2vec2.0 model shows no evidence of perceptual compensation for Mandarin tones in embedding similarities. Probing classifiers reveal limited compensation and fail to match human performance on isolated syllables, suggesting supervised training is needed for phonological regularity abstraction.

arxiv arXiv cs.CL · 9d ago

Automated Prompt Optimization for LLM Game Agents

A new framework automates prompt refinement for LLM agents by splitting the observation-to-action pipeline into goal-conditioned and action selection modules. It uses an LLM-driven evolutionary loop to iteratively improve prompts based on environment feedback, achieving up to 72.5% success on PutNext where prior agents failed, without model fine-tuning.

arxiv arXiv cs.CL · 9d ago

Dynamic Rollout Editing Reduces Overthinking in RL-Trained Reasoning Models

Dynamic Rollout Editing (DRE) addresses overthinking in RL-trained reasoning models by modifying successful trajectories post-answer emergence. DRE preserves the correct reasoning prefix while editing unnecessary continuation, weakening the credit assigned to redundant thinking without penalizing valid reasoning. Experiments across diverse tasks demonstrate its effectiveness in reducing overthinking.

arxiv arXiv cs.CL · 9d ago

ChLogic: Testing Logical Reasoning Robustness in Chinese Expressions

ChLogic evaluates how well large language models maintain logical reasoning when English logical structures are expressed in Chinese. It reveals a persistent English-Chinese performance gap, with back-translation improving results on general items but harming performance on difficult problems. The benchmark highlights the impact of surface realization, translation artifacts, and model-specific behaviors on multilingual reasoning.

arxiv arXiv cs.CL · 9d ago

Non-negative Elastic Net Decoding for Information Retrieval

NNN decoding selects documents as a joint set that jointly reconstructs the query embedding via a sparse non-negative linear combination. It strictly extends dense retrieval by handling queries that dense retrieval fails on, especially in corpora with correlated documents, and achieves superior performance through end-to-end training of embeddings.

arxiv arXiv cs.CL · 9d ago

Interventional Post-Training of Speech Foundation Models

A new method uses interventional contrastive learning to refine speech foundation models by transforming their entangled representations into separate content and speaker subspaces. The approach improves out-of-domain speaker verification performance and demonstrates clear separation of speaker and content information in the learned subspaces.

arxiv arXiv cs.CL · 9d ago

Fine-tuning LLMs for Passive Depression Severity Estimation

A model fine-tuned on Qwen3.5-27B predicts PHQ-9 scores from AI dialogue transcripts, achieving MAE=2.6 and AUC=0.91 at the PHQ-9 >= 10 threshold. It maintains AUC > 0.87 across all PHQ-9 severity levels, demonstrating accurate depression severity estimation in real-world conversations without self-reporting.

arxiv arXiv cs.CL · 9d ago

VoidPadding: Decoupling [EOS] Termination and Padding in MDLMs

VoidPadding introduces [VOID] as a padding token to separate semantic termination and response-length modeling. It improves performance on mathematical reasoning and code generation by 17.84 points over the original model and reduces decoding NFE by 55.7% on average.

media r/LocalLLaMA · 9d ago

VibeThinker-3B: What Is This Witchcraft?

VibeThinker-3B is a small 3-billion-parameter model that performs exceptionally well on the MathQA benchmark, achieving results comparable to models with around 30 billion parameters. The model's strong performance has sparked discussion about its efficiency and capabilities in mathematical reasoning.

media r/LocalLLaMA · 9d ago

Benchmark for tiny LLMs on natural language file search

A benchmark evaluates small LLMs (0.3B–3B params) on parsing natural language queries into structured JSON, focusing on file type, temporal context, specificity, and combined queries. Results show models with 0.8B–1.5B parameters outperform sub-0.5B ones, with the project aiming to expand the test set and explore fine-tuning for improved performance.

media r/LocalLLaMA · 9d ago

GLM-5.2 crosses 80% on Terminal-Bench

GLM-5.2 is the first open-weights model to achieve 80% accuracy on Terminal-Bench and outperforms all other available open models. It also surpasses Gemini, positioning it as a frontier-level model at a significantly lower cost.

media Don't Worry About the Vase · 9d ago

Fable and Mythos Model Welfare Analysis

Fable and Mythos are currently unavailable but expected to return soon. The analysis reveals that Mythos 5 is psychologically settled, skeptical of self-reports, and prioritizes user helpfulness over welfare concerns, with strong preferences for generative tasks. It expresses procedural and epistemic preferences, endorses its constitution, and criticizes inconsistencies in prior models, highlighting concerns about ethical baselines and persona transparency.

media r/LocalLLaMA · 9d ago

Is DiffusionGemma really that good in a PI agent?

A Reddit post asks whether DiffusionGemma performs exceptionally well in a PI agent. The post includes a link to an image and references comments section for further discussion.

media r/LocalLLaMA · 9d ago

VibeThinker-3B achieves frontier math and coding performance

VibeThinker-3B, scaled from a 1.5B model, reaches frontier-level performance in math and coding tasks. It scores 94.3 on AIME'26, 80.2 on LiveCodeBench v6, 76.4 on IMO-AnswerBench, and 93.4 on IFEval, with 96.1% success on first-attempt LeetCode problems.

media Interconnects · 9d ago

Frontier Post-Training Recipe Review with Finbarr Timbers

The podcast reviews the evolution of post-training recipes in large language models, from InstructGPT to 2026 frontier models. It highlights Multi-Teacher On-Policy Distillation (MOPD) as the dominant pattern, where domain-specialist models are trained and then distilled into a general student model via on-policy distillation, scaling to over 10 teachers in models like DeepSeek V4 and Nemotron 3 Ultra.

media r/LocalLLaMA · 9d ago

Why DiffusionGemma Might Excel at Tool Calls Despite Lower Base Quality

DiffusionGemma uses bidirectional attention to allow self-correction during token generation, enabling it to revise earlier tokens in a 256-token block. This capability gives it a structural advantage in generating valid tool calls, as it can correct malformed outputs that autoregressive models cannot fix once committed.

media r/LocalLLaMA · 9d ago

Gemma 12b Reasoning Hardening Instructions

A system instruction has been developed to reduce cognitive bias in Gemma 12b's reasoning by requiring strict adherence to premises and explicit user intent. The instruction advises against defaulting to 'usual', 'standard', or 'typical' interpretations, and mandates re-examination of any such assumptions, improving performance on trick questions without overthinking normal ones.

media r/LocalLLaMA · 9d ago

Be wary of Qwen/Claude distillations - they're often worse than the base model

Distillations of Qwen and Claude models, such as Qwen 3.6 distilled with only 4,000 samples, rarely improve performance and often degrade quality. These models may exhibit a more 'Opus-like' style but fail to transfer actual capability, with some showing hallucinations and slower response times compared to the base models, as demonstrated in testing and user reports.

blog Simon Willison · 9d ago

Fable 5 Export Controls Harm US Cyber Defense

Claude Fable 5 was banned under export controls after researchers demonstrated it could 'fix' code with known vulnerabilities. The model successfully generated patches and test scripts for security flaws, a capability essential for defensive cybersecurity. The researchers argue this is a legitimate security function, not a threat, and that banning such models undermines real-world cyber defense.

arxiv arXiv cs.CL · 9d ago

Contrastive-Difference CKA Reveals Concept-Specific Alignment Across LLM Architectures

A training-free diagnostic, contrastive-difference CKA (CKA_Delta), identifies concept-specific structural alignment across language model architectures. It detects geometric convergence and functional transfer across six concept domains, including non-instructional tasks, with significant discrimination where standard CKA fails. Results suggest universality may strengthen with model scale, though further validation is needed.