Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Fine-tuning LLMs for Passive Depression Severity Estimation

A model fine-tuned on Qwen3.5-27B predicts PHQ-9 scores from AI dialogue transcripts, achieving MAE=2.6 and AUC=0.91 at the PHQ-9 >= 10 threshold. It maintains AUC > 0.87 across all PHQ-9 severity levels, demonstrating accurate depression severity estimation in real-world conversations without self-reporting.

arxiv arXiv cs.CL · 8d ago

VoidPadding: Decoupling [EOS] Termination and Padding in MDLMs

VoidPadding introduces [VOID] as a padding token to separate semantic termination and response-length modeling. It improves performance on mathematical reasoning and code generation by 17.84 points over the original model and reduces decoding NFE by 55.7% on average.

media r/LocalLLaMA · 8d ago

VibeThinker-3B: What Is This Witchcraft?

VibeThinker-3B is a small 3-billion-parameter model that performs exceptionally well on the MathQA benchmark, achieving results comparable to models with around 30 billion parameters. The model's strong performance has sparked discussion about its efficiency and capabilities in mathematical reasoning.

media r/LocalLLaMA · 8d ago

Benchmark for tiny LLMs on natural language file search

A benchmark evaluates small LLMs (0.3B–3B params) on parsing natural language queries into structured JSON, focusing on file type, temporal context, specificity, and combined queries. Results show models with 0.8B–1.5B parameters outperform sub-0.5B ones, with the project aiming to expand the test set and explore fine-tuning for improved performance.

media r/LocalLLaMA · 8d ago

GLM-5.2 crosses 80% on Terminal-Bench

GLM-5.2 is the first open-weights model to achieve 80% accuracy on Terminal-Bench and outperforms all other available open models. It also surpasses Gemini, positioning it as a frontier-level model at a significantly lower cost.

media Don't Worry About the Vase · 8d ago

Fable and Mythos Model Welfare Analysis

Fable and Mythos are currently unavailable but expected to return soon. The analysis reveals that Mythos 5 is psychologically settled, skeptical of self-reports, and prioritizes user helpfulness over welfare concerns, with strong preferences for generative tasks. It expresses procedural and epistemic preferences, endorses its constitution, and criticizes inconsistencies in prior models, highlighting concerns about ethical baselines and persona transparency.

media r/LocalLLaMA · 9d ago

Is DiffusionGemma really that good in a PI agent?

A Reddit post asks whether DiffusionGemma performs exceptionally well in a PI agent. The post includes a link to an image and references comments section for further discussion.

media r/LocalLLaMA · 9d ago

VibeThinker-3B achieves frontier math and coding performance

VibeThinker-3B, scaled from a 1.5B model, reaches frontier-level performance in math and coding tasks. It scores 94.3 on AIME'26, 80.2 on LiveCodeBench v6, 76.4 on IMO-AnswerBench, and 93.4 on IFEval, with 96.1% success on first-attempt LeetCode problems.

media Interconnects · 9d ago

Frontier Post-Training Recipe Review with Finbarr Timbers

The podcast reviews the evolution of post-training recipes in large language models, from InstructGPT to 2026 frontier models. It highlights Multi-Teacher On-Policy Distillation (MOPD) as the dominant pattern, where domain-specialist models are trained and then distilled into a general student model via on-policy distillation, scaling to over 10 teachers in models like DeepSeek V4 and Nemotron 3 Ultra.

media r/LocalLLaMA · 9d ago

Why DiffusionGemma Might Excel at Tool Calls Despite Lower Base Quality

DiffusionGemma uses bidirectional attention to allow self-correction during token generation, enabling it to revise earlier tokens in a 256-token block. This capability gives it a structural advantage in generating valid tool calls, as it can correct malformed outputs that autoregressive models cannot fix once committed.

media r/LocalLLaMA · 9d ago

Gemma 12b Reasoning Hardening Instructions

A system instruction has been developed to reduce cognitive bias in Gemma 12b's reasoning by requiring strict adherence to premises and explicit user intent. The instruction advises against defaulting to 'usual', 'standard', or 'typical' interpretations, and mandates re-examination of any such assumptions, improving performance on trick questions without overthinking normal ones.

media r/LocalLLaMA · 9d ago

Be wary of Qwen/Claude distillations - they're often worse than the base model

Distillations of Qwen and Claude models, such as Qwen 3.6 distilled with only 4,000 samples, rarely improve performance and often degrade quality. These models may exhibit a more 'Opus-like' style but fail to transfer actual capability, with some showing hallucinations and slower response times compared to the base models, as demonstrated in testing and user reports.

blog Simon Willison · 9d ago

Fable 5 Export Controls Harm US Cyber Defense

Claude Fable 5 was banned under export controls after researchers demonstrated it could 'fix' code with known vulnerabilities. The model successfully generated patches and test scripts for security flaws, a capability essential for defensive cybersecurity. The researchers argue this is a legitimate security function, not a threat, and that banning such models undermines real-world cyber defense.

arxiv arXiv cs.CL · 9d ago

Contrastive-Difference CKA Reveals Concept-Specific Alignment Across LLM Architectures

A training-free diagnostic, contrastive-difference CKA (CKA_Delta), identifies concept-specific structural alignment across language model architectures. It detects geometric convergence and functional transfer across six concept domains, including non-instructional tasks, with significant discrimination where standard CKA fails. Results suggest universality may strengthen with model scale, though further validation is needed.

arxiv arXiv cs.CL · 9d ago

Symbolic Informalization in Informath Project

The Informath project demonstrates symbolic informalization to convert formal mathematical proofs into fluent, precise natural language. It uses Dedukti as a hub connecting proof systems like Agda, Lean, and Rocq, with Grammatical Framework ensuring linguistic correctness across multiple languages.

arxiv arXiv cs.CL · 9d ago

LOGOS: A General-Purpose Generative Model for Natural Sciences

LOGOS is a unified generative language model that represents scientific objects and their interactions as token sequences in a shared grammar. It achieves consistent or superior performance across diverse natural science tasks, demonstrating the feasibility of a single model serving multiple domains. The model scales positively with parameter count, and its design suggests that AI for Science should align deeply with large language models through shared architectures and training.

arxiv arXiv cs.CL · 9d ago

LESS Is More: Adaptive Sampling for Diffusion Language Models

LESS introduces a training-free, model-agnostic adaptive sampler that reduces reverse denoising steps by 72.1% compared to fixed-budget decoding. It achieves higher accuracy than existing training-free samplers and lowers inference compute and latency through mutual-stability rules that ensure token commitment only when predictions are confident, consistent, and stable.

arxiv arXiv cs.CL · 9d ago

IMPACTeen Dataset Released with English and Polish Versions

IMPACTeen is a dataset of 1,021 texts annotated from five perspectives—teenagers, parents, psychologists, communication experts, and teachers. It includes 5,100 annotation records covering social influence techniques, intentions, consequences, and resistance, with annotations validated through human editing. The dataset, created using LLM generation and human validation, is available in both Polish and English and supports research on social influence and language model training.

arxiv arXiv cs.CL · 9d ago

Key Properties for Effective Code Interpreter Reasoning

A study identifies extrinsic (crucial tokens) and intrinsic (cognitive behaviors) properties that enhance code interpreter reasoning in large language models. Stronger reasoning models show higher prevalence of verification, backtracking, and backward chaining, with these properties improving performance during inference and training, reducing overthinking and boosting token efficiency.

arxiv arXiv cs.CL · 9d ago

DeepRubric: Efficient RL for Deep Research Agents

DeepRubric introduces a data construction framework that builds query-rubric pairs by first defining verifiable evaluation targets through an evidence tree. It generates 9K supervision examples and trains a 8B model with GRPO, achieving performance comparable to state-of-the-art models using 13x fewer RL GPU-hours.