Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Can Reasoning Models Detect Changes to Their Chains of Thought?

Recent reasoning models show only modest ability to detect changes to their chains of thought. They struggle to identify how their CoT was modified and perform similarly when evaluating changes to their own versus other models' CoTs.

arxiv arXiv cs.CL · 2d ago

TSCognition and TSAlign Advance Time Series Reasoning with LLMs

TSCognition introduces a multimodal benchmark with 41K QA samples across five cognitive reasoning tasks. TSAlign outperforms existing models on TSCognition and TimerBed while reducing computational cost, using patch-level representations and alignment in LLM embedding space.

arxiv arXiv cs.CL · 2d ago

Score Granularity Gap in LLM Confidence Scoring

A study compares seven confidence score methods across 25 model-dataset pairs, finding that single-shot verbalized confidence ranks cases well but offers only a few distinct values, limiting operator thresholds. Multi-query aggregation widens the score granularity gap, improving weak models but degrading strong ones, with trade-offs that inform practical deployment.

arxiv arXiv cs.CL · 2d ago

Adaptive Data Scheduling Improves LLM Reinforcement Learning

Adaptive Data Scheduling (ADS) introduces a dual-level data scheduling framework that replaces uniform sampling with adaptive distribution over semantic clusters and policy-boundary sample selection. Experimental results show ADS improves average accuracy by 5.2% over GRPO across three LLMs and seven reasoning benchmarks, demonstrating its effectiveness as a general strategy for LLM RL post-training.

arxiv arXiv cs.CL · 2d ago

Curiosity as Linguistic Intervention in LLM Tutoring

CURIOBOT uses Berlyne's collative variables to create curiosity-driven linguistic interventions in tutoring dialogues. Across 270 conversations, these interventions increased exploratory behaviors by up to 2.4x in conversational turns under fixed time budgets, with gains persisting despite unchanged tutor instruction quality.

arxiv arXiv cs.CL · 2d ago

ORBIT: Training-Free Multi-Attribute Behavioral Steering

ORBIT enables training-free, simultaneous control of multiple behavioral attributes by using orthogonal subspace rotation. It achieves balanced, coherent steering across attributes without retraining, outperforming existing baselines on TraitFactory and ToneBank benchmarks.

arxiv arXiv cs.CL · 2d ago

A Taxonomy of Conceptual Alignment in Human-Robot Dialogue

The paper proposes a design-centric taxonomy for conceptual alignment in human-robot dialogue, defining it as a bidirectional, co-constructive process. It introduces a dialogue act schema to capture interactional moves that enable alignment, offering a structured framework for analyzing and designing such interactions.

arxiv arXiv cs.CL · 2d ago

First-Token Broadcasters in Transformers: Mechanistic Origins of Language Identity

LIHA identifies a small set of first-token broadcaster heads in GPT-2 that persistently attend to the initial prompt token, causing language switches. Instruction tuning reorganizes these circuits, concentrating language identity at early layers, as shown in a controlled comparison between Qwen2.5-1.5B-Base and Qwen2-1.5B-Instruct models. First-token broadcasting is script-specific, with non-Latin languages processed at layer 0, matching the instruct-tuned model's pattern.

arxiv arXiv cs.CL · 2d ago

P4IR Framework Improves LLM-Based Code Compliance Accuracy

P4IR, a two-stage framework, uses supervised fine-tuning and Group Relative Policy Optimization to enhance large language model-based automated code compliance systems. It reduces tree edit and token-level Levenshtein distances by up to 23.8% and 38.6% respectively, outperforming leading LLMs like Claude Opus, GPT-5.2, and GLM-4.7 in zero-shot settings with few-shot prompting, and reduces false positives by a statistically significant margin.

arxiv arXiv cs.CL · 2d ago

Knowledge-Graph Grounding Helps LLMs Only for Out-of-Training Knowledge

A study finds that knowledge-graph grounding improves LLMs only when answering questions based on out-of-training facts. On public biomedical knowledge, grounding adds no benefit, but on novel or private data, it boosts accuracy from chance to near-perfect levels, confirming that LLMs rely on external data beyond training for true performance gains.

arxiv arXiv cs.CL · 2d ago

LLMs Use Difference-Making Logic to Learn Causal Structure

Large language models learn causal structure through a difference-making logic during training, identifying which word sequences influence others. This approach mirrors the experimental method, using variation in text to infer causal relationships, and is supported by analyses of token embeddings and self-attention mechanisms.

arxiv arXiv cs.CL · 2d ago

Character Variety in LLM-Generated Stories

This study compares characters in LLM-generated and human-written stories using narratological dimensions. It finds that while LLMs produce characters with similar basic traits, they lack diversity in complex character features like wholeness and stylization. The analysis reveals LLMs generate stories with limited character variety compared to human-written narratives.

arxiv arXiv cs.CL · 2d ago

Speech-Text Models Latently Transcribe Speech in Intermediate Layers

Interleaved speech-language models undergo an implicit transcription phase where spoken words become decodable as text tokens in intermediate layers, despite no speech recognition training. Up to 77% of the data shows the spoken word appearing as a top candidate text prediction, followed by text continuation and return to speech. This behavior is driven by interleaving data and text LM initialization, correlating with spoken knowledge performance.

arxiv arXiv cs.CL · 2d ago

FACTOR Enables Adaptive Verification for Factuality in Long-Form Generation

FACTOR introduces adaptive verification for factual long-form generation by adjusting validation criteria based on claim-level uncertainty. It improves factuality and reduces verification cost through uncertainty estimation, language inference, and candidate re-ranking, with results showing strong performance across diverse models.

media Hugging Face Forums · 2d ago

Buddy System: Rust entropy monitor with NER-gated uncertainty for tiered LLM inference

The Buddy System uses a Rust entropy monitor to detect per-token uncertainty in local Gemma 3 4B inference, routing only uncertain tokens to Sonnet via NER-gated span extraction and semantic retrieval. Benchmarks show it achieves 71.4% accuracy at $0.21, outperforming the Anthropic Advisor pattern (62.9% at $0.44) across seven Hugging Face datasets, with a key improvement on SQuAD v2 by routing source passage chunks to the cloud model.

arxiv arXiv cs.CL · 2d ago

VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows

VADAOrchestra introduces a neurosymbolic framework that combines LLM-based workflow orchestration with Datalog+/- symbolic reasoning. It enables adaptive, explainable decision-making by incrementally planning workflows and executing logical inference on demand, offering auditability, scalability, and verifiability in real-world financial scenarios.

arxiv arXiv cs.CL · 2d ago

Variance-Calibrated Modulation for LLM Decoding

VCM addresses the likelihood trap in large language model decoding by introducing dynamic mechanisms to reshape probability distributions. It improves diversity, coherence, and reasoning accuracy in open-ended generation, factual QA, and mathematical reasoning with minimal computational overhead.

arxiv arXiv cs.CL · 2d ago

Gazer: Training-Free Semantic Correction for Autoregressive Visual Models

Gazer introduces a training-free framework that uses multimodal large language model feedback to correct semantic errors in real time during autoregressive visual model generation. By integrating reflective diagnosis and semantic correction stages, Gazer improves compositional accuracy and semantic alignment across multiple models without additional training.

arxiv arXiv cs.CL · 2d ago

Multimodal Chain-of-Thought: Capabilities and Limitations

Multimodal Chain-of-Thought reasoning improves performance in mathematical and scientific reasoning but harms visual grounding and object counting in perception tasks. Models exhibit a 'Look Light, Think Heavy' pattern, where visual reflection diminishes while verbal reflection increases, indicating a persistent bottleneck in visual reasoning.

arxiv arXiv cs.CL · 2d ago

Key Factors in RL for LLM Reasoning Revealed

A theoretical analysis shows that off-policy degree, determined by gradient steps per rollout, significantly impacts importance sampling ratios and token update dominance. The study introduces Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries by token group variance, outperforming DAPO and CISPO on 3B and 7B models across mathematical, QA, and logic reasoning tasks.