Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Randomized YaRN Improves Length Generalization for Long-Context Reasoning

Randomized YaRN enhances long-context reasoning by combining YaRN positional extrapolation with randomized positional encoding and a length curriculum. It outperforms standard fine-tuning on benchmarks like BABILong and MRCR, showing significant gains at far out-of-distribution context lengths.

arxiv arXiv cs.CL · 3d ago

Symmetric Q-Sorts Measure Value-Structure Alignment in LLMs

A new framework uses symmetric human-LLM Q-sorts to evaluate how large language models structurally align with moral values. By comparing rankings of 140 moral statements across 12 LLMs and a human reference sample, the study identifies cross-family heterogeneity and localized misalignments, showing that global performance scores can mask structural flaws. The results highlight the need for structural evaluations to complement traditional item-level moral benchmarks.

arxiv arXiv cs.CL · 3d ago

Are Multilingual Models Actually Improving? Isolating True Cross-Lingual Transfer

A new metric, Hardness Adjusted Transfer (HAT) Score, isolates true cross-lingual transfer by separating it from source language accuracy gains. Analysis of 20 language models shows transfer in small models is not broken, progress with model size is slower than expected, and clear improvements have occurred over time.

arxiv arXiv cs.CL · 3d ago

Can LLMs Control Readability in Arabic?

A multi-dimensional evaluation framework assesses CEFR-controlled Arabic text generation by LLMs. Results show that CEFR-guided prompting with lexical constraints achieves high alignment with linguistic profiles and predicted readability, while unconstrained prompting shows weak control.

arxiv arXiv cs.CL · 3d ago

Benchmarking LLMs for Japanese Grapheme-to-Phoneme Conversion

A study evaluates over 30 large language models on Japanese grapheme-to-phoneme conversion using 3000 manually annotated sentences. The best LLMs achieve a kana character error rate below 0.52%, outperforming the best conventional tool (1.03%). Parse mode, with rule-based post-processing, performs better than direct mode for most models, and LLM-predicted kana improves TTS pronunciation when fed into kana-input TTS.

arxiv arXiv cs.CL · 3d ago

Nous: A Predictive World Model for Long-Term Agent Memory

Nous introduces a memory architecture based on prediction rather than storage, using categorical probability distributions to model world knowledge. Evaluated on LoCoMo with GPT-4o-mini, it achieves F1 scores of 63.50 (single-hop), 55.32 (multi-hop), -58.57 (temporal), and 62.50 (open-domain), outperforming A-MEM in three categories and BeliefMem in all, though evaluation differences limit full comparability.

arxiv arXiv cs.CL · 3d ago

Can Reasoning Models Detect Changes to Their Chains of Thought?

Recent reasoning models show only modest ability to detect changes to their chains of thought. They struggle to identify how their CoT was modified and perform similarly when evaluating changes to their own versus other models' CoTs.

arxiv arXiv cs.CL · 3d ago

TSCognition and TSAlign Advance Time Series Reasoning with LLMs

TSCognition introduces a multimodal benchmark with 41K QA samples across five cognitive reasoning tasks. TSAlign outperforms existing models on TSCognition and TimerBed while reducing computational cost, using patch-level representations and alignment in LLM embedding space.

arxiv arXiv cs.CL · 3d ago

Score Granularity Gap in LLM Confidence Scoring

A study compares seven confidence score methods across 25 model-dataset pairs, finding that single-shot verbalized confidence ranks cases well but offers only a few distinct values, limiting operator thresholds. Multi-query aggregation widens the score granularity gap, improving weak models but degrading strong ones, with trade-offs that inform practical deployment.

arxiv arXiv cs.CL · 3d ago

Adaptive Data Scheduling Improves LLM Reinforcement Learning

Adaptive Data Scheduling (ADS) introduces a dual-level data scheduling framework that replaces uniform sampling with adaptive distribution over semantic clusters and policy-boundary sample selection. Experimental results show ADS improves average accuracy by 5.2% over GRPO across three LLMs and seven reasoning benchmarks, demonstrating its effectiveness as a general strategy for LLM RL post-training.

arxiv arXiv cs.CL · 3d ago

Curiosity as Linguistic Intervention in LLM Tutoring

CURIOBOT uses Berlyne's collative variables to create curiosity-driven linguistic interventions in tutoring dialogues. Across 270 conversations, these interventions increased exploratory behaviors by up to 2.4x in conversational turns under fixed time budgets, with gains persisting despite unchanged tutor instruction quality.

arxiv arXiv cs.CL · 3d ago

ORBIT: Training-Free Multi-Attribute Behavioral Steering

ORBIT enables training-free, simultaneous control of multiple behavioral attributes by using orthogonal subspace rotation. It achieves balanced, coherent steering across attributes without retraining, outperforming existing baselines on TraitFactory and ToneBank benchmarks.

arxiv arXiv cs.CL · 3d ago

A Taxonomy of Conceptual Alignment in Human-Robot Dialogue

The paper proposes a design-centric taxonomy for conceptual alignment in human-robot dialogue, defining it as a bidirectional, co-constructive process. It introduces a dialogue act schema to capture interactional moves that enable alignment, offering a structured framework for analyzing and designing such interactions.

arxiv arXiv cs.CL · 3d ago

First-Token Broadcasters in Transformers: Mechanistic Origins of Language Identity

LIHA identifies a small set of first-token broadcaster heads in GPT-2 that persistently attend to the initial prompt token, causing language switches. Instruction tuning reorganizes these circuits, concentrating language identity at early layers, as shown in a controlled comparison between Qwen2.5-1.5B-Base and Qwen2-1.5B-Instruct models. First-token broadcasting is script-specific, with non-Latin languages processed at layer 0, matching the instruct-tuned model's pattern.

arxiv arXiv cs.CL · 3d ago

P4IR Framework Improves LLM-Based Code Compliance Accuracy

P4IR, a two-stage framework, uses supervised fine-tuning and Group Relative Policy Optimization to enhance large language model-based automated code compliance systems. It reduces tree edit and token-level Levenshtein distances by up to 23.8% and 38.6% respectively, outperforming leading LLMs like Claude Opus, GPT-5.2, and GLM-4.7 in zero-shot settings with few-shot prompting, and reduces false positives by a statistically significant margin.

arxiv arXiv cs.CL · 3d ago

Knowledge-Graph Grounding Helps LLMs Only for Out-of-Training Knowledge

A study finds that knowledge-graph grounding improves LLMs only when answering questions based on out-of-training facts. On public biomedical knowledge, grounding adds no benefit, but on novel or private data, it boosts accuracy from chance to near-perfect levels, confirming that LLMs rely on external data beyond training for true performance gains.

arxiv arXiv cs.CL · 3d ago

LLMs Use Difference-Making Logic to Learn Causal Structure

Large language models learn causal structure through a difference-making logic during training, identifying which word sequences influence others. This approach mirrors the experimental method, using variation in text to infer causal relationships, and is supported by analyses of token embeddings and self-attention mechanisms.

arxiv arXiv cs.CL · 3d ago

Character Variety in LLM-Generated Stories

This study compares characters in LLM-generated and human-written stories using narratological dimensions. It finds that while LLMs produce characters with similar basic traits, they lack diversity in complex character features like wholeness and stylization. The analysis reveals LLMs generate stories with limited character variety compared to human-written narratives.

arxiv arXiv cs.CL · 3d ago

Speech-Text Models Latently Transcribe Speech in Intermediate Layers

Interleaved speech-language models undergo an implicit transcription phase where spoken words become decodable as text tokens in intermediate layers, despite no speech recognition training. Up to 77% of the data shows the spoken word appearing as a top candidate text prediction, followed by text continuation and return to speech. This behavior is driven by interleaving data and text LM initialization, correlating with spoken knowledge performance.

arxiv arXiv cs.CL · 3d ago

FACTOR Enables Adaptive Verification for Factuality in Long-Form Generation

FACTOR introduces adaptive verification for factual long-form generation by adjusting validation criteria based on claim-level uncertainty. It improves factuality and reduces verification cost through uncertainty estimation, language inference, and candidate re-ranking, with results showing strong performance across diverse models.