Reasoning models — korshunov.ai

Reasoning models Page 5 / 35

DART: Training-Free Routing for Adaptive Thinking Budgets

DART enables hybrid reasoning models to route queries between direct answering and extended thinking without training data. It uses two no-think drafts to decide response mode and estimates thinking budget from draft disagreement. DART improves accuracy by up to 9.0 points in math and 22.-5 points in code reasoning while reducing thinking tokens by 15-69% and 51-63% respectively.

arxiv arXiv cs.CL · 2d ago

Memory Contagion: Bias Propagation in Agent Memory

Researchers identify Memory Contagion, a phenomenon where evaluator bias propagates across time in agent memory. Even with perfect memory consolidation, bias spreads to future agents retrieving from the same memory store, with contamination detected as low as p=0.2. The effect varies by bias type: length bias is attenuated, while authority bias is amplified, indicating a bias-dependent interaction.

arxiv arXiv cs.CL · 2d ago

Task-Sensitive Analysis of Intrinsic Self-Correction

A study examines when intrinsic self-correction works by analyzing its performance across different task structures. The research finds that self-correction yields consistent gains only when the task supports explicit constraint verification, complex reasoning revision, or strategy evaluation. The results show SC is effective only in specific task contexts, not universally.

arxiv arXiv cs.CL · 2d ago

CFPO: Counterfactual Policy Optimization for Multimodal Reasoning

CFPO introduces a cross-modal counterfactual enhancement mechanism to improve causal consistency between visual perception and textual reasoning in vision-language models. It achieves 3.17%-6.25% gains over standard RL baselines and 1.32%-2.13% over PAPO, without requiring external rewards or supervision.

arxiv arXiv cs.CL · 2d ago

Judgment-Grounded Expansion for Peer Review Generation

A new human-AI collaboration method called judgment-grounded expansion enables accountable peer review generation. The approach involves a reviewer providing an evaluative claim, which the system expands into review comment candidates through a structured generate-check-refine process. The study addresses scalable evaluation and candidate set curation, showing conformal prediction effectively balances candidate size and coverage.

arxiv arXiv cs.CL · 2d ago

IMLogic Benchmark and RootMem Framework for Implicit Logical Memory Retrieval

IMLogic is the first high-quality benchmark for evaluating implicit logical memory retrieval in long-dialogue scenarios. RootMem introduces a structured, decision-preserving representation called root memory to distill reusable personalized logic from user histories, and uses an LLM-based router to activate relevant memories, outperforming existing retrieval baselines in accuracy.

arxiv arXiv cs.CL · 2d ago

Energy-Based Transformers Predict Reading Difficulty

Energy-based transformers show robust predictive power for reading times across multiple corpora, outperforming surprisal in all cases. The energy measure captures known object/subject asymmetries in relative clause processing and subsumes both attention entropy and surprisal, suggesting it as a unified predictor of reading difficulty.

arxiv arXiv cs.CL · 2d ago

Self-Stigma Is Not Uniform: LLMs Need Persona-Aware Support

A study of 1,174 Reddit users reveals four distinct self-stigma personas. LLMs trained to recognize these personas outperform generic models in targeted responses, though clinical experts prefer generic empathy over persona-matched support. The research highlights a tension between tailored empathy and holistic user preference in stigma-related AI interventions.

arxiv arXiv cs.CL · 2d ago

ReasoningLens: Hierarchical Visualization for Large Reasoning Models

ReasoningLens presents an open-source framework that visualizes and audits long-chain-of-thought traces in large reasoning models. It structures reasoning into interactive hierarchies, uses an agentic auditor for error detection, and identifies model-specific blind spots through systemic reasoning profiles.

arxiv arXiv cs.CL · 2d ago

UnBias-Plus: Detect, Explain, and Rewrite Bias

UnBias-Plus is an open-source toolkit that enables segment-level bias classification, biased span localization, neutral text rewriting, and decision reasoning. It offers multiple access methods including Python, CLI, REST API, and web interfaces, with all source code, models, datasets, and documentation publicly available.

arxiv arXiv cs.CL · 2d ago

TriggerBench: Evaluating Prospective Memory in LLMs

TriggerBench introduces a benchmark to assess prospective memory in large language models, revealing a precision-recall trade-off and attentional fragility. Prospective memory is found to be significantly harder than retrospective memory and correlates with spare reasoning capacity, indicating that PM reflects underlying cognitive resources beyond token count.

arxiv arXiv cs.CL · 2d ago

SelfCompact: Self-Driving Context Compaction for Language Models

SelfCompact enables language models to autonomously decide when and how to compact accumulated context during reasoning. By combining a model-invoked summarization tool with a lightweight rubric that guides compaction based on trajectory structure, it achieves effective adaptive compaction without fine-tuning. Results show it matches or exceeds fixed-interval methods on math and agentic search benchmarks, improving baselines by up to 18.1 points on math and 5-9 points on search, at 30-70% lower token cost.

arxiv arXiv cs.CL · 2d ago

VeriEvol: Scaling Multimodal Mathematical Reasoning with Verifiable Evolution

VeriEvol introduces a verifiable data-construction framework for visual mathematical reasoning, decoupling prompt difficulty and answer reliability. It evolves image-question prompts using type-aware operators and verifies answers via multi-source counter-evidence falsification. On five benchmarks, scaling from 10K to 250K samples improves mean accuracy from 35.42 to 54.73, with a cumulative +3.88 over baseline, driven by evolved prompts and HTV-Agent verification.

arxiv arXiv cs.CL · 2d ago

LLMs Fail to Reliably Self-Report Adversarial Prefills

No large language models reliably detect when their responses were influenced by adversarial prefill attacks. Introspective signals are strongest in safety-related reasoning, but are probe-dependent and can be amplified by LoRA fine-tuning, which paradoxically increases attack success rates.

arxiv arXiv cs.CL · 2d ago

Randomized YaRN Improves Length Generalization for Long-Context Reasoning

Randomized YaRN enhances long-context reasoning by combining YaRN positional extrapolation with randomized positional encoding and a length curriculum. It outperforms standard fine-tuning on benchmarks like BABILong and MRCR, showing significant gains at far out-of-distribution context lengths.

arxiv arXiv cs.CL · 2d ago

Symmetric Q-Sorts Measure Value-Structure Alignment in LLMs

A new framework uses symmetric human-LLM Q-sorts to evaluate how large language models structurally align with moral values. By comparing rankings of 140 moral statements across 12 LLMs and a human reference sample, the study identifies cross-family heterogeneity and localized misalignments, showing that global performance scores can mask structural flaws. The results highlight the need for structural evaluations to complement traditional item-level moral benchmarks.

arxiv arXiv cs.CL · 2d ago

Are Multilingual Models Actually Improving? Isolating True Cross-Lingual Transfer

A new metric, Hardness Adjusted Transfer (HAT) Score, isolates true cross-lingual transfer by separating it from source language accuracy gains. Analysis of 20 language models shows transfer in small models is not broken, progress with model size is slower than expected, and clear improvements have occurred over time.

arxiv arXiv cs.CL · 2d ago

Can LLMs Control Readability in Arabic?

A multi-dimensional evaluation framework assesses CEFR-controlled Arabic text generation by LLMs. Results show that CEFR-guided prompting with lexical constraints achieves high alignment with linguistic profiles and predicted readability, while unconstrained prompting shows weak control.

arxiv arXiv cs.CL · 2d ago

Benchmarking LLMs for Japanese Grapheme-to-Phoneme Conversion

A study evaluates over 30 large language models on Japanese grapheme-to-phoneme conversion using 3000 manually annotated sentences. The best LLMs achieve a kana character error rate below 0.52%, outperforming the best conventional tool (1.03%). Parse mode, with rule-based post-processing, performs better than direct mode for most models, and LLM-predicted kana improves TTS pronunciation when fed into kana-input TTS.

arxiv arXiv cs.CL · 2d ago

Nous: A Predictive World Model for Long-Term Agent Memory

Nous introduces a memory architecture based on prediction rather than storage, using categorical probability distributions to model world knowledge. Evaluated on LoCoMo with GPT-4o-mini, it achieves F1 scores of 63.50 (single-hop), 55.32 (multi-hop), -58.57 (temporal), and 62.50 (open-domain), outperforming A-MEM in three categories and BeliefMem in all, though evaluation differences limit full comparability.