Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

CFPO: Counterfactual Policy Optimization for Multimodal Reasoning

CFPO introduces a cross-modal counterfactual enhancement mechanism to improve causal consistency between visual perception and textual reasoning in vision-language models. It achieves 3.17%-6.25% gains over standard RL baselines and 1.32%-2.13% over PAPO, without requiring external rewards or supervision.

arxiv arXiv cs.CL · 2d ago

Judgment-Grounded Expansion for Peer Review Generation

A new human-AI collaboration method called judgment-grounded expansion enables accountable peer review generation. The approach involves a reviewer providing an evaluative claim, which the system expands into review comment candidates through a structured generate-check-refine process. The study addresses scalable evaluation and candidate set curation, showing conformal prediction effectively balances candidate size and coverage.

arxiv arXiv cs.CL · 2d ago

IMLogic Benchmark and RootMem Framework for Implicit Logical Memory Retrieval

IMLogic is the first high-quality benchmark for evaluating implicit logical memory retrieval in long-dialogue scenarios. RootMem introduces a structured, decision-preserving representation called root memory to distill reusable personalized logic from user histories, and uses an LLM-based router to activate relevant memories, outperforming existing retrieval baselines in accuracy.

arxiv arXiv cs.CL · 2d ago

Energy-Based Transformers Predict Reading Difficulty

Energy-based transformers show robust predictive power for reading times across multiple corpora, outperforming surprisal in all cases. The energy measure captures known object/subject asymmetries in relative clause processing and subsumes both attention entropy and surprisal, suggesting it as a unified predictor of reading difficulty.

arxiv arXiv cs.CL · 2d ago

Self-Stigma Is Not Uniform: LLMs Need Persona-Aware Support

A study of 1,174 Reddit users reveals four distinct self-stigma personas. LLMs trained to recognize these personas outperform generic models in targeted responses, though clinical experts prefer generic empathy over persona-matched support. The research highlights a tension between tailored empathy and holistic user preference in stigma-related AI interventions.

arxiv arXiv cs.CL · 2d ago

ReasoningLens: Hierarchical Visualization for Large Reasoning Models

ReasoningLens presents an open-source framework that visualizes and audits long-chain-of-thought traces in large reasoning models. It structures reasoning into interactive hierarchies, uses an agentic auditor for error detection, and identifies model-specific blind spots through systemic reasoning profiles.

arxiv arXiv cs.CL · 2d ago

UnBias-Plus: Detect, Explain, and Rewrite Bias

UnBias-Plus is an open-source toolkit that enables segment-level bias classification, biased span localization, neutral text rewriting, and decision reasoning. It offers multiple access methods including Python, CLI, REST API, and web interfaces, with all source code, models, datasets, and documentation publicly available.

arxiv arXiv cs.CL · 2d ago

TriggerBench: Evaluating Prospective Memory in LLMs

TriggerBench introduces a benchmark to assess prospective memory in large language models, revealing a precision-recall trade-off and attentional fragility. Prospective memory is found to be significantly harder than retrospective memory and correlates with spare reasoning capacity, indicating that PM reflects underlying cognitive resources beyond token count.

arxiv arXiv cs.CL · 2d ago

SelfCompact: Self-Driving Context Compaction for Language Models

SelfCompact enables language models to autonomously decide when and how to compact accumulated context during reasoning. By combining a model-invoked summarization tool with a lightweight rubric that guides compaction based on trajectory structure, it achieves effective adaptive compaction without fine-tuning. Results show it matches or exceeds fixed-interval methods on math and agentic search benchmarks, improving baselines by up to 18.1 points on math and 5-9 points on search, at 30-70% lower token cost.

arxiv arXiv cs.CL · 2d ago

VeriEvol: Scaling Multimodal Mathematical Reasoning with Verifiable Evolution

VeriEvol introduces a verifiable data-construction framework for visual mathematical reasoning, decoupling prompt difficulty and answer reliability. It evolves image-question prompts using type-aware operators and verifies answers via multi-source counter-evidence falsification. On five benchmarks, scaling from 10K to 250K samples improves mean accuracy from 35.42 to 54.73, with a cumulative +3.88 over baseline, driven by evolved prompts and HTV-Agent verification.

arxiv arXiv cs.CL · 2d ago

LLMs Fail to Reliably Self-Report Adversarial Prefills

No large language models reliably detect when their responses were influenced by adversarial prefill attacks. Introspective signals are strongest in safety-related reasoning, but are probe-dependent and can be amplified by LoRA fine-tuning, which paradoxically increases attack success rates.

arxiv arXiv cs.CL · 2d ago

Randomized YaRN Improves Length Generalization for Long-Context Reasoning

Randomized YaRN enhances long-context reasoning by combining YaRN positional extrapolation with randomized positional encoding and a length curriculum. It outperforms standard fine-tuning on benchmarks like BABILong and MRCR, showing significant gains at far out-of-distribution context lengths.

arxiv arXiv cs.CL · 2d ago

Symmetric Q-Sorts Measure Value-Structure Alignment in LLMs

A new framework uses symmetric human-LLM Q-sorts to evaluate how large language models structurally align with moral values. By comparing rankings of 140 moral statements across 12 LLMs and a human reference sample, the study identifies cross-family heterogeneity and localized misalignments, showing that global performance scores can mask structural flaws. The results highlight the need for structural evaluations to complement traditional item-level moral benchmarks.

arxiv arXiv cs.CL · 2d ago

Are Multilingual Models Actually Improving? Isolating True Cross-Lingual Transfer

A new metric, Hardness Adjusted Transfer (HAT) Score, isolates true cross-lingual transfer by separating it from source language accuracy gains. Analysis of 20 language models shows transfer in small models is not broken, progress with model size is slower than expected, and clear improvements have occurred over time.

arxiv arXiv cs.CL · 2d ago

Can LLMs Control Readability in Arabic?

A multi-dimensional evaluation framework assesses CEFR-controlled Arabic text generation by LLMs. Results show that CEFR-guided prompting with lexical constraints achieves high alignment with linguistic profiles and predicted readability, while unconstrained prompting shows weak control.

arxiv arXiv cs.CL · 2d ago

Benchmarking LLMs for Japanese Grapheme-to-Phoneme Conversion

A study evaluates over 30 large language models on Japanese grapheme-to-phoneme conversion using 3000 manually annotated sentences. The best LLMs achieve a kana character error rate below 0.52%, outperforming the best conventional tool (1.03%). Parse mode, with rule-based post-processing, performs better than direct mode for most models, and LLM-predicted kana improves TTS pronunciation when fed into kana-input TTS.

arxiv arXiv cs.CL · 2d ago

Nous: A Predictive World Model for Long-Term Agent Memory

Nous introduces a memory architecture based on prediction rather than storage, using categorical probability distributions to model world knowledge. Evaluated on LoCoMo with GPT-4o-mini, it achieves F1 scores of 63.50 (single-hop), 55.32 (multi-hop), -58.57 (temporal), and 62.50 (open-domain), outperforming A-MEM in three categories and BeliefMem in all, though evaluation differences limit full comparability.

arxiv arXiv cs.CL · 2d ago

Can Reasoning Models Detect Changes to Their Chains of Thought?

Recent reasoning models show only modest ability to detect changes to their chains of thought. They struggle to identify how their CoT was modified and perform similarly when evaluating changes to their own versus other models' CoTs.

arxiv arXiv cs.CL · 2d ago

TSCognition and TSAlign Advance Time Series Reasoning with LLMs

TSCognition introduces a multimodal benchmark with 41K QA samples across five cognitive reasoning tasks. TSAlign outperforms existing models on TSCognition and TimerBed while reducing computational cost, using patch-level representations and alignment in LLM embedding space.

arxiv arXiv cs.CL · 2d ago

Score Granularity Gap in LLM Confidence Scoring

A study compares seven confidence score methods across 25 model-dataset pairs, finding that single-shot verbalized confidence ranks cases well but offers only a few distinct values, limiting operator thresholds. Multi-query aggregation widens the score granularity gap, improving weak models but degrading strong ones, with trade-offs that inform practical deployment.