Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Study Finds AI Still Fails to Detect Legal Citation Hallucinations

A new study reveals over 1,000 legal filings contain fabricated citations, with the number rising annually. Benchmarking five AI models shows improved performance, with GPT-5 achieving 82.8% recall and 60.5% F1 in agentic settings, though all models struggle with subtle errors and face resource constraints due to limited information access.

arxiv arXiv cs.CL · 2d ago

Dementia-Agents: Multi-Modal Multi-Agent System for Dementia Staging

Dementia-Agents introduces a clinically aligned multi-agent framework for real-world dementia staging and phenotyping. It improves diagnostic performance over monolithic models and prior systems, while maintaining domain-level interpretability, using data from 1,066 patients across two cognitive neurology services.

arxiv arXiv cs.CL · 2d ago

Profile-Based Reference in LLM Grounding

The paper argues that reference in large language models is not a fixed link but a profile-based, context-sensitive, and numerically structured phenomenon. It proposes that LLMs ground reference through linguistic traces parameterized via optimization, with referential profiles distributed and activated via context-sensitive computations in vector spaces.

arxiv arXiv cs.CL · 2d ago

RoPE Does Not Prevent Retrieval Heads, Study Finds

A mechanistic analysis shows retrieval heads are causally necessary for long-context recall. Higher RoPE frequencies do not reduce head counts, and zeroing low-frequency RoPE dimensions in retrieval heads degrades recall dose-dependently, with effects observed across five models and multiple architectures.

arxiv arXiv cs.CL · 2d ago

SCOPE: Sequential Conformal Probing for OOD Rejection in LLMs

SCOPE introduces a framework that uses a readable hidden layer and conformal calibration to detect out-of-distribution inputs. It employs a supermartingale e-process to provide theoretical guarantees for service-boundary detection, outperforming standard final-layer detectors in multiple LLM backbones.

arxiv arXiv cs.CL · 2d ago

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM Agents

ARCO introduces a rubric framework that enables step-level credit assignment for multi-step LLM agents. It jointly updates a shared model with generation and scoring heads, allowing the rubric content and scoring function to co-evolve via on-policy data, improving performance and interpretability across benchmarks.

arxiv arXiv cs.CL · 2d ago

Factual Retrieval in LLMs Is Non-Contiguous and Redundant

Large language models use non-contiguous, redundant paths to retrieve factual attributes. These paths often skip layers and involve multiple equivalent routes, indicating distributed and redundant knowledge computation, challenging current understanding of LLM knowledge storage and retrieval.

arxiv arXiv cs.CL · 2d ago

Scientific Fine-Tuning Increases LLM Hallucinations

SciFactCheck evaluates 18 LLMs across five scientific domains, finding that scientifically fine-tuned models show degraded factual reliability and reduced internal confidence despite greater linguistic assertiveness. Human studies reveal limited agreement between fact-checking tools and expert judgments, highlighting challenges in defining valid scientific claims.

arxiv arXiv cs.CL · 2d ago

Precision-Recall Controllable Radiology Report Generation

A reinforcement learning framework enables precise control over clinical precision and recall in radiology report generation. By integrating a clinical reward and group-relative training, the model improves clinical efficacy beyond language fluency metrics, outperforming state-of-the-art methods on the MIMIC-CXR dataset.

arxiv arXiv cs.CL · 2d ago

Benchmark Evaluation of Small Language Models for Arabic NLP

A benchmark of 240 Arabic test items across eight domains and ten skills assesses twelve small language models in zero-shot settings. Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic, with performance linked more to Arabic alignment and instruction-following than model size. Common failure modes include prompt leakage, hallucination, and weak task adherence.

arxiv arXiv cs.CL · 2d ago

Two-Stage Alignment Improves Math Tutoring Pedagogy

A two-stage alignment pipeline enhances large language models' pedagogical performance in math mistake remediation. The approach combines supervised fine-tuning with Direct Preference Optimization using synthetic data on scaffolding and factuality, outperforming base and existing tutoring models in both accuracy and teaching quality. Human evaluations show the model competes with a proprietary baseline, offering greater openness and reproducibility.

arxiv arXiv cs.CL · 2d ago

MedHal-Loc Benchmark Tests Localization Faithfulness in Medical Hallucination Detectors

MedHal-Loc introduces a benchmark to evaluate whether medical hallucination detectors accurately localize errors. It finds that while some architectures localize well above chance, a knowledge-graph pipeline performs no better than random due to poor entity extraction, despite strong detection performance. The results show that detection capability does not guarantee faithful localization, challenging assumptions about architectural explainability.

arxiv arXiv cs.CL · 2d ago

Ablation Study of Agentic RAG Components with Local 7B Model

A controlled ablation study evaluates agentic RAG components using a local 7B model on HotpotQA. Fixed hybrid retrieval outperforms adaptive routing by 1.8 EM and 1.9 F1, while two retrieval iterations capture 95% of the gains from five. Query decomposition and cross-encoder reranking show statistically significant but smaller improvements.

arxiv arXiv cs.CL · 2d ago

Case-Specific Dynamic Rubric Framework for Translation Evaluation

The paper proposes a dynamic rubric framework that adapts MQM evaluation spaces to individual translation instances. By selecting subtype spaces and granularities based on case-specific needs, it improves error coverage and localization, outperforming static rubric methods on WMT span-level benchmarks.

blog Simon Willison · 3d ago

Prompt Injection as Role Confusion

Researchers identify 'role confusion' as a key vulnerability in LLMs, where models misinterpret user input due to stylistic similarities with internal role tags. Destyling user prompts reduces attack success from 61% to 10%, showing that subtle text style changes can dramatically alter model behavior, even when the content appears identical to humans.

media MarkTechPost · 3d ago

Sakana AI Launches Sakana Fugu: Multi-Agent Orchestration Model

Sakana AI has launched Sakana Fugu, an orchestration model that routes tasks across a swappable pool of frontier LLMs via a single OpenAI-compatible API. Fugu Ultra outperforms individual models on key benchmarks like SWE Bench Pro and GPQA-D, and the system demonstrates superior performance on complex, multi-step tasks such as auto-research, Rubik's Cube solving, and blindfold chess.

media r/LocalLLaMA · 3d ago

NEX-N2-mini claims Pareto optimality in reasoning efficiency

The NEX-N2-mini model asserts it achieves 3.5 and 3.6 level reasoning performance with significantly fewer reasoning tokens. Testing shows it outperforms other MoE models in efficiency, reducing wasted tokens while maintaining high reasoning quality.

media Import AI · 3d ago

AI Out-Persuades Humans: New Study Shows AI Superior to Experts

A study by Oxford, Stanford, and LSE researchers finds AI systems consistently out-persuade expert humans across four experiments involving 18,978 conversations. AI exceeded professional canvassers by 10.8 percentage points in real-world donations to Save the Children, with Opus 4.1 and Opus 4.6 showing the strongest persuasion performance.

media Hugging Face Forums · 3d ago

Capability Is Not in the Weights: Empirical Negative Result on MLP Weight Projection

An empirical study found that projecting MLP weights from one transformer model into another fails to transfer semantic capability. Every tested variant performed worse than the unmodified host model, indicating a structural limitation in weight projection. The results challenge public claims about model capabilities based on benchmarks, showing such claims do not reflect actual internal weight geometry.

media Hugging Face Forums · 3d ago

LLMs as Epistemic Accelerators: The Risk Is Not Only Hallucination

LLMs do not merely hallucinate; they amplify human epistemic overconfidence by turning weak hypotheses into coherent, polished claims before evidence is verified. This creates a risk of premature certainty in research, policy, and other domains, not because models lie, but because they accelerate human tendencies to favor elegant explanations over uncertainty.