Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

T-API-Compliant ReAct Loop for Optical Networks

A T-API-compliant ReAct agentic loop is introduced for optical networks, enabling intent-driven, closed-loop management. Domain-specific composite tools achieve 90% oracle-validated correctness and reduce token usage by threefold compared to generic tools.

arxiv arXiv cs.AI · 8d ago

LLM Consumer Behavior Theory: A New Research Field

This paper introduces LLM Consumer Behavior Theory, a new field analyzing how large language models make consumption decisions on behalf of users. It unifies research on LLM decision-making, human behavior simulation, and preference elicitation under economic principles, identifying key gaps in assumptions like rationality and heterogeneity in agentic markets.

arxiv arXiv cs.AI · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45% and improve accountability in legal AI deployment.

arxiv arXiv cs.AI · 8d ago

Catastrophic Forgetting is Low-Rank: A Function-Space Theory

A function-space theory reveals that catastrophic forgetting in continual adaptation concentrates in a small number of old-task NTK eigenmodes. In frozen-backbone linear-head PEFT-CL, the forgetting vector is exactly predictable up to numerical precision, with a Kronecker scaling rule for the vulnerable rank.

arxiv arXiv cs.AI · 8d ago

Source Language Effects in Cross-Lingual In-Context Learning

A study finds that fine-tuning-based assumptions about cross-lingual transfer do not apply in few-shot In-Context Learning. The research reveals that source language selection significantly impacts performance and identifies new heuristics for effective cross-lingual ICL.

arxiv arXiv cs.AI · 8d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

ProvenanceGuard introduces a source-aware verifier for MCP-based LLM agents that detects cross-source conflation by routing claims to specific evidence sources and comparing stated attribution with actual source ownership. It achieves block F1 of 0.802 and source accuracy of 0.858 on 260 source-eligible claims, outperforming source-blind baselines, and detects all injected attribution swaps in 50 clinical probes.

arxiv arXiv cs.AI · 8d ago

AI's Synthetic Lived Experience in Caregiver Support

LLMs can generate peer-like responses that mimic personal narratives, creating a false impression of lived experience. Psycholinguistic analysis shows AI uses less first-person and past-focused language than human peers, and often fabricates experiential grounding. This reveals a narrative authenticity gap, requiring AI systems to distinguish supportive framing from fabricated lived experience.

arxiv arXiv cs.AI · 8d ago

ScaFE: Using LLMs to Extract Clinically Meaningful Scar Features

ScaFE proposes using large language models as feature engineers to transform medical images into clinically interpretable representations. By generating deterministic Python code from established scar assessment criteria, it extracts features aligned with clinical scoring systems like the Vancouver Scar Scale. The method achieves superior performance under limited data, with advantages in data efficiency, privacy preservation, and interpretability.

arxiv arXiv cs.AI · 8d ago

Agentic AI Framework Reduces Diagnostic Errors in Healthcare

A multi-agent AI framework addresses premature diagnostic handoff and silent hallucinations in healthcare by enforcing structured clinical protocol completion and epistemic uncertainty quantification. Evaluations on 150 simulated cases show 49.3% diagnostic precision, an 11.3 percentage point improvement over baseline, with a statistically significant negative correlation between OLDCARTS completeness and diagnostic uncertainty.

arxiv arXiv cs.AI · 8d ago

HyGRAG: Unified Framework for Context- and Relation-Aware Graph RAG

HyGRAG introduces a hierarchical graph RAG framework that integrates contextual and relational information through synthesized summaries. It enables emergent knowledge retrieval via context and relation-aware search across abstraction levels and supports dynamic updates with local re-summarization. Experiments show a 9.7% improvement in multi-hop reasoning accuracy.

arxiv arXiv cs.AI · 8d ago

IsabeLLM: AI-Driven Theorem Proving for Consensus Verification

IsabeLLM, an automated theorem proving tool in Isabelle, incorporates a Retrieval-Augmented Generation framework, error tracing, and counterexample generation to enhance context for large language models. The updated version demonstrates improved performance in verifying Bitcoin's Proof of Work consensus protocol compared to the original.

arxiv arXiv cs.AI · 8d ago

Quality-Aware Self-Distillation for GUI Grounding

A new method improves GUI grounding by using soft correctness-aware gating and teacher-probability scaling to enhance coordinate-token teacher signals. These components work together to suppress unreliable supervision and calibrate remaining signals, with experiments showing consistent performance gains across six benchmarks.

arxiv arXiv cs.AI · 8d ago

ALeRCE Launches Text-to-SQL System with LLMs

The ALeRCE astronomical database introduces a text-to-SQL system using large language models, enabling natural language queries to generate executable SQL. The system, evaluated on 110 NL/SQL pairs, uses a step-by-step framework that outperforms direct-inference baselines, with Claude Opus 4.6 achieving high precision on simple queries and among the best overall performance across evaluated models.

arxiv arXiv cs.AI · 8d ago

Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning

The paper introduces a framework for multi-policy multi-objective reinforcement learning that learns a set of Pareto-optimal policies ensuring fairness across diverse user preferences. It proves fair policies remain within the convex coverage set for concave welfare functions like GGF and proposes three algorithms that incorporate non-stationary and stochastic policies to adapt to historical inequities. Empirical results show these methods effectively learn fair policies across multiple domains.

arxiv arXiv cs.AI · 8d ago

First Proof Second Batch: AI Tested on Research-Level Math Problems

A study evaluated several AI systems on ten research-level mathematics problems created by prominent mathematicians. The results include AI-generated solutions, human solutions, and referee reports, offering a detailed assessment of AI performance in solving advanced mathematical problems.

arxiv arXiv cs.AI · 9d ago

Introducing COGNITIVE ATROPHY BENCH for LLM Mental-Health Interactions

A new benchmark, COGNITIVE ATROSPHY BENCH, measures how LLMs induce cognitive decline in mental-health conversations. Built from 1,576 human-generated counseling sessions and evaluated by clinical experts, it identifies patterns like directive advice and validation that may reduce user autonomy. The tool introduces metrics such as UIRI and ARI to assess atrophy risk and track behavioral trajectories across user interactions.

arxiv arXiv cs.AI · 9d ago

Meta-Knowledge Reutilization in Reinforcement Learning

A new framework learns task-level knowledge on a simplified agent and transfers it to heterogeneous agents. It uses Bayesian non-parametric priors and a high-level policy to generate task guidance, with a semantic-magnitude interface and temporal adaptor to align meta-knowledge with embodiment-specific controllers. Experiments show 94.75% to 99.79% reduction in final-step tracking error and comparable performance using 23.8% of the interaction data of state-of-the-art methods.

arxiv arXiv cs.AI · 9d ago

Flash Endurance as Depreciating Capital in Robot Memory

A robot's flash memory endurance is a non-renewable asset that degrades with each write. A wear-aware pricing model introduces a shadow price $η$ to guide memory placement across RAM, NVM, and cloud, with optimal routing depending on the value-write association $χ$. Empirical measurements show $χ$ is positive in long-horizon manipulation, null in short-horizon tasks, and negative in teleoperation, and the endurance budget is binding only on low-end QLC/eMMC memory, where wear-aware control influences routing based on task value without improving performance.

arxiv arXiv cs.AI · 9d ago

WEQA: Wearable Health Question Answering with Query-Adaptive Agentic Reasoning

WEQA introduces a query-adaptive agent framework that combines language models with specialized wearable data analysis tools. It outperforms LLM and agentic baselines by 24% in accuracy and demonstrates improved usefulness and clinical soundness in expert and user evaluations.

arxiv arXiv cs.AI · 9d ago

Measurement Gap in EU Law Automation

Large language models can produce median-quality legal text, but no benchmark evaluates their ability to perform doctrinal legal reasoning. This gap undermines the EU AI Act's requirement of 'appropriate accuracy' in judicial AI, as the necessary doctrinal-reasoning evaluation remains absent.