Reasoning models — korshunov.ai

Reasoning models Page 12 / 35

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders these states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.

arxiv arXiv cs.AI · 7d ago

AI Economist Agent: Model-Grounded Economic Analysis Framework

The AI Economist Agent uses RAG, knowledge graphs, and LLMs to generate economic narratives grounded in theory and data. It enables model-based analysis, evidence retrieval, and report generation, ensuring economic coherence and traceability through explicit model computations.

arxiv arXiv cs.AI · 7d ago

Lean as Process-Verified Reward Oracle in RL for Theorem Proving

This work shows that Lean can serve as a symbolic process oracle, providing fine-grained, verified feedback during reinforcement learning. By parsing proof attempts into tactic sequences and using Lean's elaboration to mark sound steps and first failures, the system generates dense, type-theoretic reward signals. Experiments demonstrate tactic-level supervision outperforms outcome-only methods on benchmarks like MiniF2F and ProofNet, highlighting Lean's role as both evaluator and training reward source.

arxiv arXiv cs.AI · 7d ago

EEG Foundation Models for Burst-Suppression Detection in ICU

A study evaluates EEG Foundation Models for event-based burst-suppression detection in ICU settings without patient-specific calibration. REVE-base achieved the highest event-based F1-score of 0.868 and reduced burst-per-minute error by 52.1% compared to EEGNet and 36.2% compared to adaptive thresholding, demonstrating superior performance. Ablation results show full fine-tuning outperforms other strategies, and pretrained REVE-base surpasses random initialization by 0.723 F1 points at 25% labeled data, highlighting the value of pretraining for limited datasets.

arxiv arXiv cs.AI · 7d ago

Hidden Evolution of Disguised Visual Context in VLMs

Visual tokens enter large language models as raw, unstructured signals. Their internal transformation and integration depend on architecture—either as in-context prompts or injected into intermediate layers—leading to distinct evolution paths in visual representation and frequency characteristics. We find that attention alone is insufficient; performance is driven by the quality of visual representations at each layer across different integration paradigms.

arxiv arXiv cs.AI · 7d ago

Attention-Based SAC for Porosity Prediction in Additive Manufacturing

A multi-head attention feature extractor integrated with Soft Actor-Critic improves porosity prediction and process parameter optimization in laser powder bed fusion. The method achieves a convergence value of 322.79 in 14 episodes, outperforming DQN, PPO, TD3, and vanilla SAC with faster convergence and greater stability.

arxiv arXiv cs.AI · 7d ago

MakeupMirror Improves Facial Attribute Preservation in Diffusion Models

MakeupMirror, a diffusion-based makeup transfer model, enhances facial feature and skin tone preservation over Stable-Makeup. It achieves +60% improvement in facial recognition similarity and -50% reduction in skin tone difference, with 94% expert acceptance and 0.7s inference latency on diverse datasets.

arxiv arXiv cs.AI · 7d ago

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing

A hybrid two-stage diffusion transformer architecture enables efficient and accurate instruction-guided audio editing. It uses coarse-to-fine semantic alignment via joint attention at low resolution, followed by refined editing with alternating joint and cross-attention at high resolution. The method achieves better performance on complex editing tasks with improved efficiency and a compact model.

arxiv arXiv cs.AI · 7d ago

Sensorimotor World Models for Action-Aligned Perception

A new sensorimotor world model (SMWM) learns compact, action-relevant latent representations from offline trajectories. It uses inverse dynamics regularization to prevent representation collapse and align latent states with controllable environmental degrees of freedom, enabling stable training without complex regularizers or frozen components. SMWM achieves competitive planning performance in 2D and 3D control tasks.

arxiv arXiv cs.AI · 7d ago

Dual-Agent Framework for Cross-Model Verified Translation

A dual-agent framework converts natural-language experiment protocols into executable commands for robotic lab platforms. It uses a Parser Agent and a rule-based mapping engine to translate protocols, with a heterogeneous LLM Validation Agent ensuring accuracy and triggering self-correction. The framework successfully enables end-to-end autonomous execution of microplate-based experiments like the Bradford assay.

arxiv arXiv cs.AI · 7d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.AI · 7d ago

Adaptive LLM Tutoring Improves Engagement and Efficiency

A new adaptive LLM tutoring system uses subject-aware prompting to enhance student engagement. It outperforms static models in simulation and shows real-world effectiveness, reducing interactions by 3 turns and increasing exercise conversion rates to 28.1% with a stochastic strategy.

arxiv arXiv cs.AI · 7d ago

RACL: Reasoning-Agent Control Layer for Metaheuristic Learning

RACL introduces a reasoning agent that controls metaheuristic search behavior without replacing optimizers or altering constraints. It improves or ties key policies in vehicle routing experiments, reducing average cost by 8.337% versus Fixed and 1.605% versus Stagnation-Triggered policies, with no significant computational overhead.

arxiv arXiv cs.AI · 7d ago

BIM-Edit: Benchmarking LLMs for IFC-Based BIM Editing

BIM-Edit introduces a benchmark to evaluate large language models on natural-language editing of Building Information Models in IFC format. It includes 324 editing tasks across 11 real and 36 synthetic building models, assessing geometric accuracy, semantic validity, and topological consistency. The best model achieves only 49.5% average score, with no model solving more than 3.4% of tasks, highlighting a significant gap in LLM capabilities for engineering design workflows.

arxiv arXiv cs.AI · 7d ago

Essay Quality Representations in LLMs Found to Be Linearly Accessible

A study reveals that essay quality information in large language models is encoded in linearly accessible forms within their hidden representations. These representations emerge layer-by-layer, remain stable across prompts, and show partial transfer across different essay prompts, with longer essays relying more on deeper model layers. The research identifies specific 'essay scoring neurons' whose activation strongly correlates with scores and can be influenced by targeted interventions.

arxiv arXiv cs.AI · 7d ago

Hypergraph-Based Semantic Reasoning Framework

A new framework called HISR uses hypergraphs to model complex multi-entity relationships, improving semantic interpretation accuracy by up to 36.6% over existing methods. It enables robust semantic inference under partial information loss by mapping entities and higher-order relations into dedicated semantic subspaces.

arxiv arXiv cs.AI · 7d ago

MedRLM: Recursive Multimodal Health Intelligence Framework

MedRLs enables long-context clinical reasoning by recursively inspecting patient data across text, images, sensors, and guidelines. It integrates specialized agents and a Clinical Evidence Graph Memory to connect observations with evidence and referral criteria, supporting sensor-triggered reasoning and uncertainty-gated clinician review.

arxiv arXiv cs.AI · 7d ago

RS-Neg Benchmark and NeFo Method for Negation Understanding in Remote Sensing MLLMs

RS-Neg is the first benchmark to evaluate negation comprehension in remote sensing tasks across region-level and scene-level scenarios. It reveals that advanced remote sensing MLLMs struggle with negation, showing hallucinations and performance drops. NeFo, a test-time learning method, improves negation understanding using only 5% unlabeled test data and generalizes well to new tasks.

arxiv arXiv cs.AI · 7d ago

HilDA: Hierarchical Distillation with Diffusion for Self-Supervised LiDAR Pretraining

HilDA introduces a self-supervised pretraining framework for LiDAR backbones that uses hierarchical distillation and temporal occupancy diffusion to improve semantic and geometric understanding. It achieves state-of-the-art results on cross-modal distillation benchmarks and outperforms prior methods in 3D object detection, scene flow, and semantic occupancy prediction.

arxiv arXiv cs.AI · 7d ago

Introducing Rule Violation Score for Logical Compliance

We introduce the Rule Violation Score (RVS), a metric that evaluates how well predictive models adhere to logical rules. RVS distinguishes between hard and soft rules, works with any relational dataset and model, and can be computed via SQL queries for Horn rules. Evaluation on multiple benchmarks shows that models with similar predictive accuracy can differ greatly in logical compliance, highlighting RVS's ability to reveal behaviors missed by standard metrics.