Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Measurement Gap in EU Law Automation

Large language models can produce median-quality legal text, but no benchmark evaluates their ability to perform doctrinal legal reasoning. This gap undermines the EU AI Act's requirement of 'appropriate accuracy' in judicial AI, as the necessary doctrinal-reasoning evaluation remains absent.

arxiv arXiv cs.AI · 9d ago

LEADS: Agentic Discovery of Hybrid Models for Cardiac Electrophysiology

LEADS proposes a framework that uses an LLM agent to discover hybrid cardiac electrophysiology models through an iterative reasoning-and-action loop. It formulates domain knowledge as a structured action space, enabling physically grounded, interpretable, and numerically stable model designs, outperforming both human-designed and other LLM-based approaches on synthetic and real cardiac data.

arxiv arXiv cs.AI · 9d ago

ReAge3D: Realistic 3D Face Re-Aging with View Consistency

ReAge3D introduces a framework for realistic and identity-preserving 3D face re-aging. It uses a 2D diffusion model and center-out editing to ensure multi-view consistency, preserving fine age-related details through masked diffusion and view reconstruction.

arxiv arXiv cs.AI · 9d ago

Kolmogorov Regression for Robust Diffusion Policies

A backward Kolmogorov equation lifts diffusion policies to a Cameron-Martin space, replacing stochastic score matching with a deterministic PDE. This approach achieves convergence bounds tied to kernel effective rank, improved trajectory regularity, and a failure detector without rewards, showing 17% higher reward and 67.6% reduced drift on PushT, and 28.4% lower RMSE with perfect bottleneck detection on a manufacturing line. Hamilton-Jacobi theory reduces deadlock events by 96% in simulations.

arxiv arXiv cs.AI · 9d ago

DRFLOW: Benchmark for Personalized Workflow Prediction

DRFLOW introduces a benchmark to evaluate agents' ability to predict personalized workflows from heterogeneous sources. It includes 100 tasks across five domains, grounded in 3,900 sources and featuring 1,246 reference workflow steps. DRFLOW-Agent achieves up to 10.02% F1 improvement over baselines, yet significant challenges remain in accurate workflow prediction.

arxiv arXiv cs.AI · 9d ago

Stanford EDGAR Filings Dataset Released

Stanford introduces SEFD, an open, layout-faithful reconstruction of SEC filings into MultiMarkdown. The 152B-token SEFD-v1 dataset enables financial language modeling and includes benchmarks for forecasting and table transcription, with less than 0.1% overlap to Common Crawl.

arxiv arXiv cs.AI · 9d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context degradation detection, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.AI · 9d ago

FPRM: Fixed-Point Reasoning Model with Adaptive Compute

FPRM is a Transformer-based model that uses fixed-point convergence as an end-to-end halting mechanism in a looped architecture. It adapts compute to task difficulty by leveraging fixed-point reasoning, outperforming baseline models on Sudoku, Maze, state-tracking, and ARC-AGI benchmarks.

arxiv arXiv cs.AI · 9d ago

Looped World Models Achieve 100x Parameter Efficiency

Looped World Models (LoopWM) introduce a looped architecture that iteratively refines latent environment states using a parameter-shared transformer. This approach achieves up to 100x parameter efficiency over conventional world models by adapting computation depth to each prediction's complexity.

arxiv arXiv cs.AI · 9d ago

Learning Red Agent Policy from Observations for Neurosymbolic Cyber Agents

A policy learning technique using imitation learning is proposed to predict red agent actions in partially observable cyber environments. The method learns red agent policies from network observations and defender actions, enabling neurosymbolic cyber-defense agents to accurately predict attacks and adapt defenses in diverse simulated scenarios.

arxiv arXiv cs.AI · 9d ago

EvolveNav: Self-Evolving Memory for Zero-Shot Navigation

EvolveNav introduces a self-evolving framework for zero-shot object-goal navigation that improves during test time. It uses a rule memory derived from past trajectories and a confidence-based retrieval strategy to select effective actions, reducing redundant exploration. The method achieves a 10.1% higher success rate than existing baselines with fewer unnecessary steps.

arxiv arXiv cs.CL · 9d ago

Negative Token Filtering for Stable Single-Rollout RL

A new approach called negative token filtering enables stable single-rollout training by preventing false penalties on negative samples. The method improves performance on agentic tasks compared to group-based RL techniques, while matching group-based methods on reasoning tasks.

arxiv arXiv cs.CL · 9d ago

Soft Prompting for Language Adherence in Multimodal LLMs

A soft prompting approach is proposed to improve language adherence in multimodal LLMs without strict output constraints. The method introduces a new metric to quantify language violations and evaluates three strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning. Results show effectiveness in reducing language violations while preserving ASR performance across multiple languages, with trade-offs considered under different compute constraints.

arxiv arXiv cs.CL · 9d ago

Can Language Models Discover Zero?

Language models of GPT-2 size cannot independently discover zero during testing, regardless of pretraining. However, performance improves significantly with training on tens to hundreds of zero examples, and language pretraining reduces required examples by about 50%.

arxiv arXiv cs.CL · 9d ago

Word2Vec's Performance in Toki Pona's Minimal Vocabulary

This study evaluates Word2Vec's ability to capture semantic relationships in Toki Pona, a language with only 130 words. Using 1.4 million sentences, it finds that non-core tokens do not disrupt embedding structure and may actually bring similar words closer in vector space. The results show Word2Vec's effectiveness relies more on distributional patterns than vocabulary size, even at extreme lexical reduction.

arxiv arXiv cs.CL · 9d ago

SpeechDx: Multi-Task Benchmark for Clinical Speech AI

SpeechDx introduces a large-scale benchmark with 12 datasets and 27 tasks across diverse health conditions. It evaluates models by speech production stages and reveals that large-scale models perform best, while domain-specific models show limited generalization across clinical conditions.

arxiv arXiv cs.CL · 9d ago

LLM-Generated Stories Show Low Diversity

Large language models produce narratives that are more similar to each other than human-written stories. Frontier models converge on a generic narrative pattern, lacking the diversity found in human-authored stories. Common techniques like negative prompting and temperature scaling do not significantly reduce this homogeneity.

arxiv arXiv cs.CL · 9d ago

Operationalizing Ontology for Untranslatability in NLP

A new ontology and taxonomy of compensation strategies for untranslatable cases are introduced, enabling controlled analysis of machine translation. A multilingual dataset pairs untranslatable sentences with strategy-based translations, showing human preference for outputs that include explanatory context, known as the Annotation compensation strategy.

arxiv arXiv cs.CL · 9d ago

Implicit vs. Explicit Prompting in LVLMs for Referential Communication

Two studies show conflicting results on LVLMs' ability to coordinate efficient referring expressions. Explicit prompting enables models to achieve efficient communication, but implicit prompting fails to trigger this behavior, revealing fundamental differences in human-AI communication.

arxiv arXiv cs.CL · 9d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.