Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

DRFLOW: Benchmark for Personalized Workflow Prediction

DRFLOW introduces a benchmark to evaluate agents' ability to predict personalized workflows from heterogeneous sources. It includes 100 tasks across five domains, grounded in 3,900 sources and featuring 1,246 reference workflow steps. DRFLOW-Agent achieves up to 10.02% F1 improvement over baselines, yet significant challenges remain in accurate workflow prediction.

arxiv arXiv cs.AI · 8d ago

Stanford EDGAR Filings Dataset Released

Stanford introduces SEFD, an open, layout-faithful reconstruction of SEC filings into MultiMarkdown. The 152B-token SEFD-v1 dataset enables financial language modeling and includes benchmarks for forecasting and table transcription, with less than 0.1% overlap to Common Crawl.

arxiv arXiv cs.AI · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context degradation detection, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.AI · 8d ago

FPRM: Fixed-Point Reasoning Model with Adaptive Compute

FPRM is a Transformer-based model that uses fixed-point convergence as an end-to-end halting mechanism in a looped architecture. It adapts compute to task difficulty by leveraging fixed-point reasoning, outperforming baseline models on Sudoku, Maze, state-tracking, and ARC-AGI benchmarks.

arxiv arXiv cs.AI · 8d ago

Looped World Models Achieve 100x Parameter Efficiency

Looped World Models (LoopWM) introduce a looped architecture that iteratively refines latent environment states using a parameter-shared transformer. This approach achieves up to 100x parameter efficiency over conventional world models by adapting computation depth to each prediction's complexity.

arxiv arXiv cs.AI · 8d ago

Learning Red Agent Policy from Observations for Neurosymbolic Cyber Agents

A policy learning technique using imitation learning is proposed to predict red agent actions in partially observable cyber environments. The method learns red agent policies from network observations and defender actions, enabling neurosymbolic cyber-defense agents to accurately predict attacks and adapt defenses in diverse simulated scenarios.

arxiv arXiv cs.AI · 8d ago

EvolveNav: Self-Evolving Memory for Zero-Shot Navigation

EvolveNav introduces a self-evolving framework for zero-shot object-goal navigation that improves during test time. It uses a rule memory derived from past trajectories and a confidence-based retrieval strategy to select effective actions, reducing redundant exploration. The method achieves a 10.1% higher success rate than existing baselines with fewer unnecessary steps.

arxiv arXiv cs.CL · 8d ago

Negative Token Filtering for Stable Single-Rollout RL

A new approach called negative token filtering enables stable single-rollout training by preventing false penalties on negative samples. The method improves performance on agentic tasks compared to group-based RL techniques, while matching group-based methods on reasoning tasks.

arxiv arXiv cs.CL · 8d ago

Soft Prompting for Language Adherence in Multimodal LLMs

A soft prompting approach is proposed to improve language adherence in multimodal LLMs without strict output constraints. The method introduces a new metric to quantify language violations and evaluates three strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning. Results show effectiveness in reducing language violations while preserving ASR performance across multiple languages, with trade-offs considered under different compute constraints.

arxiv arXiv cs.CL · 8d ago

Can Language Models Discover Zero?

Language models of GPT-2 size cannot independently discover zero during testing, regardless of pretraining. However, performance improves significantly with training on tens to hundreds of zero examples, and language pretraining reduces required examples by about 50%.

arxiv arXiv cs.CL · 8d ago

Word2Vec's Performance in Toki Pona's Minimal Vocabulary

This study evaluates Word2Vec's ability to capture semantic relationships in Toki Pona, a language with only 130 words. Using 1.4 million sentences, it finds that non-core tokens do not disrupt embedding structure and may actually bring similar words closer in vector space. The results show Word2Vec's effectiveness relies more on distributional patterns than vocabulary size, even at extreme lexical reduction.

arxiv arXiv cs.CL · 8d ago

SpeechDx: Multi-Task Benchmark for Clinical Speech AI

SpeechDx introduces a large-scale benchmark with 12 datasets and 27 tasks across diverse health conditions. It evaluates models by speech production stages and reveals that large-scale models perform best, while domain-specific models show limited generalization across clinical conditions.

arxiv arXiv cs.CL · 8d ago

LLM-Generated Stories Show Low Diversity

Large language models produce narratives that are more similar to each other than human-written stories. Frontier models converge on a generic narrative pattern, lacking the diversity found in human-authored stories. Common techniques like negative prompting and temperature scaling do not significantly reduce this homogeneity.

arxiv arXiv cs.CL · 8d ago

Operationalizing Ontology for Untranslatability in NLP

A new ontology and taxonomy of compensation strategies for untranslatable cases are introduced, enabling controlled analysis of machine translation. A multilingual dataset pairs untranslatable sentences with strategy-based translations, showing human preference for outputs that include explanatory context, known as the Annotation compensation strategy.

arxiv arXiv cs.CL · 8d ago

Implicit vs. Explicit Prompting in LVLMs for Referential Communication

Two studies show conflicting results on LVLMs' ability to coordinate efficient referring expressions. Explicit prompting enables models to achieve efficient communication, but implicit prompting fails to trigger this behavior, revealing fundamental differences in human-AI communication.

arxiv arXiv cs.CL · 8d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 8d ago

NarrativeWorldBench and N-VSSM for Long-Horizon Audio Drama

NarrativeWorldBench evaluates 21 LLMs on nine narrative-structure metrics across horizons of 10 to 200 episodes, with cross-lingual support in Hindi, Tamil, Telugu, and Marathi. N-VSSM, a latent world model using Mamba-2, achieves plot-beat F1 of at least 0.84 across all horizons with 4x lower compute than closed-frontier models and outperforms Claude Opus 4.5 in long-arc consistency and controllability in a professional writer study.

arxiv arXiv cs.CL · 8d ago

MODE-RAG: Evaluating and Reducing Hallucinations in M-RAG

MODE-RAG proposes a multi-agent system using Variational Free Energy to dynamically gate interventions and reduce cross-modal hallucinations in retrieval-augmented generation. It integrates Monte Carlo Tree Search and logit perturbations to address causal fabrications and sycophancy, with dedicated agents ensuring factual verification and formatting stability. Evaluated via ModeVent, a subset of MultiVent, the system significantly improves robustness against logical fabrications.

arxiv arXiv cs.CL · 8d ago

AIPatient Arena: EHR-grounded evaluation of LLMs in clinical workflows

AIPatient Arena evaluates large language models in end-to-end clinical consultations using EHR-grounded patient-specific knowledge graphs. It assesses LLMs across eight clinical competence dimensions, revealing strong performance in interview skills, ethics, and explanation clarity, but persistent weaknesses in handling ambiguity, information coverage, and diagnostic reasoning, with process failures like repetitive questioning and omitted history.

arxiv arXiv cs.CL · 8d ago

STATEWITNESS: Activation Explainer for Deception Auditing in LLMs

STATEWITNESS introduces an activation explainer that audits deception in reasoning LLMs by reading hidden states and generating natural-language answers or structured reports. It achieves a 0.916 mean AUROC, outperforming existing black-box monitors and activation probes by 11.6% and 25.0% respectively, and provides query-level, schema, and evidence-level traces for human inspection.