Reasoning models — korshunov.ai — ML news

Reasoning models Page 30 / 35

arxiv arXiv cs.AI · 8d ago

FPRM: Fixed-Point Reasoning Model with Adaptive Compute

FPRM is a Transformer-based model that uses fixed-point convergence as an end-to-end halting mechanism in a looped architecture. It adapts compute to task difficulty by leveraging fixed-point reasoning, outperforming baseline models on Sudoku, Maze, state-tracking, and ARC-AGI benchmarks.

arxiv arXiv cs.AI · 8d ago

Looped World Models Achieve 100x Parameter Efficiency

Looped World Models (LoopWM) introduce a looped architecture that iteratively refines latent environment states using a parameter-shared transformer. This approach achieves up to 100x parameter efficiency over conventional world models by adapting computation depth to each prediction's complexity.

arxiv arXiv cs.AI · 8d ago

Learning Red Agent Policy from Observations for Neurosymbolic Cyber Agents

A policy learning technique using imitation learning is proposed to predict red agent actions in partially observable cyber environments. The method learns red agent policies from network observations and defender actions, enabling neurosymbolic cyber-defense agents to accurately predict attacks and adapt defenses in diverse simulated scenarios.

arxiv arXiv cs.AI · 8d ago

EvolveNav: Self-Evolving Memory for Zero-Shot Navigation

EvolveNav introduces a self-evolving framework for zero-shot object-goal navigation that improves during test time. It uses a rule memory derived from past trajectories and a confidence-based retrieval strategy to select effective actions, reducing redundant exploration. The method achieves a 10.1% higher success rate than existing baselines with fewer unnecessary steps.

arxiv arXiv cs.CL · 8d ago

Negative Token Filtering for Stable Single-Rollout RL

A new approach called negative token filtering enables stable single-rollout training by preventing false penalties on negative samples. The method improves performance on agentic tasks compared to group-based RL techniques, while matching group-based methods on reasoning tasks.

arxiv arXiv cs.CL · 8d ago

Soft Prompting for Language Adherence in Multimodal LLMs

A soft prompting approach is proposed to improve language adherence in multimodal LLMs without strict output constraints. The method introduces a new metric to quantify language violations and evaluates three strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning. Results show effectiveness in reducing language violations while preserving ASR performance across multiple languages, with trade-offs considered under different compute constraints.

arxiv arXiv cs.CL · 8d ago

Can Language Models Discover Zero?

Language models of GPT-2 size cannot independently discover zero during testing, regardless of pretraining. However, performance improves significantly with training on tens to hundreds of zero examples, and language pretraining reduces required examples by about 50%.

arxiv arXiv cs.CL · 8d ago

Word2Vec's Performance in Toki Pona's Minimal Vocabulary

This study evaluates Word2Vec's ability to capture semantic relationships in Toki Pona, a language with only 130 words. Using 1.4 million sentences, it finds that non-core tokens do not disrupt embedding structure and may actually bring similar words closer in vector space. The results show Word2Vec's effectiveness relies more on distributional patterns than vocabulary size, even at extreme lexical reduction.

arxiv arXiv cs.CL · 8d ago

SpeechDx: Multi-Task Benchmark for Clinical Speech AI

SpeechDx introduces a large-scale benchmark with 12 datasets and 27 tasks across diverse health conditions. It evaluates models by speech production stages and reveals that large-scale models perform best, while domain-specific models show limited generalization across clinical conditions.

arxiv arXiv cs.CL · 8d ago

LLM-Generated Stories Show Low Diversity

Large language models produce narratives that are more similar to each other than human-written stories. Frontier models converge on a generic narrative pattern, lacking the diversity found in human-authored stories. Common techniques like negative prompting and temperature scaling do not significantly reduce this homogeneity.

arxiv arXiv cs.CL · 8d ago

Operationalizing Ontology for Untranslatability in NLP

A new ontology and taxonomy of compensation strategies for untranslatable cases are introduced, enabling controlled analysis of machine translation. A multilingual dataset pairs untranslatable sentences with strategy-based translations, showing human preference for outputs that include explanatory context, known as the Annotation compensation strategy.

arxiv arXiv cs.CL · 8d ago

Implicit vs. Explicit Prompting in LVLMs for Referential Communication

Two studies show conflicting results on LVLMs' ability to coordinate efficient referring expressions. Explicit prompting enables models to achieve efficient communication, but implicit prompting fails to trigger this behavior, revealing fundamental differences in human-AI communication.

arxiv arXiv cs.CL · 8d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 8d ago

NarrativeWorldBench and N-VSSM for Long-Horizon Audio Drama

NarrativeWorldBench evaluates 21 LLMs on nine narrative-structure metrics across horizons of 10 to 200 episodes, with cross-lingual support in Hindi, Tamil, Telugu, and Marathi. N-VSSM, a latent world model using Mamba-2, achieves plot-beat F1 of at least 0.84 across all horizons with 4x lower compute than closed-frontier models and outperforms Claude Opus 4.5 in long-arc consistency and controllability in a professional writer study.

arxiv arXiv cs.CL · 8d ago

MODE-RAG: Evaluating and Reducing Hallucinations in M-RAG

MODE-RAG proposes a multi-agent system using Variational Free Energy to dynamically gate interventions and reduce cross-modal hallucinations in retrieval-augmented generation. It integrates Monte Carlo Tree Search and logit perturbations to address causal fabrications and sycophancy, with dedicated agents ensuring factual verification and formatting stability. Evaluated via ModeVent, a subset of MultiVent, the system significantly improves robustness against logical fabrications.

arxiv arXiv cs.CL · 8d ago

AIPatient Arena: EHR-grounded evaluation of LLMs in clinical workflows

AIPatient Arena evaluates large language models in end-to-end clinical consultations using EHR-grounded patient-specific knowledge graphs. It assesses LLMs across eight clinical competence dimensions, revealing strong performance in interview skills, ethics, and explanation clarity, but persistent weaknesses in handling ambiguity, information coverage, and diagnostic reasoning, with process failures like repetitive questioning and omitted history.

arxiv arXiv cs.CL · 8d ago

STATEWITNESS: Activation Explainer for Deception Auditing in LLMs

STATEWITNESS introduces an activation explainer that audits deception in reasoning LLMs by reading hidden states and generating natural-language answers or structured reports. It achieves a 0.916 mean AUROC, outperforming existing black-box monitors and activation probes by 11.6% and 25.0% respectively, and provides query-level, schema, and evidence-level traces for human inspection.

arxiv arXiv cs.CL · 8d ago

Second-Order Bias in LLMs: Evaluating Judgment-Based Bias

A new study identifies second-order bias in large language models—social bias in their judgments about biased content. Using entitlement epistemology, the research develops a reasoning task to assess whether LLMs accept or reject biased texts based on demographics, revealing implicit biases that vary by target group and evade safety guardrails. The work introduces two metrics to quantify these biases and calls for more theoretically grounded evaluation methods in NLP.

arxiv arXiv cs.CL · 8d ago

LLMs Outperform Humans in Next Speaker Prediction

Large language models outperformed humans and supervised models in next speaker prediction using the AMI corpus, despite lacking audio-visual data and domain training. Multimodal LLMs surpassed text-based LLMs in addressee and turn-change detection but still fell short of human performance, highlighting challenges in utilizing raw audio-visual signals. Ablation studies show conversational context is crucial, especially for next speaker prediction, with both humans and LLMs struggling during frequent turn changes.

arxiv arXiv cs.CL · 8d ago

Expressivity Analysis of Hierarchical Modelling in Deep Transformers

This paper analyzes deep transformer expressiveness using bounded-depth grammars. It constructs transformers with positional attention where model depth scales linearly with grammar depth, and neuron count grows quadratically with production rules. The results support the linear representation hypothesis by showing these models can encode abstract grammatical states in low-dimensional, linearly separable subspaces.