Reasoning models — korshunov.ai — ML news

Reasoning models Page 1 / 35

arxiv arXiv cs.AI · 8d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.AI · 8d ago

TRUST: Target-Confidence Recourse with tSeTlin Machines

TRUST enables users to specify desired prediction confidence when generating counterfactual explanations. By directly optimizing for confidence targets using a Probabilistic Tsetlin Machine and Bayesian optimization, TRUST produces more robust and interpretable recourse than traditional boundary-based methods, achieving perfect robustness with low cost and high confidence on real-world datasets.

arxiv arXiv cs.AI · 8d ago

Robot Uses Prior Team Experience to Improve USAR Rescue Success

A robot initialized with a selected prior collaboration pattern improved rescue success from 25.7% to 41.3% in urban search and rescue tasks. This enhancement reduced average task time by 283 seconds, with the greatest benefits observed at the start of interactions, indicating effective early task knowledge transfer through episodic memory.

arxiv arXiv cs.AI · 8d ago

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS introduces a new approach that decouples experience retention from parametric updates by modeling orchestration as an evolvable Meta-Skill. It uses a closed-loop process involving multi-trajectory rollouts and selective reflection to distill reusable strategy principles, achieving strong performance gains and robust transferability across tasks and LLMs.

arxiv arXiv cs.AI · 8d ago

WorldLines: Benchmarking Long-Horizon Embodied Agent Memory

WorldLines introduces a project-driven benchmark for long-horizon embodied household assistance, capturing extended household traces with dialogues, actions, and state changes. It enables evidence-linked samples for Memory QA and Embodied Task Planning, and proposes ObsMem, an observer-grounded memory framework that supports visibility-aware memories and state-aware decisions. Experiments highlight challenges in partial observability and memory translation, with ObsMem providing a stronger reference architecture for such settings.

arxiv arXiv cs.AI · 8d ago

ImpSH Improves Implicit Hate Speech Detection Across Domains

ImpSH, a triplet-based framework, aligns posts with implied statements and uses context-bounded semi-hard negatives to enhance detection of implicit hate speech. Evaluated on IHC, SBIC, and DynaHate with BERT and HateBERT, ImpSH outperforms standard supervised contrastive methods in cross-domain settings, showing improved generalizability and stability.

arxiv arXiv cs.AI · 8d ago

KinemaForge: URDF Synthesis from RGB-D Sequences

KinemaForge jointly infers part-level shape, joint topology, and parameters from RGB-D sequences using a kinematic constraint graph and differentiable screw-axis solver. It validates results with an energy-consistent verifier, reducing joint-axis error and simulation drift while improving closed-loop manipulation success by 14.6 percentage points over Ditto.

arxiv arXiv cs.AI · 8d ago

Domain-Shift Aware Neural Networks for Unbalance Mass Estimation

A domain-shift aware neural network is proposed for estimating unbalance masses in rotating systems under varying conditions. The model uses maximum mean discrepancy to align feature representations across different operating domains, improving prediction accuracy when system behaviors differ from training conditions. Results show its effectiveness in structural health monitoring applications.

arxiv arXiv cs.AI · 8d ago

BeliefDiffusion: Generative-Model Predictive Planning for Navigation

BeliefDiffusion combines diffusion models for multimodal belief representation with Model Predictive Control for long-term navigation planning. It outperforms model-free reinforcement learning and other generative methods in navigation success and path efficiency in partially observable environments.

arxiv arXiv cs.AI · 8d ago

Skill-Guided Continuation Distillation for GUI Agents

SGCD introduces an iterative framework to improve GUI agents by addressing supervision gaps in off-trajectory states. It extracts skills from both successful and failed rollouts, using them to guide policy continuations that are mixed with expert trajectories. On OSWorld-Verified, SGCD boosts success rates of three base models from low-30\% to over 50\%.

arxiv arXiv cs.AI · 8d ago

SAERec: Fine-grained Intent Priors via Sparse Autoencoders

SAERec constructs fine-grained, interpretable intent priors from textual corpora using sparse autoencoders to disentangle intent-related semantics. It retrieves both personal and public intents for users, guiding recommendations with human-understandable explanations and outperforms state-of-the-art models on public datasets.

arxiv arXiv cs.AI · 8d ago

LLMs Struggle with Negation in Figurative Language

A study finds that large language models struggle to interpret negation in figurative language. Performance varies significantly based on prompt style, highlighting a key limitation in current models' understanding of complex linguistic structures.

arxiv arXiv cs.AI · 8d ago

Decoupling Search from Reasoning in LLM Agents

Decoupled Search Grounding (DSG) separates search functionality from reasoning models, enabling vendor-agnostic, tunable, and reusable search grounding. DSG achieves near-native accuracy on SimpleQA with 91% lower search cost and 99.4% warm-cache hit rate, while reducing latency by 68% and preserving concise output contracts.

arxiv arXiv cs.AI · 8d ago

RTSGameBench: An RTS Benchmark for Strategic Reasoning

RTSGameBench addresses limitations in existing RTS benchmarks by offering diverse gameplay, targeted competency diagnosis, and self-evolving scenario generation. It evaluates vision-language models in strategic reasoning under uncertainty, revealing that state-of-the-art models struggle with multiagent coordination and large-scale tasks.

arxiv arXiv cs.AI · 8d ago

CADE: Direct Timestep Embedding for Time-Series Question Answering

CADE introduces direct timestep embedding and contrastive alignment to preserve metric structure in time-series data. By mapping each timestep directly into LLM embedding space, it avoids tokenization bottlenecks and outperforms existing LLM baselines on six TSQA tasks.

arxiv arXiv cs.AI · 8d ago

ThinkDeception: Interpretable Multimodal Deception Detection Framework

ThinkDeception introduces a progressive reinforcement learning framework that enables interpretable multimodal deception detection. It leverages a step-by-step annotated Chain of Thought dataset and proposes Visual-Audio Consistency Group Relative Policy Optimization with a dynamic curriculum, enhancing reasoning quality and outperforming existing methods on mainstream benchmarks.

arxiv arXiv cs.AI · 8d ago

G-IdiomAlign: Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

G-IdiomAlign introduces a gloss-pivoted benchmark using English glosses from Wiktionary to anchor idioms. It includes controlled multiple-choice equivalence and gloss-contrastive generation protocols, showing that glosses improve performance in embedding-based semantic alignment, though results remain modest, indicating significant potential for improvement in cross-lingual idiom generation.

arxiv arXiv cs.AI · 8d ago

LSTM-Vision Transformer Improves HRRR Forecast Error Prediction

A hybrid LSTM-Vision Transformer framework enhances prediction of HRRR forecast errors by integrating atmospheric profiles from mesonet profilers. It achieves up to twofold improvement in precipitation error prediction, especially during active planetary boundary layer periods, by better capturing convective error evolution and reducing PBL-related degradation.

arxiv arXiv cs.AI · 8d ago

Variability in AI-Generated Software: A New Product-Line Approach

An exploratory analysis of 10 vibe-coded C/C++ projects reveals near-zero in-artifact variability, with all decisions resolved at generation time. The paper proposes Variability by Regeneration (VbR), a product-line approach where an LLM acts as a derivation engine, generating tailored binaries from declarative specifications, with a variant dispatcher routing user requests to the correct binary. VbR shifts variability into specifications, not code, offering a new paradigm for SPL engineering.

arxiv arXiv cs.AI · 8d ago

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

RODS addresses sample depletion in multi-turn tool-use RL by using reward variance to detect capability boundaries. It synthesizes new data in real time, matching structural complexity of boundary samples, and maintains a dynamic replay buffer that co-evolves with the policy. RODS achieves performance comparable to a 17K-sample offline pipeline with 20x fewer trajectories.