Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

OPERA: Aligning Open-Ended Reasoning via Objective Perplexity-based Reinforcement Learning

The OPERA framework addresses the instability of applying reinforcement learning to open-ended tasks by replacing external judge models with intrinsic rewards derived from perplexity dynamics. This approach quantifies uncertainty reduction at critical reflective states, eliminating stylistic biases and positional inconsistencies common in LLM-as-a-judge systems. During the cold-start phase, the method utilizes guiding words to synthesize diverse reasoning traces and employs perplexity-prioritized rollouts to identify logically consistent branches. This pipeline generates a large-scale dataset of 20,000 high-quality reasoning trajectories for training. Implementing OPERA on the Qwen3-8B model establishes a new state-of-the-art among open-source models. The system achieves parity with or surpasses proprietary models like Gemini2.5 and MiniMax-M2.5 in specific open-ended tasks. Empirical evaluations confirm the scalability and efficacy of this objective perplexity-based alignment strategy.

arxiv arXiv cs.AI · 11h ago

LLMs Use Difference-Making Logic to Learn Causal Structure

Large language models learn causal structure through a difference-making logic, akin to the experimental method. This approach identifies which word sequences influence outcomes and which do not, using vast text data during training. Architectural features like token embeddings and self-attention support this inductive process by detecting patterns of variation and indifference in language.

arxiv arXiv cs.AI · 14h ago

Gazer: Training-Free Semantic Correction for Autoregressive Visual Models

Gazer introduces a training-free framework that uses multimodal large language model feedback to correct semantic errors in real time during autoregressive visual model generation. By integrating reflective diagnosis and semantic correction stages, Gazer improves compositional accuracy and semantic alignment across multiple models without additional training.

arxiv arXiv cs.AI · 15h ago

Multimodal Chain-of-Thought: Capabilities and Limitations

Multimodal Chain-of-Thought reasoning improves performance in mathematical and scientific reasoning but harms visual grounding and object counting in perception tasks. Models exhibit a 'Look Light, Think Heavy' pattern, where visual reflection diminishes while verbal reasoning increases, indicating a persistent bottleneck in visual introspection during multimodal reasoning.

arxiv arXiv cs.AI · 16h ago

PaperClaw: Autonomous Research with Human-in-the-Loop Refinement

PaperClaw is a multi-agent system that autonomously conducts research from field selection to paper publication. It uses a validated, iterative propose-test-reflect loop, grounded in real references and runnable results, and supports human-in-the-loop refinement at any stage. Evaluation shows it produces strong papers both autonomously and with human oversight.

arxiv arXiv cs.LG · 16h ago

Topological Neural Dynamics: Neuron-wise Sequence Modeling

Topological Neural Dynamics (TND) introduces a neuron-wise framework for sequence modeling, where each neuron evolves independently through a directed graph structure. In a single-player Pong behavior cloning task, TND achieves a mean of 17.47 consecutive catches per round, surpassing all baseline models by more than three times.

arxiv arXiv cs.LG · 16h ago

TASER: Task-Differentiated Skill Expansion for Heterogeneous Continual Learning

TASER introduces a framework that dynamically expands and routes atomic skills for continual learning across highly heterogeneous tasks. It reduces catastrophic forgetting and improves plasticity by ensuring semantic distinctness and efficient capacity allocation through skill detection and routing mechanisms. Evaluated on HeteroCLBench, a benchmark with 19 diverse tasks across 9 cognitive dimensions, TASER outperforms existing baselines.

arxiv arXiv cs.LG · 16h ago

NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning

NASDAQ addresses low-dimensional observation challenges in reinforcement learning by normalizing observation spaces to balance reconstruction losses. It integrates value learning with short-term value and next observation prediction, achieving competitive or superior performance with less training time across domains.

arxiv arXiv cs.LG · 16h ago

Diagnostics for MORL Policy Selection

We propose a diagnostic workflow to reveal behavioral variation in multi-objective reinforcement learning policies. The method highlights differences in policy trajectories beyond expected returns, offering quantitative and visual tools for policy inspection. Validated on grid worlds and scaled to continuous control tasks, it effectively captures behavioral diversity under increasing complexity.

arxiv arXiv cs.LG · 17h ago

Ramanujan Graph Rewiring Alleviates GNN Over-Squashing

Ramanujan Propagation uses Ramanujan graphs to reduce over-squashing in Graph Neural Networks by ensuring non-negative resistance curvature. The method preserves local connectivity while enabling efficient long-range information flow, outperforming nine state-of-the-art rewiring techniques.

arxiv arXiv cs.LG · 17h ago

Transformer Models Highly Sensitive to Noisy Data in Trajectory Prediction

A study finds that Transformer-based trajectory prediction models degrade significantly with noisy object state data. Accuracy drops by 1.3x under mild noise and up to 3.9x under realistic high noise conditions, highlighting the models' sensitivity and the need for noisier, real-world training data and mitigation strategies.

arxiv arXiv cs.LG · 17h ago

Reward-Petri-Net Interpretation of Temporal Behavior Trees

This paper presents a Reward-Petri-Net interpretation of Temporal Behavior Trees for reinforcement learning. It translates TBTs into Petri Nets, assigning rewards based on structural constraints defined in Linear Temporal Logic, enabling effective learning in complex, long-horizon robotic tasks where vanilla RL fails.

arxiv arXiv cs.LG · 18h ago

Predictive Repair Management Using Multi-Head Attention and Online Learning

A deep learning framework using multi-head attention and online learning accurately predicts repair durations by integrating categorical and numerical historical data. The model achieves 78% accuracy on real-world repair data from 2013 to 2020, outperforming feed-forward neural networks and random forests, with attention weights revealing key feature interactions.

arxiv arXiv cs.LG · 18h ago

TRIZ-Inspired Text-to-CAD Framework Enhances Creative Design

A TRIZ-inspired text-to-CAD framework uses large language models to generate creative, editable 3D CAD models by integrating inventive principles from patent data. In a chair design case study, it achieved 4.0-14.7% mass reduction while preserving structural integrity through principles like segmentation and composite materials.

arxiv arXiv cs.LG · 18h ago

Functional Orthogonality Ensures Identifiability in Unsupervised Disentanglement

The paper proves that locally orthogonal directions in generative models guarantee latent factor identifiability without needing statistical independence or causal assumptions. Experiments with orthogonality-regularized normalizing flows confirm reliable recovery of true latent factors, challenging prior claims about unsupervised disentanglement impossibility.

arxiv arXiv cs.LG · 18h ago

Atomistic Language Models Understand and Generate Materials

Atomistic Language Models (ALMs) unify language and atomistic structures, enabling natural language-driven crystal generation and optimization. ALMs use a continuous bridge to map language embeddings into atomistic diffusion steering space and employ Text-to-Crystal Feynman-Kac for stoichiometric accuracy. The ALM Bench benchmark evaluates text-conditioned material generation and optimization, with code and weights to be released soon.

arxiv arXiv cs.LG · 18h ago

LDT-FRL Framework for Cyber-Resilient IoMT

The LDT-FRL framework introduces a privacy-preserving defense system for IoMT devices, combining temporal attention, lightweight digital twins, and federated reinforcement learning. It achieves 99.66% and 99.95% accuracy on CICDDoS 2019 and TON-IoT benchmarks, with perfect F1 on the MITM class, converging 81% faster than prior methods and offering interpretable defense decisions via SHAP and Grad-CAM.

arxiv arXiv cs.LG · 18h ago

Universal Encoders for Modular Relational Deep Learning

The paper proposes a modular relational deep learning approach that decouples row encoding from graph message-passing. It introduces a transformer-based Universal Row Encoder that uses schema metadata to generate invariant row embeddings, enabling better generalization across databases and improving convergence on RelBench benchmarks.

arxiv arXiv cs.LG · 18h ago

BIPC Framework Accelerates Mixed-Integer Optimization with Machine Learning

The BIPC framework reduces solution time for large-scale mixed-integer programs by identifying a backdoor subset of variables that drive computational complexity. Using supervised learning, it predicts backdoor variable values and intervals, then solves a reduced problem with these predictions, achieving significant speedups with minimal quality loss. This enables rapid, high-quality solutions under parameter perturbations in real-world systems like power and supply chains.

arxiv arXiv cs.LG · 19h ago

Post-Training Speech Enhancement with Perceptual Rewards

A new post-training method uses multi-metric perceptual rewards to optimize speech enhancement models. It directly applies non-differentiable metrics like DNSMOS, WER, and UTMOS as rewards via Group Sequence Policy Optimization, achieving state-of-the-art results on DNS2020. Human evaluation confirms that combining multiple metrics outperforms single-metric approaches, reducing reward hacking.