Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Introducing Rule Violation Score for Logical Compliance

We introduce the Rule Violation Score (RVS), a metric that evaluates how well predictive models adhere to logical rules. RVS distinguishes between hard and soft rules, works with any relational dataset and model, and can be computed via SQL queries for Horn rules. Evaluation on multiple benchmarks shows that models with similar predictive accuracy can differ greatly in logical compliance, highlighting RVS's ability to reveal behaviors missed by standard metrics.

arxiv arXiv cs.AI · 7d ago

FlowMaps Models Long-Term Multimodal Object Dynamics

FlowMaps is a latent flow matching model that predicts future object locations in 3D environments by learning spatio-temporal patterns from human interactions. It outperforms state-of-the-art methods in dynamic object navigation across over 600 episodes in both simulated and real-world settings.

arxiv arXiv cs.AI · 7d ago

Deep Reinforcement Learning for Game AI Enhancement

This paper proposes a framework for applying deep reinforcement learning to game AI, aiming to create more believable and human-like characters. It addresses current limitations in deploying machine learning agents in games and identifies key research challenges that could accelerate AI adoption in the video game industry.

arxiv arXiv cs.AI · 7d ago

QMFOL: Benchmarking LLM Reasoning with Controllable Logical Complexity

QMFOL is an automated framework that generates monadic first-order logic reasoning tasks with quantifiable complexity. It produces 2880 benchmark instances across 960 configurations, evaluating six large reasoning models and two LLMs, showing performance degradation and increased computational cost as logical complexity rises.

arxiv arXiv cs.AI · 7d ago

Thermodynamic Measure of Intelligence

Intelligence is defined as the lawful amplification of rare but valid futures. A framework shows that recursive self-simulation is necessary and nearly sufficient for high thermodynamic intelligence, enabling a universal, measurable scale across systems from matter to humans and AI.

arxiv arXiv cs.AI · 7d ago

ScholarQuest: Taxonomy-Guided Benchmark for Agentic Academic Search

ScholarQuest is a large-scale benchmark for agentic academic paper search, built from 1,000 computer science topics and four research intents. It includes scalable answer construction and a shared retrieval backend, ScholarBase, enabling reproducible evaluation. Results show agentic methods outperform baseline retrieval, with the best agent achieving 0.314 Recall@100 and 0.355 Recall@All, indicating significant room for improvement.

arxiv arXiv cs.AI · 7d ago

MAMO: Multi-Agent System for Multi-Objective Constrained Optimization

MAMO introduces a multi-agent reinforcement learning approach to address the challenge of balancing cost minimization and constraint satisfaction in dynamic environments. It decouples task execution from reward weight selection, treating the choice of weights as a learning problem to enable more autonomous and robust solutions.

arxiv arXiv cs.AI · 7d ago

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

SPOT-E introduces a test-time method that uses visual spotlights to enhance evidence grounding in frozen vision-language models. It employs low-entropy anchors and an entropy-shaping objective to reduce answer uncertainty while preserving high-confidence tokens, improving robustness under visual corruptions across benchmarks and VLM families.

arxiv arXiv cs.AI · 7d ago

MACR: Explicit Conflict Resolution for LLM Inference

MACR introduces a multi-agent reasoning framework to resolve knowledge conflicts in LLM inference by jointly assessing internal and external knowledge. It uses semantic entropy to measure confidence and employs three specialized agents to induce rules, detect conflicts, and resolve inconsistencies across contexts. Empirical results show MACR outperforms state-of-the-art methods and provides interpretable conflict resolutions.

arxiv arXiv cs.AI · 7d ago

Finetuning VLA Models Requires Fewer Layers Than Thought

Vision-Language-Action models show severe layer-wise redundancy despite large parameter counts. A training-free compression method using Centered Kernel Alignment removes twin layers, reducing model depth by up to 50% and enabling 40-50% faster training and up to 30% faster inference without performance loss, validated across simulation and real-world robotic tasks.

arxiv arXiv cs.AI · 7d ago

Meaning Intelligence Framework for Nigerian Public Discourse

The Meaning Intelligence Framework (MIF) introduces a nine-dimension schema to analyze Nigerian public discourse, addressing context failure in AI systems. A 30-item calibration dataset shows that schema-informed prompting improves register classification accuracy from 33.3% to 73.3% and boosts the composite Meaning Intelligence Score from 73.2 to 78.6.

arxiv arXiv cs.AI · 7d ago

Lagrange: Open-Vocabulary Sparse Framework for End-to-End Driving

Lagrange introduces an open-vocabulary, energy-based sparse framework for generalized end-to-end driving. It uses Vision-Language Models to generate class-agnostic object proposals and encodes them into continuous semantic tokens, enabling robust generalization to anomalous scenarios while adhering to vehicle kinematics through Lagrangian action minimization.

arxiv arXiv cs.AI · 7d ago

Boundary Embedding Shaping for Graph Structural Disentanglement

Boundary Embedding Shaping (BES) addresses graph structural entanglement by selectively suppressing spurious neighbor correlations near class boundaries. BES uses adaptive contrastive learning to enhance boundary discrimination, improving GCN node classification by an average of 3.3% (up to 5.0% on WikiCS) and achieving superior link prediction accuracy.

arxiv arXiv cs.AI · 7d ago

Novel DTL Approach for Data-Scarce Fault Diagnosis

A new deep transfer learning method leverages systems' non-linearities to generate diagnostic data under severe data scarcity. This approach uses a periodic multi-excitation procedure and a novel data visualization technique to augment limited vibration data, enabling effective fault diagnosis via pre-trained CNNs. Experimental results on a railway pantograph validate the method's effectiveness.

arxiv arXiv cs.AI · 7d ago

SoftSkill: Behavioral Compression for Contextual Adaptation

SoftSkill proposes a method to compress natural-language skills into compact latent priors, improving task performance on SearchQA, LiveMath, and DocVQA. It outperforms SkillOpt by 5.2 to 12.5 points on key benchmarks while replacing hundreds to thousands of Markdown tokens with a few virtual tokens.

arxiv arXiv cs.AI · 7d ago

Trajectory Mining Reveals Skill Structure but Fails to Improve Policies

A three-stage pipeline mines skill libraries from GUI interaction data, achieving high purity in five of eight clusters against InteraSkill labels. However, the method only slightly improves skill-step accuracy on IW and fails to advance performance on BrowseComp+ or key metrics, indicating limitations in cross-domain policy transfer.

arxiv arXiv cs.AI · 7d ago

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass uses runtime and compiler evidence to guide LLM-generated optimization decisions, outperforming expert heuristics and classical autotuning methods. It achieves geometric-mean speedups of 1.043x on x86-64 and 1.117x on ARM64 systems without prior training or fine-tuning.

arxiv arXiv cs.AI · 7d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, accelerated safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.

arxiv arXiv cs.LG · 7d ago

Tri-Info: Generalizable Failure Prediction for VLA Models

Tri-Info uses information theory to detect failures in Vision-Language-Action models by analyzing action diversity, temporal consistency, and state coupling. It achieves 83% accuracy on real-world tasks across six models and three environments, outperforming prior methods and maintaining performance without retraining.

arxiv arXiv cs.LG · 7d ago

Training LLMs for Long-Lifecycle Agents via Cross-Domain Generalization

A new framework enables large language models to develop 'Connect the Dots' capability, allowing long-lifecycle agents to learn from experiences and iteratively update their environment context. The framework uses reinforcement learning with long rollout sequences and custom tasks to promote cross-domain generalization, showing effective out-of-distribution performance in both domains and transition settings.