Reasoning models
arxiv arXiv cs.AI · 7d ago

Lean as Process-Verified Reward Oracle in RL for Theorem Proving

This work shows that Lean can serve as a symbolic process oracle, providing fine-grained, verified feedback during reinforcement learning. By parsing proof attempts into tactic sequences and using Lean's elaboration to mark sound steps and first failures, the system generates dense, type-theoretic reward signals. Experiments demonstrate tactic-level supervision outperforms outcome-only methods on benchmarks like MiniF2F and ProofNet, highlighting Lean's role as both evaluator and training reward source.

arxiv arXiv cs.AI · 7d ago

EEG Foundation Models for Burst-Suppression Detection in ICU

A study evaluates EEG Foundation Models for event-based burst-suppression detection in ICU settings without patient-specific calibration. REVE-base achieved the highest event-based F1-score of 0.868 and reduced burst-per-minute error by 52.1% compared to EEGNet and 36.2% compared to adaptive thresholding, demonstrating superior performance. Ablation results show full fine-tuning outperforms other strategies, and pretrained REVE-base surpasses random initialization by 0.723 F1 points at 25% labeled data, highlighting the value of pretraining for limited datasets.

arxiv arXiv cs.AI · 7d ago

Hidden Evolution of Disguised Visual Context in VLMs

Visual tokens enter large language models as raw, unstructured signals. Their internal transformation and integration depend on architecture—either as in-context prompts or injected into intermediate layers—leading to distinct evolution paths in visual representation and frequency characteristics. We find that attention alone is insufficient; performance is driven by the quality of visual representations at each layer across different integration paradigms.

arxiv arXiv cs.AI · 7d ago

Sensorimotor World Models for Action-Aligned Perception

A new sensorimotor world model (SMWM) learns compact, action-relevant latent representations from offline trajectories. It uses inverse dynamics regularization to prevent representation collapse and align latent states with controllable environmental degrees of freedom, enabling stable training without complex regularizers or frozen components. SMWM achieves competitive planning performance in 2D and 3D control tasks.

arxiv arXiv cs.AI · 7d ago

Dual-Agent Framework for Cross-Model Verified Translation

A dual-agent framework converts natural-language experiment protocols into executable commands for robotic lab platforms. It uses a Parser Agent and a rule-based mapping engine to translate protocols, with a heterogeneous LLM Validation Agent ensuring accuracy and triggering self-correction. The framework successfully enables end-to-end autonomous execution of microplate-based experiments like the Bradford assay.

arxiv arXiv cs.AI · 7d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.AI · 7d ago

BIM-Edit: Benchmarking LLMs for IFC-Based BIM Editing

BIM-Edit introduces a benchmark to evaluate large language models on natural-language editing of Building Information Models in IFC format. It includes 324 editing tasks across 11 real and 36 synthetic building models, assessing geometric accuracy, semantic validity, and topological consistency. The best model achieves only 49.5% average score, with no model solving more than 3.4% of tasks, highlighting a significant gap in LLM capabilities for engineering design workflows.

arxiv arXiv cs.AI · 7d ago

Essay Quality Representations in LLMs Found to Be Linearly Accessible

A study reveals that essay quality information in large language models is encoded in linearly accessible forms within their hidden representations. These representations emerge layer-by-layer, remain stable across prompts, and show partial transfer across different essay prompts, with longer essays relying more on deeper model layers. The research identifies specific 'essay scoring neurons' whose activation strongly correlates with scores and can be influenced by targeted interventions.

arxiv arXiv cs.AI · 7d ago

RS-Neg Benchmark and NeFo Method for Negation Understanding in Remote Sensing MLLMs

RS-Neg is the first benchmark to evaluate negation comprehension in remote sensing tasks across region-level and scene-level scenarios. It reveals that advanced remote sensing MLLMs struggle with negation, showing hallucinations and performance drops. NeFo, a test-time learning method, improves negation understanding using only 5% unlabeled test data and generalizes well to new tasks.

arxiv arXiv cs.AI · 7d ago

Introducing Rule Violation Score for Logical Compliance

We introduce the Rule Violation Score (RVS), a metric that evaluates how well predictive models adhere to logical rules. RVS distinguishes between hard and soft rules, works with any relational dataset and model, and can be computed via SQL queries for Horn rules. Evaluation on multiple benchmarks shows that models with similar predictive accuracy can differ greatly in logical compliance, highlighting RVS's ability to reveal behaviors missed by standard metrics.