Reasoning models — korshunov.ai

Reasoning models Page 18 / 35

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE addresses policy entropy collapse in GRPO-based reinforcement learning by identifying entropy-critical token subsets via surprisal quantiles and reweighting their advantages. It maintains stable policy entropy across model scales and tasks, outperforming DAPO and other baselines by 4%-8% on AIME24 and AIME25, with consistent exploration-exploitation balance.

arxiv arXiv cs.LG · 8d ago

TxBench-PP: AI Agent Performance in Preclinical Pharmacology

TxBench-PP is a verifiable benchmark for small-molecule preclinical pharmacology, testing AI agents' ability to derive accurate conclusions from real-world assay data. Across 16 model-harness configurations, no system reliably made correct preclinical pharmacology decisions, with the best performance at 59.3% (Claude Opus 4.8 / Pi) and 55.3% (GPT-5.5 / Pi) of endpoint attempts.

arxiv arXiv cs.LG · 8d ago

TGO-I: Spectral Geometry of Vision Transformers

TGO-I analyzes the spectral geometry of Vision Transformers using ViT-Small/16 trained on ImageNet-100. It reveals increasing dimensional utilization and reduced anisotropy, with eigenspectra becoming flatter and spectral entropy rising. The final CLS token shows highest effective dimensionality and lowest anisotropy, indicating broad variance distribution across dimensions.

arxiv arXiv cs.LG · 8d ago

Graph Neural Networks Accelerate Algebraic Multigrid Pressure Solver

A graph neural network enhances algebraic multigrid solvers by predicting optimal polynomial coefficients for sparse pseudo-inverse operators. The method reduces V-cycle iterations and achieves wall-clock speedups of 4% to 37% across benchmarks, with robust performance on meshes up to 128 times larger than training data and on unseen industry problems like AirfRANS.

arxiv arXiv cs.LG · 8d ago

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a single panoramic canvas using 3D world coordinates. It achieves state-of-the-art performance on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using significantly less training compute than existing methods.

arxiv arXiv cs.LG · 8d ago

SCAN: Multi-Scale Clustering for Time Series Anomaly Detection

SCAN enhances reconstruction-based time series anomaly detection by integrating multi-scale neighborhood-centered clustering. It uses cluster center representations to constrain normal pattern reconstruction and derives an anomaly confidence score based on cluster membership probability, combined with reconstruction error. Extensive experiments on real-world datasets show SCAN achieves state-of-the-art performance.

arxiv arXiv cs.LG · 8d ago

Conceptual Innovation in Medical Imaging AI

A new perspective argues that medical imaging AI research should prioritize conceptual innovation—reframing problems, evaluation metrics, and clinical relevance—over algorithmic improvements alone. The article highlights that current academic incentives undervalue conceptual contributions, leading to misaligned objectives and limited real-world impact, and offers recommendations for researchers, mentors, and journals to better support such innovation.

arxiv arXiv cs.LG · 8d ago

Large Language Gibbs for Structured Probabilistic Inference

Large Language Gibbs uses LLM conditional distributions as transition operators for iterative variable resampling. This method enables coherent, order-independent probabilistic inference by achieving a stationary distribution that balances local conditionals, offering a practical alternative to single-pass generation for structured reasoning tasks.

arxiv arXiv cs.LG · 8d ago

NeSyCat Torch: Differentiable Tensor Implementation for Neurosymbolic Learning

NeSyCat Torch provides a differentiable tensor implementation of categorical semantics for neurosymbolic learning, unifying classical, fuzzy, probabilistic, and neural systems under a single inductive truth definition. It outperforms LTN and DeepProbLog in speed and accuracy on MNIST addition, matching DeepStochLog's accuracy while operating within a uniform framework extendable to continuous probability via monad instantiation.

arxiv arXiv cs.LG · 8d ago

Ambient Sound and Light Predict ICU Delirium

A study finds that ambient sound and light intensity can independently predict delirium in ICUs. Sound features were the dominant predictors, with combined sound and light improving short-term delirium risk estimation, especially within one week.

arxiv arXiv cs.LG · 8d ago

Act2Answer Evaluates Knowledge Retention in Vision-Language-Action Models

Act2Answer introduces a lightweight protocol to assess commonsense and world knowledge retention in VLA models by requiring agents to answer questions through object placement actions. A large-scale study of 7 VLA models and 9 VLM baselines reveals that VLAs perform well on simple concepts but show larger gaps on rich semantic categories compared to their source VLMs, with VQA co-training improving knowledge retention and peak answer-relevant signals observed in middle VLA layers.

arxiv arXiv cs.LG · 8d ago

MC Dropout Uncertainty Alignment Insufficient for Clinical Safety in Glioma Segmentation

A study on 126 BraTS21 patients finds that while MC Dropout achieves strong uncertainty-error alignment, it fails to detect critical calibration issues in enhancing tumour regions. The UNet-Res model shows near-zero entropy and high ECE in these clinically vital areas, with a low Dice score of 0.714, indicating severe miscalibration invisible to standard metrics like Dice and AUROC. These results highlight that uncertainty alignment alone is insufficient for clinical safety and that region-specific calibration must be evaluated alongside standard metrics.

arxiv arXiv cs.LG · 8d ago

Optimizing climate scenarios boosts emulator generalization

A new method uses a differentiable simple climate model to optimize training scenarios, enhancing emulator generalization. Training on one optimized scenario outperforms six standard ScenarioMIP pathways, and such scenarios yield more skillful emulators when used with intermediate-complexity models, despite smaller dataset sizes.

arxiv arXiv cs.LG · 8d ago

P-K-GCN: Physics-augmented Koopman-enhanced Graph Convolutional Network

P-K-GCN enables high-fidelity spatiotemporal super-resolution on irregular geometries by combining graph convolutional networks with Koopman operator theory. It incorporates a physics-based loss to enforce adherence to physical laws, reducing super-resolution error through improved generalization and accuracy, as validated in cardiac electrodynamics reconstruction.

arxiv arXiv cs.LG · 8d ago

Diffusion-Proof: First Framework for Diffusion LLMs in Formal Theorem Proving

Diffusion-Proof is the first framework to train and apply diffusion language models for formal theorem proving. It introduces dLLM-Prover-7B for whole-proof writing with long-range coherence and dLLM-Corrector-7- for local proof correction using bidirectional information. The framework outperforms auto-regressive LLM baselines by 1.61% on ProofNet-Test and 6.14% on MiniF2F-Test, and solves an IMO problem beyond the capability of DeepSeek-Prover-V2-7B.

arxiv arXiv cs.LG · 8d ago

Reverse-Engineering Transformer Attention with Executable Programs

A new method uses program synthesis to generate Python programs that reproduce attention patterns in transformer models. These programs achieve over 75% average Intersection-over-Union similarity on held-out data and can replace up to 25% of attention heads with minimal impact on model performance, increasing perplexity by only 16% on average.

arxiv arXiv cs.LG · 8d ago

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based RL

UBP2 introduces a model-based method that actively explores environments by jointly reasoning over uncertainties in reward, dynamics, and value functions. It achieves superior sample efficiency in preference-based reinforcement learning, outperforming both model-free and non-optimistic model-based baselines on the Meta-World benchmark.

arxiv arXiv cs.AI · 8d ago

Essential Subspace Merging for Multi-Task Learning

Essential Subspace Merging (ESM) reduces inter-task interference by focusing on principal directions of activation shifts. ESM++ extends this with dynamic expert selection via prototype-based routing, enabling training-free multi-task model merging with preserved task knowledge.

arxiv arXiv cs.AI · 8d ago

User as Engram: Local Parametric Edits for Personal Memory

User as Engram proposes storing per-user facts as surgical, hash-keyed edits to a memory table, leaving reasoning in a shared adapter. This design achieves 5.6x higher indirect-reasoning accuracy and maintains base-level reasoning performance, with a memory footprint 33,000x smaller than per-user LoRA. The approach enables disjoint user edits that compose losslessly, outperforming retrieval pipelines beyond 100 facts.

arxiv arXiv cs.AI · 8d ago

Clinician-Centered Pipeline for Ultrasound AI Annotation and Evaluation

A new pipeline enables clinicians to perform remote annotation and blinded evaluation of ultrasound AI models without local data downloads. It supports multi-rater participation, result aggregation, and automated statistical analysis, validated in a fetal ultrasound segmentation study with six raters of varying expertise. Results show moderate to strong agreement and a preference for later active learning models in blinded rankings.