Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Topological Data Analysis for Real-Time Process Monitoring

A new method combines topological data analysis and machine learning to monitor high-dimensional dynamic processes. It represents time-series data as manifolds, uses topological descriptors to capture structure, and employs neural ordinary differential equations to model dynamic evolution. The approach effectively detects diverse events in industrial process data and outperforms reconstruction-based and trajectory-based alternatives.

arxiv arXiv cs.LG · 6d ago

SSH-Net: Deep Network for Failure Time Prediction under Competing Risks

SSH-Net is a structured deep neural network designed to predict failure time distribution functions under competing risks. It uses separate sub-networks for different covariate groups, improving accuracy by aligning neural structure with data hierarchy. The model is validated through simulation studies and applied to Titan GPU failure data.

arxiv arXiv cs.LG · 6d ago

Agentic Symbolic Search for PDE Solution Characterization

ASYS proposes a prior-guided framework that uses mathematical theory and evolutionary search to generate interpretable symbolic forms of PDE solutions. It produces analytical representations for complex problems like Allen-Cahn dynamics and Keller-Segel blow-up, offering new pathways for mathematical analysis beyond traditional methods.

arxiv arXiv cs.LG · 6d ago

Riemannian Sharpness Explains SGD's Bias Toward Flat Minima

This study introduces Riemannian sharpness, a reparametrization-invariant measure of flatness grounded in Fisher Information Matrix geometry. It proves SGD's stationary distribution concentrates at Riemannian-flat minima and links this geometric bias to generalization via a PAC-Bayes bound. Experiments on MNIST and CIFAR-10 show Riemannian sharpness better tracks generalization than Euclidean sharpness, with scaling consistent with theory.

arxiv arXiv cs.LG · 6d ago

RefRad2D Dataset Enables Scalable Spatial Grounding in Radiology

RefRad2D is a large-scale bilingual dataset of 1.2M CT and MR image-text pairs from clinical practice. Trained on this data, RadGrounder achieves competitive results in VQA and report generation while maintaining language quality through spatial grounding supervision without performance degradation.

arxiv arXiv cs.LG · 6d ago

How Safety-Aligned LLMs Interpret Mixed Compliance Demonstrations

A study finds benign and harmful compliance demonstrations are not interchangeable in language models. Benign demonstrations can either reduce or increase harmful compliance depending on the model, with preference optimization playing a key role in preventing harmful compliance. The research also reveals recency bias in demonstration ordering and varied model behaviors in handling refusals during in-context learning.

arxiv arXiv cs.LG · 6d ago

Probe-and-Refine Tuning Improves Coding Agent Performance

A new method called probe-and-refine tuning uses synthetic bug-fix probes to iteratively improve repository guidance files with single-shot LLM calls, without agent loops or tool use. On SWE-bench Verified, it achieves a 33.0% mean resolve rate—14.5 percentage points higher than the initial static knowledge base—showing improved coverage rather than patch precision. The method enables agents to use larger step budgets effectively, and performance remains stable across models when diagnostic output is sufficient.

arxiv arXiv cs.LG · 6d ago

Multi-Task Bayesian In-Context Learning Framework

A new multi-task in-context learning framework enables amortized hierarchical Bayesian inference by representing prior information as a prefix in datasets. The transformer model adapts predictions across prior families, matching oracle performance on diverse tasks while being significantly faster. It is validated on real-world spatiotemporal temperature prediction.

arxiv arXiv cs.LG · 6d ago

Calibration in MoE Models Under Distribution Shift

This paper examines how mixture-of-experts models maintain calibration under distribution shift. It finds that expert-level calibration ensures overall model calibration in hard-routed models but is insufficient for soft-routed models. The authors propose adversarial reweighting to penalize calibration errors in routed aggregates, improving the accuracy-calibration tradeoff across tasks and shifts.

arxiv arXiv cs.LG · 6d ago

Lie-Algebra Attention: Group Element Tokens in Neural Networks

Lie-Algebra Attention introduces attention tokens as matrix Lie group elements, using the closed-form algebra norm of relative poses as attention scores. This method achieves invariant, equivariant attention without representation-theoretic components, outperforming vector-token baselines on SE(2), SO(3), and Aff(2) with fewer parameters and no learned kernels.

arxiv arXiv cs.LG · 6d ago

UNIEGO: Proxy-Mediated Unified Egocentric Video Representation

UNIEGO introduces a hierarchical multi-teacher distillation framework that uses proxy models to mediate knowledge transfer from nine diverse teachers across viewpoints and modalities. The Selective Proxy Distillation (SPD) stage adaptively selects reliable proxies during training, improving representation quality and stability. UNIEGO achieves state-of-the-art results in action recognition, video retrieval, and action segmentation on ego-exo benchmarks.

arxiv arXiv cs.LG · 6d ago

How Transparent is DiffusionGemma?

DiffusionGemma has poor variable transparency due to high opaque serial depth, but this can be mitigated by an interpretable token bottleneck, reducing serial depth to 1.1X that of Gemma 4. Algorithmic transparency is more challenging in diffusion models due to dynamic token changes, though case studies reveal novel phenomena like non-chronological reasoning and intermediate-context reasoning. DiffusionGemma is found to be similarly monitorable to Gemma 4.

arxiv arXiv cs.CL · 6d ago

RefRad2D Dataset Enables Scalable Spatial Grounding in Radiology

RefRad2D is a large-scale bilingual dataset of 1.2M CT and MR image-text pairs from clinical practice. Trained on this data, RadGrounder achieves competitive VQA results and performs spatial grounding without degrading language quality, enabling verifiable outputs in radiology.

arxiv arXiv cs.CL · 6d ago

H-RePlan: Hierarchical Recovery for Cross-Device Agent Systems

H-RePlan introduces a hierarchical replanning framework that separates device-local strategy recovery from global orchestrator replanning. It outperforms existing baselines by achieving higher completion and instruction adherence, with reduced token cost, through scope-aware recovery in multi-device agent systems.

arxiv arXiv cs.CL · 6d ago

StylisticBias: Visual Cues Drive Most Social Biases in MLLMs

StylisticBias introduces a controlled benchmark to evaluate attribute-level social bias in multimodal large language models. It reveals that age and body type dominate identity-level effects, while fashion style and 15 key visual attributes drive most bias, accounting for nearly 80% of variation. The benchmark highlights that model judgments are most sensitive to appearance-related cues, especially in socioeconomic and style-based contexts.

arxiv arXiv cs.CL · 6d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders these states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.

arxiv arXiv cs.AI · 6d ago

AI Economist Agent: Model-Grounded Economic Analysis Framework

The AI Economist Agent uses RAG, knowledge graphs, and LLMs to generate economic narratives grounded in theory and data. It enables model-based analysis, evidence retrieval, and report generation, ensuring economic coherence and traceability through explicit model computations.

arxiv arXiv cs.AI · 6d ago

Lean as Process-Verified Reward Oracle in RL for Theorem Proving

This work shows that Lean can serve as a symbolic process oracle, providing fine-grained, verified feedback during reinforcement learning. By parsing proof attempts into tactic sequences and using Lean's elaboration to mark sound steps and first failures, the system generates dense, type-theoretic reward signals. Experiments demonstrate tactic-level supervision outperforms outcome-only methods on benchmarks like MiniF2F and ProofNet, highlighting Lean's role as both evaluator and training reward source.

arxiv arXiv cs.AI · 6d ago

EEG Foundation Models for Burst-Suppression Detection in ICU

A study evaluates EEG Foundation Models for event-based burst-suppression detection in ICU settings without patient-specific calibration. REVE-base achieved the highest event-based F1-score of 0.868 and reduced burst-per-minute error by 52.1% compared to EEGNet and 36.2% compared to adaptive thresholding, demonstrating superior performance. Ablation results show full fine-tuning outperforms other strategies, and pretrained REVE-base surpasses random initialization by 0.723 F1 points at 25% labeled data, highlighting the value of pretraining for limited datasets.

arxiv arXiv cs.AI · 6d ago

Hidden Evolution of Disguised Visual Context in VLMs

Visual tokens enter large language models as raw, unstructured signals. Their internal transformation and integration depend on architecture—either as in-context prompts or injected into intermediate layers—leading to distinct evolution paths in visual representation and frequency characteristics. We find that attention alone is insufficient; performance is driven by the quality of visual representations at each layer across different integration paradigms.