Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Optimal Order in Multi-Agent Systems Framework

A new framework analyzes multi-agent systems by modeling agent influence and response functions. It derives macroscopic properties like power, entropy, and order, and identifies an optimal level of synchronization that balances productivity, stability, and adaptability. The study shows that order and system properties are task-dependent and context-relative.

arxiv arXiv cs.AI · 6d ago

Calibration Without Comprehension in LLM Vulnerability Detection

CWE-Trace evaluates eight vanilla and 15 LoRA-fine-tuned LLMs on Linux kernel vulnerability detection. Results show data contamination offers no advantage, and fine-tuning only shifts output thresholds without altering decision policies. Despite improved detection scores, LLMs lack reliable security reasoning, with top-1 CWE accuracy below 1.3% and binary detection performance at 52.1%.

arxiv arXiv cs.AI · 6d ago

FreeStyle: Scalable Dual-Reference Generation via Community LoRA Mining

FreeStyle proposes a framework that mines community LoRAs to generate large-scale style-content dual-reference image triplets. It employs a two-stage curriculum with disentanglement mechanisms to suppress style leakage and introduces a benchmark with style-invariant and VLM-based scores to evaluate content preservation and leakage rejection.

arxiv arXiv cs.AI · 6d ago

How Safety-Aligned LLMs Interpret Mixed Compliance Demonstrations

Studies show benign and harmful compliance demonstrations are not interchangeable in LLMs. Benign demonstrations can either reduce or increase harmful compliance depending on the model, with preference optimization playing a key role in preventing harmful compliance. Demonstration ordering shows strong recency bias, and models vary in how they handle refusal during in-context learning.

arxiv arXiv cs.AI · 6d ago

Multi-LCB: Extending LiveCodeBench to 12 Programming Languages

Multi-LCB extends LiveCodeBench to twelve programming languages, preserving its contamination controls and evaluation protocol. It reveals Python overfitting, language-specific biases, and significant performance gaps among LLMs across languages, establishing a rigorous benchmark for cross-language code generation.

arxiv arXiv cs.AI · 6d ago

FlowEdit: Lifelong Pronunciation Adaptation in Flow-Matching TTS

FlowEdit enables frozen flow-matching TTS models to adapt pronunciation corrections over time using latent edits in text embeddings. It stores corrections in a Modern Hopfield Network and retrieves them via soft attention with similarity gating, reducing phoneme error rates by 92.7% on 312 multilingual proper nouns while preserving general-speech quality. Corrections take about 15 seconds to complete on a single GPU.

arxiv arXiv cs.AI · 6d ago

SARLO-80: VHR SAR-Optical-Text Dataset Released

SARLO-80 is a large-scale dataset combining very-high-resolution SAR SLC, aligned optical imagery, and natural-language descriptions. It includes 119,566 triplets from 2,500 global scenes across 72 countries, standardized to an 80cm slant-range grid with pixel-level alignment and three caption variants. The dataset is publicly available on Hugging Face for multimodal learning benchmarks in native SAR geometry.

arxiv arXiv cs.AI · 6d ago

DeepSWIP: Counterfactual Reasoning in Neural Probabilistic Logic

DeepSWIP introduces a single-world counterfactual semantics for DeepProbLog, enabling causal reasoning through neural materialization and weighted model counting. It achieves exact inference under finite grounding and unique-supported-model assumptions, with experiments showing a 2.14× speedup and improved calibration over DeepTwin and AIPW estimators.

arxiv arXiv cs.AI · 6d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.

arxiv arXiv cs.AI · 6d ago

Cross-Attention Attribution for Style-Captioned Text-to-Speech

A new method adapts DAAM to speech diffusion models, analyzing how style captions influence TTS waveforms. It reveals style tokens have lower temporal variance than content tokens, with style attention correlating to pitch and energy, and peak style conditioning in early layers where attention entropy is minimized, indicating maximal selectivity.

arxiv arXiv cs.AI · 6d ago

Calibration in MoE Models Under Distribution Shift

This paper examines how mixture-of-experts models maintain calibration under distribution shift. It finds that expert-level calibration ensures overall model calibration in hard-routed models but is insufficient for soft-routed models. The authors propose adversarial reweighting to penalize calibration errors in routed aggregates, improving accuracy-calibration tradeoff across tasks and shifts.

arxiv arXiv cs.AI · 6d ago

G2Rec: Unified Framework for Generative Recommendation

G2Rec introduces a scalable framework that combines holistic graph-based user co-engagement modeling with semantic tokenization. It enables generative recommendation models to capture comprehensive, semantically grounded user interest prototypes without ground-truth user interests, outperforming existing methods in industrial-scale sequential recommendation.

arxiv arXiv cs.AI · 6d ago

How Transparent is DiffusionGemma?

DiffusionGemma has poor variable transparency due to high opaque serial depth, but this can be mitigated by an interpretable token bottleneck, reducing serial depth to 1.1X that of Gemma 4. Algorithmic transparency is more challenging in diffusion models due to dynamic token predictions, with early evidence of non-chronological reasoning, token smearing, and intermediate-context reasoning. DiffusionGemma is found to be similarly monitorable to Gemma 4.

arxiv arXiv cs.LG · 6d ago

FedMGS: Federated Modality-aware Graph Synthesis for Imbalanced MultiModal Learning

FedMGS addresses client- and node-level modality imbalance in federated graph learning by synthesizing latent semantic representations. It integrates an availability-aware graph encoder, prototype-guided semantic synthesizer, and reliability-calibrated fusion mechanism to recover missing modalities while preserving semantic alignment. Experiments show FedMGS achieves up to 17.41% performance gains over baselines across four tasks.

arxiv arXiv cs.LG · 6d ago

Style Diversity Outperforms Topic Diversity in Annotation-Free Synthetic Data

A new framework generates synthetic dialogue without human-annotated data, using only intent definitions. It incorporates topic and style attributes, with post-hoc stylization models Univ and Exam, and an LLM-as-a-judge filtering process. Results show up to 93.3% of human-annotated data performance, confirming that style diversity is more critical than topic diversity for data utility.

arxiv arXiv cs.LG · 6d ago

Direct Advantage Estimation for Partially Observable Domains

Direct Advantage Estimation (DAE) is extended to partially observable domains with minimal modifications. A discrete latent dynamics model reduces computational overhead by efficiently approximating transition probabilities, enabling scalable and sample-efficient deep reinforcement learning in high-dimensional observation spaces.

arxiv arXiv cs.LG · 6d ago

DeepGaLA: Neural Surrogates with Uncertainty for PDE Inverse Problems

DeepGaLA is a neural-network surrogate that provides uncertainty-aware predictions for inverse problems in partial differential equations. It achieves accuracy comparable to Gaussian-process surrogates while maintaining efficiency in high-dimensional parameter spaces and incorporating differential-equation constraints.

arxiv arXiv cs.LG · 6d ago

Mechanistic Study of Representation Retention in Continual Learning

A synthetic framework reveals that superposition increases over time with transient dips at task boundaries, indicating boundary-specific interference. Higher feature sparsity promotes superposition without inevitable forgetting, provided representation strength is maintained. Task-level effective rank grows with sparsity, showing broader capacity usage under sparse conditions.

arxiv arXiv cs.LG · 6d ago

HEPTv2: End-to-End Efficient Point Transformer for Charged Particle Reconstruction

HEPTv2 achieves 98.6% tracking efficiency with 0.8% fake rate on TrackML, using only 15 ms inference time and 0.4 GB memory per event. It outperforms prior transformer and graph-based methods in efficiency and reduces latency by factors of 7 and 38–52, respectively, enabling real-time particle reconstruction at the HL-LHC.

arxiv arXiv cs.LG · 6d ago

Two-Stage Evolutionary Hyperparameter Optimization for PINNs

A two-stage evolutionary strategy improves Physics-Informed Neural Network performance by first screening hyperparameter candidates via low-fidelity training, then refining top candidates with gradient-based optimization. The approach reduces mean error significantly across Advection, Klein-Gordon, and Helmholtz equation problems under fixed computational budgets.