All articles — korshunov.ai

All articles Page 1 / 90

ToolBench-X: Benchmarking Tool-Using Agents Under Unreliable Environments

The authors introduce ToolBench-X, a new benchmark designed to evaluate large language model agents under recoverable tool-environment unreliability. Unlike existing benchmarks that assume clean and stable environments, this framework injects five structured hazard types: Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. The dataset contains executable multi-step tasks across diverse domains with deterministic tools and canonical final answers for automatic evaluation. Crucially, every injected instance remains solvable through valid recovery paths such as retrying, fallback, or verification. Experiments reveal a substantial reliability gap where agents performing well with reliable tools often fail under these hazards. Further analysis indicates that failures stem from limited hazard diagnosis and ineffective recovery rather than tool-use volume or inference budget. Targeted recovery hints successfully recover many failed tasks, whereas test-time scaling yields more limited gains. These findings suggest that evaluation must shift focus from function-call accuracy to task completion in unreliable environments.

arxiv arXiv cs.CL · 5h ago

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

Sparse Mixture-of-Experts (MoE) architectures often struggle with low-resource languages due to cross-lingual routing divergence that limits expert sharing. To address this, researchers propose SARA, a framework that transfers specialized capabilities from high-resource anchor languages to low-resource ones. SARA aligns the internal routing distributions of MoE layers using a symmetric Jensen-Shannon divergence constraint rather than operating on output logits. This approach encourages mechanistic consistency in expert selection across different languages. The authors evaluated the method on two large language models across five low-resource languages and three benchmarks. Results show SARA outperforms standard instruction tuning, achieving gains of +0.8% on Qwen3-30B-A3B and +1.2% on Phi-3.5-MoE-instruct for Global-MMLU. These findings demonstrate that SARA effectively addresses performance bottlenecks in low-resource contexts.

arxiv arXiv cs.LG · 5h ago

Select-to-Act: Hierarchical Reinforcement Learning via Adaptive Language Guidance

The paper introduces HRLLI, a hierarchical reinforcement learning framework designed to improve sample efficiency by leveraging natural-language instructions. It addresses the limitation of existing approaches that treat instructions as static inputs, failing to account for their stage-dependent relevance in complex environments. The proposed method decomposes instructions into piecewise guidance elements that become relevant at different interaction stages. A novel Select-to-Act paradigm is formulated where a high-level semantic policy acts as a selector for the most relevant instruction piece based on the current state. This selected guidance conditions a low-level policy that executes environment actions, with both policies learned simultaneously to maximize augmented expected returns. Experiments on the RTFM benchmark demonstrate that HRLLI consistently outperforms strong instruction-conditioned RL baselines. The results confirm that explicitly modeling adaptive instruction selection significantly enhances reinforcement learning effectiveness.

arxiv arXiv cs.LG · 5h ago

SAFER: Reliability-Guided Adaptive Ensembling for Robust Test-Time Adaptation

The authors address the brittleness of test-time adaptation (TTA) under adversarially contaminated streams by proposing SAFER, a training-free framework for robust TTA. SAFER acts as an augmentation wrapper that replaces single-view predictions with a reliability-guided pooled predictor to stabilize online updates. For each test sample, the method generates stochastic augmentations and aggregates their outputs using correlation-weighted pooling combined with outlier detection. An adaptive-mixing extension is also introduced, which adjusts the weighting between original and augmented inputs based on feature disagreement signals to preserve clean performance. The researchers evaluated SAFER on PACS, VLCS, and OfficeHome benchmarks under PGD attacks at various rates. Results indicate that SAFER improves the resilience of TTA methods against adversarial attacks while maintaining competitive accuracy on clean data.

arxiv arXiv cs.LG · 5h ago

Parsimoniously Activated Dictionary Learning Links Sparsity and Storage to Generative Models

The paper introduces parsimoniously activated dictionary learning (PADL), a method imposing global regularization on the number of activated dictionary atoms. It demonstrates that PADL is equivalent to maximum a posteriori estimation under a structured generative model with auxiliary latent variables. This equivalence enables the derivation of generalization guarantees that are difficult to obtain from the original formulation. The authors provide an analytical characterization of the tradeoff between sparsity, storage cost, and reconstruction accuracy. This framework allows for data-driven estimation of optimal hyperparameters without manual tuning. An efficient and interpretable PADL algorithm is developed based on this theoretical connection. Experimental results show improved reconstruction performance under comparable sparsity levels on visual benchmarks. The method also demonstrates practical utility in accelerating inference for vision-language models.

arxiv arXiv cs.LG · 6h ago

ORBIT: Training-Free Multi-Attribute Behavioral Steering via Orthogonal Subspace Rotation

The authors introduce ORBIT, a training-free method for simultaneously controlling multiple behavioral attributes in large language models. Existing activation steering techniques struggle with multi-attribute control due to norm imbalance and directional cancellation when using naive vector summation. ORBIT addresses this by constructing a joint subspace from per-attribute steering planes via singular value decomposition. It then applies a single norm-preserving rotation within that subspace toward a combined target direction. The method incorporates adaptive per-token gating to identify necessary corrections at each position and an optional additive boost for weak projections. To evaluate the approach, the authors present TraitFactory, a benchmark focusing on behavioral tendencies rather than surface style. Experiments across Llama-3.2-3B, Qwen-2.5-7B, and Llama-3.1-8B models demonstrate that ORBIT achieves stronger and more balanced steering than baselines while preserving output coherence.

arxiv arXiv cs.LG · 6h ago

Reference-Free Assessment of Physical Consistency in World Model-based Video Generation

The authors introduce reference-free measures for evaluating the physical consistency of generated videos by combining relative and absolute fidelity assessments. This approach addresses the gap in physical fidelity that often prevents video generation tools like WorldGym or WorldEval from accurately reproducing real-world task success rates for VLA models. Unlike existing methods requiring costly human voting or unavailable ground-truth references, the new framework utilizes DROID-SLAM and SEA-RAFT to quantify inconsistencies. Motivated by WorldScore, the relative consistency assessment filters videos to improve task success rates by over 8%. Additionally, the absolute assessment enables spatio-temporal localization to visualize when and where physical artifacts occur in the generated content.

arxiv arXiv cs.LG · 6h ago

Kiwano: An Open-Source PyTorch Toolkit for Speaker Verification Research

Researchers have introduced Kiwano, an open-source toolkit designed to advance research and evaluation in the field of speaker verification. Built on PyTorch, this lightweight yet extensible framework provides standardized recipes, pretrained models, and integration of widely used architectures. The project emphasizes reproducibility by delivering transparent training pipelines, unified evaluation protocols, and ready-to-use baselines across multiple corpora. Beyond standard training and inference capabilities, Kiwano includes specialized tools for benchmarking, experiment tracking, and the rapid prototyping of new architectures. To encourage community adoption, the toolkit is distributed under the Apache 2.0 license and is accompanied by comprehensive documentation and reproducible experiments. By lowering entry barriers and standardizing evaluation practices, Kiwano aims to serve as a valuable resource for both academic research and applied development. The project is publicly available on GitHub at https://github.com/kiwano-toolkit/kiwano/.

arxiv arXiv cs.LG · 6h ago

Multigrid Training for Molecular Generation using Graph Neural Networks

The authors introduce a multigrid training strategy to address the high computational costs and instability associated with modeling biochemical molecular systems at full resolution. This approach leverages low-resolution optimization to accelerate learning at higher resolutions by transferring parameters across different discretizations. For graph-based molecular representations, the method progressively transfers parameters from a coarse graph to increasingly finer graphs using biased random walk upsampling. In 3D molecular generation, structures are voxelized at multiple resolutions, allowing a coarse-resolution conditional Variational Autoencoder (CVAE) to be pretrained first. Shape-compatible convolutional parameters are then transferred from the coarse model to initialize a fine-resolution CVAE. Numerical experiments on receptor-conditioned 3D ligand generation demonstrate that this method accelerates convergence compared to training from scratch. Additionally, the study shows that multigrid training improves generalization capabilities for molecular generation tasks.

media r/LocalLLaMA · 6h ago

Community Inquiry on Running DwarfStar with DeepSeek V4 Flash on DGX Spark

A Reddit user in the r/LocalLLaMA community is asking for experiences regarding the use of DwarfStar (DS4) with the DeepSeek V4 Flash model on a single NVIDIA DGX Spark device. The inquiry highlights technical specifications suggesting that DS4's Mixture of Experts approach and unified memory strategy allow for loading the model with 80 billion active parameters and full maximum context length. The poster references external resources, including a GitHub repository by antirez and a demonstration video, to support these claims about performance capabilities. The discussion seeks feedback on the practical viability of this setup, specifically questioning the quality of agentic coding tasks performed under these constraints. This request reflects ongoing interest in optimizing large language model inference on consumer-grade or compact hardware configurations.

media r/LocalLLaMA · 6h ago

Gemma4-26B-A4B & 31B-QAT Uncensored Balanced Released with MTP Speed Boosts

HauhauCS has released two new uncensored, balanced versions of the Gemma 4 models: Gemma4-26B-A4B and Gemma4-31B-QAT. Both variants incorporate Multi-Token Prediction (MTP) draft heads to enable speculative decoding, resulting in significant inference speed improvements. The 26B-A4B model achieves approximately a 35% speed boost, while the 31B model sees a 53% increase, with identical output quality verified by the model's drafting mechanism. These releases utilize QAT-aware quantization, making Q4_K_M the optimal format as higher precision offers no quality gains for these specific models. The 26B-A4B is a Mixture of Experts architecture with roughly 4 billion active parameters per token, whereas the 31B variant is a dense model offering higher capability for users with sufficient VRAM. Both models include vision support via mmproj files and maintain a 262K context window. The author notes that GenRM testing resulted in zero refusals across 465 prompts, confirming their uncensored nature.

arxiv arXiv cs.LG · 6h ago

HyperAdapter: Structured Hyperedge Adaptation for Parameter-Efficient Fine-Tuning of Vision Transformers

The authors propose HyperAdapter, a novel parameter-efficient fine-tuning method that adapts vision transformers in hyperedge space rather than token space. Existing adapter-based methods typically perform independent adaptations for each token, which overlooks structured relationships and can lead to redundant updates. HyperAdapter constructs a soft hypergraph over ViT tokens using prototype-based assignments to enable group-aware adaptation. The architecture aggregates token features into latent hyperedge representations and applies lightweight bottleneck adaptation at the hyperedge level. Updates are then diffused back to individual tokens via the hypergraph incidence structure, injecting an explicit structural inductive bias. Extensive experiments across diverse visual benchmarks demonstrate that this approach consistently outperforms strong PEFT baselines under comparable parameter budgets. The results highlight significant gains on tasks requiring structured reasoning and suggest that the choice of adaptation space is a critical dimension for efficient transfer.

arxiv arXiv cs.LG · 6h ago

Shift-Invariant Variance Estimator Eliminates Minimization Bias in Local Learning Coefficient Estimation

Singular Learning Theory uses the Local Learning Coefficient to quantify neural network loss landscape geometry, but mean-energy estimators rely on an additive loss baseline. During off-equilibrium training phases, this minimum is unknown, and substituting it with noisy mini-batch losses introduces systematic minimization bias. The authors propose the Shift-Invariant Variance Estimator (SIVE) to structurally eliminate this unknown baseline through the variance operator. By combining SIVE with a correction derived from the Law of Total Variance, the method separates geometric loss fluctuations from evaluation noise. Controlled experiments on analytically tractable toy models demonstrate that SIVE recovers expected finite-temperature geometric signals where anchored mean estimators fail. Applied to deep neural networks, SIVE serves as a robust diagnostic for tracking structural phase transitions throughout training.

arxiv arXiv cs.LG · 6h ago

Efficient CNN with Transfer Learning for Multi-Cancer Detection

A study introduces a lightweight convolutional neural network enhanced with transfer learning for multi-cancer detection using biomedical images. The architecture aims to reduce computational complexity while maintaining high classification performance for deployment in resource-constrained environments. Researchers evaluated the model on three tumor datasets comprising brain MRI and lung and kidney CT scans. The system achieved test accuracies of 90.85%, 98.64%, and 99.92% for brain, lung, and kidney cancer respectively via five-fold stratified cross-validation. Transfer learning was employed by pretraining on one cancer type and fine-tuning on others, requiring only 20 additional epochs to match scratch-trained models. The fine-tuning process updates the classification part of the CNN and takes approximately 0.014 seconds per image per epoch on an NVIDIA GeForce GTX 960. Comparative evaluations demonstrate that this model outperforms state-of-the-art architectures such as Xception, VGG16, VGG19, MobileNetV2, and DenseNet121.

blog Simon Willison · 6h ago

Simon Willison converts MDN browser compatibility data into a SQLite database

Inspired by Mozilla's new MDN MCP service, Simon Willison has converted the comprehensive mdn/browser-compat-data repository into a SQLite database. The project utilizes a script generated by Claude Code for web (Opus 4.8) to perform this conversion using sqlite-utils. The resulting database is approximately 66MB in size and is hosted on GitHub with open CORS headers to facilitate direct access. To automate the process, a GitHub Actions workflow was built using Codex Desktop (GPT-5.5) to force-push the updated database to an orphan branch named db. Users can download the final browser-compat.db file directly from the repository or explore its contents via Datasette Lite.

arxiv arXiv cs.LG · 7h ago

P4IR: Reinforcement Learning Enhances Automated Code Compliance Systems

A new framework named P4IR addresses the issue of hallucinated rules in large language model-based automated code compliance systems. This two-stage approach first employs supervised fine-tuning to instill domain knowledge into the model. It then utilizes Group Relative Policy Optimization to improve the accuracy of generated high-level code skeletons. The method achieved reductions of up to 23.8% in tree edit distance and 38.6% in token-level Levenshtein distance compared to supervised fine-tuning baselines. Comparative analysis shows that P4IR outperforms leading models like Claude Opus, GPT-5.2, and Qwen-3-Max in zero-shot settings. Additionally, the reinforcement learning stage produced a statistically significant reduction in false positives. This combination of techniques offers a path toward more reliable automated code compliance.

arxiv arXiv cs.LG · 7h ago

Asymptotic Signal Subspace Recovery in Softmax Attention Models

This study investigates the theoretical principles behind softmax-attention mechanisms by analyzing a stylized model where a query vector is learned via stochastic gradient ascent. The authors exploit the model's symmetry to derive a population objective and characterize the limiting ordinary differential equation governing the learning dynamics. By employing tools from stochastic approximation and dynamical systems theory, they establish a rigorous connection between the stochastic learning algorithm and its deterministic limit. Under suitable high-dimensional scaling assumptions and standard step-size conditions, the research demonstrates that the learned query converges almost surely to the one-dimensional signal subspace. This convergence implies that the query asymptotically recovers the latent informative direction up to an intrinsic sign ambiguity. The findings provide a theoretical foundation for understanding attention as a signal extraction procedure in high-dimensional noisy environments.

arxiv arXiv cs.LG · 7h ago

QeHDC: Hyperdimensional Computing based on Quantum-enhanced binding and SuperClass Construction

The authors propose QeHDC, a novel framework extending classical Hyperdimensional Computing by leveraging quantum mechanical properties for enhanced computational efficiency. This approach utilizes a one-pass training method that employs sinusoidal and quantum encoding to project classical data into quantum amplitude states. A key innovation is the introduction of a reference-state-based quantum binding operation realized through specific quantum circuits. Additionally, the framework implements a density-matrix-based superclass generation strategy using eigenvalue decomposition to extract critical quantum state features. These mechanisms enable more accurate and robust class representations for classification tasks. Experimental evaluations on standard benchmark datasets demonstrate superior performance compared to traditional classical and existing quantum-enhanced methods. The results also highlight the approach's robustness to noise and computational feasibility, suggesting practical benefits for future quantum-inspired paradigms.

arxiv arXiv cs.LG · 7h ago

GaRA: Graph-aware LoRA Generation for Enhancing LLMs on Graph Tasks

Graph neural networks often exhibit limited transferability due to their tight coupling with dataset-specific feature spaces, whereas language models offer flexible generalization through a unified interface. Existing methods for adapting language models to graph tasks struggle to encode whole-graph information, which can lead to significant information loss and suboptimal understanding. To address this limitation, the authors propose GaRA, a novel Graph-aware LoRA generation model that implements a weight-level information injection paradigm. This approach generates task-specific weight updates conditioned on original graph structures, allowing them to interact directly with hidden representations. The method constrains the norm of these generated updates to inject whole-graph information while avoiding optimization bias inherent in standard weight generation. Empirical studies demonstrate that GaRA consistently outperforms baseline methods across various zero-shot graph learning tasks.

arxiv arXiv cs.LG · 7h ago

LLMs Determine Causal Structure via Difference-Making Logic

The article addresses the puzzle of how large language models acquire causal structure despite the limitations of standard formalisms like Judea Pearl's interventionist approach and the Neyman-Rubin framework. It argues that LLMs utilize a specific inductive method known as variational induction, which relies on difference-making logic. During training, models process vast amounts of text from diverse contexts to identify what constitutes a difference-maker or an indifference-maker within word sequences. The analysis examines how architectural components, specifically token embeddings and self-attention mechanisms, facilitate this variational induction process. This logical framework fundamentally parallels the experimental method used in science. In both cases, causal relations are derived by systematically varying individual circumstances to observe their influence on a phenomenon.