Training methods — korshunov.ai

Training methods Page 1 / 14

Multigrid Training for Molecular Generation using Graph Neural Networks

The authors introduce a multigrid training strategy to address the high computational costs and instability associated with modeling biochemical molecular systems at full resolution. This approach leverages low-resolution optimization to accelerate learning at higher resolutions by transferring parameters across different discretizations. For graph-based molecular representations, the method progressively transfers parameters from a coarse graph to increasingly finer graphs using biased random walk upsampling. In 3D molecular generation, structures are voxelized at multiple resolutions, allowing a coarse-resolution conditional Variational Autoencoder (CVAE) to be pretrained first. Shape-compatible convolutional parameters are then transferred from the coarse model to initialize a fine-resolution CVAE. Numerical experiments on receptor-conditioned 3D ligand generation demonstrate that this method accelerates convergence compared to training from scratch. Additionally, the study shows that multigrid training improves generalization capabilities for molecular generation tasks.

arxiv arXiv cs.LG · 17h ago

HyperAdapter: Structured Hyperedge Adaptation for Parameter-Efficient Fine-Tuning of Vision Transformers

The authors propose HyperAdapter, a novel parameter-efficient fine-tuning method that adapts vision transformers in hyperedge space rather than token space. Existing adapter-based methods typically perform independent adaptations for each token, which overlooks structured relationships and can lead to redundant updates. HyperAdapter constructs a soft hypergraph over ViT tokens using prototype-based assignments to enable group-aware adaptation. The architecture aggregates token features into latent hyperedge representations and applies lightweight bottleneck adaptation at the hyperedge level. Updates are then diffused back to individual tokens via the hypergraph incidence structure, injecting an explicit structural inductive bias. Extensive experiments across diverse visual benchmarks demonstrate that this approach consistently outperforms strong PEFT baselines under comparable parameter budgets. The results highlight significant gains on tasks requiring structured reasoning and suggest that the choice of adaptation space is a critical dimension for efficient transfer.

arxiv arXiv cs.LG · 18h ago

Shift-Invariant Variance Estimator Eliminates Minimization Bias in Local Learning Coefficient Estimation

Singular Learning Theory uses the Local Learning Coefficient to quantify neural network loss landscape geometry, but mean-energy estimators rely on an additive loss baseline. During off-equilibrium training phases, this minimum is unknown, and substituting it with noisy mini-batch losses introduces systematic minimization bias. The authors propose the Shift-Invariant Variance Estimator (SIVE) to structurally eliminate this unknown baseline through the variance operator. By combining SIVE with a correction derived from the Law of Total Variance, the method separates geometric loss fluctuations from evaluation noise. Controlled experiments on analytically tractable toy models demonstrate that SIVE recovers expected finite-temperature geometric signals where anchored mean estimators fail. Applied to deep neural networks, SIVE serves as a robust diagnostic for tracking structural phase transitions throughout training.

arxiv arXiv cs.LG · 18h ago

P4IR: Reinforcement Learning Enhances Automated Code Compliance Systems

A new framework named P4IR addresses the issue of hallucinated rules in large language model-based automated code compliance systems. This two-stage approach first employs supervised fine-tuning to instill domain knowledge into the model. It then utilizes Group Relative Policy Optimization to improve the accuracy of generated high-level code skeletons. The method achieved reductions of up to 23.8% in tree edit distance and 38.6% in token-level Levenshtein distance compared to supervised fine-tuning baselines. Comparative analysis shows that P4IR outperforms leading models like Claude Opus, GPT-5.2, and Qwen-3-Max in zero-shot settings. Additionally, the reinforcement learning stage produced a statistically significant reduction in false positives. This combination of techniques offers a path toward more reliable automated code compliance.

arxiv arXiv cs.LG · 18h ago

Asymptotic Signal Subspace Recovery in Softmax Attention Models

This study investigates the theoretical principles behind softmax-attention mechanisms by analyzing a stylized model where a query vector is learned via stochastic gradient ascent. The authors exploit the model's symmetry to derive a population objective and characterize the limiting ordinary differential equation governing the learning dynamics. By employing tools from stochastic approximation and dynamical systems theory, they establish a rigorous connection between the stochastic learning algorithm and its deterministic limit. Under suitable high-dimensional scaling assumptions and standard step-size conditions, the research demonstrates that the learned query converges almost surely to the one-dimensional signal subspace. This convergence implies that the query asymptotically recovers the latent informative direction up to an intrinsic sign ambiguity. The findings provide a theoretical foundation for understanding attention as a signal extraction procedure in high-dimensional noisy environments.

arxiv arXiv cs.LG · 18h ago

QeHDC: Hyperdimensional Computing based on Quantum-enhanced binding and SuperClass Construction

The authors propose QeHDC, a novel framework extending classical Hyperdimensional Computing by leveraging quantum mechanical properties for enhanced computational efficiency. This approach utilizes a one-pass training method that employs sinusoidal and quantum encoding to project classical data into quantum amplitude states. A key innovation is the introduction of a reference-state-based quantum binding operation realized through specific quantum circuits. Additionally, the framework implements a density-matrix-based superclass generation strategy using eigenvalue decomposition to extract critical quantum state features. These mechanisms enable more accurate and robust class representations for classification tasks. Experimental evaluations on standard benchmark datasets demonstrate superior performance compared to traditional classical and existing quantum-enhanced methods. The results also highlight the approach's robustness to noise and computational feasibility, suggesting practical benefits for future quantum-inspired paradigms.

arxiv arXiv cs.LG · 18h ago

GaRA: Graph-aware LoRA Generation for Enhancing LLMs on Graph Tasks

Graph neural networks often exhibit limited transferability due to their tight coupling with dataset-specific feature spaces, whereas language models offer flexible generalization through a unified interface. Existing methods for adapting language models to graph tasks struggle to encode whole-graph information, which can lead to significant information loss and suboptimal understanding. To address this limitation, the authors propose GaRA, a novel Graph-aware LoRA generation model that implements a weight-level information injection paradigm. This approach generates task-specific weight updates conditioned on original graph structures, allowing them to interact directly with hidden representations. The method constrains the norm of these generated updates to inject whole-graph information while avoiding optimization bias inherent in standard weight generation. Empirical studies demonstrate that GaRA consistently outperforms baseline methods across various zero-shot graph learning tasks.

arxiv arXiv cs.LG · 18h ago

Escaping the Variance Trap: Jacobian-Free Dynamics for Root-Finding Bilevel Optimization

The authors identify a critical flaw termed the Variance Trap, which arises when stochastic root-finding problems are forced into minimization frameworks via squared residuals. Standard bilevel minimization algorithms require estimating hypergradients involving implicit Jacobians that act as noise amplifiers in stochastic settings. To address this, the paper formalizes Root-Finding Bilevel Optimization (RF-BO) as a distinct problem class to bypass this pathology. A Jacobian-free solution using Two-Time-Scale Stochastic Approximation (TTSA) is proposed to update directly along the root error. The study provides the first non-asymptotic convergence guarantees for TTSA in this setting under Markovian noise. Experiments show a 2.6% top-1 accuracy gain in SimCLR and 17x faster convergence in non-linear ODE control compared to baselines. Additionally, the framework achieves significantly improved entropy stability in reinforcement learning and an 11.1% quality improvement in generative modeling.

arxiv arXiv cs.LG · 18h ago

RQ-TTSA: Distribution-Aware Robust Bilevel Optimization with Quantile-Guided Huber Updates

The authors propose RQ-TTSA, a distribution-aware framework designed to address instability in bilevel optimization caused by heavy-tailed stochastic noise. Unlike existing variance-reduction techniques that rely on myopic magnitude checks, this method uses historical gradient buffers to estimate rolling quantiles for adaptive Huber-style clipping. This approach preserves local optimization geometry while strictly bounding effective variance under nonconvex-strongly convex assumptions with infinite-variance noise. Theoretical analysis derives a convergence rate of O(T^(-(p-1)/(3p-2))) that recovers optimal dependence on the heavy-tailed parameter p. Empirical evaluations across six diverse tasks, including vision benchmarks and offline reinforcement learning, show consistent outperformance over state-of-the-art baselines. RQ-TTSA eliminates divergence spikes and ensures stable convergence with negligible computational overhead of approximately 2.7 percent.

arxiv arXiv cs.LG · 18h ago

VRA-FedSGD: Variance-Reduced Federated Learning for Heavy-Tailed Noise

The authors propose VRA-FedSGD, a variance-reduction based algorithm designed for federated learning in environments with heavy-tailed gradient and communication noise. This approach addresses challenges prevalent in large-scale machine learning over wireless networks and Internet of Things deployments. The method employs momentum variance reduction combined with nonlinear mapping to mitigate heavy-tailed gradient noise. It also utilizes a variance-reduced aggregation mechanism to suppress heavy-tailed communication noise. For nonconvex objective functions, VRA-FedSGD achieves a mean convergence rate of O(K^(-(p-1)/(2p-1))), where p is the tail index. In the almost sure sense, it reaches a rate of Õ(K^(-(1-1/(p-ε))) for strongly convex objectives, with ε being an arbitrarily small constant. Simulated experiments on logistic regression with real-world data verify the algorithm's effectiveness.

media r/LocalLLaMA · 19h ago

Gefen: A Drop-in Replacement for AdamW with Claimed 8x Memory Reduction

Gefen is presented as a drop-in replacement for the AdamW optimizer, claiming an eightfold reduction in memory usage during training. The project includes a GitHub repository available at ndvbd/Gefen and a corresponding research paper hosted on arXiv under the identifier 2606.13894. This submission highlights Gefen's potential to optimize resource efficiency for machine learning workflows. The provided source material links directly to the technical documentation and codebase for further verification. No additional performance metrics or comparative benchmarks are detailed in the available text.

arxiv arXiv cs.LG · 23h ago

Fed-CausalDiff: Decoupled Synchronization for Federated Do-Simulation

Fed-CausalDiff introduces a federated causal diffusion framework that enables do-simulation and policy evaluation in decentralized settings. It decomposes latent state evolution into global and local components, allowing decoupled synchronisation to reduce communication cost while maintaining accurate causal inference.

arxiv arXiv cs.LG · 23h ago

Robust Diffusion Models via Divergence-Induced Weighted Denoising

A new training method replaces MSE loss in diffusion models with an f-divergence-based transformation, creating a robust surrogate that improves performance under data contamination. The approach uses local divergence constructions under DDPM's Gaussian reverse-kernel, reducing the training objective to a one-dimensional function of denoising error, with bounded-influence divergences suppressing large errors and enhancing stability.

arxiv arXiv cs.LG · 23h ago

Introducing Quantum Measurement Temperature to Stabilize Hybrid QNN Training

A learnable scaling parameter called Quantum Measurement Temperature (QMT) is introduced to rescale quantum measurement outputs in hybrid quantum neural networks. This approach mitigates measurement-induced logit contraction, enhancing gradient magnitude and stability during training without altering the quantum circuit or measurement operators. Experiments show improved logit separation, gradient strength, and classification accuracy in protein and image classification tasks.

arxiv arXiv cs.LG · 23h ago

Stationary Robust Mean-Field Games under Model Mismatches

This paper introduces a stationary mean-field game framework that directly incorporates distributional model uncertainty into population-coupled dynamics. It establishes a robust dynamic programming principle, proves existence of a stationary robust equilibrium, and presents the first algorithm with convergence guarantees. The mean-field solution approximates finite-population equilibria and provides explicit non-asymptotic error bounds under model uncertainty.

arxiv arXiv cs.LG · 23h ago

Training-Free Task Classification for Multi-Task Model Merging

SiM enables dynamic routing in multi-task model merging without additional training or task ID access. It uses SVD-based manifold approximations and projects test inputs onto precomputed task manifolds to route inputs to relevant experts, improving performance and reducing the gap to individual expert levels.

arxiv arXiv cs.LG · 23h ago

Importance-Weighted On-Policy Distillation Addresses Position Bias

On-Policy Distillation (OPD) suffers from position bias where later tokens provide poor supervision. We introduce Importance-Weighted On-Policy Distillation (IW-OPD), which assigns weights based on distribution discrepancy, prioritizing early tokens. IW-OPD converges faster and achieves up to 6.9 point performance gains on AIME-2025.

arxiv arXiv cs.LG · 23h ago

Scalable Bayesian Models for Stellar Flare Detection

A generative surrogate framework using a Variational Autoencoder approximates Gaussian Process priors, bypassing costly covariance operations. The VAE+Hidden Markov Model architecture enables fast, scalable stellar flare detection in large astronomical time series, matching exact models in structural fidelity while reducing computational time significantly.

arxiv arXiv cs.AI · 1d ago

Select-to-Act: Hierarchical RL with Adaptive Language Guidance

HRLLI introduces a hierarchical reinforcement learning framework that adapts natural-language instructions dynamically during decision-making. It decomposes instructions into stage-specific guidance elements and uses a select-to-act paradigm to enable real-time selection of relevant instruction pieces, improving sample efficiency and performance in complex environments.

arxiv arXiv cs.AI · 1d ago

Sparsity-Storage-Accuracy Tradeoff in Parsimoniously Activated Dictionary Learning

Parsimoniously activated dictionary learning (PADL) establishes a structured generative model with auxiliary latent variables, enabling maximum a posteriori estimation. This framework provides generalization guarantees and an analytical characterization of the tradeoff between sparsity, storage cost, and reconstruction accuracy, allowing data-driven hyperparameter estimation. The resulting algorithm achieves better reconstruction performance and accelerates inference in vision-language models.