Training methods
arxiv arXiv cs.LG · 15h ago

Escaping the Variance Trap: Jacobian-Free Dynamics for Root-Finding Bilevel Optimization

The authors identify a critical flaw termed the Variance Trap, which arises when stochastic root-finding problems are forced into minimization frameworks via squared residuals. Standard bilevel minimization algorithms require estimating hypergradients involving implicit Jacobians that act as noise amplifiers in stochastic settings. To address this, the paper formalizes Root-Finding Bilevel Optimization (RF-BO) as a distinct problem class to bypass this pathology. A Jacobian-free solution using Two-Time-Scale Stochastic Approximation (TTSA) is proposed to update directly along the root error. The study provides the first non-asymptotic convergence guarantees for TTSA in this setting under Markovian noise. Experiments show a 2.6% top-1 accuracy gain in SimCLR and 17x faster convergence in non-linear ODE control compared to baselines. Additionally, the framework achieves significantly improved entropy stability in reinforcement learning and an 11.1% quality improvement in generative modeling.

arxiv arXiv cs.LG · 15h ago

RQ-TTSA: Distribution-Aware Robust Bilevel Optimization with Quantile-Guided Huber Updates

The authors propose RQ-TTSA, a distribution-aware framework designed to address instability in bilevel optimization caused by heavy-tailed stochastic noise. Unlike existing variance-reduction techniques that rely on myopic magnitude checks, this method uses historical gradient buffers to estimate rolling quantiles for adaptive Huber-style clipping. This approach preserves local optimization geometry while strictly bounding effective variance under nonconvex-strongly convex assumptions with infinite-variance noise. Theoretical analysis derives a convergence rate of O(T^(-(p-1)/(3p-2))) that recovers optimal dependence on the heavy-tailed parameter p. Empirical evaluations across six diverse tasks, including vision benchmarks and offline reinforcement learning, show consistent outperformance over state-of-the-art baselines. RQ-TTSA eliminates divergence spikes and ensures stable convergence with negligible computational overhead of approximately 2.7 percent.

arxiv arXiv cs.LG · 15h ago

VRA-FedSGD: Variance-Reduced Federated Learning for Heavy-Tailed Noise

The authors propose VRA-FedSGD, a variance-reduction based algorithm designed for federated learning in environments with heavy-tailed gradient and communication noise. This approach addresses challenges prevalent in large-scale machine learning over wireless networks and Internet of Things deployments. The method employs momentum variance reduction combined with nonlinear mapping to mitigate heavy-tailed gradient noise. It also utilizes a variance-reduced aggregation mechanism to suppress heavy-tailed communication noise. For nonconvex objective functions, VRA-FedSGD achieves a mean convergence rate of O(K^(-(p-1)/(2p-1))), where p is the tail index. In the almost sure sense, it reaches a rate of Õ(K^(-(1-1/(p-ε))) for strongly convex objectives, with ε being an arbitrarily small constant. Simulated experiments on logistic regression with real-world data verify the algorithm's effectiveness.

media r/LocalLLaMA · 16h ago

Gefen: A Drop-in Replacement for AdamW with Claimed 8x Memory Reduction

Gefen is presented as a drop-in replacement for the AdamW optimizer, claiming an eightfold reduction in memory usage during training. The project includes a GitHub repository available at ndvbd/Gefen and a corresponding research paper hosted on arXiv under the identifier 2606.13894. This submission highlights Gefen's potential to optimize resource efficiency for machine learning workflows. The provided source material links directly to the technical documentation and codebase for further verification. No additional performance metrics or comparative benchmarks are detailed in the available text.

arxiv arXiv cs.LG · 20h ago

Robust Diffusion Models via Divergence-Induced Weighted Denoising

A new training method replaces MSE loss in diffusion models with an f-divergence-based transformation, creating a robust surrogate that improves performance under data contamination. The approach uses local divergence constructions under DDPM's Gaussian reverse-kernel, reducing the training objective to a one-dimensional function of denoising error, with bounded-influence divergences suppressing large errors and enhancing stability.

arxiv arXiv cs.LG · 20h ago

Introducing Quantum Measurement Temperature to Stabilize Hybrid QNN Training

A learnable scaling parameter called Quantum Measurement Temperature (QMT) is introduced to rescale quantum measurement outputs in hybrid quantum neural networks. This approach mitigates measurement-induced logit contraction, enhancing gradient magnitude and stability during training without altering the quantum circuit or measurement operators. Experiments show improved logit separation, gradient strength, and classification accuracy in protein and image classification tasks.

arxiv arXiv cs.LG · 20h ago

Stationary Robust Mean-Field Games under Model Mismatches

This paper introduces a stationary mean-field game framework that directly incorporates distributional model uncertainty into population-coupled dynamics. It establishes a robust dynamic programming principle, proves existence of a stationary robust equilibrium, and presents the first algorithm with convergence guarantees. The mean-field solution approximates finite-population equilibria and provides explicit non-asymptotic error bounds under model uncertainty.

arxiv arXiv cs.AI · 21h ago

Sparsity-Storage-Accuracy Tradeoff in Parsimoniously Activated Dictionary Learning

Parsimoniously activated dictionary learning (PADL) establishes a structured generative model with auxiliary latent variables, enabling maximum a posteriori estimation. This framework provides generalization guarantees and an analytical characterization of the tradeoff between sparsity, storage cost, and reconstruction accuracy, allowing data-driven hyperparameter estimation. The resulting algorithm achieves better reconstruction performance and accelerates inference in vision-language models.

arxiv arXiv cs.AI · 22h ago

HyperAdapter: Structured Hyperedge Adaptation for Vision Transformer Fine-Tuning

HyperAdapter introduces a hypergraph-based adapter that performs structured, group-aware adaptation in vision transformers by operating in hyperedge space rather than token space. It uses prototype-based assignments to build a soft hypergraph, aggregates token features into hyperedge representations, applies lightweight adaptation, and diffuses updates back via hypergraph structure, enabling explicit structural inductive bias while maintaining efficiency. Experiments show consistent performance gains over baseline PEFT methods, especially on tasks requiring structured reasoning.

arxiv arXiv cs.AI · 22h ago

P4IR Framework Improves LLM-Based Code Compliance Accuracy

P4IR, a two-stage framework, uses supervised fine-tuning and Group Relative Policy Optimization to enhance large language model-based automated code compliance systems. It reduces tree edit and token-level Levenshtein distances by up to 23.8% and 38.6% respectively, outperforming leading LLMs like Claude Opus, GPT-5.2, and GLM-4.7 in zero-shot settings with few-shot prompting, and reduces false positives by a small but statistically significant margin.