Training methods — korshunov.ai

Training methods Page 1 / 14

Muon Optimizer: Power, Limits, and a River-Valley Theory

A new trajectory-level theory reveals Muon accelerates early in optimization along the information-bearing river direction but converges slowly near the bottom, unlike gradient descent. With momentum, Muon's orthogonalized updates remove residual scale information, leading to overshooting and oscillation. The study advocates a two-stage approach—using Muon early and switching to gradient descent-like optimizers later—for improved LLM training performance.

arxiv arXiv cs.LG · 2d ago

GOMA Achieves First Stochastic Convergence Guarantee for Variational Inequalities

The paper introduces GOMA, a family of first-order methods for monotone variational inequalities. In the stochastic setting with unbounded variance, a simplified variant of GOMA achieves an O(1/\sqrt{k}) last-iterate convergence rate on the squared gradient norm, without variance reduction or growing batches. This is the first such guarantee for unconstrained stochastic monotone Lipschitz variational inequalities.

arxiv arXiv cs.LG · 2d ago

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning

FAST addresses sampling inefficiency in autonomous driving reinforcement learning by introducing Dynamic Parallel Sampling Alignment to decouple episode termination from sampling loops. It achieves up to 1.78 times wall-clock speedup over single-clip baselines while maintaining statistical unbiasedness through Scaled Mask-Padding Optimization.

arxiv arXiv cs.AI · 2d ago

Analytic Policy Gradients for Sample and Learning Efficient Control

Analytic Policy Gradients (APG) enables exact gradient computation via backpropagation through simulation when environment dynamics are differentiable. APG outperforms Proximal Policy Optimization (PPO) on four continuous control tasks, showing superior sample and learning efficiency with a segmented backpropagation scheme that reduces gradient degradation on long-horizon tasks.

arxiv arXiv cs.AI · 2d ago

Deep learning with O(log N) parallel time complexity

Hierarchical Block-Local Learning (HBLL) enables deep neural network training in O(log N) parallel time complexity, eliminating the need for full backpropagation. HBLL decomposes networks into hierarchically linked blocks and achieves competitive performance on vision and language tasks, with extensions to recurrent architectures.

arxiv arXiv cs.AI · 2d ago

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning

FAST addresses sampling inefficiency in autonomous driving reinforcement learning by introducing Dynamic Parallel Sampling Alignment to decouple sampling loops from individual episode terminations. It achieves up to 1.78 times wall-clock speedup over single-clip baselines while maintaining statistical unbiasedness through Scaled Mask-Padding Optimization.

arxiv arXiv cs.CL · 2d ago

Energy Consumption Model for Transformer Training

A new framework models energy consumption in Transformer training on multiple GPUs. It uses BERT architectural sweeps to link measured energy to compute, memory traffic, and hardware efficiency proxies. The model, inspired by roofline analysis, includes a speedup-based hardware-efficiency factor and predicts training energy across diverse GPU configurations.

arxiv arXiv cs.CL · 2d ago

Randomized YaRN Improves Length Generalization for Long-Context Reasoning

Randomized YaRN enhances long-context reasoning by combining YaRN positional extrapolation with randomized positional encoding and a length curriculum. It outperforms standard fine-tuning on benchmarks like BABILong and MRCR, showing significant gains at far out-of-distribution context lengths.

arxiv arXiv cs.CL · 3d ago

Adaptive Data Scheduling Improves LLM Reinforcement Learning

Adaptive Data Scheduling (ADS) introduces a dual-level data scheduling framework that replaces uniform sampling with adaptive distribution over semantic clusters and policy-boundary sample selection. Experimental results show ADS improves average accuracy by 5.2% over GRPO across three LLMs and seven reasoning benchmarks, demonstrating its effectiveness as a general strategy for LLM RL post-training.

arxiv arXiv cs.CL · 3d ago

Key Factors in RL for LLM Reasoning Revealed

A theoretical analysis shows that off-policy degree, determined by gradient steps per rollout, significantly impacts importance sampling ratios and token update dominance. The study introduces Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries by token group variance, outperforming DAPO and CISPO on 3B and 7B models across mathematical, QA, and logic reasoning tasks.

media Hugging Face Forums · 4d ago

Small-scale debug comparison of OLMo-core with Engram graft

A 200-step training comparison between a base OLMo3 600M model and a version with a DeepSeek-style Engram graft shows lower training and evaluation loss, faster grad-norm stabilization, and improved early learning behavior. The Engram graft, injected into layers 1 and 5, increases trainable parameters to ~1.7B but maintains only a 40k increase in active parameters per token, indicating efficient memory usage.

media r/LocalLLaMA · 5d ago

Free 15-Part Series on LLM Internals Grounded in Gemma 4 12B

I wrote a free 15-part series detailing LLM internals, using Gemma 4 12B as the core example. Each part covers technical aspects from tokenization to serving, with real math, tensor shapes, and hardware constraints. The series includes a companion vLLM Deep Dive and is fully accessible without paywalls or email.

media r/LocalLLaMA · 6d ago

RTX 5090 MSI Power Usage and Cable Warning

The RTX 5090 MSI consumes 475-500W during inference or diffusion training. The user reports no issues with the power cable, emphasizing that it should not be bent to ensure safe and stable operation.

media r/LocalLLaMA · 6d ago

Fixing Long-Context Decode Cliff on Radeon R9700 with vLLM 0.22.1

A long-context decode performance cliff on AMD Radeon AI PRO R9700 (RDNA4) was resolved by enabling AITER Unified Attention in vLLM 0.22.1. The fix involves relaxing a CDNA gate to include RDNA4, disabling other attention backends, and using bf16 KV cache, resulting in significant speedups across all context lengths. FP8 KV is ineffective on this hardware, and the model's native 262K context is fully achievable with bf16, offering ~2.9× concurrency without needing FP8.

media r/LocalLLaMA · 6d ago

EvoTensile: Evolutionary tuning of AMD Tensile GEMM kernels

EvoTensile uses evolutionary algorithms to tune GEMM kernels for AMD GPUs, improving NT layout performance from 20 to 40 TFLOPS on Strix Halo. This speedup represents a significant advance over unoptimized kernels, though it remains below the theoretical roofline of 59.4 TFLOPS.

arxiv arXiv cs.AI · 6d ago

UFP4: Uniform 4-Bit Training Overcomes Shrinkage Bias in LLM Pretraining

A study identifies shrinkage bias in E2M1-based FP4 formats due to geometric asymmetry, causing multiplicative error accumulation and training instability. The proposed UFP4 recipe uses uniform E1M2/INT4 grids and applies Random Hadamard Transform to all GEMMs, achieving lower loss degradation than E2M1 baselines in large-scale LLM pretraining. The authors recommend E1M2/INT4 as a first-class training primitive for future accelerators.

arxiv arXiv cs.AI · 6d ago

Repurposing Speech Classifier for Diffusion-Based Generation

A pretrained speech classifier is repurposed as a backbone for guided diffusion-based speech generation. By attaching a lightweight subnetwork and training it under denoising score matching, the approach achieves high speech quality with reduced memory and computational cost, using a single model instead of two separately trained components.

arxiv arXiv cs.AI · 6d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant enables 4-bit KV caching for context-heavy agents, reducing P50 time-to-first-token by 3.47x in late rounds and boosting output throughput by 1.63x over FP8 KV baseline. It achieves this using FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA on AMD CDNA4 GPUs, with optimizations for decode-attention kernels and robust design choices like asymmetric K/V treatment and Walsh-Hadamard rotation.

arxiv arXiv cs.AI · 6d ago

Calibration in MoE Models Under Distribution Shift

This paper examines how mixture-of-experts models maintain calibration under distribution shift. It finds that expert-level calibration ensures overall model calibration in hard-routed models but is insufficient for soft-routed models. The authors propose adversarial reweighting to penalize calibration errors in routed aggregates, improving accuracy-calibration tradeoff across tasks and shifts.

arxiv arXiv cs.LG · 6d ago

Direct Advantage Estimation for Partially Observable Domains

Direct Advantage Estimation (DAE) is extended to partially observable domains with minimal modifications. A discrete latent dynamics model reduces computational overhead by efficiently approximating transition probabilities, enabling scalable and sample-efficient deep reinforcement learning in high-dimensional observation spaces.