Training methods — korshunov.ai

Training methods Page 1 / 12

Imagine to Ensure Safety in Hierarchical Reinforcement Learning

The method combines a learnable world model with high- and low-level policies to enable safe exploration in long-horizon tasks. The high-level policy guides exploration toward safe subgoals, while the low-level policy uses imagined rollouts to prevent unsafe behaviors, outperforming existing Safe RL methods in success rate and constraint satisfaction across diverse tasks.

arxiv arXiv cs.AI · 23h ago

Fed-CausalDiff: Decoupled Synchronization for Federated Do-Simulation

Fed-CausalDiff introduces a federated causal diffusion framework that enables do-simulation in decentralized settings. It decomposes latent state evolution into global and local components, allowing decoupled synchronisation to reduce communication cost while maintaining accurate policy evaluation and ATE estimation.

arxiv arXiv cs.AI · 1d ago

Importance-Weighted On-Policy Distillation Addresses Position Bias

On-Policy Distillation (OPD) suffers from position bias where later tokens provide poor supervision. Importance-Weighted OPD (IW-OPD) assigns dynamic weights based on distribution discrepancy, prioritizing early tokens and suppressing late ones. IW-OPD converges faster and achieves up to 6.9 point performance gains on AIME-2025 compared to standard OPD.

arxiv arXiv cs.LG · 1d ago

Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization

ROVER enables reward-free pretraining by maximizing occupancy coverage in state space, using a learned world model to estimate occupancy without density or entropy estimation. It introduces a virtual sink state to balance exploration of known and unknown regions, achieving more uniform coverage and better downstream performance in tabular and pixel-based navigation tasks.

arxiv arXiv cs.LG · 1d ago

Central limit theorem for averaged Adam optimizer

The article establishes a central limit theorem for the averaged Adam optimizer, showing convergence at order n^{-1/2}. This rate matches classical stochastic approximation algorithms, with the covariance expressed in terms of the algorithm's properties at the attractor state.

arxiv arXiv cs.LG · 1d ago

BIPC Framework Accelerates Mixed-Integer Optimization with Machine Learning

The BIPC framework reduces solution time for large-scale mixed-integer programs by identifying a backdoor subset of variables that drive computational complexity. Using supervised learning, it predicts backdoor variable values and intervals, then solves a reduced problem with these predictions, achieving significant speedups with minimal quality loss. This enables rapid, high-quality solutions under parameter perturbations in real-world systems like power and supply chains.

arxiv arXiv cs.LG · 1d ago

Deep Learning with O(log N) Parallel Time Complexity

Hierarchical Block-Local Learning (HBLL) enables deep neural network training in O(log N) parallel time complexity, eliminating the need for full backpropagation. HBLL decomposes networks into hierarchically linked blocks and achieves competitive performance on vision and language tasks, with extensions to recurrent architectures.

arxiv arXiv cs.LG · 1d ago

Analytic Policy Gradients for Efficient Continuous Control

Analytic Policy Gradients (APG) enables exact gradient computation via backpropagation through simulation when environment dynamics are differentiable. APG outperforms Proximal Policy Optimization (PPO) on four continuous control tasks, showing superior sample and learning efficiency with a segmented backpropagation scheme that reduces gradient degradation on long-horizon tasks.

arxiv arXiv cs.LG · 1d ago

Muon Optimizer: Power, Limits, and a River-Valley Theory

A new trajectory-level theory reveals Muon accelerates early in optimization along the information-bearing river direction but converges slowly near the bottom, unlike gradient descent. With momentum, Muon's orthogonalized updates remove residual scale information, leading to overshooting and oscillation. The study advocates a two-stage approach—using Muon early and switching to gradient descent-like optimizers later—for improved LLM training performance.

arxiv arXiv cs.LG · 1d ago

GOMA Achieves First Stochastic Convergence Guarantee for Variational Inequalities

The paper introduces GOMA, a family of first-order methods for monotone variational inequalities. In the stochastic setting with unbounded variance, a simplified variant of GOMA achieves an O(1/\sqrt{k}) last-iterate convergence rate on the squared gradient norm, without variance reduction or growing batches. This is the first such guarantee for unconstrained stochastic monotone Lipschitz variational inequalities.

arxiv arXiv cs.LG · 1d ago

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning

FAST addresses sampling inefficiency in autonomous driving reinforcement learning by introducing Dynamic Parallel Sampling Alignment to decouple episode termination from sampling loops. It achieves up to 1.78 times wall-clock speedup over single-clip baselines while maintaining statistical unbiasedness through Scaled Mask-Padding Optimization.

arxiv arXiv cs.AI · 2d ago

Analytic Policy Gradients for Sample and Learning Efficient Control

arxiv arXiv cs.AI · 2d ago

Deep learning with O(log N) parallel time complexity

arxiv arXiv cs.AI · 2d ago

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning

FAST addresses sampling inefficiency in autonomous driving reinforcement learning by introducing Dynamic Parallel Sampling Alignment to decouple sampling loops from individual episode terminations. It achieves up to 1.78 times wall-clock speedup over single-clip baselines while maintaining statistical unbiasedness through Scaled Mask-Padding Optimization.

arxiv arXiv cs.CL · 2d ago

Energy Consumption Model for Transformer Training

A new framework models energy consumption in Transformer training on multiple GPUs. It uses BERT architectural sweeps to link measured energy to compute, memory traffic, and hardware efficiency proxies. The model, inspired by roofline analysis, includes a speedup-based hardware-efficiency factor and predicts training energy across diverse GPU configurations.

arxiv arXiv cs.CL · 2d ago

Randomized YaRN Improves Length Generalization for Long-Context Reasoning

Randomized YaRN enhances long-context reasoning by combining YaRN positional extrapolation with randomized positional encoding and a length curriculum. It outperforms standard fine-tuning on benchmarks like BABILong and MRCR, showing significant gains at far out-of-distribution context lengths.

arxiv arXiv cs.CL · 2d ago

Adaptive Data Scheduling Improves LLM Reinforcement Learning

Adaptive Data Scheduling (ADS) introduces a dual-level data scheduling framework that replaces uniform sampling with adaptive distribution over semantic clusters and policy-boundary sample selection. Experimental results show ADS improves average accuracy by 5.2% over GRPO across three LLMs and seven reasoning benchmarks, demonstrating its effectiveness as a general strategy for LLM RL post-training.

arxiv arXiv cs.CL · 2d ago

Key Factors in RL for LLM Reasoning Revealed

A theoretical analysis shows that off-policy degree, determined by gradient steps per rollout, significantly impacts importance sampling ratios and token update dominance. The study introduces Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries by token group variance, outperforming DAPO and CISPO on 3B and 7B models across mathematical, QA, and logic reasoning tasks.

media Hugging Face Forums · 3d ago

Small-scale debug comparison of OLMo-core with Engram graft

A 200-step training comparison between a base OLMo3 600M model and a version with a DeepSeek-style Engram graft shows lower training and evaluation loss, faster grad-norm stabilization, and improved early learning behavior. The Engram graft, injected into layers 1 and 5, increases trainable parameters to ~1.7B but maintains only a 40k increase in active parameters per token, indicating efficient memory usage.

media r/LocalLLaMA · 5d ago

Free 15-Part Series on LLM Internals Grounded in Gemma 4 12B

I wrote a free 15-part series detailing LLM internals, using Gemma 4 12B as the core example. Each part covers technical aspects from tokenization to serving, with real math, tensor shapes, and hardware constraints. The series includes a companion vLLM Deep Dive and is fully accessible without paywalls or email.