This article addresses the lack of rigorous convergence theory for the AdamW optimizer in regimes with heavy-tailed stochastic gradient noise, which is common in large language model pretraining. It questions whether AdamW can converge under these conditions or if its second-moment accumulator creates a genuine obstruction.
- Theoretical foundations for AdamW are currently limited to finite-variance regimes, despite empirical evidence of heavy-tailed noise in LLMs.
- Sign-based optimizers like Lion and Muon have achieved sharp convergence rates under heavy-tailed assumptions, as has AdaGrad.
- The authors formulate the effectiveness of AdamW under heavy-tailed noise as an open problem.
- A positive weighted-metric benchmark is proven to establish a baseline for performance.
- A corridor lower-bound mechanism is provided to demonstrate how denominator memory can obscure large gradients.
This work highlights the gap between empirical success and theoretical understanding, aiming to determine if AdamW's design inherently limits its robustness to heavy-tailed gradient distributions.