Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

AdamW serves as the standard optimizer for training large language models, yet its theoretical foundation remains largely confined to finite-variance regimes. This gap is significant because empirical evidence suggests that stochastic gradient noise during LLM pretraining typically exhibits heavy-tailed characteristics. Recent studies have demonstrated that sign-based optimizers like Lion and Muon achieve sharp convergence rates under heavy-tailed conditions, while AdaGrad also converges in this setting. However, rigorous convergence theory for AdamW has not yet been established within these heavy-tailed assumptions. The authors pose an open problem regarding whether AdamW can converge under the same heavy-tailed assumptions or if its second-moment accumulator creates a genuine obstruction. To address this, they formulate a positive weighted-metric benchmark and provide a corridor lower-bound mechanism. This mechanism illustrates how denominator memory in AdamW can effectively hide large gradients, potentially impacting its performance.