Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?
This article addresses the lack of rigorous convergence theory for the AdamW optimizer in regimes with heavy-tailed stochastic gradient noise, which is common in large language model pretraining. It questions whether AdamW can converge under these conditions or if its second-moment accumulator creates a genuine obstruction.