First Finite-Time Analysis of Classical Adam for Nonsmooth Nonconvex Optimization

This study presents the first finite-time convergence analysis for the classical Adam optimizer, specifically addressing its behavior in nonsmooth nonconvex optimization settings. Previous research largely ignored Adam's bias-correction term or required extra algorithmic modifications like clipping, leaving the original method's guarantees unclear. The authors utilize the Online-to-Nonconvex Conversion framework to prove that a randomly scaled learning rate ensures a convergence rate of $1/T^{rac{2}{13}}$. This theoretical result is significant because it applies to the modern heavy-tailed noise regime, which more closely reflects practical training conditions. Furthermore, the analysis establishes convergence under the parameter choice where $β_1=β_2$, aligning with recent empirical observations. These findings provide a rigorous explanation for Adam's effectiveness in real-world scenarios that were previously inadequately captured by smooth optimization theories.