The article demonstrates that Muown's directional update is equivalent to a Riemannian step on normalized directions, where the un-normalized parameterization magnitude modulates the angular step size. This insight explains Muown's step-size stability and motivates the development of AngularMuown, which optimizes directly over normalized directions with an explicit, schedulable angular multiplier.
- AngularMuown decouples the angular multiplier from the radial magnitude update to optimize directly over normalized directions.
- The method improves upon Muown performance and leads the per-optimizer category in the modded nanoGPT speedrunning competition.
- Experiments on Qwen2-0.5B and 1.1B parameter mixture-of-experts models confirm the algorithm scales beyond small models.
AngularMuown provides a more explicit control over angular step sizes, offering improved optimization stability and performance for pre-training Transformers.