The article demonstrates that Muown's directional update is equivalent to a Riemannian step on normalized directions, where the un-normalized parameterization magnitude modulates the angular step size. This insight explains Muown's step-size stability and motivates the development of AngularMuown, which optimizes directly over normalized directions with an explicit, schedulable angular multiplier.

  • AngularMuown decouples the angular multiplier from the radial magnitude update to optimize directly over normalized directions.
  • The method improves upon Muown performance and leads the per-optimizer category in the modded nanoGPT speedrunning competition.
  • Experiments on Qwen2-0.5B and 1.1B parameter mixture-of-experts models confirm the algorithm scales beyond small models.

AngularMuown provides a more explicit control over angular step sizes, offering improved optimization stability and performance for pre-training Transformers.