The openpangu team has released openPangu-2.0-Flash, a Mixture of Experts (MoE) model trained on Ascend hardware. The model features 92 billion total parameters with 6 billion activated parameters and supports a context length of 512k tokens.
- Training utilized 34 trillion pretraining tokens, followed by unified SFT for slow and fast thinking capabilities and multiple specialist RL training.
- Architecture improvements include efficient attention combining MLA, DSA, and SWA in a 1:2 layer ratio to lower compute and memory costs.
- The model replaces the conventional residual path with a 4-stream mHC design to improve representation diversity and generalization.
- Multi-token prediction uses three heads to draft three additional tokens per step for faster inference via self-speculative decoding.
- Training employs the Muon optimizer to achieve faster convergence.
The release provides an open-source option for high-performance long-context reasoning with optimized inference speed.