openPangu releases openPangu-2.0-Flash, a 92B MoE model with 512k context

The openpangu team has released openPangu-2.0-Flash, a Mixture of Experts (MoE) model trained on Ascend hardware. The model features 92 billion total parameters with 6 billion activated parameters and supports a context length of 512k tokens.

Training utilized 34 trillion pretraining tokens, followed by unified SFT for slow and fast thinking capabilities and multiple specialist RL training.
Architecture improvements include efficient attention combining MLA, DSA, and SWA in a 1:2 layer ratio to lower compute and memory costs.
The model replaces the conventional residual path with a 4-stream mHC design to improve representation diversity and generalization.
Multi-token prediction uses three heads to draft three additional tokens per step for faster inference via self-speculative decoding.
Training employs the Muon optimizer to achieve faster convergence.

The release provides an open-source option for high-performance long-context reasoning with optimized inference speed.