The authors propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm designed to integrate the capabilities of multiple domain-specific reinforcement learning teachers into a single student model. This approach eliminates exposure bias and provides a dense optimization signal by distilling teachers into the student during its own rollouts.
- MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines on Qwen3-30B-A3B, inheriting nearly all of each teacher's capability.
- The method enables parallel, independent development of domain teachers, removing the cross-domain coupling typical of multi-domain post-training.
- MOPD has been deployed in the post-training of MiMo-V2-Flash, an industrial-scale frontier model.
This work demonstrates practical value for capability integration in frontier-scale LLMs by allowing efficient combination of specialized skills without performance loss.