The authors propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm designed to integrate the capabilities of multiple domain-specific reinforcement learning teachers into a single student model. This approach eliminates exposure bias and provides a dense optimization signal by distilling teachers into the student during its own rollouts.

  • MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines on Qwen3-30B-A3B, inheriting nearly all of each teacher's capability.
  • The method enables parallel, independent development of domain teachers, removing the cross-domain coupling typical of multi-domain post-training.
  • MOPD has been deployed in the post-training of MiMo-V2-Flash, an industrial-scale frontier model.

This work demonstrates practical value for capability integration in frontier-scale LLMs by allowing efficient combination of specialized skills without performance loss.