This article presents AsyncOPD, a fully asynchronous on-policy distillation pipeline that decouples rollout generation from learner updates to alleviate training bottlenecks in large language model post-training. The authors provide the first systematic study of staleness effects in this context, demonstrating that teacher-weighted forward KL is robust to stale rollouts while student-weighted reverse KL is vulnerable.

  • Teacher-weighted forward KL divergence is more robust to stale data than student-weighted reverse KL divergence.
  • Stabilization methods from asynchronous reinforcement learning do not outperform a simpler OPD-specific surrogate that recomputes the reverse-KL signal at learner time.
  • Finite teacher-score caches create a bias-variance tradeoff, motivating the use of multi-sample Monte Carlo to reduce one-sample variance while preserving MC correctability.
  • The open-source AsyncOPD pipeline improves training throughput by 1.6x to 3.8x over strict synchronous training while maintaining comparable accuracy.

The authors consider this significant because it enables higher training throughput for reasoning workloads without sacrificing model performance, addressing the critical systems bottleneck where rollouts dominate training time.