OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning
The authors propose OPID, a framework that extracts skill supervision directly from completed on-policy trajectories to address the sparse reward problem in outcome-based reinforcement learning. By representing trajectory hindsight as hierarchical skills, OPID provides dense, distribution-matched token-level supervision without relying on external memory.