The authors propose OPID, a framework that extracts skill supervision directly from completed on-policy trajectories to address the sparse reward problem in outcome-based reinforcement learning. By representing trajectory hindsight as hierarchical skills, OPID provides dense, distribution-matched token-level supervision without relying on external memory.
- OPID captures global workflows via episode-level skills and local decision knowledge via step-level skills.
- A critical-first routing mechanism injects step-level skills for critical decisions or falls back to episode-level guidance.
- The method combines log-probability shifts from skill-augmented contexts with outcome advantages for policy optimization.
- Experiments on ALFWorld, WebShop, and Search-based QA show improvements in performance, sample efficiency, and robustness over baselines.
This approach preserves reinforcement learning as the primary training objective while enabling more effective learning through dense hindsight supervision.