The authors propose a method to transform inherently sparse outcome rewards in reinforcement learning into dense process rewards by training a discriminator to distinguish between successful and unsuccessful episodes. This approach incentivizes the policy to match the state-action visitations of successful episodes while avoiding those of unsuccessful ones, providing dense feedback on progress without altering the optimal policy.
- The method uses a discriminator to differentiate previous successful from unsuccessful episodes.
- It incentivizes the RL policy to match state-action visitations of successful episodes.
- The approach provides dense feedback on progress toward task completion.
- It provably achieves this goal without changing the optimal policy.
- Demonstrated faster RL finetuning performance on simulated and real-world robotic manipulation tasks.
This technique addresses the challenging credit assignment problem in sparse reward settings, leading to significantly faster reinforcement learning improvement for robotic control policies.