Learning Process Rewards via Success Visitation Matching for Efficient RL

The authors propose a method to transform inherently sparse outcome rewards in reinforcement learning into dense process rewards by training a discriminator to distinguish between successful and unsuccessful episodes. This approach incentivizes the policy to match the state-action visitations of successful episodes while avoiding those of unsuccessful ones, providing dense feedback on progress without altering the optimal policy.

The method uses a discriminator to differentiate previous successful from unsuccessful episodes.
It incentivizes the RL policy to match state-action visitations of successful episodes.
The approach provides dense feedback on progress toward task completion.
It provably achieves this goal without changing the optimal policy.
Demonstrated faster RL finetuning performance on simulated and real-world robotic manipulation tasks.

This technique addresses the challenging credit assignment problem in sparse reward settings, leading to significantly faster reinforcement learning improvement for robotic control policies.