Learning Process Rewards via Success Visitation Matching for Efficient RL

The authors address the challenge of training reinforcement learning policies with inherently sparse outcome rewards, which leads to difficult credit assignment problems. They propose a method to transform these sparse rewards into dense process rewards by training a discriminator to distinguish between successful and unsuccessful episodes. This discriminator incentivizes the policy to match the state-action visitations of successful episodes while avoiding those of unsuccessful ones. By providing dense feedback on progress toward task completion, the approach provably achieves this without altering the optimal policy. The method is specifically applied to the finetuning of robotic control policies for manipulation tasks. Experimental results demonstrate significantly faster RL finetuning performance in both simulated and real-world environments compared to maximizing sparse outcome rewards alone.