The authors propose VRRL, a reinforcement learning framework designed to enable vision-language models to perform visually grounded self-reflection during chain-of-thought reasoning.
- VRRL randomly masks trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions.
- The method introduces buffered roll-ins from an experience replay buffer to expose the model to diverse failure states.
- Evaluation on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks, shows substantial improvements in out-of-distribution accuracy over standard RL baselines.