The authors propose VRRL, a reinforcement learning framework designed to enable vision-language models to perform visually grounded self-reflection during chain-of-thought reasoning.

  • VRRL randomly masks trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions.
  • The method introduces buffered roll-ins from an experience replay buffer to expose the model to diverse failure states.
  • Evaluation on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks, shows substantial improvements in out-of-distribution accuracy over standard RL baselines.