Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

The authors propose VRRL, a reinforcement learning framework designed to enable vision-language models to perform visually grounded self-reflection during chain-of-thought reasoning.

VRRL randomly masks trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions.
The method introduces buffered roll-ins from an experience replay buffer to expose the model to diverse failure states.
Evaluation on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks, shows substantial improvements in out-of-distribution accuracy over standard RL baselines.