The authors propose a reinforcement learning fine-tuning framework that utilizes autonomous vision-language evaluation as a scalable supervision signal for GUI agents, eliminating the need for manual labels or task-specific heuristics. By treating evaluator feedback as a noisy binary reward channel and deriving a noise-corrected estimator for Proximal Policy Optimization, the method addresses the difficulty of obtaining machine-readable rewards in open-ended desktop environments.
- The framework uses a Vision-Language Model to judge task completion based on final screenshots and original instructions without manual intervention during policy optimization.
- A noise-corrected reward estimator is derived specifically for Proximal Policy Optimization to account for imperfect autonomous evaluators.
- Experiments across macOSWorld, Windows Agent Arena, and OSWorld demonstrate that corrected evaluator rewards outperform zero-shot baselines and raw evaluator fine-tuning.
- The approach improves success rates by an average of 12.6 percentage points over zero-shot performance and 5.1 points over raw evaluator fine-tuning.
This work demonstrates that autonomous evaluation can serve as a practical reward signal for reinforcement learning in GUI environments when evaluator noise is explicitly modeled and corrected.