Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation

The authors propose a reinforcement learning fine-tuning framework that utilizes autonomous vision-language evaluation as a scalable supervision signal for GUI agents, eliminating the need for manual labels or task-specific heuristics. By treating evaluator feedback as a noisy binary reward channel and deriving a noise-corrected estimator for Proximal Policy Optimization, the method addresses the difficulty of obtaining machine-readable rewards in open-ended desktop environments.

The framework uses a Vision-Language Model to judge task completion based on final screenshots and original instructions without manual intervention during policy optimization.
A noise-corrected reward estimator is derived specifically for Proximal Policy Optimization to account for imperfect autonomous evaluators.
Experiments across macOSWorld, Windows Agent Arena, and OSWorld demonstrate that corrected evaluator rewards outperform zero-shot baselines and raw evaluator fine-tuning.
The approach improves success rates by an average of 12.6 percentage points over zero-shot performance and 5.1 points over raw evaluator fine-tuning.

This work demonstrates that autonomous evaluation can serve as a practical reward signal for reinforcement learning in GUI environments when evaluator noise is explicitly modeled and corrected.

Benchmark	Model	Score
OSWorld	proposed RL fine-tuning framework	12.6pts
Windows Agent Arena	proposed RL fine-tuning framework	12.6pts

Benchmarks