EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

The article introduces EvoPolicyGym, a benchmark designed to evaluate how agents iteratively improve executable policies through feedback within a fixed interaction budget. This controlled setting addresses the limitations of existing evaluations that often collapse the process into final scores or confound it with software engineering progress.

The benchmark utilizes compact interactive reinforcement learning environments to assess iterative policy improvement.
GPT-5.5 achieves the strongest aggregate rank score and top-two performance across all 16 environments in the suite.
EvoPolicyGym provides trajectory-level diagnostics to analyze how agents allocate their budget and convert feedback into parametric tuning.

The authors argue that effective autonomous policy evolution requires discovering task-appropriate mechanisms and refining policies under bounded feedback rather than relying on isolated task wins.