The article introduces EvoPolicyGym, a benchmark designed to evaluate how agents iteratively improve executable policies through feedback within a fixed interaction budget. This controlled setting addresses the limitations of existing evaluations that often collapse the process into final scores or confound it with software engineering progress.

  • The benchmark utilizes compact interactive reinforcement learning environments to assess iterative policy improvement.
  • GPT-5.5 achieves the strongest aggregate rank score and top-two performance across all 16 environments in the suite.
  • EvoPolicyGym provides trajectory-level diagnostics to analyze how agents allocate their budget and convert feedback into parametric tuning.

The authors argue that effective autonomous policy evolution requires discovering task-appropriate mechanisms and refining policies under bounded feedback rather than relying on isolated task wins.