The authors introduce TestEvo-Bench, a live benchmark designed to evaluate how well test automation agents handle the co-evolution of code and tests. It addresses limitations in existing benchmarks by providing executable tasks anchored to real commit histories with environment configurations.

  • The benchmark features two tracks: test generation for new tests and test update for adapting failing ones.
  • It contains 746 test generation and 509 test update tasks curated from 152 open-source Java projects.
  • Evaluation uses execution-grounded metrics such as pass rate, coverage, and mutation score.
  • The live nature of the benchmark allows restricting evaluation to tasks postdating a model's training cutoff.

This framework enables more accurate assessment of agent capabilities by ensuring tests are executable and semantically tied to code changes.