The authors introduce TestEvo-Bench, a live benchmark designed to evaluate how well test automation agents handle the co-evolution of code and tests. It addresses limitations in existing benchmarks by providing executable tasks anchored to real commit histories with environment configurations.
- The benchmark features two tracks: test generation for new tests and test update for adapting failing ones.
- It contains 746 test generation and 509 test update tasks curated from 152 open-source Java projects.
- Evaluation uses execution-grounded metrics such as pass rate, coverage, and mutation score.
- The live nature of the benchmark allows restricting evaluation to tasks postdating a model's training cutoff.
This framework enables more accurate assessment of agent capabilities by ensuring tests are executable and semantically tied to code changes.