TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

The authors introduce TestEvo-Bench, a live benchmark designed to evaluate how well test automation agents handle the co-evolution of code and tests. It addresses limitations in existing benchmarks by providing executable tasks anchored to real commit histories with environment configurations.

The benchmark features two tracks: test generation for new tests and test update for adapting failing ones.
It contains 746 test generation and 509 test update tasks curated from 152 open-source Java projects.
Evaluation uses execution-grounded metrics such as pass rate, coverage, and mutation score.
The live nature of the benchmark allows restricting evaluation to tasks postdating a model's training cutoff.

This framework enables more accurate assessment of agent capabilities by ensuring tests are executable and semantically tied to code changes.