An empirical study of 86,156 test-file patches from 33,596 agent-authored PRs reveals that 80.2% of test patches contain weak or no explicit oracle signals. Strong-oracle test files significantly improve merge likelihood (OR = 1.28, p < 0.001) after adjusting for multiple factors, indicating test file presence alone overestimates verification strength.
arxiv
arXiv cs.AI
·
8d ago
·
research
Oracle Signals in Agent-Authored Test Code
from English
Importance 2/3
arXiv cs.AI
OpenAI
Anthropic
Cursor
AI agents
Code generation
Evaluation & benchmarks
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| SWE-bench Verified | Claude Code | 80.2% |
| SWE-bench Verified | Cursor | 80.2% |
| SWE-bench Verified | Devin | 80.2% |
| SWE-bench Verified | GitHub Copilot | 80.2% |
| SWE-bench Verified | OpenAI Codex | 80.2% |