arxiv arXiv cs.AI · 8d ago · research

Oracle Signals in Agent-Authored Test Code

from English

An empirical study of 86,156 test-file patches from 33,596 agent-authored PRs reveals that 80.2% of test patches contain weak or no explicit oracle signals. Strong-oracle test files significantly improve merge likelihood (OR = 1.28, p < 0.001) after adjusting for multiple factors, indicating test file presence alone overestimates verification strength.

Importance 2/3 arXiv cs.AI OpenAI Anthropic Cursor AI agents Code generation Evaluation & benchmarks

Benchmarks

Benchmark	Model	Score
SWE-bench Verified	Claude Code	80.2%
SWE-bench Verified	Cursor	80.2%
SWE-bench Verified	Devin	80.2%
SWE-bench Verified	GitHub Copilot	80.2%
SWE-bench Verified	OpenAI Codex	80.2%

Read original