arxiv arXiv cs.AI · 8d ago · research

ReproRepo: Scaling Reproducibility Audits with GitHub Issues

from English

ReproRepo introduces a scalable framework using GitHub issues to evaluate ML paper reproducibility. It shows that LLM agents like Codex with GPT-5.5 identify at least one blocker in 90% of paper-repository pairs without executing code, though exact localization remains challenging.

Importance 2/3 arXiv cs.AI OpenAI Cohere Mistral AI AI agents Code generation Evaluation & benchmarks

Benchmarks

Benchmark	Model	Score
SWE-bench Verified	Codex with GPT-5.5	90%

Read original