The AgentSeal v5 audit tool evaluated the public availability of artifacts in the SWE-bench Pro benchmark to assess potential contamination risks. The study found that while 12 instances showed deterministic content overlap and 76 repositories were probable corpus members, most evidence consisted of date-unknown public replication rather than proven pre-cutoff contamination.
- AgentSeal audited 731 public SWE-bench Pro instances using deterministic code overlap, probabilistic Bloom filter membership, and public-source replication checks.
- 12 instances had deterministic content-overlap signals in the CodeSeal index, while 76 source repositories were flagged as probable members of the Stack V2 corpus.
- 234 instances (32%) showed public replication of gold patch text outside the original repository, though temporal alignment with training cutoffs was unavailable.
- Approximately 75.4% of default-branch gold patches were exposed under the Pro audit consensus path.
- 148 instances had hidden test case code publicly visible in the source PR diff, indicating test-signal exposure.
The findings highlight that benchmark artifacts are widely replicated in public sources, creating conditions where contamination is possible even if direct proof of pre-cutoff training data inclusion is lacking.