Public AI evaluation archives show that a single terminal result can arise from two distinct pre-terminal histories, with estimated times to reach 95% of performance ceilings at 23.03 or 75.13. A candidate selection-aware frontier model fails synthetic recovery and uncertainty calibration, and is rejected by fixed audit gates. An archive-and-adjudication protocol verifies timing boundaries and falsifies unsupported frontier claims.
arxiv
arXiv cs.AI
·
9d ago
·
research
Bayesian Audits Reveal Inconsistent AI Evaluation Timelines
from English