This paper addresses the shared measurement problem in LLM evaluation and AI safety, where benchmark scores often improve while latent safety properties remain difficult to verify. It introduces EvalSafetyGap, a hybrid survey and conceptual framework combining systematic evidence synthesis with a structured audit of ten models.

  • The synthesis covers eight evidence streams from 2018-2026, including benchmark validity, LLM-as-judge reliability, reward hacking, and mechanistic interpretability.
  • EvalSafetyGap uses Goodhart's Law, Instability Decomposition, and an Alignment Trilemma to compare evaluation-side and alignment-side proxy failures under optimization pressure.
  • An audit of ten models found the association between capability and sustained adversarial robustness statistically indeterminate (Pearson r = +0.232, p = 0.520).
  • The apparent open-closed safety gap was modest and driven mainly by governance and disclosure rather than behavioral robustness.

The contribution provides a shared vocabulary and evidence map to support dynamic evaluation, transparent source reporting, multi-attempt safety measurement, and auditable alignment practice.