EvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures

This paper addresses the shared measurement problem in LLM evaluation and AI safety, where benchmark scores often improve while latent safety properties remain difficult to verify. It introduces EvalSafetyGap, a hybrid survey and conceptual framework combining systematic evidence synthesis with a structured audit of ten models.

The synthesis covers eight evidence streams from 2018-2026, including benchmark validity, LLM-as-judge reliability, reward hacking, and mechanistic interpretability.
EvalSafetyGap uses Goodhart's Law, Instability Decomposition, and an Alignment Trilemma to compare evaluation-side and alignment-side proxy failures under optimization pressure.
An audit of ten models found the association between capability and sustained adversarial robustness statistically indeterminate (Pearson r = +0.232, p = 0.520).
The apparent open-closed safety gap was modest and driven mainly by governance and disclosure rather than behavioral robustness.

The contribution provides a shared vocabulary and evidence map to support dynamic evaluation, transparent source reporting, multi-attempt safety measurement, and auditable alignment practice.