Open language models show evaluation awareness is not a unified trait. Eight experiments across 37 models reveal detection, safety behavior shifts, and representation stability vary independently, with only weak correlations between them. This undermines the idea of a single awareness score as a reliable indicator of deployment safety, highlighting the 'benchmark illusion'.
Evaluation Awareness Is Multivariate, Not a Single Capability
from English