This study investigates whether language models recognize when they are being tested, a factor critical for AI safety as it may cause models to alter their behavior strategically. Using 11 open-weight models from the Qwen 2.5, Gemma 2, and Llama 3.2 families, researchers analyzed how evaluation awareness manifests across different model sizes.
- The layer where evaluation awareness is most linearly recoverable shifts from late layers in smaller models to early layers in larger ones for both Qwen 2.5 and Gemma 2.
- This depth shift explains why scaling trajectories within a model family are non-monotonic or inverse rather than following a smooth, universal power-law.
- White-box probe signals consistently show stronger evaluation awareness than black-box behavioral expressions.
- The relationship between probe signals and behavior varies by model family in ways not predicted by probe AUROC alone.
These findings suggest that scale changes not only the strength of evaluation awareness but also its location within the network, challenging simple universal scaling accounts.