The article documents how measurements from proprietary LLM evaluators can become invalid within weeks, introducing the EPC framework to detect such instability. It applies this diagnostic across eight experimental conditions, revealing that version-conditional instability makes single-snapshot evaluator studies unreliable.
- The EPC framework comprises the Multimodal Preference Collapse Index (MPCI), evaluator-indexed coupling matrix, and Jensen-Shannon divergence (JSD).
- Four conditions showed strong coupling including GPT-4o May, GPT-4o-mini, Qwen3.7-plus, and DashScope 30r.
- Four conditions collapsed to near-zero coupling, including GPT-4o June, qwen-plus, symmetric LR, and DeepSeek self-eval.
- A re-replication of GPT-4o from May to June inverted the study's conclusion, highlighting significant drift.
- Self-evaluation consistently collapsed with 97% zero values and a JSD of 0.003.
The authors consider this important because the pattern of version-conditional instability demonstrates that single-snapshot studies are unreliable for evaluating LLM agents.