A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents
The article documents how measurements from proprietary LLM evaluators can become invalid within weeks, introducing the EPC framework to detect such instability. It applies this diagnostic across eight experimental conditions, revealing that version-conditional instability makes single-snapshot evaluator studies unreliable.