Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?
This study investigates the reliability of using Large Language Models as judges for verifying rubrics in complex agentic scenarios, introducing RuVerBench as the first benchmark for this purpose. The research evaluates frontier models on deep research and coding tasks, revealing that while performance is strong, significant noise persists in verification.