This study investigates the reliability of using Large Language Models as judges for verifying rubrics in complex agentic scenarios, introducing RuVerBench as the first benchmark for this purpose. The research evaluates frontier models on deep research and coding tasks, revealing that while performance is strong, significant noise persists in verification.
- RuVerBench contains 2,458 instances covering deep research and agentic coding domains, each with model outputs, rubrics, and human-annotated labels.
- Even the most advanced LLMs exhibit substantial noise when verifying rubrics in agentic scenarios.
- Weaker models are found to be more sensitive to prompt variations compared to stronger ones.
- Batched verification presents a trade-off between accuracy and efficiency.
- Majority voting provides effective but diminishing returns for reliability.
The authors have released their dataset and code to facilitate future research into improving the consistency of automated evaluation methods.