The authors propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions to provide interpretable, multi-dimensional scores for large language models. This approach generates transparent question-level feedback and calibrated overall scores by having an LLM answer fine-grained evaluation questions independently for each output.
- BINEVAL matches or outperforms baselines like UniEval and G-Eval on SummEval, Topical-Chat, and QAGS benchmarks.
- The method better matches human score distributions and avoids ceiling effects common in prior LLM judges.
- It provides superior discrimination between borderline and clearly flawed outputs compared to existing methods.
- The framework supports iterative prompt optimization for summarization and generation tasks under self-update and cross-model update settings.
BINEVAL offers a task-agnostic, training-free evaluation framework that combines strong empirical performance with practical diagnostic value and direct usability for prompt improvement.