Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

The authors propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions to provide interpretable, multi-dimensional scores for large language models. This approach generates transparent question-level feedback and calibrated overall scores by having an LLM answer fine-grained evaluation questions independently for each output.

BINEVAL matches or outperforms baselines like UniEval and G-Eval on SummEval, Topical-Chat, and QAGS benchmarks.
The method better matches human score distributions and avoids ceiling effects common in prior LLM judges.
It provides superior discrimination between borderline and clearly flawed outputs compared to existing methods.
The framework supports iterative prompt optimization for summarization and generation tasks under self-update and cross-model update settings.

BINEVAL offers a task-agnostic, training-free evaluation framework that combines strong empirical performance with practical diagnostic value and direct usability for prompt improvement.