The authors propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions to provide interpretable, multi-dimensional scores for large language models. This approach generates transparent question-level feedback and calibrated overall scores by having an LLM answer fine-grained evaluation questions independently for each output.

  • BINEVAL matches or outperforms baselines like UniEval and G-Eval on SummEval, Topical-Chat, and QAGS benchmarks.
  • The method better matches human score distributions and avoids ceiling effects common in prior LLM judges.
  • It provides superior discrimination between borderline and clearly flawed outputs compared to existing methods.
  • The framework supports iterative prompt optimization for summarization and generation tasks under self-update and cross-model update settings.

BINEVAL offers a task-agnostic, training-free evaluation framework that combines strong empirical performance with practical diagnostic value and direct usability for prompt improvement.