LLMs Judge Worse Than They Generate in In-Context QA

A study challenges the assumption that large language models evaluate their own outputs better than they generate them, finding that generation accuracy exceeds self-evaluation on three of four tested benchmarks. The research utilizes a controlled in-context QA setting to isolate evaluation performance from parametric knowledge confounds.

Across SQuAD 2.0, DROP, HotpotQA, and MuSiQue, models generated answers more accurately than they judged them, with the exception of multi-hop MuSiQue.
Attention analysis reveals that during evaluation, models attend to context passages 3-5 times less than during generation and barely read the candidate answer.
LoRA fine-tuning experiments confirm this asymmetry is not a training artifact; generation fine-tuning induces over-acceptance while evaluation fine-tuning degrades generation performance.

These findings challenge core assumptions in self-evaluation pipelines, suggesting that current methods may be fundamentally flawed due to how models process information during judgment versus generation.