A study challenges the assumption that large language models evaluate their own outputs better than they generate them, finding that generation accuracy exceeds self-evaluation on three of four tested benchmarks. The research utilizes a controlled in-context QA setting to isolate evaluation performance from parametric knowledge confounds.
- Across SQuAD 2.0, DROP, HotpotQA, and MuSiQue, models generated answers more accurately than they judged them, with the exception of multi-hop MuSiQue.
- Attention analysis reveals that during evaluation, models attend to context passages 3-5 times less than during generation and barely read the candidate answer.
- LoRA fine-tuning experiments confirm this asymmetry is not a training artifact; generation fine-tuning induces over-acceptance while evaluation fine-tuning degrades generation performance.
These findings challenge core assumptions in self-evaluation pipelines, suggesting that current methods may be fundamentally flawed due to how models process information during judgment versus generation.