LLMs Judge Worse Than They Generate in In-Context QA
A study challenges the assumption that large language models evaluate their own outputs better than they generate them, finding that generation accuracy exceeds self-evaluation on three of four tested benchmarks. The research utilizes a controlled in-context QA setting to isolate evaluation performance from parametric knowledge confounds.