This article examines the reliability of using Large Language Models as evaluators in multilingual and low-resource language contexts, highlighting significant gaps in current practices. The authors analyze 650 ACL Anthology papers to identify inconsistencies and overreliance on single judge models.
- Out of 650 papers mentioning LLM-as-a-judge, only 33 focus on low-resource or multilingual settings.
- Analysis reveals inconsistent evaluation outcomes and a tendency to overtrust LLM judgments in these contexts.
- There is widespread reliance on a single judge model per study without adequate human validation.
The authors provide recommendations for the NLP community to improve the validity of LLM-as-a-Judge evaluations in diverse linguistic settings.