Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings

This article examines the reliability of using Large Language Models as evaluators in multilingual and low-resource language contexts, highlighting significant gaps in current practices. The authors analyze 650 ACL Anthology papers to identify inconsistencies and overreliance on single judge models.

Out of 650 papers mentioning LLM-as-a-judge, only 33 focus on low-resource or multilingual settings.
Analysis reveals inconsistent evaluation outcomes and a tendency to overtrust LLM judgments in these contexts.
There is widespread reliance on a single judge model per study without adequate human validation.

The authors provide recommendations for the NLP community to improve the validity of LLM-as-a-Judge evaluations in diverse linguistic settings.