This article examines the reliability of using Large Language Models as evaluators in multilingual and low-resource language contexts, highlighting significant gaps in current practices. The authors analyze 650 ACL Anthology papers to identify inconsistencies and overreliance on single judge models.

  • Out of 650 papers mentioning LLM-as-a-judge, only 33 focus on low-resource or multilingual settings.
  • Analysis reveals inconsistent evaluation outcomes and a tendency to overtrust LLM judgments in these contexts.
  • There is widespread reliance on a single judge model per study without adequate human validation.

The authors provide recommendations for the NLP community to improve the validity of LLM-as-a-Judge evaluations in diverse linguistic settings.