BabelJudge introduces an open-source framework to measure four key bias modes in LLM judges across languages and agent trajectories. It reveals a significant reliability drop from Hindi to Swahili—0.714 to 0.550—highlighting that raw accuracy alone fails to capture critical failures like order inconsistency, which collapses to 0.480 in Swahili. The framework also extends to agentic evaluation with nine perturbations and three new metrics, supporting 11 judge backends via a Python package.
BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories
from English