An open evaluation involving 55 models from 11 developer families revealed that large language models exhibit statistically significant in-group bias when blind-grading each other. Across 22,254 valid judgments, every family with sufficient data showed a tendency to rate its own members differently than those of other families.
- Qwen judges favored other Qwen models by +0.91 points on a 0-10 scale.
- Mistral judges penalized other Mistral models by -1.02, the largest absolute bias observed.
- Google and Meta showed negative biases of -0.59 and -0.68 respectively.
- xAI, Anthropic, MiniMax, and OpenAI exhibited positive in-group biases ranging from +0.23 to +0.75.
The study highlights that aggregate leaderboards are misleading as six different models hold the top spot across various categories, and suggests that future evaluations should anchor judgments to ground truth where possible.