Zero-shot evaluation shows Gemini leads LLMs on 13-class emotion taxonomy

A study evaluated three commercial large language models—Claude (claude-sonnet-4-6), ChatGPT (GPT-5.4), and Gemini (gemini-2.5-flash)—on a zero-shot fine-grained emotion classification task using a stratified 1,000-sentence sample from the boltuix/emotions dataset.

Gemini achieved the highest accuracy (39.9%) and macro-F1 score (0.363).
ChatGPT followed with 38.8% accuracy and a macro-F1 of 0.291.
Claude scored 38.0% accuracy but had a markedly lower macro-F1 of 0.159, indicating class-imbalance prediction bias.
All models excelled at sarcasm and desire but consistently failed on love, confusion, and shame.
McNemar tests revealed no statistically significant pairwise differences (p > 0.10), suggesting convergence at a shared zero-shot ceiling.

These findings highlight the current limitations of frontier AI systems in performing zero-shot fine-grained emotion classification.