A study evaluated three commercial large language models—Claude (claude-sonnet-4-6), ChatGPT (GPT-5.4), and Gemini (gemini-2.5-flash)—on a zero-shot fine-grained emotion classification task using a stratified 1,000-sentence sample from the boltuix/emotions dataset.

  • Gemini achieved the highest accuracy (39.9%) and macro-F1 score (0.363).
  • ChatGPT followed with 38.8% accuracy and a macro-F1 of 0.291.
  • Claude scored 38.0% accuracy but had a markedly lower macro-F1 of 0.159, indicating class-imbalance prediction bias.
  • All models excelled at sarcasm and desire but consistently failed on love, confusion, and shame.
  • McNemar tests revealed no statistically significant pairwise differences (p > 0.10), suggesting convergence at a shared zero-shot ceiling.

These findings highlight the current limitations of frontier AI systems in performing zero-shot fine-grained emotion classification.