The SPLIT benchmark is introduced to evaluate large language models' consistency in generating emotionally grounded responses across five crisis-related categories: Stress, Panic, Loneliness, Internal Displacement, and Tension. The framework assesses three technically diverse LLMs on empathetic accuracy, linguistic naturalness, and contextual & cultural grounding in both English and Ukrainian.

  • Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct degrade when transitioning to Ukrainian, while DeepSeek-V3 remains comparatively stable.
  • Human and AI evaluators agree weakly on empathy and naturalness but diverge on cultural grounding.
  • The study argues that producing Ukrainian text is not equivalent to producing Ukrainian emotional support.

The findings aim to assist in the future development of more culturally tailored benchmark designs and encourage a stronger emphasis on human-centered evaluation.