Do Speech Emphasis Models Generalize across Languages and Emotions?

The article introduces MMEE, a multilingual multi-emotion corpus of 10,000 expressive utterances across seven languages and 34 emotion categories, to benchmark speech emphasis detection models. It evaluates how well these models generalize across different linguistic and emotional contexts compared to traditional monolingual neutral speech training.

The MMEE corpus contains 14.13 hours of professionally recorded utterances with three-level perceptual labels.
Monolingual models exhibit limited zero-shot transfer, particularly degrading when applied to typologically distant languages.
Multilingual training substantially improves model robustness across diverse linguistic settings.
Models transfer robustly between high- and low-arousal emotions, suggesting shared prosodic structures.
Performance remains stable even at smaller training scales, and bidirectional transfer between synthetic and perceptual benchmarks is observed.