Contextualized embeddings predict Mandarin word duration and pitch

A study investigates whether contextualized embeddings (CEs) can predict spoken word duration for 7470 tokens of Mandarin monosyllabic CV words extracted from a spontaneous speech corpus. The results demonstrate that CEs are predictive for duration above chance level, both at the type level and for individual tokens.

Predicted durations are precise enough to back-transform f0 contours from normalized time to the millisecond scale.
The resulting predicted contours approximate empirical contours and outperform permutation baselines.

This confirms that CEs contain sufficient information to model temporal aspects of speech, enabling more accurate synthesis of prosody.