A study investigates whether contextualized embeddings (CEs) can predict spoken word duration for 7470 tokens of Mandarin monosyllabic CV words extracted from a spontaneous speech corpus. The results demonstrate that CEs are predictive for duration above chance level, both at the type level and for individual tokens.
- Predicted durations are precise enough to back-transform f0 contours from normalized time to the millisecond scale.
- The resulting predicted contours approximate empirical contours and outperform permutation baselines.
This confirms that CEs contain sufficient information to model temporal aspects of speech, enabling more accurate synthesis of prosody.