A study investigates whether contextualized embeddings (CEs) can predict spoken word duration for 7470 tokens of Mandarin monosyllabic CV words extracted from a spontaneous speech corpus. The results demonstrate that CEs are predictive for duration above chance level, both at the type level and for individual tokens.

  • Predicted durations are precise enough to back-transform f0 contours from normalized time to the millisecond scale.
  • The resulting predicted contours approximate empirical contours and outperform permutation baselines.

This confirms that CEs contain sufficient information to model temporal aspects of speech, enabling more accurate synthesis of prosody.