The authors propose a classifier-based framework to audit multilingual text-to-speech (TTS) systems against language-specific phonological patterns, using human speech as a benchmark. This approach addresses the limitation of standard metrics like MOS, which fail to test for the preservation of sound contrasts essential for distinguishing words.

  • The framework was tested on Assamese advanced tongue root (ATR) vowel harmony using Meta's MMS TTS model.
  • A classifier trained on human speech transferred to synthesized speech with minimal loss.
  • The audit revealed that [+ATR] mid vowels were realized as [-ATR] in one-third of tokens, a bias absent in human speech.
  • At the word level, predicted ATR labels classified harmony more accurately than transcription labels, highlighting a gap between intended and produced phonology.

The framework provides task-specific diagnostics for TTS quality and generalizes to other phonological contrasts that have measurable acoustic cues.