Phonology-Informed Evaluation Framework Audits Multilingual TTS Faithfulness

The authors propose a classifier-based framework to audit multilingual text-to-speech (TTS) systems against language-specific phonological patterns, using human speech as a benchmark. This approach addresses the limitation of standard metrics like MOS, which fail to test for the preservation of sound contrasts essential for distinguishing words.

The framework was tested on Assamese advanced tongue root (ATR) vowel harmony using Meta's MMS TTS model.
A classifier trained on human speech transferred to synthesized speech with minimal loss.
The audit revealed that [+ATR] mid vowels were realized as [-ATR] in one-third of tokens, a bias absent in human speech.
At the word level, predicted ATR labels classified harmony more accurately than transcription labels, highlighting a gap between intended and produced phonology.

The framework provides task-specific diagnostics for TTS quality and generalizes to other phonological contrasts that have measurable acoustic cues.