Speech Quality Models Fail to Capture Prosodic and F0 Variability

MOS prediction models accurately capture acoustic degradation but fail to detect prosodic errors and speaker-specific characteristics like pitch and speaking rate. Human listeners perceive significant quality drops for these perturbations, while models show strong biases in fundamental frequency and lack sensitivity to speaking rate and F0 variability.