Diffusion-based text-to-speech models have improved speech quality but struggle with sharp prosodic transitions and rapid pitch variations. Existing decoders often use periodic nonlinearities like the Snake activation function, which lack adaptability for abrupt amplitude and frequency changes. To address this, the authors introduce OscillaTTS, a system featuring an adaptive oscillatory nonlinearity. This component enables controllable periodic modulation while ensuring signal stability via a linear bypass mechanism. The study investigates the role of oscillatory inductive bias within diffusion-based TTS decoders. Experiments conducted on the LJSpeech and Emotional Speech Dataset demonstrate consistent improvements in both objective and subjective evaluations. These results indicate that OscillaTTS effectively models expressive prosodic dynamics compared to prior methods.
OscillaTTS: Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS
from English