HPRO: Hierarchical Progressive Reward Optimization for Emotional TTS

The authors propose HPRO, a hierarchical progressive reward optimization framework designed to enhance emotional expressiveness in LLM-based Text-to-Speech models while preserving linguistic intelligibility. This approach addresses structural mismatches in existing preference-driven methods by isolating content and emotion and bridging the gap between sparse rewards and dense generation.

Introduces HD-Emo codec, a differentiable reward model that extracts distinct content and style preference tokens to resolve information conflict.
Structurally isolates emotional optimization from semantic content to prevent reward hacking and semantic degradation.
Bridges the scale gap by progressively aligning frame-, word-, and sentence-level objectives.
Experiments demonstrate significant enhancement in emotional expressiveness without compromising linguistic intelligibility.

HPRO effectively overcomes the limitations of standard Supervised Fine-Tuning, which often results in statistically averaged prosody, by providing a robust method for optimizing emotional speech synthesis.