The authors propose HPRO, a hierarchical progressive reward optimization framework designed to enhance emotional expressiveness in LLM-based Text-to-Speech models while preserving linguistic intelligibility. This approach addresses structural mismatches in existing preference-driven methods by isolating content and emotion and bridging the gap between sparse rewards and dense generation.
- Introduces HD-Emo codec, a differentiable reward model that extracts distinct content and style preference tokens to resolve information conflict.
- Structurally isolates emotional optimization from semantic content to prevent reward hacking and semantic degradation.
- Bridges the scale gap by progressively aligning frame-, word-, and sentence-level objectives.
- Experiments demonstrate significant enhancement in emotional expressiveness without compromising linguistic intelligibility.
HPRO effectively overcomes the limitations of standard Supervised Fine-Tuning, which often results in statistically averaged prosody, by providing a robust method for optimizing emotional speech synthesis.