The authors propose HPRO, a hierarchical progressive reward optimization framework designed to enhance emotional expressiveness in LLM-based Text-to-Speech models while preserving linguistic intelligibility. This approach addresses structural mismatches in existing preference-driven methods by isolating content and emotion and bridging the gap between sparse rewards and dense generation.

  • Introduces HD-Emo codec, a differentiable reward model that extracts distinct content and style preference tokens to resolve information conflict.
  • Structurally isolates emotional optimization from semantic content to prevent reward hacking and semantic degradation.
  • Bridges the scale gap by progressively aligning frame-, word-, and sentence-level objectives.
  • Experiments demonstrate significant enhancement in emotional expressiveness without compromising linguistic intelligibility.

HPRO effectively overcomes the limitations of standard Supervised Fine-Tuning, which often results in statistically averaged prosody, by providing a robust method for optimizing emotional speech synthesis.