A new method adapts DAAM to speech diffusion models, analyzing how style captions influence TTS waveforms. It reveals style tokens have lower temporal variance than content tokens, with style attention correlating to pitch and energy, and peak style conditioning in early layers where attention entropy is minimized, indicating maximal selectivity.
Cross-Attention Attribution for Style-Captioned Text-to-Speech
from English