How to Leverage Synthetic Speech for LLM-Based ASR Systems?

Researchers investigate the distributional gap between synthetic and real speech in LLM-based automatic speech recognition (ASR) systems by probing a SLAM-ASR architecture. They identify that discriminative signals separating the two data types are concentrated in the early-to-middle layers of the model backbone.

The study finds that representation-level separability does not directly predict downstream ASR gains.
Convolving synthetic audio with room impulse responses (RIRs) narrows the data gap by reproducing acoustic irregularities rather than improving naturalness.
A training procedure combining layer selection and RIR augmentation matches a fully real-data baseline using only 25% of real speech (13.6h).
This approach surpasses the baseline performance at all higher proportions of real data.

These findings demonstrate that synthetic speech can effectively replace genuine recordings in privacy-sensitive domains when specific architectural and augmentation strategies are applied.