Researchers investigate the distributional gap between synthetic and real speech in LLM-based automatic speech recognition (ASR) systems by probing a SLAM-ASR architecture. They identify that discriminative signals separating the two data types are concentrated in the early-to-middle layers of the model backbone.
- The study finds that representation-level separability does not directly predict downstream ASR gains.
- Convolving synthetic audio with room impulse responses (RIRs) narrows the data gap by reproducing acoustic irregularities rather than improving naturalness.
- A training procedure combining layer selection and RIR augmentation matches a fully real-data baseline using only 25% of real speech (13.6h).
- This approach surpasses the baseline performance at all higher proportions of real data.
These findings demonstrate that synthetic speech can effectively replace genuine recordings in privacy-sensitive domains when specific architectural and augmentation strategies are applied.