Translation-Enhanced Speech Encoder Pre-training Improves Speech LLMs

Connecting a pre-trained speech encoder to a Large Language Model creates a structural misalignment because encoders often produce language-specific representations while LLMs operate in a unified, language-agnostic space. The authors argue that incorporating speech translation objectives into the pre-training process provides a principled mechanism to bridge this gap. Unlike monolingual transcription, translation forces the model to learn representations that are independent of specific languages. The study experimentally evaluates the impact of adding these translation objectives during speech encoder pre-training. Results demonstrate that this approach significantly improves cross-modal integration between the speech and text modalities. Consequently, models utilizing translation-enhanced pre-training achieve superior performance across various downstream Speech LLM tasks.