SpeechCombine: instruction-following speech language model without instruction tuning

Researchers propose SpeechCombine, an instruction-following speech language model trained without instruction tuning by combining a text LLM's weight difference with speech-adapted weights.

The method uses only a single round of speech pre-training on 30k hours of data.
It starts from a text LLM base model and performs continuous pre-training on speech utterances.
The approach directly combines speech-adapted weights with the difference between instruction-tuned and base text LLM versions.
Results show the strategy preserves original text LLM knowledge while effectively transferring capabilities to the speech domain.

This finding suggests a new direction for SLM training that avoids reliance on massive speech data.