Researchers propose SpeechCombine, an instruction-following speech language model trained without instruction tuning by combining a text LLM's weight difference with speech-adapted weights.
- The method uses only a single round of speech pre-training on 30k hours of data.
- It starts from a text LLM base model and performs continuous pre-training on speech utterances.
- The approach directly combines speech-adapted weights with the difference between instruction-tuned and base text LLM versions.
- Results show the strategy preserves original text LLM knowledge while effectively transferring capabilities to the speech domain.
This finding suggests a new direction for SLM training that avoids reliance on massive speech data.