Researchers propose SpeechCombine, an instruction-following speech language model trained without instruction tuning by combining a text LLM's weight difference with speech-adapted weights.

  • The method uses only a single round of speech pre-training on 30k hours of data.
  • It starts from a text LLM base model and performs continuous pre-training on speech utterances.
  • The approach directly combines speech-adapted weights with the difference between instruction-tuned and base text LLM versions.
  • Results show the strategy preserves original text LLM knowledge while effectively transferring capabilities to the speech domain.

This finding suggests a new direction for SLM training that avoids reliance on massive speech data.