Interleaved speech-language models undergo an implicit transcription phase where spoken words become decodable as text tokens in intermediate layers, despite no speech recognition training. Up to 77% of the data shows the spoken word appearing as a top candidate text prediction, followed by text continuation and return to speech. This behavior is driven by interleaving data and text LM initialization, correlating with spoken knowledge performance.
Speech-Text Models Latently Transcribe Speech in Intermediate Layers
from English