NAVER LABS Europe submits a system to the instruction-following speech processing short track at IWSLT 2026, achieving a tie for first place in the overall ranking. The team developed systems capable of jointly performing ASR, ST, and SQA from English speech into Chinese, Italian, and German.

  • Replaces the previous speech projector with SpeechMapper, which learns a speech-to-LLM embedding projector using only ASR data.
  • Introduces fakACL, a synthetic SQA dataset composed of artificially generated scientific presentations built by prompting an LLM backbone and synthesizing speech with SeamlessM4T-large-v2.
  • The combination of improved speech projection and domain-specific synthetic data allows the model to outperform last year's best system while being more compact and relying on a weaker LLM backbone.

The authors consider this significant because their updated multi-stage training pipeline enables superior performance with reduced resource requirements compared to previous state-of-the-art systems.