This paper details FBK's submission to the IWSLT 2026 Instruction Following shared task, presenting SpeechLLMs designed for both short-form and long-form speech instruction following under constrained settings.

  • The model achieved a SIFS score of 2.0708 on the MCIF benchmark in the short track.
  • Three speech segmentation methods were explored for the long track to address unstable generation.
  • A new HIFS score was introduced to evaluate long-form performance more robustly.
  • Fixed 30-second segmentation yielded the highest HIFS score of 2.0663.
  • Hallucinations in long-form outputs primarily manifest as repetitive insertions, affecting ASR and SSUM tasks.

The study demonstrates that while long-form extension introduces specific hallucination challenges, the model's short-form capabilities are largely retained.