FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following

This paper details FBK's submission to the IWSLT 2026 Instruction Following shared task, presenting SpeechLLMs designed for both short-form and long-form speech instruction following under constrained settings.

The model achieved a SIFS score of 2.0708 on the MCIF benchmark in the short track.
Three speech segmentation methods were explored for the long track to address unstable generation.
A new HIFS score was introduced to evaluate long-form performance more robustly.
Fixed 30-second segmentation yielded the highest HIFS score of 2.0663.
Hallucinations in long-form outputs primarily manifest as repetitive insertions, affecting ASR and SSUM tasks.

The study demonstrates that while long-form extension introduces specific hallucination challenges, the model's short-form capabilities are largely retained.