This paper details FBK's submission to the IWSLT 2026 Instruction Following shared task, presenting SpeechLLMs designed for both short-form and long-form speech instruction following under constrained settings.
- The model achieved a SIFS score of 2.0708 on the MCIF benchmark in the short track.
- Three speech segmentation methods were explored for the long track to address unstable generation.
- A new HIFS score was introduced to evaluate long-form performance more robustly.
- Fixed 30-second segmentation yielded the highest HIFS score of 2.0663.
- Hallucinations in long-form outputs primarily manifest as repetitive insertions, affecting ASR and SSUM tasks.
The study demonstrates that while long-form extension introduces specific hallucination challenges, the model's short-form capabilities are largely retained.