Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Researchers introduce DramaSR-532K, a large-scale benchmark with 532K annotated dialogue lines across over 900 characters, and propose DramaSR-LRM to enhance speaker recognition in long-form TV dramas.

The DramaSR-532K benchmark integrates auditory, linguistic, and visual cues for complex character attribution.
DramaSR-LRM utilizes a large reasoning model (LRM) with multimodal tool-use to autonomously aggregate contextual evidence.
The approach significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are unreliable.

This work advances comprehensive video understanding by enabling high-fidelity speaker attribution in challenging long-form content.