Researchers introduce DramaSR-532K, a large-scale benchmark with 532K annotated dialogue lines across over 900 characters, and propose DramaSR-LRM to enhance speaker recognition in long-form TV dramas.
- The DramaSR-532K benchmark integrates auditory, linguistic, and visual cues for complex character attribution.
- DramaSR-LRM utilizes a large reasoning model (LRM) with multimodal tool-use to autonomously aggregate contextual evidence.
- The approach significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are unreliable.
This work advances comprehensive video understanding by enabling high-fidelity speaker attribution in challenging long-form content.