Researchers introduce DramaSR-532K, a large-scale benchmark with 532K annotated dialogue lines across over 900 characters, and propose DramaSR-LRM to enhance speaker recognition in long-form TV dramas.

  • The DramaSR-532K benchmark integrates auditory, linguistic, and visual cues for complex character attribution.
  • DramaSR-LRM utilizes a large reasoning model (LRM) with multimodal tool-use to autonomously aggregate contextual evidence.
  • The approach significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are unreliable.

This work advances comprehensive video understanding by enabling high-fidelity speaker attribution in challenging long-form content.