NEST: Dataset for Narrative Event Structures in Long Videos

NEST introduces a dataset of 1005 full-length movies, each annotated with 102 multimodal narrative events grounded in visual, dialogue, and audio content. The dataset captures event relationships such as temporal ordering, hierarchy, and long-range dependencies, with benchmark tasks showing low performance in event detection and localization, and higher performance in event relation extraction after fine-tuning.