HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

Researchers introduce HarmVideoBench, a multi-layered diagnostic benchmark designed to evaluate large vision-language models on their ability to understand harmful videos beyond superficial cues. The benchmark addresses limitations in existing works by incorporating explanatory rationales and assessing three hierarchical dimensions of harm: Observable Evidence, Clip-Internal Meaning, and Beyond-Clip Reasoning.

HarmVideoBench comprises 1,379 videos paired with 4,137 multiple-choice questions to evaluate deep contextual understanding.
The benchmark assesses models across three hierarchical dimensions: Observable Evidence, Clip-Internal Meaning, and Beyond-Clip Reasoning.
The study evaluates 19 leading large vision-language models to assess their multidimensional understanding of harmful content.
A new method called BCR is introduced, which predicts reasoning boundaries and dynamically retrieves context only when needed.
Experimental results show BCR raises the macro average performance from 61.7 percent to a state-of-the-art 84.4 percent.

The authors consider this important because current frameworks often turn evaluation into a black box where models succeed through shortcuts, whereas HarmVideoBench ensures models explain their reasoning and capture implicit harms.