Researchers introduce HarmVideoBench, a multi-layered diagnostic benchmark designed to evaluate large vision-language models on their ability to understand harmful videos beyond superficial cues. The benchmark addresses limitations in existing works by incorporating explanatory rationales and assessing three hierarchical dimensions of harm: Observable Evidence, Clip-Internal Meaning, and Beyond-Clip Reasoning.

  • HarmVideoBench comprises 1,379 videos paired with 4,137 multiple-choice questions to evaluate deep contextual understanding.
  • The benchmark assesses models across three hierarchical dimensions: Observable Evidence, Clip-Internal Meaning, and Beyond-Clip Reasoning.
  • The study evaluates 19 leading large vision-language models to assess their multidimensional understanding of harmful content.
  • A new method called BCR is introduced, which predicts reasoning boundaries and dynamically retrieves context only when needed.
  • Experimental results show BCR raises the macro average performance from 61.7 percent to a state-of-the-art 84.4 percent.

The authors consider this important because current frameworks often turn evaluation into a black box where models succeed through shortcuts, whereas HarmVideoBench ensures models explain their reasoning and capture implicit harms.