This work presents the first systematic analysis of evaluation pitfalls in multimedia event extraction, identifying three major sources of issues: inconsistent data processing, inconsistent task assumptions, and overly relaxed evaluation settings.

  • The study highlights that minor evaluation choices can cause large performance variations.
  • These variations often lead to an overestimation of a model's ability to ground real-world events across modalities.
  • Controlled experiments under a strict evaluation framework demonstrate the critical need for comparable standards.

The findings encourage a shift toward more rigorous evaluation in multimedia event extraction to ensure reliability and comparability of results.