Evaluation Pitfalls and Challenges in Multimedia Event Extraction

This work presents the first systematic analysis of evaluation pitfalls in multimedia event extraction, identifying three major sources of issues: inconsistent data processing, inconsistent task assumptions, and overly relaxed evaluation settings.

The study highlights that minor evaluation choices can cause large performance variations.
These variations often lead to an overestimation of a model's ability to ground real-world events across modalities.
Controlled experiments under a strict evaluation framework demonstrate the critical need for comparable standards.

The findings encourage a shift toward more rigorous evaluation in multimedia event extraction to ensure reliability and comparability of results.