This work presents the first systematic analysis of evaluation pitfalls in multimedia event extraction, identifying three major sources of issues: inconsistent data processing, inconsistent task assumptions, and overly relaxed evaluation settings.
- The study highlights that minor evaluation choices can cause large performance variations.
- These variations often lead to an overestimation of a model's ability to ground real-world events across modalities.
- Controlled experiments under a strict evaluation framework demonstrate the critical need for comparable standards.
The findings encourage a shift toward more rigorous evaluation in multimedia event extraction to ensure reliability and comparability of results.