This study evaluates whether mid-scale Multimodal Large Language Models (MLLMs) can perform localized concept naming under strict zero-shot conditions by assigning labels to bounding-box regions. The authors propose a reproducible evaluation protocol for Concept Naming that includes closed-set prompting and an embedding-similarity-based strategy for large label spaces.
- Experiments with four MLLMs ranging from 7B to 32B parameters demonstrate consistent performance trends across datasets.
- The models achieve object-level exact-match accuracy between 62% and 88%.
- The research highlights the potential of training-free concept annotation from localized regions for Concept-based Explainable AI (C-XAI).
The authors release a reproducible framework to support future low-cost C-XAI research, discussing limitations and failure modes identified during the evaluation.