Vision-language models frequently generate hallucinated outputs where text and images are misaligned, necessitating methods that not only detect these errors but also explain them and localize visual evidence. The authors introduce GAVEL, a task designed to jointly address verification, explanation, and localization for image-text pairs, accompanied by a corresponding dataset and benchmark.

  • GAVEL is a new task focusing on the joint verification, explanation, and localization of caption errors in vision-language models.
  • A new dataset and benchmark are provided to support systematic evaluation of these capabilities.
  • Experiments reveal that even strong closed-source models struggle with the GAVEL task.
  • Training a supervised baseline on human-annotated data yields consistent improvements across grounding and explanation metrics.

This work provides learnable supervision for error verification and localization, offering a framework to systematically evaluate and improve the alignment between visual and textual outputs in multimodal models.