GAVEL: Grounded Caption Error Verification and Localization
Vision-language models frequently generate hallucinated outputs where text and images are misaligned, necessitating methods that not only detect these errors but also explain them and localize visual evidence. The authors introduce GAVEL, a task designed to jointly address verification, explanation, and localization for image-text pairs, accompanied by a corresponding dataset and benchmark.