The SHOVIR benchmark evaluates vision shortcut learning in radiology report generation by extending MIMIC-CXR and PadChest-GR with per-box CheXpert labels. It utilizes image-level and disease-level occlusion experiments to isolate direct and contextual shortcuts where models rely on spurious correlations rather than actual visual evidence.
- SHOVIR extends two spatially annotated chest X-ray datasets, MIMIC-CXR and PadChest-GR, with per-box CheXpert labels.
- The benchmark defines image-level and disease-level occlusion experiments contrasting baseline performance against localized, region-specific perturbations.
- It isolates two failure modes: direct shortcuts where findings persist after visual evidence removal, and contextual shortcuts where detection degrades when co-occurring pathologies are occluded.
- Benchmarking eight state-of-the-art VLMs reveals that shortcut behavior varies substantially across architectures and datasets.
- Models with the highest baseline report quality do not necessarily rank highest in spatial grounding, showing clinically fluent generation can coexist with shallow reliance on visual evidence.
These findings expose a blind spot in current RRG evaluation and motivate region-aware assessment protocols to ensure models rely on actual pathological evidence.