SHOVIR: A Benchmark for Evaluating Vision Shortcut Learning in Radiology Report Generation

The SHOVIR benchmark evaluates vision shortcut learning in radiology report generation by extending MIMIC-CXR and PadChest-GR with per-box CheXpert labels. It utilizes image-level and disease-level occlusion experiments to isolate direct and contextual shortcuts where models rely on spurious correlations rather than actual visual evidence.

SHOVIR extends two spatially annotated chest X-ray datasets, MIMIC-CXR and PadChest-GR, with per-box CheXpert labels.
The benchmark defines image-level and disease-level occlusion experiments contrasting baseline performance against localized, region-specific perturbations.
It isolates two failure modes: direct shortcuts where findings persist after visual evidence removal, and contextual shortcuts where detection degrades when co-occurring pathologies are occluded.
Benchmarking eight state-of-the-art VLMs reveals that shortcut behavior varies substantially across architectures and datasets.
Models with the highest baseline report quality do not necessarily rank highest in spatial grounding, showing clinically fluent generation can coexist with shallow reliance on visual evidence.

These findings expose a blind spot in current RRG evaluation and motivate region-aware assessment protocols to ensure models rely on actual pathological evidence.