MIRCaps introduces a large-scale multimodal dataset with 141,364 images, 981,947 image-level captions, 1,742,264 region-level captions, and 5,391,779 bounding box annotations. It enables fine-grained vision-language learning by providing detailed captions for object categories, sizes, colors, actions, and environmental context, and demonstrates effectiveness in image captioning and object detection tasks.
MIRCaps: Large-Scale Mixed-Domain Vision-Language Dataset
from English