CapRiCorn-1K: Benchmark for Video Captioning and Subject Consistency

CapRiCorn-1K is a benchmark that evaluates video captioning quality and subject referential consistency across different video durations and domains. It supports both audiovisual and visual-only settings, revealing that current models struggle to maintain consistent subject references, especially in longer videos, with caption quality and consistency declining as video length increases. The benchmark's metrics show strong alignment with downstream tasks, validating their effectiveness.