CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning

Researchers propose CineCap, a framework that combines structured reasoning with spatio-temporal anchors and reinforcement learning to improve cinematographic video captioning. The method grounds professional film-language descriptions in explicit visual evidence while balancing descriptive completeness and factual correctness.

CineCap uses supervised fine-tuning on compact atomic reasoning grounded in spatio-temporal anchors.
Reinforcement learning applies comprehensiveness, accuracy, and gated coverage rewards to improve output quality.
The authors introduce CineCap Bench, a benchmark of 472 manually annotated video-caption pairs for evaluation.
Experiments show CineCap consistently outperforms strong proprietary and open-source baselines, establishing a new state of the art.

This work addresses the challenge of generating unified open-form descriptions over multiple cinematographic dimensions, supporting fine-grained video understanding and controllable movie-quality video generation.