Researchers propose CineCap, a framework that combines structured reasoning with spatio-temporal anchors and reinforcement learning to improve cinematographic video captioning. The method grounds professional film-language descriptions in explicit visual evidence while balancing descriptive completeness and factual correctness.
- CineCap uses supervised fine-tuning on compact atomic reasoning grounded in spatio-temporal anchors.
- Reinforcement learning applies comprehensiveness, accuracy, and gated coverage rewards to improve output quality.
- The authors introduce CineCap Bench, a benchmark of 472 manually annotated video-caption pairs for evaluation.
- Experiments show CineCap consistently outperforms strong proprietary and open-source baselines, establishing a new state of the art.
This work addresses the challenge of generating unified open-form descriptions over multiple cinematographic dimensions, supporting fine-grained video understanding and controllable movie-quality video generation.