CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning
Researchers propose CineCap, a framework that combines structured reasoning with spatio-temporal anchors and reinforcement learning to improve cinematographic video captioning. The method grounds professional film-language descriptions in explicit visual evidence while balancing descriptive completeness and factual correctness.