video-SALMONN-R3: Efficient Video Understanding via Reinforcement Learning

The paper introduces video-SALMONN-R$^3$, an end-to-end video large language model that enables efficient re-watching of video segments through reinforcement learning without relying on chain-of-thought data. This approach addresses the computational and memory constraints that typically force models to use reduced frame rates and spatial resolutions.

The model utilizes a two-stage paradigm to localize relevant segments and re-watch them at higher fidelity.
It eliminates the need for costly chain-of-thought annotations and supervised fine-tuning, which can degrade pretrained video understanding abilities.
A re-answer strategy allows the model to produce a direct answer first and refine it after re-watching.
A re-ask mechanism reinjects the query when revisiting localized segments to improve question adherence.

The authors consider this significant because experimental results show that video-SALMONN-R$^3$ consistently outperforms base models and prior re-watch-based approaches with significantly lower computational cost.