The paper introduces video-SALMONN-R$^3$, an end-to-end video large language model that enables efficient re-watching of video segments through reinforcement learning without relying on chain-of-thought data. This approach addresses the computational and memory constraints that typically force models to use reduced frame rates and spatial resolutions.

  • The model utilizes a two-stage paradigm to localize relevant segments and re-watch them at higher fidelity.
  • It eliminates the need for costly chain-of-thought annotations and supervised fine-tuning, which can degrade pretrained video understanding abilities.
  • A re-answer strategy allows the model to produce a direct answer first and refine it after re-watching.
  • A re-ask mechanism reinjects the query when revisiting localized segments to improve question adherence.

The authors consider this significant because experimental results show that video-SALMONN-R$^3$ consistently outperforms base models and prior re-watch-based approaches with significantly lower computational cost.