video-SALMONN-R3: Efficient Video Understanding via Reinforcement Learning
The paper introduces video-SALMONN-R$^3$, an end-to-end video large language model that enables efficient re-watching of video segments through reinforcement learning without relying on chain-of-thought data. This approach addresses the computational and memory constraints that typically force models to use reduced frame rates and spatial resolutions.