AVOC: Retrieval-Inspired Token Compression for Long-Form Audio-Video Understanding

AVOC enhances long-form audio-video understanding in omni-modal LLMs by introducing a learnable token compression module. It reframes token selection as a top-K retrieval problem, using relevance, importance, and diversity criteria to select compact, informative tokens, achieving state-of-the-art results on OmniVideoBench and LVOmniBench, and maintaining strong performance on one-hour audio-video needle-in-a-haystack tasks.