arxiv arXiv cs.CL · 7d ago · research

OmniAgent: Native Active Perception for Omni-Modal Understanding

from English

OmniAgent introduces a POMDP-based iterative Observation-Thought-Action cycle for video understanding, enabling on-demand action execution to selectively distill audio-visual cues into persistent textual memory. It achieves state-of-the-art performance on ten benchmarks, with a 7B agent outperforming a 10× larger Qwen2.5-VL-72B model on LVBench (50.5% vs. 47.3%).

Importance 3/3 Beats a top-lab benchmark New feature vs. leaders arXiv cs.CL Mistral AI Alibaba (Qwen) OpenAI AI agents Multimodal Reasoning models

Benchmarks

Benchmark	Model	Score
LongVideoBench	OmniAgent	—

Read original