OmniAgent introduces a POMDP-based iterative Observation-Thought-Action cycle for video understanding, enabling on-demand action execution to selectively distill audio-visual cues into persistent textual memory. It achieves state-of-the-art performance on ten benchmarks, with a 7B agent outperforming a 10× larger Qwen2.5-VL-72B model on LVBench (50.5% vs. 47.3%).
arxiv
arXiv cs.CL
·
7d ago
·
research
OmniAgent: Native Active Perception for Omni-Modal Understanding
from English
Importance 3/3
Beats a top-lab benchmark
New feature vs. leaders
arXiv cs.CL
Mistral AI
Alibaba (Qwen)
OpenAI
AI agents
Multimodal
Reasoning models
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| LongVideoBench | OmniAgent | — |