DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

Researchers introduce DMV-Bench, the first interactive benchmark designed to evaluate visual memory in multimodal agents within controlled environments. The study proposes DualMem, a parallel visual and verbal memory architecture that significantly outperforms existing systems on this new diagnostic tool.

DMV-Bench utilizes a home-furnishing e-commerce catalogue of 1,000 product variants where discriminative signals are restricted to pixels via a text-leakage contract.
Agents navigate autonomous shopping chains and must recall specific products based on unique incidental cues embedded in visited images.
DualMem maintains parallel visual and verbal codes, with vision carrying the cue end-to-end while the verbal channel assists in query grounding.
The architecture outperforms caption baselines and three recent multimodal agent-memory systems across chain lengths of 5, 10, 15, and 50 steps.
Performance gains were verified on Gemini 2.5 Flash and Qwen2.5-VL-7B models, controlling for memory-bank size and encoding-position bias.

The findings demonstrate that an asymmetric dual-coding regime effectively enhances long-horizon visual recall in interactive agent tasks.