Researchers introduce DMV-Bench, the first interactive benchmark designed to evaluate visual memory in multimodal agents within controlled environments. The study proposes DualMem, a parallel visual and verbal memory architecture that significantly outperforms existing systems on this new diagnostic tool.
- DMV-Bench utilizes a home-furnishing e-commerce catalogue of 1,000 product variants where discriminative signals are restricted to pixels via a text-leakage contract.
- Agents navigate autonomous shopping chains and must recall specific products based on unique incidental cues embedded in visited images.
- DualMem maintains parallel visual and verbal codes, with vision carrying the cue end-to-end while the verbal channel assists in query grounding.
- The architecture outperforms caption baselines and three recent multimodal agent-memory systems across chain lengths of 5, 10, 15, and 50 steps.
- Performance gains were verified on Gemini 2.5 Flash and Qwen2.5-VL-7B models, controlling for memory-bank size and encoding-position bias.
The findings demonstrate that an asymmetric dual-coding regime effectively enhances long-horizon visual recall in interactive agent tasks.