A study investigates whether vision-language models (VLMs) can distinguish between what could be shared and what has been established as shared understanding during collaborative dialogue. The researchers formulated an interpretation-matching task using 13,077 annotated reference expressions from HCRC MapTask dialogues to evaluate model behavior under controlled conditions.
- Providing authentic map images improves overall performance but causes models to over-predict alignment with the partner's perspective.
- Textual descriptions of map content reproduce this bias, while non-informative images suppress alignment predictions entirely.
- The bias is driven by task-relevant map content rather than the visual channel itself.
- This improvement in alignment prediction comes at the cost of degraded accuracy on non-aligned cases.
- Calibration analysis suggests models rely on static referential cues on maps instead of tracking grounding through dialogue history.
- These patterns were observed most clearly in Qwen3-VL-8B-Instruct and to varying degrees in four additional models from two architecture families.
The findings indicate that VLMs conflate potential shared information with established common ground, treating map content as evidence of mutual understanding rather than tracking how grounding unfolds through interaction.