The authors introduce MindEdit-Bench, a benchmark designed to evaluate vision-language models (VLMs) on object-level counterfactual spatial reasoning using in-the-wild photos. The dataset consists of 120 private indoor scenes captured via smartphone triplets and processed through an automatic 3D scene-graph extraction pipeline.

  • The benchmark includes six spatial reasoning tasks: four probing perception and perspective transformation, and two new tasks (L4 and L5) testing object-level counterfactual reasoning where correct answers are absent from input images.
  • Each question offers 8-24 structured answer choices to enable diagnosis of spatial and fallback errors.
  • Evaluation across 15 VLMs on 1,003 human-verified questions shows task-wise mean accuracy between 8% and 31%, compared to 81%-97% for human majority-vote accuracy.
  • The pooled gap between humans and the best VLM is 53 percentage points, with at least a 39 pp deficit on every task.

The benchmark highlights significant non-uniform failures in VLMs, particularly regarding camera-depth-axis inference and fallback behavior on difficult visibility-editing cases.