OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a panoramic canvas using 3D world coordinates. It achieves state-of-the-art results on SQA3D and VSI-Bench, with strong generalization on SPBench, using significantly less training compute than prior methods.
OneCanvas: 3D Scene Understanding via Panoramic Reprojection
from English