ViGiL3D++ introduces a scalable, scene-agnostic method that generates diverse visual grounding queries by combining constraint sampling in scene graphs with large language model language generation. It outperforms existing models on multiple 3D visual grounding benchmarks and reveals key limitations of current vision-language models.
ViGiL3D++ Enables Diverse Language Generation for 3D Visual Grounding
from English