Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints

The authors introduce OVBEVSeg, a framework for open-vocabulary bird's-eye view (BEV) segmentation that utilizes vision-language models to recognize categories beyond the training set while maintaining real-time efficiency. To address the 3D geometric inconsistency inherent in lifting 2D semantics into BEV, the method employs robust 3D geometric constraints across three progressive stages.

OVBEVSeg enhances efficient Gaussian splatting-based unprojection through reliable 3D projection for open-vocabulary generalization.
It performs joint 2D-BEV per-scene optimization with structural constraints to ensure 3D geometric consistency.
The framework applies 3D geometric distillation to achieve online efficiency.
On the nuScenes dataset, it outperforms closed-set methods by 15.3 mIoU on unseen categories without novel-class ground-truth labels.
It achieves 2.5x faster inference with only 0.22x the memory consumption of projection-based methods.

This approach allows for precise BEV perception in unpredictable real-world environments by leveraging vision-language models, remaining competitive with supervised baselines while significantly reducing computational costs.