The authors introduce OVBEVSeg, a framework for open-vocabulary bird's-eye view (BEV) segmentation that utilizes vision-language models to recognize categories beyond the training set while maintaining real-time efficiency. To address the 3D geometric inconsistency inherent in lifting 2D semantics into BEV, the method employs robust 3D geometric constraints across three progressive stages.
- OVBEVSeg enhances efficient Gaussian splatting-based unprojection through reliable 3D projection for open-vocabulary generalization.
- It performs joint 2D-BEV per-scene optimization with structural constraints to ensure 3D geometric consistency.
- The framework applies 3D geometric distillation to achieve online efficiency.
- On the nuScenes dataset, it outperforms closed-set methods by 15.3 mIoU on unseen categories without novel-class ground-truth labels.
- It achieves 2.5x faster inference with only 0.22x the memory consumption of projection-based methods.
This approach allows for precise BEV perception in unpredictable real-world environments by leveraging vision-language models, remaining competitive with supervised baselines while significantly reducing computational costs.