G$^3$VLA: Geometric inductive bias for Vision-Language-Action Models
The authors propose G$^3$VLA, a camera-aware geometric module that injects calibrated structure into the visual-token stream of pretrained Vision-Language-Action models without altering their action space or imitation objective. This approach combines intrinsic-conditioned ray embeddings, projective positional encoding, and bidirectional cross-view fusion to address the mismatch between 2D image coordinates and robot camera geometry.