The authors propose G$^3$VLA, a camera-aware geometric module that injects calibrated structure into the visual-token stream of pretrained Vision-Language-Action models without altering their action space or imitation objective. This approach combines intrinsic-conditioned ray embeddings, projective positional encoding, and bidirectional cross-view fusion to address the mismatch between 2D image coordinates and robot camera geometry.

  • G$^3$VLA provides geometric supervision via ground-truth point maps or confidence-gated $π^3$X teacher predictions, requiring no depth sensors or manual annotations.
  • Instantiated on $π_0$, the model yields consistent gains across LIBERO suites, RoboCasa24, RoboTwin2.0, and real-robot settings.
  • The largest improvements are observed on spatially and object-sensitive tasks.
  • Validation on $π_{0.5}$ and GR00T 1.5 suggests geometric transfer is most effective when geometry-aware tokens have direct access to the action generation pathway.

The authors consider this important because it allows pretrained VLAs to leverage calibrated camera geometry, addressing a key limitation in multi-camera setups where views are coupled by known intrinsics and extrinsics.