This article introduces a continuous Latent Bridge that couples frozen reactive and reasoning vision-language models to enable real-time game agents with millisecond latency and long-horizon planning. By projecting the slow model's residuals into the fast model's input-embedding space, it avoids text round-trips while matching or beating traditional Text Bridges in performance.

  • The Latent Bridge matches or beats the Text Bridge across 7 Atari games and the MetaDrive driving domain.
  • It significantly improves MsPacman by 57% and RoadRunner by 28% compared to baseline reactive models.
  • Combining both channels destructively interferes, reducing performance by 96% in RoadRunner.
  • The bridge's benefit is highly predictable, correlating at r=0.93 with the gain of slow reasoning over fast reaction.

The approach provides a safe drop-in solution for agents requiring both rapid action and complex planning, with reproducible pipelines and replay recordings released.