The author has released an open-source, fully local speech-to-speech backend designed for Large Language Model NPCs that enables direct NPC-to-NPC interactions without cloud dependency. The system integrates speech-to-text, a local LLM, and text-to-speech components to allow NPCs to converse with each other, retain context, and influence future player interactions.

  • Latency targets 400-600ms Time to First Audio (TTFA) using Llama 3.2 3B for VR or 7B on a 4070 Ti to simulate natural conversation flow.
  • A shared generation lock ensures only one NPC generates audio at a time, preventing GPU overload while allowing instant character switching.
  • The architecture is WebSocket-based, supporting integration with Unity, Unreal, and other engines via provided scripts.
  • A background Game Manager AI injects behavioral notes to steer the narrative, while NPCs maintain individual contexts and personalities.

This solution allows developers to implement immersive, self-sustaining NPC dialogues that enhance player immersion by witnessing organic interactions rather than just receiving direct answers.