A developer has created a game-agnostic NPC engine backend that leverages small local models to achieve fast response times and decent quality for role-playing games. The system utilizes NVIDIA Parakeet 0.6 for speech-to-text, Gemma 4 26B A4B as the LLM, and Qwen3-TTS for voice synthesis.

  • The architecture is heavily inspired by SillyTavern.
  • Retrieval-Augmented Generation (RAG) is used to keep prompts lean by injecting only contextually relevant actions from a large pool.
  • This approach prevents the model from being overloaded with giant lists of available actions every turn.

The author suggests that this method could represent the future of RPGs as smaller local models continue to improve in capability.