This project enables voice chat with the Gemma 4 31B model through a 3D avatar that listens, speaks, and displays dynamic facial expressions and hand gestures. The system exposes function tools like set_mood, make_hand_gesture, and make_facial_expression to the LLM, allowing it to autonomously decide the avatar's reactions.

  • The stack uses open models including silero VAD, parakeet for STT, Qwen3-TTS, and Gemma 4 31B served by Cerebras.
  • Communication occurs via raw PCM over a plain WebSocket connection.
  • Lip-syncing and avatar rendering are handled by met4citizen's TalkingHead and HeadAudio projects.

This setup demonstrates how to integrate multiple open-source components for real-time, interactive multimodal AI experiences.