Researchers have released Ex-Omni, a public system that generates omni-modal responses from text or speech input. The model produces response text, speech units or decoded audio, and 52-dimensional facial blendshape coefficients.

  • Generates text, speech, and 3D facial animation simultaneously.
  • Outputs 52-dimensional facial blendshape coefficients for realistic talking-face rendering.
  • Includes runtime modules for audio decoding and blendshape rendering utilities.
  • Supports EmoTalk and Claire mesh templates for visualization.

The release provides a complete inference pipeline and Gradio interface, allowing users to deploy the system locally for multi-modal interaction.