Researchers have released Ex-Omni, a public system that generates omni-modal responses from text or speech input. The model produces response text, speech units or decoded audio, and 52-dimensional facial blendshape coefficients.
- Generates text, speech, and 3D facial animation simultaneously.
- Outputs 52-dimensional facial blendshape coefficients for realistic talking-face rendering.
- Includes runtime modules for audio decoding and blendshape rendering utilities.
- Supports EmoTalk and Claire mesh templates for visualization.
The release provides a complete inference pipeline and Gradio interface, allowing users to deploy the system locally for multi-modal interaction.