Ex-Omni enables 3D facial animation generation for omni-modal LLMs

Researchers have released Ex-Omni, a public system that generates omni-modal responses from text or speech input. The model produces response text, speech units or decoded audio, and 52-dimensional facial blendshape coefficients.

Generates text, speech, and 3D facial animation simultaneously.
Outputs 52-dimensional facial blendshape coefficients for realistic talking-face rendering.
Includes runtime modules for audio decoding and blendshape rendering utilities.
Supports EmoTalk and Claire mesh templates for visualization.

The release provides a complete inference pipeline and Gradio interface, allowing users to deploy the system locally for multi-modal interaction.