The open-source project audio.cpp provides a native C++ inference framework for audio models built on top of ggml, currently supporting 12 released model families including TTS, ASR, and voice conversion. Benchmarks on Ubuntu/CUDA demonstrate that text-to-speech performance in this runtime is up to 5x faster than the corresponding Python reference implementations.
- Released models include Qwen3-TTS, PocketTTS, Vevo2, Chatterbox, MioTTS, OmniVoice, VoxCPM2, Qwen3-ASR, Seed-VC, MioCodec, Silero VAD, and Qwen3 Forced Aligner.
- PocketTTS achieves 3.68x speedup on 1-shot runs and generates audio at 48.40x real-time for long-form inputs.
- Vevo2 reaches a 5.03x speedup on 1-shot runs, while Qwen3-TTS shows up to 3.06x improvement on long-form generation.
- The framework enables shared runtime, session handling, and CLI workflows, allowing complex pipelines like same-language redubbing via a single command.
This unified C++ approach eliminates the need for separate Python environments for each model, offering significantly faster inference times and simplified deployment for audio processing tasks.