The author of the audio.cpp runtime has added support for the VibeVoice 1.5B model, enabling long-form multi-speaker text-to-speech generation in a native C++/ggml environment.
- Benchmarks on an RTX 5090 show VibeVoice generating 93.6 minutes of audio in 22.95 minutes (4.08x real-time).
- This represents a 2.86x speedup compared to a Python baseline without quantization.
- The runtime aims to provide reusable sessions, stable memory behavior, and CUDA-focused optimization for local inference.
This addition makes long-form audio models more practical for local use by avoiding Python setup overhead and offering optimized performance for dialogue and narration tasks.