The author of the audio.cpp runtime has added support for the VibeVoice 1.5B model, enabling long-form multi-speaker text-to-speech generation in a native C++/ggml environment.

  • Benchmarks on an RTX 5090 show VibeVoice generating 93.6 minutes of audio in 22.95 minutes (4.08x real-time).
  • This represents a 2.86x speedup compared to a Python baseline without quantization.
  • The runtime aims to provide reusable sessions, stable memory behavior, and CUDA-focused optimization for local inference.

This addition makes long-form audio models more practical for local use by avoiding Python setup overhead and offering optimized performance for dialogue and narration tasks.