audio.cpp adds native C++ VibeVoice 1.5B support for local audio inference

The author of the audio.cpp runtime has added support for the VibeVoice 1.5B model, enabling long-form multi-speaker text-to-speech generation in a native C++/ggml environment.

Benchmarks on an RTX 5090 show VibeVoice generating 93.6 minutes of audio in 22.95 minutes (4.08x real-time).
This represents a 2.86x speedup compared to a Python baseline without quantization.
The runtime aims to provide reusable sessions, stable memory behavior, and CUDA-focused optimization for local inference.

This addition makes long-form audio models more practical for local use by avoiding Python setup overhead and offering optimized performance for dialogue and narration tasks.