Voice & audio
media r/LocalLLaMA · 1d ago

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1

A CPU-only text-to-speech benchmark compares Kokoro-82M, Supertonic-3, and Inflect-Nano-v1 on an Intel Xeon with 4 cores and 15.6GB RAM. Kokoro delivers the most natural sound (MOS 4.44-4.45) despite slower speed, with ONNX version outperforming PyTorch in real-time factor while maintaining identical quality. Supertonic-5-step achieves a balanced result at 3.2x real-time and MOS 4.37, making it the practical choice for usability and quality.

arxiv arXiv cs.CL · 2d ago

Segmentation Width and Cluster Size Impact Speech Resynthesis in GSLMs

Varying segmentation width and cluster size in generative spoken language models enables intelligible and natural speech synthesis at lower bitrates than baseline. Speech continuation quality remains stable at these lower bitrates across multiple metrics, indicating conventional settings may be unnecessary. LLM-based metrics correlate better with human judgments but still show low alignment, underscoring the need for improved automatic evaluation.

arxiv arXiv cs.CL · 2d ago

Synthetic Audio Framework Improves ATC Speech Recognition

A synthetic audio generation framework is introduced to address data scarcity in Air Traffic Control speech recognition. It uses neural techniques like Text-to-Speech and accent conversion to simulate non-native English accents, enhancing Automatic Speech Recognition performance. Experiments with the Whisper model on the ATCO2 corpus show reduced word error rates when fine-tuned with synthetic or mixed real-synthetic data.

arxiv arXiv cs.AI · 6d ago

FlowEdit: Lifelong Pronunciation Adaptation in Flow-Matching TTS

FlowEdit enables frozen flow-matching TTS models to adapt pronunciation corrections over time using latent edits in text embeddings. It stores corrections in a Modern Hopfield Network and retrieves them via soft attention with similarity gating, reducing phoneme error rates by 92.7% on 312 multilingual proper nouns while preserving general-speech quality. Corrections take about 15 seconds to complete on a single GPU.