Voice & audio — korshunov.ai

Voice & audio Page 1 / 3

Evaluation Framework for TTS Voice Reconstruction

A new evaluation framework for text-to-speech voice reconstruction introduces subjective and objective measures to assess perceived intelligibility and speaker identity. It addresses limitations of existing methods by proposing a dual-reference distributional metric that better captures the trade-off between intelligibility and identity, validated across 193 speakers using 17 zero-shot TTS systems.

arxiv arXiv cs.CL · 2d ago

Sexualised AI Voices Amplify Gender Power Asymmetries

A study finds that sexualised AI voices on commercial platforms reinforce binary gender norms. Female-coded voices are more often described with submissive, sexualised terms, while male-coded voices are linked to dominance and positive traits, reflecting entrenched gendered power asymmetries.

media Hugging Face Forums · 3d ago

NOVA-VAD beats Silero, Pyannote, and WebRTC on noisy audio with 93% accuracy

NOVA-VAD, a lightweight and explainable Voice Activity Detector, achieves 93% accuracy on noisy audio from the UrbanSound8K dataset, outperforming WebRTC (58%), Pyannote (62%), and Silero (87%). It uses only scikit-learn, requires no GPU, and provides feature importance and confidence scores in plain English.

arxiv arXiv cs.AI · 6d ago

Repurposing Speech Classifier for Diffusion-Based Generation

A pretrained speech classifier is repurposed as a backbone for guided diffusion-based speech generation. By attaching a lightweight subnetwork and training it under denoising score matching, the approach achieves high speech quality with reduced memory and computational cost, using a single model instead of two separately trained components.

arxiv arXiv cs.AI · 6d ago

FlowEdit: Lifelong Pronunciation Adaptation in Flow-Matching TTS

FlowEdit enables frozen flow-matching TTS models to adapt pronunciation corrections over time using latent edits in text embeddings. It stores corrections in a Modern Hopfield Network and retrieves them via soft attention with similarity gating, reducing phoneme error rates by 92.7% on 312 multilingual proper nouns while preserving general-speech quality. Corrections take about 15 seconds to complete on a single GPU.

arxiv arXiv cs.AI · 6d ago

Cross-Attention Attribution for Style-Captioned Text-to-Speech

A new method adapts DAAM to speech diffusion models, analyzing how style captions influence TTS waveforms. It reveals style tokens have lower temporal variance than content tokens, with style attention correlating to pitch and energy, and peak style conditioning in early layers where attention entropy is minimized, indicating maximal selectivity.

arxiv arXiv cs.LG · 6d ago

Repurposing Speech Classifier for Diffusion-Based Generation

arxiv arXiv cs.AI · 6d ago

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing

A hybrid two-stage diffusion transformer architecture enables efficient and accurate instruction-guided audio editing. It uses coarse-to-fine semantic alignment via joint attention at low resolution, followed by refined editing with alternating joint and cross-attention at high resolution. The method achieves better performance on complex editing tasks with improved efficiency and a compact model.

arxiv arXiv cs.LG · 6d ago

PASQA: Pitch-Accent-Focused Speech Quality Model

PASQA is a speech quality assessment model designed to evaluate pitch-accent correctness in synthetic Japanese speech. It uses a dataset with controlled accent errors and achieves high accuracy in ranking accent-error severity, outperforming conventional models and aligning better with human judgments.

arxiv arXiv cs.CL · 6d ago

Lightweight Pronunciation Assessment via Discrete Speech Token Surprisal

A new framework assesses pronunciation using only native speech data, without labeled errors. It uses speech token surprisal and transcript-guided alignment to detect phonotactic deviations, achieving performance close to supervised methods on multiple datasets.

arxiv arXiv cs.CL · 6d ago

Speech Quality Models Fail to Capture Prosodic and F0 Variability

MOS prediction models accurately capture acoustic degradation but fail to detect prosodic errors and speaker-specific characteristics like pitch and speaking rate. Human listeners perceive significant quality drops for these perturbations, while models show strong biases in fundamental frequency and lack sensitivity to speaking rate and F0 variability.

arxiv arXiv cs.CL · 6d ago

Segment-Level Mandarin Speech Detection for Cognitive Impairment

A new framework uses autoencoder with contrastive learning to analyze segment-level Mandarin speech for cognitive impairment detection. It achieves stable, competitive performance across four datasets, with significant improvements in three-class classification, especially under limited labeled data conditions.

arxiv arXiv cs.CL · 6d ago

PASQA: Pitch-Accent-Focused Speech Quality Model

PASQA is a speech quality assessment model designed to evaluate pitch-accent correctness in synthetic Japanese speech. It uses a dataset with controlled accent errors and incorporates self-supervised learning, mora-conditioned fusion, ranking loss, and accent-error localization to achieve high accuracy in detecting accent errors across speakers, outperforming conventional models in alignment with human judgments.

arxiv arXiv cs.CL · 6d ago

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

ReNikud introduces a novel audio-supervised approach for Hebrew grapheme-to-phoneme conversion, using weak audio supervision and a pseudo-vocalization architecture. It outperforms prior state-of-the-art methods on Hebrew G2-Ph benchmarks and the new MILIM benchmark, enabling more natural spoken Hebrew in text-to-speech applications.

media r/LocalLLaMA · 7d ago

What's the best open speech to text today?

A user is seeking recommendations for real-time speech-to-text tools with diarization capabilities, asking about alternatives to Wispr Flow and MacParakeet, which uses Parakeet and Whisper models. They inquire whether newer models have emerged to support real-time performance.

arxiv arXiv cs.AI · 7d ago

ScenA: Reference-Driven Multi-Speaker Audio Scene Generation

ScenA conditions a text-to-audio foundation model on multiple reference voices and a natural language scene prompt to generate realistic multi-speaker conversations. It addresses the 'Reference Shortcut' issue by using a high-noise-biased training schedule, ensuring speaker assignment relies on text prompts rather than acoustic similarity. Evaluated on CoVoMix2-Dialogue, Scen- A outperforms existing systems in speaker-binding and produces rich, naturalistic audio with overlapping speech and ambient noise.

arxiv arXiv cs.LG · 7d ago

Learnable Speech-to-Spike Encoder for Spiking Neural Networks

A learnable residual speech-to-spike encoder is jointly trained with a Recurrent Leaky Integrate-and-Fire network, achieving up to 94.97% accuracy on the Google Speech Commands v2 benchmark. A 35k-parameter version reaches 89.8%, outperforming prior methods with far fewer parameters, and shows task-aligned spike representations that improve class separability.

arxiv arXiv cs.CL · 7d ago

Speech-Driven End-to-End Language Discrimination for Chinese Dialects

A study evaluates speech-driven MFCC features and an HMM-DNN model with attention mechanisms for Chinese dialect discrimination. The approach combines word-level embeddings and MFCC features using a CNN, achieving superior performance on benchmark dialect corpora compared to existing methods.

media r/LocalLLaMA · 8d ago

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model

The Inflect-Nano-v1 model is the second smallest publicly released TTS model after TinyTTS, with 4.63M total parameters. It performs surprisingly well for its size, running locally on low-end devices and offering a baseline for tiny speech synthesis in embedded or offline applications.

media r/LocalLLaMA · 8d ago

A Year Building a Fully Local Home Voice Assistant

A developer spent 12 months building a local, open-source voice assistant inspired by Alexa, documenting the challenges and progress. The project aimed to create a privacy-focused alternative using local models, with ongoing improvements and fixes.