Voice & audio — korshunov.ai

Voice & audio Page 1 / 3

NOVA-VAD beats Silero, Pyannote, and WebRTC on noisy audio with 93% accuracy

NOVA-VAD, a lightweight and explainable Voice Activity Detector, achieves 93% accuracy on noisy audio from the UrbanSound8K dataset, outperforming WebRTC (58%), Pyannote (62%), and Silero (87%). It uses only scikit-learn, requires no GPU, and provides feature importance and confidence scores in plain English.

arxiv arXiv cs.AI · 6d ago

Repurposing Speech Classifier for Diffusion-Based Generation

A pretrained speech classifier is repurposed as a backbone for guided diffusion-based speech generation. By attaching a lightweight subnetwork and training it under denoising score matching, the approach achieves high speech quality with reduced memory and computational cost, using a single model instead of two separately trained components.

arxiv arXiv cs.AI · 6d ago

FlowEdit: Lifelong Pronunciation Adaptation in Flow-Matching TTS

FlowEdit enables frozen flow-matching TTS models to adapt pronunciation corrections over time using latent edits in text embeddings. It stores corrections in a Modern Hopfield Network and retrieves them via soft attention with similarity gating, reducing phoneme error rates by 92.7% on 312 multilingual proper nouns while preserving general-speech quality. Corrections take about 15 seconds to complete on a single GPU.

arxiv arXiv cs.AI · 6d ago

Cross-Attention Attribution for Style-Captioned Text-to-Speech

A new method adapts DAAM to speech diffusion models, analyzing how style captions influence TTS waveforms. It reveals style tokens have lower temporal variance than content tokens, with style attention correlating to pitch and energy, and peak style conditioning in early layers where attention entropy is minimized, indicating maximal selectivity.

arxiv arXiv cs.LG · 6d ago

Repurposing Speech Classifier for Diffusion-Based Generation

arxiv arXiv cs.AI · 6d ago

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing

A hybrid two-stage diffusion transformer architecture enables efficient and accurate instruction-guided audio editing. It uses coarse-to-fine semantic alignment via joint attention at low resolution, followed by refined editing with alternating joint and cross-attention at high resolution. The method achieves better performance on complex editing tasks with improved efficiency and a compact model.

arxiv arXiv cs.LG · 6d ago

PASQA: Pitch-Accent-Focused Speech Quality Model

PASQA is a speech quality assessment model designed to evaluate pitch-accent correctness in synthetic Japanese speech. It uses a dataset with controlled accent errors and achieves high accuracy in ranking accent-error severity, outperforming conventional models and aligning better with human judgments.

arxiv arXiv cs.CL · 6d ago

Lightweight Pronunciation Assessment via Discrete Speech Token Surprisal

A new framework assesses pronunciation using only native speech data, without labeled errors. It uses speech token surprisal and transcript-guided alignment to detect phonotactic deviations, achieving performance close to supervised methods on multiple datasets.

arxiv arXiv cs.CL · 6d ago

Speech Quality Models Fail to Capture Prosodic and F0 Variability

MOS prediction models accurately capture acoustic degradation but fail to detect prosodic errors and speaker-specific characteristics like pitch and speaking rate. Human listeners perceive significant quality drops for these perturbations, while models show strong biases in fundamental frequency and lack sensitivity to speaking rate and F0 variability.

arxiv arXiv cs.CL · 6d ago

Segment-Level Mandarin Speech Detection for Cognitive Impairment

A new framework uses autoencoder with contrastive learning to analyze segment-level Mandarin speech for cognitive impairment detection. It achieves stable, competitive performance across four datasets, with significant improvements in three-class classification, especially under limited labeled data conditions.

arxiv arXiv cs.CL · 6d ago

PASQA: Pitch-Accent-Focused Speech Quality Model

PASQA is a speech quality assessment model designed to evaluate pitch-accent correctness in synthetic Japanese speech. It uses a dataset with controlled accent errors and incorporates self-supervised learning, mora-conditioned fusion, ranking loss, and accent-error localization to achieve high accuracy in detecting accent errors across speakers, outperforming conventional models in alignment with human judgments.

arxiv arXiv cs.CL · 6d ago

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

ReNikud introduces a novel audio-supervised approach for Hebrew grapheme-to-phoneme conversion, using weak audio supervision and a pseudo-vocalization architecture. It outperforms prior state-of-the-art methods on Hebrew G2-Ph benchmarks and the new MILIM benchmark, enabling more natural spoken Hebrew in text-to-speech applications.

media r/LocalLLaMA · 7d ago

What's the best open speech to text today?

A user is seeking recommendations for real-time speech-to-text tools with diarization capabilities, asking about alternatives to Wispr Flow and MacParakeet, which uses Parakeet and Whisper models. They inquire whether newer models have emerged to support real-time performance.

arxiv arXiv cs.AI · 7d ago

ScenA: Reference-Driven Multi-Speaker Audio Scene Generation

ScenA conditions a text-to-audio foundation model on multiple reference voices and a natural language scene prompt to generate realistic multi-speaker conversations. It addresses the 'Reference Shortcut' issue by using a high-noise-biased training schedule, ensuring speaker assignment relies on text prompts rather than acoustic similarity. Evaluated on CoVoMix2-Dialogue, Scen- A outperforms existing systems in speaker-binding and produces rich, naturalistic audio with overlapping speech and ambient noise.

arxiv arXiv cs.LG · 7d ago

Learnable Speech-to-Spike Encoder for Spiking Neural Networks

A learnable residual speech-to-spike encoder is jointly trained with a Recurrent Leaky Integrate-and-Fire network, achieving up to 94.97% accuracy on the Google Speech Commands v2 benchmark. A 35k-parameter version reaches 89.8%, outperforming prior methods with far fewer parameters, and shows task-aligned spike representations that improve class separability.

arxiv arXiv cs.CL · 7d ago

Speech-Driven End-to-End Language Discrimination for Chinese Dialects

A study evaluates speech-driven MFCC features and an HMM-DNN model with attention mechanisms for Chinese dialect discrimination. The approach combines word-level embeddings and MFCC features using a CNN, achieving superior performance on benchmark dialect corpora compared to existing methods.

media r/LocalLLaMA · 8d ago

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model

The Inflect-Nano-v1 model is the second smallest publicly released TTS model after TinyTTS, with 4.63M total parameters. It performs surprisingly well for its size, running locally on low-end devices and offering a baseline for tiny speech synthesis in embedded or offline applications.

media r/LocalLLaMA · 8d ago

A Year Building a Fully Local Home Voice Assistant

A developer spent 12 months building a local, open-source voice assistant inspired by Alexa, documenting the challenges and progress. The project aimed to create a privacy-focused alternative using local models, with ongoing improvements and fixes.

media r/LocalLLaMA · 8d ago

Looking for locally hosted tool to create English subtitles from videos

A user is seeking a locally hosted, self-contained app to generate English subtitles (in .srt or .ass format) from video files. They consider Qwen-ASR and Whisper as strong options but report poor subtitle timing in ComfyUI implementations and unreliable performance with older models like those in storytoolkitAI. They ask for recommendations that work well on Windows and can handle multiple languages.

arxiv arXiv cs.CL · 8d ago

NAR-MBR Decoding for Fast and Accurate Speech Recognition

NAR-MBR decoding improves speech recognition by maximizing expected utility from sampled outputs of non-autoregressive models. It achieves better performance than prior NAR methods and runs faster than autoregressive decoding across multiple corpora.