Voice & audio — korshunov.ai

Voice & audio Page 2 / 2

Lightweight Pronunciation Assessment via Discrete Speech Token Surprisal

A new framework assesses pronunciation using only native speech data, without labeled errors. It uses speech token surprisal and transcript-guided alignment to detect phonotactic deviations, achieving performance close to supervised methods on multiple datasets.

arxiv arXiv cs.CL · 6d ago

Speech Quality Models Fail to Capture Prosodic and F0 Variability

MOS prediction models accurately capture acoustic degradation but fail to detect prosodic errors and speaker-specific characteristics like pitch and speaking rate. Human listeners perceive significant quality drops for these perturbations, while models show strong biases in fundamental frequency and lack sensitivity to speaking rate and F0 variability.

arxiv arXiv cs.CL · 6d ago

Segment-Level Mandarin Speech Detection for Cognitive Impairment

A new framework uses autoencoder with contrastive learning to analyze segment-level Mandarin speech for cognitive impairment detection. It achieves stable, competitive performance across four datasets, with significant improvements in three-class classification, especially under limited labeled data conditions.

arxiv arXiv cs.CL · 6d ago

PASQA: Pitch-Accent-Focused Speech Quality Model

PASQA is a speech quality assessment model designed to evaluate pitch-accent correctness in synthetic Japanese speech. It uses a dataset with controlled accent errors and incorporates self-supervised learning, mora-conditioned fusion, ranking loss, and accent-error localization to achieve high accuracy in detecting accent errors across speakers, outperforming conventional models in alignment with human judgments.

arxiv arXiv cs.CL · 6d ago

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

ReNikud introduces a novel audio-supervised approach for Hebrew grapheme-to-phoneme conversion, using weak audio supervision and a pseudo-vocalization architecture. It outperforms prior state-of-the-art methods on Hebrew G2-Ph benchmarks and the new MILIM benchmark, enabling more natural spoken Hebrew in text-to-speech applications.

media r/LocalLLaMA · 6d ago

What's the best open speech to text today?

A user is seeking recommendations for real-time speech-to-text tools with diarization capabilities, asking about alternatives to Wispr Flow and MacParakeet, which uses Parakeet and Whisper models. They inquire whether newer models have emerged to support real-time performance.

arxiv arXiv cs.AI · 7d ago

ScenA: Reference-Driven Multi-Speaker Audio Scene Generation

ScenA conditions a text-to-audio foundation model on multiple reference voices and a natural language scene prompt to generate realistic multi-speaker conversations. It addresses the 'Reference Shortcut' issue by using a high-noise-biased training schedule, ensuring speaker assignment relies on text prompts rather than acoustic similarity. Evaluated on CoVoMix2-Dialogue, Scen- A outperforms existing systems in speaker-binding and produces rich, naturalistic audio with overlapping speech and ambient noise.

arxiv arXiv cs.LG · 7d ago

Learnable Speech-to-Spike Encoder for Spiking Neural Networks

A learnable residual speech-to-spike encoder is jointly trained with a Recurrent Leaky Integrate-and-Fire network, achieving up to 94.97% accuracy on the Google Speech Commands v2 benchmark. A 35k-parameter version reaches 89.8%, outperforming prior methods with far fewer parameters, and shows task-aligned spike representations that improve class separability.

arxiv arXiv cs.CL · 7d ago

Speech-Driven End-to-End Language Discrimination for Chinese Dialects

A study evaluates speech-driven MFCC features and an HMM-DNN model with attention mechanisms for Chinese dialect discrimination. The approach combines word-level embeddings and MFCC features using a CNN, achieving superior performance on benchmark dialect corpora compared to existing methods.

media r/LocalLLaMA · 7d ago

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model

The Inflect-Nano-v1 model is the second smallest publicly released TTS model after TinyTTS, with 4.63M total parameters. It performs surprisingly well for its size, running locally on low-end devices and offering a baseline for tiny speech synthesis in embedded or offline applications.

media r/LocalLLaMA · 7d ago

A Year Building a Fully Local Home Voice Assistant

A developer spent 12 months building a local, open-source voice assistant inspired by Alexa, documenting the challenges and progress. The project aimed to create a privacy-focused alternative using local models, with ongoing improvements and fixes.

media r/LocalLLaMA · 8d ago

Looking for locally hosted tool to create English subtitles from videos

A user is seeking a locally hosted, self-contained app to generate English subtitles (in .srt or .ass format) from video files. They consider Qwen-ASR and Whisper as strong options but report poor subtitle timing in ComfyUI implementations and unreliable performance with older models like those in storytoolkitAI. They ask for recommendations that work well on Windows and can handle multiple languages.

arxiv arXiv cs.CL · 8d ago

NAR-MBR Decoding for Fast and Accurate Speech Recognition

NAR-MBR decoding improves speech recognition by maximizing expected utility from sampled outputs of non-autoregressive models. It achieves better performance than prior NAR methods and runs faster than autoregressive decoding across multiple corpora.

arxiv arXiv cs.CL · 8d ago

Bilingual fine-tuning improves low-resource ASR with language identification

A study finds bilingual fine-tuning enhances automatic speech recognition in low-resource languages when language identification is accurate. Including a language identification token at inference improves ASR performance when identification accuracy is low, especially in diverse language pairs across different families and writing systems.

arxiv arXiv cs.CL · 8d ago

Self-supervised speech models lack tonal context compensation

The wav2vec2.0 model shows no evidence of perceptual compensation for Mandarin tones in embedding similarities. Probing classifiers reveal limited compensation and fail to match human performance on isolated syllables, suggesting supervised training is needed for phonological regularity abstraction.

arxiv arXiv cs.CL · 8d ago

Interventional Post-Training of Speech Foundation Models

A new method uses interventional contrastive learning to refine speech foundation models by transforming their entangled representations into separate content and speaker subspaces. The approach improves out-of-domain speaker verification performance and demonstrates clear separation of speaker and content information in the learned subspaces.

arxiv arXiv cs.AI · 9d ago

Low Frame Rate Degradation in Neural Audio Codecs

A quality cliff at 6.25 Hz in neural audio codecs is caused by insufficient training token exposure due to fixed clip duration. Correcting this training configuration enables smooth WER degradation down to 3.1 Hz and 1.6 Hz, indicating low frame rate efficiency is more achievable than previously thought.