Voice & audio — korshunov.ai

Voice & audio Page 1 / 2

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

The FFASR Leaderboard was launched to evaluate speech recognition systems in real-world conditions. It provides a benchmark for assessing the performance of automatic speech recognition models across diverse environments and use cases.

arxiv arXiv cs.CL · 21h ago

CN-NewsTTS Bench v0.1 Released

CN-NewsTTS Bench v0.1 is an open benchmark for evaluating Chinese news TTS systems' ability to correctly pronounce raw text targets. It includes 200 development and 800 public test records, 992 auto-evaluable targets, and results for seven TTS systems, with the best achieving 0.879 strict accuracy and several below 0.60.

arxiv arXiv cs.CL · 1d ago

Poster: Exploring Audio-Based Scam Detection in Turkish

This research introduces the first public multi-modal dataset of 100 aligned audio-transcript pairs for Turkish scam and benign calls. It evaluates seven large language models under raw audio, automatic, and human-corrected transcript inputs, finding that transcript-based inputs outperform direct audio processing, with human correction having minimal impact.

arxiv arXiv cs.AI · 1d ago

Improving Speaker Verification for Non-Verbal Vocalizations

A new framework combines frozen Data2Vec features with ECAPA-TDNN and a Mixture of Experts module to enhance speaker verification for non-verbal vocalizations. It uses conditional distillation and contrastive loss to maintain speech accuracy while reducing speech-NVV EER from 38.93% to 22.66% and improving speech EER from 13.17% to 9.24%.

arxiv arXiv cs.AI · 1d ago

LambdaMark: First Generic Radioactive Audio Watermarking Scheme

LambdaMark introduces the first generic radioactive audio watermarking scheme that embeds multi-bit messages into semantic audio latent representations. It achieves robustness against distortions and adversarial removal attacks, and remains effective even on generated speech from finetuned models, offering strong defense against voice cloning and impersonation.

arxiv arXiv cs.AI · 1d ago

Sexualised AI Voices Amplify Gender Power Asymmetries

A study finds that sexualised AI voices on a commercial platform reinforce binary, heteronormative gender expressions. Female-coded voices are more often labelled with sexualised and submissive descriptors, while male-coded voices are linked to dominance and positive traits, highlighting persistent gendered power imbalances in AI voice design.

media r/LocalLLaMA · 1d ago

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1

A CPU-only text-to-speech benchmark compares Kokoro-82M, Supertonic-3, and Inflect-Nano-v1 on an Intel Xeon with 4 cores and 15.6GB RAM. Kokoro delivers the most natural sound (MOS 4.44-4.45) despite slower speed, with ONNX version outperforming PyTorch in real-time factor while maintaining identical quality. Supertonic-5-step achieves a balanced result at 3.2x real-time and MOS 4.37, making it the practical choice for usability and quality.

arxiv arXiv cs.CL · 2d ago

Flow-Matching TTS Model Simulates Lombard Effect

A flow-matching based text-to-speech model is introduced to simulate the Lombard effect, where humans speak louder and clearer in noisy environments. The model enables continuous, disentangled control of vocal effort and articulation, with word-level emphasis for clarity. Experiments show improved acoustic clarity and intelligibility in noisy conditions compared to baseline systems.

arxiv arXiv cs.CL · 2d ago

Segmentation Width and Cluster Size Impact Speech Resynthesis in GSLMs

Varying segmentation width and cluster size in generative spoken language models enables intelligible and natural speech synthesis at lower bitrates than baseline. Speech continuation quality remains stable at these lower bitrates across multiple metrics, indicating conventional settings may be unnecessary. LLM-based metrics correlate better with human judgments but still show low alignment, underscoring the need for improved automatic evaluation.

arxiv arXiv cs.CL · 2d ago

OpenWER: Enhancing Cross-Lingual ASR Evaluation

OpenWER introduces an open-source framework that improves Word Error Rate robustness through language-specific normalization and compound word detection. It enables token-based Levenshtein alignment, supporting granular accuracy metrics and metadata embedding. Analysis of 52 languages shows up to 25% absolute WER reductions, advancing fair cross-lingual ASR evaluation.

arxiv arXiv cs.CL · 2d ago

Synthetic Audio Framework Improves ATC Speech Recognition

A synthetic audio generation framework is introduced to address data scarcity in Air Traffic Control speech recognition. It uses neural techniques like Text-to-Speech and accent conversion to simulate non-native English accents, enhancing Automatic Speech Recognition performance. Experiments with the Whisper model on the ATCO2 corpus show reduced word error rates when fine-tuned with synthetic or mixed real-synthetic data.

arxiv arXiv cs.CL · 2d ago

Evaluation Framework for TTS Voice Reconstruction

A new evaluation framework for text-to-speech voice reconstruction introduces subjective and objective measures to assess perceived intelligibility and speaker identity. It addresses limitations of existing methods by proposing a dual-reference distributional metric that better captures the trade-off between intelligibility and identity, validated across 193 speakers using 17 zero-shot TTS systems.

arxiv arXiv cs.CL · 2d ago

Sexualised AI Voices Amplify Gender Power Asymmetries

A study finds that sexualised AI voices on commercial platforms reinforce binary gender norms. Female-coded voices are more often described with submissive, sexualised terms, while male-coded voices are linked to dominance and positive traits, reflecting entrenched gendered power asymmetries.

media Hugging Face Forums · 3d ago

NOVA-VAD beats Silero, Pyannote, and WebRTC on noisy audio with 93% accuracy

NOVA-VAD, a lightweight and explainable Voice Activity Detector, achieves 93% accuracy on noisy audio from the UrbanSound8K dataset, outperforming WebRTC (58%), Pyannote (62%), and Silero (87%). It uses only scikit-learn, requires no GPU, and provides feature importance and confidence scores in plain English.

arxiv arXiv cs.AI · 6d ago

Repurposing Speech Classifier for Diffusion-Based Generation

A pretrained speech classifier is repurposed as a backbone for guided diffusion-based speech generation. By attaching a lightweight subnetwork and training it under denoising score matching, the approach achieves high speech quality with reduced memory and computational cost, using a single model instead of two separately trained components.

arxiv arXiv cs.AI · 6d ago

FlowEdit: Lifelong Pronunciation Adaptation in Flow-Matching TTS

FlowEdit enables frozen flow-matching TTS models to adapt pronunciation corrections over time using latent edits in text embeddings. It stores corrections in a Modern Hopfield Network and retrieves them via soft attention with similarity gating, reducing phoneme error rates by 92.7% on 312 multilingual proper nouns while preserving general-speech quality. Corrections take about 15 seconds to complete on a single GPU.

arxiv arXiv cs.AI · 6d ago

Cross-Attention Attribution for Style-Captioned Text-to-Speech

A new method adapts DAAM to speech diffusion models, analyzing how style captions influence TTS waveforms. It reveals style tokens have lower temporal variance than content tokens, with style attention correlating to pitch and energy, and peak style conditioning in early layers where attention entropy is minimized, indicating maximal selectivity.

arxiv arXiv cs.LG · 6d ago

Repurposing Speech Classifier for Diffusion-Based Generation

arxiv arXiv cs.AI · 6d ago

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing

A hybrid two-stage diffusion transformer architecture enables efficient and accurate instruction-guided audio editing. It uses coarse-to-fine semantic alignment via joint attention at low resolution, followed by refined editing with alternating joint and cross-attention at high resolution. The method achieves better performance on complex editing tasks with improved efficiency and a compact model.

arxiv arXiv cs.LG · 6d ago

PASQA: Pitch-Accent-Focused Speech Quality Model

PASQA is a speech quality assessment model designed to evaluate pitch-accent correctness in synthetic Japanese speech. It uses a dataset with controlled accent errors and achieves high accuracy in ranking accent-error severity, outperforming conventional models and aligning better with human judgments.