Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean

Researchers address the quality gap in low-resource text-to-speech by fine-tuning the 2.4B-parameter VoxCPM2 model using Low-Rank Adaptation (LoRA) on a shared corpus of Khmer and Korean.

The study adapts VoxCPM2, which combines a MiniCPM-4 backbone with a flow-matching diffusion decoder, using a single LoRA adapter trained on 26 hours of mixed-language data.
Native-speaker listening tests show the Khmer Mean Opinion Score (MOS) increases from 3.85 to 4.23 with a rank 64 adapter, representing a highly significant gain while training only up to 3.03 percent of parameters.
Automatic validation loss is lowest at rank 128, whereas human ratings peak at rank 64, indicating a disagreement between automated metrics and perceived quality.
The adaptation yields no benefit for Korean, as the base model already handles it well, and high-rank adapters even degrade quality in that language.

The findings suggest that LoRA adaptation is effective primarily where the base model is genuinely weak, highlighting its utility for improving low-resource TTS performance.