Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

Sarashina2.2-TTS is a Japanese-centric LLM-based text-to-speech system designed to address the linguistic challenge of context-dependent kanji polyphony. The model scales training data to approximately 361k hours, utilizing a balanced mix of Japanese and English speech corpora. To specifically handle reading disambiguation, the authors implemented a targeted data augmentation pipeline covering all 2,136 Joyo regular-use kanji. Alongside the model release, the paper introduces the Joyo Kanji Yomi Benchmark, which includes 4,378 distinct readings for these characters. The authors also propose Kana-CER, a metric that evaluates pronunciation correctness by comparing synthesized speech against reference readings in kana space. Experimental results show that this targeted augmentation significantly improves reading accuracy and achieves state-of-the-art kanji-level performance. The system matches top baselines on general sentence-level pronunciation while delivering the highest speaker similarity in zero-shot synthesis scenarios. Furthermore, cross-lingual evaluations confirm that the balanced training approach ensures stable Japanese pronunciation regardless of the prompt language used.