Varying segmentation width and cluster size in generative spoken language models enables intelligible and natural speech synthesis at lower bitrates than baseline. Speech continuation quality remains stable at these lower bitrates across multiple metrics, indicating conventional settings may be unnecessary. LLM-based metrics correlate better with human judgments but still show low alignment, underscoring the need for improved automatic evaluation.
Segmentation Width and Cluster Size Impact Speech Resynthesis in GSLMs
from English