The author introduces HoLo-ToLk, a research project building speech-to-text (STT) and text-to-speech (TTS) models using the zero-parameter HSL byte substrate without tokenizers or learned input embeddings. The work demonstrates that raw HSL bytes can serve as a viable signal for audio processing when combined with specific architectural modifications.
- STT performance reaches a Character Error Rate (CER) of 0.194 by adding a learnable gated fusion over the frozen substrate, outperforming a mel-spectrogram baseline of 0.213 in controlled comparisons.
- TTS implementation feeds UTF-8 text bytes directly into an autoregressive transformer with guided attention and HiFi-GAN, achieving a teacher-forced mel-L1 of 0.296.
- While STT results are considered robust across four seeds, TTS free-run synthesis on arbitrary sentences remains rough and unstable, framing it as a feasibility demo rather than a production-ready system.
The project serves as a proof-of-concept for tokenizer-free audio processing, with the long-term goal of unifying the separate STT and TTS models into a single architecture.