MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

The authors introduce MinGram, a minimalist unigram tokenizer that simplifies training by using a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step. This approach removes the need for suffix arrays, forward-backward passes, and iterative prune loops, making the procedure significantly less complex than standard methods.

MinGram keeps the token-list representation but simplifies training by removing the suffix array, forward-backward pass, and iterative prune loop.
It uses a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step.
By making token count the primary objective and using a Unigram score only as a tiebreak, it balances compression with morphological alignment.
Across six languages, MinGram compresses better than both BPE and standard Unigram.
A compression-oriented variant matches the strongest token-count compressors while retaining substantially higher morphological alignment.

In controlled downstream language-model training, Unigram-family tokenizers, including MinGram, consistently beat BPE in bits-per-byte.