The authors introduce MinGram, a minimalist unigram tokenizer that simplifies training by using a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step. This approach removes the need for suffix arrays, forward-backward passes, and iterative prune loops, making the procedure significantly less complex than standard methods.

  • MinGram keeps the token-list representation but simplifies training by removing the suffix array, forward-backward pass, and iterative prune loop.
  • It uses a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step.
  • By making token count the primary objective and using a Unigram score only as a tiebreak, it balances compression with morphological alignment.
  • Across six languages, MinGram compresses better than both BPE and standard Unigram.
  • A compression-oriented variant matches the strongest token-count compressors while retaining substantially higher morphological alignment.

In controlled downstream language-model training, Unigram-family tokenizers, including MinGram, consistently beat BPE in bits-per-byte.