A study quantifies the structural tokenization penalty faced by African languages in commercial large language models, revealing that speakers pay higher costs and experience greater latency due to inefficient subword token assignment. Across 20 African languages and 11 frontier tokenizers, every tested language incurs a premium over English, with median costs reaching 1.88 times that of English and up to 8.92 times for N'Ko script.

  • Median tokenization premium is 1.88x on GPT-5 / o200k_base, with penalties reaching 7-9x for Ethiopic and N'Ko scripts.
  • This results in up to 8.9x inference cost and generation latency multiplier, reducing effective context window to as little as 11% of English's capacity.
  • The Gemma 4 tokenizer offers the best current mitigation, reducing the mean premium from 3.31x to 2.38x, but does not eliminate the penalty.
  • The research releases an open measurement tool (afri-fertility), a public leaderboard, and results dataset to highlight this digital divide.

The authors argue that these disparities encode a digital divide directly into subword vocabularies, disproportionately affecting speakers of languages who can least afford the increased computational costs.