The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs
A study quantifies the structural tokenization penalty faced by African languages in commercial large language models, revealing that speakers pay higher costs and experience greater latency due to inefficient subword token assignment. Across 20 African languages and 11 frontier tokenizers, every tested language incurs a premium over English, with median costs reaching 1.88 times that of English and up to 8.92 times for N'Ko script.