A new approach uses the International Phonetic Alphabet to create language-agnostic tokenizers for multilingual models. Training matched text and IPA subword tokenizers across 24 languages and 14 scripts shows IPA tokenizers enhance tokenization quality, particularly for non-Latin scripts, and generalize better to unseen languages and scripts.
IPA-Based Tokenization Improves Multilingual Language Model Performance
from English