Neural Machine Translation for Low-Resource Tangkhul-English

This study addresses low-resource machine translation for the Tangkhul-English language pair, focusing on a severely under-resourced Tibeto-Burman language with minimal prior NLP infrastructure. The authors present two systems: a primary model based on ByT5-large and a contrastive system using mT5-small, both fine-tuned on 38,336 parallel sentence pairs. Evaluation on a held-out test set of 3,856 sentences shows the ByT5-large system achieving a corpus BLEU score of 39.97 and a chrF++ score of 58.07. Additional metrics include a BERTScore F1 of 0.8104 and a COMET score of 0.7302 using the wmt22-comet-da model. The research highlights orthographic challenges related to Tangkhul's Latin-script diacritics as a specific technical hurdle. Furthermore, the training corpus exhibits domain bias, consisting primarily of biblical texts, stories, and conversational data. Future work aims to improve performance through data diversification and domain adaptation strategies.