MultiSynt/MT releases 4.8T-token parallel corpus across 36 languages

Researchers introduce MultiSynt/MT, an open synthetic parallel corpus containing approximately 4.8 trillion target-language tokens across 36 European languages. The dataset is generated by translating 100 billion high-quality Nemotron-CC tokens using Tower+ and OPUS-MT/HPLT-MT systems.

MultiSynt/MT provides the largest openly available pre-training resource for many medium- and lower-resource European languages.
LLMs trained on this corpus achieve scores comparable to native-data baselines (HPLT 2.0) using roughly 72% fewer pre-training tokens.
At a matched 100B-token training budget, models outperform the baseline by approximately 15% relative.
Standard multiple-choice benchmarks fail to capture translation-quality differences that fluency-sensitive LLM-as-judge evaluations recover.

The release supports controlled research on multilingual pre-training data and evaluation, addressing the concentration of web-scale corpora in English.