Are Multilingual Models Actually Improving? Isolating True Cross-Lingual Transfer
A new metric, Hardness Adjusted Transfer (HAT) Score, isolates true cross-lingual transfer by separating it from source language accuracy gains. Analysis of 20 language models shows transfer in small models is not broken, progress with model size is slower than expected, and clear improvements have occurred over time.