MSQA benchmark reveals cultural degradation in multilingual LLMs

Researchers introduce MSQA, a benchmark of 1,064 natively sourced questions across 11 language groups and five cultural dimensions, to test the assumption that multilingual fluency implies cultural alignment. Evaluating 18 large language models reveals substantial cultural degradation and a pronounced Locality Effect, where competence tracks pre-training exposure rather than general reasoning ability.

MSQA targets locally grounded knowledge to reduce shortcuts from English-centric cross-lingual transfer.
Models remain overconfident on unfamiliar cultural questions despite multilingual capabilities.
Repeated sampling yields unstable correctness, and retrieval augmentation helps unevenly on long-tail facts.

The findings indicate that cultural alignment cannot be inferred from multilingual ability alone and requires deeper intervention than calibration, sampling, or retrieval at inference time.