How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation
This paper proposes a diagnostic framework decomposing historical language difficulty into tokenization cost, predictive uncertainty, semantic robustness, and context sensitivity. The authors evaluate this framework on 17th-century Italian, 19th-century Italian, and 18th-century Russian texts to understand how LLMs process historical languages.