How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation

This paper proposes a diagnostic framework decomposing historical language difficulty into tokenization cost, predictive uncertainty, semantic robustness, and context sensitivity. The authors evaluate this framework on 17th-century Italian, 19th-century Italian, and 18th-century Russian texts to understand how LLMs process historical languages.

The study uses a newly curated corpus of 17th-century Italian texts (1610-1689), canonical 19th-century Italian "I Promessi Sposi," and 18th-century Russian civil print books.
Russian and early modern Italian incur comparable tokenization penalties with 25-30% inflation, but their predictive difficulty diverges sharply.
17th-century Italian is on average 2.4 times more surprising than its modern equivalent, reaching 3.2 times for academic prose.
Embedding similarity remains robust (> 0.85) across all datasets, indicating models can represent historical meaning even when generation is unstable.
A minimal temporal context prompt reduces historical surprisal by approximately 60%.

These findings suggest that while historical text imposes a consistent encoding tax, digital libraries can safely deploy LLMs for semantic retrieval tasks, provided generative applications are carefully adapted.