Scaling limit of the Random Language Model

This article develops a quantitative theory for the Random Language Model (RLM) in a scaling limit where the number of hidden symbols approaches infinity while the grammar temperature approaches zero at a fixed ratio. The study establishes that the model admits a controlled description based on a large-deviation principle over rule-usage patterns, mapping the problem to Random Energy Models with nontrivial combinatorics.

The RLM exhibits a condensation transition at a critical value of x=1/8, below which rule usage concentrates and language statistics depend on corpus length.
A second characteristic scale at x=1/2 marks the onset of entropy reduction from its maximal value.
Explicit scaling laws are derived for the number of distinct rules, entropy, and related observables across scaling, saturation, and critical regimes.

The theory resolves previous ambiguities regarding the existence of a thermodynamic transition and explains the slow approach to the large-N limit as a consequence of log N dependence. It provides a unified framework in which universal statistical properties of language emerge from typical realizations of generative grammars.