Modern language models typically allocate parameters uniformly across identical layers, despite evidence that later layers primarily refine the residual stream rather than transform it. To address this asymmetry, researchers investigated whether parameter capacity should vary by depth under a fixed budget. Controlled experiments demonstrated that allocating more capacity to earlier layers and less to later layers improves perplexity compared to uniform baselines, while the reverse allocation degrades performance. Building on these results, the authors introduce Tapered Language Models (TLMs), an architectural principle where parameter-bearing components are monotonically tapered across depth. MLPs serve as the primary site for this instantiation due to their dominance in parameter count and clear width axis. The study tested tapering via a smooth cosine schedule across three model scales and four architectures, including Transformer, Gated Attention, Hope-attention, and Titans. Results show that TLMs consistently improve perplexity and downstream benchmark performance over uniform baselines without additional compute costs. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic design lever for language models.
Tapered Language Models: Improving Performance via Depth-Aware Capacity Allocation
from English