Tapered Language Models: Improving Performance via Depth-Aware Capacity Allocation

The article introduces Tapered Language Models (TLMs), an architectural principle that allocates more parameter capacity to earlier layers and less to later layers within a fixed budget. This approach challenges the standard practice of uniform layer width by leveraging evidence that later layers primarily refine the residual stream rather than transforming it.

Experiments show that tapering MLP width via a smooth cosine schedule improves perplexity and downstream benchmark performance across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans).
Allocating more capacity to earlier layers yields better results, while the reverse allocation hurts performance compared to uniform-width baselines.
The method provides these gains at no additional parameter or compute cost, establishing depth-aware capacity allocation as an architecture-agnostic design lever.

These findings suggest that tapering is a simple, effective optimization for language model design that improves efficiency without increasing resource requirements.