The article introduces Tapered Language Models (TLMs), an architectural principle that allocates more parameter capacity to earlier layers and less to later layers within a fixed budget. This approach challenges the standard practice of uniform layer width by leveraging evidence that later layers primarily refine the residual stream rather than transforming it.
- Experiments show that tapering MLP width via a smooth cosine schedule improves perplexity and downstream benchmark performance across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans).
- Allocating more capacity to earlier layers yields better results, while the reverse allocation hurts performance compared to uniform-width baselines.
- The method provides these gains at no additional parameter or compute cost, establishing depth-aware capacity allocation as an architecture-agnostic design lever.
These findings suggest that tapering is a simple, effective optimization for language model design that improves efficiency without increasing resource requirements.