Variable-Width Transformers Outperform Uniform Architectures
A new \times-shaped transformer architecture allocates varying layer widths, widening early and late layers while narrowing middle ones. It reduces average layer width, leading to 22% fewer FLOPs and 15% less KV cache memory, while outperforming uniform baselines on language modeling loss across 200M to 2B parameter models.