This paper analyzes deep transformer expressiveness using bounded-depth grammars. It constructs transformers with positional attention where model depth scales linearly with grammar depth, and neuron count grows quadratically with production rules. The results support the linear representation hypothesis by showing these models can encode abstract grammatical states in low-dimensional, linearly separable subspaces.
Expressivity Analysis of Hierarchical Modelling in Deep Transformers
from English