Attention Sinks and Collapse Are Universal Consequences of Content-Based Routing
The study demonstrates that attention sinks, representation collapse, and norm stratification are not unique to transformer architectures but are inherent consequences of content-based routing under a fixed similarity metric. It establishes an identity showing softmax attention functions as Boltzmann-weighted aggregation over Euclidean distances with constant key norms, rendering it blind to key magnitude due to the omission of a specific norm term. This framework predicts that any router utilizing a metric ill-matched to its representations will compensate by concentrating routing and collapsing the routed representations. The authors validate this hypothesis across diverse models including nine pretrained transformers, graph attention networks, selective state-space models, recurrent mixers, and learned residual layers. Experimental results confirm that all tested architectures exhibit this identical signature of collapse regardless of their specific domain or structure. Furthermore, within-model ablations isolate the routing mechanism as the primary cause rather than incidental training dynamics. The onset of this phenomenon is shown to be contingent on the strength of the positional brake accompanying the content score, which can shift the effect across its range. However, the underlying mechanism remains invariant and does not require norm stratification, as routers with norm-normalized keys exhibit the same concentration behavior.