The authors introduce Vocabulary-Aligned Sparse Autoencoder (VASAE), a method that trains sparse autoencoder features using vocabulary-aligned anchoring to assign each feature an intrinsic token name based on the nearest embedding in the Transformer's vocabulary.
- VASAE assigns intrinsic token names without reducing reconstruction quality compared to standard SAEs.
- In GPT-2-small layers 0--10, approximately 90% of features align with tokens using a 0.8 cutoff score.
- Llama-3.1-8B shallow layer dictionaries contain 92.8% strongly aligned features, while final-layer alignment is limited.
- Case studies indicate that remaining intrinsic token names are relevant to nearby input tokens after subtracting sentence-level mean sparse codes.
This approach connects learned features to intrinsic token names during training, complementing post hoc interpretation of learned dictionaries.