VASAE: Naming SAE Dictionary Directions with Vocabulary-Aligned Anchoring

The authors introduce Vocabulary-Aligned Sparse Autoencoder (VASAE), a method that trains sparse autoencoder features using vocabulary-aligned anchoring to assign each feature an intrinsic token name based on the nearest embedding in the Transformer's vocabulary.

VASAE assigns intrinsic token names without reducing reconstruction quality compared to standard SAEs.
In GPT-2-small layers 0--10, approximately 90% of features align with tokens using a 0.8 cutoff score.
Llama-3.1-8B shallow layer dictionaries contain 92.8% strongly aligned features, while final-layer alignment is limited.
Case studies indicate that remaining intrinsic token names are relevant to nearby input tokens after subtracting sentence-level mean sparse codes.

This approach connects learned features to intrinsic token names during training, complementing post hoc interpretation of learned dictionaries.