The authors introduce Vocabulary-Aligned Sparse Autoencoder (VASAE), a method that trains sparse autoencoder features using vocabulary-aligned anchoring to assign each feature an intrinsic token name based on the nearest embedding in the Transformer's vocabulary.

  • VASAE assigns intrinsic token names without reducing reconstruction quality compared to standard SAEs.
  • In GPT-2-small layers 0--10, approximately 90% of features align with tokens using a 0.8 cutoff score.
  • Llama-3.1-8B shallow layer dictionaries contain 92.8% strongly aligned features, while final-layer alignment is limited.
  • Case studies indicate that remaining intrinsic token names are relevant to nearby input tokens after subtracting sentence-level mean sparse codes.

This approach connects learned features to intrinsic token names during training, complementing post hoc interpretation of learned dictionaries.