VASAE: Naming SAE Dictionary Directions with Vocabulary-Aligned Anchoring
The authors introduce Vocabulary-Aligned Sparse Autoencoder (VASAE), a method that trains sparse autoencoder features using vocabulary-aligned anchoring to assign each feature an intrinsic token name based on the nearest embedding in the Transformer's vocabulary.