This paper proposes a new approach to catastrophic forgetting in large language models by regularizing in activation space using pretrained Sparse Autoencoders (SAEs) as a monosemantic feature dictionary, rather than relying on traditional weight-space methods like Elastic Weight Consolidation (EWC).
- The method derives a loss function that balances stability and plasticity using SAE features, showing EWC is a special case of this framework.
- It requires no previous-task data after mask construction, retaining only a compact SAE feature mask computed from current-task data.
- The approach is more memory efficient due to the significantly lower dimensionality of the feature space compared to the parameter space.
- On TRACE and MedCL benchmarks, it achieves the strongest results among approaches without task-specific architectural components, surpassing EWC.
The authors consider this important because it addresses the polysemantic nature of LLMs where weight-based protection is non-selective, offering a more effective way to isolate and protect specific knowledge during continual learning.