SAE-Guided Activation Regularization for LLM Continual Learning

This paper proposes a new approach to catastrophic forgetting in large language models by regularizing in activation space using pretrained Sparse Autoencoders (SAEs) as a monosemantic feature dictionary, rather than relying on traditional weight-space methods like Elastic Weight Consolidation (EWC).

The method derives a loss function that balances stability and plasticity using SAE features, showing EWC is a special case of this framework.
It requires no previous-task data after mask construction, retaining only a compact SAE feature mask computed from current-task data.
The approach is more memory efficient due to the significantly lower dimensionality of the feature space compared to the parameter space.
On TRACE and MedCL benchmarks, it achieves the strongest results among approaches without task-specific architectural components, surpassing EWC.

The authors consider this important because it addresses the polysemantic nature of LLMs where weight-based protection is non-selective, offering a more effective way to isolate and protect specific knowledge during continual learning.