This paper proposes a new approach to catastrophic forgetting in large language models by regularizing in activation space using pretrained Sparse Autoencoders (SAEs) as a monosemantic feature dictionary, rather than relying on traditional weight-space methods like Elastic Weight Consolidation (EWC).

  • The method derives a loss function that balances stability and plasticity using SAE features, showing EWC is a special case of this framework.
  • It requires no previous-task data after mask construction, retaining only a compact SAE feature mask computed from current-task data.
  • The approach is more memory efficient due to the significantly lower dimensionality of the feature space compared to the parameter space.
  • On TRACE and MedCL benchmarks, it achieves the strongest results among approaches without task-specific architectural components, surpassing EWC.

The authors consider this important because it addresses the polysemantic nature of LLMs where weight-based protection is non-selective, offering a more effective way to isolate and protect specific knowledge during continual learning.