The authors introduce Symbolic Mechanistic Data Attribution (SMDA), a framework that attributes training pairs to the interpretable symbolic policies governing model behavior, bridging the gap between mechanistic circuits and high-level decisions.
- SMDA fits a closed-form Ridge regression over sparse autoencoder features to model target behavior and analytically decomposes how each supervised fine-tuning example shifts that policy through feature-activation and output-probability pathways.
- The framework distills a symbolic policy for refusal behavior in Llama-3.2-3B-Instruct and analyzes 200 SFT training pairs to reveal systematic gaps in the base model's safety behavior.
- Analysis shows that per-feature decomposition mechanistically explains why harmful and harmless pairs exert qualitatively different influences, while individual training pairs often exhibit cross-feature interference.
This approach yields a diagnostic tool that is both more fine-grained than black-box influence functions and more scalable than manual circuit analysis.