Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies
The authors introduce Symbolic Mechanistic Data Attribution (SMDA), a framework that attributes training pairs to the interpretable symbolic policies governing model behavior, bridging the gap between mechanistic circuits and high-level decisions.