A re-derivation of the activation patching estimand from causal mediation analysis reveals that the natural indirect effect (NIE) captures not only the causal effect through a specific component but also interaction effects (INT). These INT terms measure how much a component's causal effect depends on the state of other components in the model, challenging the assumption that NIE isolates individual contributions.
- In the GPT-2 IOI circuit, components with conditional causal importance are either invisible or artificially inflated when using standard estimators.
- The variance of INT explains the previously documented instability of faithfulness scores in mechanistic interpretability studies.
- INT scales with the distance between clean and patched component activations and is negligible when the model is locally affine.
- Interaction effects decompose combinatorially into pairwise and higher-order group interactions, scaling with the number of mediators.
The authors argue that INT should be treated as a diagnostic for interpretability studies rather than a nuisance to eliminate. Its magnitude and sign signal when causal conclusions are prompt-dependent and when greedy NIE-based component ranking will miss mechanisms discoverable only through combinatorial search.