A re-derivation of the activation patching estimand from causal mediation analysis reveals that the natural indirect effect (NIE) captures not only the causal effect through a specific component but also interaction effects (INT). These INT terms measure how much a component's causal effect depends on the state of other components in the model, challenging the assumption that NIE isolates individual contributions.

  • In the GPT-2 IOI circuit, components with conditional causal importance are either invisible or artificially inflated when using standard estimators.
  • The variance of INT explains the previously documented instability of faithfulness scores in mechanistic interpretability studies.
  • INT scales with the distance between clean and patched component activations and is negligible when the model is locally affine.
  • Interaction effects decompose combinatorially into pairwise and higher-order group interactions, scaling with the number of mediators.

The authors argue that INT should be treated as a diagnostic for interpretability studies rather than a nuisance to eliminate. Its magnitude and sign signal when causal conclusions are prompt-dependent and when greedy NIE-based component ranking will miss mechanisms discoverable only through combinatorial search.