The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching

A re-derivation of the activation patching estimand from causal mediation analysis reveals that the natural indirect effect (NIE) captures not only the causal effect through a specific component but also interaction effects (INT). These INT terms measure how much a component's causal effect depends on the state of other components in the model, challenging the assumption that NIE isolates individual contributions.

In the GPT-2 IOI circuit, components with conditional causal importance are either invisible or artificially inflated when using standard estimators.
The variance of INT explains the previously documented instability of faithfulness scores in mechanistic interpretability studies.
INT scales with the distance between clean and patched component activations and is negligible when the model is locally affine.
Interaction effects decompose combinatorially into pairwise and higher-order group interactions, scaling with the number of mediators.

The authors argue that INT should be treated as a diagnostic for interpretability studies rather than a nuisance to eliminate. Its magnitude and sign signal when causal conclusions are prompt-dependent and when greedy NIE-based component ranking will miss mechanisms discoverable only through combinatorial search.