All articles
arxiv arXiv cs.CL · 11h ago

The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching

A re-derivation of the activation patching estimand from causal mediation analysis reveals that the natural indirect effect (NIE) captures not only the causal effect through a specific component but also interaction effects (INT). These INT terms measure how much a component's causal effect depends on the state of other components in the model, challenging the assumption that NIE isolates individual contributions.