Research demonstrates that language models trained to explain their predictions using fixed counterfactual explanations often produce introspections faithful to their own current behaviors rather than the training targets. This "introspective coupling" occurs when explanation training remains correlated with shifting model behaviors, allowing the system to track changes without updated supervision.

  • Models generate explanations more aligned with their current behavior than the fixed training data derived from earlier checkpoints or similar models.
  • Introspective coupling tracks behavioral shifts even when explanation training runs concurrently with other post-training objectives.
  • The phenomenon is observed across multiple tasks, including sycophancy and refusal, and remains robust to label noise.

The findings indicate that fixed datasets of counterfactual explanations can provide scalable and generalizable post-training signal for introspection.