IHDec addresses the failure of Large Language Models to maintain instruction hierarchies in multi-turn contexts by leveraging Jensen-Shannon Divergence to detect and correct role-influence inversions. This training-free method dynamically suppresses subordinate roles that override superior directives during token generation.

  • Formalizes the role-influence inversion phenomenon where subordinate inputs override superior roles using a Jensen-Shannon Divergence framework.
  • Automatically detects token-level hierarchy violations without requiring expensive fine-tuning or model training.
  • Outperforms training-based baselines in multi-turn conflict scenarios while fully preserving general response quality.
  • Strengthens safety against adversarial prompt injections and exhibits robust scaling synergy with larger models.

The approach provides a scalable, zero-training solution for securing instruction hierarchies, ensuring that higher-priority directives are maintained even when conflicting with lower-priority inputs in complex multi-turn interactions.