IHDec addresses the failure of Large Language Models to maintain instruction hierarchies in multi-turn contexts by leveraging Jensen-Shannon Divergence to detect and correct role-influence inversions. This training-free method dynamically suppresses subordinate roles that override superior directives during token generation.
- Formalizes the role-influence inversion phenomenon where subordinate inputs override superior roles using a Jensen-Shannon Divergence framework.
- Automatically detects token-level hierarchy violations without requiring expensive fine-tuning or model training.
- Outperforms training-based baselines in multi-turn conflict scenarios while fully preserving general response quality.
- Strengthens safety against adversarial prompt injections and exhibits robust scaling synergy with larger models.
The approach provides a scalable, zero-training solution for securing instruction hierarchies, ensuring that higher-priority directives are maintained even when conflicting with lower-priority inputs in complex multi-turn interactions.