IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies
IHDec addresses the failure of Large Language Models to maintain instruction hierarchies in multi-turn contexts by leveraging Jensen-Shannon Divergence to detect and correct role-influence inversions. This training-free method dynamically suppresses subordinate roles that override superior directives during token generation.