Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection

A new hierarchical attention model detects multi-turn jailbreaks by encoding turns into compact representations and using a lightweight conversation module to capture dialogue dynamics. On 14,038 conversations, it achieves an F1 score of 0.9394, outperforming Claude Opus 4.7 by 0.07 and reducing false-positive rate by half. Ablation studies show that combining cross-attention and self-attention in the conversation module lowers false positives by 2.26 percentage points.