The authors present AdversaBench, an end-to-end red-teaming pipeline that generates hard inputs for large language models using five structured mutation operators and confirms failures through a three-judge panel with a meta-judge tiebreaker.

  • Experiments on 45 seeds across reasoning, instruction-following, and tool use categories resulted in confirmed failures for every seed.
  • Operator effectiveness varies by category, with inject_distractor scoring 0.00 mean reward on instruction-following seeds but 0.80-0.83 on reasoning and tool-use.
  • Instruction-following seeds required an average of 2.4 attacker iterations compared to 1.1 for other categories, revealing difficulty gaps hidden by binary failure rates.
  • Pairwise judge agreement of 80-87% coexists with near-zero Cohen's kappa due to label skew, indicating category-level disagreement rates are more informative.
  • Adversarial prompts generated against Llama 3.1 8B transfer zero-shot to Llama 3.3 70B, suggesting mutations exploit general behavioral patterns rather than model-specific weaknesses.

The study highlights the importance of using multi-judge confirmation and category-level analysis to accurately evaluate adversarial robustness in large language models.