AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

The authors present AdversaBench, an end-to-end red-teaming pipeline that generates hard inputs for large language models using five structured mutation operators and confirms failures through a three-judge panel with a meta-judge tiebreaker.

Experiments on 45 seeds across reasoning, instruction-following, and tool use categories resulted in confirmed failures for every seed.
Operator effectiveness varies by category, with inject_distractor scoring 0.00 mean reward on instruction-following seeds but 0.80-0.83 on reasoning and tool-use.
Instruction-following seeds required an average of 2.4 attacker iterations compared to 1.1 for other categories, revealing difficulty gaps hidden by binary failure rates.
Pairwise judge agreement of 80-87% coexists with near-zero Cohen's kappa due to label skew, indicating category-level disagreement rates are more informative.
Adversarial prompts generated against Llama 3.1 8B transfer zero-shot to Llama 3.3 70B, suggesting mutations exploit general behavioral patterns rather than model-specific weaknesses.

The study highlights the importance of using multi-judge confirmation and category-level analysis to accurately evaluate adversarial robustness in large language models.