AdversaBench introduces an end-to-end red-teaming pipeline that generates adversarial prompts via five structured operators, evaluates target models, and confirms failures through a three-judge panel with meta-judge tiebreaker. Experiments on 45 seed prompts across reasoning, instruction-following, and tool use show every seed produces a confirmed failure, with operator effectiveness, failure iteration counts, judge agreement, and cross-model transferability revealing key patterns in LLM vulnerability.
AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation
from English