OTTER: Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization
OTTER is a black-box red-teaming framework that bypasses toxicity filters by modifying as few as five tokens. Evaluated on 457 AdvBench prompts across four GPT models, it increases jailbreak success rate from 7.0% to 84.0%, offering the first quantitative analysis of toxicity-bypass relationships and actionable recommendations for classifier hardening.