Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms

This study investigates whether non-expert malicious actors can successfully jailbreak large language models by using bandit algorithms to select optimal attacks and enhance queries. The authors propose a novel attack strategy based on the multi-armed bandit framework to efficiently learn the best jailbreak from a large choice set through noisy exploration.

The researchers curated FrankensteinBench, a safety benchmark containing 11,279 malicious queries derived from seven existing benchmarks with automated enhancement and generation.

Each query in the benchmark is categorized as either simple or complex based on the technical expertise required to craft it. The bandit-based attack achieved an average success rate of 97% across 15 state-of-the-art open-weight LLMs. Adding complexity to queries increased the attack success rate by up to 26% on average, demonstrating its effectiveness as an automatable prompting strategy.

The findings confirm that non-expert actors can elicit actionable responses from models, validating concerns about the accessibility of jailbreak attacks.