Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms
This study investigates whether non-expert malicious actors can successfully jailbreak large language models by using bandit algorithms to select optimal attacks and enhance queries. The authors propose a novel attack strategy based on the multi-armed bandit framework to efficiently learn the best jailbreak from a large choice set through noisy exploration.