Researchers introduce STEER (Safety Targeted Embedding Exploit via Refinement), a gradient-guided attack that reveals how safety training for large language models fails to generalize to low-resource languages and code-switching. The method identifies words driving refusal behavior and iteratively translates them into low-resource languages to suppress safety mechanisms while preserving harmful intent.
- Across six open-source 8B-parameter models, STEER achieves attack success rates of up to 93.0% on JailbreakBench and 96.7% on AdvBench.
- The technique outperforms random code-switching and Greedy Coordinate Gradient (GCG) methods.
- Prompts generated by STEER transfer to GPT-4o-mini, achieving a 35.5% attack success rate without access to the target model.
The findings demonstrate that safety mechanisms aligned primarily on English cannot be assumed to generalize across multilingual inputs, suggesting a need for broader coverage during alignment and explicit detection of out-of-distribution inputs.