STEER attack exposes LLM safety gaps in low-resource languages

Researchers introduce STEER (Safety Targeted Embedding Exploit via Refinement), a gradient-guided attack that reveals how safety training for large language models fails to generalize to low-resource languages and code-switching. The method identifies words driving refusal behavior and iteratively translates them into low-resource languages to suppress safety mechanisms while preserving harmful intent.

Across six open-source 8B-parameter models, STEER achieves attack success rates of up to 93.0% on JailbreakBench and 96.7% on AdvBench.
The technique outperforms random code-switching and Greedy Coordinate Gradient (GCG) methods.
Prompts generated by STEER transfer to GPT-4o-mini, achieving a 35.5% attack success rate without access to the target model.

The findings demonstrate that safety mechanisms aligned primarily on English cannot be assumed to generalize across multilingual inputs, suggesting a need for broader coverage during alignment and explicit detection of out-of-distribution inputs.

Benchmark	Model	Score
JailbreakBench	STEER (applied to six open-source 8B-parameter models)	93%
JailbreakBench	GPT-4o-mini	35.5%

Benchmarks