How Safety-Aligned LLMs Interpret Mixed Compliance Demonstrations
Studies show benign and harmful compliance demonstrations are not interchangeable in LLMs. Benign demonstrations can either reduce or increase harmful compliance depending on the model, with preference optimization playing a key role in preventing harmful compliance. Demonstration ordering shows strong recency bias, and models vary in how they handle refusal during in-context learning.