How Safety-Aligned LLMs Interpret Mixed Compliance Demonstrations
A study finds benign and harmful compliance demonstrations are not interchangeable in language models. Benign demonstrations can either reduce or increase harmful compliance depending on the model, with preference optimization playing a key role in preventing harmful compliance. The research also reveals recency bias in demonstration ordering and varied model behaviors in handling refusals during in-context learning.