This article addresses the challenge of over-alignment in large language models used within Swiss Federal Supreme Court criminal law contexts, where model guardrails frequently trigger refusals when processing sensitive case details. The authors introduce TF-RefusalBench, a multilingual benchmark derived from public rulings, to measure this phenomenon across French, German, Italian, and English.
- TF-RefusalBench contains 5,200 prompts covering common tasks and passages likely to trigger refusal in four official languages.
- Over-alignment is identified as a multifaceted phenomenon influenced by the model and the languages of both the prompt and text.
- The impact of over-alignment extends beyond simple refusals to affect task faithfulness due to disclaimers.
- Abliteration, which involves refusing directions ablation, eliminates refusal with minimal impact on task performance compared to prompting alone.
The study demonstrates that abliteration is an effective approach for enabling on-premises LLMs to handle criminal law tasks without triggering guardrails, thereby supporting legitimate work involving violent and sexual offense descriptions.