This article addresses the challenge of over-alignment in large language models used within Swiss Federal Supreme Court criminal law contexts, where model guardrails frequently trigger refusals when processing sensitive case details. The authors introduce TF-RefusalBench, a multilingual benchmark derived from public rulings, to measure this phenomenon across French, German, Italian, and English.

  • TF-RefusalBench contains 5,200 prompts covering common tasks and passages likely to trigger refusal in four official languages.
  • Over-alignment is identified as a multifaceted phenomenon influenced by the model and the languages of both the prompt and text.
  • The impact of over-alignment extends beyond simple refusals to affect task faithfulness due to disclaimers.
  • Abliteration, which involves refusing directions ablation, eliminates refusal with minimal impact on task performance compared to prompting alone.

The study demonstrates that abliteration is an effective approach for enabling on-premises LLMs to handle criminal law tasks without triggering guardrails, thereby supporting legitimate work involving violent and sexual offense descriptions.