Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

This article addresses the challenge of over-alignment in large language models used within Swiss Federal Supreme Court criminal law contexts, where model guardrails frequently trigger refusals when processing sensitive case details. The authors introduce TF-RefusalBench, a multilingual benchmark derived from public rulings, to measure this phenomenon across French, German, Italian, and English.

TF-RefusalBench contains 5,200 prompts covering common tasks and passages likely to trigger refusal in four official languages.
Over-alignment is identified as a multifaceted phenomenon influenced by the model and the languages of both the prompt and text.
The impact of over-alignment extends beyond simple refusals to affect task faithfulness due to disclaimers.
Abliteration, which involves refusing directions ablation, eliminates refusal with minimal impact on task performance compared to prompting alone.

The study demonstrates that abliteration is an effective approach for enabling on-premises LLMs to handle criminal law tasks without triggering guardrails, thereby supporting legitimate work involving violent and sexual offense descriptions.