Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts
This article addresses the challenge of over-alignment in large language models used within Swiss Federal Supreme Court criminal law contexts, where model guardrails frequently trigger refusals when processing sensitive case details. The authors introduce TF-RefusalBench, a multilingual benchmark derived from public rulings, to measure this phenomenon across French, German, Italian, and English.