TF-RefusalBench Measures Over-Alignment in LLMs for Criminal Law

TF-RefusalBench is a multilingual benchmark derived from Swiss Supreme Court rulings, containing 5,200 prompts in French, German, Italian, and English. It reveals that over-alignment in LLMs is influenced by model and language factors, and that refusals impact task faithfulness beyond simple refusal rates. Abliteration of refusal directives reduces over-alignment with minimal performance loss in criminal law tasks.