A study investigates the impact of overrefusal on small, on-device large language models when processing legal prompts, finding that authority-style prefixes systematically increase refusal rates by 2 to 20 times compared to a no-prefix baseline. While role-play jailbreak prefixes showed mixed effects across different models, the results indicate that these small LLMs are unstable under contextual framings typical of real institutional users.
- Authority-style prefixes (e.g., "acting as an assistant of the national supreme court") increase refusal rates by 2--20x over the no-prefix baseline.
- A known role-play jailbreak prefix shows mixed effects, sharply increasing refusals in some models while barely shifting them in others.
- Small on-premises LLMs exhibit instability when subjected to contextual framings that real institutional users might naturally introduce.
The findings suggest that further investigation is essential to minimize opportunities for bias introduced by selective refusal in legal contexts.