OpenSafeIntent benchmark reveals models fail to calibrate safety across dual-use prompt sets

The authors introduce OpenSafeIntent, a benchmark designed to evaluate whether AI models provide intent-calibrated safe completion by using controlled prompt sets that vary intent while holding the underlying task fixed. Each data point includes benign, dual-use, and malicious variants of the same task to assess safety calibration rather than average performance.

The benchmark reveals that prompt-level safety metrics hide significant failures, as models often fail to remain safe across matched intent variants.
Dual-use behavior is found to be brittle under paraphrase, and high-level answers on risky topics are not reliably safe.
Responses that reframe ambiguous requests into safer tasks are substantially less likely to cross the safety boundary compared to other methods.

The results suggest that safe completion should be evaluated as intent-calibrated behavior over controlled task variants rather than as a single safety-helpfulness tradeoff over independent prompts.