The authors introduce AIMS, a dataset of 1,724 human-annotated difficult safety prompts paired with intent descriptions and harm labels, to evaluate intent-aware training across multiple regimes. They argue that modeling user intent as an explicit signal significantly improves the robustness of safety classifiers.
- AIMS contains 1,724 difficult safety prompts with intent descriptions and harm labels.
- Intent-conditioned distillation outperforms reasoning-only distillation in most teacher-student pairs.
- Directly rewarding intent faithfulness with GRPO yields the strongest average performance across five external safety benchmarks.
- Intent-aware models form the inference latency-F1 Pareto frontier.
Faithful intent modeling serves as a compact, high-quality supervision signal for creating more robust safety classifiers.