The authors introduce AIMS, a dataset of 1,724 human-annotated difficult safety prompts paired with intent descriptions and harm labels, to evaluate intent-aware training across multiple regimes. They argue that modeling user intent as an explicit signal significantly improves the robustness of safety classifiers.

  • AIMS contains 1,724 difficult safety prompts with intent descriptions and harm labels.
  • Intent-conditioned distillation outperforms reasoning-only distillation in most teacher-student pairs.
  • Directly rewarding intent faithfulness with GRPO yields the strongest average performance across five external safety benchmarks.
  • Intent-aware models form the inference latency-F1 Pareto frontier.

Faithful intent modeling serves as a compact, high-quality supervision signal for creating more robust safety classifiers.