Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

The authors introduce AIMS, a dataset of 1,724 human-annotated difficult safety prompts paired with intent descriptions and harm labels, to evaluate intent-aware training across multiple regimes. They argue that modeling user intent as an explicit signal significantly improves the robustness of safety classifiers.

AIMS contains 1,724 difficult safety prompts with intent descriptions and harm labels.
Intent-conditioned distillation outperforms reasoning-only distillation in most teacher-student pairs.
Directly rewarding intent faithfulness with GRPO yields the strongest average performance across five external safety benchmarks.
Intent-aware models form the inference latency-F1 Pareto frontier.

Faithful intent modeling serves as a compact, high-quality supervision signal for creating more robust safety classifiers.