Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes
The authors introduce AIMS, a dataset of 1,724 human-annotated difficult safety prompts paired with intent descriptions and harm labels, to evaluate intent-aware training across multiple regimes. They argue that modeling user intent as an explicit signal significantly improves the robustness of safety classifiers.