PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

The authors introduce PolicyAlign, a framework designed to align large language models directly with natural-language safety policies rather than relying on costly supervision data. This approach addresses the mismatch between rapidly evolving safety requirements and conventional data-driven alignment methods. The process begins by synthesizing instructions that violate the specified policy, followed by on-policy self-distillation to internalize the desired behavior. To enhance training stability and data efficiency, the method incorporates Policy-Sensitive Filtering, which selects instructions inducing the largest behavioral shift. Experiments across multiple models demonstrate that PolicyAlign consistently improves safety metrics while maintaining low over-refusal rates and preserving general capabilities. The framework also generalizes effectively to specialized domains such as medical, legal, and financial safety scenarios. The code for this scalable alignment approach is released at https://github.com/Qwen-Applications/PolicyAlign.