HaloGuard 1.0 releases open-weights constitutional classifier for multilingual AI safety

Researchers present HaloGuard 1.0, an open-weights implementation of the constitutional-classifier paradigm designed to improve input safety across multiple languages. The model utilizes a natural-language constitution with 46 policies and 2,940 subcategories to drive synthetic data generation and handle multilingual inputs.

HaloGuard 1.0-0.8B achieves an average F1 score of 90.7 on seven prompt-safety benchmarks, outperforming baselines up to 27B parameters while maintaining a false-positive rate of 4.3 and false-negative rate of 9.5.
The larger HaloGuard 1.0-4B variant reaches an average F1 score of 92.1 with a false-positive rate of 3.5, prioritizing precision over recall.
The training corpus employs exhaustive one-to-one paired counterfactuals that flip intent while holding topic and vocabulary fixed to reduce false positives.
Multilingual materialization treats language as a surface form across 46 languages rather than an adversarial signal.

The release provides a highly efficient, open-source guard model that significantly reduces the parameter count required for state-of-the-art multilingual safety performance.