This paper investigates whether safety guardrails actually require chain-of-thought reasoning by training a lightweight bidirectional encoder alongside a reasoning-based guard on the same corpus. The authors find that removing reasoning does not improve moderation accuracy, challenging the common belief that step-by-step thinking is necessary for effective moderation.

  • A 395M parameter label-only encoder achieves an average F1 of 82.90 ± 0.26 over public benchmarks.
  • The model matches the performance of a much larger reasoning guard built on a decoder architecture.
  • Inference requires only a single forward pass for inputs up to 512 tokens, resulting in approximately a 100x reduction in compute.
  • The label-only encoder demonstrates greater robustness to training-label noise and retains higher recall at strict false-positive rates compared to the reasoning guard.

The findings suggest that current guardrail benchmarks may not be difficult enough to reward reasoning and that the necessity of chain-of-thought for moderation remains unproven, offering a more efficient solution for on-device deployment.