Online Safety Monitoring for LLMs

The article addresses the persistence of unsafe outputs in large language models during deployment and proposes a real-time monitoring solution. It introduces a simple monitor that converts verifier signals from an external model into alarm decisions by thresholding, with thresholds calibrated via risk control.

The method uses thresholding on verifier signals to generate alarms.
Thresholds are calibrated using risk control techniques.
Experiments were conducted on mathematical reasoning and red teaming datasets.
The simple design is competitive with advanced monitors based on sequential hypothesis testing.