The article addresses the persistence of unsafe outputs in large language models during deployment and proposes a real-time monitoring solution. It introduces a simple monitor that converts verifier signals from an external model into alarm decisions by thresholding, with thresholds calibrated via risk control.
- The method uses thresholding on verifier signals to generate alarms.
- Thresholds are calibrated using risk control techniques.
- Experiments were conducted on mathematical reasoning and red teaming datasets.
- The simple design is competitive with advanced monitors based on sequential hypothesis testing.