The article addresses the persistence of unsafe outputs in large language models during deployment and proposes a real-time monitoring solution. It introduces a simple monitor that converts verifier signals from an external model into alarm decisions by thresholding, with thresholds calibrated via risk control.

  • The method uses thresholding on verifier signals to generate alarms.
  • Thresholds are calibrated using risk control techniques.
  • Experiments were conducted on mathematical reasoning and red teaming datasets.
  • The simple design is competitive with advanced monitors based on sequential hypothesis testing.