This paper introduces DriftGuard, a framework that combines multi-monitor drift detection with selective model updating to address evolving toxicity in automated moderation systems. The system tracks specific safety-relevant shifts, such as identity-harm and toxic-risk drift, rather than relying solely on global distributional changes.
- DriftGuard monitors global text drift, identity-harm drift, model uncertainty, toxic-risk drift, and false-negative-risk drift.
- Updates utilize a hard-mix adaptation set prioritizing likely false negatives, high-risk identity examples, and uncertain boundary cases.
- On Civil Comments temporal shift, the framework achieved a toxic recall of 0.8777.
- On Jigsaw-to-DynaHate cross-dataset shift, toxic recall increased from 0.7107 to 0.8523 compared to baselines.
- Bootstrap analysis showed stable safety gains on DynaHate, with false-negative prevalence decreasing by 0.0781.
DriftGuard links safety-aware drift detection to targeted, lightweight model updating to provide more robust adaptive toxicity moderation in dynamic online environments.