Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models

This survey synthesizes research on toxicity detection and detoxification strategies specifically designed for multilingual large language models. It catalogs threat models that exploit linguistic variations such as code-switching, orthographic differences, and translation pivots to bypass safety alignments. The authors organize existing work into task formulations like toxic-to-neutral rewriting and classification, alongside various detection approaches including cross-lingual encoders and LLM-based detectors. Mitigation strategies are detailed across data filtering, supervised tuning, decoding-time steering, and the implementation of multilingual guardrails. The analysis highlights persistent challenges in the field, notably uneven language coverage and fragmented evaluation protocols. Furthermore, it addresses the complexity of culturally contingent definitions of harm and the risk that detoxification efforts may suppress legitimate dialectal or identity-related expression.