Clustering Unsupervised Representations as Defense against Poisoning Attacks on Speech Commands Classification System

This paper proposes a filtering defense against dirty-label poisoning attacks on speech commands classification systems by clustering unsupervised representations to identify and remove poisoned training data.

The threat model involves superimposing a trigger on utterances from a source class and relabeling them as a target class.
Unsupervised representations are learned using DIstillation with NO labels (DINO).
K-means and LDA are used to cluster these representations, retaining only utterances with the most repeated label in each cluster.
The defense reduces the attack success rate from 99.75% to 0.25% for a 10% poisoned source class.

This approach effectively mitigates poisoning attacks across various threat models and trigger variations, ensuring robust classification performance.