This paper proposes a filtering defense against dirty-label poisoning attacks on speech commands classification systems by clustering unsupervised representations to identify and remove poisoned training data.
- The threat model involves superimposing a trigger on utterances from a source class and relabeling them as a target class.
- Unsupervised representations are learned using DIstillation with NO labels (DINO).
- K-means and LDA are used to cluster these representations, retaining only utterances with the most repeated label in each cluster.
- The defense reduces the attack success rate from 99.75% to 0.25% for a 10% poisoned source class.
This approach effectively mitigates poisoning attacks across various threat models and trigger variations, ensuring robust classification performance.