This paper proposes a filtering defense against dirty-label poisoning attacks on speech commands classification systems by clustering unsupervised representations to identify and remove poisoned training data.

  • The threat model involves superimposing a trigger on utterances from a source class and relabeling them as a target class.
  • Unsupervised representations are learned using DIstillation with NO labels (DINO).
  • K-means and LDA are used to cluster these representations, retaining only utterances with the most repeated label in each cluster.
  • The defense reduces the attack success rate from 99.75% to 0.25% for a 10% poisoned source class.

This approach effectively mitigates poisoning attacks across various threat models and trigger variations, ensuring robust classification performance.