Improving Speaker Verification for Non-Verbal Vocalizations

A new framework combines frozen Data2Vec features with ECAPA-TDNN and a Mixture of Experts module to enhance speaker verification for non-verbal vocalizations. It uses conditional distillation and contrastive loss to maintain speech accuracy while reducing speech-NVV EER from 38.93% to 22.66% and improving speech EER from 13.17% to 9.24%.