The authors propose a novel training approach for end-to-end automatic speech recognition (ASR) that addresses noisy labels and lack of domain specificity in large-scale weakly supervised datasets. The method involves pretraining on the full dataset, continued pretraining on a filtered subset based on character error rate, and fine-tuning on acoustically similar samples from that subset.

  • The approach consists of three steps: pretraining on the entire dataset, continued pretraining on a filtered subset based on character error rate (CER), and fine-tuning on a small number of acoustically similar samples to the target domain.
  • Experiments using a 90,000-hour weakly supervised Japanese dataset showed that filtering reduced CER by up to 6.4%.
  • The selection method further reduced CER by up to 4.0%, with both steps reusing training samples from the initial pretraining phase.

This method allows for better utilization of weakly supervised datasets by synergistically reducing character error rates through targeted filtering and selection.