Researchers propose a novel bootstrapped method called Self-Filtering that trains a CLIP model on an evolving dataset selected through iterative self-filtering. This approach balances filtered, high-probability clean samples with diverse examples from the entire distribution to mitigate noise in large-scale vision-language datasets.
- The method iterates between training the model and selecting an improved data mixture without requiring additional data or pre-trained models.
- The evolving dataset combines filtered, highly probable clean samples with diverse samples from the entire distribution.
- Training on vision-language datasets filtered by this approach improves downstream performance compared to existing heuristics and curated reference datasets.