Data Selection Through Iterative Self-Filtering for Vision-Language Settings

Researchers propose a novel bootstrapped method called Self-Filtering that trains a CLIP model on an evolving dataset selected through iterative self-filtering. This approach balances filtered, high-probability clean samples with diverse examples from the entire distribution to mitigate noise in large-scale vision-language datasets.

The method iterates between training the model and selecting an improved data mixture without requiring additional data or pre-trained models.
The evolving dataset combines filtered, highly probable clean samples with diverse samples from the entire distribution.
Training on vision-language datasets filtered by this approach improves downstream performance compared to existing heuristics and curated reference datasets.