Self-Filtering: Iterative Data Selection for Vision-Language Models

The authors propose a novel bootstrapped method called Self-Filtering to address noise in large-scale vision-language datasets without relying on manual oversight or curated references. This approach trains a CLIP model on an evolving dataset that balances filtered, high-probability clean samples with diverse examples from the entire distribution. The process iterates between training the model and selecting an improved data mixture for subsequent steps. By continuously refining the dataset through this cycle, the method mitigates the need for additional external data sources. The study demonstrates that training on these self-selected datasets improves downstream performance effectively. This technique operates independently of pre-trained models or heuristic-based filtering strategies.