Timezone: »
Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources---YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock---to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that mixing multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clipqualitynot_quantity.
Author Information
Thao Nguyen (University of Washington)
Gabriel Ilharco (Department of Computer Science, University of Washington)
Mitchell Wortsman (University of Washington, Allen Institute for Artificial Intelligence)
Sewoong Oh (University of Washington)
Ludwig Schmidt (University of Washington)
More from the Same Authors
-
2021 : Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning »
Thomas Liao · Rohan Taori · Deborah Raji · Ludwig Schmidt -
2021 : Do ImageNet Classifiers Generalize to ImageNet? »
Benjamin Recht · Becca Roelofs · Ludwig Schmidt · Vaishaal Shankar -
2021 : Evaluating Machine Accuracy on ImageNet »
Vaishaal Shankar · Becca Roelofs · Horia Mania · Benjamin Recht · Ludwig Schmidt -
2021 : Measuring Robustness to Natural Distribution Shifts in Image Classification »
Rohan Taori · Achal Dave · Vaishaal Shankar · Nicholas Carlini · Benjamin Recht · Ludwig Schmidt -
2021 : Avoiding Spurious Correlations: Bridging Theory and Practice »
Thao Nguyen · Hanie Sedghi · Behnam Neyshabur -
2021 : Robust fine-tuning of zero-shot models »
Mitchell Wortsman · Gabriel Ilharco · Jong Wook Kim · Mike Li · Hanna Hajishirzi · Ali Farhadi · Hongseok Namkoong · Ludwig Schmidt -
2021 : Gradient flows on graphons: existence, convergence, continuity equations »
Sewoong Oh · Soumik Pal · Raghav Somani · Raghav Tripathi -
2022 : Few-shot Backdoor Attacks via Neural Tangent Kernels »
Jonathan Hayase · Sewoong Oh -
2022 Panel: Panel 2C-5: TVLT: Textless Vision-Language… & Quality Not Quantity:… »
Zineng Tang · Thao Nguyen -
2022 Poster: Patching open-vocabulary models by interpolating weights »
Gabriel Ilharco · Mitchell Wortsman · Samir Yitzhak Gadre · Shuran Song · Hannaneh Hajishirzi · Simon Kornblith · Ali Farhadi · Ludwig Schmidt -
2022 Poster: DP-PCA: Statistically Optimal and Differentially Private PCA »
Xiyang Liu · Weihao Kong · Prateek Jain · Sewoong Oh -
2022 Poster: Zonotope Domains for Lagrangian Neural Network Verification »
Matt Jordan · Jonathan Hayase · Alex Dimakis · Sewoong Oh -
2022 Poster: LAION-5B: An open large-scale dataset for training next generation image-text models »
Christoph Schuhmann · Romain Beaumont · Richard Vencu · Cade Gordon · Ross Wightman · Mehdi Cherti · Theo Coombes · Aarush Katta · Clayton Mullis · Mitchell Wortsman · Patrick Schramowski · Srivatsa Kundurthy · Katherine Crowson · Ludwig Schmidt · Robert Kaczmarczyk · Jenia Jitsev -
2022 Poster: Subgroup Robustness Grows On Trees: An Empirical Baseline Investigation »
Josh Gardner · Zoran Popovic · Ludwig Schmidt -
2022 Poster: Bring Your Own Algorithm for Optimal Differentially Private Stochastic Minimax Optimization »
Liang Zhang · Kiran Thekumparampil · Sewoong Oh · Niao He -
2021 Oral: Retiring Adult: New Datasets for Fair Machine Learning »
Frances Ding · Moritz Hardt · John Miller · Ludwig Schmidt -
2021 Poster: Divergence Frontiers for Generative Models: Sample Complexity, Quantization Effects, and Frontier Integrals »
Lang Liu · Krishna Pillutla · Sean Welleck · Sewoong Oh · Yejin Choi · Zaid Harchaoui -
2021 Poster: Robust and differentially private mean estimation »
Xiyang Liu · Weihao Kong · Sham Kakade · Sewoong Oh -
2021 Poster: Retiring Adult: New Datasets for Fair Machine Learning »
Frances Ding · Moritz Hardt · John Miller · Ludwig Schmidt -
2021 Poster: Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning »
Timo Milbich · Karsten Roth · Samarth Sinha · Ludwig Schmidt · Marzyeh Ghassemi · Bjorn Ommer -
2021 Poster: Gradient Inversion with Generative Image Prior »
Jinwoo Jeon · jaechang Kim · Kangwook Lee · Sewoong Oh · Jungseul Ok -
2021 Poster: Statistically and Computationally Efficient Linear Meta-representation Learning »
Kiran Thekumparampil · Prateek Jain · Praneeth Netrapalli · Sewoong Oh -
2020 : 17 - Uncovering How Neural Network Representations Vary with Width and Depth »
Thao Nguyen -
2020 Poster: Projection Efficient Subgradient Method and Optimal Nonsmooth Frank-Wolfe Method »
Kiran Thekumparampil · Prateek Jain · Praneeth Netrapalli · Sewoong Oh -
2020 Spotlight: Projection Efficient Subgradient Method and Optimal Nonsmooth Frank-Wolfe Method »
Kiran Thekumparampil · Prateek Jain · Praneeth Netrapalli · Sewoong Oh -
2020 Poster: Robust Meta-learning for Mixed Linear Regression with Small Batches »
Weihao Kong · Raghav Somani · Sham Kakade · Sewoong Oh -
2020 Poster: Supermasks in Superposition »
Mitchell Wortsman · Vivek Ramanujan · Rosanne Liu · Aniruddha Kembhavi · Mohammad Rastegari · Jason Yosinski · Ali Farhadi -
2020 Poster: Measuring Robustness to Natural Distribution Shifts in Image Classification »
Rohan Taori · Achal Dave · Vaishaal Shankar · Nicholas Carlini · Benjamin Recht · Ludwig Schmidt -
2020 Spotlight: Measuring Robustness to Natural Distribution Shifts in Image Classification »
Rohan Taori · Achal Dave · Vaishaal Shankar · Nicholas Carlini · Benjamin Recht · Ludwig Schmidt -
2019 Poster: Efficient Algorithms for Smooth Minimax Optimization »
Kiran Thekumparampil · Prateek Jain · Praneeth Netrapalli · Sewoong Oh -
2019 Poster: Turbo Autoencoder: Deep learning based channel codes for point-to-point communication channels »
Yihan Jiang · Hyeji Kim · Himanshu Asnani · Sreeram Kannan · Sewoong Oh · Pramod Viswanath -
2019 Poster: Model Similarity Mitigates Test Set Overuse »
Horia Mania · John Miller · Ludwig Schmidt · Moritz Hardt · Benjamin Recht -
2019 Poster: Unlabeled Data Improves Adversarial Robustness »
Yair Carmon · Aditi Raghunathan · Ludwig Schmidt · John Duchi · Percy Liang -
2019 Poster: A Meta-Analysis of Overfitting in Machine Learning »
Becca Roelofs · Vaishaal Shankar · Benjamin Recht · Sara Fridovich-Keil · Moritz Hardt · John Miller · Ludwig Schmidt -
2019 Poster: Discovering Neural Wirings »
Mitchell Wortsman · Ali Farhadi · Mohammad Rastegari -
2019 Poster: Minimax Optimal Estimation of Approximate Differential Privacy on Neighboring Databases »
Xiyang Liu · Sewoong Oh -
2018 Poster: Deepcode: Feedback Codes via Deep Learning »
Hyeji Kim · Yihan Jiang · Sreeram Kannan · Sewoong Oh · Pramod Viswanath -
2018 Poster: Robustness of conditional GANs to noisy labels »
Kiran Thekumparampil · Ashish Khetan · Zinan Lin · Sewoong Oh -
2018 Spotlight: Robustness of conditional GANs to noisy labels »
Kiran Thekumparampil · Ashish Khetan · Zinan Lin · Sewoong Oh -
2018 Poster: PacGAN: The power of two samples in generative adversarial networks »
Zinan Lin · Ashish Khetan · Giulia Fanti · Sewoong Oh -
2017 : New perspective from Blackwell's "comparisons of experiments" on generative adversarial networks »
Sewoong Oh -
2017 Poster: Optimal Sample Complexity of M-wise Data for Top-K Ranking »
Minje Jang · Sunghyun Kim · Changho Suh · Sewoong Oh -
2017 Poster: Estimating Mutual Information for Discrete-Continuous Mixtures »
Weihao Gao · Sreeram Kannan · Sewoong Oh · Pramod Viswanath -
2017 Poster: Matrix Norm Estimation from a Few Entries »
Ashish Khetan · Sewoong Oh -
2017 Spotlight: Estimating Mutual Information for Discrete-Continuous Mixtures »
Weihao Gao · Sreeram Kannan · Sewoong Oh · Pramod Viswanath -
2017 Spotlight: Matrix Norm Estimation from a Few Entries »
Ashish Khetan · Sewoong Oh -
2017 Poster: Discovering Potential Correlations via Hypercontractivity »
Hyeji Kim · Weihao Gao · Sreeram Kannan · Sewoong Oh · Pramod Viswanath -
2016 : Sewoong Oh: "The Minimax Rate for Adaptive Crowdsourcing" »
Sewoong Oh -
2016 Poster: Breaking the Bandwidth Barrier: Geometrical Adaptive Entropy Estimation »
Weihao Gao · Sewoong Oh · Pramod Viswanath -
2016 Poster: Computational and Statistical Tradeoffs in Learning to Rank »
Ashish Khetan · Sewoong Oh -
2016 Poster: Achieving budget-optimality with adaptive schemes in crowdsourcing »
Ashish Khetan · Sewoong Oh -
2015 Workshop: Non-convex Optimization for Machine Learning: Theory and Practice »
Anima Anandkumar · Niranjan Uma Naresh · Kamalika Chaudhuri · Percy Liang · Sewoong Oh -
2015 Poster: Secure Multi-party Differential Privacy »
Peter Kairouz · Sewoong Oh · Pramod Viswanath -
2015 Poster: Collaboratively Learning Preferences from Ordinal Data »
Sewoong Oh · Kiran Thekumparampil · Jiaming Xu -
2014 Workshop: Analysis of Rank Data: Confluence of Social Choice, Operations Research, and Machine Learning »
Shivani Agarwal · Hossein Azari Soufiani · Guy Bresler · Sewoong Oh · David Parkes · Arun Rajkumar · Devavrat Shah -
2014 Poster: Provable Tensor Factorization with Missing Data »
Prateek Jain · Sewoong Oh -
2014 Poster: Extremal Mechanisms for Local Differential Privacy »
Peter Kairouz · Sewoong Oh · Pramod Viswanath -
2014 Poster: Minimax-optimal Inference from Partial Rankings »
Bruce Hajek · Sewoong Oh · Jiaming Xu -
2014 Poster: Learning Mixed Multinomial Logit Model from Ordinal Data »
Sewoong Oh · Devavrat Shah -
2012 Poster: Iterative ranking from pair-wise comparisons »
Sahand N Negahban · Sewoong Oh · Devavrat Shah -
2012 Spotlight: Iterative ranking from pair-wise comparisons »
Sahand N Negahban · Sewoong Oh · Devavrat Shah -
2011 Poster: Iterative Learning for Reliable Crowdsourcing Systems »
David R Karger · Sewoong Oh · Devavrat Shah -
2011 Oral: Iterative Learning for Reliable Crowdsourcing Systems »
David R Karger · Sewoong Oh · Devavrat Shah -
2009 Poster: Matrix Completion from Noisy Entries »
Raghunandan Keshavan · Andrea Montanari · Sewoong Oh