Skip to yearly menu bar Skip to main content

Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)

Your CLIP Model Might Be Undertrained

Alaa Khaddaj · Hadi Salman · Andrew Ilyas · Guillaume Leclerc · Aleksander Madry


Contrastive Language-Image Pretraining (CLIP) models exhibit good performance on a range of vision tasks. To improve the performance of this class of models even further, several works have proposed to modify the CLIP training procedure. In this work, we show that it is possible to achieve substantial gains using a much simpler strategy. Specifically, existing CLIP models---especially those trained on smaller datasets---tend to be undertrained. Indeed, we show that extending the training procedure according to a simple heuristic can significantly improve the performance of CLIP models.

Chat is not available.