Skip to yearly menu bar Skip to main content


Poster

Do CLIP Models Always Generalize Better than ImageNet Models?

Qizhou Wang · Yong Lin · Yongqiang Chen · Ludwig Schmidt · Bo Han · Tong Zhang

[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Large vision language models, such as CLIP, are typically considered to exhibit enhanced generalizability under distribution shifts over single-modal models supervised on ImageNet. However, previous evaluation datasets in comparing between CLIP and ImageNet models are variants primarily crafted for models pre-trained on ImageNet, which may not correctly reflect the extent to which CLIP models, e.g., pre-trained on LAION, are robust to spurious correlations. To bridge the gap, we curate a new evaluation dataset named CounterAnimal, encompassing realistic spurious features commonly found in animal photos. The CounterAnimal dataset consists of a) the common group: comprising animals in common backgrounds, and b) the counter group: including animals in plausible yet unusual backgrounds. The performance drops from the common to counter groups quantify the reliance on spurious background features for animal predictions. Our evaluations reveal that CLIP models, pre-trained on either LAION or OpenAI datasets, exhibit notable performance drops when tested on the counter group. We also observe that ImageNet models can be more robust than CLIP models on CounterAnimal, suggesting that CLIP models do not rely less on spurious features than supervised models. To further solidify our findings, we provide theoretical insights that the CLIP objective cannot offer additional robustness. While our findings suggest that the distribution shifts remain an open problem for CLIP, CounterAnimal also reveals that strategies such as scaling up parameters and high-quality pre-trained data help mitigate the spurious features, laying the foundation for future developments.

Live content is unavailable. Log in and register to view live content