Skip to yearly menu bar Skip to main content

Workshop: Workshop on Distribution Shifts: New Frontiers with Foundation Models

Probing the Equivariance of Image Embeddings

Cyrus Rashtchian · Charles Herrmann · Chun-Sung Ferng · Ayan Chakrabarti · Dilip Krishnan · Deqing Sun · Da-Cheng Juan · Andrew Tomkins

Keywords: [ robustness ] [ OOD Detection ] [ image embeddings ] [ Distribution Shift ] [ Probing ]


Probes are small networks that predict properties of underlying data from embeddings, and they provide a targeted way to illuminate the information in embeddings. While analysis with probes has become standard in NLP, there has been less exploration in vision. Our goal is to understand the invariance vs. equivariance of popular image embeddings (e.g., MAE, SimCLR, or CLIP) under certain distribution shifts. By doing so, we investigate what visual aspects from the raw images are encoded into the embeddings by these foundation models. Our probing is based on a systematic transformation prediction task that measures the visual content of embeddings along many axes, including neural style transfer, recoloring, icon/text overlays, noising, and blurring. Surprisingly, six embeddings (including SimCLR) encode enough non-semantic information to identify dozens of transformations. We also consider a generalization task, where we group similar transformations and hold out several for testing. Image-text models (CLIP, ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN, MAE). Our results show that embeddings from foundation models are equivariant and encode more non-semantic features than a supervised baseline. Hence, their OOD generalization abilities are not due to invariance to such distribution shifts.

Chat is not available.