Timezone: »

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
Victor Weixin Liang · Yuhui Zhang · Yongchan Kwon · Serena Yeung · James Zou

Tue Nov 29 02:00 PM -- 04:00 PM (PST) @ Hall J #441

We present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g. images and text) are embedded at arm's length in their shared representation in multi-modal models such as CLIP. Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model initialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multi-modal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zero-shot classification performance and fairness.

Author Information

Victor Weixin Liang (Stanford University)
Yuhui Zhang (Stanford University)
Yongchan Kwon (Columbia University)
Serena Yeung (Stanford University)
James Zou (Stanford)

More from the Same Authors