Skip to yearly menu bar Skip to main content

Workshop: 4th Workshop on Self-Supervised Learning: Theory and Practice

Enhancing CLIP with a Third Modality

Efthymios Tsaprazlis · Georgios Smyrnis · Alex Dimakis · Petros Maragos


We study the problem of training a third tower for a new modality given a pre-trained CLIP model. This extra part of the architecture can be used to incorporate other modalities in the model pipeline. In our setting, we consider the use of a model such as BLIP-2, which provides us with a dialogue centered around the image. We evaluate our model in the setting of image and text retrieval, and compare it against the regular image and text based one.

Chat is not available.