Workshop: 4th Workshop on Self-Supervised Learning: Theory and Practice

Enhancing CLIP with a Third Modality

Efthymios Tsaprazlis · Georgios Smyrnis · Alex Dimakis · Petros Maragos


We study the problem of training a third tower for a new modality given a pre-trained CLIP model. This extra part of the architecture can be used to incorporate other modalities in the model pipeline. In our setting, we consider the use of a model such as BLIP-2, which provides us with a dialogue centered around the image. We evaluate our model in the setting of image and text retrieval, and compare it against the regular image and text based one.

