Enhancing CLIP with a Third Modality
Efthymios Tsaprazlis · Georgios Smyrnis · Alex Dimakis · Petros Maragos
Abstract
We study the problem of training a third tower for a new modality given a pre-trained CLIP model. This extra part of the architecture can be used to incorporate other modalities in the model pipeline. In our setting, we consider the use of a model such as BLIP-2, which provides us with a dialogue centered around the image. We evaluate our model in the setting of image and text retrieval, and compare it against the regular image and text based one.
Chat is not available.
Successful Page Load