Skip to yearly menu bar Skip to main content

Workshop: 4th Workshop on Self-Supervised Learning: Theory and Practice

Multimodal Distillation of CLIP Models

Georgios Smyrnis · Sriram Ravula · Sujay Sanghavi · Alex Dimakis


CLIP-style models are powerful tools for zero-shot image-text tasks, but contain a very large number of parameters, making them expensive to deploy in hardware-constrained settings. We introduce a novel way to distill these large CLIP-based models into significantly smaller ones. Our method is called multimodal distillation because we jointly train two student networks (operating on image and text) from two teacher networks. Our loss tries to preserve the structure of the embeddings of the dataset, as provided by the image and text teacher networks. We are thus able to extract information from the interaction of the teacher embeddings, improving performance on downstream classification tasks.

Chat is not available.