Skip to yearly menu bar Skip to main content

Workshop: UniReps: Unifying Representations in Neural Models

Efficient Multimodal Alignment: To Freeze or Not to Freeze?

Till Aczel · Roger Wattenhofer

[ ] [ Project Page ]
presentation: UniReps: Unifying Representations in Neural Models
Fri 15 Dec 6:15 a.m. PST — 3:15 p.m. PST


Language-image pretraining creates a joint representation space between the two modalities where images and texts with similar semantic information lay close to each other. Language-image models are often trained from scratch without taking advantage of unimodal pretrained models. By aligning the representation spaces of two modality-specific encoders, our model achieves 74.7% accuracy on the ImagenNet1K validation set, at two orders of magnitude lower training cost. In this work, we highlight the importance of unfreezing the CLS tokens of uni-modal transformer encoders to create a joint embedding space. Freezing the image and text CLS tokens reduces the mean accuracy from 37.5% to 19.4% on the 38 evaluation benchmarks.

Chat is not available.