Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models
Sharut Gupta · Shobhita Sundaram · Chenyu Wang · Stefanie Jegelka · Phillip Isola
Abstract
Traditional multimodal frameworks emphasize learning unified representations for tasks such as visual question answering, typically requiring paired, aligned data. However, an overlooked yet powerful question remains: can one leverage auxiliary \emph{unpaired} multimodal data to directly enhance representation learning in an $\textit{individual}$ modality? To explore this, we propose $\textbf{UML}$: $\textbf{U}$npaired $\textbf{M}$ultimodal $\textbf{L}$earner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities—including images, text, audio, or video—while sharing model weights across these modalities. Our approach exploits shared structure in unaligned multimodal signals, eliminating the need for paired data. We show that unpaired text improves image classification, and that other auxiliary modalities likewise enhance both image and audio tasks.
Chat is not available.
Successful Page Load