Timezone: »

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training
Hongwei Xue · Yupan Huang · Bei Liu · Houwen Peng · Jianlong Fu · Houqiang Li · Jiebo Luo

Tue Dec 07 08:30 AM -- 10:00 AM (PST) @

Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective. To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i.e., inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. Our approach not only outperforms the state-of-the-art VLP performance, but also shows benefits on the IMF metric.

Author Information

Hongwei Xue (University of Science and Technology of China)
Yupan Huang (Sun Yat-sen University)
Bei Liu (Microsoft Research Asia)
Houwen Peng (Microsoft Research)
Jianlong Fu (Microsoft Research)
Houqiang Li (University of Science and Technology of China)
Jiebo Luo (U. Rochester)

More from the Same Authors