Timezone: »

GLIPv2: Unifying Localization and Vision-Language Understanding
Haotian Zhang · Pengchuan Zhang · Xiaowei Hu · Yen-Chun Chen · Liunian Li · Xiyang Dai · Lijuan Wang · Lu Yuan · Jenq-Neng Hwang · Jianfeng Gao

Tue Nov 29 02:00 PM -- 04:00 PM (PST) @ Hall J #626

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks.

Author Information

Haotian Zhang (Apple AI/ML)
Pengchuan Zhang (California Institute of Technology)
Xiaowei Hu (Microsoft)
Yen-Chun Chen (Microsoft)
Liunian Li (University of California, Los Angeles)
Xiyang Dai (Microsoft)
Lijuan Wang
Lu Yuan (Microsoft)
Jenq-Neng Hwang (University of Washington, Seattle)
Jianfeng Gao (Microsoft Research, Redmond, WA)

More from the Same Authors