Timezone: »
The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets. We use image-language models to translate the video content into frame captions, object, attribute, and event phrases, and compose them into a temporal-aware template. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content. The flexibility of prompting allows the model to capture any form of text input, such as automatic speech recognition (ASR) transcripts. Our experiments demonstrate the power of language models in understanding videos on a wide variety of video-language tasks, including video captioning, video question answering, video caption retrieval, and video future event prediction. Especially, on video future event prediction, our few-shot model significantly outperforms state-of-the-art supervised models trained on large-scale video datasets.Code and processed data are publicly available for research purposes at https://github.com/MikeWangWZHL/VidIL.
Author Information
Zhenhailong Wang (University of Illinois at Urbana-Champaign)
Zhenhailong Wang is a second year Master Student in Computer Science at UIUC. He is a graduate research assistant working with Prof. Heng Ji. His current research interest lies in multimodal and multitask natural language processing.
Manling Li (University of Illinois, Urbana Champaign)
Ruochen Xu (Microsoft)
Luowei Zhou (Microsoft)
Jie Lei (Meta)
Xudong Lin (Columbia University)
Shuohang Wang (Microsoft)
Ziyi Yang (Stanford University)
Chenguang Zhu (Stanford University)
Derek Hoiem (University of Illinois)
Shih-Fu Chang (Columbia University)
Mohit Bansal (UNC Chapel Hill)
Heng Ji (University of Illinois)
More from the Same Authors
-
2021 : VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation »
Linjie Li · Jie Lei · Zhe Gan · Licheng Yu · Yen-Chun Chen · Rohit Pillai · Yu Cheng · Luowei Zhou · Xin Wang · William Yang Wang · Tamara L Berg · Mohit Bansal · Jingjing Liu · Lijuan Wang · Zicheng Liu -
2021 : Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models »
Boxin Wang · Chejian Xu · Shuohang Wang · Zhe Gan · Yu Cheng · Jianfeng Gao · Ahmed Awadallah · Bo Li -
2022 Poster: REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering »
Yuanze Lin · Yujia Xie · Dongdong Chen · Yichong Xu · Chenguang Zhu · Lu Yuan -
2022 Poster: OmniVL: One Foundation Model for Image-Language and Video-Language Tasks »
Junke Wang · Dongdong Chen · Zuxuan Wu · Chong Luo · Luowei Zhou · Yucheng Zhao · Yujia Xie · Ce Liu · Yu-Gang Jiang · Lu Yuan -
2022 : LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning »
Yi-Lin Sung · Jaemin Cho · Mohit Bansal -
2022 : Adaptive Pre-training of Language Models for Better Logical Reasoning »
Soumya Sanyal · Yichong Xu · Shuohang Wang · Ziyi Yang · Reid Pryzant · Wenhao Yu · Chenguang Zhu · Xiang Ren -
2022 : Empowering Language Models with Knowledge Graph Reasoning for Question Answering »
Ziniu Hu · Yichong Xu · Wenhao Yu · Shuohang Wang · Ziyi Yang · Chenguang Zhu · Kai-Wei Chang · Yizhou Sun -
2022 Spotlight: OmniVL: One Foundation Model for Image-Language and Video-Language Tasks »
Junke Wang · Dongdong Chen · Zuxuan Wu · Chong Luo · Luowei Zhou · Yucheng Zhao · Yujia Xie · Ce Liu · Yu-Gang Jiang · Lu Yuan -
2022 Poster: TVLT: Textless Vision-Language Transformer »
Zineng Tang · Jaemin Cho · Yixin Nie · Mohit Bansal -
2022 Poster: Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning »
Yujia Xie · Luowei Zhou · Xiyang Dai · Lu Yuan · Nguyen Bach · Ce Liu · Michael Zeng -
2022 Poster: LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning »
Yi-Lin Sung · Jaemin Cho · Mohit Bansal -
2022 Poster: Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning »
Haokun Liu · Derek Tam · Mohammed Muqeeth · Jay Mohta · Tenghao Huang · Mohit Bansal · Colin Raffel -
2022 Poster: VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives »
Zhuofan Ying · Peter Hase · Mohit Bansal -
2022 Poster: WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models »
Yonatan Bitton · Nitzan Bitton Guetta · Ron Yosef · Yuval Elovici · Mohit Bansal · Gabriel Stanovsky · Roy Schwartz -
2021 Poster: The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations »
Peter Hase · Harry Xie · Mohit Bansal -
2021 Poster: Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method »
Yifan Chen · Qi Zeng · Heng Ji · Yun Yang -
2021 Poster: The Elastic Lottery Ticket Hypothesis »
Xiaohan Chen · Yu Cheng · Shuohang Wang · Zhe Gan · Jingjing Liu · Zhangyang Wang -
2021 Poster: VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer »
Zineng Tang · Jaemin Cho · Hao Tan · Mohit Bansal -
2021 Poster: VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text »
Hassan Akbari · Liangzhe Yuan · Rui Qian · Wei-Hong Chuang · Shih-Fu Chang · Yin Cui · Boqing Gong -
2021 Poster: Detecting Moments and Highlights in Videos via Natural Language Queries »
Jie Lei · Tamara L Berg · Mohit Bansal -
2020 Workshop: HAMLETS: Human And Model in the Loop Evaluation and Training Strategies »
Divyansh Kaushik · Bhargavi Paranjape · Forough Arabshahi · Yanai Elazar · Yixin Nie · Max Bartolo · Polina Kirichenko · Pontus Lars Erik Saito Stenetorp · Mohit Bansal · Zachary Lipton · Douwe Kiela -
2020 : Panel #2 »
Oren Etzioni · Heng Ji · Subbarao Kambhampati · Victoria Lin · Jiajun Wu -
2020 : Q&A #2 »
Heng Ji · Jure Leskovec · Jiajun Wu -
2020 : Invited Talk #5 »
Heng Ji -
2020 Demonstration: A Knowledge Graph Reasoning Prototype »
Lihui Liu · Boxin Du · Heng Ji · Hanghang Tong -
2018 Poster: Low-shot Learning via Covariance-Preserving Adversarial Augmentation Networks »
Hang Gao · Zheng Shou · Alireza Zareian · Hanwang Zhang · Shih-Fu Chang -
2017 Demonstration: Interactive-Length Multi-Task Video Captioning with Cooperative Feedback »
Han Guo · Ramakanth Pasunuru · Mohit Bansal -
2016 Poster: Swapout: Learning an ensemble of deep architectures »
Saurabh Singh · Derek Hoiem · David Forsyth