Timezone: »
Joint visual and language modeling on large-scale datasets has recently shown good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of video-language models against various real-world perturbations. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different text perturbations. The study reveals some interesting initial findings from the studied models: 1) models are more robust when text is perturbed versus when video is perturbed, 2) models that are pre-trained are more robust than those trained from scratch, 3) models attend more to scene and objects rather than motion and action. We hope this study will serve as a benchmark and guide future research in robust video-language learning. The benchmark introduced in this study along with the code and datasets is available at https://bit.ly/3CNOly4.
Author Information
Madeline Chantry (University of Central Florida)
Passionate researcher with a specialization in Deep Learning and computer vision. Studied psychology and business at the University of Connecticut followed by a masters in data analytics from the University of Central Florida (UCF). Worked on big data and machine learning projects for several years in cyber-security. Currently focused on research as a Graduate Research Assistant in the Center for Research in Computer Vision at UCF. Research focus is in visual-language self-supervised deep learning models. Takes pride in the ability to learn quickly with self-taught programming and without any prior experience in computer science coursework, passed all graduate level course requirements to complete doctoral qualifiers in computer science.
Shruti Vyas (University of Central Florida)
Hamid Palangi (Microsoft Research)
Yogesh Rawat (University of Central Florida)
Vibhav Vineet (Microsoft Research)
More from the Same Authors
-
2022 : Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors »
Thomas Hartvigsen · Swami Sankaranarayanan · Hamid Palangi · Yoon Kim · Marzyeh Ghassemi -
2022 : Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors »
Thomas Hartvigsen · Swami Sankaranarayanan · Hamid Palangi · Yoon Kim · Marzyeh Ghassemi -
2022 Poster: Are all Frames Equal? Active Sparse Labeling for Video Action Detection »
Aayush Rana · Yogesh Rawat -
2022 Poster: 3DB: A Framework for Debugging Computer Vision Models »
Guillaume Leclerc · Hadi Salman · Andrew Ilyas · Sai Vemprala · Logan Engstrom · Vibhav Vineet · Kai Xiao · Pengchuan Zhang · Shibani Santurkar · Greg Yang · Ashish Kapoor · Aleksander Madry -
2022 Poster: Don't Pour Cereal into Coffee: Differentiable Temporal Logic for Temporal Action Segmentation »
Ziwei Xu · Yogesh Rawat · Yongkang Wong · Mohan Kankanhalli · Mubarak Shah -
2021 Poster: Reformulating Zero-shot Action Recognition for Multi-label Actions »
Alec Kerrigan · Kevin Duarte · Yogesh Rawat · Mubarak Shah -
2018 Poster: VideoCapsuleNet: A Simplified Network for Action Detection »
Kevin Duarte · Yogesh Rawat · Mubarak Shah