Timezone: »
Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
Author Information
Humam Alwassel (KAUST)
Dhruv Mahajan (Facebook)
Bruno Korbar (Facebook)
Lorenzo Torresani (Facebook AI)
Lorenzo Torresani is an Associate Professor with tenure in the Computer Science Department at Dartmouth College and a Research Scientist at Facebook AI. He received a Laurea Degree in Computer Science with summa cum laude honors from the University of Milan (Italy) in 1996, and an M.S. and a Ph.D. in Computer Science from Stanford University in 2001 and 2005, respectively. In the past, he has worked at several industrial research labs including Microsoft Research Cambridge, Like.com and Digital Persona. His research interests are in computer vision and deep learning. He is the recipient of several awards, including a CVPR best student paper prize, a National Science Foundation CAREER Award, a Google Faculty Research Award, three Facebook Faculty Awards, and a Fulbright U.S. Scholar Award.
Bernard Ghanem (KAUST)
Du Tran (Facebook AI)
Related Events (a corresponding poster, oral, or spotlight)
-
2020 Spotlight: Self-Supervised Learning by Cross-Modal Audio-Video Clustering »
Tue. Dec 8th 03:50 -- 04:00 PM Room Orals & Spotlights: Clustering/Ranking
More from the Same Authors
-
2021 Spotlight: ASSANet: An Anisotropic Separable Set Abstraction for Efficient Point Cloud Representation Learning »
Guocheng Qian · Hasan Hammoud · Guohao Li · Ali Thabet · Bernard Ghanem -
2022 : Certified Robustness in Federated Learning »
Motasem Alfarra · Juan Perez · Egor Shulgin · Peter Richtarik · Bernard Ghanem -
2023 Poster: Dynamically Masked Discriminator for GANs »
Wentian Zhang · Haozhe Liu · Bing Li · Jinheng Xie · Yawen Huang · Yuexiang Li · Yefeng Zheng · Bernard Ghanem -
2023 Poster: CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society »
Guohao Li · Hasan Hammoud · Hani Itani · Dmitrii Khizbullin · Bernard Ghanem -
2022 Spotlight: Lightning Talks 6A-4 »
Xiu-Shen Wei · Konstantina Dritsa · Guillaume Huguet · ABHRA CHAUDHURI · Zhenbin Wang · Kevin Qinghong Lin · Yutong Chen · Jianan Zhou · Yongsen Mao · Junwei Liang · Jinpeng Wang · Mao Ye · Yiming Zhang · Aikaterini Thoma · H.-Y. Xu · Daniel Sumner Magruder · Enwei Zhang · Jianing Zhu · Ronglai Zuo · Massimiliano Mancini · Hanxiao Jiang · Jun Zhang · Fangyun Wei · Faen Zhang · Ioannis Pavlopoulos · Zeynep Akata · Xiatian Zhu · Jingfeng ZHANG · Alexander Tong · Mattia Soldan · Chunhua Shen · Yuxin Peng · Liuhan Peng · Michael Wray · Tongliang Liu · Anjan Dutta · Yu Wu · Oluwadamilola Fasina · Panos Louridas · Angel Chang · Manik Kuchroo · Manolis Savva · Shujie LIU · Wei Zhou · Rui Yan · Gang Niu · Liang Tian · Bo Han · Eric Z. XU · Guy Wolf · Yingying Zhu · Brian Mak · Difei Gao · Masashi Sugiyama · Smita Krishnaswamy · Rong-Cheng Tu · Wenzhe Zhao · Weijie Kong · Chengfei Cai · WANG HongFa · Dima Damen · Bernard Ghanem · Wei Liu · Mike Zheng Shou -
2022 Spotlight: Egocentric Video-Language Pretraining »
Kevin Qinghong Lin · Jinpeng Wang · Mattia Soldan · Michael Wray · Rui Yan · Eric Z. XU · Difei Gao · Rong-Cheng Tu · Wenzhe Zhao · Weijie Kong · Chengfei Cai · WANG HongFa · Dima Damen · Bernard Ghanem · Wei Liu · Mike Zheng Shou -
2022 Poster: PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies »
Guocheng Qian · Yuchen Li · Houwen Peng · Jinjie Mai · Hasan Hammoud · Mohamed Elhoseiny · Bernard Ghanem -
2022 Poster: Egocentric Video-Language Pretraining »
Kevin Qinghong Lin · Jinpeng Wang · Mattia Soldan · Michael Wray · Rui Yan · Eric Z. XU · Difei Gao · Rong-Cheng Tu · Wenzhe Zhao · Weijie Kong · Chengfei Cai · WANG HongFa · Dima Damen · Bernard Ghanem · Wei Liu · Mike Zheng Shou -
2022 Poster: Scalable Interpretability via Polynomials »
Abhimanyu Dubey · Filip Radenovic · Dhruv Mahajan -
2022 Poster: Neural Basis Models for Interpretability »
Filip Radenovic · Abhimanyu Dubey · Dhruv Mahajan -
2021 Poster: ASSANet: An Anisotropic Separable Set Abstraction for Efficient Point Cloud Representation Learning »
Guocheng Qian · Hasan Hammoud · Guohao Li · Ali Thabet · Bernard Ghanem -
2021 Poster: Low-Fidelity Video Encoder Optimization for Temporal Action Localization »
Mengmeng Xu · Juan Manuel Perez Rua · Xiatian Zhu · Bernard Ghanem · Brais Martinez -
2020 Poster: COBE: Contextualized Object Embeddings from Narrated Instructional Video »
Gedas Bertasius · Lorenzo Torresani -
2019 Poster: STAR-Caps: Capsule Networks with Straight-Through Attentive Routing »
Karim Ahmed · Lorenzo Torresani -
2019 Poster: Learning Temporal Pose Estimation from Sparsely-Labeled Videos »
Gedas Bertasius · Christoph Feichtenhofer · Du Tran · Jianbo Shi · Lorenzo Torresani -
2018 Poster: Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization »
Bruno Korbar · Du Tran · Lorenzo Torresani -
2017 Poster: Learning to Inpaint for Image Compression »
Mohammad Haris Baig · Vladlen Koltun · Lorenzo Torresani -
2016 : ViCom: Benchmark and Methods for Video Comprehension »
Du Tran · Maksim Bolonkin · Manohar Paluri · Lorenzo Torresani -
2016 : Introduction »
Lorenzo Torresani -
2016 Workshop: Large Scale Computer Vision Systems »
Manohar Paluri · Lorenzo Torresani · Gal Chechik · Dario Garcia · Du Tran -
2012 Poster: Max-Margin Structured Output Regression for Spatio-Temporal Action Localization »
Du Tran · Junsong Yuan