Timezone: »
There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further fine-tuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.
Author Information
Bruno Korbar (Dartmouth Collegue)
Du Tran (Facebook)
Lorenzo Torresani (Dartmouth/Facebook)
Lorenzo Torresani is an Associate Professor with tenure in the Computer Science Department at Dartmouth College and a Research Scientist at Facebook AI. He received a Laurea Degree in Computer Science with summa cum laude honors from the University of Milan (Italy) in 1996, and an M.S. and a Ph.D. in Computer Science from Stanford University in 2001 and 2005, respectively. In the past, he has worked at several industrial research labs including Microsoft Research Cambridge, Like.com and Digital Persona. His research interests are in computer vision and deep learning. He is the recipient of several awards, including a CVPR best student paper prize, a National Science Foundation CAREER Award, a Google Faculty Research Award, three Facebook Faculty Awards, and a Fulbright U.S. Scholar Award.
More from the Same Authors
-
2020 Poster: Self-Supervised Learning by Cross-Modal Audio-Video Clustering »
Humam Alwassel · Dhruv Mahajan · Bruno Korbar · Lorenzo Torresani · Bernard Ghanem · Du Tran -
2020 Poster: COBE: Contextualized Object Embeddings from Narrated Instructional Video »
Gedas Bertasius · Lorenzo Torresani -
2020 Spotlight: Self-Supervised Learning by Cross-Modal Audio-Video Clustering »
Humam Alwassel · Dhruv Mahajan · Bruno Korbar · Lorenzo Torresani · Bernard Ghanem · Du Tran -
2019 Poster: STAR-Caps: Capsule Networks with Straight-Through Attentive Routing »
Karim Ahmed · Lorenzo Torresani -
2019 Poster: Learning Temporal Pose Estimation from Sparsely-Labeled Videos »
Gedas Bertasius · Christoph Feichtenhofer · Du Tran · Jianbo Shi · Lorenzo Torresani -
2017 Poster: Learning to Inpaint for Image Compression »
Mohammad Haris Baig · Vladlen Koltun · Lorenzo Torresani -
2016 : ViCom: Benchmark and Methods for Video Comprehension »
Du Tran · Maksim Bolonkin · Manohar Paluri · Lorenzo Torresani -
2016 : Introduction »
Lorenzo Torresani -
2016 Workshop: Large Scale Computer Vision Systems »
Manohar Paluri · Lorenzo Torresani · Gal Chechik · Dario Garcia · Du Tran -
2012 Poster: Max-Margin Structured Output Regression for Spatio-Temporal Action Localization »
Du Tran · Junsong Yuan