Timezone: »
Spatio-temporal action detection in videos is typically addressed in a fully-supervised setup with manual annotation of training videos required at every frame. Since such annotation is extremely tedious and prohibits scalability, there is a clear need to minimize the amount of manual supervision. In this work we propose a unifying framework that can handle and combine varying types of less demanding weak supervision. Our model is based on discriminative clustering and integrates different types of supervision as constraints on the optimization. We investigate applications of such a model to training setups with alternative supervisory signals ranging from video-level class labels over temporal points or sparse action bounding boxes to the full per-frame annotation of action bounding boxes. Experiments on the challenging UCF101-24 and DALY datasets demonstrate competitive performance of our method at a fraction of supervision used by previous methods. The flexibility of our model enables joint learning from data with different levels of annotation. Experimental results demonstrate a significant gain by adding a few fully supervised examples to otherwise weakly labeled videos.
Author Information
Guilhem Chéron (Inria)
Jean-Baptiste Alayrac (Deepmind)
Ivan Laptev (INRIA)
Cordelia Schmid (Inria / Google)
More from the Same Authors
-
2021 Poster: Large-Scale Unsupervised Object Discovery »
Van Huy Vo · Elena Sizikova · Cordelia Schmid · Patrick Pérez · Jean Ponce -
2021 Poster: CCVS: Context-aware Controllable Video Synthesis »
Guillaume Le Moing · Jean Ponce · Cordelia Schmid -
2021 Poster: XCiT: Cross-Covariance Image Transformers »
Alaaeldin Ali · Hugo Touvron · Mathilde Caron · Piotr Bojanowski · Matthijs Douze · Armand Joulin · Ivan Laptev · Natalia Neverova · Gabriel Synnaeve · Jakob Verbeek · Herve Jegou -
2021 Poster: History Aware Multimodal Transformer for Vision-and-Language Navigation »
Shizhe Chen · Pierre-Louis Guhur · Cordelia Schmid · Ivan Laptev -
2021 Poster: Attention Bottlenecks for Multimodal Fusion »
Arsha Nagrani · Shan Yang · Anurag Arnab · Aren Jansen · Cordelia Schmid · Chen Sun -
2021 Poster: Differentiable rendering with perturbed optimizers »
Quentin Le Lidec · Ivan Laptev · Cordelia Schmid · Justin Carpentier -
2020 Poster: Self-Supervised MultiModal Versatile Networks »
Jean-Baptiste Alayrac · Adria Recasens · Rosalia Schneider · Relja Arandjelović · Jason Ramapuram · Jeffrey De Fauw · Lucas Smaira · Sander Dieleman · Andrew Zisserman -
2019 Poster: Are Labels Required for Improving Adversarial Robustness? »
Jean-Baptiste Alayrac · Jonathan Uesato · Po-Sen Huang · Alhussein Fawzi · Robert Stanforth · Pushmeet Kohli -
2019 Poster: Adaptive Density Estimation for Generative Models »
Thomas Lucas · Konstantin Shmelkov · Karteek Alahari · Cordelia Schmid · Jakob Verbeek -
2019 Spotlight: Adaptive Density Estimation for Generative Models »
Thomas Lucas · Konstantin Shmelkov · Karteek Alahari · Cordelia Schmid · Jakob Verbeek -
2018 Poster: Unsupervised Learning of Artistic Styles with Archetypal Style Analysis »
Daan Wynen · Cordelia Schmid · Julien Mairal -
2016 : Invited Talk - Recent Progress in Spatio-Temporal Action Location »
Cordelia Schmid -
2016 Poster: MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild »
Gregory Rogez · Cordelia Schmid -
2014 Poster: Convolutional Kernel Networks »
Julien Mairal · Piotr Koniusz · Zaid Harchaoui · Cordelia Schmid -
2014 Spotlight: Convolutional Kernel Networks »
Julien Mairal · Piotr Koniusz · Zaid Harchaoui · Cordelia Schmid -
2011 Poster: Learning person-object interactions for action recognition in still images »
Vincent Delaitre · Josef Sivic · Ivan Laptev