Timezone: »
Modern approaches for multi-person pose estimation in video require large amounts of dense annotations. However, labeling every frame in a video is costly and labor intensive. To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. Given a pair of video frames---a labeled Frame A and an unlabeled Frame B---we train our model to predict human pose in Frame A using the features from Frame B by means of deformable convolutions to implicitly learn the pose warping between A and B. We demonstrate that we can leverage our trained PoseWarper for several applications. First, at inference time we can reverse the application direction of our network in order to propagate pose information from manually annotated frames to unlabeled frames. This makes it possible to generate pose annotations for the entire video given only a few manually-labeled frames. Compared to modern label propagation methods based on optical flow, our warping mechanism is much more compact (6M vs 39M parameters), and also more accurate (88.7 mAP vs 83.8 mAP). We also show that we can improve the accuracy of a pose estimator by training it on an augmented dataset obtained by adding our propagated poses to the original manual labels. Lastly, we can use our PoseWarper to aggregate temporal pose information from neighboring frames during inference. This allows us to obtain state-of-the-art pose detection results on PoseTrack2017 and PoseTrack2018 datasets. Code has been made available at: https://github.com/facebookresearch/PoseWarper.
Author Information
Gedas Bertasius (Facebook Research)
Christoph Feichtenhofer (Facebook AI Research)
Du Tran (Facebook AI)
Jianbo Shi (University of Pennsylvania)
Lorenzo Torresani (Facebook AI)
Lorenzo Torresani is an Associate Professor with tenure in the Computer Science Department at Dartmouth College and a Research Scientist at Facebook AI. He received a Laurea Degree in Computer Science with summa cum laude honors from the University of Milan (Italy) in 1996, and an M.S. and a Ph.D. in Computer Science from Stanford University in 2001 and 2005, respectively. In the past, he has worked at several industrial research labs including Microsoft Research Cambridge, Like.com and Digital Persona. His research interests are in computer vision and deep learning. He is the recipient of several awards, including a CVPR best student paper prize, a National Science Foundation CAREER Award, a Google Faculty Research Award, three Facebook Faculty Awards, and a Fulbright U.S. Scholar Award.
More from the Same Authors
-
2023 Poster: MAViL: Masked Audio-Video Learners »
Po-Yao Huang · Vasu Sharma · Hu Xu · Chaitanya Ryali · haoqi fan · Yanghao Li · Shang-Wen Li · Gargi Ghosh · Jitendra Malik · Christoph Feichtenhofer -
2022 Poster: Masked Autoencoders that Listen »
Po-Yao Huang · Hu Xu · Juncheng Li · Alexei Baevski · Michael Auli · Wojciech Galuba · Florian Metze · Christoph Feichtenhofer -
2022 Poster: Masked Autoencoders As Spatiotemporal Learners »
Christoph Feichtenhofer · haoqi fan · Yanghao Li · Kaiming He -
2020 Poster: Self-Supervised Learning by Cross-Modal Audio-Video Clustering »
Humam Alwassel · Dhruv Mahajan · Bruno Korbar · Lorenzo Torresani · Bernard Ghanem · Du Tran -
2020 Poster: COBE: Contextualized Object Embeddings from Narrated Instructional Video »
Gedas Bertasius · Lorenzo Torresani -
2020 Spotlight: Self-Supervised Learning by Cross-Modal Audio-Video Clustering »
Humam Alwassel · Dhruv Mahajan · Bruno Korbar · Lorenzo Torresani · Bernard Ghanem · Du Tran -
2019 Poster: STAR-Caps: Capsule Networks with Straight-Through Attentive Routing »
Karim Ahmed · Lorenzo Torresani -
2018 Poster: Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization »
Bruno Korbar · Du Tran · Lorenzo Torresani -
2017 Poster: Learning to Inpaint for Image Compression »
Mohammad Haris Baig · Vladlen Koltun · Lorenzo Torresani -
2016 : ViCom: Benchmark and Methods for Video Comprehension »
Du Tran · Maksim Bolonkin · Manohar Paluri · Lorenzo Torresani -
2016 : Introduction »
Lorenzo Torresani -
2016 Workshop: Large Scale Computer Vision Systems »
Manohar Paluri · Lorenzo Torresani · Gal Chechik · Dario Garcia · Du Tran -
2012 Poster: Max-Margin Structured Output Regression for Spatio-Temporal Action Localization »
Du Tran · Junsong Yuan