Timezone: »
The audio-visual video parsing task aims to parse a video into modality- and category-aware temporal segments. Previous work mainly focuses on weakly-supervised approaches, which learn from video-level event labels. During training, they do not know which modality perceives and meanwhile which temporal segment contains the video event. Since there is no explicit grouping in the existing frameworks, the modality and temporal uncertainties make these methods suffer from false predictions. For instance, segments in the same category could be predicted in different event classes. Learning compact and discriminative multi-modal subspaces is essential for mitigating the issue. To this end, in this paper, we propose a novel Multi-modal Grouping Network, namely MGN, for explicitly semantic-aware grouping. Specifically, MGN aggregates event-aware unimodal features through unimodal grouping in terms of learnable categorical embedding tokens. Furthermore, it leverages the cross-modal grouping for modality-aware prediction to match the video-level target. Our simple framework achieves improving results against previous baselines on weakly-supervised audio-visual video parsing. In addition, our MGN is much more lightweight, using only 47.2% of the parameters of baselines (17 MB vs. 36 MB). Code is available at https://github.com/stoneMo/MGN.
Author Information
Shentong Mo (CMU)
Yapeng Tian (The University of Texas at Dallas)

Dr. Yapeng Tian is an assistant professor in the Computer Science Department of UT Dallas. He is interested in solving core computer vision and computer audition problems and applying the developed learning approaches to broad AI applications. His recent work has focused on studying audio-visual scene understanding and mitigating video motions.
More from the Same Authors
-
2021 : Simulated Annealing for Neural Architecture Search »
Shentong Mo · Jingfei Xia · Pinxu Ren -
2021 : Adaptive Fine-tuning for Vision and Language Pre-trained Models »
Shentong Mo · Jingfei Xia · Ihor Markevych -
2021 : Multi-modal Self-supervised Pre-training for Large-scale Genome Data »
Shentong Mo · Xi Fu · Chenyang Hong · Yizhen Chen · Yuxuan Zheng · Xiangru Tang · Yanyan Lan · Zhiqiang Shen · Eric Xing -
2022 Poster: A Closer Look at Weakly-Supervised Audio-Visual Source Localization »
Shentong Mo · Pedro Morgado -
2021 Workshop: 2nd Workshop on Self-Supervised Learning: Theory and Practice »
Pengtao Xie · Ishan Misra · Pulkit Agrawal · Abdelrahman Mohamed · Shentong Mo · Youwei Liang · Jeannette Bohg · Kristina N Toutanova -
2021 : Poster Session 2 (gather.town) »
Wenjie Li · Akhilesh Soni · Jinwuk Seok · Jianhao Ma · Jeffery Kline · Mathieu Tuli · Miaolan Xie · Robert Gower · Quanqi Hu · Matteo Cacciola · Yuanlu Bai · Boyue Li · Wenhao Zhan · Shentong Mo · Junhyung Lyle Kim · Sajad Fathi Hafshejani · Chris Junchi Li · Zhishuai Guo · Harshvardhan Harshvardhan · Neha Wadia · Tatjana Chavdarova · Difan Zou · Zixiang Chen · Aman Gupta · Jacques Chen · Betty Shea · Benoit Dherin · Aleksandr Beznosikov