Timezone: »

Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
Shentong Mo · Yapeng Tian

Tue Nov 29 02:00 PM -- 04:00 PM (PST) @ Hall J #142

The audio-visual video parsing task aims to parse a video into modality- and category-aware temporal segments. Previous work mainly focuses on weakly-supervised approaches, which learn from video-level event labels. During training, they do not know which modality perceives and meanwhile which temporal segment contains the video event. Since there is no explicit grouping in the existing frameworks, the modality and temporal uncertainties make these methods suffer from false predictions. For instance, segments in the same category could be predicted in different event classes. Learning compact and discriminative multi-modal subspaces is essential for mitigating the issue. To this end, in this paper, we propose a novel Multi-modal Grouping Network, namely MGN, for explicitly semantic-aware grouping. Specifically, MGN aggregates event-aware unimodal features through unimodal grouping in terms of learnable categorical embedding tokens. Furthermore, it leverages the cross-modal grouping for modality-aware prediction to match the video-level target. Our simple framework achieves improving results against previous baselines on weakly-supervised audio-visual video parsing. In addition, our MGN is much more lightweight, using only 47.2% of the parameters of baselines (17 MB vs. 36 MB). Code is available at https://github.com/stoneMo/MGN.

Author Information

Shentong Mo (CMU)
Yapeng Tian (The University of Texas at Dallas)
Yapeng Tian

Dr. Yapeng Tian is an assistant professor in the Computer Science Department of UT Dallas. He is interested in solving core computer vision and computer audition problems and applying the developed learning approaches to broad AI applications. His recent work has focused on studying audio-visual scene understanding and mitigating video motions.

More from the Same Authors