NeurIPS Poster Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

Poster

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

Yung-Hsuan Lai · Yen-Chun Chen · Frank Wang

Great Hall & Hall B1+B2 (level 1) #500

[ Abstract ]

[ Paper] [ Slides] [ Poster] [ OpenReview]

Abstract: Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its

modality-aligned

$\textit{modality-aligned}$ setting,

i.e.

$\textit{i.e.}$ , the audio and visual modality are

both

$\textit{both}$ assumed to signal the prediction target.With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored

unaligned

$\textit{unaligned}$ setting, where the goal is to recognize audio and visual events in a video with only weak labels observed.Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both).To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed

V

$\textbf{V}$ isual-

A

$\textbf{A}$ udio

L

$\textbf{L}$ abel Elab

or

$\textbf{or}$ ation (VALOR), is innovated to harvest modality labels for the training events.Empirical studies show that the harvested labels significantly improve an attentional baseline by

8.0

$\textbf{8.0}$ in average F-score (Type@AV).Surprisingly, we found that modality-independent teachers outperform their modality-fused counterparts since they are noise-proof from the other potentially unaligned modality.Moreover, our best model achieves the new state-of-the-art on all metrics of LLP by a substantial margin (

+5.4

$\textbf{+5.4}$ F-score for Type@AV). VALOR is further generalized to Audio-Visual Event Localization and achieves the new state-of-the-art as well.

Chat is not available.