COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox
Poster Session 3 (more posters)
on 2020-12-08T21:00:00-08:00 - 2020-12-08T23:00:00-08:00
GatherTown: Vision ( Town A2 - Spot C0 )
on 2020-12-08T21:00:00-08:00 - 2020-12-08T23:00:00-08:00
GatherTown: Vision ( Town A2 - Spot C0 )
Join GatherTown
Only iff poster is crowded, join Zoom . Authors have to start the Zoom call from their Profile page / Presentation History.
Only iff poster is crowded, join Zoom . Authors have to start the Zoom call from their Profile page / Presentation History.
Toggle Abstract Paper (in Proceedings / .pdf)
Abstract: Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters.