Timezone: »
Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters.
Author Information
Simon Ging (University of Freiburg)
Mohammadreza Zolfaghari (University of Freiburg)
Hamed Pirsiavash (University of Maryland, Baltimore County)
Thomas Brox (University of Freiburg)
More from the Same Authors
-
2022 : Towards Discovering Neural Architectures from Scratch »
Simon Schrodi · Danny Stoll · Robin Ru · Rhea Sukthanker · Thomas Brox · Frank Hutter -
2023 Poster: Construction of Hierarchical Neural Architecture Search Spaces based on Context-free Grammars »
Simon Schrodi · Danny Stoll · Binxin Ru · Rhea Sukthanker · Thomas Brox · Frank Hutter -
2022 Poster: Assaying Out-Of-Distribution Generalization in Transfer Learning »
Florian Wenzel · Andrea Dittadi · Peter Gehler · Carl-Johann Simon-Gabriel · Max Horn · Dominik Zietlow · David Kernert · Chris Russell · Thomas Brox · Bernt Schiele · Bernhard Schölkopf · Francesco Locatello -
2020 Poster: CompRess: Self-Supervised Learning by Compressing Representations »
Soroush Abbasi Koohpayegani · Ajinkya Tejankar · Hamed Pirsiavash -
2019 Poster: DeepUSPS: Deep Robust Unsupervised Saliency Prediction via Self-supervision »
Tam Nguyen · Maximilian Dax · Chaithanya Kumar Mummadi · Nhung Ngo · Thi Hoai Phuong Nguyen · Zhongyu Lou · Thomas Brox -
2018 : Accepted papers »
Sven Gowal · Bogdan Kulynych · Marius Mosbach · Nicholas Frosst · Phil Roth · Utku Ozbulak · Simral Chaudhary · Toshiki Shibahara · Salome Viljoen · Nikita Samarin · Briland Hitaj · Rohan Taori · Emanuel Moss · Melody Guan · Lukas Schott · Angus Galloway · Anna Golubeva · Xiaomeng Jin · Felix Kreuk · Akshayvarun Subramanya · Vipin Pillai · Hamed Pirsiavash · Giuseppe Ateniese · Ankita Kalra · Logan Engstrom · Anish Athalye -
2016 : Learning 3D representations, disparity estimation, and structure from motion »
Thomas Brox -
2016 Poster: Generating Videos with Scene Dynamics »
Carl Vondrick · Hamed Pirsiavash · Antonio Torralba -
2016 Poster: Protein contact prediction from amino acid co-evolution using convolutional networks for graph-valued images »
Vladimir Golkov · Marcin Skwark · Antonij Golkov · Alexey Dosovitskiy · Thomas Brox · Jens Meiler · Daniel Cremers -
2016 Oral: Protein contact prediction from amino acid co-evolution using convolutional networks for graph-valued images »
Vladimir Golkov · Marcin Skwark · Antonij Golkov · Alexey Dosovitskiy · Thomas Brox · Jens Meiler · Daniel Cremers -
2016 Poster: Generating Images with Perceptual Similarity Metrics based on Deep Networks »
Alexey Dosovitskiy · Thomas Brox -
2016 Poster: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks »
Anh Nguyen · Alexey Dosovitskiy · Jason Yosinski · Thomas Brox · Jeff Clune -
2014 Poster: Discriminative Unsupervised Feature Learning with Convolutional Neural Networks »
Alexey Dosovitskiy · Jost Tobias Springenberg · Martin Riedmiller · Thomas Brox