Timezone: »
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks.A common approach for building multimodal models is to simply combine multiple of these modality-specific architectures using late-stage fusion of final representations or predictions ('late-fusion').Instead, we introduce a novel transformer based architecture that uses 'attention bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, these bottlenecks force information between different modalities to pass through a small number of '`bottleneck' latent units, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
Author Information
Arsha Nagrani (Google)
Shan Yang (Amazon)
Anurag Arnab (University of Oxford)
Aren Jansen (Google, Inc.)
Cordelia Schmid (Inria / Google)
Chen Sun (Google Research)
More from the Same Authors
-
2022 Poster: Language Conditioned Spatial Relation Reasoning for 3D Object Grounding »
Shizhe Chen · Pierre-Louis Guhur · Makarand Tapaswi · Cordelia Schmid · Ivan Laptev -
2022 Spotlight: Lightning Talks 4B-3 »
Zicheng Zhang · Mancheng Meng · Antoine Guedon · Yue Wu · Wei Mao · Zaiyu Huang · Peihao Chen · Shizhe Chen · yongwei chen · Keqiang Sun · Yi Zhu · chen rui · Hanhui Li · Dongyu Ji · Ziyan Wu · miaomiao Liu · Pascal Monasse · Yu Deng · Shangzhe Wu · Pierre-Louis Guhur · Jiaolong Yang · Kunyang Lin · Makarand Tapaswi · Zhaoyang Huang · Terrence Chen · Jiabao Lei · Jianzhuang Liu · Vincent Lepetit · Zhenyu Xie · Richard I Hartley · Dinggang Shen · Xiaodan Liang · Runhao Zeng · Cordelia Schmid · Michael Kampffmeyer · Mathieu Salzmann · Ning Zhang · Fangyun Wei · Yabin Zhang · Fan Yang · Qifeng Chen · Wei Ke · Quan Wang · Thomas Li · qingling Cai · Kui Jia · Ivan Laptev · Mingkui Tan · Xin Tong · Hongsheng Li · Xiaodan Liang · Chuang Gan -
2022 Spotlight: Language Conditioned Spatial Relation Reasoning for 3D Object Grounding »
Shizhe Chen · Pierre-Louis Guhur · Makarand Tapaswi · Cordelia Schmid · Ivan Laptev -
2022 : MAQA: A Multimodal QA Benchmark for Negation »
Yue Li · Aren Jansen · Qingqing Huang · Ravi Ganti · Joonseok Lee · Dima Kuzmin -
2022 Poster: Zero-Shot Video Question Answering via Frozen Bidirectional Language Models »
Antoine Yang · Antoine Miech · Josef Sivic · Ivan Laptev · Cordelia Schmid -
2021 Poster: Large-Scale Unsupervised Object Discovery »
Van Huy Vo · Elena Sizikova · Cordelia Schmid · Patrick Pérez · Jean Ponce -
2021 Poster: CCVS: Context-aware Controllable Video Synthesis »
Guillaume Le Moing · Jean Ponce · Cordelia Schmid -
2021 Poster: TokenLearner: Adaptive Space-Time Tokenization for Videos »
Michael Ryoo · AJ Piergiovanni · Anurag Arnab · Mostafa Dehghani · Anelia Angelova -
2021 Poster: History Aware Multimodal Transformer for Vision-and-Language Navigation »
Shizhe Chen · Pierre-Louis Guhur · Cordelia Schmid · Ivan Laptev -
2021 Poster: Compressive Visual Representations »
Kuang-Huei Lee · Anurag Arnab · Sergio Guadarrama · John Canny · Ian Fischer -
2021 Poster: Differentiable rendering with perturbed optimizers »
Quentin Le Lidec · Ivan Laptev · Cordelia Schmid · Justin Carpentier -
2020 Poster: What Makes for Good Views for Contrastive Learning? »
Yonglong Tian · Chen Sun · Ben Poole · Dilip Krishnan · Cordelia Schmid · Phillip Isola -
2019 : Coffee + Posters »
Changhao Chen · Nils Gählert · Edouard Leurent · Johannes Lehner · Apratim Bhattacharyya · Harkirat Singh Behl · Teck Yian Lim · Shiho Kim · Jelena Novosel · Błażej Osiński · Arindam Das · Ruobing Shen · Jeffrey Hawke · Joachim Sicking · Babak Shahian Jahromi · Theja Tulabandhula · Claudio Michaelis · Evgenia Rusak · WENHANG BAO · Hazem Rashed · JP Chen · Amin Ansari · Jaekwang Cha · Mohamed Zahran · Daniele Reda · Jinhyuk Kim · Kim Dohyun · Ho Suk · Junekyo Jhung · Alexander Kister · Matthias Fahrland · Adam Jakubowski · Piotr Miłoś · Jean Mercat · Bruno Arsenali · Silviu Homoceanu · Xiao-Yang Liu · Philip Torr · Ahmad El Sallab · Ibrahim Sobh · Anurag Arnab · Krzysztof Galias -
2019 Poster: Adaptive Density Estimation for Generative Models »
Thomas Lucas · Konstantin Shmelkov · Karteek Alahari · Cordelia Schmid · Jakob Verbeek -
2019 Spotlight: Adaptive Density Estimation for Generative Models »
Thomas Lucas · Konstantin Shmelkov · Karteek Alahari · Cordelia Schmid · Jakob Verbeek -
2019 Poster: Unsupervised learning of object structure and dynamics from videos »
Matthias Minderer · Chen Sun · Ruben Villegas · Forrester Cole · Kevin Murphy · Honglak Lee -
2018 Poster: Unsupervised Learning of Artistic Styles with Archetypal Style Analysis »
Daan Wynen · Cordelia Schmid · Julien Mairal -
2018 Poster: A flexible model for training action localization with varying levels of supervision »
Guilhem Chéron · Jean-Baptiste Alayrac · Ivan Laptev · Cordelia Schmid -
2017 : Towards Learning Semantic Audio Representations from Unlabeled Data »
Aren Jansen -
2016 : Invited Talk - Recent Progress in Spatio-Temporal Action Location »
Cordelia Schmid -
2016 Poster: MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild »
Gregory Rogez · Cordelia Schmid -
2014 Poster: Convolutional Kernel Networks »
Julien Mairal · Piotr Koniusz · Zaid Harchaoui · Cordelia Schmid -
2014 Spotlight: Convolutional Kernel Networks »
Julien Mairal · Piotr Koniusz · Zaid Harchaoui · Cordelia Schmid