Timezone: »
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. To remember previously visited locations and actions taken, most approaches to VLN implement memory using recurrent states. Instead, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR), high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back). We demonstrate HAMT to be particularly effective for navigation tasks with longer trajectories.
Author Information
Shizhe Chen (Inria)
Pierre-Louis Guhur (INRIA)
Cordelia Schmid (Inria / Google)
Ivan Laptev (INRIA)
More from the Same Authors
-
2022 Poster: Language Conditioned Spatial Relation Reasoning for 3D Object Grounding »
Shizhe Chen · Pierre-Louis Guhur · Makarand Tapaswi · Cordelia Schmid · Ivan Laptev -
2023 Poster: VidChapters-7M: Video Chapters at Scale »
Antoine Yang · Arsha Nagrani · Ivan Laptev · Josef Sivic · Cordelia Schmid -
2023 Poster: Does Visual Pretraining Help End-to-End Reasoning? »
Chen Sun · Calvin Luo · Xingyi Zhou · Anurag Arnab · Cordelia Schmid -
2023 Poster: AVIS: Autonomous Visual Information Seeking with Large Language Model Agent »
Ziniu Hu · Ahmet Iscen · Chen Sun · Kai-Wei Chang · Yizhou Sun · David Ross · Cordelia Schmid · Alireza Fathi -
2022 Spotlight: Lightning Talks 4B-3 »
Zicheng Zhang · Mancheng Meng · Antoine Guedon · Yue Wu · Wei Mao · Zaiyu Huang · Peihao Chen · Shizhe Chen · Yongwei Chen · Keqiang Sun · Yi Zhu · chen rui · Hanhui Li · Dongyu Ji · Ziyan Wu · miaomiao Liu · Pascal Monasse · Yu Deng · Shangzhe Wu · Pierre-Louis Guhur · Jiaolong Yang · Kunyang Lin · Makarand Tapaswi · Zhaoyang Huang · Terrence Chen · Jiabao Lei · Jianzhuang Liu · Vincent Lepetit · Zhenyu Xie · Richard I Hartley · Dinggang Shen · Xiaodan Liang · Runhao Zeng · Cordelia Schmid · Michael Kampffmeyer · Mathieu Salzmann · Ning Zhang · Fangyun Wei · Yabin Zhang · Fan Yang · Qifeng Chen · Wei Ke · Quan Wang · Thomas Li · qingling Cai · Kui Jia · Ivan Laptev · Mingkui Tan · Xin Tong · Hongsheng Li · Xiaodan Liang · Chuang Gan -
2022 Spotlight: Language Conditioned Spatial Relation Reasoning for 3D Object Grounding »
Shizhe Chen · Pierre-Louis Guhur · Makarand Tapaswi · Cordelia Schmid · Ivan Laptev -
2022 Poster: Zero-Shot Video Question Answering via Frozen Bidirectional Language Models »
Antoine Yang · Antoine Miech · Josef Sivic · Ivan Laptev · Cordelia Schmid -
2021 Poster: Large-Scale Unsupervised Object Discovery »
Van Huy Vo · Elena Sizikova · Cordelia Schmid · Patrick Pérez · Jean Ponce -
2021 Poster: CCVS: Context-aware Controllable Video Synthesis »
Guillaume Le Moing · Jean Ponce · Cordelia Schmid -
2021 Poster: XCiT: Cross-Covariance Image Transformers »
Alaaeldin Ali · Hugo Touvron · Mathilde Caron · Piotr Bojanowski · Matthijs Douze · Armand Joulin · Ivan Laptev · Natalia Neverova · Gabriel Synnaeve · Jakob Verbeek · Herve Jegou -
2021 Poster: Attention Bottlenecks for Multimodal Fusion »
Arsha Nagrani · Shan Yang · Anurag Arnab · Aren Jansen · Cordelia Schmid · Chen Sun -
2021 Poster: Differentiable rendering with perturbed optimizers »
Quentin Le Lidec · Ivan Laptev · Cordelia Schmid · Justin Carpentier -
2019 Poster: Adaptive Density Estimation for Generative Models »
Thomas Lucas · Konstantin Shmelkov · Karteek Alahari · Cordelia Schmid · Jakob Verbeek -
2019 Spotlight: Adaptive Density Estimation for Generative Models »
Thomas Lucas · Konstantin Shmelkov · Karteek Alahari · Cordelia Schmid · Jakob Verbeek -
2018 Poster: Unsupervised Learning of Artistic Styles with Archetypal Style Analysis »
Daan Wynen · Cordelia Schmid · Julien Mairal -
2018 Poster: A flexible model for training action localization with varying levels of supervision »
Guilhem Chéron · Jean-Baptiste Alayrac · Ivan Laptev · Cordelia Schmid -
2016 : Invited Talk - Recent Progress in Spatio-Temporal Action Location »
Cordelia Schmid -
2016 Poster: MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild »
Gregory Rogez · Cordelia Schmid -
2014 Poster: Convolutional Kernel Networks »
Julien Mairal · Piotr Koniusz · Zaid Harchaoui · Cordelia Schmid -
2014 Spotlight: Convolutional Kernel Networks »
Julien Mairal · Piotr Koniusz · Zaid Harchaoui · Cordelia Schmid -
2011 Poster: Learning person-object interactions for action recognition in still images »
Vincent Delaitre · Josef Sivic · Ivan Laptev