Timezone: »
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. In particular, it is often crucial to distinguish similar objects referred by the text, such as "the left most chair" and "a chair next to the window". In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations. To this end, we design a spatial self-attention layer that accounts for relative distances and orientations between objects in input 3D point clouds. Training such a layer with visual and language inputs enables to disambiguate spatial relations and to localize objects referred by the text. To facilitate the cross-modal learning of relations, we further propose a teacher-student approach where the teacher model is first trained using ground-truth object labels, and then helps to train a student model using point cloud inputs. We perform ablation studies showing advantages of our approach. We also demonstrate our model to significantly outperform the state of the art on the challenging Nr3D, Sr3D and ScanRefer 3D object grounding datasets.
Author Information
Shizhe Chen (INRIA)
Pierre-Louis Guhur (INRIA)
Makarand Tapaswi (Wadhwani AI / IIIT Hyderabad)
Cordelia Schmid (INRIA)
Ivan Laptev (INRIA)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Language Conditioned Spatial Relation Reasoning for 3D Object Grounding »
Dates n/a. Room
More from the Same Authors
-
2023 Poster: Does Visual Pretraining Help End-to-End Reasoning? »
Chen Sun · Calvin Luo · Xingyi Zhou · Anurag Arnab · Cordelia Schmid -
2023 Poster: AVIS: Autonomous Visual Information Seeking with Large Language Models »
Ziniu Hu · Ahmet Iscen · Chen Sun · Kai-Wei Chang · Yizhou Sun · Cordelia Schmid · David Ross · Alireza Fathi -
2023 Poster: VidChapters-7M: Video Chapters at Scale »
Antoine Yang · Arsha Nagrani · Ivan Laptev · Josef Sivic · Cordelia Schmid -
2022 Spotlight: Lightning Talks 4B-4 »
Ziyue Jiang · Zeeshan Khan · Yuxiang Yang · Chenze Shao · Yichong Leng · Zehao Yu · Wenguan Wang · Xian Liu · Zehua Chen · Yang Feng · Qianyi Wu · James Liang · C.V. Jawahar · Junjie Yang · Zhe Su · Songyou Peng · Yufei Xu · Junliang Guo · Michael Niemeyer · Hang Zhou · Zhou Zhao · Makarand Tapaswi · Dongfang Liu · Qian Yang · Torsten Sattler · Yuanqi Du · Haohe Liu · Jing Zhang · Andreas Geiger · Yi Ren · Long Lan · Jiawei Chen · Wayne Wu · Dahua Lin · Dacheng Tao · Xu Tan · Jinglin Liu · Ziwei Liu · 振辉 叶 · Danilo Mandic · Lei He · Xiangyang Li · Tao Qin · sheng zhao · Tie-Yan Liu -
2022 Spotlight: Lightning Talks 4B-3 »
Zicheng Zhang · Mancheng Meng · Antoine Guedon · Yue Wu · Wei Mao · Zaiyu Huang · Peihao Chen · Shizhe Chen · Yongwei Chen · Keqiang Sun · Yi Zhu · chen rui · Hanhui Li · Dongyu Ji · Ziyan Wu · miaomiao Liu · Pascal Monasse · Yu Deng · Shangzhe Wu · Pierre-Louis Guhur · Jiaolong Yang · Kunyang Lin · Makarand Tapaswi · Zhaoyang Huang · Terrence Chen · Jiabao Lei · Jianzhuang Liu · Vincent Lepetit · Zhenyu Xie · Richard I Hartley · Dinggang Shen · Xiaodan Liang · Runhao Zeng · Cordelia Schmid · Michael Kampffmeyer · Mathieu Salzmann · Ning Zhang · Fangyun Wei · Yabin Zhang · Fan Yang · Qifeng Chen · Wei Ke · Quan Wang · Thomas Li · qingling Cai · Kui Jia · Ivan Laptev · Mingkui Tan · Xin Tong · Hongsheng Li · Xiaodan Liang · Chuang Gan -
2022 Spotlight: Grounded Video Situation Recognition »
Zeeshan Khan · C.V. Jawahar · Makarand Tapaswi -
2022 Poster: Grounded Video Situation Recognition »
Zeeshan Khan · C.V. Jawahar · Makarand Tapaswi -
2022 Poster: Zero-Shot Video Question Answering via Frozen Bidirectional Language Models »
Antoine Yang · Antoine Miech · Josef Sivic · Ivan Laptev · Cordelia Schmid -
2021 Poster: Large-Scale Unsupervised Object Discovery »
Van Huy Vo · Elena Sizikova · Cordelia Schmid · Patrick Pérez · Jean Ponce -
2021 Poster: CCVS: Context-aware Controllable Video Synthesis »
Guillaume Le Moing · Jean Ponce · Cordelia Schmid -
2021 Poster: XCiT: Cross-Covariance Image Transformers »
Alaaeldin Ali · Hugo Touvron · Mathilde Caron · Piotr Bojanowski · Matthijs Douze · Armand Joulin · Ivan Laptev · Natalia Neverova · Gabriel Synnaeve · Jakob Verbeek · Herve Jegou -
2021 Poster: History Aware Multimodal Transformer for Vision-and-Language Navigation »
Shizhe Chen · Pierre-Louis Guhur · Cordelia Schmid · Ivan Laptev -
2021 Poster: Attention Bottlenecks for Multimodal Fusion »
Arsha Nagrani · Shan Yang · Anurag Arnab · Aren Jansen · Cordelia Schmid · Chen Sun -
2021 Poster: Differentiable rendering with perturbed optimizers »
Quentin Le Lidec · Ivan Laptev · Cordelia Schmid · Justin Carpentier -
2019 Poster: Adaptive Density Estimation for Generative Models »
Thomas Lucas · Konstantin Shmelkov · Karteek Alahari · Cordelia Schmid · Jakob Verbeek -
2019 Spotlight: Adaptive Density Estimation for Generative Models »
Thomas Lucas · Konstantin Shmelkov · Karteek Alahari · Cordelia Schmid · Jakob Verbeek -
2018 Poster: Unsupervised Learning of Artistic Styles with Archetypal Style Analysis »
Daan Wynen · Cordelia Schmid · Julien Mairal -
2018 Poster: A flexible model for training action localization with varying levels of supervision »
Guilhem Chéron · Jean-Baptiste Alayrac · Ivan Laptev · Cordelia Schmid -
2016 : Invited Talk - Recent Progress in Spatio-Temporal Action Location »
Cordelia Schmid -
2016 Poster: MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild »
Gregory Rogez · Cordelia Schmid -
2014 Poster: Convolutional Kernel Networks »
Julien Mairal · Piotr Koniusz · Zaid Harchaoui · Cordelia Schmid -
2014 Spotlight: Convolutional Kernel Networks »
Julien Mairal · Piotr Koniusz · Zaid Harchaoui · Cordelia Schmid -
2011 Poster: Learning person-object interactions for action recognition in still images »
Vincent Delaitre · Josef Sivic · Ivan Laptev