Timezone: »
Despite the tremendous progress in zero-shot learning (ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this work, we argue that online textual documents e.g., Wikipedia, contain rich visual descriptions about object classes, therefore can be used as powerful unsupervised side information for ZSL. To this end, we propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents by aligning both modalities in a shared embedding space. In order to distill discriminative visual words from noisy documents, we introduce a new cross-modal attention module that learns fine-grained interactions between image patches and document words. Consequently, our I2DFormer not only learns highly discriminative document embeddings that capture visual similarities but also gains the ability to localize visually relevant words in image regions. Quantitatively, we demonstrate that our I2DFormer significantly outperforms previous unsupervised semantic embeddings under both zero-shot and generalized zero-shot learning settings on three public datasets. Qualitatively, we show that our method leads to highly interpretable results where document words can be grounded in the image regions.
Author Information
Muhammad Ferjad Naeem (ETH Zurich)
Yongqin Xian (Google)
Luc V Gool (Computer Vision Lab, ETH Zurich)
Federico Tombari (Google, TUM)
More from the Same Authors
-
2019 Poster: Gated CRF Loss for Weakly Supervised Semantic Image Segmentation »
Anton Obukhov · Stamatios Georgoulis · Dengxin Dai · Luc V Gool -
2021 : Spatial-Temporal Gated Transformersfor Efficient Video Processing »
Yawei Li · Babak Ehteshami Bejnordi · Bert Moons · Tijmen Blankevoort · Amirhossein Habibian · Radu Timofte · Luc V Gool -
2022 Poster: Recurrent Video Restoration Transformer with Guided Deformable Attention »
Jingyun Liang · Yuchen Fan · Xiaoyu Xiang · Rakesh Ranjan · Eddy Ilg · Simon Green · Jiezhang Cao · Kai Zhang · Radu Timofte · Luc V Gool -
2023 Poster: LART: Neural Correspondence Learning with Latent Regularization Transformer for 3D Motion Transfer »
Haoyu Chen · Hao Tang · Radu Timofte · Luc V Gool · Guoying Zhao -
2023 Poster: Autodecoding Latent 3D Diffusion Models »
Evangelos Ntavelis · Aliaksandr Siarohin · Kyle Olszewski · Chaoyang Wang · Luc V Gool · Sergey Tulyakov -
2023 Poster: Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding »
Zhejun Zhang · Alexander Liniger · Christos Sakaridis · Fisher Yu · Luc V Gool -
2023 Poster: OpenMask3D: Open-Vocabulary 3D Instance Segmentation »
Ayça Takmaz · Elisabetta Fedele · Robert Sumner · Marc Pollefeys · Federico Tombari · Francis Engelmann -
2023 Poster: DDF-HO: Hand-Held Object Reconstruction via Conditional Directed Distance Field »
Chenyangguang Zhang · Yan Di · Ruida Zhang · Guangyao Zhai · Fabian Manhardt · Federico Tombari · Xiangyang Ji -
2023 Poster: CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graphs »
Guangyao Zhai · Evin Pınar Örnek · Shun-Cheng Wu · Yan Di · Federico Tombari · Nassir Navab · Benjamin Busam -
2023 Poster: Revisiting Evaluation Metrics for Semantic Segmentation: Optimization and Evaluation of Fine-grained Intersection over Union »
Zifu Wang · Maxim Berman · Amal Rannen-Triki · Philip Torr · Devis Tuia · Tinne Tuytelaars · Luc V Gool · Jiaqian Yu · Matthew Blaschko -
2022 Spotlight: Lightning Talks 5A-4 »
Yangrui Chen · Zhiyang Chen · Liang Zhang · Hanqing Wang · Jiaqi Han · Shuchen Wu · shaohui peng · Ganqu Cui · Yoav Kolumbus · Noemi Elteto · Xing Hu · Anwen Hu · Wei Liang · Cong Xie · Lifan Yuan · Noam Nisan · Wenbing Huang · Yousong Zhu · Ishita Dasgupta · Luc V Gool · Tingyang Xu · Rui Zhang · Qin Jin · Zhaowen Li · Meng Ma · Bingxiang He · Yangyi Chen · Juncheng Gu · Wenguan Wang · Ke Tang · Yu Rong · Eric Schulz · Fan Yang · Wei Li · Zhiyuan Liu · Jiaming Guo · Yanghua Peng · Haibin Lin · Haixin Wang · Qi Yi · Maosong Sun · Ruizhi Chen · Chuan Wu · Chaoyang Zhao · Yibo Zhu · Liwei Wu · xishan zhang · Zidong Du · Rui Zhao · Jinqiao Wang · Ling Li · Qi Guo · Ming Tang · Yunji Chen -
2022 Spotlight: Towards Versatile Embodied Navigation »
Hanqing Wang · Wei Liang · Luc V Gool · Wenguan Wang -
2022 Spotlight: Recurrent Video Restoration Transformer with Guided Deformable Attention »
Jingyun Liang · Yuchen Fan · Xiaoyu Xiang · Rakesh Ranjan · Eddy Ilg · Simon Green · Jiezhang Cao · Kai Zhang · Radu Timofte · Luc V Gool -
2022 : SecNet: Semantic Eye Completion in Implicit Field »
Yida Wang · Yiru Shen · David Joseph Tan · Federico Tombari · Sachin S Talathi -
2022 : SecNet: Semantic Eye Completion in Implicit Field »
Yida Wang · Yiru Shen · David Joseph Tan · Federico Tombari · Sachin S Talathi -
2022 Poster: Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging »
Yuanhao Cai · Jing Lin · Haoqian Wang · Xin Yuan · Henghui Ding · Yulun Zhang · Radu Timofte · Luc V Gool -
2022 Poster: Towards Versatile Embodied Navigation »
Hanqing Wang · Wei Liang · Luc V Gool · Wenguan Wang -
2021 Poster: Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations »
Wouter Van Gansbeke · Simon Vandenhende · Stamatios Georgoulis · Luc V Gool -
2020 Poster: GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network »
Prune Truong · Martin Danelljan · Luc V Gool · Radu Timofte -
2020 Poster: Soft Contrastive Learning for Visual Localization »
Janine Thoma · Danda Pani Paudel · Luc V Gool -
2017 Poster: Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations »
Eirikur Agustsson · Fabian Mentzer · Michael Tschannen · Lukas Cavigelli · Radu Timofte · Luca Benini · Luc V Gool -
2016 Poster: Dynamic Filter Networks »
Xu Jia · Bert De Brabandere · Tinne Tuytelaars · Luc V Gool -
2014 Poster: Quantized Kernel Learning for Feature Matching »
Danfeng Qin · Xuanli Chen · Matthieu Guillaumin · Luc V Gool -
2014 Poster: Self-Adaptable Templates for Feature Coding »
Xavier Boix · Gemma Roig · Salomon Diether · Luc V Gool -
2011 Poster: Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities »
Angela Yao · Juergen Gall · Luc V Gool · Raquel Urtasun