Timezone: »
Transformer models have demonstrated excellent performance on a diverse set of computer vision applications ranging from classification to segmentation on various data modalities such as images, videos, and 3D data. The goal of this workshop is to bring together computer vision and machine learning researchers working towards advancing the theory, architecture, and algorithmic design for vision transformer models, as well as the practitioners utilizing transformer models for novel applications and use cases.
The workshop’s motivation is to narrow the gap between the research advancements in transformer designs and applications utilizing transformers for various computer vision applications. The workshop also aims to widen the adaptation of transformer models for various vision-related industrial applications. We are interested in papers reporting their experimental results on the utilization of transformers for any application of computer vision, challenges they have faced, and their mitigation strategy on topics like, but not limited to image classification, object detection, segmentation, human-object interaction detection, scene understanding based on 3D, video, and multimodal inputs.
Thu 11:00 p.m. - 11:10 p.m.
|
Opening Remarks
|
🔗 |
Thu 11:10 p.m. - 11:40 p.m.
|
[First Invited Talk] Ming Hsuan Yang
|
🔗 |
Thu 11:40 p.m. - 11:55 p.m.
|
CLUDA : Contrastive Learning in Unsupervised Domain Adaptation for Semantic Segmentation
(
1st Oral Presentation
)
In this work, we propose CLUDA, a simple, yet novelmethod for performing unsupervised domain adaptation(UDA) for semantic segmentation by incorporating con-trastive losses into a student-teacher learning paradigm,that makes use of pseudo-labels generated from the tar-get domain by the teacher network. More specifically, weextract a multi-level fused-feature map from the encoder,and apply contrastive loss across different classes and dif-ferent domains, via source-target mixing of images. Weconsistently improve performance on various feature en-coder architectures and for different domain adaptationdatasets in semantic segmentation. Furthermore, we intro-duce a learned-weighted contrastive loss to improve uponon a state-of-the-art multi-resolution training approachin UDA. We produce state-of-the-art results on GTA →Cityscapes (74.4 mIOU, +0.6) and Synthia → Cityscapes(67.2 mIOU, +1.4) datasets. CLUDA effectively demon-strates contrastive learning in UDA as a generic method,which can be easily integrated into any existing UDA forsemantic segmentation tasks. Please refer to the supple-mentary material for the details on implementation. |
Midhun Vayyat · Kasi Jaswin · Anuraag Bhattacharya · Shuaib Ahmed · Rahul Tallamraju 🔗 |
Thu 11:40 p.m. - 1:10 a.m.
|
[1st] Oral Presentation
|
🔗 |
Thu 11:55 p.m. - 12:10 a.m.
|
PatchBlender: A Motion Prior for Video Transformers
(
[1st] Oral Presentation
)
Transformers have become one of the dominant architectures in the field of computer vision. However, there are yet several challenges when applying such architectures to video data. Most notably, these models struggle to model the temporal patterns of video data effectively. Directly targeting this issue, we introduce PatchBlender, a learnable blending function that operates over patch embeddings across the temporal dimension of the latent space. We show that our method is successful at enabling vision transformers to encode the temporal component of video data. On Something-Something v2 and MOVi-A, we show that our method improves the performance of a ViT-B. PatchBlender has the advantage of being compatible with almost any Transformer architecture and since it is learnable, the model can adaptively turn on or off the prior. It is also extremely lightweight compute-wise, 0.005% the GFLOPs of a ViT-B. |
Gabriele Prato · Yale Song · Janarthanan Rajendran · R Devon Hjelm · Neel Joshi · Sarath Chandar 🔗 |
Fri 12:10 a.m. - 12:25 a.m.
|
Bi-Directional Self-Attention for Vision Transformers
(
[1st] Oral Presentation
)
Self-Attention (SA) maps a set of key-value pairs to an output by aggregating information from each pair according to its compatibility with a query. This allows SA to aggregate surrounding context (represented by key-value pairs) around a specific source (e.g. a query).Critically however, this process cannot also refine a source (e.g. a query) based on the surrounding context (e.g. key-value pairs). We address this limitation by inverting the way key-value pairs and queries are processed. We propose Inverse Self-Attention (ISA), which instead maps a query (source) to an output based on its compatibility with a set of key-value pairs (scene). Leveraging the inherent complementary nature of ISA and SA, we further propose Bi-directional Self-Attention (BiSA), an attention layer that couples SA and ISA by convexly combining their outputs. BiSA can be easily adapted into any existing transformer architecture to improve the expressibility of attention layers. We showcase this flexibility by extensively studying the effects of BiSA on CIFAR100[1], ImageNet1K[2], and ADE20K[3], and extend the Swin Transformer[4] and LeViT[5] with BiSA, and observe substantial improvements. |
George Stoica · Taylor Hearn · Bhavika Devnani · Judy Hoffman 🔗 |
Fri 12:25 a.m. - 12:40 a.m.
|
Video based Object 6D Pose Estimation using Transformers
(
[1st] Oral Presentation
)
We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach leverages the temporal information from a video sequence for pose refinement, along with being computationally efficient and robust. Compared to existing methods, our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences.Experimental evaluation on the YCB-Video dataset shows that our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches. Further, with a speed of 33 fps, it is also more efficient and therefore applicable to a variety of applications that require real-time object pose estimation. Training code and pretrained models are available at https://anonymous.4open.science/r/VideoPose-3C8C. |
Apoorva Beedu · Huda Alamri · Irfan Essa 🔗 |
Fri 12:40 a.m. - 12:55 a.m.
|
End-to-end Multimodal Representation Learning for Video Dialog
(
[1st] Oral Presentation
)
Video-based dialog task is a challenging multimodal learning task that has received increasing attention over the past few years with state-of-the-art obtaining new performance records. This progress is largely powered by the adaptation of the more powerful transformer-based language encoders. Despite this progress, existing approaches do not effectively utilize visual features to help solve tasks. Recent studies show that state-of-the-art models are biased towards textual information rather than visual cues. In order to better leverage the available visual information, this study proposes a new framework that combines 3D-CNN network and transformer-based networks into a single visual encoder to extract more robust semantic representations from videos. The visual encoder is jointly trained end-to-end with other input modalities such as text and audio. Experiments on the AVSD task show significant improvement over baselines in both generative and retrieval tasks. |
Huda Alamri · Apoorva Beedu · Irfan Essa · Anthony Bilic · Michael Hu 🔗 |
Fri 12:55 a.m. - 1:10 p.m.
|
Continual Transformers: Redundancy-Free Attention for Online Inference
(
[1st] Oral Presentation
)
Transformers in their common form are inherently limited to operate on whole token sequences rather than on one token at a time. Consequently, their use during online inference on time-series data entails considerable redundancy due to the overlap in successive token sequences. In this work, we propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference on a continual input stream. Importantly, our modifications are purely to the order of computations, while the outputs and learned weights are identical to those of the original Transformer Encoder. We validate our Continual Transformer Encoder with experiments on the THUMOS14, TVSeries and GTZAN datasets with remarkable results: Our Continual one- and two-block architectures reduce the floating point operations per prediction by up to 63x and 2.6x, respectively, while retaining predictive performance. |
Lukas Hedegaard · Arian Bakhtiarnia · Alexandros Iosifidis 🔗 |
Fri 1:10 a.m. - 1:40 a.m.
|
Break
(
1st Break
)
|
🔗 |
Fri 1:40 a.m. - 2:30 a.m.
|
On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition
(
[1st] Poster session
)
Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. The less restrictive inductive bias of transformers endows greater representational capacity in comparison with CNNs. However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training. This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting compared to CNNs. We specifically evaluate video vision transformers across two contrasting video datasets (Kinetics-400 and SomethingSomething-V2) and perform thorough analysis and ablation studies to explain this observation using the predominant features of video transformer architectures. We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well. Our experiments inform our recommendation that semi-supervised learning video work should consider the use of video transformers in the future. |
Farrukh Rahman · Ömer Mubarek · Zsolt Kira 🔗 |
Fri 1:40 a.m. - 2:30 a.m.
|
Fully-attentive and interpretable: vision and video vision transformers for pain detection
(
[1st] Poster session
)
Pain is a serious and costly issue globally, but to be treated, it must first be detected. Vision transformers are a top-performing architecture in computer vision, with little research on their use for pain detection. In this paper, we propose the first fully-attentive automated pain detection pipeline that achieves state-of-the-art performance on binary pain detection from facial expressions. The model is trained on the UNBC-McMaster dataset, after faces are 3D-registered and rotated to the canonical frontal view. In our experiments we identify important areas of the hyperparameter space and their interaction with vision and video vision transformers, obtaining 3 noteworthy models. We analyse the attention maps of one of our models, finding reasonable interpretations for its predictions. We also evaluate Mixup, an augmentation technique, and Sharpness-Aware Minimization, an optimizer, with no success. Our presented models, ViT-1 (F1 score 0.55 +- 0.15), ViViT-1 (F1 score 0.55 +- 0.13), and ViViT-2 (F1 score 0.49 +- 0.04), all outperform earlier works, showing the potential of vision transformers for pain detection. The code will be available upon acceptance. |
Giacomo Fiorentini · Itir Onal Ertugrul · Albert Ali Salah 🔗 |
Fri 1:40 a.m. - 2:30 a.m.
|
DynamicViT: Making Vision Transformer faster through layer skipping
(
[1st] Poster session
)
The recent deep learning breakthroughs in language and vision tasks can be mainly attributed to large-scale transformers. Unfortunately, their massive size and high compute requirement have limited their use in resource-constrained environments. Dynamic neural networks could potentially reduce the amount of compute requirement by dynamically adjusting the computational path based on the input. We propose a layer skipping dynamic transformer network that skips layers for each sample based on decisions given by a reinforcement learning agent. Extensive experiment on CIFAR-10 and CIFAR-100 showed that this dynamic ViT model gained an average of 40\% speed increase evaluated on different batch sizes ranging from 1 to 1024. |
Amanuel Mersha · Samuel Assefa 🔗 |
Fri 1:40 a.m. - 2:30 a.m.
|
FQDet: Fast-converging Query-based Detector
(
[1st] Poster session
)
Recently, two-stage Deformable DETR introduced the query-based two-stage head, a new type of two-stage head different from the region-based two-stage heads of classical detectors as Faster R-CNN. In query-based two-stage heads, the second stage selects one feature per detection processed by a transformer, called the query, as opposed to pooling a rectangular grid of features processed by CNNs as in region-based detectors. In this work, we improve the query-based head by improving the prior of the cross-attention operation with anchors, significantly speeding up the convergence while increasing its performance. Additionally, we empirically show that by improving the cross-attention prior, auxiliary losses and iterative bounding box mechanisms typically used by DETR-based detectors are no longer needed. By combining the best of both the classical and the DETR-based detectors, our FQDet head peaks at 45.4 AP on the 2017 COCO validation set when using a ResNet-50+TPN backbone, only after training for 12 epochs using the 1x schedule. We outperform other high-performing two-stage heads such as e.g. Cascade R-CNN, while using the same backbone and while being computationally cheaper. Additionally, when using the large ResNeXt-101-DCN+TPN backbone and multi-scale testing, our FQDet head achieves 52.9 AP on the 2017 COCO test-dev set after only 12 epochs of training. Code will be released. |
Cédric Picron · Punarjay Chakravarty · Tinne Tuytelaars 🔗 |
Fri 1:40 a.m. - 2:30 a.m.
|
[1st] Poster session
|
🔗 |
Fri 2:30 a.m. - 3:00 a.m.
|
[2nd Invited Talk] Cordelia Schmid
|
🔗 |
Fri 3:00 a.m. - 3:30 a.m.
|
[3rd Invited Talk] Rita Cucchiara
|
🔗 |
Fri 3:30 a.m. - 3:45 a.m.
|
Matryoshka Representations for Adaptive Deployment
(
[2nd] Oral Presentation
)
Learned representations are a central component in modern ML systems, serving a multitude of downstream tasks. When training such representations, it is often the case that computational and statistical constraints for each downstream task are unknown. In this context, rigid fixed-capacity representations can be either over or under-accommodating to the task at hand. This leads us to ask: can we design a flexible representation that can adapt to multiple downstream tasks with varying computational resources? Our main contribution is Matryoshka Representation Learning (MRL) which encodes information at different granularities and allows a single embedding to adapt to the computational constraints of downstream tasks. MRL minimally modifies existing representation learning pipelines and imposes no additional cost during inference and deployment. MRL learns coarse-to-fine representations that are at least as accurate and rich as independently trained low-dimensional representations. The flexibility within the learned Matryoshka Representations offer: (a) up to 14× smaller embedding size for ImageNet-1K classification at the same level of accuracy; (b) up to 14× real-world speed-ups for large-scale retrieval on ImageNet-1K and 4K; and (c) up to 2% accuracy improvements for long-tail few-shot classification, all while being as robust as the original representations. Finally, we show that MRL extends seamlessly to web-scale datasets (ImageNet, JFT) across various modalities – vision (ViT, ResNet), vision + language (ALIGN) and language (BERT). |
Aniket Rege · Aditya Kusupati · Gantavya Bhatt · Matthew Wallingford · Aditya Sinha · Vivek Ramanujan · William Howard-Snyder · Kaifeng Chen · Sham Kakade · Prateek Jain · Ali Farhadi
|
Fri 3:30 a.m. - 4:00 a.m.
|
[2nd] Oral Presentation
|
🔗 |
Fri 3:45 a.m. - 4:00 a.m.
|
TPFNet: A Novel Text In-painting Transformer for Text Removal
(
[2nd] Oral Presentation
)
Text erasure from an image is helpful for various tasks such as image editing and privacy preservation. In this paper, we present TPFNet, a novel one-stage (end-to-end) network for text removal from images. Our network has two parts. Since noise can be more effectively removed from low-resolution images, part 1 operates on low-resolution images. The output of part 1 is a low-resolution text-free image. Part 2 uses the features learned in part 1 to predict a high-resolution text-free image. In part 1, we use "pyramidal vision transformer" (PVT) as the encoder. Further, we use a novel multi-headed decoder that generates a high-pass filtered image and a segmentation map, in addition to a text-free image. The segmentation branch helps locate the text precisely, and the high-pass branch helps in learning the image structure. To precisely locate the text, TPFNet employs an adversarial loss that is conditional on the segmentation map rather than the input image. On Oxford, SCUT, and SCUT-EnsText datasets, our network outperforms recently proposed networks on nearly all the metrics. |
Onkar Susladkar · Dhruv Makwana · Gayatri Deshmukh · Sparsh Mittal · Sai Chandra Teja R · Rekha Singhal 🔗 |
Fri 4:00 a.m. - 4:30 a.m.
|
[4th Invited Talk] Kristen Grauman
|
🔗 |
Fri 4:30 a.m. - 5:00 a.m.
|
[5th Invited Talk] Laura Leal-Taixé
|
🔗 |
Fri 5:00 a.m. - 5:10 a.m.
|
Coffee Break
|
🔗 |
Fri 5:10 a.m. - 5:50 a.m.
|
PatchRot: A Self-Supervised Technique for Training Vision Transformers
(
[2nd] Poster Session
)
Vision transformers require a huge amount of labeled data to outperform convolutional neural networks. However, labeling a huge dataset is a very expensive process. Self-supervised learning techniques alleviate this problem by learning features similar to supervised learning in an unsupervised way. In this paper, we propose a self-supervised technique PatchRot that is crafted for vision transformers. PatchRot rotates images and image patches and trains the network to predict the rotation angles. The network learns to extract both global and local features from an image. Our extensive experiments on different datasets showcase PatchRot training learns rich features which outperform supervised learning and compared baseline. |
Sachin Chhabra · Prabal Bijoy Dutta · Hemanth Venkateswara · baoxin Li 🔗 |
Fri 5:10 a.m. - 5:50 a.m.
|
Multimodal Transformer for Parallel Concatenated Variational Autoencoders
(
[2nd] Poster Session
)
In this paper, we propose a multimodal transformer using parallel concatenated architecture. Instead of using patches, we use column stripes for images in R, G, B channels as the transformer input. The column stripes keep the spatial relations of original image. We incorporate the multimodal transformer with variational autoencoder for synthetic cross-modal data generation. The multimodal transformer is designed using multiple compression matrices, and it serves as encoders for Parallel Concatenated Variational AutoEncoders (PC-VAE). The PC-VAE consists of multiple encoders, one latent space, and two decoders. The encoders are based on random Gaussian matrices and don't need any training. We propose a new loss function based on the interaction information from partial information decomposition. The interaction information evaluates the input cross-modal information and decoder output. The PC-VAE are trained via minimizing the loss function. Experiments are performed to validate the proposed multimodal transformer for PC-VAE. |
Stephen Liang · Jerry Mendel 🔗 |
Fri 5:10 a.m. - 5:50 a.m.
|
Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets
(
[2nd] Poster Session
)
Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks. With vision Transformers, specifically the multi-head self-attention modules, networks can capture long-term dependencies inherently. However, these attention modules normally need to be trained on large datasets, and vision Transformers show inferior performance on small datasets when training from scratch compared with widely dominant backbones like ResNets. Note that the Transformer model was first proposed for natural language processing, which carries denser information than natural images. To boost the performance of vision Transformers on small datasets, this paper proposes to explicitly increase the input information density in the frequency domain. Specifically, we introduce selecting channels by calculating the channel-wise heatmaps in the frequency domain using Discrete Cosine Transform (DCT), reducing the size of input while keeping most information and hence increasing the information density. As a result, 25% fewer channels are kept while better performance is achieved compared with previous work. Extensive experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets, including CIFAR-10/100, SVHN, Flowers-102, and Tiny ImageNet. The accuracy has been boosted up to 17.05% with Swin and Focal Transformers. |
Xiangyu Chen · Ying Qin · Wenju Xu · Andrés Bur · Cuncong Zhong · Guanghui Wang 🔗 |
Fri 5:10 a.m. - 5:50 a.m.
|
Learning Explicit Object-Centric Representations with Vision Transformers
(
[2nd] Poster Session
)
With the recent successful adaptation of transformers to the vision domain, particularly when trained in a self-supervised fashion, it has been shown that vision transformers can learn impressive object-reasoning-like behaviour and features expressive for the task of object segmentation in images. In this paper, we build on the self-supervision task of masked autoencoding and explore its effectiveness for explicitly learning object-centric representations with transformers. To this end, we design an object-centric autoencoder using transformers only and train it end-to-end to reconstruct full images from unmasked patches. We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks. |
Oscar Vikström · Alexander Ilin 🔗 |
Fri 5:10 a.m. - 5:50 a.m.
|
[2nd] Poster Session
|
🔗 |
Fri 5:50 a.m. - 6:00 a.m.
|
Best Paper Announcement and Closing Remarks
|
🔗 |
Author Information
Fahad Shahbaz Khan (Inception Institute of Artificial Intelligence)
Gul Varol (Ecole des Ponts ParisTech)
Salman Khan (MBZ University of AI)
Ping Luo (The University of Hong Kong)
Rao Anwer (Mohamed bin Zayed University of Artificial Intelligence)
Ashish Vaswani (Google Brain)
Hisham Cholakkal (MBZUAI)
Niki Parmar (Google)
Joost van de Weijer (Computer Vision Center Barcelona)
Mubarak Shah (University of Central Florida)
More from the Same Authors
-
2021 Spotlight: Intriguing Properties of Vision Transformers »
Muhammad Muzammal Naseer · Kanchana Ranasinghe · Salman H Khan · Munawar Hayat · Fahad Shahbaz Khan · Ming-Hsuan Yang -
2021 : An Empirical Investigation of Representation Learning for Imitation »
Cynthia Chen · Sam Toyer · Cody Wild · Scott Emmons · Ian Fischer · Kuang-Huei Lee · Neel Alex · Steven Wang · Ping Luo · Stuart Russell · Pieter Abbeel · Rohin Shah -
2022 Poster: An Investigation into Whitening Loss for Self-supervised Learning »
Xi Weng · Lei Huang · Lei Zhao · Rao Anwer · Salman Khan · Fahad Shahbaz Khan -
2022 : Rethinking Data Heterogeneity in Federated Learning: Introducing a New Notion and Standard Benchmarks »
Saeed Vahidian · Mahdi Morafah · Chen Chen · Mubarak Shah · Bill Lin -
2022 : SEM2: Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model »
Zeyu Gao · Yao Mu · Ruoyan Shen · Chen Chen · Yangang Ren · Jianyu Chen · Shengbo Li · Ping Luo · Yanfeng Lu -
2023 Poster: RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths »
Zeyue Xue · Guanglu Song · Qiushan Guo · Boxiao Liu · Zhuofan Zong · Yu Liu · Ping Luo -
2023 Poster: Hardware Resilience Properties of Text-Guided Image Classifiers »
Syed Talal Wasim · Kabila Haile Soboka · Abdulrahman Mahmoud · Salman Khan · David Brooks · Gu-Yeon Wei -
2023 Poster: Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection »
Haibao Yu · Yingjuan Tang · Enze Xie · Jilei Mao · Ping Luo · Zaiqing Nie -
2023 Poster: VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks »
Wenhai Wang · Zhe Chen · Xiaokang Chen · Jiannan Wu · Xizhou Zhu · Gang Zeng · Ping Luo · Tong Lu · Jie Zhou · Yu Qiao · Jifeng Dai -
2023 Poster: Handling Data Heterogeneity via Architectural Design for Federated Visual Recognition »
Sara Pieri · Jose Restom · Samuel Horváth · Hisham Cholakkal -
2023 Poster: PromptIR: Prompting for All-in-One Image Restoration »
Vaishnav Potlapalli · Syed Waqas Zamir · Salman Khan · Fahad Shahbaz Khan -
2023 Poster: EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought »
Yao Mu · Qinglong Zhang · Mengkang Hu · Wenhai Wang · Mingyu Ding · Jun Jin · Bin Wang · Jifeng Dai · Yu Qiao · Ping Luo -
2023 Poster: GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization »
Vicente Vivanco Cepeda · Gaurav Kumar Nayak · Mubarak Shah -
2023 Poster: Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization »
Jameel Abdul Samadh · Mohammad Hanan Gani · Noor Hussein · Muhammad Uzair Khattak · Muzammal Naseer · Salman Khan · Fahad Shahbaz Khan -
2023 Poster: FeCAM: Exploiting the Heterogeneity of Class Distributions in Exemplar-Free Continual Learning »
Dipam Goswami · Yuyang Liu · Bartłomiej Twardowski · Joost van de Weijer -
2023 Poster: 3D Indoor Instance Segmentation in an Open-World »
Mohamed Boudjoghra · Salwa Al Khatib · Jean Lahoud · Hisham Cholakkal · Rao Anwer · Salman Khan · Fahad Shahbaz Khan -
2023 Poster: Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing »
kai wang · Fei Yang · Shiqi Yang · Muhammad Atif Butt · Joost van de Weijer -
2023 Poster: Cal-DETR: Calibrated Detection Transformer »
Muhammad Akhtar Munir · Salman Khan · Muhammad Haris Khan · Mohsen Ali · Fahad Shahbaz Khan -
2023 Poster: Foundation Model is Efficient Multimodal Multitask Model Selector »
Fanqing Meng · Wenqi Shao · zhanglin peng · Chonghe Jiang · Kaipeng Zhang · Yu Qiao · Ping Luo -
2023 Poster: OpenLane-V2: A Topology Reasoning Benchmark for Scene Understanding in Autonomous Driving »
Huijie Wang · Tianyu Li · Yang Li · Li Chen · Chonghao Sima · Zhenbo Liu · Bangjun Wang · Peijin Jia · Yuting Wang · Shengyin Jiang · Feng Wen · Hang Xu · Ping Luo · Junchi Yan · Wei Zhang · Hongyang Li -
2022 Spotlight: Lightning Talks 6B-3 »
Lingfeng Yang · Yao Lai · Zizheng Pan · Zhenyu Wang · Weicong Liang · Chuanyang Zheng · Jian-Wei Zhang · Peng Jin · Jing Liu · Xiuying Wei · Yao Mu · Xiang Li · YUHUI YUAN · Zizheng Pan · Yifan Sun · Yunchen Zhang · Jianfei Cai · Hao Luo · zheyang li · Jinfa Huang · Haoyu He · Yi Yang · Ping Luo · Fenglin Liu · Henghui Ding · Borui Zhao · Xiangguo Zhang · Kai Zhang · Pichao WANG · Bohan Zhuang · Wei Chen · Ruihao Gong · Zhi Yang · Xian Wu · Feng Ding · Jianfei Cai · Xiao Luo · Renjie Song · Weihong Lin · Jian Yang · Wenming Tan · Bohan Zhuang · Shanghang Zhang · Shen Ge · Fan Wang · Qi Zhang · Guoli Song · Jun Xiao · Hao Li · Ding Jia · David Clifton · Ye Ren · Fengwei Yu · Zheng Zhang · Jie Chen · Shiliang Pu · Xianglong Liu · Chao Zhang · Han Hu -
2022 Spotlight: MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning »
Yao Lai · Yao Mu · Ping Luo -
2022 Spotlight: DOMINO: Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning »
Yao Mu · Yuzheng Zhuang · Fei Ni · Bin Wang · Jianyu Chen · Jianye Hao · Ping Luo -
2022 Spotlight: Lightning Talks 5A-1 »
Yao Mu · Jin Zhang · Haoyi Niu · Rui Yang · Mingdong Wu · Ze Gong · Shubham Sharma · Chenjia Bai · Yu ("Tony") Zhang · Siyuan Li · Yuzheng Zhuang · Fangwei Zhong · Yiwen Qiu · Xiaoteng Ma · Fei Ni · Yulong Xia · Chongjie Zhang · Hao Dong · Ming Li · Zhaoran Wang · Bin Wang · Chongjie Zhang · Jianyu Chen · Guyue Zhou · Lei Han · Jianming HU · Jianye Hao · Xianyuan Zhan · Ping Luo -
2022 Spotlight: Lightning Talks 1B-4 »
Andrei Atanov · Shiqi Yang · Wanshan Li · Yongchang Hao · Ziquan Liu · Jiaxin Shi · Anton Plaksin · Jiaxiang Chen · Ziqi Pan · yaxing wang · Yuxin Liu · Stepan Martyanov · Alessandro Rinaldo · Yuhao Zhou · Li Niu · Qingyuan Yang · Andrei Filatov · Yi Xu · Liqing Zhang · Lili Mou · Ruomin Huang · Teresa Yeo · kai wang · Daren Wang · Jessica Hwang · Yuanhong Xu · Qi Qian · Hu Ding · Michalis Titsias · Shangling Jui · Ajay Sohmshetty · Lester Mackey · Joost van de Weijer · Hao Li · Amir Zamir · Xiangyang Ji · Antoni Chan · Rong Jin -
2022 Spotlight: Lightning Talks 1B-3 »
Chaofei Wang · Qixun Wang · Jing Xu · Long-Kai Huang · Xi Weng · Fei Ye · Harsh Rangwani · shrinivas ramasubramanian · Yifei Wang · Qisen Yang · Xu Luo · Lei Huang · Adrian G. Bors · Ying Wei · Xinglin Pan · Sho Takemori · Hong Zhu · Rui Huang · Lei Zhao · Yisen Wang · Kato Takashi · Shiji Song · Yanan Li · Rao Anwer · Yuhei Umeda · Salman Khan · Gao Huang · Wenjie Pei · Fahad Shahbaz Khan · Venkatesh Babu R · Zenglin Xu -
2022 Spotlight: An Investigation into Whitening Loss for Self-supervised Learning »
Xi Weng · Lei Huang · Lei Zhao · Rao Anwer · Salman Khan · Fahad Shahbaz Khan -
2022 Spotlight: Attracting and Dispersing: A Simple Approach for Source-free Domain Adaptation »
Shiqi Yang · yaxing wang · kai wang · Shangling Jui · Joost van de Weijer -
2022 Workshop: Self-Supervised Learning: Theory and Practice »
Ishan Misra · Pengtao Xie · Gul Varol · Yale Song · Yuki Asano · Xiaolong Wang · Pauline Luc -
2022 : Contrastive Learning on Synthetic Videos for GAN Latent Disentangling »
Kevin Duarte · Wei-An Lin · Ratheesh Kalarot · Jingwan (Cynthia) Lu · Eli Shechtman · Shabnam Ghadar · Mubarak Shah -
2022 Poster: DOMINO: Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning »
Yao Mu · Yuzheng Zhuang · Fei Ni · Bin Wang · Jianyu Chen · Jianye Hao · Ping Luo -
2022 Poster: AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition »
Shoufa Chen · Chongjian GE · Zhan Tong · Jiangliu Wang · Yibing Song · Jue Wang · Ping Luo -
2022 Poster: MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning »
Yao Lai · Yao Mu · Ping Luo -
2022 Poster: Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection »
Hanoona Bangalath · Muhammad Maaz · Muhammad Uzair Khattak · Salman Khan · Fahad Shahbaz Khan -
2022 Poster: AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation »
Yuanfeng Ji · Haotian Bai · Chongjian GE · Jie Yang · Ye Zhu · Ruimao Zhang · Zhen Li · Lingyan Zhanng · Wanling Ma · Xiang Wan · Ping Luo -
2022 Poster: Rethinking Resolution in the Context of Efficient Video Recognition »
Chuofan Ma · Qiushan Guo · Yi Jiang · Ping Luo · Zehuan Yuan · Xiaojuan Qi -
2022 Poster: Attracting and Dispersing: A Simple Approach for Source-free Domain Adaptation »
Shiqi Yang · yaxing wang · kai wang · Shangling Jui · Joost van de Weijer -
2022 Poster: Large-batch Optimization for Dense Visual Predictions: Training Faster R-CNN in 4.2 Minutes »
Zeyue Xue · Jianming Liang · Guanglu Song · Zhuofan Zong · Liang Chen · Yu Liu · Ping Luo -
2022 Poster: Don't Pour Cereal into Coffee: Differentiable Temporal Logic for Temporal Action Segmentation »
Ziwei Xu · Yogesh Rawat · Yongkang Wong · Mohan Kankanhalli · Mubarak Shah -
2021 Workshop: The pre-registration workshop: an alternative publication model for machine learning research »
Samuel Albanie · João Henriques · Luca Bertinetto · Alex Hernandez-Garcia · Hazel Doughty · Gul Varol -
2021 Poster: Intriguing Properties of Vision Transformers »
Muhammad Muzammal Naseer · Kanchana Ranasinghe · Salman H Khan · Munawar Hayat · Fahad Shahbaz Khan · Ming-Hsuan Yang -
2021 Poster: Rethinking the Pruning Criteria for Convolutional Neural Network »
Zhongzhan Huang · Wenqi Shao · Xinjiang Wang · Liang Lin · Ping Luo -
2021 Poster: Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language »
Mingyu Ding · Zhenfang Chen · Tao Du · Ping Luo · Josh Tenenbaum · Chuang Gan -
2021 Poster: Model-Based Reinforcement Learning via Imagination with Derived Memory »
Yao Mu · Yuzheng Zhuang · Bin Wang · Guangxiang Zhu · Wulong Liu · Jianyu Chen · Ping Luo · Shengbo Li · Chongjie Zhang · Jianye Hao -
2021 Poster: Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning »
Chongjian GE · Youwei Liang · YIBING SONG · Jianbo Jiao · Jue Wang · Ping Luo -
2021 Poster: Compressed Video Contrastive Learning »
Yuqi Huo · Mingyu Ding · Haoyu Lu · Nanyi Fei · Zhiwu Lu · Ji-Rong Wen · Ping Luo -
2021 Poster: SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers »
Enze Xie · Wenhai Wang · Zhiding Yu · Anima Anandkumar · Jose M. Alvarez · Ping Luo -
2021 Poster: Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation »
Shiqi Yang · yaxing wang · Joost van de Weijer · Luis Herranz · Shangling Jui -
2020 Poster: DeepI2I: Enabling Deep Hierarchical Image-to-Image Translation by Transferring from GANs »
yaxing wang · Lu Yu · Joost van de Weijer -
2020 Poster: RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning »
Riccardo Del Chiaro · Bartłomiej Twardowski · Andrew Bagdanov · Joost van de Weijer -
2019 : Coffee Break + Poster Session II »
Niki Parmar · Haraldur Hallgrimsson · Christian Kames · Arijit Patra · Abdullah-Al-Zubaer Imran · Junlin Yang · David Zimmerer · Arunava Chakravarty · Lawrence Schobs · Alexej Gossmann · TUNG-I CHEN · Tarun Dutt · Li Yao · Octavio Eleazar Martinez Manzanera · Johannes Pinckaers · Mehmet Ufuk Dalmis · Deepak Gupta · Nandinee Haq · David Ruhe · Jevgenij Gamper · Alfredo De Goyeneche Macaya · Jonathan Tamir · Byunghwan Jeon · SUBBAREDDY OOTA · Reinhard Heckel · Pamela K Douglas · Oleksii Sidorov · Ke Wang · Melanie Garcia · Ravi Soni · Ankita Shukla -
2019 : Oral Session III – Imaging »
Niki Parmar · Haraldur Hallgrimsson · Christian Kames -
2019 Poster: Cross-Domain Transferability of Adversarial Perturbations »
Muhammad Muzammal Naseer · Salman H Khan · Muhammad Haris Khan · Fahad Shahbaz Khan · Fatih Porikli -
2019 Poster: Stand-Alone Self-Attention in Vision Models »
Niki Parmar · Prajit Ramachandran · Ashish Vaswani · Irwan Bello · Anselm Levskaya · Jonathon Shlens -
2018 Poster: Image-to-image translation for cross-domain disentanglement »
Abel Gonzalez-Garcia · Joost van de Weijer · Yoshua Bengio -
2018 Poster: Memory Replay GANs: Learning to Generate New Categories without Forgetting »
Chenshen Wu · Luis Herranz · Xialei Liu · yaxing wang · Joost van de Weijer · Bogdan Raducanu -
2018 Poster: Mesh-TensorFlow: Deep Learning for Supercomputers »
Noam Shazeer · Youlong Cheng · Niki Parmar · Dustin Tran · Ashish Vaswani · Penporn Koanantakool · Peter Hawkins · HyoukJoong Lee · Mingsheng Hong · Cliff Young · Ryan Sepassi · Blake Hechtman -
2017 Poster: Attention is All you Need »
Ashish Vaswani · Noam Shazeer · Niki Parmar · Jakob Uszkoreit · Llion Jones · Aidan Gomez · Łukasz Kaiser · Illia Polosukhin -
2017 Spotlight: Attention is All you Need »
Ashish Vaswani · Noam Shazeer · Niki Parmar · Jakob Uszkoreit · Llion Jones · Aidan Gomez · Łukasz Kaiser · Illia Polosukhin -
2014 Poster: Multi-View Perceptron: a Deep Model for Learning Face Identity and View Representations »
Zhenyao Zhu · Ping Luo · Xiaogang Wang · Xiaoou Tang -
2011 Poster: Portmanteau Vocabularies for Multi-Cue Image Representation »
Fahad S Khan · Joost van de Weijer · Andrew D Bagdanov · Maria Vanrell