Timezone: »

Semi-supervised Vision Transformers at Scale
Zhaowei Cai · Avinash Ravichandran · Paolo Favaro · Manchen Wang · Davide Modolo · Rahul Bhotika · Zhuowen Tu · Stefano Soatto

Wed Nov 30 09:00 AM -- 11:00 AM (PST) @ Hall J #141

We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks. To tackle this problem, we use a SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning. At the semi-supervised fine-tuning stage, we adopt an exponential moving average (EMA)-Teacher framework instead of the popular FixMatch, since the former is more stable and delivers higher accuracy for semi-supervised vision transformers. In addition, we propose a probabilistic pseudo mixup mechanism to interpolate unlabeled samples and their pseudo labels for improved regularization, which is important for training ViTs with weak inductive bias. Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN counterparts in the semi-supervised classification setting. Semi-ViT also enjoys the scalability benefits of ViTs that can be readily scaled up to large-size models with increasing accuracies. For example, Semi-ViT-Huge achieves an impressive 80\% top-1 accuracy on ImageNet using only 1\% labels, which is comparable with Inception-v4 using 100\% ImageNet labels. The code is available at https://github.com/amazon-research/semi-vit.

Author Information

Zhaowei Cai (Amazon)
Avinash Ravichandran (AWS)
Paolo Favaro (University of Bern)
Manchen Wang (Amazon)
Davide Modolo (Amazon)
Rahul Bhotika (Optum Labs)
Zhuowen Tu (University of California, San Diego)
Stefano Soatto (UCLA)

Stefano Soatto received his Ph.D. in Control and Dynamical Systems from the California Institute of Technology in 1996; he joined UCLA in 2000 after being Assistant and then Associate Professor of Electrical Engineering and Biomedical Engineering at Washington University, and Research Associate in Applied Sciences at Harvard University. Between 1995 and 1998 he was also Ricercatore in the Department of Mathematics and Computer Science at the University of Udine - Italy. He received his D.Ing. degree (highest honors) from the University of Padova- Italy in 1992. His general research interests are in Computer Vision and Nonlinear Estimation and Control Theory. In particular, he is interested in ways for computers to use sensory information to interact with humans and the environment. Dr. Soatto is the recipient of the David Marr Prize for work on Euclidean reconstruction and reprojection up to subgroups. He also received the Siemens Prize with the Outstanding Paper Award from the IEEE Computer Society for his work on optimal structure from motion. He received the National Science Foundation Career Award and the Okawa Foundation Grant. He is a Member of the Editorial Board of the International Journal of Computer Vision (IJCV) and Foundations and Trends in Computer Graphics and Vision. He is the founder and director of the UCLA Vision Lab; more information is available at http://vision.ucla.edu

More from the Same Authors