Timezone: »
While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.
Author Information
Ting Chen (Google Brain)
Saurabh Saxena (Google)
Lala Li (Google)
Tsung-Yi Lin (Google Brain)

I am a senior research scientist at Nvidia Research. I was previously at Google Research, Brain Team. I work on computer vision and machine learning. I did my PhD at Cornell University and Cornell Tech, where I was advised by Serge Belongie. I did my masters at University California, San Diego and my bachelors at National Taiwan University. I led the creation of the COCO dataset and received the Best Student Paper Award for Focal Loss at ICCV 2017.
David Fleet (Google Research, Brain Team and University of Toronto)
Geoffrey Hinton (Google & University of Toronto)
More from the Same Authors
-
2021 Spotlight: Neural Additive Models: Interpretable Machine Learning with Neural Nets »
Rishabh Agarwal · Levi Melnick · Nicholas Frosst · Xuezhou Zhang · Ben Lengerich · Rich Caruana · Geoffrey Hinton -
2021 Spotlight: Revisiting ResNets: Improved Training and Scaling Strategies »
Irwan Bello · William Fedus · Xianzhi Du · Ekin Dogus Cubuk · Aravind Srinivas · Tsung-Yi Lin · Jonathon Shlens · Barret Zoph -
2021 : Palette: Image-to-Image Diffusion Models »
Chitwan Saharia · William Chan · Huiwen Chang · Chris Lee · Jonathan Ho · Tim Salimans · David Fleet · Mohammad Norouzi -
2021 : Understanding and Improving Robustness of VisionTransformers through patch-based NegativeAugmentation »
Yao Qin · Chiyuan Zhang · Ting Chen · Balaji Lakshminarayanan · Alex Beutel · Xuezhi Wang -
2021 : Palette: Image-to-Image Diffusion Models »
Chitwan Saharia · William Chan · Huiwen Chang · Chris Lee · Jonathan Ho · Tim Salimans · David Fleet · Mohammad Norouzi -
2022 Poster: Residual Multiplicative Filter Networks for Multiscale Reconstruction »
Shayan Shekarforoush · David Lindell · David Fleet · Marcus Brubaker -
2023 Poster: The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation »
Saurabh Saxena · Charles Herrmann · Junhwa Hur · Abhishek Kar · Mohammad Norouzi · Deqing Sun · David Fleet -
2023 Oral: The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation »
Saurabh Saxena · Charles Herrmann · Junhwa Hur · Abhishek Kar · Mohammad Norouzi · Deqing Sun · David Fleet -
2022 Spotlight: Residual Multiplicative Filter Networks for Multiscale Reconstruction »
Shayan Shekarforoush · David Lindell · David Fleet · Marcus Brubaker -
2022 Spotlight: Lightning Talks 5B-1 »
Devansh Arpit · Xiaojun Xu · Zifan Shi · Ivan Skorokhodov · Shayan Shekarforoush · Zhan Tong · Yiqun Wang · Shichong Peng · Linyi Li · Ivan Skorokhodov · Huan Wang · Yibing Song · David Lindell · Yinghao Xu · Seyed Alireza Moazenipourasil · Sergey Tulyakov · Peter Wonka · Yiqun Wang · Ke Li · David Fleet · Yujun Shen · Yingbo Zhou · Bo Li · Jue Wang · Peter Wonka · Marcus Brubaker · Caiming Xiong · Limin Wang · Deli Zhao · Qifeng Chen · Dit-Yan Yeung -
2022 : Invited Speaker »
David Fleet -
2022 Poster: Video Diffusion Models »
Jonathan Ho · Tim Salimans · Alexey Gritsenko · William Chan · Mohammad Norouzi · David Fleet -
2022 Poster: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding »
Chitwan Saharia · William Chan · Saurabh Saxena · Lala Li · Jay Whang · Remi Denton · Kamyar Ghasemipour · Raphael Gontijo Lopes · Burcu Karagol Ayan · Tim Salimans · Jonathan Ho · David Fleet · Mohammad Norouzi -
2022 Poster: Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation »
Yao Qin · Chiyuan Zhang · Ting Chen · Balaji Lakshminarayanan · Alex Beutel · Xuezhi Wang -
2021 Poster: Why Do Better Loss Functions Lead to Less Transferable Features? »
Simon Kornblith · Ting Chen · Honglak Lee · Mohammad Norouzi -
2021 Poster: Improving Contrastive Learning on Imbalanced Data via Open-World Sampling »
Ziyu Jiang · Tianlong Chen · Ting Chen · Zhangyang Wang -
2021 Poster: Intriguing Properties of Contrastive Losses »
Ting Chen · Calvin Luo · Lala Li -
2021 Poster: Canonical Capsules: Self-Supervised Capsules in Canonical Pose »
Weiwei Sun · Andrea Tagliasacchi · Boyang Deng · Sara Sabour · Soroosh Yazdani · Geoffrey Hinton · Kwang Moo Yi -
2021 Poster: Neural Additive Models: Interpretable Machine Learning with Neural Nets »
Rishabh Agarwal · Levi Melnick · Nicholas Frosst · Xuezhou Zhang · Ben Lengerich · Rich Caruana · Geoffrey Hinton -
2021 Poster: Improved Transformer for High-Resolution GANs »
Long Zhao · Zizhao Zhang · Ting Chen · Dimitris Metaxas · Han Zhang -
2021 Poster: Revisiting ResNets: Improved Training and Scaling Strategies »
Irwan Bello · William Fedus · Xianzhi Du · Ekin Dogus Cubuk · Aravind Srinivas · Tsung-Yi Lin · Jonathon Shlens · Barret Zoph -
2020 Poster: Exemplar VAE: Linking Generative Models, Nearest Neighbor Retrieval, and Data Augmentation »
Sajad Norouzi · David Fleet · Mohammad Norouzi -
2020 Poster: Rethinking Pre-training and Self-training »
Barret Zoph · Golnaz Ghiasi · Tsung-Yi Lin · Yin Cui · Hanxiao Liu · Ekin Dogus Cubuk · Quoc V Le -
2020 Oral: Rethinking Pre-training and Self-training »
Barret Zoph · Golnaz Ghiasi · Tsung-Yi Lin · Yin Cui · Hanxiao Liu · Ekin Dogus Cubuk · Quoc V Le -
2019 Poster: Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model »
Guodong Zhang · Lala Li · Zachary Nado · James Martens · Sushant Sachdeva · George Dahl · Chris Shallue · Roger Grosse -
2018 Poster: DropBlock: A regularization method for convolutional networks »
Golnaz Ghiasi · Tsung-Yi Lin · Quoc V Le -
2015 Poster: Efficient Non-greedy Optimization of Decision Trees »
Mohammad Norouzi · Maxwell Collins · Matthew A Johnson · David Fleet · Pushmeet Kohli -
2015 Poster: Grammar as a Foreign Language »
Oriol Vinyals · Ćukasz Kaiser · Terry Koo · Slav Petrov · Ilya Sutskever · Geoffrey Hinton -
2013 Poster: Efficient Optimization for Sparse Gaussian Process Regression »
Yanshuai Cao · Marcus Brubaker · David Fleet · Aaron Hertzmann -
2012 Poster: Hamming Distance Metric Learning »
Mohammad Norouzi · Russ Salakhutdinov · David Fleet -
2008 Session: Oral session 7: Complex Dynamical Systems: Modeling and Estimation »
David Fleet