Timezone: »

A Unified Sequence Interface for Vision Tasks
Ting Chen · Saurabh Saxena · Lala Li · Tsung-Yi Lin · David Fleet · Geoffrey Hinton

Tue Nov 29 09:00 AM -- 11:00 AM (PST) @ Hall J #526

While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.

Author Information

Ting Chen (Google Brain)
Saurabh Saxena (Google)
Lala Li (Google)
Tsung-Yi Lin (Google Brain)
Tsung-Yi Lin

I am a senior research scientist at Nvidia Research. I was previously at Google Research, Brain Team. I work on computer vision and machine learning. I did my PhD at Cornell University and Cornell Tech, where I was advised by Serge Belongie. I did my masters at University California, San Diego and my bachelors at National Taiwan University. I led the creation of the COCO dataset and received the Best Student Paper Award for Focal Loss at ICCV 2017.

David Fleet (Google Research, Brain Team and University of Toronto)
Geoffrey Hinton (Google & University of Toronto)

More from the Same Authors