NeurIPS 2019 Expo Workshop

Dec. 8, 2019

Expo 2019 Schedule »

Multi-modal Research to Production with PyTorch and Facebook

Sponsor: Facebook

Abstract:

The content at Facebook and more broadly continues to increase in diversity and is made up of a number of modalities (text, audio, video, etc..). For example, an Ad may contain multiple components including image, body text, title, video and landing pages. Even an individual component may bear multimodal traits, for instance, a video contains visual and audio signals, a landing page is composed of images, texts, HTML sources, etc. This workshop will dive into a number modalities such as computer vision (large scale image classification and instance segmentation) and Translation and Speech (seq-to-seq Transformers) from the lens of taking cutting edge research to production. Lastly, we will also walk through how to use the latest APIs in PyTorch to take eager mode developed models into graph mode via Torchscript and quantize them for scale production deployment on servers or mobile devices.

Libraries used:

  • PyTorch - a popular deep learning framework for research to production.
  • Classy Vision - a newly open sourced PyTorch framework developed by Facebook AI for research on large-scale image and video classification. Classy Vision allows researchers to quickly prototype and iterate on large distributed training jobs. Models built on Classy Vision can be seamlessly deployed to production, and Classy Vision powers the next generation of classification models in production at Facebook.
  • Detectron2 - the recently released object detection library built by the FAIR computer vision team. We will articulate the improvements over the previous version including: 1) Support for latest models and new tasks; 2) Increased flexibility, to enable new computer vision research; 3) Maintainable and scalable, to support production use cases.
  • Fairseq - general purpose sequence-to-sequence library, can be used in many applications, including (unsupervised) translation, summarization, dialog and speech recognition.