Timezone: »

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
Yang You

The Transformer architecture has improved the performance of deep learning models in domains such as Computer Vision and Natural Language Processing. Together with better performance come larger model sizes. This imposes challenges to the memory wall of the current accelerator hardware such as GPU. It is never ideal to train large models such as Vision Transformer, BERT, and GPT on a single GPU or a single machine. There is an urgent demand to train models in a distributed environment. However, distributed training, especially model parallelism, often requires domain expertise in computer systems and architecture. It remains a challenge for AI researchers to implement complex distributed training solutions for their models. To solve this problem, we introduce Colossal-AI, which is a unified parallel training system designed to seamlessly integrate different paradigms of parallelization techniques including data parallelism, pipeline parallelism, multiple tensor parallelism, and sequence parallelism. Colossal-AI aims to support the AI community to write distributed models in the same way as how they write models normally. This allows them to focus on developing the model architecture and separates the concerns of distributed training from the development process. Colossal-AI is able to achieve 2x speedup over state-of-the-art distributed systems for GPT model training. The source code can be found at this https://github.com/hpcaitech/ColossalAI

Author Information

Yang You (National University of Singapore)

More from the Same Authors

  • 2022 Poster: Random Sharpness-Aware Minimization »
    Yong Liu · Siqi Mai · Minhao Cheng · Xiangning Chen · Cho-Jui Hsieh · Yang You
  • 2018 : Posters (all accepted papers) + Break »
    Jianyu Wang · Denis Gudovskiy · Ziheng Jiang · Michael Kaufmann · Andreea Anghel · James Bradbury · Nikolas Ioannou · Nitin Agrawal · Emma Tosch · Gyeongin Yu · Keno Fischer · Jarrett Revels · Giuseppe Siracusano · Yaoqing Yang · Jeff Johnson · Yang You · Hector Yuen · Chris Ying · Honglei Liu · Nikoli Dryden · Xiangxi Mo · Yangzihao Wang · Amit Juneja · Micah Smith · Qian Yu · pramod gupta · Deepak Narayanan · Keshav Santhanam · Tim Capes · Abdul Dakkak · Norman Mu · Ke Deng · Liam Li · Joao Carreira · Luis Remis · Deepti Raghavan · Una-May O'Reilly · Amanpreet Singh · Mahmoud (Mido) Assran · Eugene Wu · Eytan Bakshy · Jinliang Wei · Michael Innes · Viral Shah · Haibin Lin · Conrad Sanderson · Ryan Curtin · Marcus Edel
  • 2016 Poster: Asynchronous Parallel Greedy Coordinate Descent »
    Yang You · Xiangru Lian · Ji Liu · Hsiang-Fu Yu · Inderjit Dhillon · James Demmel · Cho-Jui Hsieh