Demonstration

Adapting Microsoft's CNTK and ResNet-18 to Enable Strong-Scaling on Cray Systems

Mark Staveley

2016 Demonstration

Abstract

The ability to process vast quantities of data with an ever-increasing amount of computational power has enabled Deep Learning models in speech and vision to supersede human capabilities. However, we are still limited by algorithm implementations (software) and hardware capabilities (compute, storage, and networking) when it comes to scaling out algorithms and reducing the time to solution. In many cases, training times for Deep Learning models can take days or even weeks before desired accuracy rates are obtained. Members of the Cray Deep Learning Group (with assistance from the Engineering Staff at the Swiss National Supercomputing Centre) have been able to leverage Cray’s experience with extreme scale systems to successfully scale out Microsoft’s CNTK 1 code with the ResNet-18 2 model (from last year’s ImageNet competition) to over 512 Cray XC30 Supercomputer nodes (each node having 1 GPU). Specifically, we have been able to leverage tools found within Cray MPI and the Cray Programming Environment to optimize the MPI communications within Microsoft’s CNTK, while still being able to preserving the algorithm.

Chat is not available.