Expo Demonstration
Ballroom C

Today’s state of deep neural network inference can be summed up in two words: complex and inefficient. The quest for accuracy has led to overparameterized deep neural networks that require heavy compute resources to solve tasks at hand, and as such we are rapidly approaching unsustainable computational, economic, and environmental costs to gain incrementally smaller improvements in model performance. Enter sparsity, a research technique that makes neural networks smaller and faster. There is no lack of research on achieving high levels of network sparsity, but putting that research into practice remains a challenge. As a result, data scientists and machine learning engineers are often forced to make tradeoffs between model performance, accuracy, and inference costs. After years of research at MIT, the team at Neural Magic concluded that throwing teraflops at dense models is not sustainable. So they’ve taken the best of known research on model compression (unstructured pruning and quantization, in particular) and efficient sparse execution to build a software solution that delivers efficient deep neural network inference on everyday CPUs, without the need for specialized hardware. Join Neural Magic ML experts to learn how they successfully created and applied SOTA research on model compression and efficient sparse execution to build open-source software that compresses and optimizes deep learning models for efficient inference, ultimately deploying it with DeepSparse, a free-to-use engine that delivers GPU speeds on commodity CPUs. The community will walk away with an overview of (1) SOTA research and model compression techniques, including ways to apply them to your models using open-source software, (2) a demo of the first-ever sparsity-aware inference engine that translates high sparsity levels into a significant speedup, and (3) next steps on using the Neural Magic open-source and free ML tools to make your inference efficient and less complex.