Timezone: »
Today’s state of deep neural network inference can be summed up with two words: complex and inefficient. The quest for accuracy has led to overparameterized deep neural networks that require heavy compute resources to solve tasks at hand, and as such we are “rapidly approaching outrageous computational, economic, and environmental costs to gain incrementally smaller improvements in model performance (State of AI Report 2020).” Furthermore, there is no lack of research on achieving high levels of unstructured sparsity, but putting that research into practice remains a challenge. As a result, data scientists and machine learning engineers are often forced to make tradeoffs between model performance, accuracy, and inference costs.
There is a better way.
After years of research at MIT, the team at Neural Magic concluded that throwing teraflops at dense models is not sustainable. So we've taken the best of known research on model compression (unstructured pruning and quantization, in particular) and efficient sparse execution to build a software solution that delivers efficient deep neural network inference on everyday CPUs, without the need for specialized hardware.
Join Neural Magic ML experts to learn how we successfully applied published research on model compression and efficient sparse execution to built software that compresses and optimize deep learning models for efficient inference with ease.
You’ll walk away with an overview of: SOTA model compression techniques; A demo of the first-ever general-purpose inference engine that translates high sparsity levels into significant speedup, and Next steps on using the Neural Magic Inference engine and ML tools to make your inference efficient, with less complexity.
Author Information
Mark J Kurtz (Neural Magic)
Dan Alistarh (IST Austria & Neural Magic Inc.)
Saša Zelenović (Neural Magic)
More from the Same Authors
-
2023 Poster: Knowledge Distillation Performs Partial Variance Reduction »
Mher Safaryan · Alexandra Peste · Dan Alistarh -
2023 Poster: ZipLM: Inference-Aware Structured Pruning of Language Models »
Eldar Kurtić · Elias Frantar · Dan Alistarh -
2023 Poster: CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models »
Denis Kuznedelev · Eldar Kurtić · Elias Frantar · Dan Alistarh -
2022 Expo Demonstration: Software-Delivered AI: Using Sparse-Quantization for Fastest Inference on Deep Neural Networks »
Mark J Kurtz -
2020 Poster: Scalable Belief Propagation via Relaxed Scheduling »
Vitalii Aksenov · Dan Alistarh · Janne H. Korhonen -
2020 Poster: Adaptive Gradient Quantization for Data-Parallel SGD »
Fartash Faghri · Iman Tabrizian · Ilia Markov · Dan Alistarh · Daniel Roy · Ali Ramezani-Kebrya -
2020 Poster: WoodFisher: Efficient Second-Order Approximation for Neural Network Compression »
Sidak Pal Singh · Dan Alistarh -
2018 Poster: The Convergence of Sparsified Gradient Methods »
Dan Alistarh · Torsten Hoefler · Mikael Johansson · Nikola Konstantinov · Sarit Khirirat · Cedric Renggli -
2018 Poster: Byzantine Stochastic Gradient Descent »
Dan Alistarh · Zeyuan Allen-Zhu · Jerry Li