Timezone: »

Machine Learning for Systems
Anna Goldie · Azalia Mirhoseini · Jonathan Raiman · Kevin Swersky · Milad Hashemi

Sat Dec 08 05:00 AM -- 03:30 PM (PST) @ Room 510 AC
Event URL: http://mlforsystems.org/ »

This workshop is part two of a two-part series with one day focusing on Machine Learning for Systems and the other on Systems for Machine Learning. Although the two workshops are being led by different organizers, we are coordinating our call for papers to ensure that the workshops complement each other and that submitted papers are routed to the appropriate venue.

The Systems for Machine Learning workshop focuses on designing systems to enable ML, whereas we focus on developing ML to optimize systems. Both fields are mature enough to warrant a dedicated workshop. Organizers on both sides are open to merging in the future, but this year we plan to run them separately on two different days.

Designing specialized hardware and systems for deep learning is a topic that has received significant research attention, both in industrial and academic settings, leading to exponential increases in compute capability in GPUs and accelerators. However, using machine learning to optimize and accelerate software and hardware systems is a lightly explored but promising field, with broad implications for computing as a whole. Very recent work has outlined a broad scope where deep learning vastly outperforms traditional heuristics, including topics such as: scheduling [1], data structure design [2], microarchitecture [3], compilers [4], and control of warehouse scale computing systems [5].

The focus of this workshop is to expand upon this recent work and build a community focused on using machine learning in computer systems problems. We seek to improve the state of the art in the areas where learning has already proven to perform better than traditional heuristics, as well as expand to new areas throughout the system stack such as hardware/circuit design and operating/runtime systems.

By forming a community of academic and industrial researchers who are excited about this area, we seek to build towards intelligent, self optimizing systems and answer questions such as: How do we generate and share high quality datasets that span the layers of the system stack? Which learned representations best represent code performance and runtime? Which simulators and simulation methodologies provide a tractable proving ground for techniques like reinforcement learning?

To this end, the target audience for this workshop includes a wide variety of attendees from state-of-the-art researchers in machine learning to domain experts in computer systems design. We have invited a broad set of expert speakers to present the potential for impact of combining machine learning research with computer systems. We hope that providing a formal venue for researchers from both fields to meet and interact will push forward both fundamental research in ML as well as real-world impact to computer systems design and implementation.

The workshop will host 6 speakers/panelists (all confirmed) and we will put out a call for researchers to submit relevant papers, up to 4 pages in the default NIPS style, that will undergo a peer review process. Selected works will be presented as spotlights, contributed talks and/or posters. Speakers will be invited to participate in an interactive panel discussion to conclude the workshop.

The organizers of this workshop span core research in machine learning, computer systems and architecture, as well as their intersection. Jointly, they have published in top-tier systems and machine learning conferences including: NIPS, ICML, ICLR, ISCA, MICRO, DAC, and SIGMETRICS.

[1] Device Placement Optimization with Reinforcement Learning, https://arxiv.org/pdf/1706.04972.pdf
[2] The Case for Learned Index Structures, https://arxiv.org/abs/1712.01208
[3] Learning Memory Access Patterns, https://arxiv.org/pdf/1803.02329.pdf
[4] End to End Deep Learning of Optimization Heuristics: https://ieeexplore.ieee.org/document/8091247/?reload=true
[5] https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/
[6] Bayesian optimization for tuning the JVM, https://www.youtube.com/watch?v=YhNl468S8CI
[7] Safe Exploration for Identifying Linear Systems via Robust Optimization: https://arxiv.org/abs/1711.11165

Sat 6:00 a.m. - 6:10 a.m. [iCal]
Sat 6:10 a.m. - 6:35 a.m. [iCal]

Traditional compiler use expert-written rules to prove the correctness of program transformations, and hope for the best in terms of performance. Stochastic program optimizers turn that model on its head. They use machine learning techniques to search for aggressive performance-improving transformations, and state-of-the-art verification techniques to prove correctness after the fact. The results are novel, often inscrutable, and in many cases outperform expertly tuned code. In this talk I'll present an overview of the core technique, describe current work, and discuss directions for future research.

eric schkufza
Sat 6:35 a.m. - 7:00 a.m. [iCal]

In the post-Moore’s Law era, the amount of computation per unit cost and power is no longer increasing at its historic rate. In the post-ImageNet era, researchers are solving more complicated AI problems using larger data sets which drives the demand for more computation.
 This mismatch between supply and demand for computation highlights the need for co-designing efficient machine learning algorithms and domain-specific hardware architectures. Such algorithm-hardware co-design opens up a much larger design space, which requires domain experts on both sides (ML+systems), and human heuristics might be sub-optimal to explore the vast design space. We introduce three of our recent work of using machine learning to optimize the machine learning system: learning the optimal pruning strategy (AMC) and quantization strategy (HAQ) on the target hardware, rather than relying on rule-based strategies; learning the optimal neural network architecture that is specialized for a target hardware architecture, optimizing both accuracy and latency (ProxylessNAS), rather than using a generic neural network architecture across all hardware architectures; learning to optimize analog circuit parameters, rather than relying on experienced analog engineers to tune those transistors. On the other side of the loop (design hardware-friendly machine learning algorithms), I'll introduce the temporal shift module (TSM) that offers 8x lower latency, 12x higher throughput than 3D convolution-based methods, while ranking the first on both Something-Something V1 and V2 leaderboards. I’ll conclude the talk by giving an outlook of the design automation for efficient machine learning system.

Sat 7:00 a.m. - 8:00 a.m. [iCal]
Poster Session (All Posters) (Poster Session)
Artemiy Margaritov, ravichandra addanki, Hamidreza Mahyar, GUO ZHANG, avani wildani, Hadi Esmaeilzadeh, Dmitrii Ustiugov, Shaileshh Bojja Venkatakrishnan, Fabian Ruffy Varga, adit bhardwaj, Tatiana Shpeisman
Sat 8:00 a.m. - 8:15 a.m. [iCal]

Because of the prevalence of APIs in modern software development, an automated interactive code discovery system to help developers use these APIs would be extremely valuable. Program synthesis is a promising method to build such a system, but existing approaches focus on programs in domain-specific languages with much fewer functions than typically provided by an API. In this paper we focus on 112 functions from the Python library for DataFrame manipulation, an order of magnitude more than considered in prior approaches. To assess the viability of program synthesis in this domain, our first goal is a system that reliably synthesizes programs with a single library function. We introduce an encoding of structured input--output examples as graphs that can be fed to existing graph-based neural networks to infer the library function. We evaluate the effectiveness of this approach on synthesized and real-world I/O examples, finding programs matching the I/O examples for 97% of both our validation set and cleaned test set.

Sat 8:15 a.m. - 8:30 a.m. [iCal]

We present Placeto, a reinforcement learning (RL) approach to efficiently find device placements for distributed neural network training. Unlike prior approaches that only find a device placement for a specific computational graph, Placeto can learn generalizable device placement policies that can be applied to any graph. We propose two key ideas in our approach: (1) we represent the policy as performing iterative placement improvements, rather than outputting a placement in one shot (2) we use graph embeddings to capture the structural information of the computational graph, without relying on node labels for indexing. These ideas allow Placeto to train efficiently and generalize to unseen graphs. Our experiments show that Placeto can take up to 20x fewer training steps to find placements that are on par with or better than the best placements found by prior approaches.

Sat 8:30 a.m. - 8:45 a.m. [iCal]

Recent networking research has identified that data-driven congestion control (CC) can be more efficient than traditional CC in TCP. Deep reinforcement learning (RL), in particular, has the potential to learn optimal network policies. However, RL suffers from instability and over-fitting, deficiencies which so far render it unacceptable for use in datacenter networks. In this paper, we analyze the requirements for RL to succeed in the datacenter context. We present a new emulator, Iroko, which we developed to support different network topologies, congestion control algorithms, and deployment scenarios. Iroko interfaces with the OpenAI gym toolkit, which allows for fast and fair evaluation of different RL and traditional CC algorithms under the same conditions. We present initial benchmarks on three deep RL algorithms compared to TCP New Vegas and DCTCP. Our results show that these algorithms are able to learn a CC policy which exceeds the performance of TCP New Vegas on a dumbbell and fat-tree topology. We make our emulator open-source and publicly available: https://github.com/dcgym/iroko.

Sat 8:45 a.m. - 9:10 a.m. [iCal]

The computer architecture is facing an important and exciting challenge. The slowing of Moore's law (at the same time demand continues to grow) has led to new approaches to thinking about future system design including accelerators and software-defined hardware. In this talk we will discuss how machine learning has the potential to amplify these opportunities. We will discuss some specific case studies and end with some key insights specific to applying machine learning to improve computer architecture.

Partha Ranganathan
Sat 10:45 a.m. - 11:10 a.m. [iCal]

Traditional resource management techniques that rely on simple heuristics often fail to achieve predictable performance in contemporary complex systems that span physical servers, virtual servers, private and/or public clouds. My research aims to bring the benefits of Machine Learning (ML) models to optimize and manage such complex systems by deriving actionable insights from the performance and utilization data these systems generate. To realize this vision of model-based resource management, we need to deal with the following key challenges data-driven ML models raise: uncertainty in predictions, cost of training, generalizability from benchmark datasets to real-world systems datasets, and interpretability of the models.

In this talk, I will present our the ML formulations to demonstrate how to handle these challenges for two main problem domains in distributed systems: (I) Scheduling in parallel data-intensive computational frameworks for improved tail latencies, and (II) Performance-aware resource allocation in the public cloud environments for meeting user-specified performance and cost goals. Along the way, I will also share a list of guidelines for leveraging ML for solving problems in systems, based on my experience.

Neeraja Yadwadkar
Sat 11:10 a.m. - 11:25 a.m. [iCal]

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, current systems rely on manually optimized libraries, e.g., cuDNN, that support only a narrow range of server class GPUs. Such reliance limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search using effective model transfer across workloads. Experimental results show that our framework delivers performance that is competitive with state-of-the-art hand-tuned libraries for low-power CPUs, mobile GPUs, and server-class GPUs.

Sat 11:25 a.m. - 11:40 a.m. [iCal]

Analog IC design relies on human experts to search for parameters that satisfy circuit specifications with their experience and intuitions, which is highly labor intensive, time consuming and suboptimal. Machine learning is a promising tool to automate this process. However, supervised learning is difficult for this task due to the low availability of training data: 1) Circuit simulation is slow, thus generating large-scale dataset is time-consuming; 2) Most circuit designs are propitiatory IPs within individual IC companies, making it expensive to collect large-scale datasets. We propose Learning to Design Circuits (L2DC) to leverage reinforcement learning that learns to efficiently generate new circuits data and to optimize circuits. We fix the schematic, and optimize the parameters of the transistors automatically by training an RL agent with no prior knowledge about optimizing circuits. After iteratively getting observations, generating a new set of transistor parameters, getting a reward, and adjusting the model, L2DC is able to optimize circuits. We evaluate L2DC on two transimpedance amplifiers. Trained for a day, our RL agent can achieve comparable or better performance than human experts trained for a quarter. It first learns to meet hard-constraints (eg. gain, bandwidth), and then learns to optimize good-to-have targets (eg. area, power). Compared with grid search-aided human design, L2DC can achieve 250x higher sample efficiency with comparable performance. Under the same runtime constraint, the performance of L2DC is also better than Bayesian Optimization.

Sat 11:40 a.m. - 11:55 a.m. [iCal]

Despite numerous state-of-the-art applications of Deep Neural Networks (DNNs) in a wide range of real-world tasks, two major challenges hinder further advances in DNNs: hyperparameter optimization and constrained power resources, which is a significant concern in embedded devices. DNNs become increasingly difficult to train and deploy as they grow in size due to both computational intensity and the large memory footprint. Recent efforts show that quantizing weights of deep neural networks to lower bitwidths takes a significant step toward mitigating the mentioned issues, by reducing memory bandwidth and using limited computational resources which is important for deploying DNN models to devices with limited resources. This paper builds upon the algorithmic insight that the bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. Deep quantization (quantizing bitwidths below eight) while maintaining accuracy, requires magnificent manual effort and hyper-parameter tuning as well as re-training. This paper tackles the aforementioned problems by designing an end to end framework, dubbed ReLeQ, to automate DNN quantization. We formulate DNN quantization as an optimization problem and use a state-of-the-art policy gradient based Reinforcement Learning (RL) algorithm, Proximal Policy Optimization (PPO) to efficiently explore the large design space of DNN quantization and solve the defined optimization problem. To show the effectiveness of ReLeQ, we evaluated it across several neural networks including MNIST, CIFAR10, SVHN. ReLeQ quantizes the weights of these networks to average bitwidths of 2.25, 5 and 4 respectively while maintaining the final accuracy loss below 0.3% .

Sat 12:00 p.m. - 1:00 p.m. [iCal]
Poster Session (All Posters) (Poster Session)
Stephen Macke, Hongzi Mao, Caroline Lemieux, Saim Salman, Rishikesh Jha, Hanrui Wang, Shoumik P Palkar, Tianqi Chen, Thomas Pumir, Vaishnav Janardhan, adit bhardwaj, Ed Chi
Sat 1:00 p.m. - 1:25 p.m. [iCal]

To integrate information from more than two tables, a SQL query optimizer must identify the most efficient nesting of two-way table join operations to answer the query. Recent advances in AI may provide an unexpected new perspective on this classical problem that has been studied for over 40 years. Join optimization can be posed as a Markov Decision Process where the state is a graph that represents the join conditions in a query and actions are edge contractions on this graph; thereby, allowing us to apply ideas from deep reinforcement learning and imitation learning to facilitate an improved query optimizer that learns from experience, handles uncertainty, and incorporates execution feedback. I describe how our group built a full-featured query optimizer based on this MDP architecture, and we present results across a variety of database designs and query workloads in Postgres SQL and Apache Spark. I conclude by highlighting some of the under-appreciated RL research challenges in exploration, parametrization, and policy evaluation unearthed by this application.

Sat 1:25 p.m. - 1:50 p.m. [iCal]
Invited Speaker 7: Jeff Dean (Invited Talk)
Sat 1:50 p.m. - 2:50 p.m. [iCal]
Sat 2:50 p.m. - 3:00 p.m. [iCal]

Author Information

Anna Goldie (Google Brain)

Anna Goldie is a Research Software Engineer at Google Brain. She completed her Masters at MIT in the Spoken Language Systems Group at CSAIL, where she built a Mandarin-speaking dialogue system for her thesis. Her past research focused on meta-learning, conversational modeling, machine translation, and question-answering. She open-sourced tf-seq2seq, a popular open-source framework for machine translation. Recently, she has worked on deep reinforcement learning approaches to systems optimization.

Azalia Mirhoseini (Google Brain)

I am a Research Scientist at Google Brain where I focus on deep reinforcement learning approaches to solve problems in computer systems and metalearning. I received my Ph.D. from Rice University where I worked on algorithms and architectures for performance efficient big data analytics. I am the recipient of National Gold Medal in Iran Mathematics Olympiad (2004), Microsoft Women Graduate Student Scholarship (2010), IBM Ph.D. Student Scholarship (2012), and Schlumberger Ph.D. Student Fellowship (2013). My dissertation received the best 2015 Ph.D. thesis award at Rice University’s ECE department.

Jonathan Raiman (OpenAI)
Kevin Swersky (Google)
Milad Hashemi (Google)

More from the Same Authors