Timezone: »

Machine Learning Open Source Software 2018: Sustainable communities
Heiko Strathmann · Viktor Gal · Ryan Curtin · Antti Honkela · Sergey Lisitsyn · Cheng Soon Ong

Sat Dec 08 05:00 AM -- 03:30 PM (PST) @ Room 515
Event URL: https://2018.mloss.org/ »

Machine learning open source software (MLOSS) is one of the cornerstones of open science and reproducible research. Once a niche area for ML research, MLOSS today has gathered significant momentum, fostered both by scientific community, and more recently by corporate organizations. Along with open access and open data, it enables free reuse and extension of current developments in ML. The past mloss.org workshops at NIPS06, NIPS08, ICML10, NIPS13, and ICML15 successfully brought together researchers and developers from both fields, to exchange experiences and lessons learnt, to encourage interoperability between people and projects, and to demonstrate software to users in the ML community.

Continuing the tradition in 2018, we plan to have a workshop that is a mix of invited speakers, contributed talks and discussion/activity sessions. This year’s headline aims to give an insight of the challenges faced by projects as they seek long-term sustainability, with a particular focus on community building and preservation, and diverse teams. In the talks, we will cover some of the latest technical innovations as done by established and new projects. The main focus, however, will be on insights on project sustainability, diversity, funding and attracting new developers, both from academia and industry. We will discuss various strategies that helps promoting gender diversity in projects (e.g. implementing quotas etc.) and how to promote developer growth within a project.

We aim to make this workshop as diverse as possible within the field. This includes a gender balanced speakers, focussing on programming languages from different scientific communities, and in particular most of our invited speakers represent umbrella projects with a hugely diverse set of applications and users (NumFOCUS, openML, tidyverse).

With a call for participation for software project demos, we aim to provide improved outreach and visibility, especially for smaller OSS projects as typically present in academia. In addition, our workshop will serve as a gathering of OSS developers in academia, for peer to peer exchange of learnt lessons, experiences, and sustainability and diversity tactics.

The workshop will include an interactive session to produce general techniques for driving community engagement and sustainability, such as application templates (Google Summer of Code, etc), “getting started” guides for new developers, and a collection of potential funding sources. We plan to conclude the workshop with a discussion on the headline topic.

Sat 5:25 a.m. - 5:30 a.m. [iCal]
Opening remarks (Intro)
Sat 5:30 a.m. - 6:00 a.m. [iCal]
Gina Helfrich, NumFOCUS (Invited talk) Gina Helfrich
Sat 6:00 a.m. - 6:30 a.m. [iCal]
Christoph Hertzberg, Eigen3 (Invited talk) Christoph Hertzberg
Sat 6:30 a.m. - 7:00 a.m. [iCal]
Joaquin Vanschoren, OpenML (Invited talk)
Sat 7:00 a.m. - 7:05 a.m. [iCal]

Sherpa is a free open-source hyperparameter optimization library for machine learning models. It is designed for problems with computationally expensive iterative function evaluations, such as the hyperparameter tuning of deep neural networks. With Sherpa, scientists can quickly optimize hyperparameters using a variety of powerful and interchangeable algorithms. Additionally, the framework makes it easy to implement custom algorithms. Sherpa can be run on either a single machine or a cluster via a grid scheduler with minimal configuration. Finally, an interactive dashboard enables users to view the progress of models as they are trained, cancel trials, and explore which hyperparameter combinations are working best. Sherpa empowers machine learning researchers by automating the tedious aspects of model tuning and providing an extensible framework for developing automated hyperparameter-tuning strategies. Its source code and documentation are available at https://github.com/LarsHH/sherpa and https://parameter-sherpa.readthedocs.io/, respectively. A demo can be found at https://youtu.be/L95sasMLgP4.

Peter Sadowski
Sat 7:05 a.m. - 7:10 a.m. [iCal]

In recent years, deep neural networks have revolutionized many application domains of machine learning and are key components of many critical decision or predictive processes such as autonomous driving or medical image analysis. In these and many other domains it is crucial that specialists can understand and analyze actions and predictions, even of the most complex neural network architectures. Despite these arguments neural networks are often treated as black boxes and their complex internal workings as well as the basis for their predictions are not fully understood. In the attempt to alleviate this shortcoming many analysis methods were proposed, yet the lack of reference implementations often makes a systematic comparison between the methods a major effort. In this tutorial we present the library iNNvestigate which addresses the mentioned issue by providing a common interface and out-of-the-box implementation for many analysis methods. In the first part we will show how iNNvestigate enables users to easily compare such methods for neural networks. The second part will demonstrate how the underlying API abstracts a common operations in neural network analysis and show how users can use them for the development of (future) methods.

Sat 7:10 a.m. - 7:15 a.m. [iCal]

mlpack is an open-source C++ machine learning library with an emphasis on speed and flexibility. Since its original inception in 2007, it has grown to be a large project, implementing a wide variety of machine learning algorithms. This short paper describes the general design principles and discusses how the open-source community around the library functions.

Marcus Edel
Sat 7:15 a.m. - 7:20 a.m. [iCal]

SGDLibrary is a open source MATLAB library of stochastic optimization algorithms, which finds the minimizer of a function f: R^d -> R of the finite-sum form min f(w) = 1/n sumi fi(w). This problem has been studied intensively in recent years in the field of machine learning. One typical but promising approach for large-scale data is to use a stochastic optimization algorithm to solve the problem. SGDLibrary is a readable, flexible and extensible pure-MATLAB library of a collection of stochastic optimization algorithms. The purpose of the library is to provide researchers and implementers a comprehensive evaluation environment for the use of these algorithms on various machine learning problems.

Hiroyuki Kasai
Sat 7:20 a.m. - 7:25 a.m. [iCal]

Natural Language Processing is now dominated by deep learning models. Baseline (https://github.com/dpressel/baseline) is a library to facilitate reproducible research and fast model development for NLP with deep learning. It provides easily extensible implementations and abstractions for data loading, model development, training, hyper-parameter tuning, deployment to production, and a leaderboard to track experimental results.

Brian Lester
Sat 7:25 a.m. - 8:00 a.m. [iCal]
Discussion over morning coffee (Break)
Sat 7:25 a.m. - 8:20 a.m. [iCal]

Machine Learning is transitioning from an art and science into a technology available to every developer. In the near future, every application on every platform will incorporate trained models to encode data-based decisions that would be impossible for developers to author. This presents a significant engineering challenge, since currently data science and modeling are largely decoupled from standard software development processes. This separation makes incorporating machine learning capabilities inside applications unnecessarily costly and difficult, and furthermore discourage developers from embracing ML in first place. In this paper we introduce ML.NET, a framework developed at Microsoft over the last decade in response to the challenge of making it easy to ship machine learning models in large software applications.

Markus Weimer
Sat 7:25 a.m. - 8:20 a.m. [iCal]

We will show the advantages of using a fabric of open source AI services and libraries, which have been launched by the AI labs in IBM Research, to train, harden and de-bias deep learning models. The motivation is that model building should not be monolithic. Algorithms, operations and pipelines to build and refine models should be modularized and reused as needed. The componentry presented meets these requirements and shares a philosophy of being framework- and vendor- agnostic, as well as modular and extensible. We focus on multiple aspects of machine learning that we describe in the following. To train models in the cloud in a distributed, framework-agnostic way, we use the Fabric for Deep Learning (FfDL). Adversarial attacks against models are mitigated using the Adversarial Robustness Toolbox (ART). We detect and remove bias using AI Fairness 360 (AIF360). Additionally, we publish to the open source developer community using the Model Asset Exchange (MAX). Overall, we demonstrate operations on deep learning models, and a set of developer APIs, that will help open source developers create robust and fair models for their applications, and for open source sharing. We will also call for community collaboration on these projects of open services and libraries, to democratize the open AI ecosystem.

Sat 7:25 a.m. - 8:20 a.m. [iCal]

This paper discusses results and insights from the 1st ReQuEST workshop, a collective effort to promote reusability, portability and reproducibility of deep learning research artifacts within the Architecture/PL/Systems communities. ReQuEST (Reproducible Quality-Efficient Systems Tournament) exploits the open-source Collective Knowledge framework (CK) to unify benchmarking, optimization, and co-design of deep learning systems implementations and exchange results via a live multi-objective scoreboard. Systems evaluated under ReQuEST are diverse and include an FPGA-based accelerator, optimized deep learning libraries for x86 and ARM systems, and distributed inference in Amazon Cloud and over a cluster of Raspberry Pis. We finally discuss limitations to our approach, and how we plan improve upon those limitations for the upcoming SysML artifact evaluation effort.

Thierry Moreau
Sat 7:25 a.m. - 8:20 a.m. [iCal]

Despite impressive advancements in the last years, human vision is still much more robust than machine vision. This article presents PyLissom, a novel software library for the modeling of the cortex maps in the visual system. The software was implemented in PyTorch, a modern deep learning framework, and it allows full integration with other PyTorch modules. We hypothesize that PyLissom could act as a bridge between the neuroscience and machine learning communities, driving to advancements in both fields.

Hernan Barijhoff
Sat 7:25 a.m. - 8:20 a.m. [iCal]

We introduce salad, an open source toolbox that provides a unified implementation of state-of-the-art methods for transfer learning, semi-supervised learning and domain adaptation. In the first release, we provide a framework for reproducing, extending and combining research results of the past years, including model architectures, loss functions and training algorithms. The toolbox along with first benchmark results and further resources is accessible at domainadaptation.org.

Steffen Schneider
Sat 7:25 a.m. - 8:20 a.m. [iCal]

This pdf is a proposal for a small talk about speed benchmarks of gradient boosted decision libraries on MLoss 2018 NIPS workshop. There is no proper way to do speed benchmarks of boosting libraries today. Openly-available benchmarks are misleading, often contains mistakes. We want to show, what is wrong with the benchmarks and what everyone should keep in mind, if one want to compare different libraries in a meaningful way.

Vasily Ershov
Sat 7:25 a.m. - 8:20 a.m. [iCal]

Gravity is an open source, scalable, memory efficient modeling language for solving mathematical models in Optimization and Machine Learning. Gravity exploits structure to reduce function evaluation time including Jacobian and Hessian computation. Gravity is implemented in C++ with a flexible interface allowing the user to specify the numerical accuracy of variables and parameters. It is also designed to offer efficient iterative model solving, convexity detection, multithreading of subproblems, and lazy constraint generation. When compared to state-of-the-art modeling languages such as JuMP, Gravity is 5 times faster in terms of function evaluation and up to 60 times more memory efficient. Gravity enables researchers and practitioners to access state-of-the-art optimization solvers with a user-friendly interface for writing general mixed-integer nonlinear models.

Hassan Hijazi
Sat 7:25 a.m. - 8:20 a.m. [iCal]

In this paper, we introduce McTorch, a manifold optimization library for deep learning that extends PyTorch. It aims to lower the barrier for users wishing to use manifold constraints in deep learning applications, i.e., when the parameters are constrained to lie on a manifold. Such constraints include the popular orthogonality and rank constraints, and have been recently used in a number of applications in deep learning. McTorch follows PyTorch's architecture and decouples manifold definitions and optimizers, i.e., once a new manifold is added it can be used with any existing optimizer and vice-versa. McTorch is available at https://github.com/mctorch.

Anoop Kunchukuttan
Sat 7:25 a.m. - 8:20 a.m. [iCal]

Recently, with the advent of programmatic and practical machine learning tools, programmers have been able to successfully integrate applications for the web and the mobile with artificial intelligence capabilities. This trend has largely been possible because of major organizations and software companies releasing their machine learning frameworks to the public-- such as Tensorflow (Google), MXnet (Amazon) and PyTorch (Facebook). Python has been the de facto choice as the programming language for these frameworks because of it’s versatility and ease-of-use. In a similar vein, Elixir is the functional programming language equivalent of Python and Ruby, in that it combines the versatility and ease-of-use that Python and Ruby boast of, with functional programming paradigms and the Erlang VM’s fault tolerance and robustness. However, despite these obvious advantages, Elixir, similar to other functional programming languages, does not provide developers with a machine learning toolset which is essential for equipping applications with deep learning and statistical inference features. To bridge this gap, we present Tensorflex, an open source framework that allows users to leverage pre-trained Tensorflow models (written in Python, C or C++) for Inference (generating predictions) in Elixir. Moreover, Tensorflex was written as part of a Google Summer of Code (2018) project by Anshuman Chhabra, and José Valim was the mentor for the same.

Anshuman Chhabra
Sat 7:25 a.m. - 8:20 a.m. [iCal]

In my talk, I would like to share the experience and lessons learned in creating an open source machine learning software at CERN with a significant user base in high-energy physics, expanding the open-source initiative to the wider physics community and interactions with the open source community at large.

Sergei Gleyzer
Sat 7:25 a.m. - 8:20 a.m. [iCal]

The paper introduces MI-Prometheus (Machine Intelligence - Prometheus), an open-source framework aiming at accelerating Machine Learning Research, by fostering the rapid development of diverse neural network-based models and facilitating their comparison. In its core, to accelerate the computations on their own, MI-Prometheus relies on PyTorch and extensively uses its mechanisms for the distribution of computations on CPUs/GPUs. The paper discusses the motivation of our work, the formulation of the requirements and presents the core concepts. We also briefly describe some of the problems and models currently available in the framework.

Vincent Marois
Sat 7:25 a.m. - 8:20 a.m. [iCal]

Data scientific tasks with structured data types, e.g., arrays, images, time series, text records, are one of the major challenge areas of contemporary machine learning and AI research beyond the ``tabular'' situation - that is, data that fits into a single classical data frame, and learning tasks on it such as the classical supervised learning task where one column is to be predicted from others.\ With xpandas, we present a python package that extends the pandas data container functionality to cope with arbitrary structured types (such as time series, images) at its column/slice elements, and which provides a transformer interface to scikit-learn's pipeline and composition workflows.\ We intend xpandas to be the first building block towards scikit-learn like toolbox interfaces for advanced learning tasks such as supervised learning with structured features, structured output prediction, image segmentation, time series forecasting and event risk modelling.

Vitaly Davydov
Sat 7:25 a.m. - 8:20 a.m. [iCal]

We present skpro, a Python framework for domain-agnostic probabilistic supervised learning. It features a scikit-learn-like general API that supports the implementation and fair comparison of both Bayesian and frequentist prediction strategies that produce conditional predictive distributions for each individual test data point. The skpro interface also supports strategy optimization through hyper-paramter tuning, model composition, ensemble methods like bagging, and workflow automation. The package and documentation are released under the BSD-3 open source license and available at GitHub.com/alan-turing-institute/skpro.

Franz J Kiraly
Sat 8:20 a.m. - 8:40 a.m. [iCal]

While there are multiple research-based groups for the ML community around the world, the adoption of these skills by a broader base of developers will require new communities that reach beyond researchers to flourish at a large scale.

The Singapore TensorFlow & Deep Learning community is a group of over 3,000 people from different backgrounds and levels that is pushing the adoption of ML in South-East Asia, via monthly in-person meetings, guest talks, and special events.

In the proposed short talk, we will present some of the challenges, lessons learned and solutions found to building machine learning communities at scale.

Martin Andrews
Sat 8:40 a.m. - 9:00 a.m. [iCal]

The PyMC project is a team of open source developers devoted to the development of software for applied Bayesian statistics and probabilistic machine learning. Broadly, our objective is to produce Python implementations of state-of-the-art methods that can be used by a wide range of non-expert analysts, thereby democratizing probabilistic programming and putting powerful Bayesian methods in the hands of those who need them most: economists, astronomers, epidemiologists, ecologists, and more. Our current product, PyMC3, allows users to implement arbitrary probabilistic models using a high-level API that is analogous to specifying a model on a whiteboard.

Christopher Fonnesbeck
Sat 9:00 a.m. - 11:00 a.m. [iCal]
Lunch (on your own) (Break)
Sat 11:00 a.m. - 11:30 a.m. [iCal]
James Hensman, GPFlow (Invited talk)
Sat 11:30 a.m. - 12:00 p.m. [iCal]
Mara Averick, tidyverse (Invited talk)
Sat 12:00 p.m. - 12:30 p.m. [iCal]
Afternoon coffee break (Break)
Sat 12:30 p.m. - 12:50 p.m. [iCal]

An open-source DeepPavlov library is specifically tailored for development of dialogue systems. The library prioritizes efficiency, modularity, and extensibility with the goal to make it easier to create dialogue systems from scratch with limited data available. It supports modular as well as end-to-end approaches to implementation of conversational agents. In DeepPavlov framework an agent consists of skills and every skill can be decomposed into components. Components are usually trainable models which solve typical NLP tasks such as intent classification, named entity recognition, sentiment analysis or pre-trained encoders for word or sentence level embeddings. Sequence-to-sequence chit-chat, question answering or task-oriented skills can be assembled from components provided in the library. ML models implemented in DeepPavlov have performance on par with current state of the art in the field.

Yury Kuratov
Sat 12:50 p.m. - 1:10 p.m. [iCal]

Modularity is a key feature of deep learning libraries but has not been fully exploited for probabilistic programming. We propose to improve modularity of probabilistic programming language by offering not only plain probabilistic distributions but also sophisticated probabilistic model such as Bayesian non-parametric models as fundamental building blocks. We demonstrate this idea by presenting a modular probabilistic programming language MXFusion, which includes a new type of re-usable building blocks, called probabilistic modules. A probabilistic module consists of a set of random variables with associated probabilistic distributions and dedicated inference methods. Under the framework of variational inference, the pre-specified inference methods of individual probabilistic modules can be transparently used for inference of the whole probabilistic model.

Zhenwen Dai
Sat 1:10 p.m. - 1:30 p.m. [iCal]

This work presents Flow, an open-source Python library enabling the application of distributed reinforcement learning (RL) to mixed-autonomy traffic control tasks, in which autonomous vehicles, human-driven vehicles, and infrastructure interact. Flow integrates SUMO, a traffic microsimulator, with RLlib, a distributed reinforcement learning library. Using Flow, researchers can programatically design new traffic networks, specify experiment configurations, and apply control to autonomous vehicles and intelligent infrastructure. We have used Flow to train autonomous vehicles to improve traffic flow in a wide variety of representative traffic scenarios; the results and scripts to generate the networks and perform training have been integrated into Flow as benchmarks. Community use of Flow is central to its design; extensibility and clarity were considered throughout development. Flow is available for open-source use at flow-project.github.io and github.com/flow-project/flow.

Nishant Kheterpal
Sat 1:30 p.m. - 1:50 p.m. [iCal]

Binder is an open-source project that lets users share interactive, reproducible science. Binder’s goal is to allow researchers to create interactive versions of their code utilizing pre-existing workflows and minimal additional effort. It uses standard configuration files in software engineering to let researchers create interactive versions of code they have hosted on commonly-used platforms like GitHub. Binder’s underlying technology, BinderHub, is entirely open-source and utilizes entirely open-source tools. By leveraging tools such as Kubernetes and Docker, it manages the technical complexity around creating containers to capture a repository and its dependencies, generating user sessions, and providing public URLs to share the built images with others. BinderHub combines two open-source projects within the Jupyter ecosystem: repo2docker and JupyterHub. repo2docker builds the Docker image of the git repository specified by the user, installs dependencies, and provides various front-ends to explore the image. JupyterHub then spawns and serves instances of these built images using Kubernetes to scale as needed. Because each of these pieces is open-source and uses popular tools in cloud orchestration, BinderHub can be deployed on a variety of cloud platforms, or even on your own hardware.

Sat 1:50 p.m. - 2:30 p.m. [iCal]
Panel discussion (Discussion panel)
Sat 2:30 p.m. - 2:35 p.m. [iCal]
Closing remarks (Outro)

Author Information

Heiko Strathmann (Gatsby / ETHZ / Shogun)
Viktor Gal (ETHZ)
Ryan Curtin (RelationalAI)
Antti Honkela (University of Helsinki)
Sergey Lisitsyn (Yandex / shogun.ml)
Cheng Soon Ong (Data61, CSIRO)

More from the Same Authors