Timezone: »
Deep distributed decision trees and tree ensembles have grown in importance due to the need to model increasingly large datasets. However, PLANET, the standard distributed tree learning algorithm implemented in systems such as \xgboost and Spark MLlib, scales poorly as data dimensionality and tree depths grow. We present Yggdrasil, a new distributed tree learning method that outperforms existing methods by up to 24x. Unlike PLANET, Yggdrasil is based on vertical partitioning of the data (i.e., partitioning by feature), along with a set of optimized data structures to reduce the CPU and communication costs of training. Yggdrasil (1) trains directly on compressed data for compressible features and labels; (2) introduces efficient data structures for training on uncompressed data; and (3) minimizes communication between nodes by using sparse bitvectors. Moreover, while PLANET approximates split points through feature binning, Yggdrasil does not require binning, and we analytically characterize the impact of this approximation. We evaluate Yggdrasil against the MNIST 8M dataset and a high-dimensional dataset at Yahoo; for both, Yggdrasil is faster by up to an order of magnitude.
Author Information
Firas Abuzaid (MIT)
Joseph K Bradley (Databricks)
Feynman Liang (Cambridge University Engineering Department)
Andrew Feng (Yahoo!)
Lee Yang (Yahoo!)
Matei Zaharia (MIT)
Ameet S Talwalkar (CMU)
More from the Same Authors
-
2021 : Simulated User Studies for Explanation Evaluation »
Valerie Chen · Gregory Plumb · Nicholay Topin · Ameet S Talwalkar -
2021 : Bayesian Persuasion for Algorithmic Recourse »
Keegan Harris · Valerie Chen · Joon Kim · Ameet S Talwalkar · Hoda Heidari · Steven Wu -
2021 : Bayesian Persuasion for Algorithmic Recourse »
Keegan Harris · Valerie Chen · Joon Kim · Ameet S Talwalkar · Hoda Heidari · Steven Wu -
2021 : [S9] Simulated User Studies for Explanation Evaluation »
Valerie Chen · Gregory Plumb · Nicholay Topin · Ameet S Talwalkar -
2020 Poster: Exact expressions for double descent and implicit regularization via surrogate random design »
Michal Derezinski · Feynman Liang · Michael Mahoney -
2020 Poster: Precise expressions for random projections: Low-rank approximation and randomized Newton »
Michal Derezinski · Feynman Liang · Zhenyu Liao · Michael Mahoney -
2017 Poster: Variable Importance Using Decision Trees »
Jalil Kazemitabar · Arash Amini · Adam Bloniarz · Ameet S Talwalkar -
2017 Poster: Federated Multi-Task Learning »
Virginia Smith · Chao-Kai Chiang · Maziar Sanjabi · Ameet S Talwalkar -
2016 : Invited Talk: Paleo: A Performance Model for Deep Neural Networks (Ameet Talwalkar, UCLA) »
Ameet S Talwalkar -
2014 Workshop: Distributed Machine Learning and Matrix Computations »
Reza Zadeh · Ion Stoica · Ameet S Talwalkar -
2014 Poster: Parallel Double Greedy Submodular Maximization »
Xinghao Pan · Stefanie Jegelka · Joseph Gonzalez · Joseph K Bradley · Michael Jordan -
2011 Workshop: Sparse Representation and Low-rank Approximation »
Ameet S Talwalkar · Lester W Mackey · Mehryar Mohri · Michael W Mahoney · Francis Bach · Mike Davies · Remi Gribonval · Guillaume R Obozinski -
2011 Poster: Divide-and-Conquer Matrix Factorization »
Lester W Mackey · Ameet S Talwalkar · Michael Jordan -
2010 Workshop: Low-rank Methods for Large-scale Machine Learning »
Arthur Gretton · Michael W Mahoney · Mehryar Mohri · Ameet S Talwalkar -
2009 Poster: Ensemble Nystrom Method »
Sanjiv Kumar · Mehryar Mohri · Ameet S Talwalkar -
2007 Oral: FilterBoost: Regression and Classification on Large Datasets »
Joseph K Bradley · Robert E Schapire -
2007 Poster: FilterBoost: Regression and Classification on Large Datasets »
Joseph K Bradley · Robert E Schapire