Timezone: »
Poster
Improved Distributed Principal Component Analysis
Yingyu Liang · Maria-Florina F Balcan · Vandana Kanchanapally · David Woodruff
We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve problems such as $k$-means clustering and low rank approximation. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efficiency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for $k$-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as input-sparsity subspace embeddings with high correctness probability with a dimension and sparsity independent of the error probability, may be of independent interest.
Author Information
Yingyu Liang (Princeton University)
Maria-Florina F Balcan (Georgia Tech)
Vandana Kanchanapally
David Woodruff (IBM Research)
More from the Same Authors
-
2021 Poster: Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles »
Jiefeng Chen · Frederick Liu · Besim Avci · Xi Wu · Yingyu Liang · Somesh Jha -
2016 Poster: Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates »
Yuanzhi Li · Yingyu Liang · Andrej Risteski -
2016 Poster: Sublinear Time Orthogonal Tensor Decomposition »
Zhao Song · David Woodruff · Huan Zhang -
2016 Poster: Communication-Optimal Distributed Clustering »
Jiecao Chen · He Sun · David Woodruff · Qin Zhang -
2015 : Sketching as a tool for numerical linear algebra »
David Woodruff -
2015 Poster: Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients »
Bo Xie · Yingyu Liang · Le Song -
2014 Poster: Active Learning and Best-Response Dynamics »
Maria-Florina F Balcan · Christopher Berlind · Avrim Blum · Emma Cohen · Kaushik Patnaik · Le Song -
2014 Poster: Low Rank Approximation Lower Bounds in Row-Update Streams »
David Woodruff -
2014 Poster: Learning Time-Varying Coverage Functions »
Nan Du · Yingyu Liang · Maria-Florina F Balcan · Le Song -
2014 Poster: Scalable Kernel Methods via Doubly Stochastic Gradients »
Bo Dai · Bo Xie · Niao He · Yingyu Liang · Anant Raj · Maria-Florina F Balcan · Le Song -
2014 Poster: Subspace Embeddings for the Polynomial Kernel »
Haim Avron · Huy Nguyen · David Woodruff -
2013 Workshop: Large Scale Matrix Analysis and Inference »
Reza Zadeh · Gunnar Carlsson · Michael Mahoney · Manfred K. Warmuth · Wouter M Koolen · Nati Srebro · Satyen Kale · Malik Magdon-Ismail · Ashish Goel · Matei A Zaharia · David Woodruff · Ioannis Koutis · Benjamin Recht -
2013 Poster: Statistical Active Learning Algorithms »
Maria-Florina F Balcan · Vitaly Feldman -
2013 Poster: Distributed k-means and k-median clustering on general communication topologies »
Maria-Florina F Balcan · Steven Ehrlich · Yingyu Liang -
2013 Poster: Sketching Structured Matrices for Faster Nonlinear Regression »
Haim Avron · Vikas Sindhwani · David Woodruff -
2008 Workshop: New Challanges in Theoretical Machine Learning: Data Dependent Concept Spaces »
Maria-Florina F Balcan · Shai Ben-David · Avrim Blum · Kristiaan Pelckmans · John Shawe-Taylor