Poster
Simple, Scalable and Effective Clustering via One-Dimensional Projections
Moses Charikar · Monika Henzinger · Lunjia Hu · Maximilian Vötsch · Erik Waingarten
Great Hall & Hall B1+B2 (level 1) #1111
Abstract:
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and -means++ can take time when clustering points in a -dimensional space (represented by an matrix ) into clusters. On massive datasets with moderate to large , the multiplicative factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time for arbitrary . Here is the total number of non-zero entries in the input dataset , which is upper bounded by and can be significantly smaller for sparse datasets. We prove that our algorithm achieves approximation ratio on any input dataset for the -means objective, and our experiments show that the quality of the clusters found by our algorithm is usually much better than this worst-case bound. We use our algorithm for -means clustering and for coreset construction; our experiments show that it gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks. Our theoretical analysis is based on novel results of independent interest. We show that the approximation ratio achieved after a random one-dimensional projection can be lifted to the original points and that -means++ seeding can be implemented in expected time in one dimension.
Chat is not available.