NeurIPS Poster Simple, Scalable and Effective Clustering via One-Dimensional Projections

Poster

Simple, Scalable and Effective Clustering via One-Dimensional Projections

Moses Charikar · Monika Henzinger · Lunjia Hu · Maximilian Vötsch · Erik Waingarten

Great Hall & Hall B1+B2 (level 1) #1111

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Abstract: Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and

k

$k$ -means++ can take

Ω (n d k)

$\Omega(ndk)$ time when clustering

n

$n$ points in a

d

$d$ -dimensional space (represented by an

n \times d

$n\times d$ matrix

X

$X$ ) into

k

$k$ clusters. On massive datasets with moderate to large

k

$k$ , the multiplicative

k

$k$ factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time

O (n n z (X) + n \log n)

$O(\mathsf{nnz}(X) + n\log n)$ for arbitrary

k

$k$ . Here

n n z (X)

$\mathsf{nnz}(X)$ is the total number of non-zero entries in the input dataset

X

$X$ , which is upper bounded by

n d

$nd$ and can be significantly smaller for sparse datasets. We prove that our algorithm achieves approximation ratio

\tilde{O} (k^{4})

$\widetilde{O}(k^4)$ on any input dataset for the

k

$k$ -means objective, and our experiments show that the quality of the clusters found by our algorithm is usually much better than this worst-case bound. We use our algorithm for

k

$k$ -means clustering and for coreset construction; our experiments show that it gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks. Our theoretical analysis is based on novel results of independent interest. We show that the approximation ratio achieved after a random one-dimensional projection can be lifted to the original points and that

k

$k$ -means++ seeding can be implemented in expected time

O (n \log n)

$O(n\log n)$ in one dimension.

Chat is not available.