Coreset for Line-Sets Clustering

Sagi Lotan · Ernesto Evgeniy Sanches Shayda · Dan Feldman

Hall J #818

Abstract: The input to the {line-sets $k$-median} problem is an integer $k \geq 1$, and a set $\mathcal{L} = \{L_1,\dots,L_n\}$that contains $n$ sets of lines in $\mathbb{R}^d$. The goal is to compute a set $C$ of $k$ centers (points in $\mathbb{R}^d$) that minimizes the sum $\sum_{L \in \mathcal{L}}\min_{\ell\in L, c\in C}\mathrm{dist}(\ell,c)$ of Euclidean distances from each set to its closest center, where $\mathrm{dist}(\ell,c):=\min_{x\in \ell}\norm{x-c}_2$.An \emph{$\varepsilon$-coreset} for this problem is a weighted subset of sets in $\mathcal{L}$ that approximates this sum up to $1 \pm \varepsilon$ multiplicative factor, for every set $C$ of $k$ centers. We prove that \emph{every} such input set $\set{L}$ has a small $\varepsilon$-coreset, and provide the first coreset construction for this problem and its variants. The coreset consists of $O(\log^2n)$ weighted line-sets from $\set{L}$, and is constructed in $O(n\log n)$ time for every fixed $d, k\geq 1$ and $\varepsilon \in (0,1)$. The main technique is based on a novel reduction to a ``fair clustering'' of colored points to colored centers. We then provide a coreset for this coloring problem, which may be of independent interest. Open source code and experiments are also provided.

Chat is not available.