Skip to yearly menu bar Skip to main content


Poster

List-Decodable Mean Estimation in Nearly-PCA Time

Ilias Diakonikolas · Daniel Kane · Daniel Kongsgaard · Jerry Li · Kevin Tian

Keywords: [ Clustering ] [ Theory ]


Abstract: Robust statistics has traditionally focused on designing estimators tolerant to a minority of contaminated data. {\em List-decodable learning}~\cite{CharikarSV17} studies the more challenging regime where only a minority 1k fraction of the dataset, k2, is drawn from the distribution of interest, and no assumptions are made on the remaining data. We study the fundamental task of list-decodable mean estimation in high dimensions. Our main result is a new algorithm for bounded covariance distributions with optimal sample complexity and near-optimal error guarantee, running in {\em nearly-PCA time}. Assuming the ground truth distribution on Rd has identity-bounded covariance, our algorithm outputs O(k) candidate means, one of which is within distance O(klogk) from the truth. Our algorithm runs in time O~(ndk), where n is the dataset size. This runtime nearly matches the cost of performing k-PCA on the data, a natural bottleneck of known algorithms for (very) special cases of our problem, such as clustering well-separated mixtures. Prior to our work, the fastest runtimes were O~(n2dk2)~\cite{DiakonikolasKK20}, and O~(ndkC) \cite{CherapanamjeriMY20} for an unspecified constant C6. Our approach builds on a novel soft downweighting method we term SIFT, arguably the simplest known polynomial-time mean estimator in the list-decodable setting. To develop our fast algorithms, we boost the computational cost of SIFT via a careful win-win-win'' analysis of an approximate Ky Fan matrix multiplicative weights procedure we develop, which may be of independent interest.

Chat is not available.