Timezone: »
Clustering is a ubiquitous task in data science. Compared to the commonly used k-means clustering, k-medoids clustering requires the cluster centers to be actual data points and supports arbitrary distance metrics, which permits greater interpretability and the clustering of structured objects. Current state-of-the-art k-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size n for each iteration, being prohibitively expensive for large datasets. We propose BanditPAM, a randomized algorithm inspired by techniques from multi-armed bandits, that reduces the complexity of each PAM iteration from O(n^2) to O(nlogn) and returns the same results with high probability, under assumptions on the data that often hold in practice. As such, BanditPAM matches state-of-the-art clustering loss while reaching solutions much faster. We empirically validate our results on several large real-world datasets, including a coding exercise submissions dataset from Code.org, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. In these experiments, we observe that BanditPAM returns the same results as state-of-the-art PAM-like algorithms up to 4x faster while performing up to 200x fewer distance computations. The improvements demonstrated by BanditPAM enable k-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release highly optimized Python and C++ implementations of our algorithm.
Author Information
Mo Tiwari (Stanford University)
Martin Zhang (Harvard University)
James J Mayclin (Stanford University)
Sebastian Thrun (Stanford University)
Chris Piech (Stanford)
Ilan Shomorony (University of Illinois at Urbana Champaign)
More from the Same Authors
-
2022 Poster: MABSplit: Faster Forest Training Using Multi-Armed Bandits »
Mo Tiwari · Ryan Kang · Jaeyong Lee · Chris Piech · Ilan Shomorony · Sebastian Thrun · Martin Zhang -
2022 Poster: Giving Feedback on Interactive Student Programs with Meta-Exploration »
Evan Liu · Moritz Stephan · Allen Nie · Chris Piech · Emma Brunskill · Chelsea Finn -
2021 : Panel Discussion »
Jo Boaler · Yuri Burda · Chris Piech · Sumeet Singh · Kurt VanLehn -
2020 Poster: Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment »
Govinda Kamath · Tavor Baharav · Ilan Shomorony -
2015 Poster: Deep Knowledge Tracing »
Chris Piech · Jonathan Bassen · Jonathan Huang · Surya Ganguli · Mehran Sahami · Leonidas Guibas · Jascha Sohl-Dickstein -
2013 Demonstration: Codewebs: a Pedagogical Search Engine for Code Submissions to a MOOC »
Jonathan Huang · Chris Piech · Andy Nguyen · Leonidas Guibas -
2007 Workshop: The Urban Challenge – Perspectives of Autonomous Driving »
Sebastian Thrun · Chris Urmson · Raul Rojas · William Uther