`

Timezone: »

 
Poster
Query Complexity of Clustering with Side Information
Arya Mazumdar · Barna Saha

Mon Dec 04 06:30 PM -- 10:30 PM (PST) @ Pacific Ballroom #226
Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, ``do two elements $u$ and $v$ belong to the same cluster?''. The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we provide a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and give strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. To improve accuracy of clustering, a fruitful approach in recent years has been to ask a domain expert or crowd to obtain labeled data interactively. Many heuristics have been proposed, and all of these use a similarity function to come up with a querying strategy. Even so, there is a lack systematic theoretical study. Our main contribution in this paper is to show the dramatic power of side information aka similarity matrix on reducing the query complexity of clustering. A similarity matrix represents noisy pair-wise relationships such as one computed by some function on attributes of the elements. A natural noisy model is where similarity values are drawn independently from some arbitrary probability distribution $f_+$ when the underlying pair of elements belong to the same cluster, and from some $f_-$ otherwise. We show that given such a similarity matrix, the query complexity reduces drastically from $\Theta(nk)$ (no similarity matrix) to $O(\frac{k^2\log{n}}{\cH^2(f_+\|f_-)})$ where $\cH^2$ denotes the squared Hellinger divergence. Moreover, this is also information-theoretic optimal within an $O(\log{n})$ factor. Our algorithms are all efficient, and parameter free, i.e., they work without any knowledge of $k, f_+$ and $f_-$, and only depend logarithmically with $n$.

Author Information

Arya Mazumdar (University of Massachusetts Amherst)
Barna Saha (University of Massachusetts Amherst)

More from the Same Authors

  • 2021 Poster: Support Recovery of Sparse Signals from a Mixture of Linear Measurements »
    Soumyabrata Pal · Arya Mazumdar · Venkata Gandikota
  • 2021 Poster: Fuzzy Clustering with Similarity Queries »
    Wasim Huleihel · Arya Mazumdar · Soumyabrata Pal
  • 2019 : Poster Session »
    Gergely Flamich · Shashanka Ubaru · Charles Zheng · Josip Djolonga · Kristoffer Wickstrøm · Diego Granziol · Konstantinos Pitas · Jun Li · Robert Williamson · Sangwoong Yoon · Kwot Sin Lee · Julian Zilly · Linda Petrini · Ian Fischer · Zhe Dong · Alexander Alemi · Bao-Ngoc Nguyen · Rob Brekelmans · Tailin Wu · Aditya Mahajan · Alexander Li · Kirankumar Shiragur · Yair Carmon · Linara Adilova · SHIYU LIU · Bang An · Sanjeeb Dash · Oktay Gunluk · Arya Mazumdar · Mehul Motani · Julia Rosenzweig · Michael Kamp · Marton Havasi · Leighton P Barnes · Zhengqing Zhou · Yi Hao · Dylan Foster · Yuval Benjamini · Nati Srebro · Michael Tschannen · Paul Rubenstein · Sylvain Gelly · John Duchi · Aaron Sidford · Robin Ru · Stefan Zohren · Murtaza Dalal · Michael A Osborne · Stephen J Roberts · Moses Charikar · Jayakumar Subramanian · Xiaodi Fan · Max Schwarzer · Nicholas Roberts · Simon Lacoste-Julien · Vinay Prabhu · Aram Galstyan · Greg Ver Steeg · Lalitha Sankar · Yung-Kyun Noh · Gautam Dasarathy · Frank Park · Ngai-Man (Man) Cheung · Ngoc-Trung Tran · Linxiao Yang · Ben Poole · Andrea Censi · Tristan Sylvain · R Devon Hjelm · Bangjie Liu · Jose Gallego-Posada · Tyler Sypherd · Kai Yang · Jan Nikolas Morshuis
  • 2019 : Poster Session »
    Jonathan Scarlett · Piotr Indyk · Ali Vakilian · Adrian Weller · Partha P Mitra · Benjamin Aubin · Bruno Loureiro · Florent Krzakala · Lenka Zdeborová · Kristina Monakhova · Joshua Yurtsever · Laura Waller · Hendrik Sommerhoff · Michael Moeller · Rushil Anirudh · Shuang Qiu · Xiaohan Wei · Zhuoran Yang · Jayaraman Thiagarajan · Salman Asif · Michael Gillhofer · Johannes Brandstetter · Sepp Hochreiter · Felix Petersen · Dhruv Patel · Assad Oberai · Akshay Kamath · Sushrut Karmalkar · Eric Price · Ali Ahmed · Zahra Kadkhodaie · Sreyas Mohan · Eero Simoncelli · Carlos Fernandez-Granda · Oscar Leong · Wesam Sakla · Rebecca Willett · Stephan Hoyer · Jascha Sohl-Dickstein · Sam Greydanus · Gauri Jagatap · Chinmay Hegde · Michael Kellman · Jonathan Tamir · Nouamane Laanait · Ousmane Dia · Mirco Ravanelli · Jonathan Binas · Negar Rostamzadeh · Shirin Jalali · Tiantian Fang · Alex Schwing · Sébastien Lachapelle · Philippe Brouillard · Tristan Deleu · Simon Lacoste-Julien · Stella Yu · Arya Mazumdar · Ankit Singh Rawat · Yue Zhao · Jianshu Chen · Xiaoyang Li · Hubert Ramsauer · Gabrio Rizzuti · Nikolaos Mitsakos · Dingzhou Cao · Thomas Strohmer · Yang Li · Pei Peng · Gregory Ongie
  • 2019 Poster: Superset Technique for Approximate Recovery in One-Bit Compressed Sensing »
    Larkin Flodin · Venkata Gandikota · Arya Mazumdar
  • 2019 Poster: Sample Complexity of Learning Mixture of Sparse Linear Regressions »
    Akshay Krishnamurthy · Arya Mazumdar · Andrew McGregor · Soumyabrata Pal
  • 2019 Poster: Same-Cluster Querying for Overlapping Clusters »
    Wasim Huleihel · Arya Mazumdar · Muriel Medard · Soumyabrata Pal
  • 2017 Poster: Clustering with Noisy Queries »
    Arya Mazumdar · Barna Saha
  • 2017 Poster: Semisupervised Clustering, AND-Queries and Locally Encodable Source Coding »
    Arya Mazumdar · Soumyabrata Pal
  • 2017 Spotlight: Semisupervised Clustering, AND-Queries and Locally Encodable Source Coding »
    Arya Mazumdar · Soumyabrata Pal
  • 2015 Poster: Associative Memory via a Sparse Recovery Model »
    Arya Mazumdar · Ankit Singh Rawat