Timezone: »

Nearest Neighbors for Modern Applications with Massive Data: An Age-old Solution with New Challenges
George H Chen · Devavrat Shah · Christina Lee

Fri Dec 08 08:00 AM -- 06:30 PM (PST) @ 201 B
Event URL: https://nn2017.mit.edu »

Many modern methods for prediction leverage nearest neighbor (NN) search to find past training examples most similar to a test example, an idea that dates back in text to at least the 11th century in the “Book of Optics” by Alhazen. Today, NN methods remain popular, often as a cog in a bigger prediction machine, used for instance in recommendation systems, forecasting baseball player performance and election outcomes, survival analysis in healthcare, image in-painting, crowdsourcing, graphon estimation, and more. The popularity of NN methods is due in no small part to the proliferation of high-quality fast approximate NN search methods that scale to high-dimensional massive datasets typical of contemporary applications. Moreover, NN prediction readily pairs with methods that learn similarities, such as metric learning methods or Siamese networks. In fact, some well-known pairings that result in nearest neighbor predictors that learn similarities include random forests and many boosting methods.

Despite the popularity, success, and age of nearest neighbor methods, our theoretical understanding of them is still surprisingly incomplete (perhaps much to the chagrin of the initial efforts of analysis by Fix, Hodges, Cover, and Hart) and can also be disconnected from what practitioners actually want or care about. Many successful approximate nearest neighbor methods in practice do not have known theoretical guarantees, and many of the guarantees for exact nearest neighbor methods do not readily handle approximation. Meanwhile, many applications use variations on NN methods, for which existing theory may not extend to, or for which existing theory is not easily usable by a practitioner. Suffice it to say, a lot is lost in translation between different communities working with NN methods.

In short, NN methods is an exciting field at the intersection of classical statistics, machine learning, data structures and domain specific expertise. The aim of this work is to bring together theoreticians and practitioners alike from these various different backgrounds with a diverse range of perspectives to bring everyone up to speed on:
- Best known statistical/computational guarantees (especially recent non-asymptotic results)
- Latest methods/systems that have been developed especially for fast approximate NN search that scale to massive datasets
- Various applications in which NN methods are heavily used as a critical component in prediction or inference

By gathering a diverse crowd, we hope attendees share their perspectives, identify ways to bridge theory and practice, and discuss avenues of future research.

Fri 9:00 a.m. - 9:20 a.m. [iCal]

I give a brief history of nearest neighbor (NN) methods, starting from Alhazen’s “Book of Optics” in the 11th century, and leading up to present time. Surprisingly, the most general nonasymptotic results on NN prediction only recently emerged in 2014 in the work of Chaudhuri and Dasgupta. Turning toward “user-friendly” theory with an eye toward practitioners, I mention recent guarantees (2013-2015) in the contemporary applications of time series forecasting, online collaborative filtering, and medical image segmentation — in all three cases, clustering structure enables successful NN prediction.

George H Chen
Fri 9:20 a.m. - 10:05 a.m. [iCal]

Motivated by applications such as autonomous vehicles, test-time attacks via adversarial examples have received a great deal of recent attention. In this setting, an adversary is capable of making queries to a classifier, and perturbs a test example by a small amount in order to force the classifier to report an incorrect label. While a long line of work has explored a number of attacks, not many reliable defenses are known, and there is an overall lack of general understanding about the foundations of designing machine learning algorithms robust to adversarial examples.

In this talk, we take a step towards addressing this challenging question by introducing a new theoretical framework, analogous to bias-variance theory, which we can use to tease out the causes of vulnerability. We apply our framework to a simple classification algorithm: nearest neighbors, and analyze its robustness to adversarial examples. Motivated by our analysis, we propose a modified version of the nearest neighbor algorithm, and demonstrate both theoretically and empirically that it has superior robustness to standard nearest neighbors.

Kamalika Chaudhuri
Fri 10:05 a.m. - 10:25 a.m. [iCal]

We bring the tools from Blackwell's seminal result on comparing two stochastic experiments, to shine new lights on the modern  applications of great interest: generative adversarial networks (GAN). Binary hypothesis testing is at the center of GANs, and we propose new data processing inequalities that allows us to discover new algorithms to combat mode collapse, provide sharper analyses, and provide simpler proofs. This leads to a new architecture to handle one of the major challenges in GAN known as ``mode collapse''; the lack of diversity in the samples generated by the learned generators. The hypothesis testing view of GAN allows us to analyze the new architecture and show that it encourages generators with no mode collapse. Experimental results show that the proposed architecture can learn to generate samples with diversity that is orders of magnitude better than competing approaches, while being simpler. For this talk, I will assume no prior background on GANs.

Sewoong Oh
Fri 11:00 a.m. - 11:45 a.m. [iCal]

I will give an overview of recent research on designing provable data-dependent approximate similarity search methods for high-dimensional data. Unlike many approaches proposed in the algorithms literature, the new methods lead to data structures that adapt to the data while retaining correctness and running time guarantees. The high-level principle behind those methods is that  every data set (even a completely random one) has some structure that can be exploited algorithmically.  This interpolates between basic "data-oblivious" approaches (e.g., those based on  simple random projections) that are analyzable but do not adapt easily to the structure of the data, and basic "data-dependent" methods that exploit specific structure of the data, but might not perform well if this structure is not present.

Concretely, I will cover two recent lines of research: - Data-dependent Locality-Sensitive Hashing: approximate nearest neighbor search methods that provably improve over the best possible data-oblivious LSH algorithms (covers work from SODA'14, STOC'15 and NIPS'15). - Data-dependent metric compression: algorithms that represent sets of points using a small number of bits while approximately preserving the distances up to higher accuracy than previously known (covers work from SODA'17 and NIPS'17).

Piotr Indyk
Fri 11:45 a.m. - 12:05 p.m. [iCal]

Most exact methods for k-nearest neighbour search suffer from the curse of dimensionality; that is, their query times exhibit exponential dependence on either the ambient or the intrinsic dimensionality. Dynamic Continuous Indexing (DCI) [19] offers a promising way of circumventing the curse and successfully reduces the dependence of query time on intrinsic dimensionality from exponential to sublinear. In this paper, we propose a variant of DCI, which we call Prioritized DCI, and show a remarkable improvement in the dependence of query time on intrinsic dimensionality. In particular, a linear increase in intrinsic dimensionality, or equivalently, an exponential increase in the number of points near a query, can be mostly counteracted with just a linear increase in space. We also demonstrate empirically that Prioritized DCI significantly outperforms prior methods. In particular, relative to Locality-Sensitive Hashing (LSH), Prioritized DCI reduces the number of distance evaluations by a factor of 14 to 116 and the memory consumption by a factor of 21.

Ke Li
Fri 1:45 p.m. - 2:30 p.m. [iCal]

Nearest neighbors is an algorithm everyone loves to hate.   It's too trivial, too brute-force, doesn't offer any insights.  In this talk, I will try to make you fall in love with the humble nearest neighbor.  First, I will discuss the use of nearest neighbor as an exceptionally useful baseline to guard against our field's natural bias in favor of elegant algorithms over data.  Then, I will discuss some scenarios when nearest neighbors is particularly useful in practice.

Alexei Efros
Fri 2:30 p.m. - 2:50 p.m. [iCal]

We describe an extensible data science platform based on non-parametric statistical methods such as nearest neighbors. This platform was borne out of the need for us to provide scalable and flexible predictive analytics solutions for retailers and the federal government. In this talk, I will describe the formalism associated with the platform, discuss its performance on benchmark datasets out-of-the-box and show a live demo for building a recommendation system and for detecting geo-spatial anomalies in maritime domain.

Vishal Doshi
Fri 2:50 p.m. - 4:15 p.m. [iCal]
  • Learning Supervised Binary Hashing without Binary Code Optimization (Miguel Carreira-Perpinan; Ramin Raziperchikolaei)
    • Sub-linear Privacy-preserving Search with Unsecured Server and Semi-honest Parties (Beidi Chen)
    • On Nearest Neighbors in Non Local Means Denoising (Iuri Frosio; Kautz Jan)
    • Fast k-Nearest Neighbour Search via Prioritized DCI (Ke Li; Jitendra Malik)
    • Generative Local Metric Learning for Nearest Neighbor Methods (Yung-Kyun Noh; Masashi Sugiyama; Daniel D Lee)
    • Private Document Classification in Federated Databases (Phillipp Schoppmann; Adria Gascon; Borja Balle)
    • Optimizing Revenue Over Data-Driven Assortments (Deeksha Sinha; Theja Tulabandula)
    • Fast Distance Metric Learning for Multi-Task Large Margin Nearest Neighbor Classification (Adams Yu)
Beidi Chen, Borja Balle, Daniel Lee, iuri frosio, Jitendra Malik, Jan Kautz, Ke Li, Masashi Sugiyama, Miguel A. Carreira-Perpinan, Ramin Raziperchikolaei, Theja Tulabandhula, Yung-Kyun Noh, Adams Wei Yu
Fri 4:15 p.m. - 5:00 p.m. [iCal]

I explain the Boundary Forest algorithm, a simple, fast, and incremental online algorithm that can be used in both supervised and unsupervised settings, for classification, regression, or retrieval problems. As a classification algorithm, its generalization performance is at least as good as K-nearest-neighbors, while being able to respond to queries very quickly. It maintains trees of examples that let it train and respond to test queries in a time logarithmic with the number of stored examples, which is typically very much less than the number of training examples.

Jonathan Yedidia
Fri 5:00 p.m. - 5:20 p.m. [iCal]

The sparse matrix estimation problem consists of estimating the distribution of an n-by-n matrix Y , from a sparsely observed single instance of this matrix where the entries of Y are independent random variables. This captures a wide array of problems; special instances include matrix completion in the context of recommendation systems, graphon estimation, and community detection in (mixed membership) stochastic block models. Inspired by classical collaborative filtering for recommendation systems, we propose a novel iterative, collaborative filtering style algorithm for matrix estimation in this generic setting. Under model assumptions of uniform sampling, bounded entries, and finite spectrum, we provide bounds on the the mean squared error (MSE) of our estimator and show improved sample complexity.

Christina Lee

Author Information

George H Chen (Carnegie Mellon University)

George Chen is an assistant professor of information systems at Carnegie Mellon University. He works on nonparametric prediction methods, applied to healthcare and sustainable development. He received his PhD from MIT in Electrical Engineering and Computer Science.

Devavrat Shah (Massachusetts Institute of Technology)

Devavrat Shah is a professor of Electrical Engineering & Computer Science and Director of Statistics and Data Science at MIT. He received PhD in Computer Science from Stanford. He received Erlang Prize from Applied Probability Society of INFORMS in 2010 and NeuIPS best paper award in 2008.

Christina Lee (Microsoft Research)

More from the Same Authors