Skip to yearly menu bar Skip to main content


Workshop

Nearest Neighbors for Modern Applications with Massive Data: An Age-old Solution with New Challenges

George H Chen · Devavrat Shah · Christina Lee

201 B

Many modern methods for prediction leverage nearest neighbor (NN) search to find past training examples most similar to a test example, an idea that dates back in text to at least the 11th century in the “Book of Optics” by Alhazen. Today, NN methods remain popular, often as a cog in a bigger prediction machine, used for instance in recommendation systems, forecasting baseball player performance and election outcomes, survival analysis in healthcare, image in-painting, crowdsourcing, graphon estimation, and more. The popularity of NN methods is due in no small part to the proliferation of high-quality fast approximate NN search methods that scale to high-dimensional massive datasets typical of contemporary applications. Moreover, NN prediction readily pairs with methods that learn similarities, such as metric learning methods or Siamese networks. In fact, some well-known pairings that result in nearest neighbor predictors that learn similarities include random forests and many boosting methods.

Despite the popularity, success, and age of nearest neighbor methods, our theoretical understanding of them is still surprisingly incomplete (perhaps much to the chagrin of the initial efforts of analysis by Fix, Hodges, Cover, and Hart) and can also be disconnected from what practitioners actually want or care about. Many successful approximate nearest neighbor methods in practice do not have known theoretical guarantees, and many of the guarantees for exact nearest neighbor methods do not readily handle approximation. Meanwhile, many applications use variations on NN methods, for which existing theory may not extend to, or for which existing theory is not easily usable by a practitioner. Suffice it to say, a lot is lost in translation between different communities working with NN methods.

In short, NN methods is an exciting field at the intersection of classical statistics, machine learning, data structures and domain specific expertise. The aim of this work is to bring together theoreticians and practitioners alike from these various different backgrounds with a diverse range of perspectives to bring everyone up to speed on:
- Best known statistical/computational guarantees (especially recent non-asymptotic results)
- Latest methods/systems that have been developed especially for fast approximate NN search that scale to massive datasets
- Various applications in which NN methods are heavily used as a critical component in prediction or inference

By gathering a diverse crowd, we hope attendees share their perspectives, identify ways to bridge theory and practice, and discuss avenues of future research.

Live content is unavailable. Log in and register to view live content

Timezone: America/Los_Angeles

Schedule

Log in and register to view live content