The evaluation and optimization of machine learning systems have largely adopted well-known performance metrics like accuracy (for classification) or squared error (for regression). While these metrics are reusable across a variety of machine learning tasks, they make strong assumptions often not observed when situated in a broader technical or sociotechnical system. This is especially true in systems that interact with large populations of humans attempting to complete a goal or satisfy a need (e.g. search, recommendation, game-playing). In this tutorial, we will present methods for developing evaluation metrics grounded in what users expect of the system and how they respond to system decisions. The goal of this tutorial is both to share methods for designing user-based quantitative metrics and to motivate new research into optimizing for these more structured metrics.
Praveen Chandar (Spotify)
Praveen Chandar is a Senior Research Scientist at Spotify working on search and recommendations. His research interests are in machine learning, information retrieval, and recommendation systems with a focus on experimentation and evaluation. Praveen received his Ph.D. from the University of Delaware, working on novelty and diversity aspects of search evaluation. He was previously a Research Staff Member at IBM Research. He has published papers at top conferences including, SIGIR, KDD, WSDM, WWW, CIKM, CHI, and UAI.
Fernando Diaz (Google)
Fernando Diaz is a research scientist at Google Brain Montréal. His research focuses on the design of information access systems, including search engines, music recommendation services and crisis response platforms is particularly interested in understanding and addressing the societal implications of artificial intelligence more generally. Previously, Fernando was the assistant managing director of Microsoft Research Montréal and a director of research at Spotify, where he helped establish its research organization on recommendation, search, and personalization. Fernando’s work has received awards at SIGIR, WSDM, ISCRAM, and ECIR. He is the recipient of the 2017 British Computer Society Karen Spärck Jones Award. Fernando has co-organized workshops and tutorials at SIGIR, WSDM, and WWW. He has also co-organized several NIST TREC initiatives, WSDM (2013), Strategic Workshop on Information Retrieval (2018), FAT* (2019), SIGIR (2021), and the CIFAR Workshop on Artificial Intelligence and the Curation of Culture (2019)
Brian St. Thomas (Spotify)
Brian St. Thomas is a Senior Data Scientist at Spotify researching online experimentation methods and metric development. His research interests are in the development and evaluation of personalized recommendation and search systems, with a focus on statistical aspects of these problems. Brian received his Ph.D. from Duke University, and was previously a Data Scientist with TiVo's Search and Recommendations division. Brian has published research in JASA, SIGIR, CHI, WWW and co-organized a tutorial at RecSys.
Related Events (a corresponding poster, oral, or spotlight)
2020 Tutorial: (Track2) Beyond Accuracy: Grounding Evaluation Metrics for Human-Machine Learning Systems Q&A »
Tue Dec 8th 10:00 -- 10:50 PM Room None
More from the Same Authors
2020 Workshop: Algorithmic Fairness through the Lens of Causality and Interpretability »
Awa Dieng · Jessica Schrouff · Matt J Kusner · Golnoosh Farnadi · Fernando Diaz
2020 Poster: Model Selection for Production System via Automated Online Experiments »
Zhenwen Dai · Praveen Chandar · Ghazal Fazelnia · Benjamin Carterette · Mounia Lalmas
2016 Demonstration: Project Malmo - Minecraft for AI Research »
Katja Hofmann · Matthew A Johnson · Fernando Diaz · Alekh Agarwal · Tim Hutton · David Bignell · Evelyne Viegas