Timezone: »
Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role benchmarking practices play in the field, relatively little attention has been paid to the dynamics of benchmark dataset use and resuse within and across machine learning subcommunities. In this work we dig into these dynamics, by studying how dataset usage patterns differ across different machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity and access within the field.
Author Information
Bernard Koch (University of California, Los Angeles)
Remi Denton (Google)

Remi Denton (they/them) is a Staff Research Scientist at Google, within the Technology, AI, Society, and Culture team, where they study the sociocultural impacts of AI technologies and conditions of AI development. Prior to joining Google, Remi received their PhD in Computer Science from the Courant Institute of Mathematical Sciences at New York University, where they focused on unsupervised learning and generative modeling of images and video. Prior to that, they received their BSc in Computer Science and Cognitive Science at the University of Toronto. Though trained formally as a computer scientist, Remi draws ideas and methods from multiple disciplines and is drawn towards highly interdisciplinary collaborations, in order to examine AI systems from a sociotechnical perspective. Remi’s recent research centers on emerging text- and image-based generative AI, with a focus on data considerations and representational harms. Remi published under the name "Emily Denton".
Alex Hanna (Google)
Jacob G Foster (University of California, Los Angeles)
More from the Same Authors
-
2021 : Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research »
Bernard Koch · Remi Denton · Alex Hanna · Jacob G Foster -
2021 : Artsheets for Art Datasets »
Ramya Srinivasan · Remi Denton · Jordan Famularo · Negar Rostamzadeh · Fernando Diaz · Beth Coleman -
2021 : AI and the Everything in the Whole Wide World Benchmark »
Deborah Raji · Remi Denton · Emily M. Bender · Alex Hanna · Amandalynne Paullada -
2021 : Joint Content-Context Analysis of Scientific Publications: Identifying Opportunities for Collaboration in Cognitive Science »
Harlin Lee · Jacob G Foster -
2022 Poster: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding »
Chitwan Saharia · William Chan · Saurabh Saxena · Lala Li · Jay Whang · Remi Denton · Kamyar Ghasemipour · Raphael Gontijo Lopes · Burcu Karagol Ayan · Tim Salimans · Jonathan Ho · David Fleet · Mohammad Norouzi -
2021 : Live panel: ImageNets of "x": ImageNet's Infrastructural Impact »
Remi Denton · Alex Hanna -
2021 : ImageNets of "x": ImageNet's Infrastructural Impact »
Remi Denton · Alex Hanna -
2021 : Career and Life: Panel Discussion - Bo Li, Adriana Romero-Soriano, Devi Parikh, and Emily Denton »
Remi Denton · Devi Parikh · Bo Li · Adriana Romero