Skip to yearly menu bar Skip to main content


NeurIPS 2021 Datasets and Benchmarks Accepted Papers 174

This year, NeurIPS launched the new Datasets and Benchmarks track, to serve as a venue for exceptional work in creating high-quality datasets, insightful benchmarks, and discussions on how to improve dataset development and data-oriented work more broadly. Further details about the motivation and setup are discussed in this blog post.

Below is the list of the 163 accepted submissions. We are really excited about their quality and potential impact.

Datasets and Benchmarks Proceedings


Generating Datasets of 3D Garments with Sewing Patterns
Maria Korosteleva · Sung-Hee Lee

KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdialects
Zhiyuan Tang · Dong Wang · Yanguang Xu · Jianwei Sun · Xiaoning Lei · Shuaijiang Zhao · cheng wen · Xingjun Tan · Chuandong Xie · Shuran Zhou · Rui Yan · Chenjia Lv · Yang Han · Wei Zou · Xiangang Li

SKM-TEA: A Dataset for Accelerated MRI Reconstruction with Dense Image Labels for Quantitative Clinical Evaluation
Arjun Desai · Andrew Schmidt · Elka Rubin · Christopher Sandino · Marianne Black · Valentina Mazzoli · Kathryn Stevens · Robert Boutin · Christopher Ré · Garry Gold · Brian Hargreaves · Akshay Chaudhari

Evaluating Bayes Error Estimators on Real-World Datasets with FeeBee
Cedric Renggli · Luka Rimanic · Nora Hollenstein · Ce Zhang

DUE: End-to-End Document Understanding Benchmark
Łukasz Borchmann · Michał Pietruszka · Tomasz Stanislawek · Dawid Jurkiewicz · Michał Turski · Karolina Szyndler · Filip Graliński

PASS: An ImageNet replacement for self-supervised pretraining without humans
Yuki Asano · Christian Rupprecht · Andrew Zisserman · Andrea Vedaldi

Mitigating dataset harms requires stewardship: Lessons from 1000 papers
Kenneth Peng · Arunesh Mathur · Arvind Narayanan

Constructing a Visual Dataset to Study the Effects of Spatial Apartheid in South Africa
Raesetje Sefala · Timnit Gebru · Luzango Mfupe · Nyalleng Moorosi · Richard Klein

AI and the Everything in the Whole Wide World Benchmark
Deborah Raji · Emily Denton · Emily M. Bender · Alex Hanna · Amandalynne Paullada

URLB: Unsupervised Reinforcement Learning Benchmark
Misha Laskin · Denis Yarats · Hao Liu · Kimin Lee · Albert Zhan · Kevin Lu · Catherine Cang · Lerrel Pinto · Pieter Abbeel

Benchmark for Compositional Text-to-Image Synthesis
Dong Huk Park · Samaneh Azadi · Xihui Liu · Trevor Darrell · Anna Rohrbach

An Empirical Study of Graph Contrastive Learning
Yanqiao Zhu · Yichen Xu · Qiang Liu · Shu Wu

Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading Comprehension
Shusheng Xu · Yichen Liu · Xiaoyu Yi · Siyuan Zhou · Huizi Li · Yi Wu

SciGen: a Dataset for Reasoning-Aware Text Generation from Scientific Tables
Nafise Moosavi · Andreas Rücklé · Dan Roth · Iryna Gurevych

A sandbox for prediction and integration of DNA, RNA, and proteins in single cells
Malte Luecken · Daniel Burkhardt · Robrecht Cannoodt · Christopher Lance · Aditi Agrawal · Hananeh Aliee · Ann Chen · Louise Deconinck · Angela Detweiler · Alejandro Granados · Shelly Huynh · Laura Isacco · Yang Kim · Dominik Klein · BONY DE KUMAR · Sunil Kuppasani · Heiko Lickert · Aaron McGeever · Honey Mekonen · Joaquin Melgarejo · Maurizio Morri · Michaela Müller · Norma Neff · Sheryl Paul · Bastian Rieck · Kaylie Schneider · Scott Steelman · Michael Sterr · Daniel Treacy · Alexander Tong · Alexandra-Chloe Villani · Guilin Wang · Jia Yan · Ce Zhang · Angela Pisco · Smita Krishnaswamy · Fabian Theis · Jonathan M Bloom

Chest ImaGenome Dataset for Clinical Reasoning
Joy T Wu · Nkechinyere Agu · Ismini Lourentzou · Arjun Sharma · Joseph Alexander Paguio · Jasper Seth Yao · Edward C Dee · William Mitchell · Satyananda Kashyap · Andrea Giovannini · Leo Anthony Celi · Mehdi Moradi

WRENCH: A Comprehensive Benchmark for Weak Supervision
Jieyu Zhang · Yue Yu · · Yujing Wang · Yaming Yang · Mao Yang · Alexander Ratner

A Dataset for Answering Time-Sensitive Questions
Wenhu Chen · Xinyi Wang · William Yang Wang

Modeling Worlds in Text
Prithviraj Ammanabrolu · Mark Riedl

The PAIR-R24M Dataset for Multi-animal 3D Pose Estimation
Jesse Marshall · Ugne Klibaite · amanda gellis · Diego Aldarondo · Bence Olveczky · Timothy W Dunn

The People’s Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage
Daniel Galvez · Greg Diamos · Juan Torres · Juan Cerón · Keith Achorn · Anjali Gopi · David Kanter · Max Lam · Mark Mazumder · Vijay Janapa Reddi

CSFCube - A Test Collection of Computer Science Research Articles for Faceted Query by Example
Sheshera Mysore · Tim O'Gorman · Andrew McCallum · Hamed Zamani

The Medkit-Learn(ing) Environment: Medical Decision Modelling through Simulation
Alex Chan · Ioana Bica · Alihan Hüyük · Daniel Jarrett · Mihaela van der Schaar

Benchmarking Bias Mitigation Algorithms in Representation Learning through Fairness Metrics
Charan Reddy · Deepak Sharma · Soroush Mehri · Adriana Romero Soriano · Samira Shabanian · Sina Honari

Datasets for Online Controlled Experiments
Chak Hin Bryan Liu · Angelo Cardoso · Paul Couturier · Emma McCoy

ReaSCAN: Compositional Reasoning in Language Grounding
Zhengxuan Wu · Elisa Kreiss · Desmond Ong · Christopher Potts

The CLEAR Benchmark: Continual LEArning on Real-World Imagery
Zhiqiu Lin · Jia Shi · Deepak Pathak · Deva Ramanan

A Unified Few-Shot Classification Benchmark to Compare Transfer and Meta Learning Approaches
Vincent Dumoulin · Neil Houlsby · Utku Evci · Xiaohua Zhai · Ross Goroshin · Sylvain Gelly · Hugo Larochelle

Isaac Gym: High Performance GPU Based Physics Simulation For Robot Learning
Viktor Makoviychuk · Lukasz Wawrzyniak · Yunrong Guo · Michelle Lu · Kier Storey · Miles Macklin · David Hoeller · Nikita Rudin · Arthur Allshire · Ankur Handa · Gavriel State

Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions
Chenyu Yi · SIYUAN YANG · Haoliang Li · Yap-peng Tan · Alex Kot

Really Doing Great at Estimating CATE? A Critical Look at ML Benchmarking Practices in Treatment Effect Estimation
Alicia Curth · David Svensson · Jim Weatherall · Mihaela van der Schaar

FLIP: Benchmark tasks in fitness landscape inference for proteins
Christian Dallago · Jody Mou · Kadina Johnston · Bruce Wittmann · Nicholas Bhattacharya · Samuel Goldman · Ali Madani · Kevin Yang

GraphGT: Machine Learning Datasets for Graph Generation and Transformation
Yuanqi Du · Shiyu Wang · Xiaojie Guo · Hengning Cao · Shujie Hu · Junji Jiang · Aishwarya Varala · Abhinav Angirekula · Liang Zhao

PROCAT: Product Catalogue Dataset for Implicit Clustering, Permutation Learning and Structure Prediction
Mateusz Jurewicz · Leon Derczynski

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models
Boxin Wang · Chejian Xu · Shuohang Wang · Zhe Gan · Yu Cheng · Jianfeng Gao · Ahmed Awadallah · Bo Li

Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation
Yuta Saito · Shunsuke Aihara · Megumi Matsutani · Yusuke Narita

CCNLab: A Benchmarking Framework for Computational Cognitive Neuroscience
Nikhil Bhattasali · Momchil Tomov · Samuel J Gershman

A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning
Gaoussou Kebe · Padraig Higgins · Patrick Jenkins · Kasra Darvish · Rishabh Sachdeva · Ryan Barron · John Winder · Donald Engel · Edward Raff · Francis Ferraro · Cynthia Matuszek

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh Kumar Ramakrishnan · Aaron Gokaslan · Erik Wijmans · Oleksandr Maksymets · Alexander Clegg · John Turner · Eric Undersander · Wojciech Galuba · Andrew Westbury · Angel Chang · Manolis Savva · Yili Zhao · Dhruv Batra

Timers and Such: A Practical Benchmark for Spoken Language Understanding with Numbers
Loren Lugosch · Piyush Papreja · Mirco Ravanelli · Abdelwahab HEBA · Titouan Parcollet

A realistic approach to generate masked faces applied on two novel masked face recognition data sets
Tudor-Alexandru Mare · Georgian Duta · Iuliana Georgescu · Adrian Sandru · Bogdan Alexe · Marius Popescu · Radu Tudor Ionescu

FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark 
Mingjie Li · Wenjia Cai · Rui Liu · Yuetian Weng · Xiaoyun Zhao · Cong Wang · Xin Chen · Zhong Liu · Caineng Pan · Mengke Li · yingfeng zheng · Yizhi Liu · Flora Salim · Karin Verspoor · Xiaodan Liang · Xiaojun Chang

Few-Shot Learning Evaluation in Natural Language Understanding
Subhabrata Mukherjee · Xiaodong Liu · Guoqing Zheng · Saghar Hosseini · Hao Cheng · Ge Yang · Christopher Meek · Ahmed Awadallah · Jianfeng Gao

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning
Paul Pu Liang · Yiwei Lyu · Xiang Fan · Zetian Wu · Yun Cheng · Jason Wu · Leslie (Yufan) Chen · Peter Wu · Michelle A. Lee · Yuke Zhu · Ruslan Salakhutdinov · Louis-Philippe Morency

OmniPrint: A Configurable Printed Character Synthesizer
Haozhe Sun · Wei-Wei Tu · Isabelle Guyon

A Toolbox for Construction and Analysis of Speech Datasets
Evelina Bakhturina · Vitaly Lavrukhin · Boris Ginsburg

What Would Jiminy Cricket Do? Towards Agents That Behave Morally
Dan Hendrycks · Mantas Mazeika · Andy Zou · Sahil Patel · Christine Zhu · Jesus Navarro · Dawn Song · Bo Li · Jacob Steinhardt

RB2: Robotic Manipulation Benchmarking with a Twist
Sudeep Dasari · Jianren Wang · Joyce Hong · Shikhar Bahl · Yixin Lin · Austin Wang · Abitha Thankaraj · Karanbir Chahal · Berk Calli · Saurabh Gupta · David Held · Lerrel Pinto · Deepak Pathak · Vikash Kumar · Abhinav Gupta

Programming Puzzles
Tal Schuster · Ashwin Kalyan · Alex Polozov · Adam Kalai

Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research
Bernard Koch · Emily Denton · Alex Hanna · Jacob G Foster

An Extensible Benchmark Suite for Learning to Simulate Physical Systems
Karl Otness · Arvi Gjoka · Joan Bruna · Daniele Panozzo · Benjamin Peherstorfer · Teseo Schneider · Denis Zorin

CommonsenseQA 2.0: Exposing the Limits of AI through Gamification
Alon Talmor · Ori Yoran · Ronan Le Bras · Chandra Bhagavatula · Yoav Goldberg · Yejin Choi · Jonathan Berant

WildfireDB: An Open-Source Dataset Connecting Wildfire Occurrence with Relevant Determinants
Samriddhi Singla · Ayan Mukhopadhyay · Michael Wilbur · Tina Diao · Vinayak Gajjewar · Ahmed Eldawy · Mykel J Kochenderfer · Ross Shachter · Abhishek Dubey

The Neural MMO Platform for Massively Multiagent Research
Joseph Suarez · Yilun Du · Clare Zhu · Igor Mordatch · Phillip Isola

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting
Benjamin Wilson · William Qi · Tanmay Agarwal · John Lambert · Jagjeet Singh · Siddhesh Khandelwal · Bowen Pan · Ratnesh Kumar · Andrew Hartnett · Jhony Kaesemodel Pontes · Deva Ramanan · Peter Carr · James Hays

NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language Evaluation
David Alfonso-Hermelo · Ahmad Rashid · Abbas Ghaddar · Philippe Langlais · Mehdi Rezagholizadeh

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development
Kexin Huang · Tianfan Fu · Wenhao Gao · Yue Zhao · Yusuf Roohani · Jure Leskovec · Connor Coley · Cao Xiao · Jimeng Sun · Marinka Zitnik

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation
Junjue Wang · Zhuo Zheng · Ailong Ma · Xiaoyan Lu · Yanfei Zhong

CropHarvest: A global dataset for crop-type classification
Gabriel Tseng · Ivan Zvonkov · Catherine Nakalembe · Hannah Kerner

Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus
John Bandy · Nicholas Vincent

CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge
Yasumasa Onoe · Michael Zhang · Eunsol Choi · Greg Durrett

What Ails One-Shot Image Segmentation: A Data Perspective
Mayur Hemani · Abhinav Patel · Tejas Shimpi · Anirudha Ramesh · Balaji Krishnamurthy

HiRID-ICU-Benchmark --- A Comprehensive Machine Learning Benchmark on High-resolution ICU Data
Hugo Yèche · Rita Kuznetsova · Marc Zimmermann · Matthias Hüser · Xinrui Lyu · Martin Faltys · Gunnar Rätsch

DENETHOR: The DynamicEarthNET dataset for Harmonized, inter-Operable, analysis-Ready, daily crop monitoring from space
Lukas Kondmann · Aysim Toker · Marc Rußwurm · Andrés Camero · Devis Peressuti · Grega Milcinski · Pierre-Philippe Mathieu · Nicolas Longepe · Timothy Davis · Giovanni Marchisio · Laura Leal-Taixé · Xiaoxiang Zhu

STAR: A Benchmark for Situated Reasoning in Real-World Videos
Bo Wu · Shoubin Yu · Zhenfang Chen · Josh Tenenbaum · Chuang Gan

LiRo: Benchmark and leaderboard for Romanian language tasks
Stefan Dumitrescu · Petru Rebeja · Beata Lorincz · Mihaela Gaman · Andrei Avram · Mihai Ilie · Andrei Pruteanu · Adriana Stan · Lorena Rosia · Cristina Iacobescu · Luciana Morogan · George Dima · Gabriel Marchidan · Traian Rebedea · Madalina Chitez · Dani Yogatama · Sebastian Ruder · Radu Tudor Ionescu · Razvan Pascanu · Viorica Patraucean

The Met Dataset: Instance-level Recognition for Artworks
Nikolaos-Antonios Ypsilantis · Noa Garcia · Guangxing Han · Sarah Ibrahimi · Nanne van Noord · Giorgos Tolias

A Large-Scale Database for Graph Representation Learning
Scott Freitas · Yuxiao Dong · Joshua Neil · Duen Horng Chau

BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue Modeling
Zhaojiang Lin · Andrea Madotto · Genta Winata · Peng Xu · Feijun Jiang · Yuxiang Hu · Chen Shi · Pascale N Fung

SODA10M: A Large-Scale 2D Self/Semi-Supervised Object Detection Dataset for Autonomous Driving
Jianhua Han · Xiwen Liang · Hang Xu · Kai Chen · Lanqing Hong · Jiageng Mao · Chaoqiang Ye · Wei Zhang · Zhenguo Li · Xiaodan Liang · Chunjing XU

Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management
Cécile Logé · Emily Ross · David Dadey · Saahil Jain · Adriel Saporta · Andrew Ng · Pranav Rajpurkar

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Nandan Thakur · Nils Reimers · Andreas Rücklé · Abhishek Srivastava · Iryna Gurevych

CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription
Nikita Pavlichenko · Ivan Stelmakh · Dmitry Ustalov

HumBugDB: A Large-scale Acoustic Mosquito Dataset
Ivan Kiskin · Marianne Sinka · Adam Cobb · Waqas Rafique · Lawrence Wang · Davide Zilli · Benjamin Gutteridge · Rinita Dam · Theodoros Marinos · Yunpeng Li · Dickson Msaky · Emmanuel Kaindoa · Gerard Killeen · Eva Herreros-Moya · Kathy Willis · Stephen J Roberts

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Gilad Baruch · Zhuoyuan Chen · Afshin Dehghan · Yuri Feigin · Peter Fu · Thomas Gebauer · Daniel Kurz · Tal Dimry · Brandon Joffe · Arik Schwartz · Elad Shulman

One Million Scenes for Autonomous Driving: ONCE Dataset
Jiageng Mao · Niu Minzhe · ChenHan Jiang · hanxue liang · Jingheng Chen · Xiaodan Liang · Yamin Li · Chaoqiang Ye · Wei Zhang · Zhenguo Li · Jie Yu · Hang Xu · Chunjing XU

FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information
Rami Aly · Zhijiang Guo · Michael Schlichtkrull · James Thorne · Andreas Vlachos · Christos Christodoulopoulos · Oana Cocarascu · Arpit Mittal

Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing
Sarah Wiegreffe · Ana Marasovic

Graph Robustness Benchmark: Benchmarking the Adversarial Robustness of Graph Machine Learning
Qinkai Zheng · Xu Zou · Yuxiao Dong · Yukuo Cen · Da Yin · Jiarong Xu · Yang Yang · Jie Tang

CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review
Dan Hendrycks · Collin Burns · Anya Chen · Spencer Ball

DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debates
John Pougué-Biyong · Valentina Semenova · Alexandre Matton · Rachel Han · Aerin Kim · Renaud Lambiotte · Doyne Farmer

Trust, but Verify: Cross-Modality Fusion for HD Map Change Detection
John Lambert · James Hays

ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation
Chuang Gan · Jeremy Schwartz · Seth Alter · Damian Mrowca · Martin Schrimpf · James Traer · Julian De Freitas · Jonas Kubilius · Abhishek Bhandwaldar · Nick Haber · Megumi Sano · Kuno Kim · Elias Wang · Michael Lingelbach · Aidan Curtis · Kevin Feigelis · Daniel Bear · Dan Gutfreund · David Cox · Antonio Torralba · James J DiCarlo · Josh Tenenbaum · Josh McDermott · Dan Yamins

Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning
Thomas Liao · Rohan Taori · Deborah Raji · Ludwig Schmidt

Personalized Benchmarking with the Ludwig Benchmarking Toolkit
Avanika Narayan · Piero Molino · Karan Goel · Willie Neiswanger · Christopher Ré

Pl@ntNet-300K: a plant image dataset with high label ambiguity and a long-tailed distribution
Camille Garcin · alexis joly · Pierre Bonnet · Antoine Affouard · Jean-Christophe Lombardo · Mathias Chouet · Maximilien Servajean · Titouan Lorieul · Joseph Salmon

Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge Graphs
Zihao Wang · Hang Yin · Yangqiu Song

RELLISUR: A Real Low-Light Image Super-Resolution Dataset
Andreas Aakerberg · Kamal Nasrollahi · Thomas Moeslund

The Multi-Agent Behavior Dataset: Mouse Dyadic Social Interactions
Jennifer J Sun · Tomomi Karigo · Dipam Chakraborty · Sharada Mohanty · Benjamin Wild · Quan Sun · Chen Chen · David Anderson · Pietro Perona · Yisong Yue · Ann Kennedy

Intelligent Sight and Sound: A Chronic Cancer Facial Pain Dataset
Catherine Ordun · Alexandra Cha · Edward Raff · Byron Gaskin · Alexander Hanson · Mason Rule · Sanjay Purushotham · James Gulley

Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge
Jiyang Qi · Yan Gao · Yao Hu · Xinggang Wang · Xiaoyu Liu · Xiang Bai · Serge Belongie · Alan Yuille · Philip Torr · Song Bai

The CPD Data Set: Personnel, Use of Force, and Complaints in the Chicago Police Department
Thibaut Horel · Lorenzo Masoero · Raj Agrawal · Daria Roithmayr · Trevor Campbell

VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection
Jaeju An · Jeongho Kim · Hanbeen Lee · Jinbeom Kim · Junhyung Kang · Minha Kim · Saebyeol Shin · Minha Kim · Donghee Hong · Simon Woo

Benchmarking Data-driven Surrogate Simulators for Artificial Electromagnetic Materials
Yang Deng · Juncheng Dong · Simiao Ren · Omar Khatib · Mohammadreza Soltani · Vahid Tarokh · Willie Padilla · Jordan Malof

FS-Mol: A Few-Shot Learning Dataset of Molecules
Megan Stanley · John Bronskill · Krzysztof Maziarz · Hubert Misztela · Jessica Lanini · Marwin Segler · Nadine Schneider · Marc Brockschmidt

DABS: a Domain-Agnostic Benchmark for Self-Supervised Learning
Alex Tamkin · Vincent Liu · Rongfei Lu · Daniel Fein · Colin Schultz · Noah Goodman

Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement Learning
Nan Rosemary Ke · Aniket Didolkar · Sarthak Mittal · Anirudh Goyal · Guillaume Lajoie · Stefan Bauer · Danilo Jimenez Rezende · Yoshua Bengio · Chris Pal · Michael Mozer

AP-10K: A Benchmark for Animal Pose Estimation in the Wild
Hang Yu · Yufei Xu · Jing Zhang · Wei Zhao · Ziyu Guan · Dacheng Tao

It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks
Michelle Bao · Angela Zhou · Samantha Zottola · Brian Brubach · Sarah Desmarais · Aaron Horowitz · Kristian Lum · Suresh Venkatasubramanian

HPOBench: A Collection of Reproducible Multi-Fidelity Benchmark Problems for HPO
Katharina Eggensperger · Philipp Müller · Neeratyoy Mallik · Matthias Feurer · Rene Sass · Aaron Klein · Noor Awad · Marius Lindauer · Frank Hutter

SustainBench: Benchmarks for Monitoring the Sustainable Development Goals with Machine Learning
Christopher Yeh · Chenlin Meng · Sherrie Wang · Anne Driscoll · Erik Rozi · Patrick Liu · Jihyeon Lee · Marshall Burke · David Lobell · Stefano Ermon

STEP: Segmenting and Tracking Every Pixel
Mark Weber · Jun Xie · Maxwell Collins · Yukun Zhu · Paul Voigtlaender · Hartwig Adam · Bradley Green · Andreas Geiger · Bastian Leibe · Daniel Cremers · Aljosa Osep · Laura Leal-Taixé · Liang-Chieh Chen

Neural Latents Benchmark ‘21: Evaluating latent variable models of neural population activity
Felix Pei · Joel Ye · David Zoltowski · Anqi Wu · Raeed Chowdhury · Hansem Sohn · Joseph O'Doherty · Krishna V Shenoy · Matthew Kaufman · Mark Churchland · Mehrdad Jazayeri · Lee Miller · Jonathan Pillow · Il Memming Park · Eva Dyer · Chethan Pandarinath

KLUE: Korean Language Understanding Evaluation
Sungjoon Park · Jihyung Moon · Sungdong Kim · Won Ik Cho · Ji Yoon Han · Jangwon Park · Chisung Song · Junseong Kim · Youngsook Song · Taehwan Oh · Joohong Lee · Juhyun Oh · Sungwon Lyu · Younghoon Jeong · Inkwon Lee · Sangwoo Seo · Dongjun Lee · Hyunwoo Kim · Myeonghwa Lee · Seongbo Jang · Seungwon Do · Sunkyoung Kim · Kyungtae Lim · Jongwon Lee · Kyumin Park · Jamin Shin · Seonghyun Kim · Lucy Park · Alice Oh · Jung-Woo Ha · Kyunghyun Cho

ImageNet-21K Pretraining for the Masses
Tal Ridnik · Emanuel Ben-Baruch · Asaf Noy · Lihi Zelnik

Relational Pattern Benchmarking on the Knowledge Graph Link Prediction Task
Afshin Sadeghi · Hirra Malik · Diego Collarana · Jens Lehmann

Artsheets for Art Datasets
Ramya Srinivasan · Emily Denton · Jordan Famularo · Negar Rostamzadeh · Fernando Diaz · Beth Coleman

Benchmarking Multimodal AutoML for Tabular Data with Text Fields
Xingjian Shi · Jonas Mueller · Nick Erickson · Mu Li · Alexander Smola

SynthBio: A Case Study in Faster Curation of Text Datasets
Ann Yuan · Daphne Ippolito · Vitaly Nikolaev · Chris Callison-Burch · Andy Coenen · Sebastian Gehrmann

EEGEyeNet: a Simultaneous Electroencephalography and Eye-tracking Dataset and Benchmark for Eye Movement Prediction
Ard Kastrati · Martyna Plomecka · Damian Pascual Ortiz · Lukas Wolf · Victor Gillioz · Roger Wattenhofer · Nicolas Langer

RobustBench: a standardized adversarial robustness benchmark
Francesco Croce · Maksym Andriushchenko · Vikash Sehwag · Edoardo Debenedetti · Nicolas Flammarion · Mung Chiang · Prateek Mittal · Matthias Hein

EventNarrative: A Large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation
Anthony Colas · Ali Sadeghian · Yue Wang · Daisy Zhe Wang

Alchemy: A benchmark and analysis toolkit for meta-reinforcement learning agents
Jane Wang · Michael King · Nicolas Porcel · Zeb Kurth-Nelson · Tina Zhu · Charles Deck · Peter Choy · Mary Cassin · Malcolm Reynolds · Francis Song · Gavin Buttimore · David Reichert · Neil Rabinowitz · Loic Matthey · Demis Hassabis · Alexander Lerchner · Matt Botvinick

CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
Ruchir Puri · David Kung · Geert Janssen · Wei Zhang · Giacomo Domeniconi · Vladimir Zolotov · Julian T Dolby · Jie Chen · Mihir Choudhury · Lindsey Decker · Veronika Thost · Luca Buratti · Saurabh Pujar · Shyam Ramji · Ulrich Finkler · Susan Malaika · Frederick Reiss

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning
Cameron Voloshin · Hoang Le · Nan Jiang · Yisong Yue

TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers
Lianmin Zheng · Ruochen Liu · Junru Shao · Tianqi Chen · Joseph Gonzalez · Ion Stoica · Ameer Haj-Ali

Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks
Georgios Papoudakis · Filippos Christianos · Lukas Schäfer · Stefano Albrecht

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation
Linjie Li · Jie Lei · Zhe Gan · Licheng Yu · Yen-Chun Chen · Rohit Pillai · Yu Cheng · Luowei Zhou · Xin Wang · William Yang Wang · Tamara L Berg · Mohit Bansal · Jingjing Liu · Lijuan Wang · Zicheng Liu

Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks
Neil Band · Tim G. J. Rudner · Qixuan Feng · Angelos Filos · Zachary Nado · Mike Dusenberry · Ghassen Jerfel · Dustin Tran · Yarin Gal

Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks
Andrey Malinin · Neil Band · Yarin Gal · Mark Gales · Alexander Ganshin · German Chesnokov · Alexey Noskov · Andrey Ploskonosov · Liudmila Prokhorenkova · Ivan Provilkov · Vatsal Raina · Vyas Raina · Denis Roginskiy · Mariya Shmatova · Panagiotis Tigas · Boris Yangel

Towards a robust experimental framework and benchmark for lifelong language learning
Aman Hussain · Nithin Holla · Pushkar Mishra · Helen Yannakoudakis · Ekaterina Shutova

Task Agnostic and Task Specific Self-Supervised Learning from Speech with LeBenchmark
Solène Evain · Ha Nguyen · Hang Le · Marcely Zanon Boito · Salima Mdhaffar · Sina Alisamir · Ziyi Tong · Natalia Tomashenko · Marco Dinarelli · Titouan Parcollet · Alexandre Allauzen · Yannick Estève · Benjamin Lecouteux · François Portet · Solange Rossato · Fabien Ringeval · Didier Schwab · laurent besacier

CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms
Martin Pawelczyk · Sascha Bielawski · Johan Van den Heuvel · Tobias Richter · Gjergji Kasneci

Dynamic Environments with Deformable Objects
Rika Antonova · peiyang shi · Hang Yin · Zehang Weng · Danica Kragic

A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer
威佳 吴 · Debing Zhang · Yuanqiang Cai · Sibo Wang · Jiahong Li · Zhuang Li · Yejun Tang · Hong Zhou

The Tufts fNIRS Mental Workload Dataset & Benchmark for Brain-Computer Interfaces that Generalize
zhe huang · Liang Wang · Giles Blaney · Christopher Slaughter · Devon McKeon · Ziyu Zhou · Robert Jacob · Michael Hughes

Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks · Collin Burns · Saurav Kadavath · Akul Arora · Steven Basart · Eric Tang · Dawn Song · Jacob Steinhardt

Contemporary Symbolic Regression Methods and their Relative Performance
William La Cava · Patryk Orzechowski · Bogdan Burlacu · Fabricio de Franca · Marco Virgolin · Ying Jin · Michael Kommenda · Jason Moore

Synthetic Benchmarks for Scientific Research in Explainable Machine Learning
Yang Liu · Sujay Khandagale · Colin White · Willie Neiswanger

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Shuai Lu · Daya Guo · Shuo Ren · Junjie Huang · Alexey Svyatkovskiy · Ambrosio Blanco · Colin Clement · Dawn Drain · Daxin Jiang · Duyu Tang · Ge Li · Lidong Zhou · Linjun Shou · Long Zhou · Michele Tufano · MING GONG · Ming Zhou · Nan Duan · Neel Sundaresan · Shao Kun Deng · Shengyu Fu · Shujie LIU

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark
Yuhang Li · Mingzhu Shen · Jian Ma · Yan Ren · Mingxin Zhao · Qi Zhang · Ruihao Gong · Fengwei Yu · Junjie Yan

Measuring Coding Challenge Competence With APPS
Dan Hendrycks · Steven Basart · Saurav Kadavath · Mantas Mazeika · Akul Arora · Ethan Guo · Collin Burns · Samir Puranik · Horace He · Dawn Song · Jacob Steinhardt

Seasons in Drift: A Long Term Thermal Imaging Dataset for Studying Concept Drift
Ivan Nikolov · Mark Philip Philipsen · Jinsong Liu · Jacob Dueholm · Anders Johansen · Kamal Nasrollahi · Thomas Moeslund

ATOM3D: Tasks on Molecules in Three Dimensions
Raphael Townshend · Martin Vögele · Patricia Suriana · Alex Derry · Alexander Powers · Yianni Laloudakis · Sidhika Balachandar · Bowen Jing · Brandon Anderson · Stephan Eismann · Risi Kondor · Russ Altman · Ron Dror

WaveFake: A Data Set to Facilitate Audio Deepfake Detection
Joel Frank · Lea Schönherr

OpenML Benchmarking Suites
Bernd Bischl · Giuseppe Casalicchio · Matthias Feurer · Pieter Gijsbers · Frank Hutter · Michel Lang · Rafael Gomes Mantovani · Jan van Rijn · Joaquin Vanschoren

RadGraph: Extracting Clinical Entities and Relations from Radiology Reports
Saahil Jain · Ashwin Agrawal · Adriel Saporta · Steven Truong · Du Nguyen Duong · Tan Bui · Pierre Chambon · Yuhao Zhang · Matthew Lungren · Andrew Ng · Curtis Langlotz · Pranav Rajpurkar

$\texttt{RP-Mod}\ \&\ \texttt{RP-Crowd:}$ Moderator- and Crowd-Annotated German News Comment Datasets
Dennis Assenmacher · Marco Niemann · Kilian Müller · Moritz Seiler · Dennis Riehle · Heike Trautmann

Whole Brain Vessel Graphs: A Dataset and Benchmark for Graph Learning and Neuroscience
Johannes C. Paetzold · Julian McGinnis · Suprosanna Shit · Ivan Ezhov · Paul Büschl · Chinmay Prabhakar · Anjany Sekuboyina · Mihail Todorov · Georgios Kaissis · Ali Ertürk · Stephan Günnemann · Bjoern Menze

RAFT: A Real-World Few-Shot Text Classification Benchmark
Neel Alex · Eli Lifland · Lewis Tunstall · Abhishek Thakur · Pegah Maham · C. Riedel · Emmie Hine · Carolyn Ashurst · Paul Sedille · Alexis Carlier · Michael Noetel · Andreas Stuhlmüller

Physion: Evaluating Physical Prediction from Vision in Humans and Machines
Daniel Bear · Elias Wang · Damian Mrowca · Felix Binder · Hsiao-Yu Tung · Pramod RT · Cameron Holdaway · Sirui Tao · Kevin Smith · Fan-Yun Sun · Fei-Fei Li · Nancy Kanwisher · Josh Tenenbaum · Dan Yamins · Judith Fan

Hardware Design and Accurate Simulation of Structured-Light Scanning for Benchmarking of 3D Reconstruction Algorithms
Sebastian Koch · Yurii Piadyk · Markus Worchel · Marc Alexa · Claudio Silva · Denis Zorin · Daniele Panozzo

Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation
Daniel Freeman · Erik Frey · Anton Raichuk · Sertan Girgin · Igor Mordatch · Olivier Bachem

IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
Pan Lu · Liang Qiu · Jiaqi Chen · Tanglin Xia · Yizhou Zhao · Wei Zhang · Zhou Yu · Xiaodan Liang · Song-Chun Zhu

A Procedural World Generation Framework for Systematic Evaluation of Continual Learning
Timm Hess · Martin Mundt · Iuliia Pliushch · Visvanathan Ramesh

SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation
Robin Chan · Krzysztof Lis · Svenja Uhlemeyer · Hermann Blum · Sina Honari · Roland Siegwart · Pascal Fua · Mathieu Salzmann · Matthias Rottmann

B-Pref: Benchmarking Preference-Based Reinforcement Learning
Kimin Lee · Laura Smith · Anca Dragan · Pieter Abbeel

FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset
Hasam Khalid · Shahroz Tariq · Minha Kim · Simon Woo

NaturalProofs: Mathematical Theorem Proving in Natural Language
Sean Welleck · Jiacheng Liu · Ronan Le Bras · Hanna Hajishirzi · Yejin Choi · Kyunghyun Cho

MLPerf Tiny Benchmark
Colby Banbury · Vijay Janapa Reddi · Peter Torelli · Nat Jeffries · Csaba Kiraly · Jeremy Holleman · Pietro Montino · David Kanter · Pete Warden · Danilo Pau · Urmish Thakker · antonio torrini · jay cordaro · Giuseppe Di Guglielmo · Javier Duarte · Honson Tran · Nhan Tran · niu wenxu · xu xuesong

OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs
Weihua Hu · Matthias Fey · Hongyu Ren · Maho Nakata · Yuxiao Dong · Jure Leskovec

RedCaps: Web-curated image-text data created by the people, for the people
Karan Desai · Gaurav Kaul · Zubin Aysola · Justin Johnson

An Information Retrieval Approach to Building Datasets for Hate Speech Detection
Md Mustafizur Rahman · Dinesh Balakrishnan · Dhiraj Murthy · Mucahid Kutlu · Matt Lease

COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory Screening
Tong Xia · Dimitrios Spathis · Chlo{\"e} Brown · J Ch · Andreas Grammenos · Jing Han · Apinan Hasthanasombat · Erika Bondareva · Ting Dang · Andres Floto · Pietro Cicuta · Cecilia Mascolo

ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation
Laurynas Karazija · Iro Laina · Christian Rupprecht

MIND dataset for diet planning and dietary healthcare with machine learning: Dataset creation using combinatorial optimization and controllable generation with domain experts
Changhun Lee · Soohyeok Kim · Sehwa Jeong · Chiehyeon Lim · Jayun Kim · Yeji Kim · Minyoung Jung

A Channel Coding Benchmark for Meta-Learning
Rui Li · Ondrej Bohdal · Rajesh K Mishra · Hyeji Kim · Da Li · Nicholas Lane · Timothy Hospedales

Chaos as an interpretable benchmark for forecasting and data-driven modelling
William Gilpin

Revisiting Time Series Outlier Detection: Definitions and Benchmarks
Kwei-Herng Lai · Daochen Zha · Junjie Xu · Yue Zhao · Guanchu Wang · Xia Hu

Automatic Construction of Evaluation Suites for Natural Language Generation Datasets
Simon Mille · Kaustubh Dhole · Saad Mahamood · Laura Perez-Beltrachini · Varun Prashant Gangal · Mihir Kale · Emiel van Miltenburg · Sebastian Gehrmann

HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML
Sebastian Pineda Arango · Hadi Jomaa · Martin Wistuba · Josif Grabocka

WikiChurches: A Fine-Grained Dataset of Architectural Styles with Real-World Challenges
Björn Barz · Joachim Denzler

ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations
Tongzhou Mu · Zhan Ling · Fanbo Xiang · Derek Yang · Xuanlin Li · Stone Tao · Zhiao Huang · Zhiwei Jia · Hao Su

Monash Time Series Forecasting Archive
Rakshitha W Godahewa · Christoph Bergmeir · Geoffrey Webb · Rob Hyndman · Pablo Montero-Manso

Which priors matter? Benchmarking models for learning latent dynamics
Aleksandar Botev · Andrew Jaegle · Peter Wirnsberger · Daniel Hennes · Irina Higgins

Reinforcement Learning Benchmarks for Traffic Signal Control
James Ault · Guni Sharon

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
Curtis Northcutt · Anish Athalye · Jonas Mueller

CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer
Moein Sorkhei · Yue Liu · Hossein Azizpour · Edward Azavedo · Karin Dembrower · Dimitra Ntoula · Athanasios Zouzos · Fredrik Strand · Kevin Smith

Benchmarks for Corruption Invariant Person Re-identification
Minghui Chen · Zhiqiang Wang · Feng Zheng

ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models
Salva Rühling Cachay · Venkatesh Ramesh · Jason Cole · Howard Barker · David Rolnick

Variance-Aware Machine Translation Test Sets
Runzhe Zhan · Xuebo Liu · Derek Wong · Lidia Chao

MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research
Mikayel Samvelyan · Robert Kirk · Vitaly Kurin · Jack Parker-Holder · Minqi Jiang · Eric Hambro · Fabio Petroni · Heinrich Kuttler · Edward Grefenstette · Tim Rocktäschel

An Empirical Investigation of Representation Learning for Imitation
Cynthia Chen · Sam Toyer · Cody Wild · Scott Emmons · Ian Fischer · Kuang-Huei Lee · Neel Alex · Steven Wang · Ping Luo · Stuart Russell · Pieter Abbeel · Rohin Shah

Multilingual Spoken Words Corpus
Mark Mazumder · Sharad Chitlangia · Colby Banbury · Yiping Kang · Juan Ciro · Keith Achorn · Daniel Galvez · Mark Sabini · Peter Mattson · David Kanter · Greg Diamos · Pete Warden · Josh Meyer · Vijay Janapa Reddi