Data-Centric AI (DCAI) represents the recent transition from focusing on modeling to the underlying data used to train and evaluate models. Increasingly, common model architectures have begun to dominate a wide range of tasks, and predictable scaling rules have emerged. While building and using datasets has been critical to these successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. The DCAI movement aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems.
The main objective of this workshop is to cultivate the DCAI community into a vibrant interdisciplinary field that tackles practical data problems. We consider some of those problems to be: data collection/generation, data labeling, data preprocess/augmentation, data quality evaluation, data debt, and data governance. Many of these areas are nascent, and we hope to further their development by knitting them together into a coherent whole. Together we will define the DCAI movement that will shape the future of AI and ML. Please see our call for papers below to take an active role in shaping that future! If you have any questions, please reach out to the organizers (neurips-data-centric-ai@googlegroups.com)
The ML community has a strong track record of building and using datasets for AI systems. But this endeavor is often artisanal—painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining and evaluating datasets easier, cheaper and more repeatable. So, the core challenge is to accelerate dataset creation and iteration together with increasing the efficiency of use and reuse by democratizing data engineering and evaluation.
If 80 percent of machine learning work is data preparation, then ensuring data quality is the most important work of a machine learning team and therefore a vital research area. Human-labeled data has increasingly become the fuel and compass of AI-based software systems - yet innovative efforts have mostly focused on models and code. The growing focus on scale, speed, and cost of building and improving datasets has resulted in an impact on quality, which is nebulous and often circularly defined, since the annotators are the source of data and ground truth [Riezler, 2014]. The development of tools to make repeatable and systematic adjustments to datasets has also lagged. While dataset quality is still the top concern everyone has, the ways in which that is measured in practice is poorly understood and sometimes simply wrong. A decade later, we see some cause for concern: fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019], and unrealistic performance metrics [Bernstein 2021].
We need a framework for excellence in data engineering that does not yet exist. In the first to market rush with data, aspects of maintainability, reproducibility, reliability, validity, and fidelity of datasets are often overlooked. We want to turn this way of thinking on its head and highlight examples, case-studies, methodologies for excellence in data collection. Building an active research community focused on Data Centric AI is an important part of the process of defining the core problems and creating ways to measure progress in machine learning through data quality tasks.
Opening Remarks | |
Workshop Overview (Talk) | |
Human Computer Interaction and Crowdsourcing for Data Centric AI (Keynote) | |
Past and Future of data centric AI (Invited Talk) | |
Data Centric AI Competition (Intro) | |
Data Centric AI Competition : Divakar Roy (Lightning Talk) | |
Data Centric AI Competition: Shashank Deshpande (Lightning Talk) | |
Data Centric AI Competition: Johnson Kuan (Lightning Talk) | |
Data Centric AI Competition: Rens Dimmendaal (Lightning Talk) | |
Data Centric AI Competition: Nidhish Shah (Lightning Talk) | |
A Data-Centric Approach for Training Deep Neural Networks with Less Data (Lightning Talk) | |
Q&A Lightning Talk - Benchmarking (Q&A Session) | |
Lightning Talks - Benchmarks and Challenges (Intro) | |
Few-Shot Image Classification Challenge On-Board OPS-SAT (Lightning Talk) | |
No News is Good News: A Critique of the One Billion Word Benchmark (Lightning Talk) | |
A Data-Centric Image Classification Benchmark (Lightning Talk) | |
On Data-centric Myths (Lightning Talk) | |
Human-inspired Data Centric Computer Vision (Lightning Talk) | |
Q&A Lightning Talk - Benchmarks and Challenges (Q&A session) | |
Break | |
DataPerf - Peter Mattson and Praveen Paritosh (Talk) | |
Lightning Talks - Challenge Problems and Theory (Intro) | |
YMIR: A Rapid Data Development Platform for Long-tailed Vision Applications (Lightning Talk) | |
CircleNLU: A Tool for building Data-Driven Natural Language Understanding System (Lightning Talk) | |
AirSAS: Controlled Dataset Generation for Physics-Informed Machine Learning (Lightning Talk) | |
Lhotse: a speech data representation library for the modern deep learning ecosystem (Lightning Talk) | |
Data-Driven Deep Reinforcement Learning in Quantitative Finance (Lightning Talk) | |
Ground-Truth, Whose Truth? - Examining the Challenges with Annotating Toxic Text Datasets (Lightning Talk) | |
Data-Centric AI Requires Rethinking Data Notion (Lightning Talk) | |
Small Data in NLU: Proposals towards a Data-Centric Approach (Lightning Talk) | |
Towards better data discovery and collection with flow-based programming (Lightning Talk) | |
Q&A Lightning Talks - Challenge Problems and Theory (Q&A Sessions) | |
Break | |
Facebook - Data Centric Infrastructure (Invited Talk) | |
Lightning Talks - Responsibility and Ethics (Intro) | |
Sampling To Improve Predictions For Underrepresented Observations In Imbalanced Data (Lightning Talk) | |
Feminist Curation of Text for Data-centric AI (Lightning Talk) | |
Addressing Content Selection Bias in Creating Datasets for Hate Speech Detection (Lightning Talk) | |
Data Cards: Purposeful and Transparent Documentation for Responsible AI (Lightning Talk) | |
A Data-Centric Behavioral Machine Learning Platform to Reduce Health Inequalities (Lightning Talk) | |
Simultaneous Improvement of ML Model Fairness and Performance by Identifying Bias in Data (Lightning Talk) | |
Building Legal Datasets (Lightning Talk) | |
DAG Card is the new Model Card (Lightning Talk) | |
Q&A Lightning Talks - Responsibility and Ethics (Q&A Session) | |
Break | |
Q&A with Morning Invited + Keynote Speakers + Closing Remarks (Q&A Session) | |
Break - watch the on-demand videos and ask questions in Rocket.Chat (Break) | |
Alex Ratner and Chris Re - The Future of Data Centric AI (Keynote) | |
Technical Debt in ML: A Data-Centric View (Invited talk) | |
Lightning Talks - Data Synthesis and Datasets (Recorded Talks) | |
Towards Systematic Evaluation in Machine Learning through Automated Stress Test Creation (Lightning Talk) | |
Bridging the gap to real-world for network intrusion detection systems with data-centric approach (Lightning Talk) | |
IMDB-WIKI-SbS: An Evaluation Dataset for Crowdsourced Pairwise Comparisons (Lightning Talk) | |
LSH methods for data deduplication in a Wikipediaartificial dataset (Lightning Talk) | |
Using Synthetic Images To Uncover Population Biases In Facial Landmarks Detection (Lightning Talk) | |
3D ImageNet: A data collection and labeling tool for Depth and RGB Images (Lightning Talk) | |
Augment & Valuate : A Data Enhancement Pipeline for data-centric AI (Lightning Talk) | |
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs (Lightning Talk) | |
Sim2Real Docs: Domain Randomization for Documents in Natural Scenes using Ray-traced Rendering (Lightning Talk) | |
A First Look Towards One-Shot Object Detection with SPOT for Data-Efficient Learning (Lightning Talk) | |
Challenges of Working with Materials R&D Data (Lightning Talk) | |
Open-Sourcing Generative Models for Data-driven Robot Simulations (Lightning Talk) | |
Natural Adversarial Objects (Lightning Talk) | |
Q&A for Lightning Talks - Datasets and Data Synthesis (Q&A Session) | |
Break | |
Curtis Northcutt (Invited Talk) | |
Lightning Talks - Data Quality and Iteration (Intro) | |
DiagnosisQA: A semi-automated pipeline for developing clinician validated diagnosis specific QA datasets. (Lightning Talk) | |
Contrasting the Profiles of Easy and Hard Observations in a Dataset (Lightning Talk) | |
Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation (Lightning Talk) | |
Engineering AI Tools for Systematic and Scalable Quality Assessment in Magnetic Resonance Imaging (Lightning Talk) | |
PyHard: a novel tool for generating hardness embeddings to support data-centric analysis (Lightning Talk) | |
Increasing Data Diversity with Iterative Sampling to Improve Performance (Lightning Talk) | |
Exploiting Proximity Search and Easy Examples to Select Rare Events (Lightning Talk) | |
Fantastic Data and How to Query Them (Lightning Talk) | |
Exploiting Domain Knowledge for EfficientData-centric Session-based Recommendation model (Lightning Talk) | |
Automatic Data Quality Evaluation for Text Classification (Lightning Talk) | |
Q&A for Lightning Talks - Data Quality and Iteration (Q&A Session) | |
Break | |
Anima Anandkumar (Invited Talk) | |
Lightning Talks - Data Labeling (Intro) | |
Decreasing Annotation Burden of Pairwise Comparisons with Human-in-the-Loop Sorting: Application in Medical Image Artifact Rating (Lightning Talk) | |
Effect of Radiology Report Labeler Quality on Deep Learning Models for Chest X-Ray Interpretation (Lightning Talk) | |
Towards a Shared Rubric for Dataset Annotation (Lightning Talk) | |
Influence of human-expert labels on a neonatal seizure detector based on a convolutional neural network (Lightning Talk) | |
Utilizing Driving Context to Increase the Annotation Efficiency of Imbalanced Gaze Image Data (Lightning Talk) | |
Highly Efficient Representation and Active Learning Framework and Its Application to Imbalanced Medical Image Classification (Lightning Talk) | |
Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation (Lightning Talk) | |
Finding Label Errors in Autonomous Vehicle Data With Learned Observation Assertions (Lightning Talk) | |
Ontolabeling: Re-Thinking Data Labeling For Computer Vision (Lightning Talk) | |
Single-Click 3D Object Annotation on LiDAR Point Clouds (Lightning Talk) | |
Q&A for Lightning Talks - Data Labeling (Q&A Session) | |
Q&A with Afternoon Invited + Keynote Speakers + Closing Remarks (Q&A Session) | |
Below are the videos of accepted Lighting Talks that are not presented in the livestream (Lightning Talks (videos)) | |
Data Expressiveness and Its Use in Data-centric AI (video) | |
Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models (video) | |
A Hybrid Bayesian Model to Analyse Healthcare Data (video) | |
How should human translation coexist with NMT? Efficient tool for building high quality parallel corpus (video) | |
A Probabilistic Framework for Knowledge GraphData Augmentation (video) | |
A New Tool for Efficiently Generating Quality Estimation Datasets (video) | |
Automatic Knowledge Augmentation for Generative Commonsense Reasoning (video) | |
FedHist: A Federated-First Dataset for Learning inHealthcare (video) | |
SCIMAT: Science and Mathematics Dataset (video) | |
Unleashing the Power of Industrial Big Data through Scalable Manual Labeling (video) | |
nferX: a case study on data-centric NLP in biomedicine (video) | |
Combining Data-driven Supervision with Human-in-the-loop Feedback for Entity Resolution (video) | |
InfiniteForm: A synthetic, minimal bias dataset for fitness applications (video) | |
Who Decides if AI is Fair? The Labels Problem in Algorithmic Auditing (video) | |
Debiasing Pre-Trained Sentence Encoders With WordDropouts on Fine-Tuning Data (video) | |
Towards a Framework for Data Excellence in Data-Centric AI: Lessons from the Semantic Web (video) | |
Data Agnostic Image Annotation (video) | |
On Biased Systems and Data (video) | |
Tabular Engineering with Automunge (video) | |
All in one Data Cleansing Tool (video) | |
Seg-Diff: Checkpoints Are All You Need (video) | |
Data Augmentation for Intent Classification (video) | |
Topological Deep Learning (video) | |
Homogenization of Existing Inertial-Based Datasets to Support Human Activity Recognition (video) | |
Data preparation for training CNNs: Application to vibration-based condition monitoring (video) | |
Dialectal Voice : An Open-Source Voice Dataset and Automatic Speech Recognition model for Moroccan Arabic dialectal (video) | |
Annotation Quality Framework - Accuracy,Credibility, and Consistency (video) | |
Diagnosing severity levels of Autism Spectrum Disorder with Machine Learning (video) | |
Challenges and Solutions to build a Data Pipeline to Identify Anomalies in Enterprise System Performance (video) | |
Vietnamese Speech-based Question Answering over Car Manuals (video) | |
Towards a Taxonomy of Graph Learning Datasets (video) | |
Evaluating Machine Learning Models for Internet Network Security with Data Slices (video) | |
AutoDQ: Automatic Data Quality for Financial Data (video) | |
What can Data-Centric AI Learn from Data and ML Engineering? (video) | |
Annotation Inconsistency and Entity Bias inMultiWOZ (video) | |
AutoDC: Automated data-centric processing (video) | |
Fix your Model by Fixing your Datasets (video) | |
Can machines learn to see without visual databases? (video) | |
Two Approaches to Building Dialogue Systems for People on the Spectrum (video) | |
CogALex 2.0: Impact of Data Quality on Lexical-Semantic Relation Prediction (video) | |
A concept for fitness-for-use evaluation in Machine Learning pipelines (video) | |
Bridging the gap between AI and the life sciences: towards a standardized multi-omics data type (video) | |
Data vast and low in variance: Augment machine learning pipelines with dataset profiles to improve data quality without sacrificing scale (video) | |