`

Timezone: »

 
Workshop
Data Centric AI
Andrew Ng · Lora Aroyo · Greg Diamos · Cody Coleman · Vijay Janapa Reddi · Joaquin Vanschoren · Carole-Jean Wu · Sharon Zhou · Lynn He

Tue Dec 14 08:30 AM -- 06:00 PM (PST) @ None
Event URL: http://datacentricai.org/ »

Data-Centric AI (DCAI) represents the recent transition from focusing on modeling to the underlying data used to train and evaluate models. Increasingly, common model architectures have begun to dominate a wide range of tasks, and predictable scaling rules have emerged. While building and using datasets has been critical to these successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. The DCAI movement aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems.

The main objective of this workshop is to cultivate the DCAI community into a vibrant interdisciplinary field that tackles practical data problems. We consider some of those problems to be: data collection/generation, data labeling, data preprocess/augmentation, data quality evaluation, data debt, and data governance. Many of these areas are nascent, and we hope to further their development by knitting them together into a coherent whole. Together we will define the DCAI movement that will shape the future of AI and ML. Please see our call for papers below to take an active role in shaping that future! If you have any questions, please reach out to the organizers (neurips-data-centric-ai@googlegroups.com)

The ML community has a strong track record of building and using datasets for AI systems. But this endeavor is often artisanal—painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining and evaluating datasets easier, cheaper and more repeatable. So, the core challenge is to accelerate dataset creation and iteration together with increasing the efficiency of use and reuse by democratizing data engineering and evaluation.

If 80 percent of machine learning work is data preparation, then ensuring data quality is the most important work of a machine learning team and therefore a vital research area. Human-labeled data has increasingly become the fuel and compass of AI-based software systems - yet innovative efforts have mostly focused on models and code. The growing focus on scale, speed, and cost of building and improving datasets has resulted in an impact on quality, which is nebulous and often circularly defined, since the annotators are the source of data and ground truth [Riezler, 2014]. The development of tools to make repeatable and systematic adjustments to datasets has also lagged. While dataset quality is still the top concern everyone has, the ways in which that is measured in practice is poorly understood and sometimes simply wrong. A decade later, we see some cause for concern: fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019], and unrealistic performance metrics [Bernstein 2021].

We need a framework for excellence in data engineering that does not yet exist. In the first to market rush with data, aspects of maintainability, reproducibility, reliability, validity, and fidelity of datasets are often overlooked. We want to turn this way of thinking on its head and highlight examples, case-studies, methodologies for excellence in data collection. Building an active research community focused on Data Centric AI is an important part of the process of defining the core problems and creating ways to measure progress in machine learning through data quality tasks.

Tue 8:30 a.m. - 8:45 a.m.
Andrew Ng - Opening Remarks (Opening Remarks)
Tue 8:45 a.m. - 9:00 a.m.
Lora Aroyo - Workshop Information (Talk)
Tue 9:00 a.m. - 9:15 a.m.
Micheal Bernstein - Human Computer Interaction and Crowdsourcing for Data Centric AI (Keynote)
Michael Bernstein
Tue 9:15 a.m. - 9:25 a.m.
Olga Russakovsky - Past and Future of data centric AI (Invited Talk)
Olga Russakovsky
Tue 9:25 a.m. - 9:27 a.m.
(Oral)   

While the availability of large datasets is perceived to be a key requirement for training deep neural networks, it is possible to train such models with relatively little data. However, compensating for the absence of large datasets demands a series of actions to enhance the quality of the existing samples and to generate new ones. This paper summarizes our winning submission to the ``Data-Centric AI'' competition. We discuss some of the challenges that arise while training with a small dataset, offer a principled approach for systematic data quality enhancement, and propose a GAN-based solution for synthesizing new data points. Our evaluations indicate that the dataset generated by the proposed pipeline offers 5\% accuracy improvement, while being significantly smaller than the baseline.

Tue 9:25 a.m. - 9:40 a.m.
Lightning Talks - Datacentric AI Competition (Recorded Talks)
Tue 9:25 a.m. - 9:27 a.m.
Data Centric AI Competition (Talk)
Lynn He
Tue 9:27 a.m. - 9:29 a.m.
Data Centric AI Competition : Divakar Roy (Oral)   
Divakar Roy
Tue 9:29 a.m. - 9:31 a.m.
Data Centric AI Competition: Shashank Deshpande (Oral)   
shashank deshpande
Tue 9:31 a.m. - 9:33 a.m.
Data Centric AI Competition: Johnson Kuan (Oral)   
Johnson Kuan
Tue 9:33 a.m. - 9:35 a.m.
Data Centric AI Competition: Rens Dimmendaal (Oral)   
Rens Dimmendaal
Tue 9:35 a.m. - 9:37 a.m.
Data Centric AI Competition: Nidhish Shah (Oral)   
Nidhish Shah
Tue 9:40 a.m. - 9:50 a.m.
Q&A Lightning Talk - Benchmarking (Q&A Session)
Tue 9:50 a.m. - 9:52 a.m.
(Oral)   

Artificial Intelligence on the edge is constrained to low-memory and low-energy environments, but has a high impact potential. We propose a data-centric competition to accelerate the deployment of on-board classification for Earth Observation satellites. To this end, we fix the model architecture a priori to be suitable for ground-to-satellite transmission and on-board inference. The competitors submit model parameters obtained via their training procedure using only a few labeled images taken from the European Space Agency satellite OPS-SAT, and are ranked according to classification accuracy on a larger hidden test set. Our final goals are to alleviate the need for large amounts of in-situ data for training on-board AI, and to reduce the number of pre-processing steps performed on-board. Our approach could be further extended to other domains, including guidance and control, astronomy and more.

Tue 9:50 a.m. - 10:15 a.m.
Lightning Talks - Benchmarks and Challenges (Recorded Talks)
Tue 9:52 a.m. - 9:54 a.m.
(Oral)   

The One Billion Word Benchmark is a dataset derived from the WMT 2011 News Crawl, commonly used to measure language modeling ability in natural language processing. We train models solely on Common Crawl web scrapes partitioned by year, and demonstrate that they perform worse on this task over time due to distributional shift. Analysis of this corpus reveals that it contains several examples of harmful text, as well as outdated references to current events. We suggest that the temporal nature of news and its distribution shift over time makes it poorly suited for measuring language modeling ability, and discuss potential impact and considerations for researchers building language models and evaluation datasets.

Tue 9:54 a.m. - 9:56 a.m.
(Oral)   

High-quality labeled datasets are critical to the advances in machine learning and tend to benefit all kinds of model-centric algorithms, such as novel architectures and loss functions. The labeling process is usually label-intensive and time-consuming since it includes many turns of data selection, data cleaning, and data analysis. There are tons of work that aim to solve each specific step, but it lacks an understanding of how to combine them and, most importantly, a standard testbed for different dataset improving techniques. We, therefore, present the concept of a multi-domain benchmark for acquiring consistent labels with limited budgets. In contrast to most benchmarks that encourage novel model-centric algorithms, our multi-domain data-centric benchmark encourages algorithms to improve the provided dataset. The proposed benchmark consists of different resolutions, class distributions and domains ranging from biological to medical domains.

Tue 9:56 a.m. - 9:58 a.m.
(Oral)   

The community lacks theory-informed guidelines for building good data sets. We analyse theoretical directions relating to what aspects of the data matter and conclude that the intuitions derived from the existing literature are incorrect and misleading. Using empirical counter-examples, we show that 1) data dimension should not necessarily be minimised and 2) when manipulating data, preserving the distribution is inessential. This calls for a more data-aware theoretical understanding. Although not explored in this work, we propose the study of the impact of data modification on learned representations as a promising research direction.

Tue 9:58 a.m. - 10:00 a.m.
(Oral)   

The vast majority of work in computer vision focuses on proposing and applying new machine learning models and algorithms for visual recognition. In contrast, relatively little work has studied how properties of the training data affect these models. For example, the Internet images and videos commonly used for training are very different from the inputs that human vision systems receive in our everyday lives. If the goal of computer vision is to build vision systems as intelligent as humans, we argue that we should study the actual inputs to human vision systems, and get hints to improve the training data for computer vision models. We use wearable cameras and eye gaze trackers to collect video data that approximates people’s everyday visual fields of views, and find the structure of the data that can potentially improve computer vision systems. This paper presents our previous work on this direction and advocates data centric computer vision inspired by human vision.

Tue 10:00 a.m. - 10:25 a.m.
Q&A Lightning Talk - Benchmarks and Challenges (Q&A session)
Tue 10:25 a.m. - 10:40 a.m.
MLDataPerf - Peter Mattson and Praveen Paritosh (Talk)
Peter Mattson
Tue 10:40 a.m. - 10:42 a.m.
(Oral)   

This paper introduces an open source platform for rapid development of long-tailed computer vision applications. The platform puts efficient dataset development at the center of the machine learning development process, integrates active learning methods, data and model version control, and uses concepts such as projects to enable fast iteration of multiple task specific datasets in parallel. We make it an open platform by abstracting the development process into core states and operations, and design open APIs to integrate third party tools as implementations of the operations. This open design reduces our development cost and at the same time reduces adoption cost for ML teams with existing tools for part of the development process. The platform is targeted to open source in the coming weeks and is already used internally to meet the ever increasing demand of custom computer vision applications from customers.

Tue 10:40 a.m. - 11:00 a.m.
Lightning Talks - Challenge Problems and Theory (Recorded Talks)
Tue 10:42 a.m. - 10:44 a.m.
(Oral)   

Despite huge successes reported by the field of machine learning, such as voice assistants or self-driving cars, businesses still observe very high failure rate when it comes to deployment of ML in production. We argue that part of the reason is infrastructure that was not designed for data-oriented activities. This paper explores the potential of flow-based programming (FBP) for simplifying data discovery and collection in software systems. We compare FBP with the currently prevalent service-oriented paradigm to assess characteristics of each paradigm in the context of ML deployment. We develop a data processing application, formulating a subsequent ML deployment task, and measuring the impact of the task implementation within both programming paradigms. Our main conclusion is that FBP shows great potential for providing data-centric infrastructural benefits for deployment of ML. Additionally, we provide an insight into the current trend that prioritizes model development over data quality management.

Tue 10:44 a.m. - 10:46 a.m.
(Oral)   

Although many open-source developer tools exist for building intelligent chatbot agents, these still have limitations with a few basic features and do not provide enough support for building commercialized products and low-resource languages like Vietnamese with a whole Natural Language Understanding (NLU) life-cycle. Thus, building an in-house system will be necessary for enterprises to have a unique competitive advantage. To fill this gap, we introduce CircleNLU as a Data-Driven NLU system with a range of features such as pseudo annotation, massive deployment, machine learning (ML) development process, and others. This paper shares our system architecture and sees how it works.

Tue 10:46 a.m. - 10:48 a.m.
(Oral)   

Synthetic aperture sonar (SAS) is an underwater remote sensing technique for applications such as seafloor characterization and object detection. However, underwater SAS datasets are both extremely expensive to collect and difficult to control and repeat. We propose an in-air SAS measurement apparatus (AirSAS) made from commercial off-the-shelf laboratory equipment to generate controlled, repeatable datasets. AirSAS is both flexible and sufficiently delicate to capture the complex acoustic phenomena inherent in SAS measurements. The system allows us to physically control the differences between classes of interest, and observe acoustic phenomenology that is rare or expensive to collect underwater. Accordingly, we can measure and tune which acoustic phenomena deep learning models are sensitive to. AirSAS can generate both circular and linear track collections. The first iteration of the AirSAS dataset is currently being curated for public release.

Tue 10:48 a.m. - 10:50 a.m.
(Oral)   

Speech data is notoriously difficult to work with due to a variety of codecs, lengths of recordings, and meta-data formats. We present Lhotse, a speech data representation library that draws upon lessons learned from Kaldi speech recognition toolkit and brings its concepts into the modern deep learning ecosystem. Lhotse provides a common JSON manifest format with corresponding Python classes and over 30 data preparation recipes for popular speech corpora. Various datasets can be easily combined together and re-purposed for different tasks. The library handles multi-channel recordings, long recordings, local and cloud storage, lazy and on-the-fly operations amongst other features. We introduce Cut and CutSet concepts, which simplify common data wrangling tasks for audio and help incorporate acoustic context of speech utterances. Finally, we show how Lhotse leverages PyTorch data API abstractions and adopts them to handle speech data for deep learning.

Tue 10:50 a.m. - 10:52 a.m.
(Oral)   

Deep reinforcement learning (DRL) has shown huge potentials in quantitative finance recently. However, due to the high complexity of real-world markets, raw historical financial data often involve large noise and may not reflect the future of markets, degrading the performance of DRL agents in practice. By simulating the trading mechanism of real markets on processed datasets, market simulation environments play important roles in addressing this issue. However, the simulation accuracy heavily relies on the quality of processed datasets, while building and using datasets is often artisanal -- painstaking and expensive. Moreover, training DRL agents on large datasets imposes a challenge on simulation speed. In this paper, we present a NeoFinRL framework that includes tens of \underline{N}ear real-market \underline{e}nvironments f\underline{o}r data-driven \underline{Fin}ancial \underline{R}einforcement \underline{L}earning. First, NeoFinRL separates financial data processing from the design pipeline of DRL-based strategy and provides open-source data engineering tools. Second, NeoFinRL provides tens of standard market environments for various trading tasks. Third, NeoFinRL enables massively parallel simulations by exploiting thousands of GPU cores.

Tue 10:54 a.m. - 10:56 a.m.
(Oral)   

The use of language models (LMs) to regulate content online is on the rise. Task-specific fine-tuning of these models is performed using datasets that are often labeled by annotators who provide "ground-truth" labels in an effort to distinguish between offensive and normal content. Annotators generally include linguistic experts, volunteers, and paid workers recruited on crowdsourcing platforms, among others. These projects have led to the development, improvement, and expansion of large datasets over time, and have contributed immensely to research on natural language. Despite the achievements, existing evidence suggests that Machine Learning (ML) models built on these datasets do not always result in desirable outcomes. Therefore, using a design science research (DSR) approach, this study examines selected toxic text datasets with the goal of shedding light on some of the inherent issues and contributes to discussions on navigating these challenges for existing and future projects.

Tue 10:56 a.m. - 10:58 a.m.
(Oral)   

The transition towards data-centric AI requires revisiting data notions from mathematical and implementational standpoints to obtain unified data-centric machine learning packages. Towards this end, this work proposes unifying principles offered by categorical and cochain notions of data, and discusses the importance of these principles in data-centric AI transition. In the categorical notion, data invariants, which are the structural properties that are preserved under a particular type of morphisms, are often the interesting object of study. As for cochain notion, data can be viewed as a function defined in a discrete domain of interest and acted upon via operators. While these notions are almost orthogonal, they provide a unifying definition to view the data, ultimately impacting the way machine learning packages are developed, implemented, and utilized by practitioners.

Tue 10:58 a.m. - 11:00 a.m.
(Oral)   

Domain-specific voice assistants often suffer from the problem of data scarcity. Publicly available, annotated datasets are in short supply and rarely fit the domain and the language required by a specific use case. Insufficient attention to data quality can generally be problematic when it comes to training and evaluation. The Computational Linguistics (CL) community has gained expertise and developed best practices for high-quality data annotation and collection as well as for for qualitative data analysis. However, the recent model-centric focus in AI and ML has not created ideal conditions for a fruitful collaboration with CL and the more data-centric fields of NLP to tackle data quality issues. We showcase principles and methods from CL / NLP research, which can potentially guide the development of data-centric NLU for domain-specific voice assistants - but have been typically overlooked by common practices in ML / AI. Those principles can potentially be of help to shape data-centric practices also for other domains. We argue that paying more attention to data quality and domain specificity can go a long way in improving the NLU components of today’s voice assistants

Tue 11:00 a.m. - 11:15 a.m.
Q&A Lightning Talks - Challenges and Theory (Q&A Sessions)
Tue 11:20 a.m. - 11:30 a.m.
Facebook - Data Centric Infrastructure (Invited Talk)
Douwe Kiela
Tue 11:30 a.m. - 11:32 a.m.
(Oral)   

With the progressive commoditization of modeling capabilities, data-centric AI recognizes that what happens before and after training becomes crucial for real-world deployments. Following the intuition behind Model Cards, we propose DAG Cards as a form of documentation encompassing the tenets of a data-centric point of view. We argue that Machine Learning pipelines (rather than models) are the most appropriate level of documentation for many practical use cases, and we share with the community an open implementation to generate cards from code.

Tue 11:30 a.m. - 11:50 a.m.
Lightning Talks - Responsibility and Ethics (Recorded Talks)
Tue 11:32 a.m. - 11:34 a.m.
(Oral)   

Data imbalance is common in production data, where controlled production settings require data to fall within a narrow range of variation and data are collected with quality assessment in mind, rather than data analytic insights. This imbalance negatively impacts the predictive performance of models on underrepresented observations. We propose sampling to adjust for this imbalance with the goal of improving the performance of models trained on historical production data. We investigate the use of three sampling approaches to adjust for imbalance. The goal is to downsample the covariates in the training data and subsequently fit a regression model. We investigate how the predictive power of the model changes when using either the sampled or the original data for training. We apply our methods on a large biopharmaceutical manufacturing data set from an advanced simulation of penicillin production and find that fitting a model using the sampled data gives a small reduction in the overall predictive performance, but yields a systematically better performance on underrepresented observations. In addition, the results emphasize the need for alternative, fair, and balanced model evaluations.

Tue 11:34 a.m. - 11:36 a.m.
(Oral)   

Language models are becoming increasingly central to artificial intelligence through their use in online search, recommendation engines and language generation technologies. However, concepts of gender can be deeply embedded in textual datasets that are used to train language models, which can have a profound influence on societal conceptions of gender. There is therefore an urgent need for scalable methods to enable the evaluation of how gender is represented in large-scale text datasets and language models. We propose a framework founded in feminist theory and feminist linguistics for the assessment of gender ideology embedded in textual datasets and language models, and propose strategies to mitigate bias.

Tue 11:36 a.m. - 11:38 a.m.
(Oral)   

A key challenge in building a dataset for hate speech detection is that hate speech is relatively rare, meaning that random sampling of tweets to annotate is highly inefficient in finding hate speech. To address this, prior work often only considers tweets matching known “hate words”, but restricting the dataset to a pre-defined vocabulary only partially captures the real-world phenomenon we seek to model. Our key insight is that the rarity of hate speech is akin to rarity of relevance in information retrieval (IR). This connection suggests that well-established methodologies for creating IR test collections can be usefully applied to build more inclusive datasets for hate speech. Applying this idea, we have created a new hate speech dataset for Twitter that provides broader coverage of hate, showing a drop in accuracy of existing detection models when tested on these broader forms of hate. This workshop short paper only highlights a longer work currently under review.

Tue 11:38 a.m. - 11:40 a.m.
(Oral)   

As we move towards large-scale models (BERT, LaMBDA, DALL-E) capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give shape and nuance to how these models might be used rapidly increases. As such, a clear and thorough understanding of a dataset's origins, development, intent, ethical considerations and evolution is a necessary step for the responsible and informed deployment of these models, especially in people-facing contexts used across high-risk domains. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of its documentation. Moreover, with these models often dependent on multiple datasets, consistency and comparability across all dataset documentation demands a process likening to user-centric product development. In this position paper, we propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset's lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models—such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use, or decisions affecting model performance. Using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. Finally, we present lessons learned from deploying over twenty Data Cards.

Tue 11:42 a.m. - 11:44 a.m.
(Oral)   

Providing front-line health workers in low- and middle- income countries with recommendations and predictions to improve health outcomes can have a tremendous impact on reducing healthcare inequalities, for instance by helping to prevent the thousands of maternal and newborn deaths that occur every day. To that end, we are developing a data-centric machine learning platform that leverages the behavioral logs from a wide range of mobile health applications running in those countries. Here we describe the platform architecture, focusing on the details that help us to maximize the quality and organization of the data throughout the whole process, from the data ingestion with a data-science purposed software development kit to the data pipelines, feature engineering and model management.

Tue 11:44 a.m. - 11:46 a.m.
(Oral)   

Machine learning models built on datasets containing discriminative instances attributed to various underlying factors result in biased and unfair outcomes. It’s a well founded and intuitive fact that existing bias mitigation strategies often sacrifice accuracy in order to ensure fairness. But when AI engine’s prediction is used for decision making which reflects on revenue or operational efficiency such as credit risk modeling, it would be desirable by the business if accuracy can be somehow reasonably preserved. This conflicting requirement of maintaining accuracy and fairness in AI motivates our research. In this paper, we propose a fresh approach for simultaneous improvement of fairness and accuracy of ML models within a realistic paradigm. The essence of our work is a data preprocessing technique that can detect instances ascribing a specific kind of bias that should be removed from the dataset before training and we further show that such instance removal will have no adverse impact on model accuracy. In particular, we claim that in the problem settings where instances exist with similar feature but different labels caused by variation in protected attributes, an inherent bias gets induced in the dataset, which can be identified and mitigated through our novel scheme. Our experimental evaluation on two open-source datasets demonstrates how the proposed method can mitigate bias along with improving rather than degrading accuracy, while offering certain set of control for end user.

Tue 11:48 a.m. - 11:50 a.m.
(Oral)   

Data-centric AI advocates a paradigm shift towards better, and not just bigger, datasets. As data protection laws with extra-territorial reach proliferate worldwide, ensuring datasets are legal is an increasingly crucial yet overlooked component of "better". To help dataset builders become more willing and able to navigate this complex legal space, this paper reviews key legal obligations surrounding datasets, examines the practical impact of data laws on ML pipelines, and offers a framework for building legal datasets.

Tue 11:50 a.m. - 12:05 p.m.
Q&A Lightning Talks - Responsibility and Ethics (Q&A Session)
Tue 12:10 p.m. - 12:50 p.m.
Q&A with Morning Invited + Keynote Speakers + Closing Remarks (Q&A Session)
Tue 12:50 p.m. - 1:20 p.m.
Poster Session (Poster)
Tue 1:20 p.m. - 1:35 p.m.
Chris Re (Keynote)
Tue 1:35 p.m. - 1:45 p.m.
D Sculley (Invited talk)
D. Sculley
Tue 1:45 p.m. - 1:47 p.m.
(Oral)   

Although state-of-the-art object detection methods have shown compelling performance, models often are not robust to adversarial attacks and out-of-distribution data. We introduce a new dataset, Natural Adversarial Objects (NAO), to evaluate the robustness of object detection models. NAO contains 7,936 images and 13,604 objects that are unmodified, but cause state-of-the-art detection models to misclassify with high confidence. The mean average precision (mAP) of EfficientDet-D7 drops 68.3% when evaluated on NAO compared to the standard MSCOCO validation set. We investigate why examples in NAO are difficult to detect and classify. Experiments of shuffling image patches reveal that models are overly sensitive to local texture. Additionally, using integrated gradients and background replacement, we find that the detection model is reliant on pixel information within the bounding box, and insensitive to the background context when predicting class labels.

Tue 1:45 p.m. - 2:15 p.m.
Lightning Talks - Data Synthesis and Datasets (Recorded Talks)
Tue 1:47 p.m. - 1:49 p.m.
(Oral)   

Many machine learning (ML) models that perform well on canonical benchmarks are nonetheless brittle. This has led to a broad assortment of alternative benchmarks for ML evaluation, each relying on their own distinct process of generation, selection, or curation. In this work, we look towards organizing principles for a systematic approach to measuring model performance. We introduce a framework unifying the literature on stress testing and discuss how specific criteria can shape the inclusion or exclusion of samples from a test. As a concrete example of this framework, we present NOOCh: a suite of scalably generated, naturally-occurring stress tests, and show how varying testing criteria can be used to probe specific failure modes. Experimentally, we explore the tradeoffs between various learning approaches on these tests and demonstrate how test design choices can yield varying conclusions.

Tue 1:49 p.m. - 1:51 p.m.
(Oral)   

Most research using machine learning (ML) for network intrusion detection systems (NIDS) uses well-established datasets such as KDD-CUP99, NSL-KDD, UNSW-NB15, and CICIDS-2017. In this context, the possibilities of machine learning techniques are explored, aiming for metrics improvements compared to the published baselines (model-centric approach). However, those datasets present some limitations as aging that make it unfeasible to transpose those ML-based solutions to real-world applications. This paper presents a systematic data-centric approach to address the current limitations of NIDS research, specifically the datasets. This approach generates NIDS datasets composed of the most recent network traffic and attacks, with the labeling process integrated by design.

Tue 1:51 p.m. - 1:53 p.m.
(Oral)   

Today, comprehensive evaluation of large-scale machine learning models is possible thanks to the open datasets produced using crowdsourcing, such as SQuAD, MS~COCO, ImageNet, SuperGLUE, etc. These datasets capture objective responses, assuming the single correct answer, which does not allow to capture the subjective human perception. In turn, pairwise comparison tasks, in which one has to choose between only two options, allow taking peoples' preferences into account for very challenging artificial intelligence tasks, such as information retrieval and recommender system evaluation. Unfortunately, the available datasets are either small or proprietary, slowing down progress in gathering better feedback from human users. In this paper, we present IMDB-WIKI-SbS, a new large-scale dataset for evaluating pairwise comparisons. It contains 9,150 images appearing in 250,249 pairs annotated on a crowdsourcing platform. Our dataset has balanced distributions of age and gender using the well-known IMDB-WIKI dataset as ground truth. We describe how our dataset is built and then compare several baseline methods, indicating its suitability for model evaluation.

Tue 1:53 p.m. - 1:55 p.m.
(Oral)   

This paper illustrates locality sensitive hasing (LSH) models for the identification and removal of nearly redundant data in a text dataset. To evaluate the different models, we create an artificial dataset for data deduplication using English Wikipedia articles. Area-Under-Curve (AUC) over 0.9 were observed for most models, with the best model reaching 0.96. Deduplication enables more effective model training by preventing the model from learning a distribution that differs from the real one as a result of the repeated data.

Tue 1:55 p.m. - 1:57 p.m.
(Oral)   

In order to analyze a trained model performance and identify its weak spots, one has to set aside a portion of the data for testing. The test set has to be large enough to detect statistically significant biases with respect to all the relevant sub-groups in the target population. This requirement may be difficult to satisfy, especially in data-hungry applications. We propose to overcome this difficulty by generating synthetic test set. We use the face landmarks detection task to validate our proposal by showing that all the biases observed on real datasets are also seen on a carefully designed synthetic dataset. This shows that synthetic test sets can efficiently detect a model’s weak spots and overcome limitations of real test set in terms of quantity and/or diversity.

Tue 1:57 p.m. - 1:59 p.m.
(Oral)   

3D-sensing is increasingly being used everywhere, including tablets, smartphones, robots, and autonomous vehicles. One major limitation to the usage and application of 3D-depth data is that very few databases have clean and accessible data, preventing researchers from building new applications and algorithms. This paper proposes 3D-ImageNet -- an analogue to the original ImageNet, which spurred the 2D image-processing AI explosion. 3D-ImageNet’s goal is to do the same for 3D-depth-sensing as was experienced for 2D images. This paper describes an open-source system that multiple cellphone users can use to collect and label a large amount of 3D data.

Tue 1:59 p.m. - 2:01 p.m.
(Oral)   

Data scarcity and noise are important issues in industrial applications of machine learning. However, it is often challenging to devise a scalable and generalized approach to address the fundamental distributional and semantic properties of dataset with black box models. For this reason, data-centric approaches are crucial for the automation of machine learning operation pipeline. In this point, we suggest a domain-agnostic pipeline for refining the quality of data in image classification problems. This pipeline contains data valuation, cleansing, and augmentation. With an appropriate combination of these methods, we could achieve 84.711% test accuracy (ranked #6, Honorable Mention in the Most Innovative) in the Data-Centric AI competition only with the provided dataset.

Tue 2:01 p.m. - 2:03 p.m.
(Oral)   

Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

Tue 2:03 p.m. - 2:05 p.m.
(Oral)   

In the past, computer vision systems for digitized documents could rely on systematically captured, high-quality scans. Today, transactions involving digital documents are more likely to start as mobile phone photo uploads taken by non-professionals. As such, computer vision for document automation must now address complexities of natural scene images. An additional challenge is that task objectives for document processing can be highly use-case specific, which makes publicly-available datasets limited in their utility, while manual data labeling is also costly and poorly translates between use cases.

To address these issues we created Sim2Real Docs - a framework for synthesizing datasets and performing domain randomization of documents in natural scenes. Sim2Real Docs enables programmatic 3D rendering of documents using Blender, an open source tool for 3D modeling and ray-traced rendering. By using rendering that simulates physical interactions of light, geometry, camera, and background, we synthesize datasets of documents in a natural scene context. Each render is paired with use-case specific ground truth data specifying latent characteristics of interest, producing unlimited fit-for-task training data. The role of machine learning models is then to solve the inverse problem posed by the rendering pipeline. Such models can be further iterated upon with real-world data by either fine tuning or making adjustments to domain randomization parameters.

Tue 2:07 p.m. - 2:09 p.m.
(Oral)   

In this work we discuss One-Shot Object Detection, a challenging task of detecting novel objects in a target scene using a single reference image called a query. To address this challenge we introduce SPOT (Surfacing POsitions using Transformers), a novel transformer based end-to-end architecture which uses synergy between the provided query and target images using a learnable Robust Feature Matching module to emphasize the features of targets based on visual cues from the query. We curate LocateDS - a large dataset of query-target pairs from open-source logo and annotated product images containing pictograms, which are better candidates for the one-shot detection problem. Initial results on this dataset show that our model performs significantly better than the current state-of-the-art. We also extend SPOT to a novel real-life downstream task of Intelligent Sample Selection from a domain with very different distribution.

Tue 2:09 p.m. - 2:11 p.m.
(Oral)   

Materials informatics (MI) is a rapidly growing field, utilizing materials and chemicals R&D data to produce novel materials faster. The progress towards the goals of using MI at a large scale is hampered by the current reality of materials R&D data landscapes. In this paper, we will break down the unique challenges presented in preparing and storing materials and chemicals research data for MI, and will discuss a potential solution for enabling systematic enterprise-wide materials data storage and retrieval capable of unlocking MI across an organization.

Tue 2:11 p.m. - 2:13 p.m.
(Oral)   

Open-source robot hardware has become popular in recent years due easy and low-cost fabrication with 3D printing. Applying reinforcement learning algorithms to these robots, however, require the collection of a large amount of data during robot execution. The process is time consuming and can damage the robot. In addition, data collected for one robot may not be applicable for a similar one due to inherent uncertainties (e.g., friction, compliance, etc.) in the fabrication process. Therefore, we propose to disseminate a generative model rather than actual recorded data. We propose to use a limited amount of real data on a robot to train a Generative Adversarial Network (GAN). We show on two robotic systems that training a regression model using generated synthetic data provides transition accuracy at least as good as real data. Such model could be open-sourced along with the hardware to provide easy and rapid access to research platforms.

Tue 2:15 p.m. - 2:45 p.m.
Q&A for Lightning Talks - Datasets and Data Synthesis (Q&A Session)
Tue 2:45 p.m. - 2:55 p.m.
Curtis Northcutt (Invited Talk)   
Curtis Northcutt
Tue 2:55 p.m. - 2:57 p.m.
(Oral)   

Data quality is critical for machine learning, but its evaluation usually relies on the performance of used models. A model-independent data quality evaluation metric is needed. This paper proposes a convenient metric called DQTC to quantify the data quality for text classification based on information theory. And an experiment is conducted to verify the relevance between DQTC and model performance. Finally, we describe the linguistic improvement that should be considered. The code is available online.

Tue 2:55 p.m. - 3:15 p.m.
Lightning Talks - Data Quality and Iteration (Recorded Talks)
Tue 2:57 p.m. - 2:59 p.m.
(Oral)   

Question answering (QA) is one of the oldest research areas of AI and Computational Linguistics. QA has seen significant progress with the development of state-of-the-art models and benchmark datasets over the last few years. However, pre-trained QA models perform poorly for clinical QA tasks, presumably due to the complexity of electronic healthcare data. With the digitization of healthcare data and the increasing volume of unstructured data, it is extremely important for healthcare providers to have a mechanism to query the data to find appropriate answers. Since diagnosis is central to any decision-making for the clinicians and patients, we have created a pipeline to develop diagnosis-specific QA datasets and curated a QA database for the Cerebrovascular Accident (CVA). CVA, also commonly known as Stroke, is an important and commonly occurring diagnosis amongst critically ill patients. Our method when compared to clinician validation achieved an accuracy of 0.90(with 90% CI [0.82,0.99]). Using our method, we hope to overcome the key challenges of building and validating a highly accurate QA dataset in a semiautomated manner which can help improve the performance of QA models.

Tue 2:59 p.m. - 3:01 p.m.
(Oral)   

For supporting data-centric analyzes, it is important to identify and characterize which observations from a dataset are hard or easy to classify. This paper employs meta-learning strategies to describe the main differences between observations which are easy and hard to classify in a dataset. Intervals on significant meta-features values assessing the hardness levels of the observations are extracted and contrasted. This meta-knowledge allows for characterizing the hardness profile of a dataset and obtaining insights into the main sources of difficulty they pose, as shown in experiments using two super-classes of the CIFAR-100 dataset with different hardness levels.

Tue 3:01 p.m. - 3:03 p.m.
(Oral)   

As the adoption of deep learning techniques in industrial applications grows with increasing speed and scale, successful deployment of deep learning models often hinges on the availability, volume, and quality of annotated data. In this paper, we tackle the problems of efficient data labeling and annotation verification under the human-in-the-loop setting. We showcase that the latest advancements in the field of self-supervised visual representation learning can lead to tools and methods that benefit the curation and engineering of natural image datasets, reducing annotation cost and increasing annotation quality. We propose a unifying framework by leveraging self-supervised semi-supervised learning and use it to construct workflows for data labeling and annotation verification tasks. We demonstrate the effectiveness of our workflows over existing methodologies. On active learning task, our method achieves 97.0% Top-1 Accuracy on CIFAR10 with 0.1% annotated data, and 83.9% Top-1 Accuracy on CIFAR100 with 10% annotated data. When learning with 50% of wrong labels, our method achieves 97.4% Top-1 Accuracy on CIFAR10 and 85.5% Top-1 Accuracy on CIFAR100.

Tue 3:03 p.m. - 3:05 p.m.
(Oral)   

A desire to achieve a large medical imaging dataset keeps increasing as machine learning algorithms, parallel computing, and hardware evolve. Accordingly, there is a growing demand in pooling data from multiple clinical and academic institutes to enable large-scale clinical or translational research studies. Magnetic resonance imaging (MRI) is one of the most frequently used imaging technique that is not invasive. However, constructing a big MRI data repository has multiple challenges such as privacy issues, image size, and issues with DICOM. Not only constructing the data repository is difficult, but using data pooled from the repository is also challenging, due to heterogeneity in image acquisition, reconstruction, and processing pipelines across MRI vendors and sites. This position paper describes challenges of constructing a large MRI data repository and using data downloaded from such data repository in various aspects. Furthermore, the paper proposes introducing a QA pipeline that can help addressing the challenges described and provides general considerations and design principles.

Tue 3:05 p.m. - 3:07 p.m.
(Oral)   

For building successful Machine Learning (ML) systems, it is imperative to have high quality data and well tuned learning models. But how can one assess the quality of a given dataset? And how can the strengths and weaknesses of a model on a dataset be revealed? Our new tool PyHard employs a methodology known as Instance Space Analysis (ISA) to produce a hardness embedding of a dataset relating the predictive performance of multiple ML models to estimated instance hardness meta-features. This space is built so that observations are distributed linearly regarding how hard they are to classify. The user can visually interact with this embedding in multiple ways and obtain useful insights about data and algorithmic performance along the individual observations of the dataset. We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models and are therefore worth closer inspection, and the delineation of regions of strengths and weaknesses of ML models.

Tue 3:07 p.m. - 3:09 p.m.
(Oral)   

As a part of the Data-Centric AI Competition, we propose data-centric approaches to improve the diversity of the training samples by iterative sampling. The method itself relies strongly on the fidelity of augmented samples and the diversity of the augmentation methods. Moreover, we improve the performance further by introducing more samples for the problematic classes providing closer samples to edge cases, potentially those the model at hand misclassifies.

Tue 3:09 p.m. - 3:11 p.m.
(Oral)   

A common problem practitioners face is to select rare events in a large dataset. Unfortunately, standard techniques ranging from pre-trained models to active learning do not leverage proximity structure present in many datasets and can lead to worse-than-random results. To address this, we propose EzMode , an algorithm for iterative selection of rare events in large, unlabeled datasets. EzMode leverages active learning to iteratively train classifiers, but chooses the easiest positive examples to label in contrast to standard uncertainty techniques. EzMode also leverages proximity structure (e.g., temporal sampling) to find difficult positive examples. We show that EzMode can outperform baselines by up to 130× on a novel, real-world, 9,000 GB video dataset.

Tue 3:11 p.m. - 3:13 p.m.
(Oral)   

It is commonly acknowledged that the availability of the huge amount of (training)data is one of the most important factors for many recent advances in Artificial Intelligence (AI). However, datasets are often designed for some specific tasks in narrow AI sub areas and there is no unified way to manage and access them. This not only creates unnecessary overheads when training or deploying Machine Learning models but also limits the understanding of the data, which is very important for data-centric AI. In this paper, we present our vision about a unified framework for different datasets so that they can be integrated and queried easily, e.g., using standard query languages. We demonstrate this in our ongoing work to create a framework for datasets in Computer Vision and show its advantages in different scenarios. Our demonstration is available at https://vision.semkg.org.

Tue 3:13 p.m. - 3:15 p.m.
(Oral)   

Traditionally, deep learning models’ training involves choosing complex network architectures and large data sets to build models with high accuracy. This kind of training demands a large high-performance computing infrastructure to complete the training process in a reasonable time. We explore a data-centric approach to choose the “right” data samples in the “right” amount during each epoch of a model’s training to build the model efficiently in a shorter time and at least with the same (sometimes even better) accuracy compared to the traditional approach. This paper presents our experience using domain knowledge (temporal nature of data) to build a recommendation model by reducing data samples used in successive training epochs. We show that using a data-centric approach on state-of-the-art session-based recommendation models can reduce the model training time by at least 2.1x and achieve slightly better accuracy and “average recommendation "popularity" on a publicly available data set containing 6 million sessions.

Tue 3:15 p.m. - 3:35 p.m.
Q&A for Lightning Talks - Data Quality and Iteration (Q&A Session)
Tue 3:35 p.m. - 3:45 p.m.
Anima Anandkumar (Invited Talk)
Tue 3:45 p.m. - 3:47 p.m.
(Oral)   

When arranging for third-party data annotation, it can be hard to compare how well the competing providers apply best practices to create high-quality datasets. This leads to a ``race to the bottom,'' where competition based solely on price makes it hard for vendors to charge for high-quality annotation. We propose a voluntary rubric which can be used (a) as a scorecard to compare vendors' offerings, (b) to communicate our expectations of the vendors more clearly and consistently than today, (c) to justify the expense of choosing someone other than the lowest bidder, and (d) to encourage annotation providers to improve their practices.

Tue 3:45 p.m. - 4:05 p.m.
Lightning Talks - Data Labeling (Recorded Talks)
Tue 3:47 p.m. - 3:49 p.m.
(Oral)   

Over the last decade, developments in computer vision tasks have been driven by image, video, and multimodal benchmark datasets fueling the growth of machine learning methods for object detection, classification, and scene understanding.

Such advances have, however, created static, goal-specific and heterogeneous datasets, with little to none emphasis on the used taxonomies and semantics behind the class definitions, making them ill-defined, and hardly mappable to each others. This approach hinders and limits the long-term usability of datasets, their intercompatibility, extensibility, and the ability to repurpose them.

In this work we propose a new methodology for data labeling, which we call Ontolabeling, that detaches data structure from semantics, creating two data model layers. The first layer organizes spatio-temporal labels for multi-sensor data, while the second layer makes use of ontologies to structure, organize, maintain, extend and repurpose the semantics of the annotations.

Our approach is supported by an open source toolkit that enables label management (create, read, update, and delete) following the proposed Ontolabeling principles.

Tue 3:49 p.m. - 3:51 p.m.
(Oral)   

We present a simple and effective tool for performing interactive 3D object annotation for 3D object detection on LiDAR point clouds. Our annotation pipeline begins with a pre-labeling stage that infers 3D bounding boxes automatically by using a pre-trained deep neural network. While this stage can largely reduce manual effort for annotators, we found that pre-labeling is often imperfect, e.g., some bounding boxes are missing or inaccurate. In this paper, we propose to enhance the annotation pipeline with an interactive operator that allows users to generate a bounding box for a 3D object missed by the pre-trained model. This user interaction acts in the bird's eye (BEV), i.e., top-down perspective, of the point cloud, where users can inspect the existing annotation and place additional boxes accordingly. Our interactive operator requires only a single click and the inference is done by an object detector network trained on the BEV space. Experimental results show that, compared with existing annotation tools, our method can boost up the annotation efficiency by conveniently adding missing bounding boxes with more accurate dimensions using only single clicks.

Tue 3:51 p.m. - 3:53 p.m.
(Oral)   

Ranking by pairwise comparisons has shown improved reliability over ordinal classification. However, as the annotations of pairwise comparisons scale quadratically, this becomes less practical when the dataset is large. We propose a method for reducing the number of pairwise comparisons required to rank by a quantitative metric, demonstrating the effectiveness of the approach in ranking medical images by image quality in this proof of concept study. Using the medical image annotation software that we developed, we actively subsample pairwise comparisons using a sorting algorithm with a human rater in the loop. We find that this method substantially reduces the number of comparisons required for a full ordinal ranking without compromising inter-rater reliability when compared to pairwise comparisons without sorting.

Tue 3:55 p.m. - 3:57 p.m.
(Oral)   

Neonatal seizures are common among infants and can be detected with an electroencephalogram (EEG). The EEG signals are complex time-series using multiple channels. Human domain experts are often in disagreement when labelling neonatal seizure data. Only few studies will include labels from multiple experts, as annotating hours of EEG recordings is time consuming and expensive. In this study we investigate the differences in performance of a deep-learning-based neonatal seizure detector trained using single expert labelling versus data labelled using the consensus of multiple experts. Results indicate that there are differences even when the experts are in minor disagreement. We find that excluding ambiguously labeled data is important when training a neonatal seizure detector.

Tue 3:57 p.m. - 3:59 p.m.
(Oral)   

Knowing where the driver of a car is looking, whether in a mirror or through the windshield, is important for advanced driver assistance systems and driving education applications. This problem can be addressed as a supervised classification task. However, in a typical dataset of driver video recordings, some classes will dominate over others. We implemented a driving video annotation tool (DVAT) that uses automatically recognized driving situations to focus the human annotator’s effort on snippets with a high likelihood of otherwise rarely occurring classes. By using DVAT, we reduced the number of frames that need human input by 94% while keeping the dataset more balanced and using human time efficiently.

Tue 3:59 p.m. - 4:01 p.m.
(Oral)   
We propose a highly data-efficient active learning framework for image classification. Our novel framework combines: (1) unsupervised representation learning of a Convolutional Neural Network and (2) the Gaussian Process (GP) method, in sequence to achieve highly data and label efficient classifications. Moreover, both elements are less sensitive to the prevalent and challenging class imbalance issue, thanks to the (1) feature learned without labels and (2) the Bayesian nature of GP. The GP-provided uncertainty estimates enable active learning by ranking samples based on the uncertainty and selectively labeling samples showing higher uncertainty. We apply this novel combination to the severely imbalanced case of COVID-19 chest X-ray classification and the Nerthus colonoscopy classification. We demonstrate that only $\lesssim 10\%$ of the labeled data is needed to reach the accuracy from training all available labels. We also applied our model architecture and proposed framework to a broader class of datasets with expected success.
Tue 4:01 p.m. - 4:03 p.m.
(Oral)   

Human annotations play a crucial role in machine learning (ML) research and development. However, the ethical considerations around the processes and decisions that go into building ML datasets has not received nearly enough attention. In this paper, we survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation. We synthesize these insights, and lay out the challenges in this space along two layers: (1) who the annotator is, and how the annotators' lived experiences can impact their annotations, and (2) the relationship between the annotators and the crowdsourcing platforms and what that relationship affords them. Finally, we put forth a concrete set of recommendations and considerations for dataset developers at various stages of the ML data pipeline: task formulation, selection of annotators, platform and infrastructure choices, dataset analysis and evaluation, and dataset documentation and release.

Tue 4:03 p.m. - 4:05 p.m.
(Oral)   

ML is being deployed in complex, real-world scenarios where errors have impactful consequences. As such, thorough testing of the ML pipelines is critical. A key component in ML deployment pipelines is the curation of labeled training data, which is assumed to be ground truth. However, in our experience in a large autonomous vehicle development center, we have found that labels can have errors, which can lead to downstream safety risks in trained models.

To address these issues, we propose a new abstraction, learned observation assertions, and implement it in a system, Fixy . Fixy leverages existing organizational resources, such as existing labeled datasets or trained ML models, to learn a probabilistic model for finding errors in labels. Given user-provided features and these existing resources, Fixy learns priors that specify likely and unlikely values (e.g., a speed of 30mph is likely but 300mph is unlikely). It then uses these priors to score labels for potential errors. We show Fixy can automatically rank potential errors in real datasets with up to 2× higher precision compared to recent work on model assertions and standard techniques such as uncertainty sampling.

Tue 4:05 p.m. - 4:25 p.m.
Q&A for Lightning Talks - Data Labeling (Q&A Session)
Tue 4:25 p.m. - 5:05 p.m.
Q&A with Afternoon Invited + Keynote Speakers + Closing Remarks (Q&A Session)
Tue 5:05 p.m. - 6:00 p.m.
Poster Session 2 (Posters)
-
(Oral)   

Missing values exist in nearly all clinical studies because data for a variable or question are not collected or not available. Imputing missing values and augmenting data can significantly improve generalisation and avoid bias in machine learning models. We propose a Hybrid Bayesian inference using Hamiltonian Monte Carlo (F-HMC) as a more practical approach to process cross-dimensional relations by applying a random walk and Hamiltonian dynamics to adapt posterior distribution and generate large-scale samples. The proposed method is applied to cancer symptom assessment, and MNIST datasets confirmed to enrich data quality in precision, accuracy, recall, F1-score, and propensity metric.

-
(Oral)   

This paper proposes a tool for efficiently constructing high-quality parallel corpora with minimizing human labor and making this tool publicly available. Our proposed construction process is based on neural machine translation (NMT) to allow for it to not only coexist with human translation, but also improve its efficiency by combining data quality control with human translation in a data-centric approach.

-
(Oral)   

Building of data for quality estimation (QE) training is expensive and requires significant human labor. In this study, we focus on a data-centric approach while performing QE, and subsequently propose a fully automatic pseudo-QE dataset generation tool that generates QE datasets by receiving only monolingual or parallel corpus as the input. Consequently, the QE performance is enhanced either by data augmentation or by encouraging multiple language pairs to exploit the applicability of QE. Further, we intend to publicly release this user friendly QE dataset generation tool as we believe this tool provides a new, inexpensive method to the community for developing QE datasets.

-
(Oral)   

Generative commonsense reasoning is the capability of a language model to generate a sentence with a given concept-set that is based on commonsense knowledge. However, generative language models still struggle to provide outputs, and the training set does not contain patterns that are sufficient for generative commonsense reasoning. In this paper, we propose a data-centric method that uses automatic knowledge augmentation to extend commonsense knowledge using a machine knowledge generator. This method can generate semi-golden sentences that improve the generative commonsense reasoning of a language model without architecture modifications. Furthermore, this approach is a model-agnostic method and does not require human effort for data construction.

-
(Oral)   

Automunge is an open source python library that has formalized and automated the data preparations for tabular learning in between the workflow boundaries of received “tidy data” (one column per feature and one row per sample) and returned dataframes suitable for the direct application of machine learning. Under automation numeric features are normalized, categoric features are binarized, and missing data is imputed. Data transformations are fit to properties of a training set for a consistent basis on any partitioned “validation data” or additional “test data”. When preparing training data, a compact python dictionary is returned recording steps and parameters of transformation, which may then serve as a key for preparing additional corresponding data on a consistent basis. In addition to data preparations under automation, Automunge may also serve as a platform for tabular engineering, as demonstrated herein.

-
(Oral)   

Under-represented languages such as Moroccan Arabic dialectal or Darija as it is commonly known face a lack of open systems capable of understanding them. However, a growing need for these systems by Academia, private companies and public institutions is increasingly expressed in order to better improve the human experience and ensure good productivity. We present here an automatic voice recognition system resulting from Data Centric and Transfer Learning approaches for the construction of a voice database and a Speech Recognition model for the Darija.

-
(Oral)   

In this work, we announce a comprehensive well curated and opensource dataset with millions of samples for pre-college and college level problems in mathematics and science. A preliminary set of results using transformer architectures with character to character encoding is shown. The dataset identifies some challenging problem and invites research on better architecture search.

-
(Oral)   

Success of many machine learning and offline measurement efforts is highly dependent on the quality of labeled data that they use. Development of supervised machine learning models and quantitative research rely on the assumption of annotation obtained through human reviewers being “ground truth”. Annotation quality issues result in violation of this assumption and corrupt quality of all the downstream work and analysis. Through a series of analyses we have identified a highly pressing need for development of a quality framework that will allow creation of a robust system of label quality monitoring and improvement. In this paper we will present an overview of the Accuracy, Credibility, and Consistency (ACC) framework, which consists of three elements: (1) understanding of what annotation quality is and what metrics are required to be tracked (2) implementation of the concepts and measurements and (3) intervention protocols for identified annotation quality issues.

-
(Oral)   

Autism Spectrum Disorders (ASDs) are neurodevelopmental disorders which inhibits linguistic, cognitive, communication and social skills of affected individuals. Currently, ASD is diagnosed by means of time-consuming and expensive screening tests. Hence, Machine Learning (ML) techniques have been applied to construct predictive models able to diagnose autism at early stages. However, the binary setting (ASD vs not-ASD) and the not-exciting performance reached by such models highlight the need for further de-identified datasets and interdisciplinary work linking computer scientists and Subject Domain Experts (SDEs). In this work, we propose a novel dataset in which labels refer to the severity level of autism as required by the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) standard reference. Then we analyze the quality of resulting ML models (i.e. Random Forest, XGBoost, Neural Network) based not only on their performance metrics (i.e. precision, recall, F1) but also on the most important features they consider for classification and their similarity with the ones suggested by the SDE.

- 12:38 p.m.
(Oral)   

All-neural end-to-end (E2E) Spoken Language Understanding (SLU) models can improve performance over traditional compositional SLU models, but have the challenge of requiring high-quality training data with both audio and annotations. In particular they struggle with performance on “golden utterances”, which are essential for defining and supporting features, but may lack sufficient training data. In this paper, we compare two data-centric AI methods for improving performance on golden utterances: improving the annotation quality of existing training utterances and augmenting the training data with varying amounts of synthetic data. Our experimental results show improvements with both methods, and in particular that augmenting with synthetic data is effective in addressing errors caused by both inconsistent training data annotations as well as lack of training data. This method leads to improvement in intent recognition error rate (IRER) on our golden utterance test set by 93% relative to the baseline without seeing a negative impact on other test metrics.

- 12:46 p.m.
(Oral)   

We discuss how VMware is solving the following challenges to harness data to operate our ML-based anomaly detection system to detect performance issues in our Software Defined Data Center (SDDC) enterprise deployments: (i) label scarcity and label bias due to heavy dependency on unscalable human annotators, and (ii) data drifts due to ever-changing workload patterns, software stack and underlying hardware. Our anomaly detection system has been deployed in production for many years and has successfully detected numerous major performance issues. We demonstrate that by addressing these data challenges, we not only improve the accuracy of our performance anomaly detection model by 30\%, but also ensure that the model performance to never degrade over time.

- 1:07 p.m.
(Oral)

The growing prevalence of AI in industry and the dominance of a handful of model classes have contributed to a community-wide shift toward more data-centric AI. As an AI-driven biotech company unparalleled in the range and quantity of its biomedical data, nference has built the nferX platform housing many of our ML tools to facilitate both data-driven scientific discovery and healthcare AI product development. In this case study, we provide an overview of the nferX platform and its application in biomedical NLP, with an emphasis on methods for increasing labeling efficiency in our model development pipeline

- 1:11 p.m.
(Oral)   

With the shift in trend towards the data centric AI, It is indispensable to provide proper labeled data to our DL models. What we have built is a prototype of a data handler tool or pipeline which gives an a high-level insight on the pool of unlabeled data, post which classifies noisy and clear data with a particular type of noise, now the noisy images are segregated separately and we use Deep learning models to denoise these images and then again feed them back to the remaining clear images. Here the type of noise is subject to the user and the dataset.. This is developed with modularity in mind and can be scaled to other types of noises as well

- 1:15 p.m.
(Oral)   

Data quality is central for Machine Learning (ML) applications but is in many cases not trivial to evaluate. Particular challenges involve e.g., validity concerns of quality metrics with regards to ML tasks and data provenance and problematic reproducibility of data quality assessments. In this paper we propose to intertwine all components of the ML pipeline into the quality assessment process to achieve a concept of fitness-for-use which has a clearly defined area of validity, is reproducible and can potentially be transferred to other ML pipelines.

- 1:17 p.m.
(Oral)   

This paper presents a novel Vietnamese speech-based question answering system QA-CarManual that enables users to ask car-manual-related questions (e.g. how to properly operate devices and/or utilities within a car). Given a car manual written in Vietnamese as the main knowledge base, we develop QA-CarManual as a lightweight, real-time and interactive system that integrates state-of-the-art technologies in language and speech processing to (i) understand and interact with users via speech commands and (ii) automatically query the knowledge base and return answers in both forms of text and speech as well as visualization. To our best knowledge, QA-CarManual is the first Vietnamese question answering system that interacts with users via speech inputs and outputs. We perform a human evaluation to assess the quality of our QA-CarManual system and obtain promising results.

- 1:21 p.m.
(Oral)   

Graph neural networks (GNNs) have attracted much attention due to their ability to leverage the intrinsic geometries of the underlying data. Although many different types of GNN models have been developed, with many benchmarking procedures to demonstrate the superiority of one GNN model over the others, there is a lack of systematic understanding of the underlying benchmarking datasets, and what aspects of the model are being tested. Here, we provide a principled approach to taxonomize graph benchmarking datasets by carefully designing a collection of graph perturbations to probe the essential data characteristics that GNN models leverage to perform predictions. Our data-driven taxonomization of graph datasets provides a new understanding of critical dataset characteristics that will enable better model evaluation and the development of more specialized GNN models.

-
(Oral)   

Omics data are key for the understanding of life and for improving human health but the contributions of AI in the field of multi-omics analysis are scarce when compared to single omics or medical imaging. We believe the major reason behind this fact is the lack of a standardized multi-omics data type. In this position paper, we introduce this problem, discuss some controversial aspects, and sketch a possible solution, as biomedical researchers clearly realized that there can be no real precision medicine without truly integrated multi-omics analysis and are desperately calling for collaboration. Our proposed multi-omics data type would provide a standardized way of storing raw and preprocessed multi-omics data together with preprocessing methods, therefore greatly simplifying data analysis and facilitating the participation of AI practitioners.

-
(Oral)   

Vibration data is one of the most informative data to be used for fault detection. It mostly employs in the form of frequency response function (FRF) for training deep learners. However, since normally the FRFs are measured at excessive numbers of frequencies, its usage not only enforces large computational resources for training the deep learners, but could also hinder proper feature extraction. In this paper, it is shown that given a predefined deep learning structure and its associated hyper-parameters, how proper data selection and/or augmentation could improve the performance of the trained model in classifying the samples. For this purpose, the least absolute shrinkage and selection operator (LASSO) and some generative functions are utilized respectively for data selection/reduction and augmentation prior to any training. The efficacy of this procedure is illustrated by applying it to an experimental dataset created by the broadband vibrational responses of polycrystalline Nickel alloy first-stage turbine blades with different types and severities of damages. It is shown that the data selection and augmentation approach could improve the performance of the model to some extent and at the same time, drastically reduce the computational time.

-
(Oral)   

By using public data about the structure of the internet, practitioners can identify what assets organizations own on the internet, many of which are vulnerable to cybersecurity attacks. With current knowledge of what servers and assets are exposed to the internet, organizations are able to remediate vulnerabilities before they are exploited. As part of managing an AI/ML system for this "internet asset attribution" task, we make extensive formal use of "data slices", subsets of data that share particular properties. Data slices make managing models and datasets more repeatable and sustainable. Data slice evaluation lets us systematically manage regressions, user trust, experiment evaluation, and data space characterization.

-
(Oral)   

Financial services companies depend on peta-bytes of data to make decisions about investments, services and operations. Data-centric methods are needed to ensure the quality of the data used for ML model-based and other business process automation. This paper presents AutoDQ, an end-to-end data quality assurance framework to monitor production data quality and which leverages ML to identify and select validation constraints. AutoDQ introduces novel unit tests derived from the automatic extraction of data semantics and inter-column relationships, in addition to constraints based on predictability and statistical profiling of data. It operates on both tabular and time-series data without requiring schema or any metadata. The components of our framework have been tested over 100 public datasets as well as several internal transactional datasets.

-
(Oral)   

The paper presents a study on combining model- and data-centric approaches to building a question answering system for inclusion of people with autism spectrum disorder. The study shows that applying sequentially model- and data-centric approaches might allow achieving higher metric scores on closed-domain low-resourced datasets.

-
(Oral)   

Data-centric AI is a new and exciting research topic in the AI community, but many organizations already build and maintain various data-centric'' applications whose goal is to produce high quality data. These range from traditional business data processing applications (e.g.,how much should we charge each of our customers this month?'') to production ML systems such as recommendation engines. The fields of data and ML engineering have arisen in recent years to manage these applications, and both include many interesting novel tools and processes. In this paper, we discuss several lessons from data and ML engineering that could be interesting to apply in data-centric AI, based on our experience building data and ML platforms that serve thousands of applications at a range of organizations.

-
(Oral)   

MultiWOZ is one of the most popular multi-domain task-oriented dialog datasets, containing 10K+ annotated dialogs covering eight domains. It has been widely accepted as a benchmark for various dialog tasks. In this work, we identify an overlooked issue with dialog state annotation inconsistencies in the dataset, where a slot type is tagged inconsistently across similar dialogs leading to confusion for DST modeling. We propose an automated correction for this issue, which is present in 70% of the dialogs. Additionally, we notice that there is significant entity bias in the dataset (e.g., “cambridge" appears in 50% of the destination cities in the train domain). The entity bias can potentially lead to named entity memorization in generative models, which may go unnoticed as the test set suffers from a similar entity bias as well. .We release a new test set with all entities replaced with unseen entities. Finally, we benchmark joint goal accuracy (JGA) of the state-of-the-art DST baselines on these modified versions of the data. Our experiments show that the annotation inconsistency corrections lead to 7-10% improvement in JGA. On the other hand, we observe a 29% drop in JGA when models are evaluated on the new test set with unseen entities.

-
(Oral)   

Semantic segmentation is an important task for a wide range of applications from medical imaging to autonomous vehicles. However, the current state of the art requires large amounts of per-pixel annotated datasets which are costly and time consuming to curate. This paper presents Seg-Diff – an active learning method for estimating segmentation model uncertainty for unlabeled images. Our proposed method computes the difference in predictive uncertainty across saved training checkpoints, and uses these differences to compute a scalar ranking of uncertainty which can be visualized as an uncertainty heatmap. Using Seg-Diff to sample images for active learning, we consistently outperform random sampling on the Cityscapes dataset when measuring overall mean Intersection Over Union (mIOU).

-
(Oral)   

AutoML (automated machine learning) has been extensively developed in the past few years for the model-centric approach. As for the data-centric approach, the processes to improve the dataset, such as fixing incorrect labels, adding examples that represent edge cases, and applying data augmentation, are still very artisanal and expensive. Here we develop an automated data-centric tool (AutoDC), similar to the purpose of AutoML, aims to speed up the dataset improvement processes. In our preliminary tests on 3 open source image classification datasets, AutoDC is estimated to reduce 80% of the manual time for data improvement tasks, at the same time, improve the model accuracy by 10-15% with the fixed ML code.

-
(Oral)   

Training accurate intent classifiers requires labeled data, which can be costly to obtain. Data augmentation methods may ameliorate this issue, but the quality of the generated data varies significantly across techniques. We study the process of systematically producing pseudo-labeled data given a small seed set using a wide variety of data augmentation techniques, including mixing methods together. We find that while certain methods dramatically improve qualitative and quantitative performance, other methods have minimal or even negative impact. We also analyze key considerations when implementing data augmentation methods in production.

-
(Oral)   

The growing popularity of remote fitness has increased the demand for highly accurate computer vision models that track human poses. However, the best methods still fail in many real-world fitness scenarios, suggesting that there is a domain gap between current datasets and real-world fitness data. To enable the field to address fitness-specific vision problems, we created InfiniteForm -- an open-source synthetic dataset of 60k images with diverse fitness poses (15 categories), both single- and multi-person scenes, and realistic variation in lighting, camera angles, and occlusions. As a synthetic dataset, InfiniteForm offers minimal bias in body shape and skin tone, and provides pixel-perfect labels for standard annotations like 2D keypoints, as well as those that are difficult or impossible for humans to produce like depth and occlusion. In addition, we introduce a novel generative procedure for creating diverse synthetic poses from predefined exercise categories. This generative process can be extended to any application where pose diversity is needed to train robust computer vision models.

-
(Oral)

Labeled "ground truth" datasets are routinely used to evaluate and audit AI algorithms applied in high-stakes settings. However, there do not exist widely accepted benchmarks for the quality of labels in these datasets. We provide empirical evidence that the quality of labels can significantly distort the results of algorithmic audits in real-world settings. Using data annotators typically hired by AI firms in India, we show that fidelity of the ground truth data can lead to spurious differences in the performance of ASRs between urban and rural populations. After a rigorous, albeit expensive, label cleaning process, these disparities between groups disappear. Our findings highlight how trade-offs between label quality and data annotation costs can complicate algorithmic audits in practice. They also emphasize the need for the development of consensus-driven, widely accepted benchmarks for label quality.

-
(Oral)   

Data can naturally be modeled using topological terms. Indeed, the field of topological data analysis relies fundamentally on the idea that the shape of data carries important invariants that can be utilized to study and analyze the underlying data. In this article, we define a topological framework of data in the context of a supervised machine learning problem. Using this topological setting, we prove that in order to achieve a successful classification task, the neural networks architecture must not be chosen independently without considering the nature of data.

-
(Oral)   

The quality of underlying training data is very crucial for building performant machine learning models with wider generalizabilty. However, current machine learning (ML) tools lack streamlined processes for improving the data quality. So, getting data quality insights and iteratively pruning the errors to obtain a dataset which is most representative of downstream use cases is still an ad-hoc manual process. Our work addresses this data tooling gap, required to build improved ML workflows purely through data-centric techniques. More specifically, we introduce a systematic framework for finding noisy or mislabelled samples in the dataset and, identifying the most informative samples, which when included in training would provide maximal model performance lift. We demonstrate the efficacy of our framework on public as well as private enterprise datasets of two Fortune 500 companies, and are confident this work will form the basis for ML teams to perform more intelligent data discovery and pruning.

-
(Oral)   

To deal with the unimaginable continual growth of data and the focus on its use rather than its governance, the value of data has begun to deteriorate seen in lack of reproducibility, validity, provenance, etc. In this work, we aim to simply understand what is the value of data and how this basic understanding might affect existing AI algorithms, in particular, EM-T (traditional expectation maximization) used in soft clustering and EM* (a data-centric extension of EM-T). We have discovered that the value of data--or its ``expressiveness" as we call it--is procedurally determined and runs the gamut from low expressiveness (LE) to high expressiveness (HE), the former not affecting the objective function much, while the latter a great deal. By using balanced binary search trees (BST) (complete orders) introduced here, we have improved on our earlier work that utilized heaps (partial orders) to separate LE from HE data. EM-DC (expectation maximization-data centric) significantly improve the performance of EM-T on big data. EM-DC is an improvement over EM* by allowing more efficient identification of LE/HE data and its placement in the BST. Outcomes of this, aside from significant reduction in run-time over EM, while maintaining EM-T accuracy, include being able to isolate noisy data, convergence on data structures (using Hamming distance) rather than real-values, and the ability for the user to dictate the relative mixture of LE/HE acceptable for the run. The Python code and links to the data sets are provided in the paper. We additionally have released an R version (https://cran.r-project.org/web/packages/DCEM/index.html) that includes EM-T, EM, and k++ initialization.

-
(Oral)   

The Semantic Web community takes pride in developing technologies to make data significantly more valuable, interpretable, and computationally friendly by annotating the data appropriately with community-accepted vocabularies. For example, the core semantic web technology, the resource description framework, is designed to describe resources in a machine-readable way, and standardized ontologies, such as the recommended standard for capturing provenance on the web--PROV, were designed to provide lineage of data. This position paper argues that such semantic technologies could improve the quality of the data used in machine learning models, increase their accuracy, and make them more transparent and interpretable. We further argue that "a framework for excellence in data engineering" as put forth by the Data-centric AI workshop proposal (https://eval.how/dcai2021), has existed since the early 2000s, but there is a need for the Machine Learning and Semantic Web communities to collaborate to realize its full potential.

-
(Oral)   

Several techniques have been proposed to address the problem of recognizing activities of daily living from signals. Deep learning techniques applied to inertial signals have proven to be effective, achieving significant classification accuracy. Recently, research in human activity recognition (HAR) models has been almost totally model-centric. It has been experimented that feeding high-quality training data allows deep learning models to both perform well independently of their architecture and to be more robust to intraclass variability and interclass similarity. Unfortunately, the data in the available datasets are not always of high quality. Moreover, the performance of a deep learning algorithm is proportional to the size of the dataset used to generate it. The publicly available datasets mostly are small in terms of number of subjects and/or types of activities performed. Moreover, datasets are heterogeneous among them and therefore cannot be trivially combined to obtain a larger set.

The final aim of our work is the definition and implementation of a platform that integrates datasets of inertial signals in order to make available to the scientific community large datasets of homogeneous signals, enriched, when possible, with context information (e.g., characteristics of the subjects and device position). The main focus of our platform is to emphasise data quality, which is essential for training efficient models.

-
(Oral)   

This paper sustains the position that the time has come for thinking of learning machines that conquer visual skills in a truly human-like context, where a few human-like object supervisions are given by vocal interactions and pointing aids only. This likely requires new foundations on computational processes of vision with the final purpose of involving machines in tasks of visual description by living in their own visual environment under simple man-machine linguistic interactions. The challenge consists of developing machines that learn to see without needing to handle visual databases. This might open the doors to a truly orthogonal competitive track concerning deep learning technologies for vision which does not rely on the accumulation of huge visual databases.

-
(Oral)   

Visual identification of objects using cameras requires precise detection, localization, and recognition of the things in the field of view. The visual identification problem is challenging when the objects look identical and features between distinct entities are indistinguishable, even with state-of-the-art computer vision techniques. The problem becomes significantly challenging when the things themselves do not carry rich geometric and photometric features. To address this issue, we design and evaluate a novel visual sensing system that uses optical beacons(In this case, LED) to promptly locate each of the tightly spaced objects and track them across the scenes. Such techniques can be helpful to create data sets that are huge in volume and precisely annotated to augment Deep learning models to perform better in tasks such as localization and segmentation. One such use case that we have experimented with is the localization of LEDs using a classical communication algorithm to act as an automated annotation tool, and to verify the same, we have also created a data set containing 11000 images and micro-benchmark the task of localization against state-of-the-art (SOTA) object detection model YOLO v3.

-
(Oral)   

This work intends to provide an argument in favor of biased data and biased systems. It is argued through examples and situations that the use of bias can be beneficial to an ethical, lawful, and desirable result. Not only that but an argument is made that the search to eliminate bias at all costs can be counterproductive as well. We start by understanding bias in our data and how it can be used for good, followed by bias in our systems and how it can positively impact our lives. Possible problems and abuses when adopting this mindset are also discussed at the end.

-
(Oral)   

The recent discussion of data-centric artificial intelligence (DCAI) has galvanized researchers and practitioners to elevate data quality and dataset iteration practices to the level of importance given to model iteration on fixed datasets. Some DCAI techniques successfully increase training data quality but at the expense of the number of training examples. Meanwhile, production AI systems are being increasingly deployed in new settings producing even more inference data. Dataset profiling techniques provide systematic ways of transferring important characteristics and data examples from large, real-time inference data sources to the smaller datasets used for training--delivering higher quality data without sacrificing scalability.

-
(Oral)   

Predicting lexical-semantic relations between word pairs has successfully been accomplished by pre-trained neural language models. An XLM-R-based approach, for instance, achieved the best performance differentiating between hypernymy, synonymy, antonymy, and random relations in four languages in the CogALex-VI 2020 shared task. However, the results also revealed strong performance divergences between languages and confusions of very specific relations, especially hypernymy and synonymy. Upon manual inspection a difference in data quality across languages and relations could be observed. We propose the improved dataset for lexical-semantic relation prediction and an evaluation and analysis of its impact across three pre-trained neural language models, including transfer learning.

Author Information

Andrew Ng (DeepLearning.AI)
Lora Aroyo (Google Research)

I am a research scientist at Google Research where I work on research for Data Excellence by specifically focussing on metrics and strategies to measure quality of human-labeled data in a reliable and transparent way. I received MSc in Computer Science from Sofia University, Bulgaria, and PhD from Twente University, The Netherlands. Prior to joining Google, I was a computer science professor heading the User-Centric Data Science research group at the VU University Amsterdam. Our team invented the CrowdTruth crowdsourcing method and applied in various domains such as digital humanities, medical and online multimedia. I guided the human-in-the-loop strategies as a Chief Scientist at a NY-based startup Tagasauris. Currently I am also president of the User Modeling Society. For a list of my publications, please see my profile on Google Scholar.

Greg Diamos (Landing AI)
Cody Coleman (Stanford University)

Cody is a computer science Ph.D. candidate at Stanford University, is advised by Professors Matei Zaharia and Peter Bailis and is supported by a National Science Foundation Fellowship. As a member of the Stanford DAWN Project, Cody’s research is focused on democratizing machine learning through tools and infrastructure that enable more than the most well-funded teams to create innovative and impactful systems; this includes reducing the cost of producing state-of-the-art models and creating novel abstractions that simplify machine learning development and deployment. Prior to joining Stanford, he completed his B.S. and M.Eng. in electrical engineering and computer science at the Massachusetts Institute of Technology.

Vijay Janapa Reddi (Harvard University)
Joaquin Vanschoren (Eindhoven University of Technology)

Joaquin Vanschoren is an Assistant Professor in Machine Learning at the Eindhoven University of Technology. He holds a PhD from the Katholieke Universiteit Leuven, Belgium. His research focuses on meta-learning and understanding and automating machine learning. He founded and leads OpenML.org, a popular open science platform that facilitates the sharing and reuse of reproducible empirical machine learning data. He obtained several demo and application awards and has been invited speaker at ECDA, StatComp, IDA, AutoML@ICML, CiML@NIPS, AutoML@PRICAI, MLOSS@NIPS, and many other occasions, as well as tutorial speaker at NIPS and ECMLPKDD. He was general chair at LION 2016, program chair of Discovery Science 2018, demo chair at ECMLPKDD 2013, and co-organizes the AutoML and meta-learning workshop series at NIPS 2018, ICML 2016-2018, ECMLPKDD 2012-2015, and ECAI 2012-2014. He is also editor and contributor to the book 'Automatic Machine Learning: Methods, Systems, Challenges'.

Carole-Jean Wu (Facebook AI Research)
Sharon Zhou (Stanford University)
Lynn He (DeepLearning.AI)

More from the Same Authors