`

Timezone: »

Datasets and Benchmarks
Dataset and Benchmark Poster Session 1
Joaquin Vanschoren · Serena Yeung

Tue Dec 07 08:30 AM -- 10:00 AM (PST) @

The Datasets and Benchmarks track serves as a novel venue for high-quality publications, talks, and posters on highly valuable machine learning datasets and benchmarks, as well as a forum for discussions on how to improve dataset development. Datasets and benchmarks are crucial for the development of machine learning methods, but also require their own publishing and reviewing guidelines. For instance, datasets can often not be reviewed in a double-blind fashion, and hence full anonymization will not be required. On the other hand, they do require additional specific checks, such as a proper description of how the data was collected, whether they show intrinsic bias, and whether they will remain accessible.

 - Programming Puzzles (Poster) []  []  [ Paper ]  link »    We introduce a new type of programming challenge called programming puzzles, as an objective and comprehensive evaluation of program synthesis, and release an open-source dataset of Python Programming Puzzles (P3). Each puzzle is defined by a short Python program $f$, and the goal is to find an input which makes $f$ return True. The puzzles are objective in that each one is specified entirely by the source code of its verifier $f$, so evaluating $f$ is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding. The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems, to classic programming puzzles (e.g., Tower of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). We develop baseline enumerative program synthesis, GPT-3 and Codex solvers that are capable of solving puzzles---even without access to any reference solutions---by learning from their own past solutions. Codex performs best, solving up to 18% of 397 test problems with a single try and 80% of the problems with 1,000 tries per problem. In a small user study, we find a positive correlation between puzzle-solving performance and coding experience, and between the puzzle difficulty for humans and AI solvers. Therefore, further improvements on P3 could have a significant impact on many program synthesis areas. Link » Tal Schuster · Ashwin Kalyan · Alex Polozov · Adam Kalai 🔗 - FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information (Poster) []  [ Paper ]    Fact verification has attracted a lot of attention in the machine learning and natural language processing communities, as it is one of the key methods for detecting misinformation. Existing large-scale benchmarks for this task have focused mostly on textual sources, i.e. unstructured information, and thus ignored the wealth of information available in structured formats, such as tables. In this paper we introduce a novel dataset and benchmark, Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS), which consists of 87,026 verified claims. Each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia, as well as a label indicating whether this evidence supports, refutes, or does not provide enough information to reach a verdict. Furthermore, we detail our efforts to track and minimize the biases present in the dataset and could be exploited by models, e.g. being able to predict the label without using evidence. Finally, we develop a baseline for verifying claims against text and tables which predicts both the correct evidence and verdict for 18% of the claims. Rami Aly · Zhijiang Guo · Michael Schlichtkrull · James Thorne · Andreas Vlachos · Christos Christodoulopoulos · Oana Cocarascu · Arpit Mittal 🔗 - BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue Modeling (Poster) []  [ Paper ]    Task-oriented dialogue (ToD) benchmarks provide an important avenue to measure progress and develop better conversational agents. However, existing datasets for end-to-end ToD modeling are limited to a single language, hindering the development of robust end-to-end ToD systems for multilingual countries and regions. Here we introduce BiToD, the first bilingual multi-domain dataset for end-to-end task-oriented dialogue modeling. BiToD contains over 7k multi-domain dialogues (144k utterances) with a large and realistic bilingual knowledge base. It serves as an effective benchmark for evaluating bilingual ToD systems and cross-lingual transfer learning approaches. We provide state-of-the-art baselines under three evaluation settings (monolingual, bilingual, and cross-lingual). The analysis of our baselines in different settings highlights 1) the effectiveness of training a bilingual ToD system comparing to two independent monolingual ToD systems, and 2) the potential of leveraging a bilingual knowledge base and cross-lingual transfer learning to improve the system performance in the low resource condition. Zhaojiang Lin · Andrea Madotto · Genta Winata · Peng Xu · Feijun Jiang · Yuxiang Hu · Chen Shi · Pascale N Fung 🔗 - Towards a robust experimental framework and benchmark for lifelong language learning (Poster) []  []  []  [ Paper ]  link »    In lifelong learning, a model learns different tasks sequentially throughout its lifetime. State-of-the-art deep learning models, however, struggle to generalize in this setting and suffer from catastrophic forgetting of old tasks when learning new ones. While a number of approaches have been developed in an attempt to ameliorate this problem, there are no established, unified or generalized frameworks for rigorous evaluations of proposed solutions; a problem which is particularly pronounced in the domain of NLP. The few existing benchmarks are typically limited to a specific flavor of lifelong learning -- continual open-set classification -- where new classes, as opposed to tasks, are learned incrementally. Moreover, the only general lifelong learning benchmark combines a multi-label classification setup with a multi-class classification setup resulting in misleading gradients during training. We empirically demonstrate that the catastrophic forgetting observed here can be attributed to the experimental design rather than to any inherent modeling limitations. To address these issues, we propose an experimental framework for true, general lifelong learning in NLP. Using this framework, we develop a comprehensive suite of benchmarks that target different properties of lifelong learning (e.g., forgetting or intransigence); experiment with diverse facets of language learning: multi-domain, multilingual and different levels of linguistic hierarchy; and present a continuous evaluation scheme under a new metric: Area Under the Lifelong Test Curve. Our framework reveals shortcomings of prevalent memory-based solutions, demonstrating they are unable to outperform a simple experience replay baseline under the realistic lifelong learning setup. Link » Aman Hussain · Nithin Holla · Pushkar Mishra · Helen Yannakoudakis · Ekaterina Shutova 🔗 - The People’s Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage (Poster) []  [ Paper ]    The People’s Speech is a free-to-download 31,400-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA. The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection system under the Apache2.0 license. We show that a model trained on this dataset achieves a 32.17% word error rate on Librispeech’s test-clean test set. Finally, we discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons’s sponsorship. Daniel Galvez · Greg Diamos · Juan Torres · Juan Cerón · Keith Achorn · Anjali Gopi · David Kanter · Max Lam · Mark Mazumder · Vijay Janapa Reddi 🔗 - CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription (Poster) []  [ Paper ]    Domain-specific data is the crux of the successful transfer of machine learning systems from benchmarks to real life. In simple problems such as image classification, crowdsourcing has become one of the standard tools for cheap and time-efficient data collection: thanks in large part to advances in research on aggregation methods. However, the applicability of crowdsourcing to more complex tasks (e.g., speech recognition) remains limited due to the lack of principled aggregation methods for these modalities. The main obstacle towards designing aggregation methods for more advanced applications is the absence of training data, and in this work, we focus on bridging this gap in speech recognition. For this, we collect and release CrowdSpeech --- the first publicly available large-scale dataset of crowdsourced audio transcriptions. Evaluation of existing and novel aggregation methods on our data shows room for improvement, suggesting that our work may entail the design of better algorithms. At a higher level, we also contribute to the more general challenge of developing the methodology for reliable data collection via crowdsourcing. In that, we design a principled pipeline for constructing datasets of crowdsourced audio transcriptions in any novel domain. We show its applicability on an under-resourced language by constructing VoxDIY --- a counterpart of CrowdSpeech for the Russian language. We also release the code that allows a full replication of our data collection pipeline and share various insights on best practices of data collection via crowdsourcing. Nikita Pavlichenko · Ivan Stelmakh · Dmitry Ustalov 🔗 - ReaSCAN: Compositional Reasoning in Language Grounding (Poster) []  [ Paper ]    The ability to compositionally map language to referents, relations, and actions is an essential component of language understanding. The recent gSCAN dataset (Ruis et al. 2020, NeurIPS) is an inspiring attempt to assess the capacity of models to learn this kind of grounding in scenarios involving navigational instructions. However, we show that gSCAN's highly constrained design means that it does not require compositional interpretation and that many details of its instructions and scenarios are not required for task success. To address these limitations, we propose ReaSCAN, a benchmark dataset that builds off gSCAN but requires compositional language interpretation and reasoning about entities and relations. We assess two models on ReaSCAN: a multi-modal baseline and a state-of-the-art graph convolutional neural model. These experiments show that ReaSCAN is substantially harder than gSCAN for both neural architectures. This suggests that ReaSCAN can serve as a valuable benchmark for advancing our understanding of models' compositional generalization and reasoning capabilities. Zhengxuan Wu · Elisa Kreiss · Desmond Ong · Christopher Potts 🔗 - CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation (Poster) []  [ Paper ]    Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems. Shuai Lu · Daya Guo · Shuo Ren · Junjie Huang · Alexey Svyatkovskiy · Ambrosio Blanco · Colin Clement · Dawn Drain · Daxin Jiang · Duyu Tang · Ge Li · Lidong Zhou · Linjun Shou · Long Zhou · Michele Tufano · MING GONG · Ming Zhou · Nan Duan · Neel Sundaresan · Shao Kun Deng · Shengyu Fu · Shujie LIU 🔗 - Variance-Aware Machine Translation Test Sets (Poster) []  [ Paper ]    We release 70 small and discriminative test sets for machine translation (MT) evaluation called variance-aware test sets (VAT), covering 35 translation directions from WMT16 to WMT20 competitions. VAT is automatically created by a novel variance-aware filtering method that filters the indiscriminative test instances of the current MT benchmark without any human labor. Experimental results show that VAT outperforms the original WMT benchmark in terms of the correlation with human judgment across mainstream language pairs and test sets. Further analysis on the properties of VAT reveals the challenging linguistic features (e.g., translation of low-frequency words and proper nouns) for the competitive MT systems, providing guidance for constructing future MT test sets. The test sets and the code for preparing variance-aware MT test sets are freely available at https://github.com/NLP2CT/Variance-Aware-MT-Test-Sets. Runzhe Zhan · Xuebo Liu · Derek Wong · Lidia Chao 🔗 - Timers and Such: A Practical Benchmark for Spoken Language Understanding with Numbers (Poster) []  [ Paper ] This paper introduces Timers and Such, a new open source dataset of spoken English commands for common voice control use cases involving numbers. We describe the gap in existing spoken language understanding datasets that Timers and Such fills, the design and creation of the dataset, and experiments with a number of ASR-based and end-to-end baseline models, the code for which has been made available as part of the SpeechBrain toolkit. Loren Lugosch · Piyush Papreja · Mirco Ravanelli · Abdelwahab HEBA · Titouan Parcollet 🔗 - LiRo: Benchmark and leaderboard for Romanian language tasks (Poster) []  [ Paper ]    Recent advances in NLP have been sustained by the availability of large amounts of data and standardized benchmarks, which are not available for many languages. As a small step towards addressing this we propose LiRo, a platform for benchmarking models on the Romanian language on nine standard tasks: text classification, named entity recognition, machine translation, sentiment analysis, POS tagging, dependency parsing, language modelling, question-answering, and semantic textual similarity. We also include a less standard task of embedding debiasing, to address the growing concerns related to gender bias in language models. The platform exposes per-task leaderboards populated with baseline results for each task. In addition, we create three new datasets: one from Romanian Wikipedia and two by translating the Semantic Textual Similarity (STS) benchmark and the Cross-lingual Question Answering Dataset (XQuAD) into Romanian. We believe LiRo will not only add to the growing body of benchmarks covering various languages, but can also enable multi-lingual research by augmenting parallel corpora, and hence is of interest for the wider NLP community. LiRo is available at https://lirobenchmark.github.io/. Stefan Dumitrescu · Petru Rebeja · Beata Lorincz · Mihaela Gaman · Andrei Avram · Mihai Ilie · Andrei Pruteanu · Adriana Stan · Lorena Rosia · Cristina Iacobescu · Luciana Morogan · George Dima · Gabriel Marchidan · Traian Rebedea · Madalina Chitez · Dani Yogatama · Sebastian Ruder · Radu Tudor Ionescu · Razvan Pascanu · Viorica Patraucean 🔗 - A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning (Poster) []  [ Paper ]    Grounded language acquisition is a major area of research combining aspects of natural language processing, computer vision, and signal processing, compounded by domain issues requiring sample efficiency and other deployment constraints. In this work, we present a multimodal dataset of RGB+depth objects with spoken as well as textual descriptions. We analyze the differences between the two types of descriptive language and our experiments demonstrate that the different modalities affect learning. This will enable researchers studying the intersection of robotics, NLP, and HCI to better investigate how the multiple modalities of image, depth, text, speech, and transcription interact, as well as how differences in the vernacular of these modalities impact results. Gaoussou Kebe · Padraig Higgins · Patrick Jenkins · Kasra Darvish · Rishabh Sachdeva · Ryan Barron · John Winder · Donald Engel · Edward Raff · Francis Ferraro · Cynthia Matuszek 🔗 - NaturalProofs: Mathematical Theorem Proving in Natural Language (Poster) []  [ Paper ]    Understanding and creating mathematics using natural mathematical language - the mixture of symbolic and natural language used by humans - is a challenging and important problem for driving progress in machine learning. As a step in this direction, we develop NaturalProofs, a multi-domain corpus of mathematical statements and their proofs, written in natural mathematical language. NaturalProofs unifies broad coverage, deep coverage, and low-resource mathematical sources, allowing for evaluating both in-distribution and zero-shot generalization. Using NaturalProofs, we benchmark strong neural methods on mathematical reference retrieval and generation tasks which test a system's ability to determine key results that appear in a proof. Large-scale sequence models show promise compared to classical information retrieval methods, yet their performance and out-of-domain generalization leave substantial room for improvement. NaturalProofs opens many avenues for research on challenging mathematical tasks. Sean Welleck · Jiacheng Liu · Ronan Le Bras · Hanna Hajishirzi · Yejin Choi · Kyunghyun Cho 🔗 - CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review (Poster) []  [ Paper ]    Many specialized domains remain untouched by deep learning, as large labeled datasets require expensive expert annotators. We address this bottleneck within the legal domain by introducing the Contract Understanding Atticus Dataset (CUAD), a new dataset for legal contract review. CUAD was created with dozens of legal experts from The Atticus Project and consists of over 13,000 annotations. The task is to highlight salient portions of a contract that are important for a human to review. We find that Transformer models have nascent performance, but that this performance is strongly influenced by model design and training dataset size. Despite these promising results, there is still substantial room for improvement. As one of the only large, specialized NLP benchmarks annotated by experts, CUAD can serve as a challenging research benchmark for the broader NLP community. Dan Hendrycks · Collin Burns · Anya Chen · Spencer Ball 🔗 - CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks (Poster) []  [ Paper ]    Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase software development productivity and modernize legacy applications. Advances in deep learning and machine learning algorithms have enabled breakthroughs in computer vision, speech recognition, natural language processing and beyond, motivating researchers to leverage AI techniques to improve software development efficiency. Thus, the fast-emerging research area of “AI for Code” has garnered new interest and gathered momentum. In this paper, we present a large-scale dataset \textit{CodeNet}, consisting of over 14 million code samples and about 500 million lines of code in 55 different programming languages, which is aimed at teaching AI to code. In addition to its large scale, CodeNet has a rich set of high-quality annotations to benchmark and help accelerate research in AI techniques for a variety of critical coding tasks, including code similarity and classification, code translation between a large variety of programming languages, and code performance (runtime and memory) improvement techniques. Additionally, CodeNet provides sample input and output test sets for 98.5\% of the code samples, which can be used as an oracle for determining code correctness and potentially guide reinforcement learning for code quality improvements. As a usability feature, we provide several pre-processing tools in CodeNet to transform source code into representations that can be readily used as inputs into machine learning models. Results of code classification and code similarity experiments using the CodeNet dataset are provided as a reference. We hope that the scale, diversity and rich, high-quality annotations of CodeNet will offer unprecedented research opportunities at the intersection of AI and Software Engineering. Ruchir Puri · David Kung · Geert Janssen · Wei Zhang · Giacomo Domeniconi · Vladimir Zolotov · Julian T Dolby · Jie Chen · Mihir Choudhury · Lindsey Decker · Veronika Thost · Luca Buratti · Saurabh Pujar · Shyam Ramji · Ulrich Finkler · Susan Malaika · Frederick Reiss 🔗 - DUE: End-to-End Document Understanding Benchmark (Poster) []  []  [ Paper ]  link »    Understanding documents with rich layouts plays a vital role in digitization and hyper-automation but remains a challenging topic in the NLP research community. Additionally, the lack of a commonly accepted benchmark made it difficult to quantify progress in the domain. To empower research in this field, we introduce the Document Understanding Evaluation (DUE) benchmark consisting of both available and reformulated datasets to measure the end-to-end capabilities of systems in real-world scenarios.The benchmark includes Visual Question Answering, Key Information Extraction, and Machine Reading Comprehension tasks over various document domains and layouts featuring tables, graphs, lists, and infographics. In addition, the current study reports systematic baselines and analyzes challenges in currently available datasets using recent advances in layout-aware language modeling. We open both the benchmarks and reference implementations and make them available at https://duebenchmark.com and https://github.com/due-benchmark. Link » Łukasz Borchmann · Michał Pietruszka · Tomasz Stanislawek · Dawid Jurkiewicz · Michał Turski · Karolina Szyndler · Filip Graliński 🔗 - COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory Screening (Poster) []  [ Paper ]    Audio signals are widely recognised as powerful indicators of overall health status, and there has been increasing interest in leveraging sound for affordable COVID-19 screening through machine learning. However, there has also been scepticism regarding the initial efforts, due to perhaps the lack of reproducibility, large datasets and transparency which unfortunately is often an issue with machine learning for health. To facilitate the advancement and openness of audio-based machine learning for respiratory health, we release a dataset consisting of 53,449 audio samples (over 552 hours in total) crowd-sourced from 36,116 participants through our COVID-19 Sounds app. Given its scale, this dataset is comprehensive in terms of demographics and spectrum of health conditions. It also provides participants' self-reported COVID-19 testing status with 2,106 samples tested positive. To the best of our knowledge, COVID-19 Sounds is the largest multi-modal dataset of COVID-19 respiratory sounds: it consists of three modalities including breathing, cough, and voice recordings. Additionally, in this paper, we report on several benchmarks for two principal research tasks: respiratory symptoms prediction and COVID-19 prediction. For these tasks we demonstrate performance with a ROC-AUC of over 0.7, confirming both the promise of machine learning approaches based on these types of datasets as well as the usability of our data for such tasks. We describe a realistic experimental setting that hopes to pave the way to a fair performance evaluation of future models. In addition, we reflect on how the released dataset can help to scale some existing studies and enable new research directions, which inspire and benefit a wide range of future works. Tong Xia · Dimitris Spathis · Chlo{\"e} Brown · J Ch · Andreas Grammenos · Jing Han · Apinan Hasthanasombat · Erika Bondareva · Ting Dang · Andres Floto · Pietro Cicuta · Cecilia Mascolo 🔗 - WaveFake: A Data Set to Facilitate Audio Deepfake Detection (Poster) []  [ Paper ]    Deep generative modeling has the potential to cause significant harm to society. Recognizing this threat, a magnitude of research into detecting so-called "Deepfakes'' has emerged. This research most often focuses on the image domain, while studies exploring generated audio signals have---so-far---been neglected. In this paper we make three key contributions to narrow this gap. First, we provide researchers with an introduction to common signal processing techniques used for analyzing audio signals. Second, we present a novel data set, for which we collected nine sample sets from five different network architectures, spanning two languages. Finally, we supply practitioners with two baseline models, adopted from the signal processing community, to facilitate further research in this area. Joel Frank · Lea Schönherr 🔗 - $\texttt{RP-Mod}\ \&\ \texttt{RP-Crowd:}$ Moderator- and Crowd-Annotated German News Comment Datasets (Poster) []  [ Paper ]    Abuse and hate are penetrating social media and many comment sections of news media companies. To prevent losing readers who get appalled by inappropriate texts, these platform providers invest considerable efforts to moderate user-generated contributions. This is further enforced by legislative actions, which make non-clearance of these comments a punishable action. While (semi-)automated solutions using Natural Language Processing and advanced Machine Learning techniques are getting increasingly sophisticated, the domain of abusive language detection still struggles as large non-English and well-curated datasets are scarce or not publicly available. With this work, we publish and analyse the largest annotated German abusive language comment datasets to date. In contrast to existing datasets, we achieve a high labeling standard by conducting a thorough crowd-based annotation study that complements professional moderators' decisions, which are also included in the dataset. We compare and cross-evaluate the performance of baseline algorithms and state-of-the-art transformer-based language models, which are fine-tuned on our datasets and an existing alternative, showing the usefulness for the community. Dennis Assenmacher · Marco Niemann · Kilian Müller · Moritz Seiler · Dennis Riehle · Heike Trautmann 🔗 - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models (Poster) []  [ Paper ]    Large-scale pre-trained language models have achieved tremendous success across a wide range of natural language understanding (NLU) tasks, even surpassing human performance. However, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. While several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing. In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. Our ﬁndings are summarized as follows. (i) Most existing adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples, with around 90% of them either changing the original semantic meanings or misleading human annotators as well. Therefore, we perform a careful ﬁltering process to curate a high-quality benchmark. (ii) All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy. We hope our work will motivate the development of new adversarial attacks that are more stealthy and semantic-preserving, as well as new robust language models against sophisticated adversarial attacks. AdvGLUE is available at https://adversarialglue.github.io. Boxin Wang · Chejian Xu · Shuohang Wang · Zhe Gan · Yu Cheng · Jianfeng Gao · Ahmed Awadallah · Bo Li 🔗 - Multilingual Spoken Words Corpus (Poster) []  [ Paper ]    Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. We generate this dataset by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset. We provide a detailed analysis of the contents of the data and contribute methods for detecting potential outliers. We report baseline accuracy metrics on keyword spotting models trained from our dataset compared to models trained on a manually-recorded keyword dataset. We conclude with our plans for dataset maintenance, updates, and open-sourced code. Mark Mazumder · Sharad Chitlangia · Colby Banbury · Yiping Kang · Juan Ciro · Keith Achorn · Daniel Galvez · Mark Sabini · Peter Mattson · David Kanter · Greg Diamos · Pete Warden · Josh Meyer · Vijay Janapa Reddi 🔗 - Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading Comprehension (Poster) []  [ Paper ]    We present Native Chinese Reader (NCR), a new machine reading comprehension MRC) dataset with particularly long articles in both modern and classical Chinese. NCR is collected from the exam questions for the Chinese course in China’s high schools, which are designed to evaluate the language proficiency of native Chinese youth. Existing Chinese MRC datasets are either domain-specific or focusing on short contexts of a few hundred characters in modern Chinese only. By contrast, NCR contains 8390 documents with an average length of 1024 characters covering a wide range of Chinese writing styles, including modern articles, classical literature and classical poetry. A total of 20477 questions on these documents also require strong reasoning abilities and common sense to figure out the correct answers. We implemented multiple baseline models using popular Chinese pre-trained models and additionally launched an online competition using our dataset to examine the limit of current methods. The best model achieves 59% test accuracy while human evaluation shows an average accuracy of 79%, which indicates a significant performance gap between current MRC models and native Chinese speakers. Shusheng Xu · Yichen Liu · Xiaoyu Yi · Siyuan Zhou · Huizi Li · Yi Wu 🔗 - Measuring Coding Challenge Competence With APPS (Poster) []  [ Paper ]    While programming is one of the most broadly applicable skills in modern society, it is unclear how well state-of-the-art machine learning models can write code. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to assess code generation performance in an accurate and rigorous manner. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we evaluate models by checking their generated code on test cases. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially as models improve. Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an objective measure for tracking advancements. Dan Hendrycks · Steven Basart · Saurav Kadavath · Mantas Mazeika · Akul Arora · Ethan Guo · Collin Burns · Samir Puranik · Horace He · Dawn Song · Jacob Steinhardt 🔗 - NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language Evaluation (Poster) []  [ Paper ]    Slot-filling and intent detection are the backbone of conversational agents such as voice assistants, and are active areas of research. Even though state-of-the-art techniques on publicly available benchmarks show impressive performance, their ability to generalize to realistic scenarios is yet to be demonstrated. In this work, we present NATURE, a set of simple spoken-language-oriented transformations, applied to the evaluation set of datasets, to introduce human spoken language variations while preserving the semantics of an utterance. We apply NATURE to common slot-filling and intent detection benchmarks and demonstrate that simple perturbations from the standard evaluation set by NATURE can deteriorate model performance significantly. Through our experiments we demonstrate that when NATURE operators are applied to evaluation set of popular benchmarks the model accuracy can drop by up to 40%. David Alfonso-Hermelo · Ahmad Rashid · Abbas Ghaddar · Philippe Langlais · Mehdi Rezagholizadeh 🔗 - CSFCube - A Test Collection of Computer Science Research Articles for Faceted Query by Example (Poster) []  []  [ Paper ]  link »    Query by Example is a well-known information retrieval task in which a document is chosen by the user as the search query and the goal is to retrieve relevant documents from a large collection. However, a document often covers multiple aspects of a topic. To address this scenario we introduce the task of faceted Query by Example in which users can also specify a finer grained aspect in addition to the input query document. We focus on the application of this task in scientific literature search. We envision models which are able to retrieve scientific papers analogous to a query scientific paper along specifically chosen rhetorical structure elements as one solution to this problem. In this work, the rhetorical structure elements, which we refer to as facets, indicate objectives, methods, or results of a scientific paper. We introduce and describe an expert annotated test collection to evaluate models trained to perform this task. Our test collection consists of a diverse set of 50 query documents in English, drawn from computational linguistics and machine learning venues. We carefully follow the annotation guideline used by TREC for depth-k pooling (k = 100 or 250) and the resulting data collection consists of graded relevance scores with high annotation agreement. State of the art models evaluated on our dataset show a significant gap to be closed in further work. Our dataset may be accessed here: https://github.com/iesl/CSFCube Link » Sheshera Mysore · Tim O'Gorman · Andrew McCallum · Hamed Zamani 🔗 - RAFT: A Real-World Few-Shot Text Classification Benchmark (Poster) []  [ Paper ]    Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at https://raft.elicit.org/. Neel Alex · Eli Lifland · Lewis Tunstall · Abhishek Thakur · Pegah Maham · C. Riedel · Emmie Hine · Carolyn Ashurst · Paul Sedille · Alexis Carlier · Michael Noetel · Andreas Stuhlmüller 🔗 - A Dataset for Answering Time-Sensitive Questions (Poster) []  [ Paper ]    Time is an important dimension in our physical world. Lots of facts can evolve with respect to time. For example, the U.S. President might change every four years. Therefore, it is important to consider the time dimension and empower the existing QA models to reason over time. However, the existing QA datasets contain rather few time-sensitive questions, hence not suitable for diagnosing or benchmarking the model's temporal reasoning capability. In order to promote research in this direction, we propose to construct a time-sensitive QA dataset. The dataset is constructed by 1) mining time-evolving facts from WikiData and aligning them to their corresponding Wikipedia page, 2) employing crowd workers to verify and calibrate these noisy facts, 3) generating question-answer pairs based on the annotated time-sensitive facts. Our dataset poses challenges in the aspect of both temporal understanding and temporal reasoning. We evaluate different SoTA long-document QA systems like BigBird and FiD on our dataset. The best-performing model FiD can only achieve 46\% accuracy, still far behind the human performance of 87\%. We demonstrate that these models are still lacking the ability to perform consistent temporal reasoning. Therefore, we believe that our dataset could serve as a benchmark to develop NLP models more sensitive to temporal shifts. Wenhu Chen · Xinyi Wang · William Yang Wang 🔗 - DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debates (Poster) []  [ Paper ]    In this paper, we introduce DEBAGREEMENT, a dataset of 42,894 comment-reply pairs from the popular discussion website Reddit, annotated with agree, neutral or disagree labels. We collect data from five forums on Reddit: r/BlackLivesMatter, r/Brexit, r/climate, r/democrats, r/Republican. For each forum, we select comment pairs such that they form altogether a user interaction graph. DEBAGREEMENT presents a challenge for Natural Language Processing (NLP) systems, as it contains slang, sarcasm and topic-specific jokes, often present in online exchanges. We evaluate the performance of state-of-the-art language models on a (dis)agreement detection task, and investigate the use of contextual information available (graph, authorship, and temporal information). Since recent research has shown that context, such as social context or knowledge graph information, enables language models to better perform on downstream NLP tasks, DEBAGREEMENT provides novel opportunities for combining graph-based and text-based machine learning techniques to detect (dis)agreements online. John Pougué-Biyong · Valentina Semenova · Alexandre Matton · Rachel Han · Aerin Kim · Renaud Lambiotte · Doyne Farmer 🔗 - Task Agnostic and Task Specific Self-Supervised Learning from Speech with LeBenchmark (Poster) []  [ Paper ]    Self-Supervised Learning (SSL) has yielded remarkable improvements in many different domains including computer vision, natural language processing and speech processing by leveraging large amounts of unlabeled data. In the specific context of speech, however, and despite promising results, there exists a clear lack of standardization in the evaluation process for comprehensive comparisons of these models. This issue gets even worse with the investigation of SSL approaches for other languages than English. We present LeBenchmark, an open-source and reproducible framework for assessing SSL from French speech data. It includes documented, large-scale and heterogeneous corpora, seven pretrained SSL wav2vec 2.0 models shared with the community, and a clear evaluation protocol made of four downstream tasks along with their scoring scripts: automatic speech recognition, spoken language understanding, automatic speech translation and automatic emotion recognition. For the first time, SSL models are analyzed and compared on the latter domains both from a task-agnostic (i.e. frozen) and task-specific (i.e. fine-tuned w.r.t the downstream task) perspectives. We report state-of-the-art performance on most considered French tasks and provide a readable evaluation set-up for the development of future SSL models for speech processing. Solène Evain · Ha Nguyen · Hang Le · Marcely Zanon Boito · Salima Mdhaffar · Sina Alisamir · Ziyi Tong · Natalia Tomashenko · Marco Dinarelli · Titouan Parcollet · Alexandre Allauzen · Yannick Estève · Benjamin Lecouteux · François Portet · Solange Rossato · Fabien Ringeval · Didier Schwab · laurent besacier 🔗 - SynthBio: A Case Study in Faster Curation of Text Datasets (Poster) []  [ Paper ]    NLP researchers need more, higher-quality text datasets. Human-labeled datasets are expensive to collect, while datasets collected via automatic retrieval from the web such as WikiBio [Lebret 2016] are noisy and can include undesired biases. Moreover, data sourced from the web is often included in datasets used to pretrain models, leading to inadvertent cross-contamination of training and test sets. In this work we introduce a novel method for efficient dataset curation: we use a large language model to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task. We use our method to curate SynthBio - a new evaluation set for WikiBio - comprised of structured attribute lists describing fictional individuals, mapped to natural language biographies. We show that our dataset of fictional biographies is less noisy than WikiBio, and also more balanced with respect to gender and nationality. Ann Yuan · Daphne Ippolito · Vitaly Nikolaev · Chris Callison-Burch · Andy Coenen · Sebastian Gehrmann 🔗 - Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge Graphs (Poster) []  [ Paper ]    Complex Query Answering (CQA) is an important reasoning task on knowledge graphs. Current CQA learning models have been shown to be able to generalize from atomic operators to more complex formulas, which can be regarded as the combinatorial generalizability. In this paper, we present EFO-1-QA, a new dataset to benchmark the combinatorial generalizability of CQA models by including 301 different queries types, which is 20 times larger than existing datasets. Besides, our benchmark, for the first time, provide a benchmark to evaluate and analyze the impact of different operators and normal forms by using (a) 7 choices of the operator systems and (b) 9 forms of complex queries. Specifically, we provide the detailed study of the combinatorial generalizability of two commonly used operators, i.e., projection and intersection, and justify the impact of the forms of queries given the canonical choice of operators. Our code and data can provide an effective pipeline to benchmark CQA models. Zihao Wang · Hang Yin · Yangqiu Song 🔗 - KLUE: Korean Language Understanding Evaluation (Poster) []  [ Paper ]    We introduce Korean Language Understanding Evaluation (KLUE) benchmark. KLUE is a collection of eight Korean natural language understanding (NLU) tasks, including Topic Classification, Semantic Textual Similarity, Natural LanguageInference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking. We create all of the datasets from scratch in a principled way. We design the tasks to have diverse formats and each task to be built upon various source corpora that respect copyrights. Also, we propose suitable evaluation metrics and organize annotation protocols in a way to ensure quality. To prevent ethical risks in KLUE, we proactively remove examples reflecting social biases, containing toxic content or personally identifiable information (PII). Along with the benchmark datasets, we release pre-trained language models (PLM) for Korean, KLUE-BERT and KLUE-RoBERTa, and find KLUE-Roberta-large outperforms other baselines including multilingual PLMs and existing open-source Korean PLMs. The fine-tuning recipes are publicly open for anyone to reproduce our baseline result. We believe our work will facilitate future research on cross-lingual as well as Korean language models and the creation of similar resources for other languages. KLUE is available at https://klue-benchmark.com. Sungjoon Park · Jihyung Moon · Sungdong Kim · Won Ik Cho · Ji Yoon Han · Jangwon Park · Chisung Song · Junseong Kim · Youngsook Song · Taehwan Oh · Joohong Lee · Juhyun Oh · Sungwon Lyu · Younghoon Jeong · Inkwon Lee · Sangwoo Seo · Dongjun Lee · Hyunwoo Kim · Myeonghwa Lee · Seongbo Jang · Seungwon Do · Sunkyoung Kim · Kyungtae Lim · Jongwon Lee · Kyumin Park · Jamin Shin · Seonghyun Kim · Lucy Park · Alice Oh · Jung-Woo Ha · Kyunghyun Cho 🔗 - CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge (Poster) []  [ Paper ]    Most benchmark datasets targeting commonsense reasoning focus on everyday scenarios: physical knowledge like knowing that you could fill a cup under a waterfall, social knowledge like bumping into someone is awkward, and other generic situations. However, there is a rich space of commonsense inferences anchored to knowledge about specific entities: for example, deciding the truthfulness of a claim Harry Potter can teach classes on how to fly on a broomstick. Can models learn to combine entity knowledge with commonsense reasoning in this fashion? We introduce CREAK, a testbed for commonsense reasoning about entity knowledge, bridging fact-checking about entities (Harry Potter is a wizard and is skilled at riding a broomstick) with commonsense inferences (if you're good at a skill you can teach others how to do it). Our dataset consists of 13k human-authored English claims about entities that are either true or false, in addition to a small contrast set. Crowdworkers can easily come up with these statements and human performance on the dataset is high (high 90s); we argue that pre-trained language models (LMs) should be able to blend entity knowledge and commonsense reasoning to do well here. In our experiments, we focus on the closed-book setting and observe that a baseline model finetuned on existing fact verification benchmark struggles with the type of inferences in CREAK. Training a model on CREAK improves accuracy by a substantial margin, but still falls short of human performance. Our benchmark provides a unique probe into natural language understanding models, testing both its ability to retrieve facts (e.g., who teaches at the University of Chicago?) and unstated commonsense knowledge (e.g., butlers do not yell at guests). Yasumasa Onoe · Michael Zhang · Eunsol Choi · Greg Durrett 🔗 - Few-Shot Learning Evaluation in Natural Language Understanding (Poster) []  []  [ Paper ]  link »    Most recent progress in natural language understanding (NLU) has been driven, in part, by benchmarks such as GLUE, SuperGLUE, SQuAD, etc. In fact, many NLU models have now matched or exceeded "human-level" performance on many tasks in these benchmarks. Most of these benchmarks, however, give models access to relatively large amounts of labeled data for training. As such, the models are provided far more data than required by humans to achieve strong performance. That has motivated a line of work that focuses on improving few-shot learning performance of NLU models. However, there is a lack of standardized evaluation benchmarks for few-shot NLU resulting in different experimental settings in different papers.To help accelerate this line of work, we introduce CLUES, a benchmark for evaluating the few-shot learning capabilities of NLU models. We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks. We also demonstrate differences between alternative model families and adaptation techniques in the few shot setting. Finally, we discuss several principles and choices in designing the experimental settings for evaluating the true few-shot learning performance and suggest a unified standardized approach to few-shot learning evaluation. We aim to encourage research on NLU models that can generalize to new tasks with a small number of examples. Code and data for CLUES are available at https://github.com/microsoft/CLUES. Link » Subhabrata Mukherjee · Xiaodong Liu · Guoqing Zheng · Saghar Hosseini · Hao Cheng · Ge Yang · Christopher Meek · Ahmed Awadallah · Jianfeng Gao 🔗 - SciGen: a Dataset for Reasoning-Aware Text Generation from Scientific Tables (Poster) []  [ Paper ]    We introduce SciGen, a new challenge dataset consisting of tables from scientific articles and their corresponding descriptions, for the task of reasoning-aware data-to-text generation. Describing scientific tables goes beyond the surface realization of the table content and requires reasoning over table values. The unique properties of SciGen are that (1) tables mostly contain numerical values, and (2) the corresponding descriptions require arithmetic reasoning. SciGen is the first dataset that assesses the arithmetic reasoning capabilities of generation models on complex input structures, such as tables from scientific articles, and thus it opens new avenues for future research in reasoning-aware text generation and evaluation. The core part of SciGen, including the test data, is annotated by one of the authors of the corresponding articles. Such expert annotations do not scale to large training data sizes. To tackle this, we propose a pipeline for automatically extracting high-quality table-description pairs from the LaTeX sources of scientific articles. We study the effectiveness of state-of-the-art data-to-text generation models on SciGen and evaluate the results using common metrics and human evaluation. Our results and analyses show that adding high-quality unsupervised training data improves the correctness and reduces the hallucination in generated descriptions, however, the ability of state-of-the-art models is still severely limited on this task. Nafise Moosavi · Andreas Rücklé · Dan Roth · Iryna Gurevych 🔗 - HumBugDB: A Large-scale Acoustic Mosquito Dataset (Poster) []  [ Paper ]    This paper presents the first large-scale multi-species dataset of acoustic recordings of mosquitoes tracked continuously in free flight. We present 20 hours of audio recordings that we have expertly labelled and tagged precisely in time. Significantly, 18 hours of recordings contain annotations from 36 different species. Mosquitoes are well-known carriers of diseases such as malaria, dengue and yellow fever. Collecting this dataset is motivated by the need to assist applications which utilise mosquito acoustics to conduct surveys to help predict outbreaks and inform intervention policy. The task of detecting mosquitoes from the sound of their wingbeats is challenging due to the difficulty in collecting recordings from realistic scenarios. To address this, as part of the HumBug project, we conducted global experiments to record mosquitoes ranging from those bred in culture cages to mosquitoes captured in the wild. Consequently, the audio recordings vary in signal-to-noise ratio and contain a broad range of indoor and outdoor background environments from Tanzania, Thailand, Kenya, the USA and the UK. In this paper we describe in detail how we collected, labelled and curated the data. The data is provided from a PostgreSQL database, which captures important metadata such as the capture method, age, feeding status and gender of the mosquitoes. Additionally, we provide code to extract features and train Bayesian convolutional neural networks for two key tasks: the identification of mosquitoes from their corresponding background environments, and the classification of detected mosquitoes into species. Our extensive dataset is both challenging to machine learning researchers focusing on acoustic identification, and critical to entomologists, geo-spatial modellers and other domain experts to understand mosquito behaviour, model their distribution, and manage the threat they pose to humans. Ivan Kiskin · Marianne Sinka · Adam Cobb · Waqas Rafique · Lawrence Wang · Davide Zilli · Benjamin Gutteridge · Rinita Dam · Theodoros Marinos · Yunpeng Li · Dickson Msaky · Emmanuel Kaindoa · Gerard Killeen · Eva Herreros-Moya · Kathy Willis · Stephen J Roberts 🔗 - KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdialects (Poster) []  [ Paper ]    This paper introduces an open source speech dataset, KeSpeech, which involves 1,542 hours of speech signals recorded by 27,237 speakers in 34 cities in China, and the pronunciation includes standard Mandarin and its 8 subdialects. The new dataset possesses several properties. Firstly, the dataset provides multiple labels including content transcription, speaker identity and subdialect, hence supporting a variety of speech processing tasks, such as speech recognition, speaker recognition, and subdialect identification, as well as other advanced techniques like multi-task learning and conditional learning. Secondly, some of the text samples were parallel recorded with both the standard Mandarin and a particular subdialect, allowing for new applications such as subdialect style conversion. Thirdly, the number of speakers is much larger than other open-source datasets, making it suitable for tasks that require training data from vast speakers. Finally, the speech signals were recorded in two phases, which opens the opportunity for the study of the time variance property of human speech. We present the design principle of the KeSpeech dataset and four baseline systems based on the new data resource: speech recognition, speaker verification, subdialect identification and voice conversion. The dataset is free for all academic usage. Zhiyuan Tang · Dong Wang · Yanguang Xu · Jianwei Sun · Xiaoning Lei · Shuaijiang Zhao · cheng wen · Xingjun Tan · Chuandong Xie · Shuran Zhou · Rui Yan · Chenjia Lv · Yang Han · Wei Zou · Xiangang Li 🔗 - Measuring Mathematical Problem Solving With the MATH Dataset (Poster) []  [ Paper ]    Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community. Dan Hendrycks · Collin Burns · Saurav Kadavath · Akul Arora · Steven Basart · Eric Tang · Dawn Song · Jacob Steinhardt 🔗

#### Author Information

##### Joaquin Vanschoren (Eindhoven University of Technology)

Joaquin Vanschoren is an Assistant Professor in Machine Learning at the Eindhoven University of Technology. He holds a PhD from the Katholieke Universiteit Leuven, Belgium. His research focuses on meta-learning and understanding and automating machine learning. He founded and leads OpenML.org, a popular open science platform that facilitates the sharing and reuse of reproducible empirical machine learning data. He obtained several demo and application awards and has been invited speaker at ECDA, StatComp, IDA, AutoML@ICML, CiML@NIPS, AutoML@PRICAI, MLOSS@NIPS, and many other occasions, as well as tutorial speaker at NIPS and ECMLPKDD. He was general chair at LION 2016, program chair of Discovery Science 2018, demo chair at ECMLPKDD 2013, and co-organizes the AutoML and meta-learning workshop series at NIPS 2018, ICML 2016-2018, ECMLPKDD 2012-2015, and ECAI 2012-2014. He is also editor and contributor to the book 'Automatic Machine Learning: Methods, Systems, Challenges'.