Skip to yearly menu bar Skip to main content

Workshop: Synthetic Data for Empowering ML Research

Entity-Controlled Synthetic Text Generation using Contextual Question and Answering with Pre-trained Language Models

Karan Aggarwal · Henry Jin · Aitzaz Ahmad


Recent advancements in Natural Language Processing (NLP) algorithms have resulted in state-of-the-art performance on Named Entity Recognition (NER) tasks. These algorithms typically require high-quality labeled datasets for training models. However, training NLP models effectively can suffer from issues such as scarcity of labeled data, data bias and under-representation, and privacy concerns with using sensitive data for training. Generating synthetic data to train models is a promising solution to mitigate these problems. We propose a contextual question and answering approach using pre-trained language models to synthetically generate entity-controlled text. Entity-controlled text generation is then used to augment small labeled datasets for downstream NER tasks. We evaluate this proposed method on two publicly available datasets, and measure the quality of generated texts quantitatively. We find that the model is capable of producing full text samples with the desired entities appearing in a stochastically controllable way, while retaining sentence coherence closest to the real world data. Evaluations on downstream NER tasks show significant improvements in low-labeled data regime, and in using purely synthetic data for NER to alleviate privacy concerns.

Chat is not available.