Skip to yearly menu bar Skip to main content


Theory and Practice of Efficient and Accurate Dataset Construction

Frederic Sala · Ramya Korlakai Vinayak



Data is one of the key drivers of progress in machine learning. Modern datasets require scale far beyond the ability of individual domain experts to produce. To overcome this limitation, a wide variety of techniques have been developed to build large datasets efficiently, including crowdsourcing, automated labeling, weak supervision, and many more. This tutorial describes classical and modern methods for building datasets beyond manual hand-labeling. It covers both theoretical and practical aspects of dataset construction. Theoretically, we discuss guarantees for a variety of crowdsourcing, active learning-based, and weak supervision techniques, with a particular focus on generalization properties of downstream models trained on the resulting datasets. Practically, we describe several popular systems implementing such techniques and their use in industry and beyond. We cover both the promise and potential pitfalls of using such methods. Finally, we offer a comparison of automated dataset construction versus other popular approaches to dealing with a lack of large amounts of labeled data, including few- and zero-shot methods enabled by foundation models.

Chat is not available.