Skip to yearly menu bar Skip to main content


Crowdsourcing and Machine Learning

Adish Singla · Rafael Frongillo · Matteo Venanzi

Room 120 + 121

Thu 8 Dec, 11 p.m. PST

Building systems that seamlessly integrate machine learning (ML) and human intelligence can greatly push the frontier of our ability to solve challenging real-world problems. While ML research usually focuses on developing more efficient learning algorithms, it is often the quality and amount of training data that predominantly govern the performance of real-world systems. This is only amplified by the recent popularity of large scale and complex learning methodologies such as Deep Learning, which can require millions to billions of training instances to perform well. The recent rise of human computation and crowdsourcing approaches, made popular by task-solving platforms like Amazon Mechanical Turk and CrowdFlower, enable us to systematically collect and organize human intelligence. Crowdsourcing research itself is interdisciplinary, combining economics, game theory, cognitive science, and human-computer interaction, to create robust and effective mechanisms and tools. The goal of this workshop is to bring crowdsourcing and ML experts together to explore how crowdsourcing can contribute to ML and vice versa. Specifically, we will focus on the design of mechanisms for data collection and ML competitions, and conversely, applications of ML to complex crowdsourcing platforms.


Crowdsourcing is one of the most popular approaches to data collection for ML, and therefore one of the biggest avenues through which crowdsourcing can advance the state of the art in ML. We seek cost-efficient and fast data collection methods based on crowdsourcing, and ask how design decisions in these methods could impact subsequent stages of ML system. Topics of interest include:
- Basic annotation: What is the best way to collect and aggregate labels for unlabeled data from the crowd? How can we increase fidelity by flagging labels as uncertain given the crowd feedback? How can we do the above in the most cost-efficient manner?
- Beyond simple annotation tasks: What is the most effective way to collect probabilistic data from the crowd? How can we collect data requiring global knowledge of the domain such as building Bayes net structure via crowdsourcing?
- Time-sensitive and complex tasks: How can we design crowdsourcing systems to handle real-time or time-sensitive tasks, or those requiring more complicated work dependencies? Can we encourage collaboration on complex tasks?
- Data collection for specific domains: How can ML researchers apply the crowdsourcing principles to specific domains (e.g., healthcare) where privacy and other concerns are at play?


Through the Netflix challenge and now platforms like Kaggle, we are seeing the crowdsourcing of ML research itself. Yet the mechanisms underlying these competitions are extremely simple. Here our focus is on the design of such competitions; topics of interest include:
- What is the most effective way to incentivize the crowd to participate in the ML competitions? What is the most efficient method; rather than the typically winner-takes-all, can we design a mechanism which makes better use of the net research-hours devoted to the competition?
- Competitions as recruiting: how would we design a competition differently if (as is often the case) the result is not a winning algorithm but instead a job offer?
- Privacy issues with data sharing are one of the key barriers to holding such competitions. How can we design privacy-aware mechanisms which allow enough access to enable a meaningful competition?
- Challenges arising from the sequential and interactive nature of competitions, e.g., how can we maintain unbiased leaderboards without allowing for overfitting?


General crowdsourcing systems such as Duolingo, FoldIt, and Galaxy Zoo confront challenges of reliability, efficiency, and scalability, for which ML can provide powerful solutions. Many ML approaches have already been applied to output aggregation, quality control, work flow management and incentive design, but there is much more that could be done, either through novel ML methods, major redesigns of workflow or mechanisms, or on new crowdsourcing problems. Topics here include:
- Dealing with sparse, noisy and large number of label classes, for example, in tagging image collection for Deep Learning based computer vision algorithms.
- Optimal budget allocation and active learning in crowdsourcing.
- Open theoretical questions in crowdsourcing that can be addressed by statistics and learning theory, for instance, analyzing label aggregation algorithms such as EM, or budget allocation strategies.
- Applications of ML to emerging crowd-powered marketplaces (e.g., Uber, AirBnb). How can ML improve the efficiency of these markets?

Live content is unavailable. Log in and register to view live content

Timezone: America/Los_Angeles


Log in and register to view live content