Timezone: »

Crowdsourcing and Machine Learning
Adish Singla · Rafael Frongillo · Matteo Venanzi

Thu Dec 08 11:00 PM -- 09:30 AM (PST) @ Room 120 + 121
Event URL: http://crowdml.cc/nips2016/ »

Building systems that seamlessly integrate machine learning (ML) and human intelligence can greatly push the frontier of our ability to solve challenging real-world problems. While ML research usually focuses on developing more efficient learning algorithms, it is often the quality and amount of training data that predominantly govern the performance of real-world systems. This is only amplified by the recent popularity of large scale and complex learning methodologies such as Deep Learning, which can require millions to billions of training instances to perform well. The recent rise of human computation and crowdsourcing approaches, made popular by task-solving platforms like Amazon Mechanical Turk and CrowdFlower, enable us to systematically collect and organize human intelligence. Crowdsourcing research itself is interdisciplinary, combining economics, game theory, cognitive science, and human-computer interaction, to create robust and effective mechanisms and tools. The goal of this workshop is to bring crowdsourcing and ML experts together to explore how crowdsourcing can contribute to ML and vice versa. Specifically, we will focus on the design of mechanisms for data collection and ML competitions, and conversely, applications of ML to complex crowdsourcing platforms.


Crowdsourcing is one of the most popular approaches to data collection for ML, and therefore one of the biggest avenues through which crowdsourcing can advance the state of the art in ML. We seek cost-efficient and fast data collection methods based on crowdsourcing, and ask how design decisions in these methods could impact subsequent stages of ML system. Topics of interest include:
- Basic annotation: What is the best way to collect and aggregate labels for unlabeled data from the crowd? How can we increase fidelity by flagging labels as uncertain given the crowd feedback? How can we do the above in the most cost-efficient manner?
- Beyond simple annotation tasks: What is the most effective way to collect probabilistic data from the crowd? How can we collect data requiring global knowledge of the domain such as building Bayes net structure via crowdsourcing?
- Time-sensitive and complex tasks: How can we design crowdsourcing systems to handle real-time or time-sensitive tasks, or those requiring more complicated work dependencies? Can we encourage collaboration on complex tasks?
- Data collection for specific domains: How can ML researchers apply the crowdsourcing principles to specific domains (e.g., healthcare) where privacy and other concerns are at play?


Through the Netflix challenge and now platforms like Kaggle, we are seeing the crowdsourcing of ML research itself. Yet the mechanisms underlying these competitions are extremely simple. Here our focus is on the design of such competitions; topics of interest include:
- What is the most effective way to incentivize the crowd to participate in the ML competitions? What is the most efficient method; rather than the typically winner-takes-all, can we design a mechanism which makes better use of the net research-hours devoted to the competition?
- Competitions as recruiting: how would we design a competition differently if (as is often the case) the result is not a winning algorithm but instead a job offer?
- Privacy issues with data sharing are one of the key barriers to holding such competitions. How can we design privacy-aware mechanisms which allow enough access to enable a meaningful competition?
- Challenges arising from the sequential and interactive nature of competitions, e.g., how can we maintain unbiased leaderboards without allowing for overfitting?


General crowdsourcing systems such as Duolingo, FoldIt, and Galaxy Zoo confront challenges of reliability, efficiency, and scalability, for which ML can provide powerful solutions. Many ML approaches have already been applied to output aggregation, quality control, work flow management and incentive design, but there is much more that could be done, either through novel ML methods, major redesigns of workflow or mechanisms, or on new crowdsourcing problems. Topics here include:
- Dealing with sparse, noisy and large number of label classes, for example, in tagging image collection for Deep Learning based computer vision algorithms.
- Optimal budget allocation and active learning in crowdsourcing.
- Open theoretical questions in crowdsourcing that can be addressed by statistics and learning theory, for instance, analyzing label aggregation algorithms such as EM, or budget allocation strategies.
- Applications of ML to emerging crowd-powered marketplaces (e.g., Uber, AirBnb). How can ML improve the efficiency of these markets?

Thu 11:30 p.m. - 12:00 a.m.
Poster Setup by Authors (Misc)
Fri 12:00 a.m. - 12:05 a.m.
Opening Remarks (Misc)
Fri 12:05 a.m. - 12:55 a.m.

Since its inception, crowdsourcing has been considered a black-box approach to solicit labor from a crowd of workers. Furthermore, the “crowd" has been viewed as a group of independent workers. Recent studies based on in-person interviews have opened up the black box and shown that the crowd is not a collection of independent workers, but instead that workers communicate and collaborate with each other. In this talk, I will describe our attempt to quantify this discovery by mapping the entire communication network of workers on Amazon Mechanical Turk, a leading crowdsourcing platform. We executed a task in which over 10,000 workers from across the globe self-reported their communication links to other workers, thereby mapping the communication network among workers. Our results suggest that while a large percentage of workers indeed appear to be independent, there is a rich network topology over the rest of the population. That is, there is a substantial communication network within the crowd. We further examined how online forum usage relates to network topology, how workers communicate with each other via this network, how workers’ experience levels relate to their network positions, and how U.S. workers differ from international workers in their network characteristics. These findings have implications for requesters, workers, and platform providers. This talk is based on joint work with Ming Yin, Mary Gray, and Sid Suri.

Jenn Wortman Vaughan
Fri 12:55 a.m. - 1:05 a.m.
Edoardo Manino: "Efficiency of Active Learning for the Allocation of Workers on Crowdsourced Classification Tasks" (Paper Presentation)
Fri 1:05 a.m. - 1:15 a.m.
Yao-Xiang Ding: "Crowdsourcing with Unsure Option" (Paper Presentation)
Fri 1:15 a.m. - 1:25 a.m.
Yang Liu: "Doubly Active Learning: When Active Learning meets Active Crowdsourcing" (Paper Presentation)
Fri 1:30 a.m. - 2:00 a.m.
Coffee + Posters (Break)
Fri 2:00 a.m. - 2:45 a.m.

Adaptive schemes, where tasks are assigned based on the data collected thus far, are widely used in practical crowdsourcing systems to efficiently allocate the budget. However, existing theoretical analyses of crowdsourcing systems suggest that the gain of adaptive task assignments is minimal. To bridge this gap, we propose a new model for representing practical crowdsourcing systems, which strictly generalizes the popular Dawid-Skene model, and characterize the fundamental trade-off between budget and accuracy. We introduce a novel adaptive scheme that matches this fundamental limit. We introduce new techniques to analyze the spectral analyses of non-back-tracking operators, using density evolution techniques from coding theory.

Sewoong Oh
Fri 2:45 a.m. - 2:55 a.m.
Matteo Venanzi: "Time-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems" (Paper Presentation)
Fri 2:55 a.m. - 3:05 a.m.
Miles E. Lopes: "A Sharp Bound on the Computation-Accuracy Tradeoff for Majority Voting Ensembles" (Paper Presentation)
Fri 3:10 a.m. - 3:30 a.m.
Ashish Kapoor: "Identifying and Accounting for Task-Dependent Bias in Crowdsourcing" (Paper Presentation)
Fri 3:30 a.m. - 5:00 a.m.
Lunch (Break)
Fri 5:00 a.m. - 5:15 a.m.
Boi Faltings: "Incentives for Effort in Crowdsourcing Using the Peer Truth Serum" (Paper Presentation)
Fri 5:15 a.m. - 5:30 a.m.
David Parkes: "Peer Prediction with Heterogeneous Tasks" (Paper Presentation)
Fri 5:30 a.m. - 5:45 a.m.
Jens Witkowski: "Proper Proxy Scoring Rules" (Paper Presentation)
Fri 5:45 a.m. - 6:00 a.m.
Jordan Suchow: "Rethinking Experiment Design as Algorithm Design" (Paper Presentation)
Fri 6:00 a.m. - 6:30 a.m.
Afternoon Coffee + Posters (Break)
Fri 6:30 a.m. - 7:30 a.m.

At Kaggle, we’ve run hundreds of machine learning competitions and seen over 150,000 data scientists make submissions. One thing is clear: winning competitions isn’t random. We’ve learned that certain tools and methodologies work consistently well on different types of problems. Many participants make common mistakes (such as overfitting) that should be actively avoided. Similarly, competition hosts have their own set of pitfalls (such as data leakage). In this talk, I’ll share what goes into a winning competition toolkit along with some war stories on what to avoid. Additionally, I’ll share what we’re seeing on the collaborative side of competitions. Our community is showing an increasing amount of collaboration in developing machine learning models and analytic solutions. As collaboration has grown, we've seen reproducibility as a key pain point in machine learning. It can be incredibly tough to rerun and build on your colleague's work, public work, or even your own past work! We're expanding our focus to build a reproducible data science platform that hits directly at these pain points. It combines versioned data, versioned code, and versioned computational environments (through Docker containers) to create reproducible results.

Ben Hamner
Fri 7:30 a.m. - 9:00 a.m.
Poster Presentation by Authors (Poster Session)

Author Information

Adish Singla (MPI-SWS)
Rafael Frongillo (CU Boulder)
Matteo Venanzi (Microsoft)

More from the Same Authors