Talk
in
Workshop: Crowdsourcing and Machine Learning

Ben Hamner (Kaggle): "Kaggle Competitions and The Future of Reproducible Machine Learning"

Ben Hamner

2016 Talk
in
Workshop: Crowdsourcing and Machine Learning

Abstract

At Kaggle, we’ve run hundreds of machine learning competitions and seen over 150,000 data scientists make submissions. One thing is clear: winning competitions isn’t random. We’ve learned that certain tools and methodologies work consistently well on different types of problems. Many participants make common mistakes (such as overfitting) that should be actively avoided. Similarly, competition hosts have their own set of pitfalls (such as data leakage). In this talk, I’ll share what goes into a winning competition toolkit along with some war stories on what to avoid. Additionally, I’ll share what we’re seeing on the collaborative side of competitions. Our community is showing an increasing amount of collaboration in developing machine learning models and analytic solutions. As collaboration has grown, we've seen reproducibility as a key pain point in machine learning. It can be incredibly tough to rerun and build on your colleague's work, public work, or even your own past work! We're expanding our focus to build a reproducible data science platform that hits directly at these pain points. It combines versioned data, versioned code, and versioned computational environments (through Docker containers) to create reproducible results.

Chat is not available.