Skip to yearly menu bar Skip to main content

Workshop: Advances and Opportunities: Machine Learning for Education

ImageNets for Teaching CS

Tiffany Barnes · Thomas Price · Jim Larimore


Abstract: In this breakout session, we propose the idea of an "ImageNet for Teaching Computer Science." The proposed idea involves collecting a large set of labeled programming datasets from classrooms, using a shared format, and developing a set of benchmarks and challenges that will facilitate research for K-20 computing education. This data would benefit a growing research community at the intersection of computing education and learning analytics, with implications for students across many fields that teach computing.

Rationale: Programming data is ideal for learning analytics/edu data mining, since it is rich, capturing students' every state and action as they work, and the data is structured by syntax rules. Advances in programming analysis techniques for open-ended, sequential, and semi-structured data will have broad applications across educational domains. However, recent advances in deep learning require larger datasets and more meaningful labels than those typically available from individual classrooms, necessitating cross-institutional data collection and labeling efforts.

Background and Progress: A series of workshops from CS-SPLICE ( and CSEDM ( have brought together the research community to develop infrastructure and analysis techniques for programming data. The community has developed the shared ProgSnap2 format for programming log data (, which is already used by 10+ datasets, comprising 750,000+ program snapshots in various languages (many of the datasets can be found on Researchers have used this data to develop automated support (e.g. hints, feedback, curated examples), predict student success, and personalize interventions. The CSEDM Data Challenge ( is a recurring data mining competition (held 2019, planned 2021) to gain insight from classroom programming data, which has helped to define shared machine learning benchmarks on common datasets.

Next Steps: The key challenges will be collecting diverse existing datasets, and creating infrastructure to support collecting and labeling new data. This will allow us to tackle novel research challenges, such as generalizing algorithms and labels across problems -- for example detecting knowledge components, strategies, or misconceptions on one problem using data from others. The CS-SPLICE and CSEDM communities include developers of many widely-used educational programming platforms and will be important stakeholders in driving the work forward.

Acknowledgements: This reflects joint work by Thomas Price, Tiffany Barnes, Min Chi, Samiha Marwan, Yang Shi, Preya Shabrina, and Ye Mao at NC State University. It presents and builds on ideas and foundational work by the CS-SPLICE and CSEDM teams.