Timezone: »
Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase software development productivity and modernize legacy applications. Advances in deep learning and machine learning algorithms have enabled breakthroughs in computer vision, speech recognition, natural language processing and beyond, motivating researchers to leverage AI techniques to improve software development efficiency. Thus, the fast-emerging research area of “AI for Code” has garnered new interest and gathered momentum. In this paper, we present a large-scale dataset \textit{CodeNet}, consisting of over 14 million code samples and about 500 million lines of code in 55 different programming languages, which is aimed at teaching AI to code. In addition to its large scale, CodeNet has a rich set of high-quality annotations to benchmark and help accelerate research in AI techniques for a variety of critical coding tasks, including code similarity and classification, code translation between a large variety of programming languages, and code performance (runtime and memory) improvement techniques. Additionally, CodeNet provides sample input and output test sets for 98.5\% of the code samples, which can be used as an oracle for determining code correctness and potentially guide reinforcement learning for code quality improvements. As a usability feature, we provide several pre-processing tools in CodeNet to transform source code into representations that can be readily used as inputs into machine learning models. Results of code classification and code similarity experiments using the CodeNet dataset are provided as a reference. We hope that the scale, diversity and rich, high-quality annotations of CodeNet will offer unprecedented research opportunities at the intersection of AI and Software Engineering.
Author Information
Ruchir Puri (IBM)
David Kung (International Business Machines)
Geert Janssen (IBM Research)
Wei Zhang (IBM T.J.Watson Research Center)
BE Beijing Univ of Technology 2005 MSc Technical University of Denmark 2008 PhD University of Wisconsin, Madison 2013 All in computer science Published papers in ASPLOS, OOPSLA, OSDI, PLDI, IJCAI, ICDM, NIPS
Giacomo Domeniconi (International Business Machines)
Vladimir Zolotov (International Business Machines)
Julian T Dolby (IBM Thomas J. Watson Research Center)
Jie Chen (IBM Research)
Mihir Choudhury (Rice University)
Lindsey Decker (International Business Machines)
Veronika Thost (IBM Research, MIT-IBM Watson AI Lab)
Luca Buratti (IBM)
Saurabh Pujar (IBM)
Shyam Ramji (IBM Research)
Ulrich Finkler (New York University)
Susan Malaika (International Business Machines)
Frederick Reiss
More from the Same Authors
-
2022 : Retrosynthesis Prediction Revisited »
Hongyu Tu · Shantam Shorewala · Tengfei Ma · Veronika Thost -
2022 : Graph Neural Networks for Selection of Preconditioners and Krylov Solvers »
Ziyuan Tang · Hong Zhang · Jie Chen -
2022 : c-MBA: Adversarial Attack for Cooperative MARL Using Learned Dynamics Model »
Nhan H Pham · Lam Nguyen · Jie Chen · Thanh Lam Hoang · Subhro Das · Lily Weng -
2022 Workshop: Graph Learning for Industrial Applications: Finance, Crime Detection, Medicine and Social Media »
Manuela Veloso · John Dickerson · Senthil Kumar · Eren K. · Jian Tang · Jie Chen · Peter Henstock · Susan Tibbs · Ani Calinescu · Naftali Cohen · C. Bayan Bruss · Armineh Nourbakhsh -
2022 Poster: Robustness to Unbounded Smoothness of Generalized SignSGD »
Michael Crawshaw · Mingrui Liu · Francesco Orabona · Wei Zhang · Zhenxun Zhuang -
2020 Workshop: KR2ML - Knowledge Representation and Reasoning Meets Machine Learning »
Veronika Thost · Kartik Talamadupula · Vivek Srikumar · Chenwei Zhang · Josh Tenenbaum -
2020 Poster: A Decentralized Parallel Algorithm for Training Generative Adversarial Nets »
Mingrui Liu · Wei Zhang · Youssef Mroueh · Xiaodong Cui · Jarret Ross · Tianbao Yang · Payel Das -
2020 Poster: ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training »
Chia-Yu Chen · Jiamin Ni · Songtao Lu · Xiaodong Cui · Pin-Yu Chen · Xiao Sun · Naigang Wang · Swagath Venkataramani · Vijayalakshmi (Viji) Srinivasan · Wei Zhang · Kailash Gopalakrishnan -
2020 Expo Talk Panel: AI4Code @ IBM and Red Hat »
Kartik Talamadupula · Julian T Dolby · Kavitha Srinivas · Fridolín Pokorný · Maja Vukovic · Anup K Kalia · Alessandro Morari -
2019 Workshop: KR2ML - Knowledge Representation and Reasoning Meets Machine Learning »
Veronika Thost · Christian Muise · Kartik Talamadupula · Sameer Singh · Christopher Ré -
2019 Poster: Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks »
Xiao Sun · Jungwook Choi · Chia-Yu Chen · Naigang Wang · Swagath Venkataramani · Vijayalakshmi (Viji) Srinivasan · Xiaodong Cui · Wei Zhang · Kailash Gopalakrishnan -
2018 Poster: Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks »
Xiaodong Cui · Wei Zhang · Zoltán Tüske · Michael Picheny -
2018 Poster: Constrained Generation of Semantically Valid Graphs via Regularizing Variational Autoencoders »
Tengfei Ma · Jie Chen · Cao Xiao -
2017 Poster: Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent »
Xiangru Lian · Ce Zhang · Huan Zhang · Cho-Jui Hsieh · Wei Zhang · Ji Liu -
2017 Oral: Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent »
Xiangru Lian · Ce Zhang · Huan Zhang · Cho-Jui Hsieh · Wei Zhang · Ji Liu