Timezone: »
Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 12.2% in unsupervised code translation, and 5.3% in natural language code search. Incidentally, we found that our pre-trained model is able to deobfuscate fully obfuscated source files, and to suggest descriptive variable names.
Author Information
Marie-Anne Lachaux (Facebook AI Research)
Baptiste Roziere (Facebook AI Research and Paris-Dauphine University)
Marc Szafraniec (Facebook AI Research)
Guillaume Lample (Facebook AI Research)
More from the Same Authors
-
2022 : Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs »
Albert Jiang · Sean Welleck · Jin Peng Zhou · Timothee Lacroix · Jiacheng Liu · Wenda Li · Mateja Jamnik · Guillaume Lample · Yuhuai Wu -
2022 Poster: HyperTree Proof Search for Neural Theorem Proving »
Guillaume Lample · Timothee Lacroix · Marie-Anne Lachaux · Aurelien Rodriguez · Amaury Hayat · Thibaut Lavril · Gabriel Ebner · Xavier Martinet -
2020 Poster: Unsupervised Translation of Programming Languages »
Baptiste Roziere · Marie-Anne Lachaux · Lowik Chanussot · Guillaume Lample -
2020 Poster: Continuous Surface Embeddings »
Natalia Neverova · David Novotny · Marc Szafraniec · Vasil Khalidov · Patrick Labatut · Andrea Vedaldi -
2020 Poster: Adversarial Attacks on Linear Contextual Bandits »
Evrard Garcelon · Baptiste Roziere · Laurent Meunier · Jean Tarbouriech · Olivier Teytaud · Alessandro Lazaric · Matteo Pirotta -
2019 Poster: Large Memory Layers with Product Keys »
Guillaume Lample · Alexandre Sablayrolles · Marc'Aurelio Ranzato · Ludovic Denoyer · Herve Jegou -
2019 Spotlight: Large Memory Layers with Product Keys »
Guillaume Lample · Alexandre Sablayrolles · Marc'Aurelio Ranzato · Ludovic Denoyer · Herve Jegou