Timezone: »

CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching
Zeping Yu · Wenxin Zheng · Jiaqi Wang · Qiyi Tang · Sen Nie · Shi Wu

Wed Dec 09 09:00 PM -- 11:00 PM (PST) @ Poster Session 4 #1234

Binary source code matching, especially on function-level, has a critical role in the field of computer security. Given binary code only, finding the corresponding source code improves the accuracy and efficiency in reverse engineering. Given source code only, related binary code retrieval contributes to known vulnerabilities confirmation. However, due to the vast difference between source and binary code, few studies have investigated binary source code matching. Previously published studies focus on code literals extraction such as strings and integers, then utilize traditional matching algorithms such as the Hungarian algorithm for code matching. Nevertheless, these methods have limitations on function-level, because they ignore the potential semantic features of code and a lot of code lacks sufficient code literals. Also, these methods indicate a need for expert experience for useful feature identification and feature engineering, which is timeconsuming. This paper proposes an end-to-end cross-modal retrieval network for binary source code matching, which achieves higher accuracy and requires less expert experience. We adopt Deep Pyramid Convolutional Neural Network (DPCNN) for source code feature extraction and Graph Neural Network (GNN) for binary code feature extraction. We also exploit neural network-based models to capture code literals, including strings and integers. Furthermore, we implement "norm weighted sampling" for negative sampling. We evaluate our model on two datasets, where it outperforms other methods significantly.

Author Information

Zeping Yu (Tencent Security Keen Lab)
Wenxin Zheng (Tencent Keen Lab, Shanghai Jiao Tong University)
Jiaqi Wang (Tencent Keen Lab)
Qiyi Tang (Tencent Keen Lab)
Sen Nie (Tencent Keen Lab)
Shi Wu (Tencent Keen Lab)