Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Second Workshop on Quantum Tensor Networks in Machine Learning

Codee: A Tensor Embedding Scheme for Binary Code Search

Jia Yang · Cai Fu · Xiao-Yang Liu


Abstract:

Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options and CPU architectures, as well as thousands of binary codes. Furthermore, there are some pivotal issues in current binary code search schemes, including inaccurate text-based or token-based analysis, slow graph matching, or complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, to carry out code search efficiently and accurately at the binary function level. First, we use an NLP-based neural network to generate the semantic-aware token embedding. Second, we propose an efficient basic block embedding generation algorithm based on the network representation learning model. We learn both the semantic information of instructions and the control flow structural information to generate the basic block embedding. Then we use all basic block embeddings in a function to obtain a variable-length function feature vector. Third, we build a tensor to generate function embeddings based on the tensor singular value decomposition, which compresses the variable-length vectors into short fixed-length vectors to facilitate efficient search afterward. We further propose a dynamic tensor compression algorithm to incrementally update the function embedding database. Finally, we use the local sensitive hash method to find the top-K similar matching functions in the repository. Compared with state-of-the-art cross-platform and cross-optimization-level code search schemes,our scheme achieves higher average search accuracy, shorter feature vectors, and faster feature generation performance using four datasets, OpenSSL, Coreutils, libgmp and libcurl.

Chat is not available.