CN107229563B

CN107229563B - Cross-architecture binary program vulnerability function association method

Info

Publication number: CN107229563B
Application number: CN201610178368.6A
Authority: CN
Inventors: 石志强; 常青; 陈昱; 王猛涛; 孙利民; 朱红松
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2016-03-25
Filing date: 2016-03-25
Publication date: 2020-07-10
Anticipated expiration: 2036-03-25
Also published as: CN107229563A

Abstract

The invention discloses a cross-architecture binary program vulnerability function correlation method. The method comprises the following steps: 1) carrying out reverse analysis on a binary file of a binary program to obtain a function library to be tested; then, acquiring a function call graph, a function control flow graph and basic function attributes according to the function library to be tested; 2) extracting the characteristics of each function to be tested according to the function call graph, the function control flow graph and the function basic attribute; then, calculating the numerical similarity of each function to be tested and the vulnerability function according to the extracted features and the features of the vulnerability function; 3) for each function to be tested, constructing a weighted bipartite graph of the function to be tested and the vulnerability function respectively, and calculating the overall similarity of the function to be tested and the vulnerability function by adopting a bipartite graph algorithm; 4) if the overall similarity of the function to be tested and the vulnerability function is larger than a set judgment threshold value, the function to be tested is judged to be a suspected vulnerability function, otherwise, the function to be tested is judged to be a normal function. The method is simple to implement and easy to popularize.

Description

Cross-architecture binary program vulnerability function association method

Technical Field

The invention relates to the field of binary program vulnerability mining and reverse analysis, in particular to a cross-architecture binary program vulnerability function association method, and belongs to the technical field of computer program detection.

Background

With the rapid development of global information technology and the rapid popularization of information systems and information products, computer software has become an important component of the development of world economy, science and technology, military and society. Practice shows that most information security events are initiated by attackers through software bugs. Therefore, the security vulnerability is a decisive factor directly affecting the information security system, and it is necessary to analyze and utilize the software vulnerability. Vulnerability analysis can be divided into a source code level and a binary level according to the objects to be analyzed. The vulnerability analysis technology of the source code level is to directly analyze a program written in a high-level language. An analyst can find coding errors and design defects in a program by a series of vulnerability analysis technologies by utilizing rich and complete semantic information in a source code. In practical application, however, a large amount of commercial software exists in the form of binary codes, and source codes are difficult to obtain. Therefore, binary program vulnerability analysis is becoming an important branch of the information security field.

The early application scenario is that the similarity of two binary files compiled by the same Architecture is calculated to perform function association, because the compiled by the same Architecture is obtained after disassembling, the assembler can be regarded as a character string and directly performs similarity analysis and processing because the assembler is of the same instruction set, the method of a semantic template is provided for fast positioning of similar code segments in 2013, arnun L akhotia provides a method for calculating the similarity of basic blocks by using a character compiling distance, Yaniv David provides a great difference in compiling optimization options even if the assembler obtained by disassembling the same source code if compiling binary files are different in compiling optimization options, which means that a method relying on the expression form of the assembler is sensitive to compiling optimization options, so researchers turn to a research point to information which depends on a lower expression form, begin to extract information of program segments as characteristic correlation information, and use a semantic correlation algorithm for extracting semantic functions of a semantic functions, and a Cross-linking algorithm for realizing semantic mapping between a semantic functions, such as a Cross-linking algorithm, a theoretical model, a model.

At present, a cross-architecture binary program vulnerability correlation technology which is simple to implement and high in accuracy is lacked.

Disclosure of Invention

The invention aims to provide a cross-architecture binary program vulnerability function association method. The method mainly comprises the following steps: performing reverse analysis on the binary file to obtain a function library to be tested, and calculating the numerical similarity between the function to be tested and the vulnerability function; intercepting local structure information of two functions to be compared from a function call graph to form two structure subgraphs; hierarchically abstracting the two structural subgraphs into weighted bipartite graphs, calculating the maximum weight matching of the weighted bipartite graphs by adopting a bipartite graph matching algorithm, weighting and summing the maximum weight matching as the overall similarity of two functions, and sequencing the two functions according to the overall similarity; and calculating a judgment threshold value based on the ROC curve, judging the function with the similarity larger than the judgment threshold value as a suspected vulnerability function, performing next analysis, otherwise, judging the function as a normal function, and not processing the function.

The technical innovation of the method is that a reconstruction function controls a flow chart algorithm when the similarity is calculated and a structured matching algorithm when the overall similarity is calculated. The method integrates the numerical value information and the structure information of the function, the extraction of the characteristics does not depend on a specific instruction set, the function association can be carried out on the binary files under different architectures, the result accuracy is high, and the realization is simple.

In order to achieve the purpose, the invention adopts the following technical scheme:

a cross-architecture binary program vulnerability function correlation method mainly comprises the following 3 steps:

1) and calculating the numerical similarity of the function to be detected and the vulnerability function. Firstly, carrying out reverse analysis on a binary file to obtain a function library to be tested; extracting information of calling relation among functions to be tested (namely function calling graph), control flow graph information in the functions and basic attribute information of the functions to be tested, and performing numerical processing to obtain characteristic vectors of the functions; adopting a self-compiled multi-platform function set with a symbol table as a training sample to train the integrated classifier; and calculating the similarity of each feature of the function to be measured and the vulnerability function to form a similarity vector, and bringing the similarity vector into the integrated classifier for prediction to obtain the numerical similarity.

2) And constructing a weighted bipartite graph, and calculating the overall similarity by adopting a bipartite graph algorithm. And intercepting local structure information of two functions to be compared from the function call graph to form two structure subgraphs, wherein the intercepted layer number can be determined according to actual needs. And hierarchically abstracting the two structural subgraphs into weighted bipartite graphs, wherein the node set is a function contained in corresponding layers of the two structural subgraphs, the edge set is the similarity of any two functions, the edge weight is the numerical similarity obtained by the previous step of calculation, then, the maximum weight matching of the weighted bipartite graphs is hierarchically calculated by adopting a bipartite graph matching algorithm, and the weighted sum is used as the overall similarity of the function to be measured and the vulnerability function.

3) The determination is made based on a determination threshold calculated based on the ROC curve. And obtaining an overall similarity vector of the function set to be tested and the vulnerability function to draw an ROC curve, taking a threshold value corresponding to the highest point of the Y-X curve as a judgment threshold value, judging the function with the similarity greater than the judgment threshold value as a suspected vulnerability function, and otherwise, judging the function as a normal function. Each point constituting the ROC curve is (X, Y), and then the curve constituted by (X, Y-X) is a Y-X curve based on the ROC curve, wherein X defines a field M.

The invention can obtain the following beneficial effects:

when the numerical similarity of the function to be tested and the vulnerability function is calculated, 9 aspects of characteristics such as call relation characteristics, stack space characteristics, character string characteristics, code scale characteristics, path sequence characteristics, path basic characteristics, degree sequence characteristics, degree basic characteristics, graph scale characteristics and the like are mainly considered, typical characteristics of one function are reflected relatively completely, and the characteristic extraction does not depend on a specific instruction set, so that the vulnerability association can be carried out on binary files compiled aiming at two different architectures. Meanwhile, when the characteristics are extracted, the IDA plug-in is compiled to extract from the IDA analysis result, and the IDA has differences when reversely analyzing the binary files with different architectures to construct the function control flow graph.

The invention adopts the method of intercepting the function call graph and constructing the weighted bipartite graph to calculate the maximum weight matching when fusing the numerical value information and the structure information of the function. And (3) assuming that the closer the function node to be detected is to the greater the contribution of the function node to the matching, layering the function nodes according to the hop number from the function to be detected, performing minimum bipartite graph matching on the single-layer function nodes by using a Kuhn-Munkres algorithm to obtain the single-layer similarity, and finally weighting and summing the similarities of the layers to obtain the overall similarity of the functions. When the method is used for calculating the overall similarity of the functions to be matched, the influence of the similarity of other function pairs on the function pairs to be matched is considered on the basis of the calling information among the functions. Compared with a method only using numerical values, the method is more objective and accurate.

Compared with the prior art, the method and the device do not depend on a specific instruction set, can be used for carrying out vulnerability association on binary files with different architectures, and are simple to implement and easy to popularize.

Drawings

FIG. 1 is a schematic flow diagram of a protocol;

FIG. 2 is a schematic diagram of a reconstruction function control flow graph;

FIG. 3 is a schematic diagram of a hierarchy of structural subgraphs;

FIG. 4 is a schematic diagram of a construct empowerment bipartite graph;

FIG. 5 is a schematic diagram of determining an optimal threshold value based on a ROC curve.

Detailed Description

A cross-architecture binary program vulnerability correlation method comprises the following specific implementation modes:

1) and writing an IDA plug-in to perform reverse analysis on the binary file to obtain a function library to be tested, and a function basic attribute, a function call graph and a function control flow graph.

2) And calculating the numerical similarity of the function to be detected and the vulnerability function. The whole process comprises three steps of numerical feature extraction, similarity calculation and neural network similarity prediction.

And in the stage of numerical feature extraction, numerical feature extraction is respectively carried out from three aspects of the basic attribute of the function, the function call graph and the function control flow graph. The method mainly extracts nine aspects of features such as call relation features, character string features, stack space features, code scale features, path sequence features, path basic features, degree sequence features, degree basic features, graph scale features and the like of the function to be tested. These nine aspects feature more completely reflecting the typical properties of a function.

And analyzing the function call graph, calculating the times of calling each function to be tested by other functions, calculating the times of calling the other functions by the function and the times after the function is subjected to duplication elimination, and forming calling relation characteristics.

Analyzing basic attributes of the functions, and calculating stack space to form stack space characteristics; calculating the number of jump instructions, the number of instructions and the code quantity to form a code scale characteristic; and calculating the number of the called character strings and the called character string set to form character string characteristics.

Before analyzing the function control flow graph, feature extraction is carried out on the function control flow graph (CFG graph) which cannot be directly subjected to IDA analysis. In a few cases, the CFG graphs of the same function under different architectures may be very different, such as the memcap _ main function of busybox, which is very different between the CFG graphs under the ARM architecture and the MIPS architecture. This is because the CPU instruction set of each platform is handled by the corresponding IDA processor module. However, the strategy for generating the CFG graph by each platform processor module is different, for example, rmdir _ main function of busy, ARM platform bl instruction divides basic block, and jal (also function call instruction) under MIPS platform does not divide basic block. In order to unify the basic block division rule of the CFG graph, we need to reconstruct the CFG graph, and the reconstruction algorithm is as follows

a) The head and tail addresses and the original edge endpoint addresses of all basic blocks of the function are identified.

b) And sequencing all the basic blocks according to the ascending order of the head addresses of the basic blocks, and counting the in-degree and out-degree of each basic block.

c) The basic blocks are scanned from small to large in ascending order of the basic block header address. If the out-degree of the nth basic block is 0 and the in-degree of the (n + 1) th basic block is 0, merging the two basic blocks into a new nth basic block, deleting the original nth and the original (n + 1) th basic blocks, resetting the edge taking the head address of the original (n + 1) th basic block as the end point address, and taking the head address of the nth basic block as the end point address instead; if the out-degree of the nth basic block is 0 and the in-degree of the (n + 1) th basic block is not 0, adding an edge pointing to the (n + 1) th basic block from the nth basic block, wherein the end point information is the head address of the nth basic block and the end point information is the head address of the nth basic block.

d) And finishing the reconstruction process until the last basic block is scanned.

The reconstructed CFG graph algorithm source code realized by python is as follows, wherein an input parameter bb L ist refers to a list formed by the head and the tail of all basic blocks, edge L ist is a list formed by all original edges of IDA analysis, startPoint is the function entry address, an output toDic is a dictionary formed by all edges of a reconstructed CFG graph, bbDic is a dictionary formed by all basic blocks after the CFG graph is reconstructed, and the reconstruction effect of the memcap _ main function on busybox is shown in FIG. 2.

Analyzing a function control flow graph, calculating the degree of entrance and exit of each node (namely a basic block), constructing a CFG directed graph adjacent matrix, converting the function control flow graph into an undirected graph, calculating the degree of each node, and constructing the CFG undirected graph adjacent matrix. And carrying out degree analysis on the CFG directed graph adjacency matrix and the CFG undirected graph adjacency matrix. And calculating an in-degree ascending sequence and an out-degree ascending sequence based on the CFG directed graph adjacency matrix, and calculating a degree ascending sequence based on the CFG undirected graph adjacency matrix, wherein the degree ascending sequence, the out-degree ascending sequence and the CFG undirected graph adjacency matrix form a degree sequence characteristic.

And calculating probability sequences of maximum degree, average degree and degree based on the degree ascending sequence. Calculating the entropy of the graph based on the probability sequence of the degree, and constructing basic features of the degree; performing path analysis on the CFG undirected graph adjacency matrix, and calculating the minimum distance between any two nodes (namely basic blocks) by using a Floyd algorithm or a Dijkstra algorithm to construct a path sequence characteristic; and calculating the average path length, the diameter and the radius of the graph to form the basic path characteristics. And (4) carrying out basic attribute analysis on the CFG directed graph adjacency matrix, and calculating the number of nodes, the number of edges, the link facies ratio of the graph, the graph density and the clustering coefficient of the graph to form the CFG graph scale characteristic.

And operating according to the steps, and totally extracting the call relation characteristic, the character string characteristic, the stack space characteristic, the code scale characteristic, the path sequence characteristic, the path basic characteristic, the degree sequence characteristic, the degree basic characteristic and the graph scale characteristic of the function.

In the feature similarity calculation stage, based on the expression form of the features, a numerical similarity calculation method, a sequence similarity calculation method based on a character string editing distance algorithm and a set similarity calculation method based on Jaccard similarity are adopted to calculate the similarity of each feature of the function to be compared as an input vector of the integrated classifier.

In the stage of predicting the overall similarity by the integrated classifier, firstly, a self-compiled function set with multiple platforms and a symbol table is used as a training sample to train the integrated classifier. The specific method comprises the following steps: and selecting the same source code, selecting different compilers and different optimization options, and compiling aiming at different architectures to obtain a plurality of binary executable files. And performing reverse analysis on each binary executable file to obtain a function library and extracting the multi-dimensional characteristics of each function. Based on the features, similarity is calculated for every two functions in different function libraries as input vectors of the integrated classifier. If the two function names are the same, the label is 1, as a positive sample, and if the two function names are different, the label is 0, as a negative sample. Several initial classifiers are established. And constructing a plurality of independent and identically distributed sub-training sample sets from the replaced extracted 80% samples in the initial sample set as training samples of each classifier. And inputting the corresponding sub-training sample set into a classifier for training, and adjusting the parameters of the classifier according to the prediction result until the prediction result meets the requirement, wherein the training of the classifier is finished at the moment. And then predicting the numerical similarity by adopting a trained integrated classifier. And extracting characteristics of the vulnerability function and each function to be tested, and calculating a similarity vector to serve as a test sample. And predicting by using a plurality of classifiers in the trained integrated classifier to obtain a plurality of predicted values, and taking the weighted average of the predicted values as a final predicted value as numerical similarity.

For example, if a training sample of the matching pattern MIPS-O2 → ARM-O2 is needed.

The method comprises the following steps: aiming at the MIPS framework, compiling a binary file named openssl-MIPS-O2 by adopting an-O2 optimization option for openssl source codes; aiming at an ARM architecture, openssl source codes are compiled into a binary file named openssl-ARM-O2 by adopting an-O2 optimization option.

Step two: and respectively carrying out reverse analysis on the two binary files to obtain two function libraries. The function library of openssl-MIPS-O2 has m functions in total, and is named as X₁-MIPS-O2，X₂-MIPS-O2，...，X_m-MIPS-O2; the function library of openssl-ARM-O2 has n functions in total, and is named as Y₁-ARM-O2，X₂-ARM-O2，...，Y_nARM-O2. Features are computed for all functions of the two libraries, resulting in m + n features in total.

Step three, calculating function similarity vectors among the libraries to obtain m × n similarity vectors, if X is_i＝Y_jThen the function X of the openssl-MIPS-O2 library can be considered_iFunction Y of MIPS-O2 and opennssl-ARM-O2 library_jARM-O2 is the same function, then the label columnA 1 is a positive sample, whereas a negative sample is considered.

Step four: for the balance of positive and negative samples and the speed increase, every time the similarity calculation and label marking are carried out on 100 openssl-MIPS-O2 functions and 100 openssl-ARM-O2 functions, 100 positive samples and 9900 negative samples are obtained. All positive samples were collected and 100 were randomly drawn from 9900 negative samples as negative samples.

This results in min (m, n) positive samples and the same number of negative samples as the initial sample set for the matching pattern MIPS-O2 → ARM-O2.

3) And constructing an empowered bipartite graph, and calculating the overall similarity by adopting a bipartite graph matching algorithm (such as a Kuhn-Munkres algorithm).

The whole algorithm comprises the following steps:

a) and intercepting local structure information of the function to be compared from the function call graph to form two structure subgraphs, wherein the intercepted layer number can be determined according to the experimental effect.

b) Layering the intercepted structure subgraph according to the hop number away from the function to be compared (wherein, if the structure subgraph is from a function call graph of a binary file where the vulnerability function is located, the function to be compared refers to the vulnerability function; if the structural subgraph is from a function call graph of a binary file where the function to be compared is located, the function to be compared here refers to the function to be compared), and the weight is given according to the importance degree of the function to be compared, as shown in fig. 3.

c) Abstracting two subgraph corresponding layers into a weighted complete bipartite graph, wherein a node set is a function contained in the corresponding layer, an edge set is a similarity relation of any two functions in the node set, and an edge weight is a numerical similarity corresponding to the two functions, as shown in fig. 4. This results in a plurality of weighted bipartite graphs.

d) And adopting a bipartite matching algorithm to calculate the maximum weight matching corresponding to each layer in a layered manner as the similarity of the corresponding layer for each weighted bipartite graph.

e) And weighting and summing the similarity of each layer to obtain the overall similarity of the functions to be compared.

4) The determination is made based on a determination threshold calculated based on the ROC curve. And obtaining an overall similarity vector of the function set to be tested and the vulnerability function to draw an ROC curve. Wherein the horizontal axis of the ROC curve is a false positive rate, namely a false positive rate (FP/(FP + TN)); the vertical axis represents the true positive rate, i.e., the ratio of true positive (TP/(TP + FN)). The ROC curve gives the variation of the false positive rate and the true positive rate when the threshold is varied, which can be used to compare the performance of the classifier. Ideally, the best classifier should be located at the upper left corner, which means that the classifier obtains a high true positive rate when the false positive rate is low, i.e., a true vulnerability function is detected, and few normal functions are misjudged as vulnerability functions. The point of the ROC curve closer to the upper left corner is the best threshold with the least error, the point on the training set where the total number of false positives and false negatives is the least, i.e., the point where Y-X is the largest, as shown in fig. 5. Therefore, a threshold corresponding to the highest point of the Y-X curve is used as a judgment threshold, and a function with the similarity greater than the judgment threshold is judged as a suspected vulnerability function, otherwise, the function is judged as a normal function.

In summary, the present invention discloses a cross-architecture binary program vulnerability correlation technique. The above description of the embodiments is not intended to limit the invention, and those skilled in the art may make various changes and modifications without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention is defined by the scope of the claims.

Claims

1. A cross-architecture binary program vulnerability function correlation method comprises the following steps:

1) carrying out reverse analysis on a binary file of a binary program to obtain a function library to be tested; then, acquiring a function call graph, a function control flow graph and basic function attributes according to the function library to be tested;

2) extracting the characteristics of each function to be tested according to the function call graph, the function control flow graph and the function basic attribute; then, calculating the numerical similarity of each function to be tested and the vulnerability function according to the extracted features and the features of the vulnerability function;

3) for each function to be tested, constructing a weighted bipartite graph of the function to be tested and the vulnerability function respectively, and calculating the overall similarity of the function to be tested and the vulnerability function by adopting a bipartite graph algorithm;

4) if the overall similarity of the function to be tested and the vulnerability function is larger than a set judgment threshold value, the function to be tested is judged to be a suspected vulnerability function, otherwise, the function to be tested is judged to be a normal function.

2. The method of claim 1, wherein the numerical similarity is calculated by:

21) compiling the same source code into a plurality of binary executable files with different architectures; then, reversely analyzing each binary executable file to obtain a function library and extracting the characteristics of each function to be tested;

22) respectively selecting a function from two different function libraries, and calculating the similarity of the two selected functions based on the extracted features to be used as an input vector of the integrated classifier; if the two function names are the same, the label is 1, the corresponding input vector is used as a positive sample, otherwise, the corresponding input vector is used as a negative sample, and an initial sample set is obtained; wherein the ensemble classifier comprises a plurality of classifiers;

23) extracting a plurality of samples from the initial sample set, constructing a plurality of independent and identically distributed sub-training sample sets as training samples of each classifier in the integrated classifier;

24) and respectively inputting the sub-training sample sets into corresponding classifiers for training, predicting the vulnerability function and the function to be tested by adopting the trained classifiers based on the characteristics of the vulnerability function and each function to be tested, and then taking the weighted average of a plurality of obtained predicted values as the numerical similarity.

3. The method of claim 1 or 2, wherein the function control flow graph obtained in step 1) is reconstructed by:

a) identifying head and tail addresses and original edge endpoint addresses of all basic blocks of a function in a function control flow graph;

b) sequencing all the basic blocks according to the ascending sequence of the head addresses of the basic blocks, and counting the in-degree and out-degree of each basic block;

c) scanning the basic blocks from small to large according to the ascending order of the head addresses of the basic blocks: if the out-degree of the nth basic block is 0 and the in-degree of the (n + 1) th basic block is 0, merging the two basic blocks into a new nth basic block, deleting the original nth and the original (n + 1) th basic blocks, and changing the edge taking the head address of the original (n + 1) th basic block as the end point address into the head address of the nth basic block as the end point address; if the out-degree of the nth basic block is 0 and the in-degree of the (n + 1) th basic block is not 0, adding an edge pointing to the (n + 1) th basic block from the nth basic block, wherein one end point information of the edge is the head address of the nth basic block, and the other end point information is the head address of the nth basic block.

4. The method of claim 1, wherein the overall similarity is calculated by:

a) intercepting local structure information of the function to be detected from the function call graph to form a structure subgraph a, and intercepting local structure information of the vulnerability function from the function call graph where the vulnerability function is located to form a structure subgraph b;

b) layering the intercepted structure subgraph a according to the hop count from the function to be tested and giving weight according to the importance degree of the function to be tested, thereby abstracting the corresponding layer of the structure subgraph a into a weighted bipartite graph respectively, layering the intercepted structure subgraph b according to the hop count from the vulnerability function and giving weight according to the importance degree of the vulnerability function, thereby abstracting the corresponding layer of the structure subgraph b into a weighted bipartite graph respectively; the node set is a function contained in the corresponding layer, the edge set is the similarity relation of any two functions in the node set, and the edge weight is the numerical similarity corresponding to the two functions;

c) adopting a bipartite matching algorithm to calculate the maximum weight matching layer by layer for each weighted bipartite graph as the similarity of each layer;

d) and weighting and summing the similarity of each layer to be used as the overall similarity of the function to be detected and the vulnerability function.

5. The method of claim 1, wherein the decision threshold is determined by: and drawing an ROC curve according to the obtained overall similarity of the function to be detected and the vulnerability function, and taking a threshold value corresponding to the highest point of the Y-X curve as a judgment threshold value.