CN108268777B

CN108268777B - Similarity detection method for carrying out unknown vulnerability discovery by using patch information

Info

Publication number: CN108268777B
Application number: CN201810047837.XA
Authority: CN
Inventors: 梁彬; 李赞; 边攀; 石文昌
Original assignee: Renmin University of China
Current assignee: Renmin University of China
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2020-06-30
Anticipated expiration: 2038-01-18
Also published as: CN108268777A

Abstract

The invention relates to a similarity detection method for discovering unknown vulnerabilities by using patch information, which comprises the following steps: slicing the known vulnerability function and the patched patch function thereof to generate slices containing vulnerability related statements and slices containing patch statements; performing symbol normalization on variable names, variable types and function calling names in functions to be tested, vulnerability slices and patch slices; mapping a function to be tested, a vulnerability slice and a patch slice to a vector space process to generate a function feature vector to be tested, a vulnerability feature vector and a patch feature vector, and respectively forming a vector, wherein the value of each dimension in the vector represents the product of the occurrence frequency of the feature statement in the function and the TF-IDF weight; and after the feature vectors are generated, calculating the similarity of the feature vectors and sequencing the feature vectors, and judging whether the function set to be tested has unknown vulnerabilities similar to the features of the known vulnerabilities. The method can effectively weaken the interference of the bug irrelevant sentences and improve the detection accuracy.

Description

Similarity detection method for carrying out unknown vulnerability discovery by using patch information

Technical Field

The invention relates to an unknown vulnerability detection method, in particular to a similarity detection method which is applied in the field of information security and utilizes patch information to discover unknown vulnerabilities.

Background

Vulnerability (Vulneravailability) is a significant cause of software failures and errors, so Vulnerability detection has been a research hotspot in the field of software security. Static detection technology, one of the mainstream vulnerability detection technologies, has been proven to be capable of effectively detecting vulnerabilities in codes, and there are many related works that propose methods for detecting specific vulnerabilities through static analysis, such as detecting some insecure function usages. In order to automatically detect a particular vulnerability, these static analysis methods need to rely on a priori knowledge, i.e., coding rules, to detect code that is violated. The encoding rule, whether given manually based on experience or extracted automatically by a program, may generate an error, which results in a False Positive (False Positive), and therefore, the result generated by the conventional static analysis method often requires a lot of time for manual auditing.

In recent years, in order to avoid relying on prior knowledge related to programs, researchers directly skip the step of proposing rules by the conventional static analysis method, and begin to pay attention to another idea of utilizing similarity to perform static vulnerability detection. Starting from a code segment containing a leak, detecting codes similar to known leaks in characteristics in the codes to be detected. This approach to similarity detection has been worked out, and the comparison is typically the "Vulnerability Extrapolation" (Vulnerability Extrapolation) method proposed in 2012 by Yamaguchi et al. The method maps an Abstract Syntax Tree (AST) of a function to a feature vector space, utilizes a latent semantic Analysis (LatentSemanetic Analysis) method in machine learning to perform principal Component Analysis (principal Component Analysis) on a feature vector, extracts a main API (application Programming interface) using mode, calculates the similarity between the feature vector of a known vulnerability function and other feature vectors, sorts the result according to the similarity from large to small, and finally audits partial candidate functions with high similarity and preceding sequence.

However, such methods for similarity detection also have certain limitations. In such a method, feature extraction is often performed on the whole function containing the holes and used for subsequent similarity calculation, and other information in the function, which is irrelevant to statements related to the holes, becomes noise affecting similarity detection. Especially, when the function of the vulnerability is longer and the number of the included sentences is more, the proportion of the noise to the sentence related to the vulnerability is larger, and the influence of the noise on the similarity result is larger. Therefore, when similarity calculation and sorting are performed, False alarm and False Negative (False Negative) may be generated. When the characteristics of a function without holes in the noise part are similar to the known hole function, a higher similarity can be obtained due to a larger noise ratio, so that false alarm is generated; when a function actually contains similar loopholes and the characteristics of noise sentences outside the loophole-related sentences are not similar to the known loophole functions, the sequencing result is likely to be behind due to the low calculated similarity value, so that the report missing is generated. Therefore, if the noise in the function where the known vulnerability is located is not processed, the extracted function features are doped with the noise features, the effectiveness of the detection method is finally influenced, and the difficulty is also increased for manual audit.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a similarity detection method for discovering an unknown vulnerability by using patch information, which can effectively reduce interference of a vulnerability-independent statement and improve detection accuracy.

In order to achieve the purpose, the invention adopts the following technical scheme: a similarity detection method for carrying out unknown vulnerability discovery by using patch information is characterized by comprising the following steps: 1) slicing: slicing the known vulnerability function and the patched patch function thereof to generate slices containing vulnerability related statements and slices containing patch statements; 2) performing symbol normalization on variable names, variable types and function calling names in functions to be tested, vulnerability slices and patch slices; 3) vectorization: mapping a function to be tested, a vulnerability slice and a patch slice to a vector space to generate a function feature vector to be tested, a vulnerability feature vector and a patch feature vector; the vulnerability characteristic vector, the patch characteristic vector and each function characteristic vector to be tested respectively form a vector, and the value of each dimension in the vector represents the product of the occurrence frequency of the characteristic statement in the function and the TF-IDF weight; 4) similarity calculation and matching: and after the feature vectors are generated, calculating the similarity of the feature vectors and sequencing the feature vectors, and judging whether the function set to be tested has unknown vulnerabilities similar to the features of the known vulnerabilities.

Further, in the step 1), the GCC is used as a compiling tool to compile the vulnerability function, on a language-independent gimble statement level generated in the compiling process, a statement used as a slicing condition is obtained by using a patch, and a function where the vulnerability is located is sliced according to a control flow and a data flow from the slicing condition so as to retain relevant semantic information of the vulnerability context; and adding or reducing added and reduced sentences in the patch after the vulnerability slice is generated to obtain the patch slice.

Further, three types of statements of interest when slicing: conditions, assignments, and function calls.

Further, in step 1), the functions in the function set to be tested do not need to be sliced, and only need to be compiled, and then the GIMPLE statements of the conditions, assignments, and function call types in each function are extracted as function features.

Further, in the step 2), for variable names, variable types are uniformly adopted to represent variables, and constants are regarded as one type regardless of specific values; for the variable types, common data types are summarized, classified and uniformly represented; and for the function calling name, extracting all called functions in the code set to be tested, simply clustering the functions by utilizing character string distance calculation, and merging similar character strings into the same representation form.

Further, in the step 4), the similarity calculation and the sorting are performed twice: the first time, similarity calculation is carried out by utilizing the vulnerability characteristic vector and the characteristic vector of the function to be measured, and the similarity is sorted from large to small according to the similarity value to obtain a preliminary candidate function set; and the second time is to calculate and sort the first time of similarity to obtain a function in the preliminary candidate function set, and then carry out similarity calculation again by using the patch characteristic vector and the function in the candidate function set, and carry out secondary similarity calculation and sorting to remove false reports without holes in the preliminary candidate set, so as to generate a final candidate set for subsequent manual audit.

Further, in the second similarity calculation and sorting, if a candidate function belongs to a false alarm without a bug, the function should be a function containing a patch feature, and the similarity value obtained by the second similarity calculation should be higher than or at least not lower than the value obtained by the first similarity calculation.

Further, two feature vectors A (a)₁，...，a_n) And B (B)₁，...，b_n) The distance of (2) is calculated by cosine similarity:

the distance value should belong to the [0, 1] interval, and the larger the value is, the closer the two vectors are, the more similar the corresponding slices and functions are represented; when the numerical value is 0, the two vectors are completely different, and no one-dimensional feature is superposed; a value of 1 indicates that the two vectors are identical and coincide in all feature dimensions.

Due to the adoption of the technical scheme, the invention has the following advantages: 1. the method can solve the problem that the noise statement in the function containing the hole in the existing similarity detection work can cause false alarm and missing report. 2. According to the method, the potential value of the patch is utilized, namely the position and the range of the vulnerability are defined, the statement related to the vulnerability is accurately positioned according to the patch information, a program slicing technology is introduced to remove the statement unrelated to the vulnerability in the original function containing the vulnerability, the obtained slice is utilized to generate the denoising vulnerability characteristic to carry out potential unknown vulnerability detection, the interference of the statement unrelated to the vulnerability can be effectively weakened, the finally matched similar candidate results are all functions related to the vulnerability characteristic, and the purpose of improving the detection accuracy is achieved. 3. The method can be used for naturally reducing the steps of analyzing the principal components of the feature space or further extracting the main programming mode because the slice can accurately obtain the vulnerability characteristics and can be directly used for similarity calculation, so that the method has lower performance overhead. 4. The invention comprehensively utilizes the information of the patch to carry out secondary screening on the sequencing result after the similarity calculation, thereby further filtering the result and removing the misinformation that the patch sentence is contained and the bug cannot be generated.

Drawings

FIG. 1 is a schematic overall flow chart of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and examples.

As shown in fig. 1, the present invention provides a similarity detection method for discovering unknown vulnerabilities by using patch information, which mainly aims to solve the problem that false alarm and false alarm may be caused by noise statements in functions containing vulnerabilities in the existing work, and comprises the following steps:

1) slicing: slicing is a key step of removing the influence of noise statements in a function where a vulnerability is located and reducing related false reports and false reports.

And slicing the known vulnerability function and the patched patch function to generate a vulnerability slice containing vulnerability related statements and a patch slice containing patch statements, removing irrelevant noise in the vulnerability function, and only keeping vulnerability related characteristics.

The specific process is as follows: and compiling the function of the vulnerability by adopting GCC as a compiling tool, obtaining the statement serving as a slicing condition by using the patch on the language-independent GIMPLE statement level generated in the compiling process, and slicing the function of the vulnerability according to the control flow and the data flow from the slicing condition so as to reserve the related semantic information of the vulnerability context. Only the more important three types of statements are of interest when slicing: conditions, assignments, and function calls. And after the vulnerability slice is generated, adding and reducing added and reduced sentences in the patch correspondingly to obtain the patch slice.

It should be noted that the functions in the function set to be tested do not need to be sliced, and only need to be compiled, and then the GIMPLE statements of the conditions, assignments, and function call types in each function are extracted as the function features.

2) And during compiling, performing symbolic normalization on variable names, variable types and function call names in the functions to be tested, the vulnerability slices and the patch slices: the sign normalization is to reduce the dimension of the vectorized vector space and further abstract the sentence features to make them have certain representativeness.

For variable names, because of great randomness, variable types are uniformly adopted to represent variables, and constants are ignored for specific values and are regarded as one type;

for variable types, since many data types are defined by typedef, for example, size _ t is defined from unsignalidin, common data types are summarized and classified into a unified representation;

for the function calling name, all called functions in the code set to be tested are extracted, simple clustering is carried out on the functions according to the distance of the character strings, similar character strings are merged into the same representation form, for example, av _ malloc and av _ malloc are represented by av _ malloc, and the purpose of function name normalization is achieved.

3) Vectorization: vectorization is the process of mapping the function to be tested, the vulnerability slice, and the patch slice to a vector space to generate a feature vector of the function to be tested, a feature vector of the vulnerability, and a feature vector of the patch. And the vulnerability characteristic vector, the patch characteristic vector and each function characteristic vector to be tested respectively form a vector, and the value of each dimension in the vector represents the product of the occurrence frequency of the characteristic statement in the function and the TF-IDF weight. The method simplifies the representation method and reduces the complexity of similarity calculation, and reduces a large amount of calculation compared with a method for matching and calculating by using structures such as trees, graphs and the like.

The code set C to be tested is { F ═ F₁，...，F_nMeans that it contains n functions F_nAnd V represents the feature vector space corresponding to C. The mapping of C to V is defined as φ. Phi is realized by adopting the existing hash algorithm hashpjw, each feature statement corresponds to one hash value h in V, and the dimension | V | of the vector space V is the total number of all statements contained in all functions in C. For each function F_i(i is more than or equal to 1 and less than or equal to n), the dimensionality of the vector is | V |, and the value of each hash value h on the dimensionality is represented

Comprises the following steps:

wherein

Wherein, TF-IDF is the weighting commonly used in information retrieval, and the weight is equal to the product of the word frequency TF and the inverse file frequency IDF. The weight is introduced mainly in consideration of the difference between the frequency and the importance of the sentences, for example, a great number of assignment characteristic sentences int (int + int) exist in a code set, and the influence of the sentences on the calculation results needs to be reduced; while some statements occur frequently in some functions and less frequently in others, the computation of TF-IDF weights is added after the vectorized representation in order to evaluate the importance of each hash value to different functions.

4) Similarity calculation and matching: and after the feature vectors are generated, calculating the similarity of the feature vectors and sequencing the feature vectors, and judging whether the function set to be tested has unknown vulnerabilities similar to the features of the known vulnerabilities.

Similarity calculation and ranking were done twice:

the first time, similarity calculation is carried out by utilizing the vulnerability characteristic vector and the characteristic vector of the function to be measured, and the similarity is sorted from large to small according to the similarity value to obtain a preliminary candidate function set;

and the second time is to calculate and sort the first time of similarity to obtain a function in the preliminary candidate function set, and then to calculate the similarity again by using the patch feature vector and the function in the candidate function set, if a candidate function belongs to the false alarm without the loophole, the function should be a function containing the patch feature, and the similarity value calculated for the second time should be higher than or at least not lower than the similarity value calculated for the first time. The purpose of performing secondary similarity calculation sorting is to remove the false reports without holes in the primary candidate set by using the patch information again to generate a final candidate set for subsequent manual auditing.

Two feature vectors A (a)₁，...，a_n) And B (B)₁，...，b_n) The distance of (a) is calculated by cosine similarity, and the formula is as follows (a)_iAnd b_iRespectively representing the products of the occurrence times of ith dimension characteristic sentences in A and B and TF-IDF weight, wherein i is more than or equal to 1 and less than or equal to n):

Example (b):

the method of the invention is deployed on a 64-bit Ubuntu 16.04 platform and uses GCC4.9 as a compiler. And selecting an open source audio and video processing library FFmpeg 3.2.4 version and an open source image browsing software Ghosstscript 9.21 version which are widely supported by multiple platforms as experimental objects. The FFmpeg comprises 1583 files and 15598 functions, and the Ghostscript comprises 935 files and 15875 functions. And in the aspect of vulnerability, the newly published vulnerability of the two software in 2017 is selected as a known target vulnerability to carry out a detection experiment.

The method of the invention relates to three main steps, wherein slicing, normalization and vectorization are directly deployed in GCC, and slicing and vector mapping output feature vectors are simultaneously realized in the compiling process. The similarity calculation is performed separately after that, and the similarity calculation can be completed in almost 1 second, and almost no time is consumed, so the performance experiment is mainly analyzed for the time consumed by slicing and vector mapping. The experimental objects FFmpeg and Ghostscript are independently compiled to record time consumption, then the processing processes of slicing, normalizing and vector mapping are added, the running time is recorded again, the process is repeated for 10 times, and the average value of the time consumption is obtained, and the result is shown in Table 1.

TABLE 1 time consumption of the inventive slicing and feature vector acquisition process

It can be seen that the time required from the actual slicing to the end of the feature vector mapping step is less than the time of one compilation, only a few minutes are required for code analysis containing more than 10000 functions, and the performance overhead is completely within an acceptable range.

And (3) detection results:

in the process of carrying out unknown vulnerability discovery experiments on FFmpeg and Ghostscript, 3 new unknown vulnerabilities are detected, wherein 1 unknown vulnerability is detected on the CVE-2017-. The effectiveness of the method for slicing by combining patch information in reducing irrelevant statement noise in a function containing a leak and related false alarm and false alarm caused by the irrelevant statement noise is explained by taking a leak CVE-2017-5025 of FFmpeg as an example, and the effectiveness of secondary calculation for filtering false alarm of a function without the leak by using a patch feature vector is also explained.

First, if no slicing is performed, the result of directly performing similarity calculation after vectorization is performed simply by using all statements of the whole function mov _ read _ hdlr as features is shown in table 2 (the sorting result does not include the function where the target vulnerability is located).

TABLE 2 first 15 functions most similar to CVE-2017-5025 without slicing

Due to the fact that the proportion of the sentences irrelevant to the loopholes in the function is high (56 rows/61 rows), the final similarity calculation result is greatly interfered by noise, and the similarity of all candidate functions is low. Meanwhile, as can be seen from table 2, the false alarm rate of the statements related to the bug which does not contain the statements at all is very high, and 9 statements are contained (table 2)

Number shows), the proportion accounts for more than half of the first 15, the detection accuracy is very low, and the false alarm is high.

The results of sorting after once similarity calculation after obtaining the feature vector only containing the vulnerability by using the patch information and program slicing method are shown in table 3.

TABLE 3 first 15 functions most similar to CVE-2017-5025 (only one similarity calculation was performed) after the slicing method was added

The experimental result shows that none of the similarity results calculated again after the introduction of the slices is false alarm caused by the irrelevant noise of the vulnerability, and the similarity is generally improved. While the similarity values of several noise misinformation in table 2, the read _ packet similarity of the original ranking 12 is 0.47, the similarity is now 0.34, the ranking is carried out to 287, the similarity values of the other 8 functions after slicing due to noise misinformation are all lower than 0.2, and especially the similarity of the two functions of get _ aiff _ header and mov _ read _ frma originally ranked as 2 and 4 is less than 0.1. The above results fully prove that the influence of noise can be effectively removed by slicing, and irrelevant false alarms generated by noise are reduced. In addition, the 12 th candidate result in the table 3 is a known bug, and a patch is applied just before the candidate result is found, while the 2 nd ordered function is an unknown bug, and the two functions are not found in the non-sliced similarity calculation, which indicates that after the slice reduces the irrelevant noise, the similar function can be matched more accurately, the ordering of the similar function with the known bug is improved, and the missing report is reduced. However, before the second similarity calculation, there are 10 functions in the result (Table 3)

Number shown) are functions that already contain patch statements, and are also false positives that do not result in vulnerabilities.

Finally, on the basis of table 3, the secondary similarity calculation is performed by using the vector containing the patch features generated by the patch and the function therein, and the result is shown in table 4.

TABLE 4 first 15 functions most similar to CVE-2017-5025 after quadratic similarity calculation

From the experimental results of Table 4, it can be seen that there are 5 functions (in Table 4)

Number) is higher than the first similarity calculation shown in table 3, which indicates that these 5 functions already contain the added patch features, and thus they no longer contain the target vulnerability and can be removed from the candidate results.

Thus, the final candidate set will be as shown in table 5.

TABLE 5 Final candidate set after second similarity calculation screening

The manual auditing can be carried out only on the functions, so that the auditing workload is reduced. However, there are still 5 false positives containing patch statements in table 5, which cannot be filtered out by the second similarity calculation, because they are different in patching mode, and therefore the vector containing patch features used in this experiment cannot improve the similarity of these functions.

The above embodiments are only for illustrating the present invention, and the steps may be changed, and on the basis of the technical solution of the present invention, the modification and equivalent changes of the individual steps according to the principle of the present invention should not be excluded from the protection scope of the present invention.

Claims

1. A similarity detection method for carrying out unknown vulnerability discovery by using patch information is characterized by comprising the following steps:

1) slicing: slicing the known vulnerability function and the patched patch function thereof to generate slices containing vulnerability related statements and slices containing patch statements;

2) performing symbol normalization on variable names, variable types and function calling names in functions to be tested, vulnerability slices and patch slices;

3) vectorization: mapping a function to be tested, a vulnerability slice and a patch slice to a vector space to generate a function feature vector to be tested, a vulnerability feature vector and a patch feature vector; the vulnerability characteristic vector, the patch characteristic vector and each vector of each function characteristic vector to be tested respectively define the value of each dimension in the vector, and the product of the occurrence frequency of each characteristic statement in the function represented by the value of each dimension in the vector and the TF-IDF weight;

4) similarity calculation and matching: after the feature vectors are generated, calculating the similarity of the feature vectors and sequencing the feature vectors, and judging whether unknown vulnerabilities similar to the features of known vulnerabilities exist in the function set to be tested or not;

similarity calculation and ranking were done twice:

and the second time is to calculate and sort the first time of similarity to obtain a function in the preliminary candidate function set, and then carry out similarity calculation again by using the patch characteristic vector and the function in the candidate function set, and carry out secondary similarity calculation and sorting to remove false reports without holes in the preliminary candidate set, so as to generate a final candidate set for subsequent manual audit.

2. The method for detecting similarity of unknown vulnerabilities discovery using patch information as claimed in claim 1, characterized in that: in the step 1), the GCC is used as a compiling tool to compile a vulnerability function, a statement used as a slicing condition is obtained by using a patch on a language-independent GIMPLE statement level generated in the compiling process, and a function where the vulnerability is located is sliced according to a control flow and a data flow from the slicing condition so as to retain relevant semantic information of a vulnerability context; and adding or reducing added and reduced sentences in the patch after the vulnerability slice is generated to obtain the patch slice.

3. The method for detecting similarity of unknown vulnerabilities discovery using patch information as claimed in claim 2, characterized in that: three types of statements of interest when slicing: conditions, assignments, and function calls.

4. A method as claimed in claim 1, 2 or 3, for similarity detection of unknown vulnerabilities discovery using patch information, characterized by: in the step 1), the functions in the function set to be tested do not need to be sliced, and only need to be compiled, and then the GIMPLE statements of the conditions, assignments and function call types in each function are extracted as function features.

5. The method for detecting similarity of unknown vulnerabilities discovery using patch information as claimed in claim 1, characterized in that: in the step 2), variable names are uniformly represented by variable types, and constants are regarded as one type by ignoring specific values; for the variable types, common data types are summarized, classified and uniformly represented; and for the function calling name, extracting all called functions in the code set to be tested, simply clustering the functions by utilizing character string distance calculation, and merging similar character strings into the same representation form.

6. The method for detecting similarity of unknown vulnerabilities discovery using patch information as claimed in claim 1, characterized in that: in the second similarity calculation and sorting, if a candidate function belongs to the false alarm without the loophole, the function is a function containing the patch characteristics, and the similarity value obtained by the second similarity calculation is not lower than the value obtained by the first similarity calculation.

7. The method for detecting similarity of unknown vulnerabilities discovery using patch information as claimed in claim 1 or 6, characterized by: two feature vectors A (a)₁，...，a_n) And B (B)₁，...，b_n) The distance of (2) is calculated by cosine similarity: