CN108268777B - Similarity detection method for carrying out unknown vulnerability discovery by using patch information - Google Patents

Similarity detection method for carrying out unknown vulnerability discovery by using patch information Download PDF

Info

Publication number
CN108268777B
CN108268777B CN201810047837.XA CN201810047837A CN108268777B CN 108268777 B CN108268777 B CN 108268777B CN 201810047837 A CN201810047837 A CN 201810047837A CN 108268777 B CN108268777 B CN 108268777B
Authority
CN
China
Prior art keywords
function
patch
vulnerability
similarity
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810047837.XA
Other languages
Chinese (zh)
Other versions
CN108268777A (en
Inventor
梁彬
李赞
边攀
石文昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Priority to CN201810047837.XA priority Critical patent/CN108268777B/en
Publication of CN108268777A publication Critical patent/CN108268777A/en
Application granted granted Critical
Publication of CN108268777B publication Critical patent/CN108268777B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to a similarity detection method for discovering unknown vulnerabilities by using patch information, which comprises the following steps: slicing the known vulnerability function and the patched patch function thereof to generate slices containing vulnerability related statements and slices containing patch statements; performing symbol normalization on variable names, variable types and function calling names in functions to be tested, vulnerability slices and patch slices; mapping a function to be tested, a vulnerability slice and a patch slice to a vector space process to generate a function feature vector to be tested, a vulnerability feature vector and a patch feature vector, and respectively forming a vector, wherein the value of each dimension in the vector represents the product of the occurrence frequency of the feature statement in the function and the TF-IDF weight; and after the feature vectors are generated, calculating the similarity of the feature vectors and sequencing the feature vectors, and judging whether the function set to be tested has unknown vulnerabilities similar to the features of the known vulnerabilities. The method can effectively weaken the interference of the bug irrelevant sentences and improve the detection accuracy.

Description

Similarity detection method for carrying out unknown vulnerability discovery by using patch information
Technical Field
The invention relates to an unknown vulnerability detection method, in particular to a similarity detection method which is applied in the field of information security and utilizes patch information to discover unknown vulnerabilities.
Background
Vulnerability (Vulneravailability) is a significant cause of software failures and errors, so Vulnerability detection has been a research hotspot in the field of software security. Static detection technology, one of the mainstream vulnerability detection technologies, has been proven to be capable of effectively detecting vulnerabilities in codes, and there are many related works that propose methods for detecting specific vulnerabilities through static analysis, such as detecting some insecure function usages. In order to automatically detect a particular vulnerability, these static analysis methods need to rely on a priori knowledge, i.e., coding rules, to detect code that is violated. The encoding rule, whether given manually based on experience or extracted automatically by a program, may generate an error, which results in a False Positive (False Positive), and therefore, the result generated by the conventional static analysis method often requires a lot of time for manual auditing.
In recent years, in order to avoid relying on prior knowledge related to programs, researchers directly skip the step of proposing rules by the conventional static analysis method, and begin to pay attention to another idea of utilizing similarity to perform static vulnerability detection. Starting from a code segment containing a leak, detecting codes similar to known leaks in characteristics in the codes to be detected. This approach to similarity detection has been worked out, and the comparison is typically the "Vulnerability Extrapolation" (Vulnerability Extrapolation) method proposed in 2012 by Yamaguchi et al. The method maps an Abstract Syntax Tree (AST) of a function to a feature vector space, utilizes a latent semantic Analysis (LatentSemanetic Analysis) method in machine learning to perform principal Component Analysis (principal Component Analysis) on a feature vector, extracts a main API (application Programming interface) using mode, calculates the similarity between the feature vector of a known vulnerability function and other feature vectors, sorts the result according to the similarity from large to small, and finally audits partial candidate functions with high similarity and preceding sequence.
However, such methods for similarity detection also have certain limitations. In such a method, feature extraction is often performed on the whole function containing the holes and used for subsequent similarity calculation, and other information in the function, which is irrelevant to statements related to the holes, becomes noise affecting similarity detection. Especially, when the function of the vulnerability is longer and the number of the included sentences is more, the proportion of the noise to the sentence related to the vulnerability is larger, and the influence of the noise on the similarity result is larger. Therefore, when similarity calculation and sorting are performed, False alarm and False Negative (False Negative) may be generated. When the characteristics of a function without holes in the noise part are similar to the known hole function, a higher similarity can be obtained due to a larger noise ratio, so that false alarm is generated; when a function actually contains similar loopholes and the characteristics of noise sentences outside the loophole-related sentences are not similar to the known loophole functions, the sequencing result is likely to be behind due to the low calculated similarity value, so that the report missing is generated. Therefore, if the noise in the function where the known vulnerability is located is not processed, the extracted function features are doped with the noise features, the effectiveness of the detection method is finally influenced, and the difficulty is also increased for manual audit.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a similarity detection method for discovering an unknown vulnerability by using patch information, which can effectively reduce interference of a vulnerability-independent statement and improve detection accuracy.
In order to achieve the purpose, the invention adopts the following technical scheme: a similarity detection method for carrying out unknown vulnerability discovery by using patch information is characterized by comprising the following steps: 1) slicing: slicing the known vulnerability function and the patched patch function thereof to generate slices containing vulnerability related statements and slices containing patch statements; 2) performing symbol normalization on variable names, variable types and function calling names in functions to be tested, vulnerability slices and patch slices; 3) vectorization: mapping a function to be tested, a vulnerability slice and a patch slice to a vector space to generate a function feature vector to be tested, a vulnerability feature vector and a patch feature vector; the vulnerability characteristic vector, the patch characteristic vector and each function characteristic vector to be tested respectively form a vector, and the value of each dimension in the vector represents the product of the occurrence frequency of the characteristic statement in the function and the TF-IDF weight; 4) similarity calculation and matching: and after the feature vectors are generated, calculating the similarity of the feature vectors and sequencing the feature vectors, and judging whether the function set to be tested has unknown vulnerabilities similar to the features of the known vulnerabilities.
Further, in the step 1), the GCC is used as a compiling tool to compile the vulnerability function, on a language-independent gimble statement level generated in the compiling process, a statement used as a slicing condition is obtained by using a patch, and a function where the vulnerability is located is sliced according to a control flow and a data flow from the slicing condition so as to retain relevant semantic information of the vulnerability context; and adding or reducing added and reduced sentences in the patch after the vulnerability slice is generated to obtain the patch slice.
Further, three types of statements of interest when slicing: conditions, assignments, and function calls.
Further, in step 1), the functions in the function set to be tested do not need to be sliced, and only need to be compiled, and then the GIMPLE statements of the conditions, assignments, and function call types in each function are extracted as function features.
Further, in the step 2), for variable names, variable types are uniformly adopted to represent variables, and constants are regarded as one type regardless of specific values; for the variable types, common data types are summarized, classified and uniformly represented; and for the function calling name, extracting all called functions in the code set to be tested, simply clustering the functions by utilizing character string distance calculation, and merging similar character strings into the same representation form.
Further, in the step 4), the similarity calculation and the sorting are performed twice: the first time, similarity calculation is carried out by utilizing the vulnerability characteristic vector and the characteristic vector of the function to be measured, and the similarity is sorted from large to small according to the similarity value to obtain a preliminary candidate function set; and the second time is to calculate and sort the first time of similarity to obtain a function in the preliminary candidate function set, and then carry out similarity calculation again by using the patch characteristic vector and the function in the candidate function set, and carry out secondary similarity calculation and sorting to remove false reports without holes in the preliminary candidate set, so as to generate a final candidate set for subsequent manual audit.
Further, in the second similarity calculation and sorting, if a candidate function belongs to a false alarm without a bug, the function should be a function containing a patch feature, and the similarity value obtained by the second similarity calculation should be higher than or at least not lower than the value obtained by the first similarity calculation.
Further, two feature vectors A (a)1,...,an) And B (B)1,...,bn) The distance of (2) is calculated by cosine similarity:
Figure BDA0001551451000000031
the distance value should belong to the [0, 1] interval, and the larger the value is, the closer the two vectors are, the more similar the corresponding slices and functions are represented; when the numerical value is 0, the two vectors are completely different, and no one-dimensional feature is superposed; a value of 1 indicates that the two vectors are identical and coincide in all feature dimensions.
Due to the adoption of the technical scheme, the invention has the following advantages: 1. the method can solve the problem that the noise statement in the function containing the hole in the existing similarity detection work can cause false alarm and missing report. 2. According to the method, the potential value of the patch is utilized, namely the position and the range of the vulnerability are defined, the statement related to the vulnerability is accurately positioned according to the patch information, a program slicing technology is introduced to remove the statement unrelated to the vulnerability in the original function containing the vulnerability, the obtained slice is utilized to generate the denoising vulnerability characteristic to carry out potential unknown vulnerability detection, the interference of the statement unrelated to the vulnerability can be effectively weakened, the finally matched similar candidate results are all functions related to the vulnerability characteristic, and the purpose of improving the detection accuracy is achieved. 3. The method can be used for naturally reducing the steps of analyzing the principal components of the feature space or further extracting the main programming mode because the slice can accurately obtain the vulnerability characteristics and can be directly used for similarity calculation, so that the method has lower performance overhead. 4. The invention comprehensively utilizes the information of the patch to carry out secondary screening on the sequencing result after the similarity calculation, thereby further filtering the result and removing the misinformation that the patch sentence is contained and the bug cannot be generated.
Drawings
FIG. 1 is a schematic overall flow chart of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and examples.
As shown in fig. 1, the present invention provides a similarity detection method for discovering unknown vulnerabilities by using patch information, which mainly aims to solve the problem that false alarm and false alarm may be caused by noise statements in functions containing vulnerabilities in the existing work, and comprises the following steps:
1) slicing: slicing is a key step of removing the influence of noise statements in a function where a vulnerability is located and reducing related false reports and false reports.
And slicing the known vulnerability function and the patched patch function to generate a vulnerability slice containing vulnerability related statements and a patch slice containing patch statements, removing irrelevant noise in the vulnerability function, and only keeping vulnerability related characteristics.
The specific process is as follows: and compiling the function of the vulnerability by adopting GCC as a compiling tool, obtaining the statement serving as a slicing condition by using the patch on the language-independent GIMPLE statement level generated in the compiling process, and slicing the function of the vulnerability according to the control flow and the data flow from the slicing condition so as to reserve the related semantic information of the vulnerability context. Only the more important three types of statements are of interest when slicing: conditions, assignments, and function calls. And after the vulnerability slice is generated, adding and reducing added and reduced sentences in the patch correspondingly to obtain the patch slice.
It should be noted that the functions in the function set to be tested do not need to be sliced, and only need to be compiled, and then the GIMPLE statements of the conditions, assignments, and function call types in each function are extracted as the function features.
2) And during compiling, performing symbolic normalization on variable names, variable types and function call names in the functions to be tested, the vulnerability slices and the patch slices: the sign normalization is to reduce the dimension of the vectorized vector space and further abstract the sentence features to make them have certain representativeness.
For variable names, because of great randomness, variable types are uniformly adopted to represent variables, and constants are ignored for specific values and are regarded as one type;
for variable types, since many data types are defined by typedef, for example, size _ t is defined from unsignalidin, common data types are summarized and classified into a unified representation;
for the function calling name, all called functions in the code set to be tested are extracted, simple clustering is carried out on the functions according to the distance of the character strings, similar character strings are merged into the same representation form, for example, av _ malloc and av _ malloc are represented by av _ malloc, and the purpose of function name normalization is achieved.
3) Vectorization: vectorization is the process of mapping the function to be tested, the vulnerability slice, and the patch slice to a vector space to generate a feature vector of the function to be tested, a feature vector of the vulnerability, and a feature vector of the patch. And the vulnerability characteristic vector, the patch characteristic vector and each function characteristic vector to be tested respectively form a vector, and the value of each dimension in the vector represents the product of the occurrence frequency of the characteristic statement in the function and the TF-IDF weight. The method simplifies the representation method and reduces the complexity of similarity calculation, and reduces a large amount of calculation compared with a method for matching and calculating by using structures such as trees, graphs and the like.
The code set C to be tested is { F ═ F1,...,FnMeans that it contains n functions FnAnd V represents the feature vector space corresponding to C. The mapping of C to V is defined as φ. Phi is realized by adopting the existing hash algorithm hashpjw, each feature statement corresponds to one hash value h in V, and the dimension | V | of the vector space V is the total number of all statements contained in all functions in C. For each function Fi(i is more than or equal to 1 and less than or equal to n), the dimensionality of the vector is | V |, and the value of each hash value h on the dimensionality is represented
Figure BDA0001551451000000051
Comprises the following steps:
Figure BDA0001551451000000052
wherein
Figure BDA0001551451000000053
Wherein, TF-IDF is the weighting commonly used in information retrieval, and the weight is equal to the product of the word frequency TF and the inverse file frequency IDF. The weight is introduced mainly in consideration of the difference between the frequency and the importance of the sentences, for example, a great number of assignment characteristic sentences int (int + int) exist in a code set, and the influence of the sentences on the calculation results needs to be reduced; while some statements occur frequently in some functions and less frequently in others, the computation of TF-IDF weights is added after the vectorized representation in order to evaluate the importance of each hash value to different functions.
4) Similarity calculation and matching: and after the feature vectors are generated, calculating the similarity of the feature vectors and sequencing the feature vectors, and judging whether the function set to be tested has unknown vulnerabilities similar to the features of the known vulnerabilities.
Similarity calculation and ranking were done twice:
the first time, similarity calculation is carried out by utilizing the vulnerability characteristic vector and the characteristic vector of the function to be measured, and the similarity is sorted from large to small according to the similarity value to obtain a preliminary candidate function set;
and the second time is to calculate and sort the first time of similarity to obtain a function in the preliminary candidate function set, and then to calculate the similarity again by using the patch feature vector and the function in the candidate function set, if a candidate function belongs to the false alarm without the loophole, the function should be a function containing the patch feature, and the similarity value calculated for the second time should be higher than or at least not lower than the similarity value calculated for the first time. The purpose of performing secondary similarity calculation sorting is to remove the false reports without holes in the primary candidate set by using the patch information again to generate a final candidate set for subsequent manual auditing.
Two feature vectors A (a)1,...,an) And B (B)1,...,bn) The distance of (a) is calculated by cosine similarity, and the formula is as follows (a)iAnd biRespectively representing the products of the occurrence times of ith dimension characteristic sentences in A and B and TF-IDF weight, wherein i is more than or equal to 1 and less than or equal to n):
Figure BDA0001551451000000054
the distance value should belong to the [0, 1] interval, and the larger the value is, the closer the two vectors are, the more similar the corresponding slices and functions are represented; when the numerical value is 0, the two vectors are completely different, and no one-dimensional feature is superposed; a value of 1 indicates that the two vectors are identical and coincide in all feature dimensions.
Example (b):
the method of the invention is deployed on a 64-bit Ubuntu 16.04 platform and uses GCC4.9 as a compiler. And selecting an open source audio and video processing library FFmpeg 3.2.4 version and an open source image browsing software Ghosstscript 9.21 version which are widely supported by multiple platforms as experimental objects. The FFmpeg comprises 1583 files and 15598 functions, and the Ghostscript comprises 935 files and 15875 functions. And in the aspect of vulnerability, the newly published vulnerability of the two software in 2017 is selected as a known target vulnerability to carry out a detection experiment.
The method of the invention relates to three main steps, wherein slicing, normalization and vectorization are directly deployed in GCC, and slicing and vector mapping output feature vectors are simultaneously realized in the compiling process. The similarity calculation is performed separately after that, and the similarity calculation can be completed in almost 1 second, and almost no time is consumed, so the performance experiment is mainly analyzed for the time consumed by slicing and vector mapping. The experimental objects FFmpeg and Ghostscript are independently compiled to record time consumption, then the processing processes of slicing, normalizing and vector mapping are added, the running time is recorded again, the process is repeated for 10 times, and the average value of the time consumption is obtained, and the result is shown in Table 1.
TABLE 1 time consumption of the inventive slicing and feature vector acquisition process
Figure BDA0001551451000000061
It can be seen that the time required from the actual slicing to the end of the feature vector mapping step is less than the time of one compilation, only a few minutes are required for code analysis containing more than 10000 functions, and the performance overhead is completely within an acceptable range.
And (3) detection results:
in the process of carrying out unknown vulnerability discovery experiments on FFmpeg and Ghostscript, 3 new unknown vulnerabilities are detected, wherein 1 unknown vulnerability is detected on the CVE-2017-. The effectiveness of the method for slicing by combining patch information in reducing irrelevant statement noise in a function containing a leak and related false alarm and false alarm caused by the irrelevant statement noise is explained by taking a leak CVE-2017-5025 of FFmpeg as an example, and the effectiveness of secondary calculation for filtering false alarm of a function without the leak by using a patch feature vector is also explained.
First, if no slicing is performed, the result of directly performing similarity calculation after vectorization is performed simply by using all statements of the whole function mov _ read _ hdlr as features is shown in table 2 (the sorting result does not include the function where the target vulnerability is located).
TABLE 2 first 15 functions most similar to CVE-2017-5025 without slicing
Figure BDA0001551451000000062
Figure BDA0001551451000000071
Due to the fact that the proportion of the sentences irrelevant to the loopholes in the function is high (56 rows/61 rows), the final similarity calculation result is greatly interfered by noise, and the similarity of all candidate functions is low. Meanwhile, as can be seen from table 2, the false alarm rate of the statements related to the bug which does not contain the statements at all is very high, and 9 statements are contained (table 2)
Figure BDA0001551451000000073
Number shows), the proportion accounts for more than half of the first 15, the detection accuracy is very low, and the false alarm is high.
The results of sorting after once similarity calculation after obtaining the feature vector only containing the vulnerability by using the patch information and program slicing method are shown in table 3.
TABLE 3 first 15 functions most similar to CVE-2017-5025 (only one similarity calculation was performed) after the slicing method was added
Figure BDA0001551451000000072
The experimental result shows that none of the similarity results calculated again after the introduction of the slices is false alarm caused by the irrelevant noise of the vulnerability, and the similarity is generally improved. While the similarity values of several noise misinformation in table 2, the read _ packet similarity of the original ranking 12 is 0.47, the similarity is now 0.34, the ranking is carried out to 287, the similarity values of the other 8 functions after slicing due to noise misinformation are all lower than 0.2, and especially the similarity of the two functions of get _ aiff _ header and mov _ read _ frma originally ranked as 2 and 4 is less than 0.1. The above results fully prove that the influence of noise can be effectively removed by slicing, and irrelevant false alarms generated by noise are reduced. In addition, the 12 th candidate result in the table 3 is a known bug, and a patch is applied just before the candidate result is found, while the 2 nd ordered function is an unknown bug, and the two functions are not found in the non-sliced similarity calculation, which indicates that after the slice reduces the irrelevant noise, the similar function can be matched more accurately, the ordering of the similar function with the known bug is improved, and the missing report is reduced. However, before the second similarity calculation, there are 10 functions in the result (Table 3)
Figure BDA0001551451000000083
Number shown) are functions that already contain patch statements, and are also false positives that do not result in vulnerabilities.
Finally, on the basis of table 3, the secondary similarity calculation is performed by using the vector containing the patch features generated by the patch and the function therein, and the result is shown in table 4.
TABLE 4 first 15 functions most similar to CVE-2017-5025 after quadratic similarity calculation
Figure BDA0001551451000000081
From the experimental results of Table 4, it can be seen that there are 5 functions (in Table 4)
Figure BDA0001551451000000084
Number) is higher than the first similarity calculation shown in table 3, which indicates that these 5 functions already contain the added patch features, and thus they no longer contain the target vulnerability and can be removed from the candidate results.
Thus, the final candidate set will be as shown in table 5.
TABLE 5 Final candidate set after second similarity calculation screening
Figure BDA0001551451000000082
The manual auditing can be carried out only on the functions, so that the auditing workload is reduced. However, there are still 5 false positives containing patch statements in table 5, which cannot be filtered out by the second similarity calculation, because they are different in patching mode, and therefore the vector containing patch features used in this experiment cannot improve the similarity of these functions.
The above embodiments are only for illustrating the present invention, and the steps may be changed, and on the basis of the technical solution of the present invention, the modification and equivalent changes of the individual steps according to the principle of the present invention should not be excluded from the protection scope of the present invention.

Claims (7)

1. A similarity detection method for carrying out unknown vulnerability discovery by using patch information is characterized by comprising the following steps:
1) slicing: slicing the known vulnerability function and the patched patch function thereof to generate slices containing vulnerability related statements and slices containing patch statements;
2) performing symbol normalization on variable names, variable types and function calling names in functions to be tested, vulnerability slices and patch slices;
3) vectorization: mapping a function to be tested, a vulnerability slice and a patch slice to a vector space to generate a function feature vector to be tested, a vulnerability feature vector and a patch feature vector; the vulnerability characteristic vector, the patch characteristic vector and each vector of each function characteristic vector to be tested respectively define the value of each dimension in the vector, and the product of the occurrence frequency of each characteristic statement in the function represented by the value of each dimension in the vector and the TF-IDF weight;
4) similarity calculation and matching: after the feature vectors are generated, calculating the similarity of the feature vectors and sequencing the feature vectors, and judging whether unknown vulnerabilities similar to the features of known vulnerabilities exist in the function set to be tested or not;
similarity calculation and ranking were done twice:
the first time, similarity calculation is carried out by utilizing the vulnerability characteristic vector and the characteristic vector of the function to be measured, and the similarity is sorted from large to small according to the similarity value to obtain a preliminary candidate function set;
and the second time is to calculate and sort the first time of similarity to obtain a function in the preliminary candidate function set, and then carry out similarity calculation again by using the patch characteristic vector and the function in the candidate function set, and carry out secondary similarity calculation and sorting to remove false reports without holes in the preliminary candidate set, so as to generate a final candidate set for subsequent manual audit.
2. The method for detecting similarity of unknown vulnerabilities discovery using patch information as claimed in claim 1, characterized in that: in the step 1), the GCC is used as a compiling tool to compile a vulnerability function, a statement used as a slicing condition is obtained by using a patch on a language-independent GIMPLE statement level generated in the compiling process, and a function where the vulnerability is located is sliced according to a control flow and a data flow from the slicing condition so as to retain relevant semantic information of a vulnerability context; and adding or reducing added and reduced sentences in the patch after the vulnerability slice is generated to obtain the patch slice.
3. The method for detecting similarity of unknown vulnerabilities discovery using patch information as claimed in claim 2, characterized in that: three types of statements of interest when slicing: conditions, assignments, and function calls.
4. A method as claimed in claim 1, 2 or 3, for similarity detection of unknown vulnerabilities discovery using patch information, characterized by: in the step 1), the functions in the function set to be tested do not need to be sliced, and only need to be compiled, and then the GIMPLE statements of the conditions, assignments and function call types in each function are extracted as function features.
5. The method for detecting similarity of unknown vulnerabilities discovery using patch information as claimed in claim 1, characterized in that: in the step 2), variable names are uniformly represented by variable types, and constants are regarded as one type by ignoring specific values; for the variable types, common data types are summarized, classified and uniformly represented; and for the function calling name, extracting all called functions in the code set to be tested, simply clustering the functions by utilizing character string distance calculation, and merging similar character strings into the same representation form.
6. The method for detecting similarity of unknown vulnerabilities discovery using patch information as claimed in claim 1, characterized in that: in the second similarity calculation and sorting, if a candidate function belongs to the false alarm without the loophole, the function is a function containing the patch characteristics, and the similarity value obtained by the second similarity calculation is not lower than the value obtained by the first similarity calculation.
7. The method for detecting similarity of unknown vulnerabilities discovery using patch information as claimed in claim 1 or 6, characterized by: two feature vectors A (a)1,...,an) And B (B)1,...,bn) The distance of (2) is calculated by cosine similarity:
Figure FDA0002379868150000021
the distance value should belong to the [0, 1] interval, and the larger the value is, the closer the two vectors are, the more similar the corresponding slices and functions are represented; when the numerical value is 0, the two vectors are completely different, and no one-dimensional feature is superposed; a value of 1 indicates that the two vectors are identical and coincide in all feature dimensions.
CN201810047837.XA 2018-01-18 2018-01-18 Similarity detection method for carrying out unknown vulnerability discovery by using patch information Active CN108268777B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810047837.XA CN108268777B (en) 2018-01-18 2018-01-18 Similarity detection method for carrying out unknown vulnerability discovery by using patch information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810047837.XA CN108268777B (en) 2018-01-18 2018-01-18 Similarity detection method for carrying out unknown vulnerability discovery by using patch information

Publications (2)

Publication Number Publication Date
CN108268777A CN108268777A (en) 2018-07-10
CN108268777B true CN108268777B (en) 2020-06-30

Family

ID=62775965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810047837.XA Active CN108268777B (en) 2018-01-18 2018-01-18 Similarity detection method for carrying out unknown vulnerability discovery by using patch information

Country Status (1)

Country Link
CN (1) CN108268777B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117642A (en) * 2018-08-16 2019-01-01 北京梆梆安全科技有限公司 A kind of the file reading leak detection method and device of application program
CN109635569B (en) * 2018-12-10 2020-11-03 国家电网有限公司信息通信分公司 Vulnerability detection method and device
CN110287702B (en) * 2019-05-29 2020-08-11 清华大学 Binary vulnerability clone detection method and device
CN110417751B (en) * 2019-07-10 2021-07-02 腾讯科技(深圳)有限公司 Network security early warning method, device and storage medium
CN111046390B (en) * 2019-07-12 2023-07-07 安天科技集团股份有限公司 Collaborative defense patch protection method and device and storage equipment
CN111125716B (en) * 2019-12-19 2022-05-31 中国人民大学 Method and device for detecting Ethernet intelligent contract vulnerability
CN113742205B (en) * 2020-05-27 2024-04-23 南京大学 Code vulnerability intelligent detection method based on man-machine cooperation
CN112528290B (en) * 2020-12-04 2023-07-18 扬州大学 Vulnerability positioning method, vulnerability positioning system, computer equipment and storage medium
CN112379923B (en) * 2020-12-08 2022-06-21 中国科学院信息工程研究所 Vulnerability code clone detection method and device, electronic equipment and storage medium
CN113468525B (en) * 2021-05-24 2023-06-27 中国科学院信息工程研究所 Similar vulnerability detection method and device for binary program
CN113626820B (en) * 2021-06-25 2023-06-27 中国科学院信息工程研究所 Known vulnerability positioning method and device for network equipment
CN114785574B (en) * 2022-04-07 2023-09-29 国网浙江省电力有限公司宁波供电公司 AI-assisted remote vulnerability accurate verification method
CN117473513B (en) * 2023-12-28 2024-04-12 北京立思辰安科技术有限公司 Equipment detection method, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154257A (en) * 2007-08-14 2008-04-02 电子科技大学 Dynamic mend performing method based on characteristics of loopholes
CN106462703A (en) * 2014-05-22 2017-02-22 软件营地株式会社 System and method for analyzing patch file
CN107229563A (en) * 2016-03-25 2017-10-03 中国科学院信息工程研究所 A kind of binary program leak function correlating method across framework

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9176849B2 (en) * 2013-04-17 2015-11-03 Globalfoundries U.S. 2 Llc Partitioning of program analyses into sub-analyses using dynamic hints

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154257A (en) * 2007-08-14 2008-04-02 电子科技大学 Dynamic mend performing method based on characteristics of loopholes
CN106462703A (en) * 2014-05-22 2017-02-22 软件营地株式会社 System and method for analyzing patch file
CN107229563A (en) * 2016-03-25 2017-10-03 中国科学院信息工程研究所 A kind of binary program leak function correlating method across framework

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ShieldGen: Automatic Data Patch Generation for Unknown Vulnerabilities with Informed Probing;Weidong Cui;《2007 IEEE Symposium on Security and Privacy》;20071231;全文 *
基于特征矩阵的软件脆弱性代码克隆检测方法;甘水滔;《软件学报》;20150215;第349-361页 *

Also Published As

Publication number Publication date
CN108268777A (en) 2018-07-10

Similar Documents

Publication Publication Date Title
CN108268777B (en) Similarity detection method for carrying out unknown vulnerability discovery by using patch information
CN110737899B (en) Intelligent contract security vulnerability detection method based on machine learning
CN111125716B (en) Method and device for detecting Ethernet intelligent contract vulnerability
CN102054149B (en) Method for extracting malicious code behavior characteristic
CN109885479B (en) Software fuzzy test method and device based on path record truncation
CN111400724B (en) Operating system vulnerability detection method, system and medium based on code similarity analysis
CN114077741B (en) Software supply chain safety detection method and device, electronic equipment and storage medium
CN111400719A (en) Firmware vulnerability distinguishing method and system based on open source component version identification
US10289843B2 (en) Extraction and comparison of hybrid program binary features
CN106991325B (en) Protection method and device for software bugs
US20080127043A1 (en) Automatic Extraction of Programming Rules
CN112733156A (en) Intelligent software vulnerability detection method, system and medium based on code attribute graph
CN108491228A (en) A kind of binary vulnerability Code Clones detection method and system
CN112000952B (en) Author organization characteristic engineering method of Windows platform malicious software
CN112115326B (en) Multi-label classification and vulnerability detection method for Etheng intelligent contracts
CN113468525A (en) Similar vulnerability detection method and device for binary program
CN108170467A (en) Constraint qualification clusters and measure information software birthmark feature selection approach, computer
CN116305158A (en) Vulnerability identification method based on slice code dependency graph semantic learning
CN113901463B (en) Concept drift-oriented interpretable Android malicious software detection method
CN116702157B (en) Intelligent contract vulnerability detection method based on neural network
Vahedi et al. Cloud based malware detection through behavioral entropy
Nugraha et al. Malware detection using decision tree algorithm based on memory features engineering
CN114706769A (en) Log-based regression test-oriented black box test case sequencing method
KR101907443B1 (en) Component-based malicious file similarity analysis device and method
CN112163217A (en) Malicious software variant identification method, device, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant