CN116663018A - Vulnerability detection method and device based on code executable path - Google Patents

Vulnerability detection method and device based on code executable path Download PDF

Info

Publication number
CN116663018A
CN116663018A CN202310725510.4A CN202310725510A CN116663018A CN 116663018 A CN116663018 A CN 116663018A CN 202310725510 A CN202310725510 A CN 202310725510A CN 116663018 A CN116663018 A CN 116663018A
Authority
CN
China
Prior art keywords
statement
code
sentence
path
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310725510.4A
Other languages
Chinese (zh)
Inventor
胡星
刘忠鑫
张峻伟
夏鑫
李善平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202310725510.4A priority Critical patent/CN116663018A/en
Publication of CN116663018A publication Critical patent/CN116663018A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a vulnerability detection method and device based on code executable paths, which constructs a control flow diagram of an input code based on an abstract syntax tree, designs a path selection algorithm based on a greedy algorithm, extracts a plurality of executable paths from the control flow diagram, learns feature vectors of each path by utilizing a pre-training model, generates feature vectors of the code by adopting the feature vectors of a convolutional neural network fusion path, learns weights among different characters in each path and weights among different paths, and finally judges whether the input code segment contains the vulnerability or not by a multi-layer perceptron. The invention further improves the vulnerability detection effect and improves the safety level of software use. The generation effect can be further improved on the basis of the existing intelligent detection method for the code loopholes by confirming that the code loopholes are decomposed into a plurality of executable paths according to the control flow graph.

Description

Vulnerability detection method and device based on code executable path
Technical Field
The present invention relates to the field of code vulnerability detection, and in particular, to a vulnerability detection method and device based on a code executable path.
Background
Development of computer science and technology brings convenience to human beings and also provides criminal tools for malicious molecules. Software vulnerabilities refer to defects in the life cycle (i.e., development, deployment, execution of the whole process) of software, which may be exploited by lawbreakers, bypass access control of the system, and illegally steal higher rights to arbitrarily manipulate the system, such as triggering privileged commands, accessing sensitive information, impersonating identities, monitoring system operation, etc. Many vulnerabilities affect popular software, exposing many customers using the software to a higher risk of data leakage or supply chain attacks. In recent years, with the continuous development of deep learning technology, a vulnerability detection model based on deep learning is also widely focused by people, and excellent performance is achieved on the task of vulnerability detection. However, the current common vulnerability detection model has limited effect in detecting longer code segments, on one hand, because long codes contain more information irrelevant to vulnerabilities; on the other hand, existing models are limited by limited computational resources, often opting to truncate long codes, resulting in the loss of some important information.
Along with the increasing complexity and the increasing bulkiness of the existing software system, the occurrence frequency of the loopholes is continuously improved, and research work of the system is urgently needed to be carried out aiming at the field of loophole detection, so that the loopholes of the software system can be efficiently and timely found out, the loopholes are repaired in real time, and the safety level of the software application is improved.
In order to improve the accuracy and efficiency of software vulnerability detection, save the labor of safety field experts and other workers, better ensure the safety of the existing software, and create a research topic of intelligent detection of code vulnerabilities. As the name suggests, code vulnerability detection is based on developer-implemented code entities, which are intelligently judged by the machine whether vulnerabilities are contained therein.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a code executable path-based vulnerability detection method and device for automatically detecting code vulnerabilities, so as to detect whether code fragments with complex structures and longer lengths contain vulnerabilities or not. The method generates a plurality of executable paths by decomposing a code segment control flow diagram, and fuses an existing code pre-training representation model and a convolutional neural network to further improve the detection effect, namely, the code segment is decomposed, so that information irrelevant to the loopholes in the original segment is screened out, the code feature learning is more accurate, and the detection capability is further enhanced.
The invention is realized by the following technical scheme:
according to a first aspect of the present specification, there is provided a vulnerability detection method based on a code executable path, the method comprising the steps of:
s1, code-label pair data are obtained, a data set is constructed, and the data set is divided into a training set and a testing set;
s2, constructing a grammar-based control flow graph based on an abstract grammar tree of a training set code,
s3, extracting executable paths based on greedy algorithm, and initializing weights and nodes of all edges; continuously extracting paths with minimum weight from an initial node to an exit node, increasing the weight of the node connecting edges in each path when one path is extracted, marking all nodes in the path as accessed nodes, and preferentially selecting the path with the largest number of the non-accessed nodes when at least one non-accessed node is contained in each path and the weights are the same; when the number of paths reaches a preset extraction threshold, selecting the last path as the path with the largest number of non-accessed nodes, and if the number of non-accessed nodes in the paths is the same, selecting the path with the smallest weight;
s4, inputting the extracted paths into feature vectors and weights among different code characters in a vulnerability detection model learning path to obtain a code prediction result; the vulnerability detection model comprises a code pre-training model, a convolutional neural network and a multi-layer perceptron;
s5, testing the prediction result of the code by using the test set and performing parameter adjustment to obtain the optimal parameters of the model;
s6, inputting the codes to be detected into a trained vulnerability detection model to obtain a code prediction result.
Further, the data set in S1 is a data set obtained by fusing a plurality of public vulnerability data sets.
Further, the construction of the S2 abstract syntax tree is specifically that blank lines and annotation information in a code are removed firstly, then the line number of each line of sentences in the code is marked, then the abstract syntax tree corresponding to the code is obtained through a static analyzer, the abstract syntax tree is traversed through a breadth first search algorithm, and a control flow diagram based on grammar is constructed aiming at a control branch structure existing in the abstract syntax tree.
Further, the grammar-based control flow graph construction process specifically comprises the following steps: if a node in the control flow graph is a simple statement sentence and its next peer sentence exists in the abstract syntax tree, then it is connected to that peer sentence;
if the node is a cyclic statement, firstly adding a continuous edge between the statement and a first sub-statement in the cyclic structure of the statement; if the statement exists outside the loop structure, a connecting edge is added between the two; traversing a loop structure in which a loop sentence is located, and adding a continuous edge to a first sentence outside the loop structure if the last sub-sentence in the loop structure is a break sentence. If the last sub-sentence is a continuous sentence, connecting the sub-sentence with the circulating sentence;
if the statement type is a break statement, adding a continuous edge with the first statement outside the nearest cyclic structure;
if the statement type is a continuous statement, adding a connecting edge between the statement type and a circulation statement of the nearest circulation structure;
if the sentence type is a judging sentence, connecting the sentence with a first sentence outside a judging structure, and connecting a last sub-sentence in the judging structure with the first sentence outside the structure; if the judging structure exists in the circulating structure, connecting the judging statement with the circulating statement, and connecting the last sub-statement in the judging structure with the circulating statement; traversing the sentences in the judging structure, and adding the continuous edges of the judging sentences and the first sub-sentences in the judging structure; if the judging structure comprises branch sentences, adding a connecting edge between the judging sentences and the branch sentences, deleting the connecting edge between the judging sentences and the first sentences outside the circulating structure, and adding the connecting edge between the branch sentences and the first sentences outside the circulating structure; then traversing the branch structure, connecting the branch statement with the first sub-statement in the branch structure, and adding the connecting edge between the last sub-statement in the branch structure and the first statement outside the branch; if the last sub-sentence in the judging structure or the branch structure is a break sentence, a continuous sentence, a return sentence or an exception sentence, it is connected with a sentence that the next sentence is likely to be executed.
If the statement type is a switch statement, it is first connected to the first branch statement in the switch structure. For each branch structure in the switch structure, the tool connects each branch statement with the branch statement of the next branch structure, and if the last statement in the branch structure is a break statement, a continuous statement, a return statement or an exception statement, the last statement in the branch structure is connected with a statement which can be executed by the next statement; finally, connecting the branch statement with the first sub statement in the branch structure, and traversing the statement in each branch structure;
for other statements, it is connected to the first sub-statement and the last sub-statement in its structure is connected to the first statement outside the switch structure.
If the statement type is an exception handling statement, the exception handling statement is connected with a first sub-statement of an exception handling structure of the exception handling statement, and a last statement in the exception handling structure is connected with a catch statement. For each catch statement, we link it to the first sub-statement within the catch structure, to the next catch statement; for the last statement in the exception handling structure, it is connected to the first statement outside the structure.
Further, the code pre-training model in S4 is a CodeBERT model based on a transducer model.
Further, the specific step S4 is as follows: the code pre-training model learns the feature vector of each path, learns the weight among each code character in the path, distributes different weights for different path feature vectors by utilizing a convolutional neural network, learns the structural information of the codes, and fuses the feature vectors of a plurality of paths to generate the feature vector of the code; and inputting the feature vector of the code into a multi-layer perceptron network to obtain a predictive label of the code.
Further, in the process of generating the feature vector of the code, the S4 finishes the conversion of the code from the word to the token, and adds the token corresponding to the < bos > and the < eos > from the head to the tail to identify the beginning and the end of the sequence.
Further, in S4, calculating losses of a predicted label and a real label obtained by the multi-layer perceptron network by using a cross entropy loss function, and carrying out gradient transfer optimization network; the optimizer uses AdamW.
Further, in S5, the detection effect of the detection model is evaluated using the F1 value as an evaluation index, and the evaluation is performed on the test set.
According to a second aspect of the present invention, a code executable path-based vulnerability detection apparatus is provided, which includes a memory and one or more processors, where executable code is stored in the memory, and the processors are configured to implement a code executable path-based vulnerability detection method when executing the executable code.
The invention has the beneficial effects that: although various code vulnerability detection methods exist at present, the detection effect is limited, no feature vector of a path is learned through a code pre-training model after a code segment is decomposed into a plurality of executable paths at present, and a model for vulnerability detection is generated by fusing the feature vectors of the plurality of paths. Specifically, the invention adopts three data sets (real, big-Vul and Devign) combined to verify whether the proposed model can improve the performance of vulnerability detection, and especially has obvious improvement in the effect when detecting longer code fragments. Therefore, the invention counts the code segments with the code sequence length more than 400 in the test data set, and the invention detects the code segments with the code sequence length more than 400 and identifies the vulnerability codes in the code segments. Compared with the existing advanced vulnerability detection model CodeBERT, the method improves the accuracy (Precision), recall rate (Recall) and F1 value by 22.56%,10.05% and 16.17% respectively.
According to the method, the control flow graph of the code is constructed based on the abstract syntax tree, the control flow graph is decomposed into a plurality of executable paths, and the feature representation of the code is learned by using the code pre-training model and the convolutional neural network, so that the vulnerability detection effect is further improved. A series of security events, including hacking, information leakage, or botnet attacks, etc., can be prevented, improving the security level of software usage.
When the feature representation of the code is learned by utilizing the code pre-training model and the convolutional neural network, the weight among different characters in each path and the weight among different paths are considered, so that the prediction accuracy of the method is higher.
Drawings
FIG. 1 is a block diagram of a vulnerability detection algorithm of the present invention;
FIG. 2 is an exemplary diagram of source code of the present invention;
FIG. 3 is an illustration of a source code conversion control flow graph of the present invention;
FIG. 4 is an algorithm diagram of an executable path extraction method of the present invention;
FIG. 5 is a block diagram of a code executable path based vulnerability detection apparatus of the present invention.
Detailed Description
The technical solutions in this embodiment will be clearly and completely described below with reference to the drawings in this embodiment, and it is apparent that the described embodiments are only some embodiments of this embodiment, not all embodiments. All other embodiments obtained by those skilled in the art without creative efforts based on the embodiments of the present embodiments are within the protection scope of the present embodiments.
In a first aspect, in this embodiment, a method for detecting a vulnerability based on a code executable path includes the following steps:
s1, code-label pair data are obtained, a data set is constructed, and the data set is divided into a training set and a testing set;
in this embodiment, 3 public datasets (real, big-Vul and Devign) are used and fused together to form a new dataset for a code vulnerability detection task, where the dataset includes 24241 leaky code segments, 207059 non-leaky code segments, and 231300 total code segments, the dataset is divided into a training set of 185040 segments and a testing set of 46260 data, and codes in the testing set that may occur in the training set at the same time are deleted. In addition, targeted data cleaning is performed aiming at the characteristics of the vulnerability detection task, and the specific cleaning process is as follows: empty lines and notes in the code are deleted, while sentences in the code fragments and the line numbers in which they reside are marked.
The algorithm framework of vulnerability detection in this embodiment is shown in fig. 1, and includes the following steps:
s2, constructing a grammar-based control flow graph of an input code segment based on a code-based abstract grammar tree: traversing the abstract syntax tree by adopting a breadth-first search algorithm, and constructing a syntax-based control flow graph aiming at a control branch structure existing in the abstract syntax tree;
constructing a control flow graph based on the abstract syntax tree, and representing the code control flow graph as input as
G= (V, epsilon), where V represents the nodes of the graph and epsilon represents the edges of the graph, i.e., the control flow between each line of statements in the code.
Common control flow graphs used to build code require compiling the project in which the code is located, however, compiling large projects requires consuming a significant amount of resources, and furthermore, code segments provided in existing data sets are often not compiled or even incomplete. Thus, compiled control flow graph based construction is not applicable to vulnerability detection task scenarios. For this reason, the embodiment is constructed by adopting a control flow graph based on an abstract syntax tree, as shown in fig. 2, the abstract syntax tree can be obtained after being processed by a static analyzer, and the code control flow graph shown in fig. 3 can be further obtained according to the abstract syntax tree.
The specific connection process is as follows: if a node in the control flow graph is a simple statement sentence and its next peer sentence exists in the abstract syntax tree, it is connected to that peer sentence. If the node is a loop statement, the statement is first bordered with the first sub-statement in its loop structure. If the statement exists outside the loop structure, a join edge is added between the two. Traversing a loop structure in which a loop sentence is located, and adding a continuous edge to a first sentence outside the loop structure if the last sub-sentence in the loop structure is a break sentence. If the last sub-sentence is a continuous sentence, the sub-sentence is connected with the loop sentence.
If the statement type is a break statement, then it is bordered with the first statement outside the nearest loop structure.
If the statement type is a continuous statement, a connecting edge is added between the statement type and the loop statement of the nearest loop structure.
If the sentence type is a judgment sentence (if_state), it is first connected to the first sentence outside the judgment structure, and the last sub-sentence inside the judgment structure is connected to the first sentence outside the structure. If the judging structure exists in the circulating structure, connecting the judging statement with the circulating statement, and connecting the last sub-statement in the judging structure with the circulating statement. Then, traversing the sentences in the judging structure, and adding the continuous edges of the judging sentences and the first sub-sentences in the judging structure. If the judging structure comprises a branch statement (such as else_state), the connecting edge between the judging statement and the branch statement is increased, the connecting edge between the judging statement and the first statement outside the circulating structure is deleted, and the connecting edge between the branch statement and the first statement outside the circulating structure is increased. Then traversing the branch structure, connecting the branch statement with the first sub-statement in the branch structure, and increasing the connecting edge between the last sub-statement in the branch structure and the first statement outside the branch. If the last sub-sentence in the judging structure or the branch structure is a break sentence, a continuous sentence, a return sentence or an exception sentence, it is connected with a sentence that the next sentence is likely to be executed.
If the statement type is a switch_statement, it is first connected to the first branch statement (case_statement) in the switch structure. For each branch structure within the switch structure, the tool links each branch statement with the branch statement of its next branch structure (case_state or default_state), and if the last statement within the branch structure is a break statement, a continuous statement, a return statement, or an exception statement, links it with a statement that the next statement may be executed. Finally, the branch statements are connected to the first sub-statement within the branch structure and the statements in each branch structure are traversed. For the other statement (default_statement), it is connected to the first sub-statement and the last sub-statement in its structure is connected to the first statement outside the switch structure.
If the statement type is an exception handling statement (try_statement), the exception handling statement is connected to the first sub-statement of its exception handling structure, and the last statement in the exception handling structure is connected to the catch statement. For each catch statement we link it to the first sub-statement within the catch structure to the next catch statement. Furthermore, for the last statement in the exception handling structure, it is connected to the first statement outside the structure.
S3, extracting an executable path based on a greedy algorithm, aiming at a control flow graph based on an abstract syntax tree, and extracting a plurality of executable paths from the control flow graph.
One code may be considered a combination of all of its executable paths. However, if the code contains a loop structure, an infinite number of executable paths may result, requiring many computing resources to encode all of the executable paths of the code. Therefore, it is impractical to encode all executable paths in a control flow graph to learn representations of corresponding code segments, and only a limited number of executable paths can be selected to be extracted from the control flow graph. To avoid extraction ofThe selected path should cover as many code lines as possible, on the other hand, in order to ease the burden of model training, the selected path needs to be as short as possible. For this purpose, the present embodiment adopts a greedy-based executable path extraction method, and selects and encodes several representative executable paths to represent corresponding codes. The specific extraction process is as follows: as shown in fig. 4, each path is represented as p= (n) 1 ,n 2 ,…n k ) Wherein n is k Representing nodes in the control flow graph, i.e., statements in the code, there must be one control flow edge in between any two nodes in the path.
Assuming that m paths are selected from the control flow graph, the initial weights for all edges in the control flow graph are first set to 1 and all nodes in the control flow graph are marked as not accessed. Then, for the first m-1 paths, traversing the exit nodes in the control flow graph in turn, selecting a path with the minimum weight from the initial node to the exit nodes, and ensuring that each path at least contains one non-accessed node. If there are multiple paths with the same weight, one path with the largest number of non-accessed nodes is selected. Then, the nodes in the selected path are marked as accessed, and the weight of the node connecting edges in the path is increased, so that the exploration of the newly non-accessed nodes and the path is rewarded. Repeating the steps to select m-1 paths. Finally, for the last path, the algorithm selects a path with the largest number of non-accessed nodes, and if a plurality of paths with the same number of non-accessed nodes exist, selects a path with the smallest weight.
S4, learning by taking each executable path of the codes as input, learning the feature vector of each path by adopting a code pre-training model, learning the weight among each code character in the path, distributing different weights for different path feature vectors by utilizing a convolutional neural network, learning the structural information of the codes, and generating the feature vector of the codes by fusing the feature vectors of a plurality of paths. The process takes into account both the weights between different characters in each path and the weights between different paths. Inputting the characteristic vector of the code into a multi-layer perceptron network to enable the predictive label of the code;
the common deep learning models for learning code features include an LSTM model, a graph neural network model and a pre-training model, and in the embodiment, feature vectors of executable paths are generated by adopting the pre-training model with better effect. In the embodiment, a code BERT model based on a transducer model is utilized to learn the feature vector of each path, the feature vectors of a plurality of paths are used as the input of a convolutional neural network, the convolutional layer adopts a plurality of convolution kernels with different sizes, the features of the plurality of paths are extracted in a multi-dimensional manner, different weights are distributed among different paths, the feature vectors of the plurality of paths are connected to be used as the integral features of an input code, and a ReLU activation function is input; after activation, the two classification results are obtained through the multi-layer perceptron network layer after pooling operation.
Carrying out Softmax processing on the result once, and finally outputting the probability of each of the two categories, wherein if the probability of the model is higher, the model indicates that the code contains the loopholes, and if the probability of the model is higher, the model indicates that the code does not contain the loopholes; and finally, calculating the loss of the round by comparing with the data label, and further transmitting a gradient training vulnerability detection model. The optimizer uses AdamW.
S5, evaluating the detection effect of the detection model by evaluating indexes, evaluating the detection effect on a test set, and adjusting parameters of the pre-training model, the convolutional neural network and the multi-layer perceptron network according to the evaluation result to find out the parameters which can lead the detection performance of the model to be the best, so as to achieve the best detection effect;
for the evaluation index of the vulnerability detection effect, the accuracy, recall rate, precision and F1 value are commonly used, and because the data sets in the vulnerability detection task are unbalanced data sets, namely the number of code data sets which do not contain the vulnerability is far more than that of code data sets which contain the vulnerability, if the accuracy is adopted, the model can learn the characteristic of codes which do not contain the vulnerability more easily, so that the F1 value is adopted as the evaluation index in the embodiment.
In order to better achieve the effect of detecting the loopholes, the effect of detecting the loopholes is steadily improved, and after a certain experiment, the step length of the pre-training model is found to be 28, the turn is 5, the size of the convolution kernel is 128, and the window of the convolution kernel is 128.
S6, inputting the codes to be detected into a trained vulnerability detection model to obtain a code prediction result.
On the other hand, corresponding to the embodiment of the vulnerability method based on the code executable path, the embodiment also provides an embodiment of the vulnerability detection device based on the code executable path.
Referring to fig. 5, a code executable path-based vulnerability detection apparatus provided by an embodiment of the present invention includes a memory and one or more processors, where executable codes are stored in the memory, and when the processors execute the executable codes, the processors are configured to implement the code executable path-based vulnerability detection method in the above embodiment.
The embodiment of the vulnerability detection apparatus based on the code executable path of the present embodiment can be applied to any device with data processing capability, and the any device with data processing capability can be a device or apparatus such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 5, a hardware structure diagram of an apparatus with any data processing capability where the vulnerability detection apparatus based on the code executable path in this embodiment is located is shown in fig. 5, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, the apparatus with any data processing capability where the apparatus in this embodiment is located generally includes other hardware according to the actual function of the apparatus with any data processing capability, which is not described herein again.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the vulnerability detection method based on the code executable path in the above embodiment.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims (10)

1. A vulnerability detection method based on a code executable path, the method comprising the steps of:
s1, code-label pair data are obtained, a data set is constructed, and the data set is divided into a training set and a testing set;
s2, constructing a grammar-based control flow graph based on an abstract grammar tree of a training set code;
s3, extracting executable paths based on greedy algorithm, and initializing weights and nodes of all edges; continuously extracting paths with minimum weight from an initial node to an exit node, increasing the weight of the node connecting edges in each path when one path is extracted, marking all nodes in the path as accessed nodes, and preferentially selecting the path with the most accessed nodes when the weights are the same when each path in the extracted paths at least comprises one non-accessed node; when the number of paths reaches a preset extraction threshold, selecting the last path as the path with the largest number of non-accessed nodes, and if the number of non-accessed nodes in the paths is the same, selecting the path with the smallest weight;
s4, inputting the extracted paths into feature vectors and weights among different code characters in a vulnerability detection model learning path to obtain a code prediction result; the vulnerability detection model comprises a code pre-training model, a convolutional neural network and a multi-layer perceptron;
s5, evaluating a prediction result of the code by using the test set and performing parameter adjustment to obtain optimal parameters of the model;
s6, inputting the codes to be detected into a trained vulnerability detection model to obtain a code prediction result.
2. The code executable path based vulnerability detection method of claim 1, wherein the data set in S1 is a data set obtained by fusing a plurality of public vulnerability data sets.
3. The code executable path-based vulnerability detection method of claim 1, wherein the construction of the S2 abstract syntax tree is specifically: firstly, blank lines and annotation information in codes are removed, then the line numbers of sentences in each line in the codes are marked, then abstract syntax trees corresponding to the codes are obtained through a static analyzer, the abstract syntax trees are traversed by adopting a breadth-first search algorithm, and a control flow diagram based on grammar is constructed aiming at control branch structures existing in the abstract syntax trees.
4. The code executable path based vulnerability detection method of claim 3, wherein the grammar based control flow graph construction process is specifically: if a node in the control flow graph is a statement sentence and its next peer sentence exists in the abstract syntax tree, then it is connected to that peer sentence;
if the node is a cyclic statement, firstly adding a continuous edge between the statement and a first sub-statement in the cyclic structure of the statement; if the statement exists outside the loop structure, a connecting edge is added between the two; traversing a circulation structure in which a circulation statement is located, and adding a continuous edge to a first statement outside the circulation structure if a last sub-statement in the circulation structure is a break statement; if the last sub-sentence is a continuous sentence, connecting the sub-sentence with the circulating sentence;
if the statement type is a break statement, adding a continuous edge with the first statement outside the nearest cyclic structure;
if the statement type is a continuous statement, adding a connecting edge between the statement type and a circulation statement of the nearest circulation structure;
if the sentence type is a judging sentence, connecting the sentence with a first sentence outside a judging structure, and connecting a last sub-sentence in the judging structure with the first sentence outside the structure; if the judging structure exists in the circulating structure, connecting the judging statement with the circulating statement, and connecting the last sub-statement in the judging structure with the circulating statement; traversing the sentences in the judging structure, and adding the continuous edges of the judging sentences and the first sub-sentences in the judging structure; if the judging structure comprises branch sentences, adding a connecting edge between the judging sentences and the branch sentences, deleting the connecting edge between the judging sentences and the first sentences outside the circulating structure, and adding the connecting edge between the branch sentences and the first sentences outside the circulating structure; then traversing the branch structure, connecting the branch statement with the first sub-statement in the branch structure, and adding the connecting edge between the last sub-statement in the branch structure and the first statement outside the branch; if the last sub-sentence in the judging structure or the branch structure is a break sentence, a continuous sentence, a return sentence or an abnormal sentence, connecting the last sub-sentence with a sentence to be executed in the next sentence;
if the statement type is a switch statement, connecting the statement type with a first branch statement in a switch structure; for each branch structure in the switch structure, connecting each branch statement with the branch statement of the next branch structure, and if the last statement in the branch structure is a break statement, a continuous statement, a return statement or an abnormal statement, connecting the last statement with the statement to be executed of the next statement; finally, connecting the branch statement with the first sub statement in the branch structure, and traversing the statement in each branch structure;
for other sentences, connecting the other sentences with the first sub-sentence, and connecting the last sub-sentence in the structure with the first sentence outside the switch structure;
if the statement type is an exception handling statement, connecting the exception handling statement with a first sub-statement of an exception handling structure of the exception handling statement, and connecting a last statement in the exception handling structure with a catch statement; for each catch statement, we link it to the first sub-statement within the catch structure, to the next catch statement; for the last statement in the exception handling structure, it is connected to the first statement outside the structure.
5. The code executable path based vulnerability detection method of claim 1, wherein the code pre-training model in S4 is a code bert model based on a transducer model.
6. The method for detecting vulnerabilities based on code-executable paths according to claim 1, wherein the step S4 comprises the specific steps of: the code pre-training model learns the feature vector of each path, learns the weight among each code character in the path, distributes different weights for different path feature vectors by utilizing a convolutional neural network, learns the structural information of the codes, and fuses the feature vectors of a plurality of paths to generate the feature vector of the code; and inputting the feature vector of the code into a multi-layer perceptron network to obtain a predictive label of the code.
7. The method of claim 6, wherein S4 completes the conversion of the code from word to token in the process of generating the feature vector of the code, and adds tokens corresponding to < bos > and < eos > from head to tail to identify the beginning and end of the sequence.
8. The code executable path-based vulnerability detection method of claim 6, wherein in S4, the loss of predictive label and real label obtained by multi-layer perceptron network is calculated by using cross entropy loss function, and gradient transfer optimization network is performed; the optimizer uses AdamW.
9. The code executable path-based vulnerability detection method of claim 1, wherein in S5, the evaluation is performed on a test set using the F1 value as an evaluation index.
10. A code executable path based vulnerability detection apparatus comprising a memory and one or more processors, the memory having executable code stored therein, wherein the processor, when executing the executable code, is configured to implement the code executable path based vulnerability detection method of any one of claims 1-9.
CN202310725510.4A 2023-06-19 2023-06-19 Vulnerability detection method and device based on code executable path Pending CN116663018A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310725510.4A CN116663018A (en) 2023-06-19 2023-06-19 Vulnerability detection method and device based on code executable path

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310725510.4A CN116663018A (en) 2023-06-19 2023-06-19 Vulnerability detection method and device based on code executable path

Publications (1)

Publication Number Publication Date
CN116663018A true CN116663018A (en) 2023-08-29

Family

ID=87720619

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310725510.4A Pending CN116663018A (en) 2023-06-19 2023-06-19 Vulnerability detection method and device based on code executable path

Country Status (1)

Country Link
CN (1) CN116663018A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117453578A (en) * 2023-12-25 2024-01-26 杭州云动智能汽车技术有限公司 NMEA sentence detection method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117453578A (en) * 2023-12-25 2024-01-26 杭州云动智能汽车技术有限公司 NMEA sentence detection method and device, electronic equipment and storage medium
CN117453578B (en) * 2023-12-25 2024-04-19 杭州云动智能汽车技术有限公司 NMEA sentence detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111639344B (en) Vulnerability detection method and device based on neural network
CN111783100B (en) Source code vulnerability detection method for code graph representation learning based on graph convolution network
CN109426722B (en) SQL injection defect detection method, system, equipment and storage medium
CN111600919B (en) Method and device for constructing intelligent network application protection system model
Shen et al. A survey of automatic software vulnerability detection, program repair, and defect prediction techniques
CN112579477A (en) Defect detection method, device and storage medium
CN110162972B (en) UAF vulnerability detection method based on statement joint coding deep neural network
CN110581864A (en) method and device for detecting SQL injection attack
CN114942879A (en) Source code vulnerability detection and positioning method based on graph neural network
CN116663018A (en) Vulnerability detection method and device based on code executable path
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
Ibias et al. SqSelect: Automatic assessment of failed error propagation in state-based systems
Alon et al. Using graph neural networks for program termination
Xu et al. Vulnerability Detection of Ethereum Smart Contract Based on SolBERT-BiGRU-Attention Hybrid Neural Model.
CN116663017A (en) Vulnerability detection method and device based on multi-program graph
CN114285587A (en) Domain name identification method and device and domain name classification model acquisition method and device
CN116702157A (en) Intelligent contract vulnerability detection method based on neural network
CN114297063B (en) Method and system for automated formal modeling and verification of source code
Patil Automated Vulnerability Detection in Java Source Code using J-CPG and Graph Neural Network
Wu et al. Detecting Vulnerabilities in Ethereum Smart Contracts with Deep Learning
Nguyen et al. Code aggregate graph: Effective representation for graph neural networks to detect vulnerable code
CN115037648B (en) Intelligent contract test case generation method and system based on data flow reduction
EP4407496A1 (en) Methods and systems for identifying binary code vulnerability
CN117171013A (en) Intelligent contract test case generation method and device based on meta heuristic algorithm
Sodsong et al. SPARK: static program analysis reasoning and retrieving knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination