CN111753303A

CN111753303A - Multi-granularity code vulnerability detection method based on deep learning and reinforcement learning

Info

Publication number: CN111753303A
Application number: CN202010747186.2A
Authority: CN
Inventors: 蒋远; 苏小红; 王甜甜
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2020-10-09
Anticipated expiration: 2040-07-29
Also published as: CN111753303B

Abstract

The invention discloses a multi-granularity code vulnerability detection method based on deep learning and reinforcement learning, which comprises the following steps: 1) analyzing the source code to obtain an intermediate code representation corresponding to the code; 2) slicing the intermediate code to obtain a code segment smaller than the source program; 3) converting an input code segment into a low-dimensional continuous real-valued vector by using a code segment representation method; 4) inputting the vector representation of the code segment into a coarse-grained code vulnerability detection model based on deep learning, and judging whether the code segment contains defects; 5) and constructing a fine-grained code vulnerability detection model based on reinforcement learning, and predicting code lines which specifically cause vulnerabilities in code segments containing defects. The invention provides a complete multi-granularity code vulnerability detection framework, applies reinforcement learning to the field of fine-granularity code vulnerability detection for the first time, and provides a new code segmentation representation learning model to fully utilize semantic information of a program, thereby improving the accuracy and the practicability of vulnerability detection.

Description

Multi-granularity code vulnerability detection method based on deep learning and reinforcement learning

Technical Field

The invention relates to a code vulnerability detection method, in particular to a multi-granularity code vulnerability detection method based on deep learning and reinforcement learning technologies.

Background

Software bugs refer to defects of software in the life cycle of the software, and the defects can be utilized by lawless persons, bypass access control of a system, and illegally steal higher authority so as to arbitrarily operate the system, such as triggering privilege commands, accessing sensitive information, impersonating identities, monitoring system operation and the like. If the security-related vulnerability cannot be identified and repaired in time, the vulnerability is easily utilized by a malicious attacker, so that the system is invaded to cause unreliable system operation results, or serious security problems such as arbitrary command execution, arbitrary file reading and the like.

Code analysis is a main means for checking and discovering inherent defects and utilization ways in software codes, and is always a research hotspot in the fields of information security and software security. However, as the existing software system becomes more complex and huge, the frequency of vulnerability occurrence and the attack means of hackers are continuously improved, the traditional vulnerability detection tool based on the predefined rule cannot meet the requirements of modern software development, and more researchers begin to pay attention to the code vulnerability detection method based on machine learning and deep learning. The vulnerability detection method based on machine learning relies on experts to manually define code features (e.g., software complexity metrics, function calls, code changes, and system calls), and then automatically classify vulnerability codes and non-vulnerability codes using a machine learning model. However, the definition of the code features is more subjective, so that the method is generally only suitable for specific projects and has poor generalization capability. And the granularity of codes input into the machine learning model is generally coarse, and the exact position of a vulnerability code line cannot be determined. The vulnerability detection method based on deep learning does not need experts to define features manually, can automatically generate vulnerability modes from a large amount of historical data, is expected to change the software source code vulnerability detection method, enables vulnerability modes oriented to various types of vulnerabilities to be changed from dependence on expert manual definition to automatic generation, and obviously improves vulnerability detection effectiveness. However, research related to the method is just started, most of research focuses on coarser-grained code vulnerability detection, for example, code vulnerabilities are detected at a function level or a file level, and research related to a "vulnerability structure" of the code is very deficient. The vulnerability structure not only enables the detection tool to judge whether the code contains the vulnerability, but also can indicate the specific form of the vulnerability in the code and the position where the vulnerability occurs. In order to enable the vulnerability detection method based on deep learning to be better applied in practice, research on a code vulnerability structure is necessary.

A document VulDeeLocator (Z.Li, D.Zou, S.xu, Z.Chen, Y.Zhu, and H.jin, VulDeeLocator: A Deep Learning-based Fine-grained continuity Detector, arXivpreprint arXiv:2001.02350,2020) is the only document which can be searched at present and can realize statement level Fine-grained Vulnerability location based on Deep Learning. However, from the experimental effect, the fine-grained detection of vuldeelocater is not obviously superior to the traditional rule-based vulnerability detection tool, because the network structure of the method cannot sufficiently capture the semantic information of the program and the code structure related to the vulnerability.

Disclosure of Invention

The invention aims to provide a multi-granularity code vulnerability detection method based on deep learning and reinforcement learning, which not only can detect a software module (such as a function) containing defects at a coarser granularity, but also can position code sentences possibly causing vulnerabilities in the module at a finer granularity level, namely, multi-granularity code vulnerability detection is realized. In addition, the accuracy of the vulnerability detection model depends on the accuracy of the model input, namely the accuracy of the code segment representation. Aiming at the problem that the Code semantic information cannot be fully utilized by the conventional Code Representation method based on tokens (tokens), the invention provides a novel Code segmentation Representation Learning method (Staged Code Representation Learning). The method first learns the vector representation of each statement in the code, and then learns the vector representation of the entire program based on the vector representation of each statement. Separating the vector representation of the statement from the vector representation of the program enables the learned code vector to capture the more subtle structural (syntactic) differences of the program as well as more complex semantic information.

The purpose of the invention is realized by the following technical scheme:

a multi-granularity code vulnerability detection method based on deep learning and reinforcement learning comprises the steps of firstly, analyzing a source code and obtaining an intermediate code expression form corresponding to the code. Using intermediate code representation enables capturing more program control flow and variable definition-usage information than the representation form of the source code. Secondly, taking key points (key points) which may cause the vulnerability as a slicing standard, slicing the intermediate code to obtain code segments (code gadgets) smaller than the source program so as to reduce the length of the input sequence of the model and avoid the influence of the vulnerability irrelevant statements on important information. Thirdly, the input code segment is converted into a low-dimensional continuous real-valued vector by using the code segment representation learning model provided by the invention. And then, the vector representation of the code segment is input into a coarse-grained code vulnerability detection model based on deep learning, and whether the code segment corresponding to the input vector contains a vulnerability is judged. And finally, if the coarse-grained model detects that the input code segment contains the bug, continuing to perform next judgment by the fine-grained detection model based on reinforcement learning, namely finding out a possible code line causing the bug.

The method specifically comprises the following steps:

step 1: performing static analysis on a source program by using a Clang tool to obtain an intermediate code representation form of the program;

step 2: extracting key points which possibly cause the vulnerability, generating a slicing standard, slicing the intermediate code, and combining the forward slicing and the backward slicing to obtain a code segment of the program;

and step 3: representing the code segments into low-dimensional continuous real-valued vectors by using a code segment representation learning method;

and 4, step 4: inputting the vector representation of the code segment into a coarse-grained code vulnerability detection model based on deep learning, and judging whether the code segment contains a vulnerability;

and 5: and constructing a fine-grained vulnerability detection model based on reinforcement learning, and predicting code lines which specifically cause vulnerabilities in code segments containing defects.

Compared with the prior art, the invention has the following advantages:

1. compared with the existing method which can only detect the loophole on a coarser granularity level, the method has the advantages that the method can not only finish the module detection of the loophole code with coarse granularity, but also can realize the positioning of a loophole statement with fine granularity, and improves the practicability of the loophole detection method based on data driving and the interpretability of a model prediction result.

2. Compared with the existing code representation method based on the label (token), the novel code segmentation representation learning method has the advantages that the local and global semantic information of a program can be fully utilized, the accuracy of the generated code representation vector is improved, and the vulnerability detection capability of the model is further improved.

3. The invention firstly proposes that a reinforcement learning technology is applied to a fine-grained code vulnerability detection task, various possible vulnerability statement combinations are continuously tried on a training data example, the number of vulnerability statements contained in each combination is fed back to a Policy (Policy) module as a signal for adjusting model behaviors, and the model can automatically learn to obtain a code structure related to vulnerabilities through continuous signal accumulation.

Drawings

Fig. 1 is a general flowchart of a multi-granularity code vulnerability detection method proposed by the present invention.

FIG. 2 is a diagram of statement nodes corresponding to function call key points in a syntax tree.

FIG. 3 is a diagram of the sentence nodes corresponding to the array definition key points in the syntax tree.

FIG. 4 is a sentence node corresponding in a syntax tree for defining keypoints.

FIG. 5 is a diagram of the statement nodes corresponding to evaluation expression key points in a syntax tree.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.

According to the method and the device, coarse-grained and fine-grained code vulnerability identification and vulnerability statement positioning are respectively realized on the basis of a deep learning technology and a reinforcement learning technology. First, the source code is parsed and an intermediate code representation (e.g., LLVM IR) corresponding to the code is obtained. Secondly, slicing the intermediate code by taking key points (keypoints) which may cause the vulnerability as slicing standards to obtain code segments (code gadgets) smaller than the source program. Thirdly, the code segment representation learning method is used for converting the input code segments into low-dimensional continuous real-valued vectors. And then, the vector representation of the code segment is input into a coarse-grained vulnerability detection model based on deep learning, and whether the code segment corresponding to the input vector contains a vulnerability or not is judged. And finally, if the coarse-grained model detects that the input code segment contains the bug, continuing to perform next judgment by the fine-grained bug detection model based on reinforcement learning, namely finding out the code line causing the bug.

As shown in fig. 1, the specific steps are as follows:

step 1: and (5) performing static analysis on the source program by using a Clang tool to obtain an intermediate code representation form of the program.

Step 2: extracting key points which may cause a vulnerability, generating a slicing standard, slicing the intermediate code, and combining a forward slice and a backward slice to obtain a code segment of the program, wherein the specific steps are as follows:

step 21: analyzing the program into an abstract syntax tree form by using a program analysis technology;

step 22: and traversing the generated abstract syntax tree through a node matching algorithm to find four syntax tree nodes which possibly cause code bugs: (1) function call statement node (i.e., Callee), as shown in fig. 2; (2) the array defines a statement node (i.e., the IdentifierDeslcStatement, and the statement contains "[" and "]" characters), as shown in FIG. 3; (3) the pointer defines a statement node (i.e., the identifierdemotlstatement, and the statement contains the "+" character), as shown in fig. 4; (4) an expression statement node (i.e., expression statement), as shown in fig. 5;

step 23: filtering the four types of nodes extracted from the program, and selecting syntax tree nodes meeting the conditions as key points which may cause a vulnerability, for example: if the identifier (identity) corresponding to the "call" node is in a predefined library function list which may cause a vulnerability, the parent node (CallExpression) corresponding to the "call" node is a statement S in the slicing standard;

step 24: extracting the related variables from the sentence S as V in the slicing standard, and finally obtaining the slicing standard (S, V) for extracting the slices by combining S and V;

step 25: analyzing the code segments into a program dependence graph by using a program analysis technology, performing forward and backward slice analysis on the program dependence graph according to the slice standard generated in the step 24, and combining the forward slices and the backward slices to obtain program slices related to four key points which possibly cause the vulnerability;

step 26: and converting the program slice of the source program into the program slice of the intermediate code according to the corresponding relation between the source program and the intermediate code.

And step 3: the code segment representation learning method is used for representing the code segments into low-dimensional continuous real-valued vectors, and comprises the following specific steps:

step 31: splitting a code segment by taking a statement (statement) as a unit;

step 32: constructing a Statement Encoding Network (SENET) based on CNN, wherein the schematic diagram of the model is shown in the upper part of the figure 1;

step 33: dividing each sentence obtained in the step 31 into a mark (token) sequence by taking a space as a separator as an input of a sentence coding network, and outputting a vector representation of the sentence;

step 34: constructing an LSTM-based Program code Network (PENet), wherein a schematic diagram of a model is shown in the upper right part of FIG. 1;

step 35: and taking the vector representation of each statement contained in the code segment in the step 33 as the input of the program coding network, and outputting the vector representation of the hidden layer of the last time step as the vector representation of the code segment.

And 4, step 4: inputting the vector representation of the code segment into a coarse-grained code vulnerability detection model based on deep learning, and judging whether the code segment contains a vulnerability, wherein the method specifically comprises the following steps:

step 41: constructing a full-connection single-hidden-layer-based coarse-grained code vulnerability detection model (DetectrNet, DNet), wherein the model is schematically shown in the middle lower part of FIG. 1;

step 42: taking the vector representation of the code segment as the input of the model, and outputting the probability of the vulnerability contained in the code segment;

step 43: and taking the predicted probability and a real label as the input of a cross entropy loss function, calculating the predicted error, and updating the parameters of a code segment representation learning model (namely SENEt realized based on CNN and PENet realized based on LSTM) and a coarse-grained vulnerability detection model (DNet) by using a back propagation algorithm.

And 5: constructing a fine-grained vulnerability detection model (Policy Network, PNet) based on reinforcement learning, and predicting code rows which specifically cause vulnerabilities in code segments containing defects, wherein the specific steps are as follows:

step 51: constructing a fine-grained code vulnerability prediction model PNet based on reinforcement learning, wherein the schematic diagram of the model is shown on the right side of FIG. 1;

step 52: splicing vector representation of a current program statement and context content (context) vector representation of the statement to be used as state representation of reinforcement learning at t time step;

step 53: according to the state representation at the time t, predicting actions (action) which can be taken by an agent (agent or policy), if the actions are 'related' (relevance) ', the input statements at the time t are statements which can cause code bugs, if the actions are' unrelated '(relevance)', the input statements do not cause the code bugs, and the same action prediction is carried out on each statement of a code segment to generate an action sequence;

step 54: according to the formula

Calculating reward (reward), wherein U is a code line related to the vulnerability in the action sequence predicted in step 53, and V is a code line where the real vulnerability is located;

step 55: parameters of the model PNet are updated according to a classic REINFORCE algorithm and a Policy gradient algorithm, so that the PNet can automatically learn to obtain a code structure related to the vulnerability.

Example (b):

taking a specific vulnerability example in a data set Software Assessment Reference Dataset (SARD) as an example, the detection process of the multi-granularity code vulnerability detection method based on deep learning and reinforcement learning provided by the invention is analyzed. The contents of the four source code files related to the vulnerability instance are shown in tables 1 to 4 respectively. First, step 1 and step 2 of the embodiment of the present invention are executed, the source code is converted into an intermediate code, and a program slice for the intermediate code is generated with a hazard function memset that may cause a bug as a key point, as shown in table 5, in this slice code, actual bug statements are rows 10, 11, 34, and 35. Then, step 3 and step 4 are executed to learn the vector representation of each line of code in the program and the vector representation of the whole program, and the vector representation of the program is used as an input vector of a coarse-grained vulnerability detection model (DNet), and the result of model prediction is 1, namely, the slice contains vulnerabilities. And finally, executing a step 5 to take vector representation of each line of statements in the program as input of a fine-grained vulnerability detection model (PNet), wherein an action sequence of output prediction is shown in a table 6, wherein a numeral 0 represents that a corresponding code statement does not contain a vulnerability, a numeral 1 represents that a corresponding code statement may contain a vulnerability, and the position index of the numeral 1 in the table 6 is also 10, 11, 34 and 35, so that the fine-grained vulnerability detection model accurately identifies a specific vulnerability position. It can be seen from the above example that the method provided by the present invention not only realizes coarse-grained vulnerability code detection, but also realizes positioning to specific code statements that may cause vulnerabilities.

TABLE 1 CWE124_ Buffer _ Underwrite __ char _ delete _ memnove _53a. c

TABLE 2 CWE124_ Buffer _ Underwrite __ char _ declar _ memnovre _53b.c

TABLE 3 CWE124_ Buffer _ Underwrite __ char _ declar _ memnovre _53c.c

TABLE 4 CWE124_ Buffer _ Underwrite __ char _ delete _ memnove _53d.c

Table 5 Key to memset, generating program slices for intermediate code

Table 6 fine-grained vulnerability detection model action sequence for table 5 prediction

Claims

1. A multi-granularity code vulnerability detection method based on deep learning and reinforcement learning is characterized by comprising the following steps:

2. The deep learning and reinforcement learning-based multi-granularity code vulnerability detection method according to claim 1, wherein the specific steps of the step 3 are as follows:

step 31: splitting a code segment by taking a statement (statement) as a unit;

step 32: constructing a statement coding network SEnet based on CNN;

step 34: constructing a program coding network PENet based on LSTM;

3. The deep learning and reinforcement learning-based multi-granularity code vulnerability detection method according to claim 1, wherein the specific steps of the step 4 are as follows:

step 41: constructing a full-connection single hidden layer-based coarse-grained code vulnerability detection model DNet;

step 43: and taking the predicted probability and a real label as the input of a cross entropy loss function, calculating a predicted error, and updating parameters of a code segmentation representation learning model and a coarse-grained vulnerability detection model DNet by using a back propagation algorithm, wherein the learning model comprises SENEt realized based on CNN and PENet realized based on LSTM.

4. The deep learning and reinforcement learning-based multi-granularity code vulnerability detection method according to claim 1, wherein the specific steps of the step 5 are as follows:

step 51: constructing a fine-grained code vulnerability prediction model PNet based on reinforcement learning;

step 54: according to the formula

step 55: and updating parameters of the model PNet according to a classic REINFORCE algorithm and a Policy gradient algorithm, so that the model PNet can automatically learn to obtain a code structure related to the vulnerability.