CN115146279A

CN115146279A - Program vulnerability detection method, terminal device and storage medium

Info

Publication number: CN115146279A
Application number: CN202210741646.XA
Authority: CN
Inventors: 胡玉鹏; 关翔予; 温杰凌; 辛钰雯; 齐园
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-10-04

Abstract

The invention discloses a program vulnerability detection method, terminal equipment and a storage medium, wherein a source program is converted into an intermediate code LLVM IR, and the program vulnerability detection method has the advantages of fine granularity, rich contained semantic information and easiness in expansion. The invention makes up the problem of insufficient semantics of the original code slicing algorithm by adding the control information and slices the intermediate code. By using an IR-based word embedding mode, according to the dependency among different instructions and the characteristics of instruction context, instruction-related information is kept as much as possible, and non-instruction information is discarded, so that the problems of complex data processing and easy semantic loss in the traditional word embedding mode are solved, and the vulnerability detection rate is greatly improved.

Description

Program vulnerability detection method, terminal device and storage medium

Technical Field

The invention relates to a software system bug detection technology, in particular to a program bug detection method, terminal equipment and a storage medium.

Background

Many network attacks originate from software vulnerabilities. Although much effort has been put into pursuing secure programming, various types of vulnerability detection systems have also been created, and software vulnerabilities remain unlikely to be addressed fundamentally.

The existing vulnerability detection scheme has two main defects: and the problems of high false alarm rate and the like caused by the fact that the path is insensitive, the code is sliced and the semantics are lost after the code is converted into a word vector.

On one hand, most of the existing vulnerability detection solutions rely on code slicing, and the code slicing processing aims to comprehensively extract semantics of vulnerability patterns, help networks to identify key codes, reduce learning difficulty of neural network models and improve learning effects. Program slicing is a program decomposition technique that abstracts the necessary syntax and semantics from a program. The existing method processes the source code into several segments through preprocessing, such as files, functions, and slices composed of interdependent statements, but most slicing algorithms currently have a key problem: the path is not sensitive. The same code slice can be extracted from the correct code and the bug code by using the existing slice generation method, the accuracy is kept at 0.5 no matter whether the detection result is bug or no bug exists, and the bug detection is useless, namely semantic loss in the data preprocessing process. Semantic loss in the data preprocessing process is a main disadvantage of the existing vulnerability detection framework. The main reason is that the path change of the statement can cause the control range of the statement to change, but the code slice in the existing framework can not capture the change. The reason is the following two aspects: (1) A control dependency is a rough description of the relationship between two statements (i.e., whether a dependency exists) and does not specify the path of the statement (i.e., whether it depends on a legal or illegal value); (2) The process of reorganizing sequences of statements, wherein a rough stacking may result in statements that are not within the same control range being directly adjacent to each other, thereby causing path insensitivity. Semantic information is of great importance in vulnerability detection, and more semantic information helps a neural network to detect vulnerabilities which cannot be found before. The existing code slicing method lacks semantic information extraction from the aspect of control dependence, and still stays at the aspect of code grammar, so that a model is not correctly trained in a training stage, the accuracy of a detection result in a detection stage is possibly reduced, and the false alarm rate of vulnerability detection is greatly increased.

On the other hand, what is more important is that most of current Vulnerability detection systems based on Deep Learning ignore the problem of semantic loss, li et al ([ 1 ]) Li Z, zou D, xu S, et al, vulDeeLocator. The word vector conversion process of the source code vulnerability detection method provided by the invention patent application (CN 113420296A) based on the Bert model and the C source code vulnerability detection of BilSTM depends on the Bert model, and the Bert model is known to be the best model for processing the natural language task at present. It is clearly not appropriate to treat the program code as a natural language task. It can be seen that most vulnerability detectors at present try to directly process the code or use syntax tree representation, or treat it as natural language, for vector representation. However, due to the structural nature of function calls, interchangeable orders of branches and statements, etc., in code, none of the existing approaches based on natural language processing are sufficient to fully understand program semantics.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a program vulnerability detection method, a terminal device and a storage medium aiming at the defects of the prior art, so as to improve the efficiency and accuracy of vulnerability detection.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a program vulnerability detection method comprises the following steps:

performing static analysis on a source program to obtain an intermediate code of the source program;

extracting key points which possibly cause a vulnerability, generating a slicing standard, slicing the intermediate code by using the slicing standard, and combining a forward slice and a backward slice to obtain a code segment of the program;

using IR2vec to embed words in the code segments of the program to obtain coded vectors;

and training a neural network by using the coded vector to obtain a vulnerability detection model.

The present invention takes advantage of the feature that LLVM IR (intermediate code) makes the high level language clearly mapped, the source program is converted into the intermediate code LLVM IR, and the method has the advantages of fine granularity, rich contained semantic information and easiness in expansion. The invention makes up the problem of insufficient semantics of code slicing by adding control information and slices the intermediate code. By using an IR-based word embedding mode, according to the dependency among different instructions and the characteristics of instruction context, instruction-related information is kept as much as possible, and non-instruction information is discarded, so that the problems of complex data processing and easy semantic loss of the traditional word embedding mode are solved, and the vulnerability detection rate is greatly improved.

In the invention, the concrete implementation process of performing static analysis on the source program and acquiring the intermediate code of the source program comprises the following steps: and analyzing the source code, and acquiring an intermediate code representation form corresponding to the source code by using a Clang command to obtain the intermediate code of the source program.

The specific implementation process for extracting the key points which may cause the vulnerability includes:

initializing a lexical unit set Y, and dividing a source program P into a plurality of functions, wherein the function sets are F; y is initialized to null;

for each function f _i E.g. F, establishing an abstract syntax tree A _i ；

Go through each lexical unit t _j ，t _j ∈A _i Judgment of t _j Whether four features in Z are matched; if matched, then Y { [ t ] } is added _j And store Y; z = { Z = _api ,z _array ,z _pointer ,z _arithmetic }; wherein z is _api ,z _array ,z _pointer ,z _arithmetic Respectively marking four special marks of library function, array, pointer and expression;

and outputting Y, which can cause the key points of the loophole.

In the invention, the code segment acquisition process of the program comprises the following steps:

in a sentence s _w Given a particular lexical unit

Collecting and defining statements s _w Set of post-vertices of S _s ；

Y is a key point which may cause a vulnerability;

for arbitrary sentences s _s ∈S _s Judgment s _s Whether or not to pass

To s _w Data or control dependency exists, if the dependency exists, the sentence s is extracted _s Is a forward slice;

collecting and defining statements s _w Is S _p (ii) a For arbitrary sentences s _p ∈S _p Judgment s _p Whether or not to pass

To s _w Data or control dependence exists, if the dependence exists, the sentence s is extracted _p Is a backward slicing;

combining the forward slices and the backward slices to obtain combined slices

Slicing the slices

As a code fragment of the program.

if it is sliced

One sentence in m _q In the closed interval, m is _q And m _q Is inserted into the slice

Performing the following steps; wherein, the first and the second end of the pipe are connected with each other,

is empty, the initial state of (a) is empty,

the updating process of (a) includes: determine t _j Whether or not to cooperate with z _elseif ，z _else ，z _case If so, bind m _cur And m _pre ，m _pre Initially empty; if not, will

Logging in

And m is _cur Is assigned to m _pre ；t _j ∈A _i ，t _j Is the jth lexical unit, A _i Is an abstract syntax tree; when t is _j Matching eight control statements in Z, t _j As A _i A subtree of the root node, given as a _ij ；m _cur Is stored in _ij The minimum and maximum row numbers of (a); z = { Z = _if ,z _elseif ,z _else ,z _for ,z _while ,z _dowhile ,z _switch ,z _case }，z _if ,z _elseif ,z _else ,z _for ,z _while ,z _dowhile ,z _swiych ,z _case Is a control statement;

the slices are combined by the forward slices and the backward slices; for any control range

m _b ∈M _st Go through the traversal, make m at the beginning _a [0]＝m _b [0]Taking m _a [1]And m _b [1]And assigning the maximum value to m _a [1]Is updated

m _a [0]Is the control range m _a Lower limit value of (2), m _a [1]Is the control range m _a An upper limit value of (d);

is provided with

Initially empty, for in

Two functions of (1) _v ,f _ω If f is _υ Call f _ω Then will be

Assign to

Namely the final code slicing result of the source code program;

the updating process of (2) comprises: device set

Initially empty, for occurrence in

Two sentences s in (1) _λ ,s _μ If s is _μ Inheritance of (2) the node is s _λ Or s _μ Is less than s _λ Then will be

Then re-assign to

Slicing the program of the source program according to the corresponding relation between the source program and the intermediate code

Into program slice fragments of intermediate code.

The invention matches all the ranges which can be transmitted to the sentences and records the ranges in the slice by identifying the control range of each control sentence, thereby storing the positive or negative dependency relationship between the sentences in the slice.

In the invention, the neural network adopts a bidirectional circulation neural network model; and inputting the coded vectors into the bidirectional recurrent neural network model according to a random sequence. The invention performs data scattering on the vector after word embedding, so that the vector finally enters the bidirectional cyclic neural network model according to a random and disordered sequence, and the influence of the sequence of data input on network training is avoided. By increasing the randomness, the generalization performance of the network is improved, the phenomenon that the gradient is too extreme when the weight is updated due to the occurrence of regular data is avoided, and the over-fitting or under-fitting of the final model is avoided.

The method of the invention also comprises the following steps: inputting the source code to be predicted into the vulnerability detection model, and extracting the output result of the vulnerability detection model larger than a set threshold value, wherein the output result is a possible vulnerability row number.

As an inventive concept, the present invention also provides a terminal device, which includes a processor and a memory; the memory stores computer programs/instructions; the processor executes the computer programs/instructions stored by the memory; the computer program/instructions are configured to implement the steps of the method of the present invention.

As an inventive concept, the present invention also provides a computer storage medium having stored thereon a computer program/instructions; which when executed by a processor, perform the steps of the method of the present invention.

Compared with the prior art, the invention has the beneficial effects that:

1. the method can accurately capture the semantic information of the program, has lower false positive rate under the condition of ensuring higher accuracy, and simultaneously realizes granularity refinement, so that the vulnerability positioning accuracy reaches the row level;

2. the invention carries out slice processing on the IR codes, and the IR language has the characteristics of fine granularity, rich semantic information of the contained codes and the like, so that vulnerability analysis and detection are carried out, and the effect of using the IR codes is more accurate;

3. the method adds control information to make up for the insufficient semantics of the code slices, identifies the control range of each control statement, matches and records all ranges which can be transmitted to the statements in the slices, thereby storing the positive or negative dependency relationship between the statements in the path sensitive slices.

4. In order to avoid the traditional word embedding, the invention considers that the semantic relation is considered as important for program representation, and the surrounding context is not considered, so that the original semantics can be kept in the mapping process of word embedding as much as possible, and the non-instruction information is discarded. The invention adopts knowledge graph-based embedding, and a knowledge graph embedding model groups similar data points together by using a relationship, so that the coding of an IR element can be adaptive to the context of a statement environment, and a context-independent static characterization method superior to Word2Vec and the like is obtained.

Drawings

Fig. 1 is a diagram of a neural network structure according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a vulnerability detection method based on IR code slicing, which improves the efficiency and accuracy of vulnerability detection. It is more critical that word embedding be done by combining a representation learning method with the stream information to capture the syntax and semantics of the input program after the IR code has been sliced.

The embodiment of the invention comprises the following steps:

s1, performing static analysis on a source program by using a Clang tool to obtain an intermediate code representation form of the program;

the method comprises the following steps: and analyzing the source code, and acquiring an intermediate code representation form corresponding to the source code, namely the IR code (a ll file) by using a Clang command.

S2, extracting key points which possibly cause the loopholes, and generating a slicing standard; then slicing the intermediate code, and combining the forward slice and the backward slice to obtain a code segment of the program.

S2, extracting key points which possibly cause the vulnerability, and searching four syntax tree nodes which possibly cause the code vulnerability through an abstract syntax tree generated by a source program, wherein the concrete implementation process comprises the following steps:

identifying 4 special lexical units, locating the special lexical units before selecting one lexical unit to generate its corresponding backward and forward slice, wherein the four special lexical units are: library/API Function Calls (FC), array Usage (AU), pointer Usage (PU), and Arithmetic Expressions (AE).

Step S2 is realized by using an open-source C/C + + code analysis tool Joern, and the specific realization method comprises the following steps:

inputting: a program P = { s } consisting of several statements ₁ …s _∈ }; grammatical feature set Z = { Z ] of four special lexical units _api ,z _array ,z _pointer ,z _arithmetic }; wherein z is _api ,z _array ,z _pointer ,z _arithmetic The method is characterized by comprising four special marks of library functions, arrays, pointers and expressions.

And (3) outputting: a special lexical unit set Y;

1: setting a program P containing s ₁ ,s ₂ ,…,s _∈ E.g. a sentence; setting grammar feature set Z = { Z) for four special lexical units _api ,z _array ,z _pointer ,z _arithmetic }；

2: initializing a special lexical unit set Y;

3: dividing P into a group of functions, and setting the function set as F;

4: for each function f _i E is subjected to one traversal for F, and each F _i Establishing abstract syntax Tree A _i ；

5: for each lexical unit t again _j ∈A _i Performing cyclic traversal to judge t _j Whether there are matches with four features in Z; if matching, will Y { t } { [ T ] _j And store Y (Y initial state is empty);

and 6, finally returning Y.

And the returned Y is the key point which can cause the vulnerability and is used as the slicing standard.

The specific implementation process of the code slice in the step S2 comprises the following steps:

by extracting forward and backward slices for each special token simultaneously, unidirectional slices may lead to semantic ambiguity or loss of semantics because one hole is detected by several statements in the context. Firstly, the invention converts the source code program into PDG, and generates forward and backward slices on the basis of data and control dependence according to the reachability analysis of the graph, and the advantages of the code slice of the invention are two points: firstly, finding sentences which are easy to be attacked by data dependence; and secondly, grammatical information is enriched by controlling dependence, so that accuracy reduction caused by semantic missing can be relieved under most conditions. The specific process is as follows:

inputting: a statement s in the program P _w (ii) a Special lexical unit set Y central statement s _w Generated a lexical unit

A program dependence graph G corresponding to the program P;

and (3) outputting: lexical unit

Corresponding forward and backward slices

1: in a sentence s _w Given a particular lexical unit

2: according to containing statements s _w One communicating branch G of _w Collect and define its set of post vertices as S _s ；

3: for any s _s ∈S _s By passing

The function is recursively traversed to determine s _s Whether or not to pass

To s _w Data or control dependence exists, if the dependence exists, the slice is extracted as a forward slice, and if the dependence does not exist, the circular search is continued; forwardSlice () is a function that extracts forward slices;

4: the same is true for the extraction of backward slices, collecting and defining the sentence s _w The set of leading vertices of (1) is S _p ；

5: for any s _p ∈S _p By passing

The function is recursively traversed to determine s _p Whether or not to pass

To s _w Data or control dependence exists, if the dependence exists, the slice is extracted as a backward slice, and if the dependence does not exist, the circular search is continued; backwardSlice ()As a function of the extraction of the backward slice;

6: finally, the forward slices and the backward slices are merged to obtain merged slices

The most key point of the embodiment of the invention is that the control information is added to make up for the semantic deficiency problem of the code slice. The control dependency between two statements is only a rough description and reorganizing the slices without taking the control dependency range into account results in no semantic separation between the two control ranges. The invention adds control information to make up the deficiency of semantic information in PDG:

(1) A corresponding abstract syntax tree is generated from the source code and nodes satisfying 8 syntax features are defined as key nodes, since here a clear control scope is involved.

(2) And calculating the maximum value and the minimum value of the line number in the subtree taking the key node as the root.

(3) In special cases (e.g., if, else) several adjacent control ranges are bound.

(4) The stack is used to correct the correspondence between the start node and the end node of the control range.

(5) The control range is inserted into the corresponding key node of the PDG, to form a complete dependency.

(6) And adjusting the statement relationship inside the functions according to the line numbers, and adjusting the statement relationship among the functions by calling the relationship.

Adding control information to make up the semantic information in PDG, the specific algorithm implementation method comprises the following steps:

inputting: a program P = { s } consisting of several statements ₁ …s _∈ }; grammatical feature set Z = { Z } of eight control statements _if ,z _elseif ,z _else ,z _for ,z _while ,z _dowhile ,z _switch ,z _case }; by a sentence s _w A lexical unit of the generation

Corresponding to lexical units

Is sliced into

And (3) outputting: lexical unit

Corresponding slicing result

1 setting a program P containing s ₁ ,s ₂ ,…,s _∈ E.g. a sentence; set Z = { Z = _if ,z _elseif ,z _else ,z _for ,z _while ,z _dowhile ,z _switch ,z _case Setting grammatical features for 8 control sentences;

dividing P into a group of functions, and setting the function set as F;

for each function f _i Performing a cycle traversal for the e F, and performing a cycle traversal for each F _i Establishing abstract syntax Tree A _i ；

4 for each lexical unit t again _j ∈A _i Performing cyclic traversal to judge t _j Whether eight features in Z match; if the features match, then t is added _j As A _i Is given as a _ij Defining it as a key node;

5 calculation of a _ij And stores the result in m _cur ；

6: judgment of t _j Whether features match z _elseif ，z _else ，z _case If there is a match, m is added _cur And m _pre Binding together; if not matched, will

And is stored in

(

Is a set, the initial state is null), and m is set _cur Assigned to m _pre ；

7, matching the symbols (such as small brackets, big brackets and the like) after stacking the pair of features to obtain the range of the open and close intervals of the symbols, and storing the range in M _st Gathering;

8 pair control range

m _b ∈M _st Go through the traversal, make m at the beginning _a [0]＝m _b [0]Taking m _a [1]And m _b [1]And assigns it to m _a [1]Finally, finally

Updating is carried out; m is _a [0]Is the control range m _a Lower limit value of (1), m _a [1]Is the control range m _a Is/are as follows on the upper part limiting the value;

for the

Repeating the step 8 until all the control ranges are traversed to obtain updated control ranges

9 to for

(herein, the

Obtained after updating in step 8

) Go through the slice if

One sentence in m _q In the closed interval of (1), then m is _q And m _q Is inserted into the slice

The preparation method comprises the following steps of (1) performing;

for updated

And (4) executing the operation of the step 9 until all the control ranges are traversed.

10: device set

Initially empty, for occurrence in

Two statements s in _λ ,s _μ If s is _μ Is s _λ Or s _μ Is less than s _λ Then will be

Then re-assign to

To pair

The operation of step 10 is executed until the traversal is completed for the rest sentences in (1)

All statements in (1).

11: is provided with

The set is initially empty, for

Two functions of (1) _υ ,f _ω If f is _υ Calling f _ω Then will be according to f _υ And f _ω Get a set

In (1) correspond to

Will be provided with

And reassign to

Unlike other code slicing methods, the method of the embodiment of the present invention adds control information to the program dependency graph to make up for the lacking semantic information. The method solves the problems that the control dependency relationship in the prior art is too rough, the range of the dependency relationship can not be captured, and the details of the dependency relationship can not be captured.

Finally slicing the program of the source program according to the corresponding relation between the source program and the intermediate code

Into program slices of intermediate code. After the IR code is sliced, the IR code basic block logic does not change, and slicing simply deletes code statements that are not relevant to the slicing criteria.

The building blocks of LLVM IR include instructions, basic blocks, functions, and modules. Each instruction contains an opcode, a type, and an operand, and each instruction is of a static type. The basic block is the largest sequence of LLVM instructions without any jump. The set of basic blocks constitutes a function and the module is a set of functions. This hierarchy of LLVM IR representations helps to obtain embedding at the corresponding level of the program.

And S3, performing word embedding on the code segment subjected to IR program slicing by using IR2vec to obtain a coded vector (namely the vector corresponding to the vulnerability candidate in the figure 1). This distributed embedding is achieved by combining a representation learning method with the stream information to capture the syntax and semantics of the input program.

According to the characteristics, in order to enable semantic information of the sliced IR codes to be damaged as little as possible, the embodiment of the invention firstly introduces an IR2Vec word embedding technology based on IR, models the IR operation codes, operands and types as entities in a relational form and invents a vector representation method more suitable for fine-grained vulnerability location on the basis. The IR2Vec is not a traditional word embedding method oriented to natural language processing, but is highly combined with the LLVM technology, through learning the relation among operational characters, parameters and types and according to the composition structure of a program, a row representation, a block representation, a function representation and a program representation are built from bottom to top, finally, the LLVM IR language can be better analyzed, and internal logic and relation among instructions in the IR can be more fully acquired. The invention uses the flow perception coding mode in IR2Vec to embed words.

The step S3 comprises the following implementation steps:

1 before the IR code instruction word embedding, it is the most critical to generate the most primitive seed embedding vocabulary. Subsequent instruction embedding will be conducted under the direction of the seed embedding vocabulary. Firstly, mapping an LLVM-IR instruction to a code triple < h, r, t >, for an instruction, each instruction can use multiple triples to represent the internal and external relations of the instruction, and the content of the triples is specifically: the type of the current instruction (i.e., the relationship between the operator and the instruction), the relationship between the operator of the current instruction and the operator of the next instruction, and the relationship between the operator of the current instruction and its operands. These triple structures will preserve the relationship between the inside of the instruction and the instruction as much as possible, and will be used as input when embedding the training generation seed into the vocabulary. Feature embedding is next performed by TransE. TransE is a knowledge graph model that can be used to characterize transformation learning for the triplet < h, r, t >. TransE embeds h, r and t into the same high dimensional space, attempting to learn the representation using the relationship of the h + r ≈ t form, the output of the learning is a dictionary containing entity embedding, i.e., seed embedding vocabulary.

2: reading in IR code segments to be embedded, and constructing a series of program-related data structures in the memory.

2: and generating a function call graph according to the call relation of each function in the program, and acquiring the called function name of each function according to the function call graph.

3: next, an attempt is made to obtain for each function its word vector. And for the stream perception coding mode, according to the dependency relationship of a seed embedded vocabulary table and control streams among instructions, instruction word vectors are guided to be generated, the word vector of each basic block is formed by splicing the word vectors of each instruction in the basic block, and the word vector of a function is formed by splicing the word vectors of each basic block of the function. And generating word vectors of the functions in one step from the step (the word vector of each function is not formed by sequentially arranging the word vectors of the basic blocks, but formed by the word vectors of the basic blocks after topological sorting.

4: the word vectors of the functions are spliced to form word vectors of the currently transmitted IR file, namely coded vectors, and the vectors are used for further training and prediction.

Because the IR2Vec tool defaults to a matrix with the size of 300x 1 for word vectors generated by a single IR file, all word vectors transmitted into an IR code are compressed into 1 line, so that boundaries among instructions in the word vectors can be lost, and the accuracy of predicting line numbers of a neural network is influenced, therefore, in order to enable the IR2Vec to be better suitable for a fine-grained vulnerability detection system and achieve vulnerability line positioning accuracy, the embodiment of the invention designs and modifies an IR2Vec prototype, separates the word vectors of each instruction and does not combine all word vectors into a line of word vectors in a cage, lays a foundation for fine-grained vulnerability positioning, marks later-stage vulnerability line numbers, and provides guarantees for model training.

S4, constructing a bidirectional recurrent neural network model (as shown in figure 1), inputting a vector coded by embedding the words in the S3 into the neural network, training a vulnerability detection model, and continuously adjusting parameters according to a loss function to enable the model to achieve the optimal vulnerability detection effect;

the specific implementation process of the step S4 comprises the following steps:

1) The code slice is first marked. Since the model of the embodiment of the present invention is a kind of supervised learning, the marked source code needs to be obtained from the source data sets SARD and NVD, and the corresponding code slice needs to be marked. Specifically, after slicing the original vulnerability data set, the corresponding position of the vulnerability row number in the original data set in the IR slice is obtained, and the position is used as a training label.

2) In the embodiment of the invention, as shown in fig. 1, a BRNN neural network model is mainly used, and a training model can be mainly divided into two parts, wherein the first part is a traditional bidirectional cyclic neural network and comprises a plurality of BRNN layers, a random deactivation layer, a compact layer and an activation layer, and the second part comprises a multiplication layer, a maximum pooling layer and an average pooling layer.

3) In order to avoid the influence of the sequence of data investment on the network training. By increasing the randomness, the generalization performance of the network is improved, the phenomenon that the gradient is too extreme when the weight is updated due to the regular data is avoided, and the over-fitting or under-fitting of the final model is avoided. The embodiment of the invention performs data scattering on the vectors after word embedding, so that the vectors finally enter a network according to a random and disordered sequence.

4) Due to the fact that the data set has strong imbalance, repeated evaluation and parameter optimization of the model are needed finally;

s5, after the model training is finished, carrying out vulnerability detection on the model by using the trained model;

the specific implementation process of the step S5 comprises the following steps:

and aiming at a source code to be predicted, performing IR code generation, IR code slicing, word embedding and data preprocessing on the source code, then extracting a trained BRNN partial model to be used as a detection model, and transmitting the obtained vector file data into the detection model.

According to the embodiment of the invention, the output of the detection model is obtained, each time step corresponds to each code line, and the corresponding numerical value of each time step represents the possibility that the corresponding code line is predicted to be a bug, the threshold value is considered to be 0.5, for the code line with the numerical value larger than 0.5, the code line is considered to have the bug, and for the code line with the numerical value smaller than 0.5, the code line is considered to have no bug. The outputs are then sorted by their numerical size and the k number that is greater than a threshold (0.5) is extracted, which is the line number of the most likely bug in the IR code slice of the code. On the basis of the line numbers, the predicted IR code line number is mapped back to the source code line number based on the mapping relationship established by the debugging information in the IR code slice and the source code line number.

The embodiment of the invention adopts the IR code slicing method with enhanced semantics, ensures that more control semantic information can be reserved in the slicing process, and captures the grammar and semantics of an input program to embed words by combining the expression learning method with the stream information, thereby greatly improving the capability of vulnerability detection and the interpretability of codes and greatly improving the accuracy of vulnerability detection.

By the method, experiments are carried out on a desktop computer containing an Intel core i5-10600KF processor, a 16G memory and an NVIDA 1080 video card. The training samples collected by the experiment contain 4 different vulnerability categories, 189554 data are total, and in the training process, 20 times of iterative training are carried out on the training data. The experiment takes about 29 hours, the final verification accuracy rate is about 95%, the accuracy is about 90%, and F1 is about 90%.

After training is completed, the learned parameters are stored as model files and reloaded into the convolutional neural network during vulnerability detection. Taking the memory crash bug as an example, the experimental result is: the accuracy of the assay was 97.6% with an accuracy of 92.7%.

Through the analysis of the result, the source code file is converted into the IR code, then the code is sliced on the premise of keeping semantic information, then the IR2Vec is converted into the expression vector, and the neural network is utilized to extract vulnerability characteristics and learn parameters, so that the high accuracy of code vulnerability detection is finally realized.

Claims

1. A program vulnerability detection method is characterized by comprising the following steps:

2. The method for detecting program vulnerabilities according to claim 1, wherein the implementation process of performing static analysis on the source program and obtaining the intermediate code of the source program includes: and analyzing the source code, and acquiring an intermediate code representation form corresponding to the source code by using a Clang command to obtain the intermediate code of the source program.

3. The program vulnerability detection method of claim 1, wherein the specific implementation process of extracting key points which may cause a vulnerability comprises:

for each function f _i E.g. F, establishing an abstract syntax tree A _i ；

Go through each lexical unit t _j ，t _j ∈A _i Judgment of t _j Whether four features in Z are matched;

if matched, then Y is matched with U { t } _j And store Y; z = { Z = _api ,z _array ,z _pointer ,z _arithmetic In which z is _api ,z _array ,z _pointer ,z _arithmetic The method comprises the following steps of respectively marking four special marks of a library function, an array, a pointer and an expression; and outputting Y, which can cause the key points of the loophole.

4. The program vulnerability detection method according to claim 1, wherein the code segment obtaining process of the program comprises:

in a sentence s _w Given a particular lexical unit

Collecting and defining statements s _w Is S _s ；

Y is a key point which can cause a vulnerability;

for arbitrary sentences s _s ∈S _s Judgment s _s Whether or not to pass

To s _w Data or control dependence exists, if the dependence exists, the sentence s is extracted _s Slicing in a forward direction;

collecting and defining statements s _w The set of leading vertices of (1) is S _p (ii) a For arbitrary sentences s _p ∈S _p Judgment of

s _p Whether or not to pass

combining the forward slices and the backward slices to obtain combined slices

Slicing the slices

As a code fragment of the program.

5. The program vulnerability detection method according to claim 1 or 4, wherein the code segment obtaining process of the program comprises:

if it is sliced

Performing the following steps; wherein the content of the first and second substances,

is empty, the initial state of (a) is empty,

the updating process of (2) comprises: determine t _j Whether or not to z _elseif ，z _else ，z _case If so, bind m _cur And m _pre ，m _pre The initial value is empty; if not, will

Logging in

And m is _cur Is assigned to m _pre ；t _j ∈A _i ，t _j Is the jth lexical unit, A _i Is an abstract syntax tree; when t is _j Matching eight control statements in Z, t _j As A _i Is given as a _ij ；m _cur Is stored in _ij The minimum and maximum row numbers of; z = { Z = _if ,z _elseif ,z _else ,z _for ,z _while ,z _dowhile ,z _switch ,z _case }，z _if ,z _elseif ,z _else ,z _for ,z _while ,z _dowhile ,z _switch ,z _case Is a control statement;

is provided with

Initially empty, for

Two functions of (1) _υ ,f _ω If f is _v Call f _ω Then will be

Assign to

Namely the final code slicing result of the source code program;

the updating process of (a) includes: device set

Initially empty, for occurrence in

Two sentences s in (1) _λ ,s _μ If s is _μ Is s _λ Or s _μ Is less than s _λ Then will be

Then re-assigned to

Into program slice fragments of intermediate code.

6. The program vulnerability detection method of claim 1, wherein using IR2vec to perform word embedding on the code segment of the program to obtain the coded vector comprises:

reading a code segment to be subjected to word embedding;

generating a function call graph according to the call relation of each function in the program, and acquiring the name of a called function according to the function call graph;

according to the dependency relationship of control flow between a seed embedded vocabulary table and instructions, instruction word vectors are guided to be generated, the word vector of each basic block is formed by splicing the word vectors of each instruction, the word vector of a function is formed by splicing the word vectors of each basic block of the function, and the word vector of the function is generated;

splicing the word vectors of the functions to obtain coded vectors;

wherein the seed embedding vocabulary generating process comprises: mapping a program instruction to a code triple < h, r, t >, h, r, t respectively representing the type of the current instruction, the relationship between an operator of the current instruction and an operator of the next instruction, and the relationship between the operator of the current instruction and an operand thereof; and embedding h, r and t into the same high-dimensional space through a knowledge graph model to obtain a seed embedding vocabulary.

7. The program vulnerability detection method of claim 1, wherein the neural network employs a bidirectional recurrent neural network model; and inputting the coded vectors into the bidirectional recurrent neural network model according to a random sequence.

8. The program vulnerability detection method of claim 1, further comprising: inputting the source code to be predicted into the vulnerability detection model, and extracting the output result of the vulnerability detection model larger than a set threshold value, wherein the output result is a possible vulnerability row number.

9. A terminal device comprising a processor and a memory; the memory stores computer programs/instructions; the processor executes computer programs/instructions stored by the memory; the computer program/instructions configured to implement the steps of the method of one of claims 1 to 8.

10. A computer storage medium having stored thereon a computer program/instructions; characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method of one of claims 1 to 8.