CN112541180B

CN112541180B - Software security vulnerability detection method based on grammatical features and semantic features

Info

Publication number: CN112541180B
Application number: CN202011488425.3A
Authority: CN
Inventors: 危胜军; 胡昌振; 钟浩; 陶莎; 赵敬宾
Original assignee: Beijing Institute of Technology BIT; Peng Cheng Laboratory
Current assignee: Beijing Institute of Technology BIT; Peng Cheng Laboratory
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2022-09-13
Anticipated expiration: 2040-12-16
Also published as: CN112541180A

Abstract

The invention discloses a software security vulnerability detection method based on grammatical features and semantic features. The method comprises the following steps: step 1, determining the granularity of a detection object; step 2, establishing a software historical leak library; step 3, establishing an abstract syntax tree of the detection object; step 4, embedding the abstract syntax tree; step 5, compiling the software source code of the detection object; step 6, establishing a program dependence graph of the detection object; step 7, embedding the program dependence graph, and step 8, learning the characteristics of the AST by using a graph convolution neural network: step 9, learning the PDG characteristics by using the bidirectional LSTM; the invention has the following advantages: the performance indexes of the precision, accuracy and recall rate of the detection model are improved; the AST tree structure is directly learned by adopting a graph neural network, so that no information is lost, and the detection performance of the model can be greatly improved by a feature direct extraction mode based on the graph neural network.

Description

Software security vulnerability detection method based on grammatical features and semantic features

Technical Field

The invention belongs to the technical field of software security, and particularly relates to a software security vulnerability detection method based on grammatical features and semantic features.

Background

At present, with the massive disclosure of software source codes and vulnerability data thereof, related data can be acquired massively at low cost, and a data-driven method is used for vulnerability detection. The idea is to automatically extract vulnerability characteristics of a source code module by utilizing the characteristic learning capability of a deep learning technology to establish a vulnerability detection model, and the whole process is divided into two stages. The first is a model building phase and the second is a model application phase. In the stage of establishing the model, firstly determining the granularity of an analysis object, namely determining the size of a software source code module, wherein the software soft code module is a section of code with correlation, can be defined by self, and can be a file, a function, a component or a section of code with any size; the second step is to carry on the preconditioning to the analyzed object, analyze out the code intermediate representation that can be used for analyzing, the intermediate representation includes Token sequence, AST and CFG, the third step is to carry on the number quantization to the intermediate representation, adopt the embedded way of the space, the fourth step is to choose the appropriate deep learning algorithm, the vector of the number quantization of the third step is regarded as the input and output of the deep learning algorithm and is the characteristic learned, the fifth step, regard characteristic learned as the input of a classifier, regard the concrete classification whether there is label or leak of the correspondent code module as the output of the classifier at the same time, train the classifier, the classifier trained classifier can be used in the model application stage: and for a new software source code module, preprocessing the software source code module, converting the software source code module into an intermediate code, and then using the same quantization and embedded vector as the input learning extraction features of a deep learning model, wherein the features are used as the input of a classifier, and the output of the classifier is the probability of vulnerability categories.

The analysis granularity of prior art 1 (vuldeepker: a Deep Learning-Based System for virtualization Detection, 2018) is a code gadget, which is a set of semantically related code statements and is generated by analyzing the data stream of a program. Representing the gadgets into token sequences by lexical analysis aiming at the code gadgets, converting each token into a vector based on word2vec, then obtaining the vector representation of each gadget, and then adopting a Bidirectional LSTM (BLSTM) to carry out feature learning to establish a vulnerability detection model. Prior art 2(SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities) expands the work of prior art 1, and establishes a vulnerability detection model based on a Bidirectional Gated Recurrent Unit (BGRU). In view of the deficiencies of the prior art 1, the prior art 3 (mu VulDeePacker: A Deep Learning-Based System for Multiclass Vulnerability Detection) designs a Vulnerability multi-classifier Based on the same code gadget concept in the prior art 1 by adopting a Deep Learning method, so that the type of a Vulnerability can be accurately pointed out. Prior art 4 (vuldeelocater: a Deep left-based Fine-grained continuity Detector) expands on the basis of prior art 2, obtains a semantically related LLVM slice based on an intermediate code of the LLVM, and establishes a model using BRNN. In the prior art 5 (firmware a Deep Learning Model for virtualization Detection on Web Application Variants), a Vulnerability Detection Model is established for a PHP Slice, the PHP Slice is first converted into an intermediate representation based on an operation code, then the operation code is subjected to word segmentation, word2vec is used for vectorization, then a 5-layer LSTM neural network is used for establishing a classification Model, one Slice is a code segment, program statements in the code segment have a certain incidence relation, and the incidence relation is defined artificially, for example, data dependency relation and control dependency relation exist among the statements. In prior art 6(Automatic discovery for virtualization prediction), a vulnerability classification model is established for a JAVA file, after a method in the JAVA file is used as a unit for word segmentation, LSTM deep learning is used to embed each token, and meanwhile, a grammatical feature of each method is obtained, and the grammatical feature of each file is obtained after pooling. And performing clustering analysis on all token vectors to obtain classified categories, and then falling the tokens which form each file into the number of each category, wherein the number is the semantic feature of the file. And establishing a classification model by taking the grammatical and semantic features as the input of the classifier. In the prior art 7(Project Achilles: A Project tools for Static Method-Level Vulnerability Detection of Java Source Code Using a Current Neural Network), a similar Method in the prior art 6 is adopted, an LSTM model is applied to a JAVA program, the analysis granularity is not a file, and tests are performed on different specific types of Vulnerability Detection. In the prior art 8(Automated Vulnerability Detection in Source Code Using Deep Representation Learning), the analysis granularity is C/C + + function, each function is segmented to obtain token, then a feature extraction method similar to sentence emotion classification is adopted to extract features, and Vulnerability classification is performed based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). In the prior art 9(Automated software robustness detection with machine learning), an analysis granularity is a C/C + + function, performances of a plurality of different input features are compared, a word bag model and a word vector method are analyzed, the word bag model adopts an extreme random trees classifier, and the word vector method adopts a TextCNN model. In the prior art 10(a deep tree-based model for software defect prediction), for a Java source code file, an Abstract Syntax Tree (AST) of the file is first established, and then a vulnerability detection model is established based on the AST by using an LSTM network with a tree structure. Prior art 11(Cross-Project Transfer registration Learning for virtualized Function Discovery), and prior art 12(POSTER: virtualized Discovery with Function registration Learning from Unlabeled Projects) establish AST of each Function for C/C + + source program, obtain node sequence of AST by depth-first search, embed the node sequence by using Word2vec, and then establish classification model by using LSTM network of 5 layers. In prior art 13, an AST is created for a java file, nodes of the AST are extracted, all different AST nodes form a library, each node is assigned with an integer representing node, and the nodes are converted into integers to serve as input of a DBN to create a classification model. In prior art 14(Software discovery Prediction visual adaptive Neural Network), on the basis of prior art 13(automatic Learning Features for discovery Prediction), nodes of AST are word-embedded by using CNN to obtain vector representation thereof, and meanwhile, Features of the whole AST node sequence are learned, and the Features are used as input of a Logistic Regression classifier to establish a classification model. In the prior art 15(Static Detection of Control-Flow-Related virtualization Using Graph Embedding), a Graph convolutional neural network (Graph conditional network) is used to establish a Detection model for a vulnerability Related to a Control Flow, a Control Flow Graph of a method is first established, then Doc2Vec is used to embed nodes of the Control Flow Graph, and then Graph conditional network is used to learn the whole Control Flow Graph to obtain a feature representation of a vulnerability Graph to establish the Detection model. In prior art 16 (connected new Networks over Control Flow Graphs for Software delivery Prediction), for a source file, an assembly instruction is first compiled for the source file, a CFG based on the assembly instruction is established, and a direct graph-based connected new network is used to establish a vulnerability detection model based on the CFG.

In addition, another prior art proposes that a vulnerability detection model is built by using AST, CFG, PDG (program dependency graph) and code conventional metric four features at the same time, and after embedding and quantizing the four features, a quantized vector forms an overall feature by a direct splicing mode. The principle of the word2vec embedding algorithm is as follows: the purpose of word2vec embedding is to convert each symbol in a sequence into a numerical vector so that symbols of similar meaning have vector representations of similar distances.

The current prior art is summarized as follows:

(1) the prior art uses an AST-based feature extraction method, and the idea is to convert the AST into a node sequence by using a certain search algorithm, and then perform feature extraction based on the node sequence, which has the following disadvantages: AST is a tree structure, reflects node types and association between nodes, and in the process of converting into a node sequence, an existing search algorithm (e.g., depth-first search) cannot reflect the adjacency and precedence relationships between nodes in the same layer, that is, there may be multiple results after the nodes in the same layer are converted into a sequence, that is, the converted sequence cannot retain the structural information of the original tree structure, so that the original information loss is large.

(2) In the prior art, a feature extraction method based on CFG is used, the idea is to extract the semantic features of CFG by using a graph neural network aiming at CFG, and the method for extracting the semantic features based on CFG has the following defects: the CFG only contains control flow information of the program, lacks data flow information, and is incomplete in semantic expression.

(3) In the prior art, a PDG-based feature extraction method is used, an adjacency matrix of nodes in a graph is used to represent PDG features (i.e., semantic features), the adjacency matrix can only represent the existence and nonexistence of node relationships, or 0 or 1, and cannot represent the degree of association, and actually, PDG describes that the precedence relationship of code statement execution in a program is a time sequence.

(4) In the prior art, slice of an analysis object is established according to grammatical features of known vulnerabilities, then semantics of the slice are analyzed, utilization of the grammatical features of the analysis object is not reflected in the process, and meanwhile, the semantic features in the prior art are different from those based on a control flow graph in concept.

(5) The prior art proposes to adopt a detection method based on grammatical and semantic features, and the obvious differences from the invention are as follows: the difference of grammatical features essentially describes the association relation among all the components forming the source code, while the grammatical features adopted in the prior art are the average value of the state vectors of tokens forming the source code file, which obviously cannot describe the association relation among all the components, but the grammatical features of the invention accurately describe the association relation among all the components; the difference of semantic features, the semantic features in the prior art describe the distance between tokens forming the source code file, and the semantic features are static descriptions.

(6) The existing detection methods based on the grammatical and semantic features of the source code have the defects to a certain degree: the grammar and the semantics of the code are difficult to accurately describe based on the bag-of-words model of the n-gram sequence and the word vector model of the token sequence; according to the detection method based on the AST, the AST can well describe the grammar of the code, but the AST cannot represent the execution semantics of a program, so that many bugs related to the execution semantics cannot be detected; according to the detection method based on the control flow graph, the CFG can well represent the execution process of the program, but the CFG does not contain variable statements, lacks part of semantics and has a large influence on detection and positioning of vulnerabilities.

Disclosure of Invention

The invention aims to provide a software security vulnerability detection method based on grammatical features and semantic features, which can overcome the technical problems. The method comprises the following steps:

step 1, determining the granularity of a detection object:

the granularity of the detection object is a function, a file, a component or any code segment with an association relation, and is determined according to the actual detection project requirements, and the languages of the detection project are C/C + +, Java and PHP.

Step 2, establishing a software historical leakage library:

searching software security vulnerabilities which are the same as the programming language of the detected software project from a public software vulnerability library, establishing a vulnerability sample library aiming at language classes, wherein the sample size is the detection granularity size, and the vulnerability sample library indicates the condition that the samples with the detection granularity size have the vulnerabilities, namely whether each sample has the vulnerabilities and the type and the number of the vulnerabilities;

the method comprises the steps of using a JTS (Julie Test Set for C/C + +) data Set of a public vulnerability database SARD (software assessment Reference dataset), wherein the JTS is JTS-1.3, comprises 246852 functions in total, comprises 105244 functions with vulnerabilities, accounts for 42% of a total sample, obtains a label of whether vulnerabilities of each function exist or not by directly analyzing a file name and a function name, and adopts vulnerability labels of 1 and 0 to represent the existence or nonexistence of vulnerabilities.

Step 3, establishing an abstract syntax tree of the detection object:

on the basis of the step 1, the detection object is analyzed based on an LLVM compiler, and an Abstract Syntax Tree (AST) of the detection object is established based on a third party interface Clang Lib provided by the compiler.

Step 4, embedding the abstract syntax tree:

on the basis of the step 3, traversing the abstract syntax tree according to a depth-first search algorithm (DFS) aiming at the obtained abstract syntax tree of the detection object to generate a node sequence aiming at nodes of the syntax tree, generating the abstract syntax tree of each sample aiming at all samples in the step 2, generating the node sequence of each abstract syntax tree, and embedding each node by using a word2vec embedding algorithm to obtain vector representation of the node based on the node sequence of the abstract syntax tree corresponding to all samples in the leak library in the step 2; an example of an abstract syntax tree and its node sequence is shown below:

node sequence: { MethodDeclaration, int, func1, Parameter, int, var1, Blockstmt, expression stmt, VariableDeclaration, int, Assign, var2, EndoseExpr, BinaryExpr: divide, var1,42, Return stmt }.

And 5, compiling the software source code of the detection object:

on the basis of the step 1, compiling the detection object based on the LLVM compiler, and acquiring an Intermediate Representation (IR) of the code.

Step 6, establishing a program dependence graph of the detection object:

on the basis of step 5, a Program Dependency Graph (PDG) of the detected object is established through a Pass framework provided by the LLVM based on the intermediate representation IR of the code.

Step 7, embedding the program dependency graph, traversing the PDG for the obtained detection object according to the following search algorithm on the basis of step 6, generating a node sequence for the PDG, and setting a node set of the PDG graph as V, specifically comprising the following steps:

step 7.1, traversing the set V, outputting all nodes with an entry degree of 0, and setting V to be V1, where V is V-V1;

step 7.2, for all nodes in V1, subtracting 1 from the degree of entrance;

step 7.3, repeating the step 7.1 and the step 7.2 until the number of the nodes in the V is 0, and ending;

7.4, aiming at all samples in the step 2, generating a PDG of each sample by the same method, generating a node sequence of each PDG, and embedding each node by using a word2vec embedding algorithm based on the node sequences of the PDGs corresponding to all samples in the leak library in the step 2 to obtain a vector representation of the node;

examples of PDG and its node sequence are as follows:

the obtained node sequence is as follows: { BB1, BB2, BB3, BB5, BB4 }.

And 8, learning the characteristics of the AST by using a graph convolution neural network:

step 8.1, on the basis of the step 4, selecting a graph convolution neural network to establish a deep learning model aiming at AST grammatical features;

and 8.2, representing the vectors of the nodes obtained in the step 4 as the input of a graph convolution neural network, and directly learning the characteristics of the tree structure of the AST based on the graph convolution neural network, wherein the graph convolution neural network comprises four layers: input layer, convolution layer, pooling layer and full-link layer.

Step 9, learning PDG characteristics by using the bidirectional LSTM:

on the basis of the step 7, selecting a bidirectional long and short term memory network (BLSTM) to establish a deep learning model aiming at PDG semantic features, wherein the BLSTM has good learning capacity on time series input, the PDG is a time series structure, and selecting the BLSTM model;

the BLSTM used includes four layers: the system comprises an input layer, a bidirectional LSTM processing unit layer, an Attention layer and a full connection layer.

And 10, establishing a fusion model aiming at grammatical features and semantic features, selecting a two-layer fully-connected neural network and a Softmax classifier to establish the fusion model on the basis of the steps 8 and 9, taking the output of the step 8 and the step 9 as the input of a fully-connected layer, taking the output of the fully-connected layer as the input of the Softmax, and taking the output of the Softmax as the probability of a vulnerability.

Step 11, training and testing the detection model:

step 11.1, on the basis of the vulnerability database established in the step 2, converting the detection object in the step 2 into AST based on the step 3, and obtaining vector representation of the AST based on the step 4; converting the detection object in the step 2 into PDG based on the step 6;

and 11.2, obtaining vector representation of PDG based on the step 7, respectively taking the vector representation of AST and the vector representation of PDG as the input of a graph convolution neural network and the input of BLSTM, taking the label of the detection object as the output of a classifier, and training and testing the whole model by adopting an Adam optimization training algorithm.

And step 12, applying the detection model to a new software module, applying the detection model obtained in the step 11 to the new software module, firstly establishing AST and PDG of the new software module, converting the AST and PDG into vector representation, taking the vector as the input of the model, and taking the output of the model operation as the probability that the new software module has the holes.

The method has the following advantages:

1. the method adopts AST to represent the grammar of software, adopts PDG (program dependency graph) to represent the semantics of the software, respectively extracts the features through two targeted deep neural networks, and fuses the grammar and the semantic features based on one deep neural network, thereby improving the performance indexes of the precision, accuracy and recall rate of a detection model;

2. the method adopts a graph neural network to directly learn the AST tree structure without sequence conversion of the AST, and can not lose any information, and the detection performance of the model can be greatly improved based on the direct feature extraction mode of the graph neural network;

3. the method of the invention extracts semantic features based on PDG, which is an extension of CFG and comprises control dependence information and data dependence information, and has more perfect semantic expression compared with CFG;

4. according to the method, the BLSTM with a good effect of extracting the time series features is adopted to extract the PDG semantic features, so that the accuracy and completeness of semantic feature extraction can be improved;

5. the semantic features adopted by the method are the execution semantics of the program, and are dynamic features, and the vulnerability can be more accurately detected based on the execution semantics of the program;

6. the method provided by the invention provides an AST + graph-based syntactic feature extraction method of a convolutional neural network, a semantic feature extraction method based on PDG + BLSTM and a fusion method of the syntactic and semantic features based on the neural network, so that the syntactic and semantic features can be comprehensively and accurately extracted at the same time, the two types of features are fused, and the alarm missing rate is reduced.

Drawings

FIG. 1 is a schematic diagram of a method of the present invention;

FIG. 2 is a flow chart of the method of the present invention;

FIG. 3 is a schematic diagram of the training of a model of the method of the present invention;

FIG. 4 is a graph of a convolutional neural network of the method of the present invention;

FIG. 5 is a schematic diagram of a BLSTM of the process of the present invention;

FIG. 6 is a diagram of an abstract syntax tree and its node sequence for the method of the present invention;

fig. 7 is a schematic diagram of PDG and its node sequence according to the method of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. As shown in fig. 1, the method of the present invention comprises the following steps:

step 1, determining the granularity of a detection object:

Step 2, establishing a software historical leak library:

searching software security vulnerabilities which are the same as the programming language of the detected software project from a public software vulnerability library, establishing a vulnerability sample library aiming at language classes, wherein the sample size is the detection granularity size, and the vulnerability sample library indicates the situation that the samples with the detection granularity size have the vulnerabilities, namely whether each sample has the vulnerabilities and the types and the number of the vulnerabilities;

the method comprises the steps of using a JTS (Julie Test Set for C/C + +) data Set of a public vulnerability database SARD (software assessment Reference dataset), wherein the JTS version is JTS-1.3, comprising 246852 functions in total, and the functions 105244 with vulnerabilities account for 42% of a total sample, and directly analyzing file names and function names to obtain whether the vulnerabilities of each function exist or not, namely, the adopted vulnerability labels are 1 and 0 to represent the existence or nonexistence of the vulnerabilities.

Step 3, establishing an abstract syntax tree of the detection object:

And 4, embedding the abstract syntax tree:

on the basis of the step 3, traversing the abstract syntax tree according to a depth-first search algorithm (DFS) aiming at the obtained abstract syntax tree of the detection object to generate a node sequence aiming at nodes of the syntax tree, generating the abstract syntax tree of each sample aiming at all samples in the step 2, generating the node sequence of each abstract syntax tree, and embedding each node by using a word2vec embedding algorithm to obtain vector representation of the node based on the node sequence of the abstract syntax tree corresponding to all samples in the leak library in the step 2;

an example of an abstract syntax tree and its node sequence is shown in fig. 6:

And 5, compiling the software source code of the detection object:

Step 6, establishing a program dependence graph of the detection object:

Step 7, embedding the program dependency graph, traversing the PDG of the obtained detection object according to the following search algorithm on the basis of step 6, generating a node sequence for the PDG, and setting a node set of the PDG graph as V, specifically comprising the following steps:

step 7.1, traversing the set V, outputting all nodes with an in-degree of 0, and setting the nodes as V1, where V is V-V1;

step 7.2, for all nodes in V1, subtracting 1 from the degree of entrance;

an example of PDG and its node sequence is shown in fig. 7:

the resulting node sequence: { BB1, BB2, BB3, BB5, BB4 }.

and 8.2, representing the vectors of the nodes obtained in the step 4 as the input of a graph convolutional neural network, directly learning the characteristics of the tree structure of the AST based on the graph convolutional neural network, wherein the graph convolutional neural network comprises four layers: input layers, convolutional layers, pooling layers, and fully-connected layers, as shown in FIG. 4.

Step 9, learning PDG characteristics by using the bidirectional LSTM:

the BLSTM used comprised four layers: the input layer, the bi-directional LSTM processing unit layer, the Attention layer and the full link layer, the example of BLSTM is shown in FIG. 5.

Step 11, training and testing the detection model:

step 11.2, obtaining PDG vector representation based on the step 7; the AST vector representation and the PDG vector representation are respectively used as the input of a graph convolution neural network and the input of a BLSTM, the label of the detection object is used as the output of a classifier, and the whole model is trained and tested by adopting an Adam optimization training algorithm, as shown in figure 3.

And step 12, applying the detection model to a new software module, applying the detection model obtained in the step 11 to the new software module, firstly establishing AST and PDG of the new software module, converting the AST and PDG into vector representation, taking the vector as the input of the model, and obtaining the output of the model operation, namely the probability that the new software module has a fault.

In the training process, the error between the actual output and the expected output is propagated reversely, and the order of parameter adjustment is the classifier, the full connection layer, the BLSTM model and the graph-based convolutional neural network model in turn, as shown in fig. 3.

The method of the invention obtains two node sequences, one is the node sequence of an abstract syntax tree, and the other is the node sequence of a program dependency graph, because the representation of the node sequences is symbols, the symbols need to be converted into a digital vector, and the conversion process adopts the following model:

let a symbol sequence be denoted w ₁ ，w ₂ ，…w _n The sliding window size is equal to 2, i.e. the symbols are predicted from two symbols before and after a symbol, i.e. using w _t-1 ，w _t-2 And w _t+1 ，w _t+2 To predict w _t The likelihood function of the model is the probability of generating any central word from the background word:

assuming v denotes the vector of the background word and u denotes the vector of the center word, then:

where V is a dictionary formed by all symbols, and the maximum likelihood estimate of the model is equivalent to the minimization of the loss function:

after training is finished, two groups of word vectors v and u are obtained, wherein any word in the dictionary is used as a central word and a background word, the central word vector is used as a representation vector of the word, and the principle of an Adam optimization training algorithm is as follows: let f (x, theta) represent the graph convolution neural network or BLSTM adopted by the invention, theta is the network parameter, and K training samples { (x) are selected each time ¹ ，y ¹ )，(x ² ，y ² )，…，(x ^K ，y ^K ) Training the parameters of the network, the partial derivative of the loss function at the tth iteration with respect to the parameter θ is:

wherein L (-) is a differentiable loss function, K is a batch size (batch size), and the gradient of the parameter is updated according to the following formula in the training process:

θ _t ＝θ _t-1 -Δθ _t ……(5)，

below is Δ θ _t The calculating method of (2):

is provided with

M _t ＝β ₁ M _t-1 +(1-β ₁ )g _t ……(6)，

G _t ＝β ₂ G _t-1 +(1-β ₂ )g _t ⊙g _t ……(7)，

Wherein beta is ₁ And beta ₂ The attenuation rates of two moving averages are respectively taken as beta ₁ ＝0.9，β ₂ ＝0.99，M ₀ ＝0，G ₀ ＝0，g _t ⊙g _t To calculate the square of each parameter gradient;

is provided with

The parameter update values of the Adam algorithm are:

wherein the learning rate alpha ₀ ＝0.001，

E is a small constant set to keep the value stable.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the present disclosure should be covered within the scope of the present invention claimed in the appended claims.

Claims

1. A software security vulnerability detection method based on grammatical features and semantic features is characterized by comprising the following steps:

step 1, determining the granularity of a detection object:

the granularity of the detection object is a function, a file, a component or any code segment with an association relation, and is determined according to the actual detection project requirements, and the language of the detection project is C/C + +, Java and PHP;

step 2, establishing a software historical leak library:

using a common vulnerability database SARD data set, wherein the version of JTS is JTS-1.3, the JTS comprises 246852 functions in total, the function 105244 with the vulnerability accounts for 42% of the total sample, and the label of the vulnerability of each function can be obtained by directly analyzing the file name and the function name, namely the adopted vulnerability labels are 1 and 0, which represent the existence and nonexistence of the vulnerability;

step 3, establishing an abstract syntax tree of the detection object:

on the basis of the step 1, analyzing the detection object based on an LLVM compiler, and establishing an abstract syntax tree of the detection object based on a third party interface Clang Lib provided by the compiler;

and 4, embedding the abstract syntax tree:

on the basis of the step 3, traversing the abstract syntax tree according to a depth-first search algorithm aiming at the obtained abstract syntax tree of the detection object to generate a node sequence aiming at nodes of the syntax tree, generating the abstract syntax tree of each sample aiming at all samples in the step 2, generating the node sequence of each abstract syntax tree, and embedding each node by using a word2vec embedding algorithm based on the node sequences of the abstract syntax trees corresponding to all samples in the cave library in the step 2 to obtain vector representation of the node;

and 5, compiling the software source code of the detection object:

on the basis of the step 1, compiling the detection object based on an LLVM compiler to obtain intermediate representation of the code;

step 6, establishing a program dependence graph of the detection object:

on the basis of the step 5, establishing a program dependency graph of the detection object through a Pass framework provided by the LLVM based on the intermediate representation IR of the code;

step 7, embedding the program dependency graph, and traversing the PDG for the PDG of the obtained detection object on the basis of step 6 to generate a node sequence for the PDG, specifically including the following steps:

step 7.1, setting a node set of a PDG graph as V, traversing the set V, outputting all nodes with an entry degree of 0, and setting V as V1, where V is V-V1;

step 7.2, for the successor nodes in V of all the nodes in V1, subtracting 1 from the degree of in;

7.3, repeating the step 7.1 and the step 7.2 until the number of the nodes in the V is 0, and ending;

step 8, learning the AST characteristics by using a graph convolution neural network;

step 9, learning PDG characteristics by using the bidirectional LSTM:

on the basis of the step 7, selecting a bidirectional long-short term memory network to establish a deep learning model aiming at PDG semantic features, wherein the BLSTM has good learning capacity on time sequence input, the PDG is a time sequence structure, and selecting the BLSTM model;

the BLSTM used comprised four layers: the system comprises an input layer, a bidirectional LSTM processing unit layer, an Attention layer and a full connection layer;

step 10, establishing a fusion model aiming at grammatical features and semantic features, selecting a two-layer fully-connected neural network and a Softmax classifier to establish the fusion model on the basis of the steps 8 and 9, taking the output of the step 8 and the step 9 as the input of a fully-connected layer, taking the output of the fully-connected layer as the input of the Softmax, and taking the output of the Softmax as the probability of a vulnerability;

step 11, training and testing a detection model;

2. The method for detecting the software security vulnerability based on the syntactic and semantic characteristics according to claim 1, wherein the step 8 comprises the steps of:

8.1, on the basis of the step 4, selecting a graph convolution neural network to establish a deep learning model aiming at AST grammatical features;

3. The method for detecting software security vulnerabilities based on syntactic and semantic features according to claim 1, wherein the step 11 comprises the steps of:

step 11.2, obtaining PDG vector representation based on the step 7; and (3) respectively taking the AST vector representation and the PDG vector representation as the input of a graph convolution neural network and the input of a BLSTM, taking the label of the detected object as the output of a classifier, and training and testing the whole model by adopting an Adam optimization training algorithm.