CN112541180B - Software security vulnerability detection method based on grammatical features and semantic features - Google Patents

Software security vulnerability detection method based on grammatical features and semantic features Download PDF

Info

Publication number
CN112541180B
CN112541180B CN202011488425.3A CN202011488425A CN112541180B CN 112541180 B CN112541180 B CN 112541180B CN 202011488425 A CN202011488425 A CN 202011488425A CN 112541180 B CN112541180 B CN 112541180B
Authority
CN
China
Prior art keywords
pdg
detection
vulnerability
detection object
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011488425.3A
Other languages
Chinese (zh)
Other versions
CN112541180A (en
Inventor
危胜军
胡昌振
钟浩
陶莎
赵敬宾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Peng Cheng Laboratory
Original Assignee
Beijing Institute of Technology BIT
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT, Peng Cheng Laboratory filed Critical Beijing Institute of Technology BIT
Priority to CN202011488425.3A priority Critical patent/CN112541180B/en
Publication of CN112541180A publication Critical patent/CN112541180A/en
Application granted granted Critical
Publication of CN112541180B publication Critical patent/CN112541180B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a software security vulnerability detection method based on grammatical features and semantic features. The method comprises the following steps: step 1, determining the granularity of a detection object; step 2, establishing a software historical leak library; step 3, establishing an abstract syntax tree of the detection object; step 4, embedding the abstract syntax tree; step 5, compiling the software source code of the detection object; step 6, establishing a program dependence graph of the detection object; step 7, embedding the program dependence graph, and step 8, learning the characteristics of the AST by using a graph convolution neural network: step 9, learning the PDG characteristics by using the bidirectional LSTM; the invention has the following advantages: the performance indexes of the precision, accuracy and recall rate of the detection model are improved; the AST tree structure is directly learned by adopting a graph neural network, so that no information is lost, and the detection performance of the model can be greatly improved by a feature direct extraction mode based on the graph neural network.

Description

Software security vulnerability detection method based on grammatical features and semantic features
Technical Field
The invention belongs to the technical field of software security, and particularly relates to a software security vulnerability detection method based on grammatical features and semantic features.
Background
At present, with the massive disclosure of software source codes and vulnerability data thereof, related data can be acquired massively at low cost, and a data-driven method is used for vulnerability detection. The idea is to automatically extract vulnerability characteristics of a source code module by utilizing the characteristic learning capability of a deep learning technology to establish a vulnerability detection model, and the whole process is divided into two stages. The first is a model building phase and the second is a model application phase. In the stage of establishing the model, firstly determining the granularity of an analysis object, namely determining the size of a software source code module, wherein the software soft code module is a section of code with correlation, can be defined by self, and can be a file, a function, a component or a section of code with any size; the second step is to carry on the preconditioning to the analyzed object, analyze out the code intermediate representation that can be used for analyzing, the intermediate representation includes Token sequence, AST and CFG, the third step is to carry on the number quantization to the intermediate representation, adopt the embedded way of the space, the fourth step is to choose the appropriate deep learning algorithm, the vector of the number quantization of the third step is regarded as the input and output of the deep learning algorithm and is the characteristic learned, the fifth step, regard characteristic learned as the input of a classifier, regard the concrete classification whether there is label or leak of the correspondent code module as the output of the classifier at the same time, train the classifier, the classifier trained classifier can be used in the model application stage: and for a new software source code module, preprocessing the software source code module, converting the software source code module into an intermediate code, and then using the same quantization and embedded vector as the input learning extraction features of a deep learning model, wherein the features are used as the input of a classifier, and the output of the classifier is the probability of vulnerability categories.
The analysis granularity of prior art 1 (vuldeepker: a Deep Learning-Based System for virtualization Detection, 2018) is a code gadget, which is a set of semantically related code statements and is generated by analyzing the data stream of a program. Representing the gadgets into token sequences by lexical analysis aiming at the code gadgets, converting each token into a vector based on word2vec, then obtaining the vector representation of each gadget, and then adopting a Bidirectional LSTM (BLSTM) to carry out feature learning to establish a vulnerability detection model. Prior art 2(SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities) expands the work of prior art 1, and establishes a vulnerability detection model based on a Bidirectional Gated Recurrent Unit (BGRU). In view of the deficiencies of the prior art 1, the prior art 3 (mu VulDeePacker: A Deep Learning-Based System for Multiclass Vulnerability Detection) designs a Vulnerability multi-classifier Based on the same code gadget concept in the prior art 1 by adopting a Deep Learning method, so that the type of a Vulnerability can be accurately pointed out. Prior art 4 (vuldeelocater: a Deep left-based Fine-grained continuity Detector) expands on the basis of prior art 2, obtains a semantically related LLVM slice based on an intermediate code of the LLVM, and establishes a model using BRNN. In the prior art 5 (firmware a Deep Learning Model for virtualization Detection on Web Application Variants), a Vulnerability Detection Model is established for a PHP Slice, the PHP Slice is first converted into an intermediate representation based on an operation code, then the operation code is subjected to word segmentation, word2vec is used for vectorization, then a 5-layer LSTM neural network is used for establishing a classification Model, one Slice is a code segment, program statements in the code segment have a certain incidence relation, and the incidence relation is defined artificially, for example, data dependency relation and control dependency relation exist among the statements. In prior art 6(Automatic discovery for virtualization prediction), a vulnerability classification model is established for a JAVA file, after a method in the JAVA file is used as a unit for word segmentation, LSTM deep learning is used to embed each token, and meanwhile, a grammatical feature of each method is obtained, and the grammatical feature of each file is obtained after pooling. And performing clustering analysis on all token vectors to obtain classified categories, and then falling the tokens which form each file into the number of each category, wherein the number is the semantic feature of the file. And establishing a classification model by taking the grammatical and semantic features as the input of the classifier. In the prior art 7(Project Achilles: A Project tools for Static Method-Level Vulnerability Detection of Java Source Code Using a Current Neural Network), a similar Method in the prior art 6 is adopted, an LSTM model is applied to a JAVA program, the analysis granularity is not a file, and tests are performed on different specific types of Vulnerability Detection. In the prior art 8(Automated Vulnerability Detection in Source Code Using Deep Representation Learning), the analysis granularity is C/C + + function, each function is segmented to obtain token, then a feature extraction method similar to sentence emotion classification is adopted to extract features, and Vulnerability classification is performed based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). In the prior art 9(Automated software robustness detection with machine learning), an analysis granularity is a C/C + + function, performances of a plurality of different input features are compared, a word bag model and a word vector method are analyzed, the word bag model adopts an extreme random trees classifier, and the word vector method adopts a TextCNN model. In the prior art 10(a deep tree-based model for software defect prediction), for a Java source code file, an Abstract Syntax Tree (AST) of the file is first established, and then a vulnerability detection model is established based on the AST by using an LSTM network with a tree structure. Prior art 11(Cross-Project Transfer registration Learning for virtualized Function Discovery), and prior art 12(POSTER: virtualized Discovery with Function registration Learning from Unlabeled Projects) establish AST of each Function for C/C + + source program, obtain node sequence of AST by depth-first search, embed the node sequence by using Word2vec, and then establish classification model by using LSTM network of 5 layers. In prior art 13, an AST is created for a java file, nodes of the AST are extracted, all different AST nodes form a library, each node is assigned with an integer representing node, and the nodes are converted into integers to serve as input of a DBN to create a classification model. In prior art 14(Software discovery Prediction visual adaptive Neural Network), on the basis of prior art 13(automatic Learning Features for discovery Prediction), nodes of AST are word-embedded by using CNN to obtain vector representation thereof, and meanwhile, Features of the whole AST node sequence are learned, and the Features are used as input of a Logistic Regression classifier to establish a classification model. In the prior art 15(Static Detection of Control-Flow-Related virtualization Using Graph Embedding), a Graph convolutional neural network (Graph conditional network) is used to establish a Detection model for a vulnerability Related to a Control Flow, a Control Flow Graph of a method is first established, then Doc2Vec is used to embed nodes of the Control Flow Graph, and then Graph conditional network is used to learn the whole Control Flow Graph to obtain a feature representation of a vulnerability Graph to establish the Detection model. In prior art 16 (connected new Networks over Control Flow Graphs for Software delivery Prediction), for a source file, an assembly instruction is first compiled for the source file, a CFG based on the assembly instruction is established, and a direct graph-based connected new network is used to establish a vulnerability detection model based on the CFG.
In addition, another prior art proposes that a vulnerability detection model is built by using AST, CFG, PDG (program dependency graph) and code conventional metric four features at the same time, and after embedding and quantizing the four features, a quantized vector forms an overall feature by a direct splicing mode. The principle of the word2vec embedding algorithm is as follows: the purpose of word2vec embedding is to convert each symbol in a sequence into a numerical vector so that symbols of similar meaning have vector representations of similar distances.
The current prior art is summarized as follows:
(1) the prior art uses an AST-based feature extraction method, and the idea is to convert the AST into a node sequence by using a certain search algorithm, and then perform feature extraction based on the node sequence, which has the following disadvantages: AST is a tree structure, reflects node types and association between nodes, and in the process of converting into a node sequence, an existing search algorithm (e.g., depth-first search) cannot reflect the adjacency and precedence relationships between nodes in the same layer, that is, there may be multiple results after the nodes in the same layer are converted into a sequence, that is, the converted sequence cannot retain the structural information of the original tree structure, so that the original information loss is large.
(2) In the prior art, a feature extraction method based on CFG is used, the idea is to extract the semantic features of CFG by using a graph neural network aiming at CFG, and the method for extracting the semantic features based on CFG has the following defects: the CFG only contains control flow information of the program, lacks data flow information, and is incomplete in semantic expression.
(3) In the prior art, a PDG-based feature extraction method is used, an adjacency matrix of nodes in a graph is used to represent PDG features (i.e., semantic features), the adjacency matrix can only represent the existence and nonexistence of node relationships, or 0 or 1, and cannot represent the degree of association, and actually, PDG describes that the precedence relationship of code statement execution in a program is a time sequence.
(4) In the prior art, slice of an analysis object is established according to grammatical features of known vulnerabilities, then semantics of the slice are analyzed, utilization of the grammatical features of the analysis object is not reflected in the process, and meanwhile, the semantic features in the prior art are different from those based on a control flow graph in concept.
(5) The prior art proposes to adopt a detection method based on grammatical and semantic features, and the obvious differences from the invention are as follows: the difference of grammatical features essentially describes the association relation among all the components forming the source code, while the grammatical features adopted in the prior art are the average value of the state vectors of tokens forming the source code file, which obviously cannot describe the association relation among all the components, but the grammatical features of the invention accurately describe the association relation among all the components; the difference of semantic features, the semantic features in the prior art describe the distance between tokens forming the source code file, and the semantic features are static descriptions.
(6) The existing detection methods based on the grammatical and semantic features of the source code have the defects to a certain degree: the grammar and the semantics of the code are difficult to accurately describe based on the bag-of-words model of the n-gram sequence and the word vector model of the token sequence; according to the detection method based on the AST, the AST can well describe the grammar of the code, but the AST cannot represent the execution semantics of a program, so that many bugs related to the execution semantics cannot be detected; according to the detection method based on the control flow graph, the CFG can well represent the execution process of the program, but the CFG does not contain variable statements, lacks part of semantics and has a large influence on detection and positioning of vulnerabilities.
Disclosure of Invention
The invention aims to provide a software security vulnerability detection method based on grammatical features and semantic features, which can overcome the technical problems. The method comprises the following steps:
step 1, determining the granularity of a detection object:
the granularity of the detection object is a function, a file, a component or any code segment with an association relation, and is determined according to the actual detection project requirements, and the languages of the detection project are C/C + +, Java and PHP.
Step 2, establishing a software historical leakage library:
searching software security vulnerabilities which are the same as the programming language of the detected software project from a public software vulnerability library, establishing a vulnerability sample library aiming at language classes, wherein the sample size is the detection granularity size, and the vulnerability sample library indicates the condition that the samples with the detection granularity size have the vulnerabilities, namely whether each sample has the vulnerabilities and the type and the number of the vulnerabilities;
the method comprises the steps of using a JTS (Julie Test Set for C/C + +) data Set of a public vulnerability database SARD (software assessment Reference dataset), wherein the JTS is JTS-1.3, comprises 246852 functions in total, comprises 105244 functions with vulnerabilities, accounts for 42% of a total sample, obtains a label of whether vulnerabilities of each function exist or not by directly analyzing a file name and a function name, and adopts vulnerability labels of 1 and 0 to represent the existence or nonexistence of vulnerabilities.
Step 3, establishing an abstract syntax tree of the detection object:
on the basis of the step 1, the detection object is analyzed based on an LLVM compiler, and an Abstract Syntax Tree (AST) of the detection object is established based on a third party interface Clang Lib provided by the compiler.
Step 4, embedding the abstract syntax tree:
on the basis of the step 3, traversing the abstract syntax tree according to a depth-first search algorithm (DFS) aiming at the obtained abstract syntax tree of the detection object to generate a node sequence aiming at nodes of the syntax tree, generating the abstract syntax tree of each sample aiming at all samples in the step 2, generating the node sequence of each abstract syntax tree, and embedding each node by using a word2vec embedding algorithm to obtain vector representation of the node based on the node sequence of the abstract syntax tree corresponding to all samples in the leak library in the step 2; an example of an abstract syntax tree and its node sequence is shown below:
node sequence: { MethodDeclaration, int, func1, Parameter, int, var1, Blockstmt, expression stmt, VariableDeclaration, int, Assign, var2, EndoseExpr, BinaryExpr: divide, var1,42, Return stmt }.
And 5, compiling the software source code of the detection object:
on the basis of the step 1, compiling the detection object based on the LLVM compiler, and acquiring an Intermediate Representation (IR) of the code.
Step 6, establishing a program dependence graph of the detection object:
on the basis of step 5, a Program Dependency Graph (PDG) of the detected object is established through a Pass framework provided by the LLVM based on the intermediate representation IR of the code.
Step 7, embedding the program dependency graph, traversing the PDG for the obtained detection object according to the following search algorithm on the basis of step 6, generating a node sequence for the PDG, and setting a node set of the PDG graph as V, specifically comprising the following steps:
step 7.1, traversing the set V, outputting all nodes with an entry degree of 0, and setting V to be V1, where V is V-V1;
step 7.2, for all nodes in V1, subtracting 1 from the degree of entrance;
step 7.3, repeating the step 7.1 and the step 7.2 until the number of the nodes in the V is 0, and ending;
7.4, aiming at all samples in the step 2, generating a PDG of each sample by the same method, generating a node sequence of each PDG, and embedding each node by using a word2vec embedding algorithm based on the node sequences of the PDGs corresponding to all samples in the leak library in the step 2 to obtain a vector representation of the node;
examples of PDG and its node sequence are as follows:
the obtained node sequence is as follows: { BB1, BB2, BB3, BB5, BB4 }.
And 8, learning the characteristics of the AST by using a graph convolution neural network:
step 8.1, on the basis of the step 4, selecting a graph convolution neural network to establish a deep learning model aiming at AST grammatical features;
and 8.2, representing the vectors of the nodes obtained in the step 4 as the input of a graph convolution neural network, and directly learning the characteristics of the tree structure of the AST based on the graph convolution neural network, wherein the graph convolution neural network comprises four layers: input layer, convolution layer, pooling layer and full-link layer.
Step 9, learning PDG characteristics by using the bidirectional LSTM:
on the basis of the step 7, selecting a bidirectional long and short term memory network (BLSTM) to establish a deep learning model aiming at PDG semantic features, wherein the BLSTM has good learning capacity on time series input, the PDG is a time series structure, and selecting the BLSTM model;
the BLSTM used includes four layers: the system comprises an input layer, a bidirectional LSTM processing unit layer, an Attention layer and a full connection layer.
And 10, establishing a fusion model aiming at grammatical features and semantic features, selecting a two-layer fully-connected neural network and a Softmax classifier to establish the fusion model on the basis of the steps 8 and 9, taking the output of the step 8 and the step 9 as the input of a fully-connected layer, taking the output of the fully-connected layer as the input of the Softmax, and taking the output of the Softmax as the probability of a vulnerability.
Step 11, training and testing the detection model:
step 11.1, on the basis of the vulnerability database established in the step 2, converting the detection object in the step 2 into AST based on the step 3, and obtaining vector representation of the AST based on the step 4; converting the detection object in the step 2 into PDG based on the step 6;
and 11.2, obtaining vector representation of PDG based on the step 7, respectively taking the vector representation of AST and the vector representation of PDG as the input of a graph convolution neural network and the input of BLSTM, taking the label of the detection object as the output of a classifier, and training and testing the whole model by adopting an Adam optimization training algorithm.
And step 12, applying the detection model to a new software module, applying the detection model obtained in the step 11 to the new software module, firstly establishing AST and PDG of the new software module, converting the AST and PDG into vector representation, taking the vector as the input of the model, and taking the output of the model operation as the probability that the new software module has the holes.
The method has the following advantages:
1. the method adopts AST to represent the grammar of software, adopts PDG (program dependency graph) to represent the semantics of the software, respectively extracts the features through two targeted deep neural networks, and fuses the grammar and the semantic features based on one deep neural network, thereby improving the performance indexes of the precision, accuracy and recall rate of a detection model;
2. the method adopts a graph neural network to directly learn the AST tree structure without sequence conversion of the AST, and can not lose any information, and the detection performance of the model can be greatly improved based on the direct feature extraction mode of the graph neural network;
3. the method of the invention extracts semantic features based on PDG, which is an extension of CFG and comprises control dependence information and data dependence information, and has more perfect semantic expression compared with CFG;
4. according to the method, the BLSTM with a good effect of extracting the time series features is adopted to extract the PDG semantic features, so that the accuracy and completeness of semantic feature extraction can be improved;
5. the semantic features adopted by the method are the execution semantics of the program, and are dynamic features, and the vulnerability can be more accurately detected based on the execution semantics of the program;
6. the method provided by the invention provides an AST + graph-based syntactic feature extraction method of a convolutional neural network, a semantic feature extraction method based on PDG + BLSTM and a fusion method of the syntactic and semantic features based on the neural network, so that the syntactic and semantic features can be comprehensively and accurately extracted at the same time, the two types of features are fused, and the alarm missing rate is reduced.
Drawings
FIG. 1 is a schematic diagram of a method of the present invention;
FIG. 2 is a flow chart of the method of the present invention;
FIG. 3 is a schematic diagram of the training of a model of the method of the present invention;
FIG. 4 is a graph of a convolutional neural network of the method of the present invention;
FIG. 5 is a schematic diagram of a BLSTM of the process of the present invention;
FIG. 6 is a diagram of an abstract syntax tree and its node sequence for the method of the present invention;
fig. 7 is a schematic diagram of PDG and its node sequence according to the method of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. As shown in fig. 1, the method of the present invention comprises the following steps:
step 1, determining the granularity of a detection object:
the granularity of the detection object is a function, a file, a component or any code segment with an association relation, and is determined according to the actual detection project requirements, and the languages of the detection project are C/C + +, Java and PHP.
Step 2, establishing a software historical leak library:
searching software security vulnerabilities which are the same as the programming language of the detected software project from a public software vulnerability library, establishing a vulnerability sample library aiming at language classes, wherein the sample size is the detection granularity size, and the vulnerability sample library indicates the situation that the samples with the detection granularity size have the vulnerabilities, namely whether each sample has the vulnerabilities and the types and the number of the vulnerabilities;
the method comprises the steps of using a JTS (Julie Test Set for C/C + +) data Set of a public vulnerability database SARD (software assessment Reference dataset), wherein the JTS version is JTS-1.3, comprising 246852 functions in total, and the functions 105244 with vulnerabilities account for 42% of a total sample, and directly analyzing file names and function names to obtain whether the vulnerabilities of each function exist or not, namely, the adopted vulnerability labels are 1 and 0 to represent the existence or nonexistence of the vulnerabilities.
Step 3, establishing an abstract syntax tree of the detection object:
on the basis of the step 1, the detection object is analyzed based on an LLVM compiler, and an Abstract Syntax Tree (AST) of the detection object is established based on a third party interface Clang Lib provided by the compiler.
And 4, embedding the abstract syntax tree:
on the basis of the step 3, traversing the abstract syntax tree according to a depth-first search algorithm (DFS) aiming at the obtained abstract syntax tree of the detection object to generate a node sequence aiming at nodes of the syntax tree, generating the abstract syntax tree of each sample aiming at all samples in the step 2, generating the node sequence of each abstract syntax tree, and embedding each node by using a word2vec embedding algorithm to obtain vector representation of the node based on the node sequence of the abstract syntax tree corresponding to all samples in the leak library in the step 2;
an example of an abstract syntax tree and its node sequence is shown in fig. 6:
node sequence: { MethodDeclaration, int, func1, Parameter, int, var1, Blockstmt, expression stmt, VariableDeclaration, int, Assign, var2, EndoseExpr, BinaryExpr: divide, var1,42, Return stmt }.
And 5, compiling the software source code of the detection object:
on the basis of the step 1, compiling the detection object based on the LLVM compiler, and acquiring an Intermediate Representation (IR) of the code.
Step 6, establishing a program dependence graph of the detection object:
on the basis of step 5, a Program Dependency Graph (PDG) of the detected object is established through a Pass framework provided by the LLVM based on the intermediate representation IR of the code.
Step 7, embedding the program dependency graph, traversing the PDG of the obtained detection object according to the following search algorithm on the basis of step 6, generating a node sequence for the PDG, and setting a node set of the PDG graph as V, specifically comprising the following steps:
step 7.1, traversing the set V, outputting all nodes with an in-degree of 0, and setting the nodes as V1, where V is V-V1;
step 7.2, for all nodes in V1, subtracting 1 from the degree of entrance;
step 7.3, repeating the step 7.1 and the step 7.2 until the number of the nodes in the V is 0, and ending;
7.4, aiming at all samples in the step 2, generating a PDG of each sample by the same method, generating a node sequence of each PDG, and embedding each node by using a word2vec embedding algorithm based on the node sequences of the PDGs corresponding to all samples in the leak library in the step 2 to obtain a vector representation of the node;
an example of PDG and its node sequence is shown in fig. 7:
the resulting node sequence: { BB1, BB2, BB3, BB5, BB4 }.
And 8, learning the characteristics of the AST by using a graph convolution neural network:
step 8.1, on the basis of the step 4, selecting a graph convolution neural network to establish a deep learning model aiming at AST grammatical features;
and 8.2, representing the vectors of the nodes obtained in the step 4 as the input of a graph convolutional neural network, directly learning the characteristics of the tree structure of the AST based on the graph convolutional neural network, wherein the graph convolutional neural network comprises four layers: input layers, convolutional layers, pooling layers, and fully-connected layers, as shown in FIG. 4.
Step 9, learning PDG characteristics by using the bidirectional LSTM:
on the basis of the step 7, selecting a bidirectional long and short term memory network (BLSTM) to establish a deep learning model aiming at PDG semantic features, wherein the BLSTM has good learning capacity on time series input, the PDG is a time series structure, and selecting the BLSTM model;
the BLSTM used comprised four layers: the input layer, the bi-directional LSTM processing unit layer, the Attention layer and the full link layer, the example of BLSTM is shown in FIG. 5.
And 10, establishing a fusion model aiming at grammatical features and semantic features, selecting a two-layer fully-connected neural network and a Softmax classifier to establish the fusion model on the basis of the steps 8 and 9, taking the output of the step 8 and the step 9 as the input of a fully-connected layer, taking the output of the fully-connected layer as the input of the Softmax, and taking the output of the Softmax as the probability of a vulnerability.
Step 11, training and testing the detection model:
step 11.1, on the basis of the vulnerability database established in the step 2, converting the detection object in the step 2 into AST based on the step 3, and obtaining vector representation of the AST based on the step 4; converting the detection object in the step 2 into PDG based on the step 6;
step 11.2, obtaining PDG vector representation based on the step 7; the AST vector representation and the PDG vector representation are respectively used as the input of a graph convolution neural network and the input of a BLSTM, the label of the detection object is used as the output of a classifier, and the whole model is trained and tested by adopting an Adam optimization training algorithm, as shown in figure 3.
And step 12, applying the detection model to a new software module, applying the detection model obtained in the step 11 to the new software module, firstly establishing AST and PDG of the new software module, converting the AST and PDG into vector representation, taking the vector as the input of the model, and obtaining the output of the model operation, namely the probability that the new software module has a fault.
In the training process, the error between the actual output and the expected output is propagated reversely, and the order of parameter adjustment is the classifier, the full connection layer, the BLSTM model and the graph-based convolutional neural network model in turn, as shown in fig. 3.
The method of the invention obtains two node sequences, one is the node sequence of an abstract syntax tree, and the other is the node sequence of a program dependency graph, because the representation of the node sequences is symbols, the symbols need to be converted into a digital vector, and the conversion process adopts the following model:
let a symbol sequence be denoted w 1 ,w 2 ,…w n The sliding window size is equal to 2, i.e. the symbols are predicted from two symbols before and after a symbol, i.e. using w t-1 ,w t-2 And w t+1 ,w t+2 To predict w t The likelihood function of the model is the probability of generating any central word from the background word:
Figure BDA0002840033690000091
assuming v denotes the vector of the background word and u denotes the vector of the center word, then:
Figure BDA0002840033690000092
where V is a dictionary formed by all symbols, and the maximum likelihood estimate of the model is equivalent to the minimization of the loss function:
Figure BDA0002840033690000101
after training is finished, two groups of word vectors v and u are obtained, wherein any word in the dictionary is used as a central word and a background word, the central word vector is used as a representation vector of the word, and the principle of an Adam optimization training algorithm is as follows: let f (x, theta) represent the graph convolution neural network or BLSTM adopted by the invention, theta is the network parameter, and K training samples { (x) are selected each time 1 ,y 1 ),(x 2 ,y 2 ),…,(x K ,y K ) Training the parameters of the network, the partial derivative of the loss function at the tth iteration with respect to the parameter θ is:
Figure BDA0002840033690000102
wherein L (-) is a differentiable loss function, K is a batch size (batch size), and the gradient of the parameter is updated according to the following formula in the training process:
θ t =θ t-1 -Δθ t ……(5),
below is Δ θ t The calculating method of (2):
is provided with
M t =β 1 M t-1 +(1-β 1 )g t ……(6),
G t =β 2 G t-1 +(1-β 2 )g t ⊙g t ……(7),
Wherein beta is 1 And beta 2 The attenuation rates of two moving averages are respectively taken as beta 1 =0.9,β 2 =0.99,M 0 =0,G 0 =0,g t ⊙g t To calculate the square of each parameter gradient;
is provided with
Figure BDA0002840033690000103
Figure BDA0002840033690000104
The parameter update values of the Adam algorithm are:
Figure BDA0002840033690000105
wherein the learning rate alpha 0 =0.001,
Figure BDA0002840033690000106
E is a small constant set to keep the value stable.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the present disclosure should be covered within the scope of the present invention claimed in the appended claims.

Claims (3)

1. A software security vulnerability detection method based on grammatical features and semantic features is characterized by comprising the following steps:
step 1, determining the granularity of a detection object:
the granularity of the detection object is a function, a file, a component or any code segment with an association relation, and is determined according to the actual detection project requirements, and the language of the detection project is C/C + +, Java and PHP;
step 2, establishing a software historical leak library:
searching software security vulnerabilities which are the same as the programming language of the detected software project from a public software vulnerability library, establishing a vulnerability sample library aiming at language classes, wherein the sample size is the detection granularity size, and the vulnerability sample library indicates the situation that the samples with the detection granularity size have the vulnerabilities, namely whether each sample has the vulnerabilities and the types and the number of the vulnerabilities;
using a common vulnerability database SARD data set, wherein the version of JTS is JTS-1.3, the JTS comprises 246852 functions in total, the function 105244 with the vulnerability accounts for 42% of the total sample, and the label of the vulnerability of each function can be obtained by directly analyzing the file name and the function name, namely the adopted vulnerability labels are 1 and 0, which represent the existence and nonexistence of the vulnerability;
step 3, establishing an abstract syntax tree of the detection object:
on the basis of the step 1, analyzing the detection object based on an LLVM compiler, and establishing an abstract syntax tree of the detection object based on a third party interface Clang Lib provided by the compiler;
and 4, embedding the abstract syntax tree:
on the basis of the step 3, traversing the abstract syntax tree according to a depth-first search algorithm aiming at the obtained abstract syntax tree of the detection object to generate a node sequence aiming at nodes of the syntax tree, generating the abstract syntax tree of each sample aiming at all samples in the step 2, generating the node sequence of each abstract syntax tree, and embedding each node by using a word2vec embedding algorithm based on the node sequences of the abstract syntax trees corresponding to all samples in the cave library in the step 2 to obtain vector representation of the node;
and 5, compiling the software source code of the detection object:
on the basis of the step 1, compiling the detection object based on an LLVM compiler to obtain intermediate representation of the code;
step 6, establishing a program dependence graph of the detection object:
on the basis of the step 5, establishing a program dependency graph of the detection object through a Pass framework provided by the LLVM based on the intermediate representation IR of the code;
step 7, embedding the program dependency graph, and traversing the PDG for the PDG of the obtained detection object on the basis of step 6 to generate a node sequence for the PDG, specifically including the following steps:
step 7.1, setting a node set of a PDG graph as V, traversing the set V, outputting all nodes with an entry degree of 0, and setting V as V1, where V is V-V1;
step 7.2, for the successor nodes in V of all the nodes in V1, subtracting 1 from the degree of in;
7.3, repeating the step 7.1 and the step 7.2 until the number of the nodes in the V is 0, and ending;
7.4, aiming at all samples in the step 2, generating a PDG of each sample by the same method, generating a node sequence of each PDG, and embedding each node by using a word2vec embedding algorithm based on the node sequences of the PDGs corresponding to all samples in the leak library in the step 2 to obtain a vector representation of the node;
step 8, learning the AST characteristics by using a graph convolution neural network;
step 9, learning PDG characteristics by using the bidirectional LSTM:
on the basis of the step 7, selecting a bidirectional long-short term memory network to establish a deep learning model aiming at PDG semantic features, wherein the BLSTM has good learning capacity on time sequence input, the PDG is a time sequence structure, and selecting the BLSTM model;
the BLSTM used comprised four layers: the system comprises an input layer, a bidirectional LSTM processing unit layer, an Attention layer and a full connection layer;
step 10, establishing a fusion model aiming at grammatical features and semantic features, selecting a two-layer fully-connected neural network and a Softmax classifier to establish the fusion model on the basis of the steps 8 and 9, taking the output of the step 8 and the step 9 as the input of a fully-connected layer, taking the output of the fully-connected layer as the input of the Softmax, and taking the output of the Softmax as the probability of a vulnerability;
step 11, training and testing a detection model;
and step 12, applying the detection model to a new software module, applying the detection model obtained in the step 11 to the new software module, firstly establishing AST and PDG of the new software module, converting the AST and PDG into vector representation, taking the vector as the input of the model, and obtaining the output of the model operation, namely the probability that the new software module has a fault.
2. The method for detecting the software security vulnerability based on the syntactic and semantic characteristics according to claim 1, wherein the step 8 comprises the steps of:
8.1, on the basis of the step 4, selecting a graph convolution neural network to establish a deep learning model aiming at AST grammatical features;
and 8.2, representing the vectors of the nodes obtained in the step 4 as the input of a graph convolution neural network, and directly learning the characteristics of the tree structure of the AST based on the graph convolution neural network, wherein the graph convolution neural network comprises four layers: input layer, convolution layer, pooling layer and full-link layer.
3. The method for detecting software security vulnerabilities based on syntactic and semantic features according to claim 1, wherein the step 11 comprises the steps of:
step 11.1, on the basis of the vulnerability database established in the step 2, converting the detection object in the step 2 into AST based on the step 3, and obtaining vector representation of the AST based on the step 4; converting the detection object in the step 2 into PDG based on the step 6;
step 11.2, obtaining PDG vector representation based on the step 7; and (3) respectively taking the AST vector representation and the PDG vector representation as the input of a graph convolution neural network and the input of a BLSTM, taking the label of the detected object as the output of a classifier, and training and testing the whole model by adopting an Adam optimization training algorithm.
CN202011488425.3A 2020-12-16 2020-12-16 Software security vulnerability detection method based on grammatical features and semantic features Active CN112541180B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011488425.3A CN112541180B (en) 2020-12-16 2020-12-16 Software security vulnerability detection method based on grammatical features and semantic features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011488425.3A CN112541180B (en) 2020-12-16 2020-12-16 Software security vulnerability detection method based on grammatical features and semantic features

Publications (2)

Publication Number Publication Date
CN112541180A CN112541180A (en) 2021-03-23
CN112541180B true CN112541180B (en) 2022-09-13

Family

ID=75018216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011488425.3A Active CN112541180B (en) 2020-12-16 2020-12-16 Software security vulnerability detection method based on grammatical features and semantic features

Country Status (1)

Country Link
CN (1) CN112541180B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158194B (en) * 2021-03-30 2023-04-07 西北大学 Vulnerability model construction method and detection method based on multi-relation graph network
CN113076235B (en) * 2021-04-09 2022-10-18 中山大学 Time sequence abnormity detection method based on state fusion
CN113220301A (en) * 2021-04-13 2021-08-06 广东工业大学 Clone consistency change prediction method and system based on hierarchical neural network
CN113297580B (en) * 2021-05-18 2024-03-22 广东电网有限责任公司 Code semantic analysis-based electric power information system safety protection method and device
CN113722218B (en) * 2021-08-23 2022-06-03 南京审计大学 Software defect prediction model construction method based on compiler intermediate representation
CN114816997B (en) * 2022-03-29 2023-08-18 湖北大学 Defect prediction method based on graph neural network and bidirectional GRU feature extraction
CN114816517B (en) * 2022-05-06 2024-07-16 哈尔滨工业大学 Hierarchical semantic perception code representation learning method
CN115130110B (en) * 2022-07-08 2024-03-19 国网浙江省电力有限公司电力科学研究院 Vulnerability discovery method, device, equipment and medium based on parallel integrated learning
CN115577361B (en) * 2022-12-09 2023-04-07 四川大学 Improved PHP Web shell detection method based on graph neural network
CN115795487B (en) * 2023-02-07 2023-05-12 深圳开源互联网安全技术有限公司 Vulnerability detection method, device, equipment and storage medium
CN116628707A (en) * 2023-07-19 2023-08-22 山东省计算中心(国家超级计算济南中心) Interpretable multitasking-based source code vulnerability detection method
CN117725422B (en) * 2024-02-07 2024-05-07 北京邮电大学 Program code vulnerability detection model training method and detection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214191A (en) * 2018-09-18 2019-01-15 北京理工大学 A method of utilizing deep learning forecasting software security breaches
CN110222512A (en) * 2019-05-21 2019-09-10 华中科技大学 A kind of software vulnerability intelligent measurement based on intermediate language and localization method and system
CN110245496A (en) * 2019-05-27 2019-09-17 华中科技大学 A kind of source code leak detection method and detector and its training method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214191A (en) * 2018-09-18 2019-01-15 北京理工大学 A method of utilizing deep learning forecasting software security breaches
CN110222512A (en) * 2019-05-21 2019-09-10 华中科技大学 A kind of software vulnerability intelligent measurement based on intermediate language and localization method and system
CN110245496A (en) * 2019-05-27 2019-09-17 华中科技大学 A kind of source code leak detection method and detector and its training method and system

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
A deep tree based model for software defect prediction;Hoa Khanh Dam 等;《arXiv:1802.00921》;20180203;第1–10页 *
Automated software vulnerability detection with machine learning;Jacob A. Harer 等;《arXiv:1803.04497》;20180802;第1–8页 *
Automatic feature learning for vulnerability prediction;Hoa Khanh Dam等;《arXiv:1708.02368》;20170808;第1–12页 *
POSTER: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects;Guanjun Lin 等;《CCS "17: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security》;20171031;第2539–2541页 *
VulDeePecker:A Deep Learning Based System for Vulnerability Detection;Zhen Li 等;《arXiv:1801.01681v1》;20180105;第1–15页 *
基于BiLSTM模型的漏洞检测;龚扣林等;《计算机科学》;20200531(第05期);第295-300页 *
基于抽象语法树的智能化漏洞检测***;陈肇炫等;《信息安全学报》;20200715;第5卷(第04期);第1-13页 *

Also Published As

Publication number Publication date
CN112541180A (en) 2021-03-23

Similar Documents

Publication Publication Date Title
CN112541180B (en) Software security vulnerability detection method based on grammatical features and semantic features
CN111639344B (en) Vulnerability detection method and device based on neural network
CN109977205B (en) Method for computer to independently learn source code
CN112215013B (en) Clone code semantic detection method based on deep learning
CN113672931B (en) Software vulnerability automatic detection method and device based on pre-training
CN113312464B (en) Event extraction method based on conversation state tracking technology
CN113138920B (en) Software defect report allocation method and device based on knowledge graph and semantic role labeling
CN112394973B (en) Multi-language code plagiarism detection method based on pseudo-twin network
CN114064487A (en) Code defect detection method
CN117215935A (en) Software defect prediction method based on multidimensional code joint graph representation
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN114547318A (en) Fault information acquisition method, device, equipment and computer storage medium
CN115757695A (en) Log language model training method and system
CN115935369A (en) Method for evaluating source code using numeric array representation of source code elements
CN112579777B (en) Semi-supervised classification method for unlabeled text
CN116702160B (en) Source code vulnerability detection method based on data dependency enhancement program slice
CN117725592A (en) Intelligent contract vulnerability detection method based on directed graph annotation network
CN115794119B (en) Case automatic analysis method and device
CN117056226A (en) Cross-project software defect number prediction method based on transfer learning
CN115935367A (en) Static source code vulnerability detection and positioning method based on graph neural network
CN114780403A (en) Software defect prediction method and device based on enhanced code attribute graph
Qu et al. Software Defect Detection Method Based on Graph Structure and Deep Neural Network
CN113076089A (en) API completion method based on object type
Mohan Automatic repair and type binding of undeclared variables using neural networks
Yang et al. LineFlowDP: A Deep Learning-Based Two-Phase Approach for Line-Level Defect Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant