CN114816997B - Defect prediction method based on graph neural network and bidirectional GRU feature extraction - Google Patents

Defect prediction method based on graph neural network and bidirectional GRU feature extraction Download PDF

Info

Publication number
CN114816997B
CN114816997B CN202210323112.5A CN202210323112A CN114816997B CN 114816997 B CN114816997 B CN 114816997B CN 202210323112 A CN202210323112 A CN 202210323112A CN 114816997 B CN114816997 B CN 114816997B
Authority
CN
China
Prior art keywords
node
software
vector
code
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210323112.5A
Other languages
Chinese (zh)
Other versions
CN114816997A (en
Inventor
何鹏
周纯英
曾诚
马菊
黄杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University
Original Assignee
Hubei University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University filed Critical Hubei University
Priority to CN202210323112.5A priority Critical patent/CN114816997B/en
Publication of CN114816997A publication Critical patent/CN114816997A/en
Application granted granted Critical
Publication of CN114816997B publication Critical patent/CN114816997B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3684Test management for test design, e.g. generating new test cases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Stored Programmes (AREA)

Abstract

The application provides a defect prediction method based on graph neural network and bidirectional GRU feature extraction, which comprises the steps of firstly, carrying out network modeling on a software system, and learning a network structure by using a neural network model to obtain software dependency characteristics; then constructing a software system source file into an abstract syntax tree, extracting a token sequence, and modeling by using a two-way gating recursion unit to obtain deep semantic features of the software system; and combining the obtained software dependency characteristics with the semantic characteristics of the obtained source codes to obtain final mixed characteristics, training a classifier for defect prediction based on the obtained mixed characteristics, constructing a prediction model by combining the two different types of measurement indexes, extracting deep semantic characteristics of the software source codes by using a bidirectional GRU, extracting dependency characteristics among software modules by using a graph neural network, and combining the two characteristics for predicting defects, thereby improving the accuracy of prediction.

Description

Defect prediction method based on graph neural network and bidirectional GRU feature extraction
Technical Field
The application relates to the field of software engineering software defect prediction, in particular to a defect prediction method based on a graph neural network and bidirectional GRU feature extraction.
Background
With the increasing size and complexity of software, the probability of defects in the software increases, as does the cost of testing source code and producing high quality software products. In practice, the test resources are limited, some serious failures may lead to significant functional losses, large amounts of data corruption, and even to a crash of the entire software system, and therefore limited resources should be allocated to the high risk modules that are more prone to failure.
In early work, researchers proposed a variety of metrics to build predictive models, typically using static code properties and defect locations, by which defects were predicted by manually defined features such as number of lines of code, previous errors, number of methods in a file, etc., which were easily obtained from source code using automated tools. However, the characteristics of the traditional manual design only consider the complexity of codes, in order to extract richer source code semantics and grammar information, researchers model the software source code to be similar to a text sequence in natural language processing, such as a neural network model based on an abstract grammar tree, so that the source code can be well simulated, meanwhile, the structural information of a program is also reserved, the abstract grammar tree is traversed, the node types needing to be reserved are selected, each source file is parsed into a series of code marks, and the code marks are input into a deep neural network, so that the semantic characteristics of the software source code are obtained.
A disadvantage of conventional code metrics and semantic features is that they focus on only a single element, and rarely consider the interaction information between elements, the available information content is limited. In recent years, network metrics derived from concepts in the field of social network analysis have attracted attention from a wide range of researchers. The analysis based on the network takes the modules as nodes, extracts the dependency relationship among the modules as edges to form a software source code network, and establishes a prediction model by using the obtained network model. Network metrics take into account interactions between modules to model information flow and topology in software, which are not captured by software code metrics.
Therefore, the method in the prior art has the technical problem of low software defect prediction accuracy.
Disclosure of Invention
Aiming at the defects of the prior art, the application provides a defect prediction method based on a graph neural network and bidirectional GRU feature extraction, which extracts richer software features and improves the accuracy of prediction.
In order to achieve the above object, the following technical scheme is adopted:
the defect prediction method based on the graph neural network and the bidirectional GRU feature extraction comprises the following steps:
s1: performing network modeling on the software system, and learning a network structure by using a neural network model to obtain the software dependency characteristics;
s2: constructing a software system source file into an abstract syntax tree, extracting a token sequence, and modeling by using a two-way gating recursion unit to obtain deep semantic features of the software system;
s3: combining the software dependency characteristics obtained in the step S1 with the semantic characteristics of the source codes obtained in the step S2 to obtain final mixed characteristics, and training a classifier for defect prediction based on the obtained mixed characteristics.
In one embodiment, step S1 includes:
s1.1: analyzing a software source code file by using an open source tool to construct a software dependent network model;
s1.2: learning a software dependent network by using a network embedding method to obtain an embedding vector of a node;
s1.3: and (3) constructing a graph neural network model, and inputting the software dependent network model constructed in the step S1.1 and the embedded vector of the node obtained in the step S1.2 as the graph neural network model to obtain the software dependent relationship characteristic.
In one embodiment, step S1.1 comprises:
s1.1.1: and performing dependency relation scanning on various files generated after source code compiling by using an open source tool, and storing the files as a uniform file format:
s1.1.2: analyzing and extracting the dependency relationship among various files, performing network modeling, and constructing a software dependent network model, wherein the software dependent network model is an undirected and unauthorized network SDN= (V, E), and the node V is a network with no direction and no rights i (v i E V) represents a class file or interface file in a software system, if there is a dependency between two nodes, there is a relation between themStrip connecting edge e ij, wherein ,eij =(v i ,v j )∈E。
In one embodiment, step S1.2 comprises:
s1.2.1: the Node2vec method is utilized, the network is learned and converted into a Node sequence in a biased random walk mode, and the Node sequence obtained through conversion is used as a text sequence in natural language processing;
s1.2.2: training is carried out through a word vector model skip gram according to the obtained node sequence, and low-dimensional vector representation of the node is obtained and used as an embedded vector of the node.
In one embodiment, S1.3 comprises:
s1.3.1: taking the software dependent network model obtained in the step S1.1 and the node embedded vector X obtained in the step S1.2 as inputs of the graph neural network,i V is the number of nodes, d is the dimension of the node embedding vector, N (V) = { u E V i (V, u) E) represents the neighbor set of node V, +.>Implicit embedded vector representing target node v at model k-th layer, initially input, ++>
S1.3.2: the node vector mode is iteratively updated through a graph neural network GCN by using convolution operation, each node in the network aggregates the neighbor node characteristics in iteration and is combined with an embedded vector obtained by iteration of the node on the previous layer to obtain a new layer of embedded vector, each node is sequentially iterated until the last layer is output to obtain updated node characteristics, wherein the convolution layer of the GCN is expressed as follows:
wherein A is an adjacency matrix,is to add a self-connecting adjacency matrix, I N Is an identity matrix>Is a degree matrix-> Representation matrix->The value of the element corresponding to the ith row and the jth column (l) Is the output of layer I, Θ (l) Is a parameter of the first layer, sigma (·) represents the activation function, H (l+1) The output of the layer 1 is the output of the layer 1, and the updated node characteristic is the software dependency characteristic.
In one embodiment, after step S1.3.2, the method further comprises:
the output probability value of each node is obtained through the softmax layer, and the specific formula is as follows:
wherein ,is an adjacent matrix normalized by the Laplace matrix, X is an embedded vector of a node, and W is an initial characteristic (0) ∈R C×H and W(1) ∈R H×2 The weight matrix of two layers is respectively, C is the dimension of each node of the input layer, H is the dimension of each node of the hidden layer, Z epsilon R N×2 Is the output probability value of all nodes;
training is carried out by taking the predicted value and the real label as targets, and the adopted loss function is as follows:
wherein ,is a label node index set, Z lf Is GCN is the predictive vector of tag l, f is the dimension of the predictive vector, y lf Is the true one-hot vector of label l, with the goal of reducing the error between the true label and the predicted label.
In one embodiment, constructing a software system source file into an abstract syntax tree, extracting a token sequence, includes:
converting the source file into an abstract syntax tree using a javalang tool;
traversing an abstract syntax tree, preserving nodes of the type that need to be extracted, each source file is parsed into a series of code tags [ w ] 1 ,w 2 ,…,w n ]Constructing a word list V with a fixed size from the parsed code marks, converting each code mark into a low-dimensional continuous real-valued vector representation by using word2vec training word list to obtain a token embedded matrixd represents the dimension of the code tag and the token embedding matrix is used to store the token sequence.
In one embodiment, modeling using a bi-directional gating recursion unit obtains deep semantic features of a software system, comprising:
inputting a sequence of code tokens for each class file into the sequence of GRU units, each code token w i Using a real value vector x i =[x 1 ,x 2 ,…,x d ]Representing, entering into one of the GRU units, training each GRU unit by predicting a next code token in the sequence;
trained GRU units generate a status output for each code tagQuantity s i =[s 1 ,s 2 ,…,s d ]Representing capturing the semantic distribution of the code tag according to the context of the code tag;
a series of code tags [ w ] for each java source file 1 ,w 2 ,…,w n ]Input into trained GRU to obtain token state output sequence s 1 ,s 2 ,…,s n ]The output sequence of token states is aggregated into a file vector through a pooling layer as a deep semantic feature of the software system corresponding to the source file.
The above technical solutions in the embodiments of the present application at least have one or more of the following technical effects:
firstly, modeling a software system as a software network by utilizing a complex network theory, and then carrying out embedded learning on the network by combining a graph neural network model of a deep learning technology to obtain network measurement characteristics, namely software dependency characteristics; secondly, extracting rich semantic information in a source file of the software system by using a two-way gating recursion unit; third, two types of software metrology features are combined and applied to software defect prediction. That is, the application adopts the combination of the two different types of measurement indexes to construct a prediction model, uses the bidirectional GRU to extract the deep semantic features of the software source codes, uses the graph neural network to extract the dependency features between the software modules, and combines the two features to predict the defects, thereby improving the accuracy of prediction.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a frame diagram of a defect prediction method based on a graph neural network and bidirectional GRU feature extraction according to an embodiment of the present application;
FIG. 2 is a process diagram of building a software dependent network in accordance with an embodiment of the present application;
FIG. 3 is a process diagram of the method of the present application applied to defect prediction.
Detailed Description
The present inventors have found through a great deal of research and practice that:
some methods in the prior art adopt the traditional code measurement method, and the defects of the code measurement method index and the semantic features are that only single elements are focused, interaction information among the elements is rarely considered, and the available information content is limited. In recent years, network metrics derived from concepts in the field of social network analysis have attracted attention from a wide range of researchers. The analysis based on the network takes the modules as nodes, extracts the dependency relationship among the modules as edges to form a software source code network, and establishes a prediction model by using the obtained network model. Network metrics take into account interactions between modules to model information flow and topology in software, which are not captured by software code metrics. However, in other research works it has been shown that network metrics are not more advantageous than code metrics and are not commonplace.
In the field of software engineering, the embedded learning model for processing the network structure data includes deepflk, node2vec, struct 2vec and the like, and the principle is that the network structure is converted into a series of node sequences, and then the word vector model is used for learning the node representation. The graph neural network model in the deep learning field can capture network structural characteristics as well and is not applied to the software defect prediction field. The idea is to multiply the adjacent matrix of the whole network structure with the original node characteristic matrix to obtain the embedded characteristics of the hidden layer, and iterate in sequence, thereby realizing a multi-layer network. Compared with the traditional embedded learning method, the graph neural network can learn the structural relation of each node and the neighborhood thereof, and can fuse the characteristic attribute carried by each node into the graph neural network to perform more comprehensive learning.
Based on the above consideration, the application adopts the combination of the two different types of measurement indexes to construct a prediction model, uses the bidirectional GRU to extract the deep semantic features of the software source codes, uses the graph neural network to extract the dependency relationship features between the software modules, and combines the two features to predict the defects.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The embodiment of the application provides a defect prediction method based on a graph neural network and bidirectional GRU feature extraction, which comprises the following steps:
s1: performing network modeling on the software system, and learning a network structure by using a neural network model to obtain the software dependency characteristics;
s2: constructing a software system source file into an abstract syntax tree, extracting a token sequence, and modeling by using a two-way gating recursion unit to obtain deep semantic features of the software system;
s3: combining the software dependency characteristics obtained in the step S1 with the semantic characteristics of the source codes obtained in the step S2 to obtain final mixed characteristics, and training a classifier for defect prediction based on the obtained mixed characteristics.
In general, step S1 is to extract the software dependency characteristics by constructing a graph neural network, step S2 is to extract the deep semantic characteristics by a bi-directional gating recursion unit, and step S3 combines the two characteristics for predicting defects.
Referring to fig. 1, a frame diagram of a defect prediction method based on a graph neural network and bidirectional GRU feature extraction is provided in an embodiment of the present application.
In one embodiment, step S1 includes:
s1.1: analyzing a software source code file by using an open source tool to construct a software dependent network model;
s1.2: learning a software dependent network by using a network embedding method to obtain an embedding vector of a node;
s1.3: and (3) constructing a graph neural network model, and inputting the software dependent network model constructed in the step S1.1 and the embedded vector of the node obtained in the step S1.2 as the graph neural network model to obtain the software dependent relationship characteristic.
In one embodiment, step S1.1 comprises:
s1.1.1: and performing dependency relation scanning on various files generated after source code compiling by using an open source tool, and storing the files as a uniform file format:
s1.1.2: analyzing and extracting the dependency relationship among various files, performing network modeling, and constructing a software dependent network model, wherein the software dependent network model is an undirected and unauthorized network SDN= (V, E), and the node V is a network with no direction and no rights i (v i E V) represents a class file or interface file in a software system, and if there is a dependency relationship between two nodes, there is a continuous edge e between them ij, wherein ,eij =(v i ,v j )∈E。
In the specific implementation process, a software system developed in Java language is used as an analysis object, and a dependency relation scanning is carried out on a class file generated after source code compiling, a jar file formed by source code packaging or a zip compression package containing source code through a Dependencyfinder tool and is stored as an XML file. The XML file stores the analysis result of the DependencyFinder on Java source code, and the dependency relationship between the basic information and elements of three granularity elements of the package, the class and the method in the source code is represented by a nested structure. The < package > tag of the outermost layer represents a packet, < class > represents a class, < feature > represents a method/field, and the < outload > and < inbound > tags of the innermost layer represent a dependency and a depended relationship, respectively. The embodiment uses an autonomously developed parsing program to parse the tags of the parsed XML file, extracts the dependency relationships among the software classes from the parsed XML file, and stores the dependency relationships as a net network file format for downstream work.
Referring to fig. 2, for the process of modeling a network, part (a) of fig. 2 is 5 Java file code fragments, part (b) of fig. 2 is a class-dependent network corresponding to each other, each node in the network represents a class, and the edges are the dependency relationships between two class nodes. For example, class B is a subclass of class A, and according to the inheritance of the class B and the class A, there is a continuous edge (B-A) pointed to A by B. C-I represents an interface implementation relationship, C-D represents a parameter type dependency relationship, and A-C, D-A represents an aggregation relationship.
In SDN, the dependency of two class nodes mainly considers the following 3 cases:
(1) Inheritance, provided class v c1 Inheritance to class v c2 Or implement interface v c2 Then there is a directed edge e 12 =<v c1 ,v c2 >;
(2) Polymerization, if class v c1 Comprises class v c2 The attribute of (2) then has a directed edge e 12 =<v c1 ,v c2 >;
(3) Parameters, provided that class v c1 Class v is invoked by the method of (1) c2 The method of (2) has a directed edge e 12 =<v c1 ,v c2 >。
In one embodiment, step S1.2 comprises:
s1.2.1: the Node2vec method is utilized, the network is learned and converted into a Node sequence in a biased random walk mode, and the Node sequence obtained through conversion is used as a text sequence in natural language processing;
s1.2.2: training is carried out through a word vector model skip gram according to the obtained node sequence, and low-dimensional vector representation of the node is obtained and used as an embedded vector of the node.
Specifically, the Node2vec method is a graph data embedding method, and S1.2.2 is to embed each Node v in the network i (v i E V) into d-dimensional token vectorI V I is the number of nodes, where some target node V is characterizedVector is->These node feature vectors will serve as initial inputs to the neural network of the subsequent graph.
In one embodiment, S1.3 comprises:
s1.3.1: taking the software dependent network model obtained in the step S1.1 and the node embedded vector X obtained in the step S1.2 as inputs of the graph neural network,i V is the number of nodes, d is the dimension of the node embedding vector, N (V) = { u E V i (V, u) E) represents the neighbor set of node V, +.>Implicit embedded vector representing target node v at model k-th layer, initially input, ++>
S1.3.2: the node vector mode is iteratively updated through a graph neural network GCN by using convolution operation, each node in the network aggregates the neighbor node characteristics in iteration and is combined with an embedded vector obtained by iteration of the node on the previous layer to obtain a new layer of embedded vector, each node is sequentially iterated until the last layer is output to obtain updated node characteristics, wherein the convolution layer of the GCN is expressed as follows:
wherein A is an adjacency matrix,is to add a self-connecting adjacency matrix, I N Is an identity matrix>Is a degree matrix-> Representation matrix->The value of the element corresponding to the ith row and the jth column (l) Is the output of layer I, Θ (l) Is a parameter of the first layer, sigma (·) represents the activation function, H (l+1) The output of the layer 1 is the output of the layer 1, and the updated node characteristic is the software dependency characteristic.
Specifically, the node vector mode is iteratively updated by GCN using convolution operation, and the target node v aggregates (Aggregate) the features of the neighboring nodes a, b, c, d around the target node v and then combines with the feature x of the target node v v Combining (Combine) to obtain an embedded vector of a new layer, which is expressed as follows:
where k represents the current layer, AGGREGATE (·) represents the aggregate function, CONCAT (·) represents the join function, W is a trainable weight matrix, σ (·) represents the activate function, such as the ReLU function,an embedding vector of a neighbor node u representing the target node v at the k-1 layer +.>An embedded vector aggregate representation at the k-1 layer for all neighbor nodes of the target node v,is the embedded vector of the target node v at the current layer, is the embedded vector obtained from the target node v at the previous layer +.>And the aggregate vector of its neighbor nodes at the current layer +.>Combined by the COMBINE (·) function.
Common aggregation functions include summation (sum), average (MEAN), maximum (max) and the like, and the method adopts a graph convolution neural network in a graph neural network model, adopts an element-averaged aggregation function MEAN and uses a ReLU activation function. The function of COMBINE (·) is to splice the feature aggregated by the neighboring node with the feature transferred by the layer on the target node.
In one embodiment, after step S1.3.2, the method further comprises:
the output probability value of each node is obtained through the softmax layer, and the specific formula is as follows:
wherein ,is an adjacent matrix normalized by the Laplace matrix, X is an embedded vector of a node, and W is an initial characteristic (0) ∈R C×H and W(1) ∈R H×2 The weight matrix of two layers is respectively, C is the dimension of each node of the input layer, H is the dimension of each node of the hidden layer, Z epsilon R N×2 Is the output probability value of all nodes;
training is carried out by taking the predicted value and the real label as targets, and the adopted loss function is as follows:
wherein ,is a label node index set, Z lf Is GCN is the predictive vector of tag l, f is the dimension of the predictive vector, y lf Is the true one-hot vector of label l, with the goal of reducing the error between the true label and the predicted label.
Specifically, since the GCN is a supervised model, the training process needs to provide a real label of the node, so that the model parameters are continuously adjusted in the back propagation process, and the model parameters are optimal. The corresponding node characteristics obtained finally are the characteristics updated in the hidden layer of the GCN, namely, the characteristics of the node updated in S1.3.2 (such as 32 dimensions). And outputting a probability value (also called a predicted value, which is 2-dimensional in the embodiment, and finally predicts a softmax probability value, and continuously making the probability value closer to a real tag) for parameter tuning optimization in GCN back propagation.
In one embodiment, constructing a software system source file into an abstract syntax tree, extracting a token sequence, includes:
converting the source file into an abstract syntax tree using a javalang tool;
traversing an abstract syntax tree, preserving nodes of the type that need to be extracted, each source file is parsed into a series of code tags [ w ] 1 ,w 2 ,…,w n ]Constructing a word list V with a fixed size from the parsed code marks, converting each code mark into a low-dimensional continuous real-valued vector representation by using word2vec training word list to obtain a token embedded matrixd represents the dimension of the code tag and the token embedding matrix is used to store the token sequence.
Specifically, the source file is converted into an abstract syntax tree using the javalang tool, and the input may be a code fragment or a complete Java file, but must be all complete code. The application adopts class granularity conversion, namely each java file is converted into an abstract syntax tree.
Then determining the extracted node types, and only extracting three types of nodes as token marks by the method:
(1) Method call nodes (method invocation) and class instantiation nodes (classdeclartion), i.e. those nodes that record method names and class names;
(2) Declaration nodes, such as method declarations, class declarations, etc.;
(3) Control flow nodes, including IfStatement, whileStatement, forStatement, throwStatement, catchClause, etc., are simply labeled as their node type.
Since the names of methods, classes and types are generally specific to a particular item, methods with the same name are either rare or have different functions in different items. Thus for cross-project prediction, the extracted tokens are node types of AST, not specific names, thereby enabling generalization of the model.
Traversing the abstract syntax tree, only preserving the nodes of the type that need to be extracted, each source file is parsed into a series of code tags [ w ] 1 ,w 2 ,…,w n ]Constructing a word list V with a fixed size by using the code marks, converting each code mark into a low-dimensional continuous real-value vector representation by using word2vec training word list to obtain a token embedded matrixd represents the dimension of the code label.
In one embodiment, modeling using a bi-directional gating recursion unit obtains deep semantic features of a software system, comprising:
inputting a sequence of code tokens for each class file into the sequence of GRU units, each code token w i Using a real value vector x i =[x 1 ,x 2 ,…,x d ]Representing input into a GRU unit by predicting the next in sequenceTraining each GRU unit by a code token;
trained GRU units generate a state output vector s for each code tag i =[s 1 ,s 2 ,…,s d ]Representing capturing the semantic distribution of the code tag according to the context of the code tag;
a series of code tags [ w ] for each java source file 1 ,w 2 ,…,w n ]Input into trained GRU to obtain token state output sequence s 1 ,s 2 ,…,s n ]The output sequence of token states is aggregated into a file vector through a pooling layer as a deep semantic feature of the software system corresponding to the source file.
In particular, since the GRU is a cyclic network, the model parameters for all GRU units are the same. Output vector s i Can be combined with the input vector x i To simplify the training model, the dimensions of both are kept the same in this embodiment. s is(s) i Representing capturing its semantic distribution according to the context of the code tag. Code mark w i Output state vector s of (2) i Marking w from previous code by calculating posterior distribution 1:i Context semantics of (2) to predict the next code marker w i+1
In the implementation process, a Java file is converted into a series of code marks Each code mark w i Using a real value vector x i =[x 1 ,x 2 ,…,x d ]And (3) representing. At time i, the transition equation for the GRU is:
r i =σ(W r w i +U r h i-1 +b r )
z i =σ(W z w i +U z h i-1 +b z )
wherein ri Is to reset the gate, determine how much information of the previous time, z i Is an update gate for determining the current memory contentAnd information h of the previous time i-1 How to pass on to the next unit. These two gating vectors determine which information can ultimately be the output of the gating loop. />Is the current memory content, h i Is the information that is ultimately input to the next cell. />Are weight matrices, d is the input dimension of the real-valued vector, and m is the output dimension. b r ,b z ,b h Is the offset value.
To further enhance the ability of the loop layer to capture dependent information, the present application employs a bi-directional GRU in which hidden states in two directions are connected to form a new state:
wherein ,is the element information obtained in forward GRU at time i,/and>the cell information obtained by the reverse GRU is finally input into the pooling layer for sampling. Since the importance of different statements is considered to be different, e.g., a method call statement may contain more information, the largest pool is employed by default to capture the most important semantic information. The vector produced finally->The semantic feature vector representation of the Java file is obtained.
The implementation process of step S3 is as follows:
first, each java file is marked as defective or non-defective according to the defects posted in each project. And then, carrying out different types of feature extraction on the software system according to the steps S1 and S2 (as shown in figure 1), and combining the extracted software dependency relationship features with the source code semantic features for training a machine learning classifier. Finally, the trained model is used to predict whether the new instance is defective or non-defective. When intra-project defect prediction is performed, the instances of the training set and the test set are from the same project. When performing cross-project defect prediction, the training instance and the test instance come from different projects.
In the specific implementation process, the application adopts a multi-layer perceptron as a prediction model. The trained model is used to predict whether the new instance is defective or non-defective. As shown in FIG. 3, when intra-project defect prediction is performed, the instances of the training set and the test set are from the same project, namely project A. When performing cross-project defect prediction, a prediction model is trained using the instances in project A and test instances in project B are predicted using the model.
The scope of the present application is not limited to the above-described embodiments, and it is apparent that various modifications and variations can be made to the present application by those skilled in the art without departing from the scope and spirit of the application. It is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (6)

1. The defect prediction method based on the graph neural network and the bidirectional GRU feature extraction is characterized by comprising the following steps of:
s1: performing network modeling on the software system, and learning a network structure by using a neural network model to obtain the software dependency characteristics;
s2: constructing a software system source file into an abstract syntax tree, extracting a token sequence, and modeling by using a two-way gating recursion unit to obtain deep semantic features of the software system;
s3: combining the software dependency characteristics obtained in the step S1 with the semantic characteristics of the source codes obtained in the step S2 to obtain final mixed characteristics, and training a classifier for defect prediction based on the obtained mixed characteristics;
wherein, step S1 includes:
s1.1: analyzing a software source code file by using an open source tool to construct a software dependent network model;
s1.2: learning a software dependent network by using a network embedding method to obtain an embedding vector of a node;
s1.3: constructing a graph neural network model, and inputting the software dependent network model constructed in the step S1.1 and the embedded vector of the node obtained in the step S1.2 as the graph neural network model to obtain the software dependent relationship characteristic;
s1.3 specifically comprises:
s1.3.1: taking the software dependent network model obtained in the step S1.1 and the node embedded vector X obtained in the step S1.2 as inputs of the graph neural network,v is the number of nodes, d is the dimension of the node embedded vector,n (V) = { u ε V| (V, u) ∈E) represents the neighbor set of node V, +.>Implicit embedded vector representing target node v at model k-th layer, initially input, ++>
S1.3.2: the node vector mode is iteratively updated through a graph neural network GCN by using convolution operation, each node in the network aggregates the neighbor node characteristics in iteration and is combined with an embedded vector obtained by iteration of the node on the previous layer to obtain a new layer of embedded vector, each node is sequentially iterated until the last layer is output to obtain updated node characteristics, wherein the convolution layer of the GCN is expressed as follows:
wherein A is an adjacency matrix,is to add a self-connecting adjacency matrix, I N Is an identity matrix>Is a matrix of degrees that is a function of the degree, representation matrix->The value of the element corresponding to the ith row and the jth column (l) Is the output of layer I, Θ (l) Is of the first layerParameters, σ (·) represent activation function, H (l+1) The output of the layer 1 is the output of the layer 1, and the updated node characteristic is the software dependency characteristic.
2. The defect prediction method of claim 1, wherein step S1.1 comprises:
s1.1.1: and performing dependency relation scanning on various files generated after source code compiling by using an open source tool, and storing the files as a uniform file format:
s1.1.2: analyzing and extracting the dependency relationship among various files, performing network modeling, and constructing a software dependent network model, wherein the software dependent network model is an undirected and unauthorized network SDN= (V, E), and the node V is a network with no direction and no rights i (v i E V) represents a class file or interface file in a software system, and if there is a dependency relationship between two nodes, there is a continuous edge e between them ij, wherein ,eij =(v i ,v j )∈E。
3. The defect prediction method of claim 1, wherein step S1.2 comprises:
s1.2.1: the Node2vec method is utilized, the network is learned and converted into a Node sequence in a biased random walk mode, and the Node sequence obtained through conversion is used as a text sequence in natural language processing;
s1.2.2: training is carried out through a word vector model skip gram according to the obtained node sequence, and low-dimensional vector representation of the node is obtained and used as an embedded vector of the node.
4. The defect prediction method of claim 1, wherein after step S1.3.2, the method further comprises:
the output probability value of each node is obtained through the softmax layer, and the specific formula is as follows:
wherein ,is an adjacent matrix normalized by the Laplace matrix, X is an embedded vector of a node, and W is an initial characteristic (0) ∈R C×H and W(1) ∈R H×2 The weight matrix of two layers is respectively, C is the dimension of each node of the input layer, H is the dimension of each node of the hidden layer, Z epsilon R N×2 Is the output probability value of all nodes;
training is carried out by taking the predicted value and the real label as targets, and the adopted loss function is as follows:
wherein ,is a label node index set, Z lf Is GCN is the predictive vector of tag l, f is the dimension of the predictive vector, y lf Is the true one-hot vector of label l, with the goal of reducing the error between the true label and the predicted label.
5. The defect prediction method of claim 1, wherein constructing the software system source file into an abstract syntax tree, extracting the token sequence, comprises:
converting the source file into an abstract syntax tree using a javalang tool;
traversing an abstract syntax tree, preserving nodes of the type that need to be extracted, each source file is parsed into a series of code tags [ w ] 1 ,w 2 ,…,w n ]Constructing a word list V with a fixed size from the parsed code marks, converting each code mark into a low-dimensional continuous real-valued vector representation by using word2vec training word list to obtain a token embedded matrixd represents the dimension of the code tag and the token embedding matrix is used to store the token sequence.
6. The defect prediction method of claim 1, wherein modeling using a bi-directional gating recursion unit obtains deep semantic features of a software system, comprising:
inputting a sequence of code tokens for each class file into the sequence of GRU units, each code token w i Using a real value vector x i =[x 1 ,x 2 ,…,x d ]Representing, entering into one of the GRU units, training each GRU unit by predicting a next code token in the sequence;
trained GRU units generate a state output vector s for each code tag i =[s 1 ,s 2 ,…,s d ]Representing capturing the semantic distribution of the code tag according to the context of the code tag;
a series of code tags [ w ] for each java source file 1 ,w 2 ,…,w n ]Input into trained GRU to obtain token state output sequence s 1 ,s 2 ,…,s n ]The output sequence of token states is aggregated into a file vector through a pooling layer as a deep semantic feature of the software system corresponding to the source file.
CN202210323112.5A 2022-03-29 2022-03-29 Defect prediction method based on graph neural network and bidirectional GRU feature extraction Active CN114816997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210323112.5A CN114816997B (en) 2022-03-29 2022-03-29 Defect prediction method based on graph neural network and bidirectional GRU feature extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210323112.5A CN114816997B (en) 2022-03-29 2022-03-29 Defect prediction method based on graph neural network and bidirectional GRU feature extraction

Publications (2)

Publication Number Publication Date
CN114816997A CN114816997A (en) 2022-07-29
CN114816997B true CN114816997B (en) 2023-08-18

Family

ID=82532126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210323112.5A Active CN114816997B (en) 2022-03-29 2022-03-29 Defect prediction method based on graph neural network and bidirectional GRU feature extraction

Country Status (1)

Country Link
CN (1) CN114816997B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115617694B (en) * 2022-11-30 2023-03-10 中南大学 Software defect prediction method, system, device and medium based on information fusion
CN116521899B (en) * 2023-05-08 2024-03-26 中国传媒大学 Improved graph neural network-based document level relation extraction method and system
CN116383089B (en) * 2023-05-29 2023-08-04 云南大学 Statement level software defect prediction system based on ordinary differential equation diagram neural network
CN117251376B (en) * 2023-10-09 2024-03-19 湖北大学 Software defect prediction method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825615A (en) * 2019-09-23 2020-02-21 中国科学院信息工程研究所 Software defect prediction method and system based on network embedding
CN111259388A (en) * 2020-01-09 2020-06-09 中山大学 Malicious software API (application program interface) calling sequence detection method based on graph convolution
DE202020101664U1 (en) * 2020-02-07 2020-07-07 Google Llc Computer-aided graph optimization
CN111783100A (en) * 2020-06-22 2020-10-16 哈尔滨工业大学 Source code vulnerability detection method for code graph representation learning based on graph convolution network
CN112069822A (en) * 2020-09-14 2020-12-11 上海风秩科技有限公司 Method, device and equipment for acquiring word vector representation and readable medium
CN112541180A (en) * 2020-12-16 2021-03-23 北京理工大学 Software security vulnerability detection method based on grammatical features and semantic features
CN112633550A (en) * 2020-11-23 2021-04-09 成都唐源电气股份有限公司 RNN-based catenary fault trend prediction method, equipment and storage medium
CN112966271A (en) * 2021-03-18 2021-06-15 中山大学 Malicious software detection method based on graph convolution network
CN114185769A (en) * 2021-11-16 2022-03-15 南京航空航天大学 Software defect prediction method and terminal based on bidirectional long-short term memory neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11544105B2 (en) * 2019-09-11 2023-01-03 Google Llc Recommendations for scheduling jobs on distributed computing devices
US11481418B2 (en) * 2020-01-02 2022-10-25 International Business Machines Corporation Natural question generation via reinforcement learning based graph-to-sequence model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825615A (en) * 2019-09-23 2020-02-21 中国科学院信息工程研究所 Software defect prediction method and system based on network embedding
CN111259388A (en) * 2020-01-09 2020-06-09 中山大学 Malicious software API (application program interface) calling sequence detection method based on graph convolution
DE202020101664U1 (en) * 2020-02-07 2020-07-07 Google Llc Computer-aided graph optimization
CN111783100A (en) * 2020-06-22 2020-10-16 哈尔滨工业大学 Source code vulnerability detection method for code graph representation learning based on graph convolution network
CN112069822A (en) * 2020-09-14 2020-12-11 上海风秩科技有限公司 Method, device and equipment for acquiring word vector representation and readable medium
CN112633550A (en) * 2020-11-23 2021-04-09 成都唐源电气股份有限公司 RNN-based catenary fault trend prediction method, equipment and storage medium
CN112541180A (en) * 2020-12-16 2021-03-23 北京理工大学 Software security vulnerability detection method based on grammatical features and semantic features
CN112966271A (en) * 2021-03-18 2021-06-15 中山大学 Malicious software detection method based on graph convolution network
CN114185769A (en) * 2021-11-16 2022-03-15 南京航空航天大学 Software defect prediction method and terminal based on bidirectional long-short term memory neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
钱冠群 ; 张莉 ; 张林 ; Philip Lew ; .软件静态结构的依赖网络建模方法与特性分析.计算机科学.2008,(第11期),全文. *

Also Published As

Publication number Publication date
CN114816997A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN114816997B (en) Defect prediction method based on graph neural network and bidirectional GRU feature extraction
CN110597735B (en) Software defect prediction method for open-source software defect feature deep learning
CN111913702B (en) Method for identifying key classes in software system based on graph neural network
CN113468888A (en) Entity relation joint extraction method and device based on neural network
CN112417063B (en) Heterogeneous relation network-based compatible function item recommendation method
CN113312447A (en) Semi-supervised log anomaly detection method based on probability label estimation
CN101546290B (en) Method for improving accuracy of quality forecast of class hierarchy in object-oriented software
CN115145551A (en) Intelligent auxiliary system for machine learning application low-code development
Meilong et al. An approach to semantic and structural features learning for software defect prediction
CN115659966A (en) Rumor detection method and system based on dynamic heteromorphic graph and multi-level attention
CN116521560A (en) Multi-feature fusion emperor class detection method based on graph neural network
CN114881032A (en) Hierarchical category named entity recognition model design method based on multi-task learning
CN114238524B (en) Satellite frequency-orbit data information extraction method based on enhanced sample model
WO2020088338A1 (en) Method and apparatus for building recognition model
CN113361259B (en) Service flow extraction method
Ouyang et al. Semantic enrichment of object associations across federated BIM semantic graphs in a common data environment
CN111783688B (en) Remote sensing image scene classification method based on convolutional neural network
CN117056226A (en) Cross-project software defect number prediction method based on transfer learning
CN115438190B (en) Power distribution network fault auxiliary decision knowledge extraction method and system
CN116720185A (en) Multi-dimensional graph tensor fusion representation and embedding method for codes
CN113886602B (en) Domain knowledge base entity identification method based on multi-granularity cognition
CN113342982B (en) Enterprise industry classification method integrating Roberta and external knowledge base
CN117290238B (en) Software defect prediction method and system based on heterogeneous relational graph neural network
CN117439800B (en) Network security situation prediction method, system and equipment
CN113377422B (en) Self-recognition technical liability method based on deep learning identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant