CN114816997B

CN114816997B - Defect prediction method based on graph neural network and bidirectional GRU feature extraction

Info

Publication number: CN114816997B
Application number: CN202210323112.5A
Authority: CN
Inventors: 何鹏; 周纯英; 曾诚; 马菊; 黄杰
Original assignee: Hubei University
Current assignee: Hubei University
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2023-08-18
Anticipated expiration: 2042-03-29
Also published as: CN114816997A

Abstract

The application provides a defect prediction method based on graph neural network and bidirectional GRU feature extraction, which comprises the steps of firstly, carrying out network modeling on a software system, and learning a network structure by using a neural network model to obtain software dependency characteristics; then constructing a software system source file into an abstract syntax tree, extracting a token sequence, and modeling by using a two-way gating recursion unit to obtain deep semantic features of the software system; and combining the obtained software dependency characteristics with the semantic characteristics of the obtained source codes to obtain final mixed characteristics, training a classifier for defect prediction based on the obtained mixed characteristics, constructing a prediction model by combining the two different types of measurement indexes, extracting deep semantic characteristics of the software source codes by using a bidirectional GRU, extracting dependency characteristics among software modules by using a graph neural network, and combining the two characteristics for predicting defects, thereby improving the accuracy of prediction.

Description

Defect prediction method based on graph neural network and bidirectional GRU feature extraction

Technical Field

The application relates to the field of software engineering software defect prediction, in particular to a defect prediction method based on a graph neural network and bidirectional GRU feature extraction.

Background

With the increasing size and complexity of software, the probability of defects in the software increases, as does the cost of testing source code and producing high quality software products. In practice, the test resources are limited, some serious failures may lead to significant functional losses, large amounts of data corruption, and even to a crash of the entire software system, and therefore limited resources should be allocated to the high risk modules that are more prone to failure.

In early work, researchers proposed a variety of metrics to build predictive models, typically using static code properties and defect locations, by which defects were predicted by manually defined features such as number of lines of code, previous errors, number of methods in a file, etc., which were easily obtained from source code using automated tools. However, the characteristics of the traditional manual design only consider the complexity of codes, in order to extract richer source code semantics and grammar information, researchers model the software source code to be similar to a text sequence in natural language processing, such as a neural network model based on an abstract grammar tree, so that the source code can be well simulated, meanwhile, the structural information of a program is also reserved, the abstract grammar tree is traversed, the node types needing to be reserved are selected, each source file is parsed into a series of code marks, and the code marks are input into a deep neural network, so that the semantic characteristics of the software source code are obtained.

A disadvantage of conventional code metrics and semantic features is that they focus on only a single element, and rarely consider the interaction information between elements, the available information content is limited. In recent years, network metrics derived from concepts in the field of social network analysis have attracted attention from a wide range of researchers. The analysis based on the network takes the modules as nodes, extracts the dependency relationship among the modules as edges to form a software source code network, and establishes a prediction model by using the obtained network model. Network metrics take into account interactions between modules to model information flow and topology in software, which are not captured by software code metrics.

Therefore, the method in the prior art has the technical problem of low software defect prediction accuracy.

Disclosure of Invention

Aiming at the defects of the prior art, the application provides a defect prediction method based on a graph neural network and bidirectional GRU feature extraction, which extracts richer software features and improves the accuracy of prediction.

In order to achieve the above object, the following technical scheme is adopted:

the defect prediction method based on the graph neural network and the bidirectional GRU feature extraction comprises the following steps:

s1: performing network modeling on the software system, and learning a network structure by using a neural network model to obtain the software dependency characteristics;

s2: constructing a software system source file into an abstract syntax tree, extracting a token sequence, and modeling by using a two-way gating recursion unit to obtain deep semantic features of the software system;

s3: combining the software dependency characteristics obtained in the step S1 with the semantic characteristics of the source codes obtained in the step S2 to obtain final mixed characteristics, and training a classifier for defect prediction based on the obtained mixed characteristics.

In one embodiment, step S1 includes:

s1.1: analyzing a software source code file by using an open source tool to construct a software dependent network model;

s1.2: learning a software dependent network by using a network embedding method to obtain an embedding vector of a node;

s1.3: and (3) constructing a graph neural network model, and inputting the software dependent network model constructed in the step S1.1 and the embedded vector of the node obtained in the step S1.2 as the graph neural network model to obtain the software dependent relationship characteristic.

In one embodiment, step S1.1 comprises:

s1.1.1: and performing dependency relation scanning on various files generated after source code compiling by using an open source tool, and storing the files as a uniform file format:

s1.1.2: analyzing and extracting the dependency relationship among various files, performing network modeling, and constructing a software dependent network model, wherein the software dependent network model is an undirected and unauthorized network SDN= (V, E), and the node V is a network with no direction and no rights _i (v _i E V) represents a class file or interface file in a software system, if there is a dependency between two nodes, there is a relation between themStrip connecting edge e _ij, wherein ,e_ij ＝(v _i ,v _j )∈E。

In one embodiment, step S1.2 comprises:

s1.2.1: the Node2vec method is utilized, the network is learned and converted into a Node sequence in a biased random walk mode, and the Node sequence obtained through conversion is used as a text sequence in natural language processing;

s1.2.2: training is carried out through a word vector model skip gram according to the obtained node sequence, and low-dimensional vector representation of the node is obtained and used as an embedded vector of the node.

In one embodiment, S1.3 comprises:

s1.3.1: taking the software dependent network model obtained in the step S1.1 and the node embedded vector X obtained in the step S1.2 as inputs of the graph neural network,i V is the number of nodes, d is the dimension of the node embedding vector, N (V) = { u E V i (V, u) E) represents the neighbor set of node V, +.>Implicit embedded vector representing target node v at model k-th layer, initially input, ++>

S1.3.2: the node vector mode is iteratively updated through a graph neural network GCN by using convolution operation, each node in the network aggregates the neighbor node characteristics in iteration and is combined with an embedded vector obtained by iteration of the node on the previous layer to obtain a new layer of embedded vector, each node is sequentially iterated until the last layer is output to obtain updated node characteristics, wherein the convolution layer of the GCN is expressed as follows:

wherein A is an adjacency matrix,is to add a self-connecting adjacency matrix, I _N Is an identity matrix>Is a degree matrix-> Representation matrix->The value of the element corresponding to the ith row and the jth column ^(l) Is the output of layer I, Θ ^(l) Is a parameter of the first layer, sigma (·) represents the activation function, H ^(l+1) The output of the layer 1 is the output of the layer 1, and the updated node characteristic is the software dependency characteristic.

In one embodiment, after step S1.3.2, the method further comprises:

the output probability value of each node is obtained through the softmax layer, and the specific formula is as follows:

wherein ,is an adjacent matrix normalized by the Laplace matrix, X is an embedded vector of a node, and W is an initial characteristic ⁽⁰⁾ ∈R ^C×H and W⁽¹⁾ ∈R ^H×2 The weight matrix of two layers is respectively, C is the dimension of each node of the input layer, H is the dimension of each node of the hidden layer, Z epsilon R ^N×2 Is the output probability value of all nodes;

training is carried out by taking the predicted value and the real label as targets, and the adopted loss function is as follows:

wherein ,is a label node index set, Z _lf Is GCN is the predictive vector of tag l, f is the dimension of the predictive vector, y _lf Is the true one-hot vector of label l, with the goal of reducing the error between the true label and the predicted label.

In one embodiment, constructing a software system source file into an abstract syntax tree, extracting a token sequence, includes:

converting the source file into an abstract syntax tree using a javalang tool;

traversing an abstract syntax tree, preserving nodes of the type that need to be extracted, each source file is parsed into a series of code tags [ w ] ₁ ,w ₂ ,…,w _n ]Constructing a word list V with a fixed size from the parsed code marks, converting each code mark into a low-dimensional continuous real-valued vector representation by using word2vec training word list to obtain a token embedded matrixd represents the dimension of the code tag and the token embedding matrix is used to store the token sequence.

In one embodiment, modeling using a bi-directional gating recursion unit obtains deep semantic features of a software system, comprising:

inputting a sequence of code tokens for each class file into the sequence of GRU units, each code token w _i Using a real value vector x _i ＝[x ₁ ,x ₂ ,…,x _d ]Representing, entering into one of the GRU units, training each GRU unit by predicting a next code token in the sequence;

trained GRU units generate a status output for each code tagQuantity s _i ＝[s ₁ ,s ₂ ,…,s _d ]Representing capturing the semantic distribution of the code tag according to the context of the code tag;

a series of code tags [ w ] for each java source file ₁ ,w ₂ ,…,w _n ]Input into trained GRU to obtain token state output sequence s ₁ ,s ₂ ,…,s _n ]The output sequence of token states is aggregated into a file vector through a pooling layer as a deep semantic feature of the software system corresponding to the source file.

The above technical solutions in the embodiments of the present application at least have one or more of the following technical effects:

firstly, modeling a software system as a software network by utilizing a complex network theory, and then carrying out embedded learning on the network by combining a graph neural network model of a deep learning technology to obtain network measurement characteristics, namely software dependency characteristics; secondly, extracting rich semantic information in a source file of the software system by using a two-way gating recursion unit; third, two types of software metrology features are combined and applied to software defect prediction. That is, the application adopts the combination of the two different types of measurement indexes to construct a prediction model, uses the bidirectional GRU to extract the deep semantic features of the software source codes, uses the graph neural network to extract the dependency features between the software modules, and combines the two features to predict the defects, thereby improving the accuracy of prediction.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a frame diagram of a defect prediction method based on a graph neural network and bidirectional GRU feature extraction according to an embodiment of the present application;

FIG. 2 is a process diagram of building a software dependent network in accordance with an embodiment of the present application;

FIG. 3 is a process diagram of the method of the present application applied to defect prediction.

Detailed Description

The present inventors have found through a great deal of research and practice that:

some methods in the prior art adopt the traditional code measurement method, and the defects of the code measurement method index and the semantic features are that only single elements are focused, interaction information among the elements is rarely considered, and the available information content is limited. In recent years, network metrics derived from concepts in the field of social network analysis have attracted attention from a wide range of researchers. The analysis based on the network takes the modules as nodes, extracts the dependency relationship among the modules as edges to form a software source code network, and establishes a prediction model by using the obtained network model. Network metrics take into account interactions between modules to model information flow and topology in software, which are not captured by software code metrics. However, in other research works it has been shown that network metrics are not more advantageous than code metrics and are not commonplace.

In the field of software engineering, the embedded learning model for processing the network structure data includes deepflk, node2vec, struct 2vec and the like, and the principle is that the network structure is converted into a series of node sequences, and then the word vector model is used for learning the node representation. The graph neural network model in the deep learning field can capture network structural characteristics as well and is not applied to the software defect prediction field. The idea is to multiply the adjacent matrix of the whole network structure with the original node characteristic matrix to obtain the embedded characteristics of the hidden layer, and iterate in sequence, thereby realizing a multi-layer network. Compared with the traditional embedded learning method, the graph neural network can learn the structural relation of each node and the neighborhood thereof, and can fuse the characteristic attribute carried by each node into the graph neural network to perform more comprehensive learning.

Based on the above consideration, the application adopts the combination of the two different types of measurement indexes to construct a prediction model, uses the bidirectional GRU to extract the deep semantic features of the software source codes, uses the graph neural network to extract the dependency relationship features between the software modules, and combines the two features to predict the defects.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiment of the application provides a defect prediction method based on a graph neural network and bidirectional GRU feature extraction, which comprises the following steps:

In general, step S1 is to extract the software dependency characteristics by constructing a graph neural network, step S2 is to extract the deep semantic characteristics by a bi-directional gating recursion unit, and step S3 combines the two characteristics for predicting defects.

Referring to fig. 1, a frame diagram of a defect prediction method based on a graph neural network and bidirectional GRU feature extraction is provided in an embodiment of the present application.

In one embodiment, step S1 includes:

In one embodiment, step S1.1 comprises:

s1.1.2: analyzing and extracting the dependency relationship among various files, performing network modeling, and constructing a software dependent network model, wherein the software dependent network model is an undirected and unauthorized network SDN= (V, E), and the node V is a network with no direction and no rights _i (v _i E V) represents a class file or interface file in a software system, and if there is a dependency relationship between two nodes, there is a continuous edge e between them _ij, wherein ,e_ij ＝(v _i ,v _j )∈E。

In the specific implementation process, a software system developed in Java language is used as an analysis object, and a dependency relation scanning is carried out on a class file generated after source code compiling, a jar file formed by source code packaging or a zip compression package containing source code through a Dependencyfinder tool and is stored as an XML file. The XML file stores the analysis result of the DependencyFinder on Java source code, and the dependency relationship between the basic information and elements of three granularity elements of the package, the class and the method in the source code is represented by a nested structure. The < package > tag of the outermost layer represents a packet, < class > represents a class, < feature > represents a method/field, and the < outload > and < inbound > tags of the innermost layer represent a dependency and a depended relationship, respectively. The embodiment uses an autonomously developed parsing program to parse the tags of the parsed XML file, extracts the dependency relationships among the software classes from the parsed XML file, and stores the dependency relationships as a net network file format for downstream work.

Referring to fig. 2, for the process of modeling a network, part (a) of fig. 2 is 5 Java file code fragments, part (b) of fig. 2 is a class-dependent network corresponding to each other, each node in the network represents a class, and the edges are the dependency relationships between two class nodes. For example, class B is a subclass of class A, and according to the inheritance of the class B and the class A, there is a continuous edge (B-A) pointed to A by B. C-I represents an interface implementation relationship, C-D represents a parameter type dependency relationship, and A-C, D-A represents an aggregation relationship.

In SDN, the dependency of two class nodes mainly considers the following 3 cases:

(1) Inheritance, provided class v _c1 Inheritance to class v _c2 Or implement interface v _c2 Then there is a directed edge e ₁₂ ＝<v _c1 ,v _c2 >；

(2) Polymerization, if class v _c1 Comprises class v _c2 The attribute of (2) then has a directed edge e ₁₂ ＝<v _c1 ,v _c2 >；

(3) Parameters, provided that class v _c1 Class v is invoked by the method of (1) _c2 The method of (2) has a directed edge e ₁₂ ＝<v _c1 ,v _c2 >。

In one embodiment, step S1.2 comprises:

Specifically, the Node2vec method is a graph data embedding method, and S1.2.2 is to embed each Node v in the network _i (v _i E V) into d-dimensional token vectorI V I is the number of nodes, where some target node V is characterizedVector is->These node feature vectors will serve as initial inputs to the neural network of the subsequent graph.

In one embodiment, S1.3 comprises:

Specifically, the node vector mode is iteratively updated by GCN using convolution operation, and the target node v aggregates (Aggregate) the features of the neighboring nodes a, b, c, d around the target node v and then combines with the feature x of the target node v _v Combining (Combine) to obtain an embedded vector of a new layer, which is expressed as follows:

where k represents the current layer, AGGREGATE (·) represents the aggregate function, CONCAT (·) represents the join function, W is a trainable weight matrix, σ (·) represents the activate function, such as the ReLU function,an embedding vector of a neighbor node u representing the target node v at the k-1 layer +.>An embedded vector aggregate representation at the k-1 layer for all neighbor nodes of the target node v,is the embedded vector of the target node v at the current layer, is the embedded vector obtained from the target node v at the previous layer +.>And the aggregate vector of its neighbor nodes at the current layer +.>Combined by the COMBINE (·) function.

Common aggregation functions include summation (sum), average (MEAN), maximum (max) and the like, and the method adopts a graph convolution neural network in a graph neural network model, adopts an element-averaged aggregation function MEAN and uses a ReLU activation function. The function of COMBINE (·) is to splice the feature aggregated by the neighboring node with the feature transferred by the layer on the target node.

In one embodiment, after step S1.3.2, the method further comprises:

Specifically, since the GCN is a supervised model, the training process needs to provide a real label of the node, so that the model parameters are continuously adjusted in the back propagation process, and the model parameters are optimal. The corresponding node characteristics obtained finally are the characteristics updated in the hidden layer of the GCN, namely, the characteristics of the node updated in S1.3.2 (such as 32 dimensions). And outputting a probability value (also called a predicted value, which is 2-dimensional in the embodiment, and finally predicts a softmax probability value, and continuously making the probability value closer to a real tag) for parameter tuning optimization in GCN back propagation.

converting the source file into an abstract syntax tree using a javalang tool;

Specifically, the source file is converted into an abstract syntax tree using the javalang tool, and the input may be a code fragment or a complete Java file, but must be all complete code. The application adopts class granularity conversion, namely each java file is converted into an abstract syntax tree.

Then determining the extracted node types, and only extracting three types of nodes as token marks by the method:

(1) Method call nodes (method invocation) and class instantiation nodes (classdeclartion), i.e. those nodes that record method names and class names;

(2) Declaration nodes, such as method declarations, class declarations, etc.;

(3) Control flow nodes, including IfStatement, whileStatement, forStatement, throwStatement, catchClause, etc., are simply labeled as their node type.

Since the names of methods, classes and types are generally specific to a particular item, methods with the same name are either rare or have different functions in different items. Thus for cross-project prediction, the extracted tokens are node types of AST, not specific names, thereby enabling generalization of the model.

Traversing the abstract syntax tree, only preserving the nodes of the type that need to be extracted, each source file is parsed into a series of code tags [ w ] ₁ ,w ₂ ,…,w _n ]Constructing a word list V with a fixed size by using the code marks, converting each code mark into a low-dimensional continuous real-value vector representation by using word2vec training word list to obtain a token embedded matrixd represents the dimension of the code label.

inputting a sequence of code tokens for each class file into the sequence of GRU units, each code token w _i Using a real value vector x _i ＝[x ₁ ,x ₂ ,…,x _d ]Representing input into a GRU unit by predicting the next in sequenceTraining each GRU unit by a code token;

trained GRU units generate a state output vector s for each code tag _i ＝[s ₁ ,s ₂ ,…,s _d ]Representing capturing the semantic distribution of the code tag according to the context of the code tag;

In particular, since the GRU is a cyclic network, the model parameters for all GRU units are the same. Output vector s _i Can be combined with the input vector x _i To simplify the training model, the dimensions of both are kept the same in this embodiment. s is(s) _i Representing capturing its semantic distribution according to the context of the code tag. Code mark w _i Output state vector s of (2) _i Marking w from previous code by calculating posterior distribution _1:i Context semantics of (2) to predict the next code marker w _i+1 。

In the implementation process, a Java file is converted into a series of code marks Each code mark w _i Using a real value vector x _i ＝[x ₁ ,x ₂ ,…,x _d ]And (3) representing. At time i, the transition equation for the GRU is:

r _i ＝σ(W _r w _i +U _r h _i-1 +b _r )

z _i ＝σ(W _z w _i +U _z h _i-1 +b _z )

wherein r_i Is to reset the gate, determine how much information of the previous time, z _i Is an update gate for determining the current memory contentAnd information h of the previous time _i-1 How to pass on to the next unit. These two gating vectors determine which information can ultimately be the output of the gating loop. />Is the current memory content, h _i Is the information that is ultimately input to the next cell. />Are weight matrices, d is the input dimension of the real-valued vector, and m is the output dimension. b _r ,b _z ,b _h Is the offset value.

To further enhance the ability of the loop layer to capture dependent information, the present application employs a bi-directional GRU in which hidden states in two directions are connected to form a new state:

wherein ,is the element information obtained in forward GRU at time i,/and>the cell information obtained by the reverse GRU is finally input into the pooling layer for sampling. Since the importance of different statements is considered to be different, e.g., a method call statement may contain more information, the largest pool is employed by default to capture the most important semantic information. The vector produced finally->The semantic feature vector representation of the Java file is obtained.

The implementation process of step S3 is as follows:

first, each java file is marked as defective or non-defective according to the defects posted in each project. And then, carrying out different types of feature extraction on the software system according to the steps S1 and S2 (as shown in figure 1), and combining the extracted software dependency relationship features with the source code semantic features for training a machine learning classifier. Finally, the trained model is used to predict whether the new instance is defective or non-defective. When intra-project defect prediction is performed, the instances of the training set and the test set are from the same project. When performing cross-project defect prediction, the training instance and the test instance come from different projects.

In the specific implementation process, the application adopts a multi-layer perceptron as a prediction model. The trained model is used to predict whether the new instance is defective or non-defective. As shown in FIG. 3, when intra-project defect prediction is performed, the instances of the training set and the test set are from the same project, namely project A. When performing cross-project defect prediction, a prediction model is trained using the instances in project A and test instances in project B are predicted using the model.

The scope of the present application is not limited to the above-described embodiments, and it is apparent that various modifications and variations can be made to the present application by those skilled in the art without departing from the scope and spirit of the application. It is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The defect prediction method based on the graph neural network and the bidirectional GRU feature extraction is characterized by comprising the following steps of:

s3: combining the software dependency characteristics obtained in the step S1 with the semantic characteristics of the source codes obtained in the step S2 to obtain final mixed characteristics, and training a classifier for defect prediction based on the obtained mixed characteristics;

wherein, step S1 includes:

s1.3: constructing a graph neural network model, and inputting the software dependent network model constructed in the step S1.1 and the embedded vector of the node obtained in the step S1.2 as the graph neural network model to obtain the software dependent relationship characteristic;

s1.3 specifically comprises:

s1.3.1: taking the software dependent network model obtained in the step S1.1 and the node embedded vector X obtained in the step S1.2 as inputs of the graph neural network,v is the number of nodes, d is the dimension of the node embedded vector,n (V) = { u ε V| (V, u) ∈E) represents the neighbor set of node V, +.>Implicit embedded vector representing target node v at model k-th layer, initially input, ++>

wherein A is an adjacency matrix,is to add a self-connecting adjacency matrix, I _N Is an identity matrix>Is a matrix of degrees that is a function of the degree, representation matrix->The value of the element corresponding to the ith row and the jth column ^(l) Is the output of layer I, Θ ^(l) Is of the first layerParameters, σ (·) represent activation function, H ^(l+1) The output of the layer 1 is the output of the layer 1, and the updated node characteristic is the software dependency characteristic.

2. The defect prediction method of claim 1, wherein step S1.1 comprises:

3. The defect prediction method of claim 1, wherein step S1.2 comprises:

4. The defect prediction method of claim 1, wherein after step S1.3.2, the method further comprises:

5. The defect prediction method of claim 1, wherein constructing the software system source file into an abstract syntax tree, extracting the token sequence, comprises:

converting the source file into an abstract syntax tree using a javalang tool;

6. The defect prediction method of claim 1, wherein modeling using a bi-directional gating recursion unit obtains deep semantic features of a software system, comprising: