CN114185595B

CN114185595B - Code structure guidance-based method name generation method

Info

Publication number: CN114185595B
Application number: CN202111288510.XA
Authority: CN
Inventors: 蔡波; 瞿志恒; 胡毅
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2024-03-29
Anticipated expiration: 2041-11-02
Also published as: CN114185595A

Abstract

The invention relates to the technical field of software engineering, in particular to a method name generation method based on code structure guidance, which comprises the following steps: 1. processing the code text to obtain a code marking sequence and a code relation diagram; 2. coding the code marking sequence and the code relation diagram respectively, aligning the code marking sequence and the code relation diagram according to the mapping relation, and transmitting the semantic features and the structural features into a decoder for generating method names. The invention can preferably generate the method name.

Description

Code structure guidance-based method name generation method

Technical Field

The invention relates to the technical field of software engineering, in particular to a method name generation method based on code structure guidance.

Background

The normalized naming of programs is particularly important during program development and maintenance. The name of the semantically described method can be regarded as the whole summary of the program time sequence function, which is helpful for the developer to understand the program in a whole way, thereby improving the programming efficiency and avoiding the misuse of the method. Because complex programs can be broken down into multiple sub-methods according to a time series sequence, hump nomenclature is typically used to include these sub-methods. In contrast, a method name inconsistent with program functionality confuses a developer's understanding of the method and even leads to misuse of the program, while creating significant difficulties in maintaining and updating the program. Because the semantic and structure of the code are not accurately understood by the manually written method names, the correct method names may not be given at the initial stage of writing the code, or the code names are not updated in time even though new functions are added in the process of updating and iterating the code, and the irregular and even wrong method names often appear in the process of developing the code. To solve this problem, many researchers have given appropriate names of methods according to the contents of the methods in different ways. For example, the appropriate method name is given by constructing some static rules for source code analysis. However, the effectiveness of these analysis-based methods depends on the rules constructed, which are not applicable to any programming language. An Abstract Syntax Tree (AST) can explicitly describe the code structure and content, and the source code can be accurately restored by static analysis of it. Therefore, many studies are based on structural similarity in code AST to name the method that gives the recommendation. However, two major problems cannot be addressed based on AST structural similarity. First, if there are non-appearing words in the method name, it cannot infer the correct method name; second, it cannot capture differences between structurally similar higher code segments.

Disclosure of Invention

The present invention is directed to a method of generating a method name based on code structure guidance that overcomes some or some of the shortcomings of the prior art.

The method for generating the method name based on the code structure guidance comprises the following steps:

1. processing the code text to obtain a code marking sequence and a code relation diagram;

2. coding the code marking sequence and the code relation diagram respectively, aligning the code marking sequence and the code relation diagram according to the mapping relation, and transmitting the semantic features and the structural features into a decoder for generating method names.

Preferably, in step one, the code segment generates a text form representing the graph structure through a fastatst parsing tool, and then generates a code relationship graph through analysis optimization.

Preferably, the encoding includes context information encoding and code relationship diagram encoding;

1) The context information is encoded as: encoding the context vector set using an RNN-based seq2seq encoder; for the context information sequence, encoding it with a gating recursion unit GRU; for various types of edges in the code relation graph, for each edge, firstly, carrying out one-round message transmission and state update among nodes on the edges of different types by using a relation graph network, and then, encoding the edges by using GGNN;

GRU transfusionInto sequence V _Fi Representing a context; each vector represents a sub-tag in a method name entity name in the context; for each time step t, a vector v will be chosen from n vectors _i Put into encoder to obtain a hidden state vector h _t As an output of this time step; by collecting all the outputs of each time step, a list H of hidden state vectors is obtained _i ＝[h ₁ ，h ₂ ，......，h _n ]This is the output of the GRU; input sequence V _Fi ＝[v ₁ ，v ₂ ，……，v _n ]Learning and converting into a hidden vector H by an encoder; the decoder is used for converting the representation hidden vector H into a target sequence Y= [ Y ] ₁ ，y ₂ ，……，y _m ]；

h _t ＝f(v _t ，h _t-1 ) (1)

h′ _x t＝g(y _t-1 ，h′ _t-1 ，H _i ) (2)

p(y _t |y ₁ ，…，y _t-1 ，V _Fi )＝s(y _t-1 ，h _t ，H _i ) (3)

Equation (1) is the encoder RNN, h _t Is a hidden state in a time step; the function f represents the RNN dynamic function; equation (2) is the hidden state of the decoder, h' _t-1 Is the hidden state of the decoder at time step t-1; function g represents the RNN dynamic function; equation (3) is used for prediction: the function s is a likelihood calculation function;

2) The code relation diagram is encoded as: encoding the code relationship graph by GGNN; graph g= (V, E, X) is formed by node set V, edge set e= (E) ₁ ，……，E _k ) And node embedment X, where k represents the number of types of edges. For an arbitrary u E V, which corresponds to an embedded X of a node _u ∈R ^dh Where dh represents the dimension of node embedding; the specific message passing process is as follows:

2.1. each node needs to send a message to its neighbors; message per nodeBy a function dependent on the edge type>In its current hidden state->Up-conversion; using a simple linear layer to represent the ++of edge-type k over time-step k>The method comprises the steps of carrying out a first treatment on the surface of the Initializing hidden state ++of each node by embedding Xu into the node corresponding to each node>The formula is as follows:

2.2. each node u aggregates messages from its neighbors by summing up in a corresponding way; n (u) represents the neighbor node of node u, and the formula is as follows:

2.3. each node u is based on aggregated messagesUpdating the state of its current time step, the updating function being a gate-controlled loop unit GRU;

the messaging process spreads over time step T and takes the hidden state of each node u at the last time stepAs node representations; the global graph state r is obtained by weighting and summing all node representations _g The method comprises the steps of carrying out a first treatment on the surface of the The weight of each node is based on the hidden layer representation h of the node _u And node embedded X _u Is obtained by cascade calculation of (a), and is specifically as follows:

r _g ＝∑ _u∈V σ(W _i [h _u ：X _u ])⊙(W _j h _u )；

is two learnable matrices, σ (·) is a sigmoid function; corresponding multiplications are then applied to the two outputs, and finally all weighted node representations are summed.

Preferably, the decoding process is:

3.1. at any time step t, the decoder receives a hidden state s for the decoding process _t At the same time, the decoder also receives the list hi= [ h ] of hidden state vectors transmitted from the encoder ₁ ，h ₂ ，......，h _n ]The score vector e of the attention of each step is obtained by calculating the inner product _i The formula is as follows:

e _i ＝v ^r tanh(W ₁ h _i +W ₂ s)；

is a weight matrix, < >>Is a weight vector;

3.2. probability distribution alpha of attention weights using softmax function ^t The formula is as follows:

e ^t ＝[e ₁ ，e ₂ ，...，e _n ]；

α ^t ＝sof tmax(e ^t )；

use of alpha ^t To hiddenThe list of hidden state vectors is weighted and summed to obtain the output a of the attention module ^t ；

3.3. Output a of attention module ^t Splicing with the hidden state st of the decoder to obtain [ a ] _t ：s _t ]As output at the computational decoder side.

The invention optimizes the structure of the code relation diagram, so that the structural information of the codes is richer and more reasonable, and the model is easier to acquire the structural characteristics of the codes. The invention establishes a mapping relation between the code text mark and the code relation graph node, thereby aligning the semantic information and the structural information of the code. The invention modifies the decoding process of the sequence generation sequence model, and each decoding process can receive the semantic information and the structural information of the code, so that the semantic and the structure of the code are comprehensively considered by the method name generated by the model.

Drawings

Fig. 1 is a flowchart of a method name generation method based on code structure guidance in embodiment 1;

FIG. 2 is a schematic diagram of an abstract code sequence in example 1;

FIG. 3 is a diagram showing the input sequence distribution in example 1.

Detailed Description

For a further understanding of the present invention, the present invention will be described in detail with reference to the drawings and examples. It is to be understood that the examples are illustrative of the present invention and are not intended to be limiting.

Example 1

As shown in fig. 1, the present embodiment provides a method name generating method based on code structure guidance, which includes the steps of:

Code context information

Our first step is to build context information from the source code. For a complete code segment, it is possible to construct context information for the internal context, the peer context and the close context, respectively, while we have extracted the method body, so we directly extract names from the program entities in the text of the method body itself, the return type and the type in the interface, which we call abstract code sequences, as shown in fig. 2. Considering that the naming convention in code is not suitable for semantic extraction, as in fig. 2, "putintelger" has little practical meaning, while "put integer" represents the practical semantic of put integer, so the compound name is decomposed into sub-labels that are collected in the source code in the same apparent order into sequences that represent context information, and the first line of code text in the example of fig. 2 is decomposed into "preferences put integer stringkey int val". Meanwhile, special characters in the subsequences, which do not contain semantic information, are deleted, for example, words constituting variable names in codes are commonly used for ' dividing ', and ' is deleted after the child token in the variable names is extracted.

The code context information is represented as a sequence of vectors, where each vector represents a sub-token. The goal of this step is to generate a representation for a given method that integrates all of its contextual semantic information. The code relationship graph is also represented as a vector through the graph neural network coding. The goal of this step is to generate a representation for a given method that integrates its syntax structure information. The overall representation of the method can be obtained by a combination of the two representations.

Code relationship diagram

fastat is a source code abstract syntax tree parsing tool that accelerates the parsing of abstract syntax trees by maintaining the binary form level equivalence of source code files. The fast tool provides a form of code-graph that resides in a syntactic analysis to construct the code-graph that we use to generate corresponding code-graph data from the source code. The overall framework of the code relation graph is constructed based on an abstract syntax tree, and for a plurality of special nodes, the semantics of the code relation graph are enriched by corresponding processing. The "NAME" node represents a NAME node that contains a class NAME, a method NAME, a parameter NAME, and the like. The name nodes are considered to occupy important positions in the code semantic expression, and meanwhile, the name nodes play a key role in the generation of the method names, so that semantic information of the name nodes is reserved, and the nodes are sub-labeled. Likewise, an "OPERATOR" node represents an operation node that generally represents call operations and arithmetic operations, such as addition operations "+", and operations "&", call operations ", and the like. Since it also has a large impact on code semantics, we also preserve the semantics information of the operation node. The "dummy" node represents some constants, such as strings and integers, which we abstract as constant nodes. The reasons are mainly two points, namely, the semantic information of the character strings appearing in the codes has little effect on the generation of the method names; secondly, when the character strings in the source codes are encoded, great difficulty is brought to compressing the rich semantics of the character strings. Therefore, we do the above for the constant nodes. For other nodes, we consider that it is not necessary to specialize it. For example, we do not distinguish between "BLOCK {" and "BLOCK }" nodes representing scope, but all are regarded as "BLOCK" nodes whose scope is merely scope information indicating code. For the resolved "last-use" edge in the original tool, the last-used relationship including the symbol is retained in the original tool, while we only retain the last-used relationship of the variable. As with the sub-tag sequence, we sub-tag the variable name node. In addition, in order to strengthen the experimental rigor, we can disguise the method name nodes in advance by using special character strings, so as to avoid exposing the semantic information of the method names in advance. In general, the code relationship graph is similar to a relationship graph, and the relationships among each node in the code are characterized, and the relationships are of various types, so that a complete code relationship graph is finally formed.

It is necessary to simultaneously perform hybrid encoding on semantic information and syntax information of the code.

Context information encoding

Encoding the context vector set using an RNN-based seq2seq encoder; for the context information sequence, encoding it with a gating recursion unit GRU; for multiple types of edges in the code relation graph, for each edge, a relation graph network is used for carrying out a round of message transmission and state updating among nodes on the edges of different types, and then GGNN is used for encoding the edges.

Input sequence V of GRU _Fi Representing a context. Each vector represents a sub-tag in the name of the method name entity in the context. For each time step t we will choose vector v from n vectors _i Put into the encoder we get a hidden state vector h _t As an output of this time step. By collecting all the outputs of each time step we have a list H of hidden state vectors _i ＝[h ₁ ，h ₂ ，......，h _n ]This is the output of the GRU. Theoretically, the input sequence V _Fi ＝[v ₁ ，v ₂ ，……，v _n ]Is learned by the encoder and converted into a concealment vector H. The decoder is used for converting the representation hidden vector H into a target sequence Y= [ Y ] ₁ ，y ₂ ，……，y _m ]

h _t ＝f(v _t ，h _t-1 ) (1)

h′ _x t＝g(y _t-1 ，h′ _t-1 ，H _i ) (2)

p(y _t |y ₁ ，…，y _t-1 ，V _Fi )＝s(y _t-1 ，h _t ，H _i ) (3)

Equation 1 is the encoder RNN, h _t Is a hidden state in a time step; the function f represents the RNN dynamic function; equation 2 is the hidden state of the decoder, h' _t Is the hidden state of the decoder at time step t; function g represents the RNN dynamic function; equation 3 is used for prediction: the function s likelihood calculates the function. FinallyWe get H _i ＝[h ₁ ，h ₂ ，......，h _n ]And h _n . This is the output of the encoder, used for the attention layer. Since not all sub-markers in a context are equally important, our goal is to pay more attention to certain sub-markers, we implement this mechanism through the attention mechanism.

Code relationship graph encoding

Encoding the code relationship graph by GGNN; unlike natural language, source code contains complex and readily available structural information, which can be represented by AST. Conventional sequence encoders treat the source code as plain text, ignoring the rich structural information. And the graph neural network can be directly applied to the input graph, thereby fully capturing the syntax and semantic information of the source code. Our graphics encoder is based on the graphic component proposed by Fernandes et al, which was developed in accordance with the gate map neural network (GGNN). Graph g= (V, E, X) is formed by node set V, edge set e= (E) ₁ ，……，E _k ) And node embedment X, where k represents the number of types of edges. For an arbitrary u E V, which corresponds to an embedded X of a node _u ∈R ^dh Where dh represents the dimension of node embedding. The specific message passing process is as follows:

2.1. each node needs to send a message to its neighbors. Message per nodeBy a function dependent on the edge type>In its current hidden state->Up-conversion. In our work we use a simple linear layer to represent the ++of edge-type k over time-step k>. We pass each nodeThe corresponding node embedding Xu initializes the hidden state of each node +.>

2.2. Each node u aggregates messages from its neighbors by summing up in a corresponding way. N (u) represents a neighboring node of the node u.

2.3. Each node u updates its current time-step state according to the aggregated message M (t) u, the update function being a gating loop unit (GRU).

The messaging process spreads over time step T and takes the hidden state of each node u at the last time stepAs a node representation. Furthermore, we also get the global map state r by weighting and summing all node representations _g . The weight of each node is based on the hidden layer representation h of the node _u And node embedded X _u Is obtained by cascade calculation of (a), and is specifically as follows:

r _g ＝∑ _u∈V σ(W _i [h _u ：X _u ])⊙(W _j h _u )；

is two learnable matrices, σ (·) is a sigmoid function. Then we apply corresponding multiplication to the two outputs, finallyAll weighted node representations are summed.

Decoding process

In the seq2seq, the hidden state of the last bit of the encoder side is an overview of the entire sentence, which needs to contain information of the entire sentence, but at this time, it is often difficult to want to contain information of the entire sentence as the sentence becomes longer. The attention mechanism is thus implemented in such a way that, at each step at the decoder side, a certain part at the encoder side is selected to form the context information matrix, and the result of each step is output. The specific implementation is as follows:

first, at any time step t, the decoder receives a hidden state s for the decoding process _t At the same time, the decoder also receives the list H of hidden state vectors transmitted from the encoder _i ＝[h ₁ ，h ₂ ，......，h _n ]We obtain a score vector e for each step of attention by computing the inner product _i The formula is as follows:

e _i ＝v ^T tanh(W ₁ h _i +W ₂ s)；

is a weight matrix, < >>Is a weight vector.

Next we use the softmax function to pay attention to the probability distribution α of the weights ^t The formula is as follows:

e ^t ＝[e ₁ ，e ₂ ，...，e _n ]；

α ^t ＝sof tmax(e ^t )；

we use alpha ^t Weighting and summing the list of hidden state vectors to obtain the output a of the attention module ^t

Finally, we output attention module a ^t And decoder hidden state s _t Splicing to obtain [ a ] _t ：s _t ]As output at the computational decoder side.

The seq-to-seq approach, while free to generate text, exhibits many poorly performing behaviors, including without limitation inaccurate reproduction of fact details, inability to process out-of-vocabulary (OOV) words, and generation of duplicate words. The Pointer Generator Network (PGN) facilitates the duplication of words from source text by pointers, which improves OOV word accuracy and processing power, while preserving the ability to generate new words. This network can be seen as a balance between extraction and abstract generation methods. We also add a Coverage vector to track and control the duplicate scope of the source file. We demonstrate that Converage is very effective in eliminating duplicates.

Pointer network

And a pointer network is added into the model, so that abstract generation capacity is maintained through the seq2seq model, words are directly fetched from the original text through the PointNet, and accuracy of abstracts is improved and OOV problems are relieved. At each step of the prediction, the two are combined together flexibly by dynamically calculating a generation probability pgen e 0,1, in particular at time t:

wherein W is _h* ，w _s ，w _x Are all parameters of learning, s _t Is in a decoding state and,is the context vector, x _t Is the input to the decoder for each time step. In the decoding stage, an extended dictionary is maintained, i.e. the original dictionary plus all words appearing in source, and we calculate probabilities for all token on this extended dictionary:

here, if w is OOV, P _vocab For 0, the same, if w is not present in source, then the latter term is also 0.

Duplicate detection

Repetition is a problem that often occurs with the seq2seq model, we introduce a Coverage model to address this problem. In particular, the attention weights of all previous time steps are added to an coverage vector (coverage vector) c _t And (3) upper part. The previous attention weight decision is used to influence the current attention weight decision, thus avoiding duplication at the same location and thus avoiding duplication of text. The specific calculation is as follows:

the coverage vector is then added to the calculation of the attention weight:

this allows the current decision to be influenced by the historical decision when calculating the attention weight, thus allowing the attention mechanism to avoid repeated attention to a certain location, and thus avoiding the generation of repeated words.

The coverage loss is calculated as follows:

the final loss of the model is:

where λ is the supercomputer.

Decoding process in the decoder we get the hidden layer state h= [ H ] of the last layer of GRU of the input sequence at each time step ₁ ，h ₂ ，......，h _n ]And hidden layer h at last moment of each GRU layer _n The decoder will receive input from the encoder in time steps and then gradually output the generated sequence. At this time h _n Represents the final state of the input sequence, while H represents the state of each sub-tag in the sequence. In order to be able to integrate the structural information of the code into the sub-tag sequence representing the text information, we finally represent the vector and the final state h by the code relationship graph _n And merging, namely merging the node representation and the state sequence H. When the graph data are input into the graph neural network, the nodes already transmit information to the adjacent nodes through message transmission, and the nodes with more code relation graphs are shared by each graph, so that only the nodes corresponding to the sub-marks in the input sequence are needed to be positioned, the node representations of the nodes after the message iteration is acquired in the graph neural network, and then the node representations are integrated into the state sequence H, so that redundant information of other nodes is filtered, and the nodes and the sequence can be corresponding when the model processes the information.

Wi∈R ^2dn×1 Is a learnable matrix.

Experimental analysis

To investigate whether the length of the code text marker sequence has an effect on the experimental results, we randomly selected 10k samples from the total samples for analysis, and found that most of the marker sequence length was less than 200, and a small portion was concentrated in the interval of [200, 400], as shown in fig. 3. We have performed experiments with different maximum input sequence lengths set.

In order to generate normalized method names for codes, the model comprehensively considers the semantics and the structure of the codes, and the generated method names can reflect the structural characteristics of the codes while conforming to the semantics of the codes. The whole process is divided into two steps, firstly, the information extraction is carried out on the code marking sequence through a text abstract means, and then, the decoder is assisted in generating the code name through the structure information provided by the code relation diagram. Meanwhile, a mapping mechanism is designed to establish a corresponding relation between the code marking sequence and the node, so that the code structure information is accurately transferred to a decoder. In the future, we plan to explore models that are more suitable for processing code structure data, thereby further improving the quality of the generated method names.

The embodiment provides a lightweight model generated by code method names comprehensively considering code semantics and structures. To clearly represent the code structure, we have constructed a completely new code graph, called Code Relationship Graph (CRG). The CRG integrates information such as data streams and the like while maintaining the structure and the complexity of the abstract syntax tree, and improves the information density. In our method, we label the input sequence and map the labeled token into the CRG according to the matching relationship, thereby establishing the mapping relationship between the code text label and the graph node in the CRG. We store this relationship in a mapping matrix so that the corresponding graph node information can be extracted. By doing so, we not only completely guarantee the integrity of the structural information of the code, but also greatly reduce the redundancy degree of the structural information of the code. In the decoding process, the model receives the semantic characteristics and the structural characteristics of the code at the same time, comprehensively considers the semantics and the structure of the code and then generates a normalized method name. To further enhance the decoding capabilities of the model, we introduce a weight sharing mechanism in the model to let the encoder and decoder share word embedding information. This example demonstrates the effectiveness of the proposed method on a published dataset java-small with 700K samples, which is 1.5% -3.5% higher in the ROUGE metric than the most advanced model.

The invention and its embodiments have been described above by way of illustration and not limitation, and the invention is illustrated in the accompanying drawings and described in the drawings in which the actual structure is not limited thereto. Therefore, if one of ordinary skill in the art is informed by this disclosure, the structural mode and the embodiments similar to the technical scheme are not creatively designed without departing from the gist of the present invention.

Claims

1. The method for generating the method name based on the code structure guidance is characterized by comprising the following steps of: the method comprises the following steps:

2. coding the code marking sequence and the code relation diagram respectively, aligning the code marking sequence and the code relation diagram according to the mapping relation, and transmitting the semantic features and the structural features into a decoder for generating method names;

the coding comprises context information coding and code relation diagram coding;

input sequence V of GRU _Fi Representing a context; each vector represents a sub-tag in a method name entity name in the context; for each time step t, a vector v will be chosen from n vectors _i Put into encoder to obtain a hidden state vector h _t As an output of this time step; by collecting all the outputs of each time step, a list H of hidden state vectors is obtained _i ＝[h ₁ ,h ₂ ,……,h _n ]This is the output of the GRU; input sequence V _Fi ＝[v ₁ ,v ₂ ,……,v _n ]Learning and converting into a hidden vector H by an encoder; the decoder is used for converting the representation hidden vector H into a target sequence Y= [ Y ] ₁ ,y ₂ ,……,y _m ]；

h _t ＝ f (v _t , h _t-1 ) (1)

h′ _x t＝g(y _t-1 ,h′ _t-1 ,H _i )(2)

p (y _t ∣y ₁ , … , y _t-1 , V _Fi ) ＝ s (y _t-1 , h _t , H _i ) (3)

2) The code relation diagram is encoded as: encoding the code relationship graph by GGNN; graph g= (V, E, X) is formed by node set V, edge set e= (E) ₁ ,……,E _k ) And node embedding X, wherein k represents the number of types of edges; for an arbitrary u E V, which corresponds to an embedded X of a node _u ∈R ^dh Where dh represents the dimension of node embedding; the specific message passing process is as follows:

2.1. each node needs to send a message to its neighbors; message per nodeBy a function dependent on the edge type>In its current hidden state->Up-conversion; using a simple linear layer to represent the ++of edge-type k over time-step k>Embedding X through the node corresponding to each node _u Initializing eachHidden state of individual node->The formula is as follows:

r _g ＝∑ _u∈V σ(W _i [h _u :X _u ])⊙(W _j h _u )；

2. The code structure guidance-based method name generation method according to claim 1, wherein: in the first step, the code segment generates a text form representing the graph structure through a fastatst parsing tool, and then generates a code relation graph through analysis and optimization.

3. The code structure guidance-based method name generation method according to claim 2, wherein: the decoding process is as follows:

3.1. at any time step t, the decoder receives a hidden state s for the decoding process _t At the same time, the decoder also receives the list H of hidden state vectors transmitted from the encoder _i ＝[h ₁ ,h ₂ ,……,h _n ]The score vector e of the attention of each step is obtained by calculating the inner product _i The formula is as follows:

e _i ＝v ^T tanh(W ₁ h _i +W ₂ s)；

is a weight matrix, < >>Is a weight vector;

e ^t ＝[e ₁ ,e ₂ ,…,e _n ]；

α ^t ＝softmax(e ^t )；

use of alpha ^t Weighting and summing the list of hidden state vectors to obtain the output a of the attention module ^t ；

3.3. Output a of attention module ^t And decoder hidden state s _t Splicing to obtain [ a ] _t ∶s _t ]As output at the computational decoder side.