CN114185595B - Code structure guidance-based method name generation method - Google Patents

Code structure guidance-based method name generation method Download PDF

Info

Publication number
CN114185595B
CN114185595B CN202111288510.XA CN202111288510A CN114185595B CN 114185595 B CN114185595 B CN 114185595B CN 202111288510 A CN202111288510 A CN 202111288510A CN 114185595 B CN114185595 B CN 114185595B
Authority
CN
China
Prior art keywords
code
node
decoder
hidden state
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111288510.XA
Other languages
Chinese (zh)
Other versions
CN114185595A (en
Inventor
蔡波
瞿志恒
胡毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202111288510.XA priority Critical patent/CN114185595B/en
Publication of CN114185595A publication Critical patent/CN114185595A/en
Application granted granted Critical
Publication of CN114185595B publication Critical patent/CN114185595B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/73Program documentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Machine Translation (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention relates to the technical field of software engineering, in particular to a method name generation method based on code structure guidance, which comprises the following steps: 1. processing the code text to obtain a code marking sequence and a code relation diagram; 2. coding the code marking sequence and the code relation diagram respectively, aligning the code marking sequence and the code relation diagram according to the mapping relation, and transmitting the semantic features and the structural features into a decoder for generating method names. The invention can preferably generate the method name.

Description

Code structure guidance-based method name generation method
Technical Field
The invention relates to the technical field of software engineering, in particular to a method name generation method based on code structure guidance.
Background
The normalized naming of programs is particularly important during program development and maintenance. The name of the semantically described method can be regarded as the whole summary of the program time sequence function, which is helpful for the developer to understand the program in a whole way, thereby improving the programming efficiency and avoiding the misuse of the method. Because complex programs can be broken down into multiple sub-methods according to a time series sequence, hump nomenclature is typically used to include these sub-methods. In contrast, a method name inconsistent with program functionality confuses a developer's understanding of the method and even leads to misuse of the program, while creating significant difficulties in maintaining and updating the program. Because the semantic and structure of the code are not accurately understood by the manually written method names, the correct method names may not be given at the initial stage of writing the code, or the code names are not updated in time even though new functions are added in the process of updating and iterating the code, and the irregular and even wrong method names often appear in the process of developing the code. To solve this problem, many researchers have given appropriate names of methods according to the contents of the methods in different ways. For example, the appropriate method name is given by constructing some static rules for source code analysis. However, the effectiveness of these analysis-based methods depends on the rules constructed, which are not applicable to any programming language. An Abstract Syntax Tree (AST) can explicitly describe the code structure and content, and the source code can be accurately restored by static analysis of it. Therefore, many studies are based on structural similarity in code AST to name the method that gives the recommendation. However, two major problems cannot be addressed based on AST structural similarity. First, if there are non-appearing words in the method name, it cannot infer the correct method name; second, it cannot capture differences between structurally similar higher code segments.
Disclosure of Invention
The present invention is directed to a method of generating a method name based on code structure guidance that overcomes some or some of the shortcomings of the prior art.
The method for generating the method name based on the code structure guidance comprises the following steps:
1. processing the code text to obtain a code marking sequence and a code relation diagram;
2. coding the code marking sequence and the code relation diagram respectively, aligning the code marking sequence and the code relation diagram according to the mapping relation, and transmitting the semantic features and the structural features into a decoder for generating method names.
Preferably, in step one, the code segment generates a text form representing the graph structure through a fastatst parsing tool, and then generates a code relationship graph through analysis optimization.
Preferably, the encoding includes context information encoding and code relationship diagram encoding;
1) The context information is encoded as: encoding the context vector set using an RNN-based seq2seq encoder; for the context information sequence, encoding it with a gating recursion unit GRU; for various types of edges in the code relation graph, for each edge, firstly, carrying out one-round message transmission and state update among nodes on the edges of different types by using a relation graph network, and then, encoding the edges by using GGNN;
GRU transfusionInto sequence V Fi Representing a context; each vector represents a sub-tag in a method name entity name in the context; for each time step t, a vector v will be chosen from n vectors i Put into encoder to obtain a hidden state vector h t As an output of this time step; by collecting all the outputs of each time step, a list H of hidden state vectors is obtained i =[h 1 ,h 2 ,......,h n ]This is the output of the GRU; input sequence V Fi =[v 1 ,v 2 ,……,v n ]Learning and converting into a hidden vector H by an encoder; the decoder is used for converting the representation hidden vector H into a target sequence Y= [ Y ] 1 ,y 2 ,……,y m ];
h t =f(v t ,h t-1 ) (1)
h′ x t=g(y t-1 ,h′ t-1 ,H i ) (2)
p(y t |y 1 ,…,y t-1 ,V Fi )=s(y t-1 ,h t ,H i ) (3)
Equation (1) is the encoder RNN, h t Is a hidden state in a time step; the function f represents the RNN dynamic function; equation (2) is the hidden state of the decoder, h' t-1 Is the hidden state of the decoder at time step t-1; function g represents the RNN dynamic function; equation (3) is used for prediction: the function s is a likelihood calculation function;
2) The code relation diagram is encoded as: encoding the code relationship graph by GGNN; graph g= (V, E, X) is formed by node set V, edge set e= (E) 1 ,……,E k ) And node embedment X, where k represents the number of types of edges. For an arbitrary u E V, which corresponds to an embedded X of a node u ∈R dh Where dh represents the dimension of node embedding; the specific message passing process is as follows:
2.1. each node needs to send a message to its neighbors; message per nodeBy a function dependent on the edge type>In its current hidden state->Up-conversion; using a simple linear layer to represent the ++of edge-type k over time-step k>The method comprises the steps of carrying out a first treatment on the surface of the Initializing hidden state ++of each node by embedding Xu into the node corresponding to each node>The formula is as follows:
2.2. each node u aggregates messages from its neighbors by summing up in a corresponding way; n (u) represents the neighbor node of node u, and the formula is as follows:
2.3. each node u is based on aggregated messagesUpdating the state of its current time step, the updating function being a gate-controlled loop unit GRU;
the messaging process spreads over time step T and takes the hidden state of each node u at the last time stepAs node representations; the global graph state r is obtained by weighting and summing all node representations g The method comprises the steps of carrying out a first treatment on the surface of the The weight of each node is based on the hidden layer representation h of the node u And node embedded X u Is obtained by cascade calculation of (a), and is specifically as follows:
r g =∑ u∈V σ(W i [h u :X u ])⊙(W j h u );
is two learnable matrices, σ (·) is a sigmoid function; corresponding multiplications are then applied to the two outputs, and finally all weighted node representations are summed.
Preferably, the decoding process is:
3.1. at any time step t, the decoder receives a hidden state s for the decoding process t At the same time, the decoder also receives the list hi= [ h ] of hidden state vectors transmitted from the encoder 1 ,h 2 ,......,h n ]The score vector e of the attention of each step is obtained by calculating the inner product i The formula is as follows:
e i =v r tanh(W 1 h i +W 2 s);
is a weight matrix, < >>Is a weight vector;
3.2. probability distribution alpha of attention weights using softmax function t The formula is as follows:
e t =[e 1 ,e 2 ,...,e n ];
α t =sof tmax(e t );
use of alpha t To hiddenThe list of hidden state vectors is weighted and summed to obtain the output a of the attention module t
3.3. Output a of attention module t Splicing with the hidden state st of the decoder to obtain [ a ] t :s t ]As output at the computational decoder side.
The invention optimizes the structure of the code relation diagram, so that the structural information of the codes is richer and more reasonable, and the model is easier to acquire the structural characteristics of the codes. The invention establishes a mapping relation between the code text mark and the code relation graph node, thereby aligning the semantic information and the structural information of the code. The invention modifies the decoding process of the sequence generation sequence model, and each decoding process can receive the semantic information and the structural information of the code, so that the semantic and the structure of the code are comprehensively considered by the method name generated by the model.
Drawings
Fig. 1 is a flowchart of a method name generation method based on code structure guidance in embodiment 1;
FIG. 2 is a schematic diagram of an abstract code sequence in example 1;
FIG. 3 is a diagram showing the input sequence distribution in example 1.
Detailed Description
For a further understanding of the present invention, the present invention will be described in detail with reference to the drawings and examples. It is to be understood that the examples are illustrative of the present invention and are not intended to be limiting.
Example 1
As shown in fig. 1, the present embodiment provides a method name generating method based on code structure guidance, which includes the steps of:
1. processing the code text to obtain a code marking sequence and a code relation diagram;
2. coding the code marking sequence and the code relation diagram respectively, aligning the code marking sequence and the code relation diagram according to the mapping relation, and transmitting the semantic features and the structural features into a decoder for generating method names.
Code context information
Our first step is to build context information from the source code. For a complete code segment, it is possible to construct context information for the internal context, the peer context and the close context, respectively, while we have extracted the method body, so we directly extract names from the program entities in the text of the method body itself, the return type and the type in the interface, which we call abstract code sequences, as shown in fig. 2. Considering that the naming convention in code is not suitable for semantic extraction, as in fig. 2, "putintelger" has little practical meaning, while "put integer" represents the practical semantic of put integer, so the compound name is decomposed into sub-labels that are collected in the source code in the same apparent order into sequences that represent context information, and the first line of code text in the example of fig. 2 is decomposed into "preferences put integer stringkey int val". Meanwhile, special characters in the subsequences, which do not contain semantic information, are deleted, for example, words constituting variable names in codes are commonly used for ' dividing ', and ' is deleted after the child token in the variable names is extracted.
The code context information is represented as a sequence of vectors, where each vector represents a sub-token. The goal of this step is to generate a representation for a given method that integrates all of its contextual semantic information. The code relationship graph is also represented as a vector through the graph neural network coding. The goal of this step is to generate a representation for a given method that integrates its syntax structure information. The overall representation of the method can be obtained by a combination of the two representations.
Code relationship diagram
fastat is a source code abstract syntax tree parsing tool that accelerates the parsing of abstract syntax trees by maintaining the binary form level equivalence of source code files. The fast tool provides a form of code-graph that resides in a syntactic analysis to construct the code-graph that we use to generate corresponding code-graph data from the source code. The overall framework of the code relation graph is constructed based on an abstract syntax tree, and for a plurality of special nodes, the semantics of the code relation graph are enriched by corresponding processing. The "NAME" node represents a NAME node that contains a class NAME, a method NAME, a parameter NAME, and the like. The name nodes are considered to occupy important positions in the code semantic expression, and meanwhile, the name nodes play a key role in the generation of the method names, so that semantic information of the name nodes is reserved, and the nodes are sub-labeled. Likewise, an "OPERATOR" node represents an operation node that generally represents call operations and arithmetic operations, such as addition operations "+", and operations "&", call operations ", and the like. Since it also has a large impact on code semantics, we also preserve the semantics information of the operation node. The "dummy" node represents some constants, such as strings and integers, which we abstract as constant nodes. The reasons are mainly two points, namely, the semantic information of the character strings appearing in the codes has little effect on the generation of the method names; secondly, when the character strings in the source codes are encoded, great difficulty is brought to compressing the rich semantics of the character strings. Therefore, we do the above for the constant nodes. For other nodes, we consider that it is not necessary to specialize it. For example, we do not distinguish between "BLOCK {" and "BLOCK }" nodes representing scope, but all are regarded as "BLOCK" nodes whose scope is merely scope information indicating code. For the resolved "last-use" edge in the original tool, the last-used relationship including the symbol is retained in the original tool, while we only retain the last-used relationship of the variable. As with the sub-tag sequence, we sub-tag the variable name node. In addition, in order to strengthen the experimental rigor, we can disguise the method name nodes in advance by using special character strings, so as to avoid exposing the semantic information of the method names in advance. In general, the code relationship graph is similar to a relationship graph, and the relationships among each node in the code are characterized, and the relationships are of various types, so that a complete code relationship graph is finally formed.
It is necessary to simultaneously perform hybrid encoding on semantic information and syntax information of the code.
Context information encoding
Encoding the context vector set using an RNN-based seq2seq encoder; for the context information sequence, encoding it with a gating recursion unit GRU; for multiple types of edges in the code relation graph, for each edge, a relation graph network is used for carrying out a round of message transmission and state updating among nodes on the edges of different types, and then GGNN is used for encoding the edges.
Input sequence V of GRU Fi Representing a context. Each vector represents a sub-tag in the name of the method name entity in the context. For each time step t we will choose vector v from n vectors i Put into the encoder we get a hidden state vector h t As an output of this time step. By collecting all the outputs of each time step we have a list H of hidden state vectors i =[h 1 ,h 2 ,......,h n ]This is the output of the GRU. Theoretically, the input sequence V Fi =[v 1 ,v 2 ,……,v n ]Is learned by the encoder and converted into a concealment vector H. The decoder is used for converting the representation hidden vector H into a target sequence Y= [ Y ] 1 ,y 2 ,……,y m ]
h t =f(v t ,h t-1 ) (1)
h′ x t=g(y t-1 ,h′ t-1 ,H i ) (2)
p(y t |y 1 ,…,y t-1 ,V Fi )=s(y t-1 ,h t ,H i ) (3)
Equation 1 is the encoder RNN, h t Is a hidden state in a time step; the function f represents the RNN dynamic function; equation 2 is the hidden state of the decoder, h' t Is the hidden state of the decoder at time step t; function g represents the RNN dynamic function; equation 3 is used for prediction: the function s likelihood calculates the function. FinallyWe get H i =[h 1 ,h 2 ,......,h n ]And h n . This is the output of the encoder, used for the attention layer. Since not all sub-markers in a context are equally important, our goal is to pay more attention to certain sub-markers, we implement this mechanism through the attention mechanism.
Code relationship graph encoding
Encoding the code relationship graph by GGNN; unlike natural language, source code contains complex and readily available structural information, which can be represented by AST. Conventional sequence encoders treat the source code as plain text, ignoring the rich structural information. And the graph neural network can be directly applied to the input graph, thereby fully capturing the syntax and semantic information of the source code. Our graphics encoder is based on the graphic component proposed by Fernandes et al, which was developed in accordance with the gate map neural network (GGNN). Graph g= (V, E, X) is formed by node set V, edge set e= (E) 1 ,……,E k ) And node embedment X, where k represents the number of types of edges. For an arbitrary u E V, which corresponds to an embedded X of a node u ∈R dh Where dh represents the dimension of node embedding. The specific message passing process is as follows:
2.1. each node needs to send a message to its neighbors. Message per nodeBy a function dependent on the edge type>In its current hidden state->Up-conversion. In our work we use a simple linear layer to represent the ++of edge-type k over time-step k>. We pass each nodeThe corresponding node embedding Xu initializes the hidden state of each node +.>
2.2. Each node u aggregates messages from its neighbors by summing up in a corresponding way. N (u) represents a neighboring node of the node u.
2.3. Each node u updates its current time-step state according to the aggregated message M (t) u, the update function being a gating loop unit (GRU).
The messaging process spreads over time step T and takes the hidden state of each node u at the last time stepAs a node representation. Furthermore, we also get the global map state r by weighting and summing all node representations g . The weight of each node is based on the hidden layer representation h of the node u And node embedded X u Is obtained by cascade calculation of (a), and is specifically as follows:
r g =∑ u∈V σ(W i [h u :X u ])⊙(W j h u );
is two learnable matrices, σ (·) is a sigmoid function. Then we apply corresponding multiplication to the two outputs, finallyAll weighted node representations are summed.
Decoding process
In the seq2seq, the hidden state of the last bit of the encoder side is an overview of the entire sentence, which needs to contain information of the entire sentence, but at this time, it is often difficult to want to contain information of the entire sentence as the sentence becomes longer. The attention mechanism is thus implemented in such a way that, at each step at the decoder side, a certain part at the encoder side is selected to form the context information matrix, and the result of each step is output. The specific implementation is as follows:
first, at any time step t, the decoder receives a hidden state s for the decoding process t At the same time, the decoder also receives the list H of hidden state vectors transmitted from the encoder i =[h 1 ,h 2 ,......,h n ]We obtain a score vector e for each step of attention by computing the inner product i The formula is as follows:
e i =v T tanh(W 1 h i +W 2 s);
is a weight matrix, < >>Is a weight vector.
Next we use the softmax function to pay attention to the probability distribution α of the weights t The formula is as follows:
e t =[e 1 ,e 2 ,...,e n ];
α t =sof tmax(e t );
we use alpha t Weighting and summing the list of hidden state vectors to obtain the output a of the attention module t
Finally, we output attention module a t And decoder hidden state s t Splicing to obtain [ a ] t :s t ]As output at the computational decoder side.
The seq-to-seq approach, while free to generate text, exhibits many poorly performing behaviors, including without limitation inaccurate reproduction of fact details, inability to process out-of-vocabulary (OOV) words, and generation of duplicate words. The Pointer Generator Network (PGN) facilitates the duplication of words from source text by pointers, which improves OOV word accuracy and processing power, while preserving the ability to generate new words. This network can be seen as a balance between extraction and abstract generation methods. We also add a Coverage vector to track and control the duplicate scope of the source file. We demonstrate that Converage is very effective in eliminating duplicates.
Pointer network
And a pointer network is added into the model, so that abstract generation capacity is maintained through the seq2seq model, words are directly fetched from the original text through the PointNet, and accuracy of abstracts is improved and OOV problems are relieved. At each step of the prediction, the two are combined together flexibly by dynamically calculating a generation probability pgen e 0,1, in particular at time t:
wherein W is h* ,w s ,w x Are all parameters of learning, s t Is in a decoding state and,is the context vector, x t Is the input to the decoder for each time step. In the decoding stage, an extended dictionary is maintained, i.e. the original dictionary plus all words appearing in source, and we calculate probabilities for all token on this extended dictionary:
here, if w is OOV, P vocab For 0, the same, if w is not present in source, then the latter term is also 0.
Duplicate detection
Repetition is a problem that often occurs with the seq2seq model, we introduce a Coverage model to address this problem. In particular, the attention weights of all previous time steps are added to an coverage vector (coverage vector) c t And (3) upper part. The previous attention weight decision is used to influence the current attention weight decision, thus avoiding duplication at the same location and thus avoiding duplication of text. The specific calculation is as follows:
the coverage vector is then added to the calculation of the attention weight:
this allows the current decision to be influenced by the historical decision when calculating the attention weight, thus allowing the attention mechanism to avoid repeated attention to a certain location, and thus avoiding the generation of repeated words.
The coverage loss is calculated as follows:
the final loss of the model is:
where λ is the supercomputer.
Decoding process in the decoder we get the hidden layer state h= [ H ] of the last layer of GRU of the input sequence at each time step 1 ,h 2 ,......,h n ]And hidden layer h at last moment of each GRU layer n The decoder will receive input from the encoder in time steps and then gradually output the generated sequence. At this time h n Represents the final state of the input sequence, while H represents the state of each sub-tag in the sequence. In order to be able to integrate the structural information of the code into the sub-tag sequence representing the text information, we finally represent the vector and the final state h by the code relationship graph n And merging, namely merging the node representation and the state sequence H. When the graph data are input into the graph neural network, the nodes already transmit information to the adjacent nodes through message transmission, and the nodes with more code relation graphs are shared by each graph, so that only the nodes corresponding to the sub-marks in the input sequence are needed to be positioned, the node representations of the nodes after the message iteration is acquired in the graph neural network, and then the node representations are integrated into the state sequence H, so that redundant information of other nodes is filtered, and the nodes and the sequence can be corresponding when the model processes the information.
Wi∈R 2dn×1 Is a learnable matrix.
Experimental analysis
To investigate whether the length of the code text marker sequence has an effect on the experimental results, we randomly selected 10k samples from the total samples for analysis, and found that most of the marker sequence length was less than 200, and a small portion was concentrated in the interval of [200, 400], as shown in fig. 3. We have performed experiments with different maximum input sequence lengths set.
In order to generate normalized method names for codes, the model comprehensively considers the semantics and the structure of the codes, and the generated method names can reflect the structural characteristics of the codes while conforming to the semantics of the codes. The whole process is divided into two steps, firstly, the information extraction is carried out on the code marking sequence through a text abstract means, and then, the decoder is assisted in generating the code name through the structure information provided by the code relation diagram. Meanwhile, a mapping mechanism is designed to establish a corresponding relation between the code marking sequence and the node, so that the code structure information is accurately transferred to a decoder. In the future, we plan to explore models that are more suitable for processing code structure data, thereby further improving the quality of the generated method names.
The embodiment provides a lightweight model generated by code method names comprehensively considering code semantics and structures. To clearly represent the code structure, we have constructed a completely new code graph, called Code Relationship Graph (CRG). The CRG integrates information such as data streams and the like while maintaining the structure and the complexity of the abstract syntax tree, and improves the information density. In our method, we label the input sequence and map the labeled token into the CRG according to the matching relationship, thereby establishing the mapping relationship between the code text label and the graph node in the CRG. We store this relationship in a mapping matrix so that the corresponding graph node information can be extracted. By doing so, we not only completely guarantee the integrity of the structural information of the code, but also greatly reduce the redundancy degree of the structural information of the code. In the decoding process, the model receives the semantic characteristics and the structural characteristics of the code at the same time, comprehensively considers the semantics and the structure of the code and then generates a normalized method name. To further enhance the decoding capabilities of the model, we introduce a weight sharing mechanism in the model to let the encoder and decoder share word embedding information. This example demonstrates the effectiveness of the proposed method on a published dataset java-small with 700K samples, which is 1.5% -3.5% higher in the ROUGE metric than the most advanced model.
The invention and its embodiments have been described above by way of illustration and not limitation, and the invention is illustrated in the accompanying drawings and described in the drawings in which the actual structure is not limited thereto. Therefore, if one of ordinary skill in the art is informed by this disclosure, the structural mode and the embodiments similar to the technical scheme are not creatively designed without departing from the gist of the present invention.

Claims (3)

1. The method for generating the method name based on the code structure guidance is characterized by comprising the following steps of: the method comprises the following steps:
1. processing the code text to obtain a code marking sequence and a code relation diagram;
2. coding the code marking sequence and the code relation diagram respectively, aligning the code marking sequence and the code relation diagram according to the mapping relation, and transmitting the semantic features and the structural features into a decoder for generating method names;
the coding comprises context information coding and code relation diagram coding;
1) The context information is encoded as: encoding the context vector set using an RNN-based seq2seq encoder; for the context information sequence, encoding it with a gating recursion unit GRU; for various types of edges in the code relation graph, for each edge, firstly, carrying out one-round message transmission and state update among nodes on the edges of different types by using a relation graph network, and then, encoding the edges by using GGNN;
input sequence V of GRU Fi Representing a context; each vector represents a sub-tag in a method name entity name in the context; for each time step t, a vector v will be chosen from n vectors i Put into encoder to obtain a hidden state vector h t As an output of this time step; by collecting all the outputs of each time step, a list H of hidden state vectors is obtained i =[h 1 ,h 2 ,……,h n ]This is the output of the GRU; input sequence V Fi =[v 1 ,v 2 ,……,v n ]Learning and converting into a hidden vector H by an encoder; the decoder is used for converting the representation hidden vector H into a target sequence Y= [ Y ] 1 ,y 2 ,……,y m ];
h t = f (v t , h t-1 ) (1)
h′ x t=g(y t-1 ,h′ t-1 ,H i )(2)
p (y t ∣y 1 , … , y t-1 , V Fi ) = s (y t-1 , h t , H i ) (3)
Equation (1) is the encoder RNN, h t Is a hidden state in a time step; the function f represents the RNN dynamic function; equation (2) is the hidden state of the decoder, h' t-1 Is the hidden state of the decoder at time step t-1; function g represents the RNN dynamic function; equation (3) is used for prediction: the function s is a likelihood calculation function;
2) The code relation diagram is encoded as: encoding the code relationship graph by GGNN; graph g= (V, E, X) is formed by node set V, edge set e= (E) 1 ,……,E k ) And node embedding X, wherein k represents the number of types of edges; for an arbitrary u E V, which corresponds to an embedded X of a node u ∈R dh Where dh represents the dimension of node embedding; the specific message passing process is as follows:
2.1. each node needs to send a message to its neighbors; message per nodeBy a function dependent on the edge type>In its current hidden state->Up-conversion; using a simple linear layer to represent the ++of edge-type k over time-step k>Embedding X through the node corresponding to each node u Initializing eachHidden state of individual node->The formula is as follows:
2.2. each node u aggregates messages from its neighbors by summing up in a corresponding way; n (u) represents the neighbor node of node u, and the formula is as follows:
2.3. each node u is based on aggregated messagesUpdating the state of its current time step, the updating function being a gate-controlled loop unit GRU;
the messaging process spreads over time step T and takes the hidden state of each node u at the last time stepAs node representations; the global graph state r is obtained by weighting and summing all node representations g The method comprises the steps of carrying out a first treatment on the surface of the The weight of each node is based on the hidden layer representation h of the node u And node embedded X u Is obtained by cascade calculation of (a), and is specifically as follows:
r g =∑ u∈V σ(W i [h u :X u ])⊙(W j h u );
is two learnable matrices, σ (·) is a sigmoid function; corresponding multiplications are then applied to the two outputs, and finally all weighted node representations are summed.
2. The code structure guidance-based method name generation method according to claim 1, wherein: in the first step, the code segment generates a text form representing the graph structure through a fastatst parsing tool, and then generates a code relation graph through analysis and optimization.
3. The code structure guidance-based method name generation method according to claim 2, wherein: the decoding process is as follows:
3.1. at any time step t, the decoder receives a hidden state s for the decoding process t At the same time, the decoder also receives the list H of hidden state vectors transmitted from the encoder i =[h 1 ,h 2 ,……,h n ]The score vector e of the attention of each step is obtained by calculating the inner product i The formula is as follows:
e i =v T tanh(W 1 h i +W 2 s);
is a weight matrix, < >>Is a weight vector;
3.2. probability distribution alpha of attention weights using softmax function t The formula is as follows:
e t =[e 1 ,e 2 ,…,e n ];
α t =softmax(e t );
use of alpha t Weighting and summing the list of hidden state vectors to obtain the output a of the attention module t
3.3. Output a of attention module t And decoder hidden state s t Splicing to obtain [ a ] t ∶s t ]As output at the computational decoder side.
CN202111288510.XA 2021-11-02 2021-11-02 Code structure guidance-based method name generation method Active CN114185595B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111288510.XA CN114185595B (en) 2021-11-02 2021-11-02 Code structure guidance-based method name generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111288510.XA CN114185595B (en) 2021-11-02 2021-11-02 Code structure guidance-based method name generation method

Publications (2)

Publication Number Publication Date
CN114185595A CN114185595A (en) 2022-03-15
CN114185595B true CN114185595B (en) 2024-03-29

Family

ID=80601815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111288510.XA Active CN114185595B (en) 2021-11-02 2021-11-02 Code structure guidance-based method name generation method

Country Status (1)

Country Link
CN (1) CN114185595B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407051B (en) * 2023-12-12 2024-03-08 武汉大学 Code automatic abstracting method based on structure position sensing

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960506A (en) * 2018-12-03 2019-07-02 复旦大学 A kind of code annotation generation method based on structure perception
CN111507070A (en) * 2020-04-15 2020-08-07 苏州思必驰信息科技有限公司 Natural language generation method and device
CN111597801A (en) * 2019-02-20 2020-08-28 上海颐为网络科技有限公司 Text automatic structuring method and system based on natural language processing
US10761839B1 (en) * 2019-10-17 2020-09-01 Globant España S.A. Natural language search engine with a predictive writing tool for coding
CN111651198A (en) * 2020-04-20 2020-09-11 北京大学 Automatic code abstract generation method and device
CN111723194A (en) * 2019-03-18 2020-09-29 阿里巴巴集团控股有限公司 Abstract generation method, device and equipment
CN112764738A (en) * 2021-01-19 2021-05-07 山东师范大学 Code automatic generation method and system based on multi-view program characteristics
CN113360766A (en) * 2021-06-29 2021-09-07 北京工业大学 Java method name recommendation method based on seq2seq model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960506A (en) * 2018-12-03 2019-07-02 复旦大学 A kind of code annotation generation method based on structure perception
CN111597801A (en) * 2019-02-20 2020-08-28 上海颐为网络科技有限公司 Text automatic structuring method and system based on natural language processing
CN111723194A (en) * 2019-03-18 2020-09-29 阿里巴巴集团控股有限公司 Abstract generation method, device and equipment
US10761839B1 (en) * 2019-10-17 2020-09-01 Globant España S.A. Natural language search engine with a predictive writing tool for coding
CN111507070A (en) * 2020-04-15 2020-08-07 苏州思必驰信息科技有限公司 Natural language generation method and device
CN111651198A (en) * 2020-04-20 2020-09-11 北京大学 Automatic code abstract generation method and device
CN112764738A (en) * 2021-01-19 2021-05-07 山东师范大学 Code automatic generation method and system based on multi-view program characteristics
CN113360766A (en) * 2021-06-29 2021-09-07 北京工业大学 Java method name recommendation method based on seq2seq model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于seq2seq框架的代码注释生成方法研究";封雯婷;《中国优秀硕士学位论文全文数据库 信息科技辑》;第I138-180页 *
"基于图卷积神经网络的函数自动命名";王堃 等;《计算机***应用》;第256-265页 *

Also Published As

Publication number Publication date
CN114185595A (en) 2022-03-15

Similar Documents

Publication Publication Date Title
US11900261B2 (en) Transfer learning system for automated software engineering tasks
CN110543419B (en) Intelligent contract code vulnerability detection method based on deep learning technology
Yang et al. Learning to prove theorems via interacting with proof assistants
US20220164626A1 (en) Automated merge conflict resolution with transformers
US20200249918A1 (en) Deep learning enhanced code completion system
US11693630B2 (en) Multi-lingual code generation with zero-shot inference
US12045592B2 (en) Semi-supervised translation of source code programs using neural transformers
US11526679B2 (en) Efficient transformer language models with disentangled attention and multi-step decoding
US11829282B2 (en) Automatic generation of assert statements for unit test cases
US11893363B2 (en) Unit test case generation with transformers
CN116151132B (en) Intelligent code completion method, system and storage medium for programming learning scene
US20210279042A1 (en) Neural code completion via re-ranking
CN112764738A (en) Code automatic generation method and system based on multi-view program characteristics
CN112162775A (en) Java code annotation automatic generation method based on Transformer and mixed code expression
CN116700780A (en) Code completion method based on abstract syntax tree code representation
CN115048141A (en) Automatic Transformer model code annotation generation method based on graph guidance
CN114185595B (en) Code structure guidance-based method name generation method
CN113065322B (en) Code segment annotation generation method and system and readable storage medium
Pandey Context free grammar induction library using Genetic Algorithms
CN117312559A (en) Method and system for extracting aspect-level emotion four-tuple based on tree structure information perception
CN112148879B (en) Computer readable storage medium for automatically labeling code with data structure
CN115167863A (en) Code completion method and device based on code sequence and code graph fusion
WO2024065028A1 (en) Application of an ai-based model to a preprocessed data set
CN116229162A (en) Semi-autoregressive image description method based on capsule network
Meneses et al. Documentation Is All You Need

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant