CN113190269A

CN113190269A - Code reconstruction method based on programming context information

Info

Publication number: CN113190269A
Application number: CN202110408445.3A
Authority: CN
Inventors: 张静宣; 骆君鹏; 梁嘉慧; 刘思远
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2021-07-30

Abstract

The invention discloses a code reconstruction method based on programming context information, which analyzes a demand document, a design document, a defect report and a code structure of a programming project according to the context information of the programming project to form a programming context model, reconstructs a source list for a given identifier through the programming context model, and comprises the following steps: (1) segmenting and serializing the code context, the annotation, the requirement document and the design document to form a text vector; (2) converting the text vector into a numerical vector by a machine learning method, and embedding the numerical vector; (3) and for a plurality of formed vector spaces, clustering the vector spaces by using a clustering method (4) inputting identifiers into a programming context model to form a reconstructed alternative ordered list, and recommending the reconstructed alternative ordered list to a developer. The invention constructs a new reconstruction identifier more effectively and individually, improves the software quality, avoids program defects and corruption, and helps developers to develop programs.

Description

Code reconstruction method based on programming context information

Technical Field

The invention relates to a code reconstruction method, in particular to a code reconstruction method based on programming context information.

Background

For identifier reconstruction, the following two aspects are mainly embodied, namely renaming based on naming convention and renaming based on inconsistency.

Renaming based on naming convention: a naming convention is a set of rules that programmers use to guide them in naming a software entity, particularly the rules that form the identifier (including the choice of words, and also "grammar" rules). Naming software entities according to a uniform naming convention improves the readability and maintainability of software applications. Naming conventions are context dependent. Different programming languages and organizations define their own naming conventions. For example, the Java language suggests naming entities according to the camel case convention, while C programmers typically use underlining as a separator to connect words.

Finding a renaming opportunity based on a naming convention is different from the choice or quality of the naming convention itself. Implying not the naming convention itself but its violation. In theory, renaming based on a naming convention must first find the context convention used by the software, and second identify the identifier of the convention such as a violation renaming opportunity.

Renaming based on inconsistency: checking for inconsistent identifiers in software is another key technique to identify renaming opportunities. The identifiers convey domain concepts that are useful for program understanding. They capture application-specific knowledge that programmers have when writing code. The appropriate identifier should not only reflect the role of its entity in conciseness, but also represent a concept consistently throughout the program. However, most programs suffer from inconsistent identifiers.

Renaming based on inconsistencies existing approaches identify renaming opportunities based on inconsistencies between identifiers and between entity names and entity implementations. Further, the inconsistency between identifiers is classified into a synonym-based method, a clone-based method, and a similarity-based method.

The existing research work puts a major effort on the identification of the renaming opportunity, but neglects the operation of determining which part of the identifiers needs to be renamed, namely what the source of the renaming of the identifiers is, and in what way better identifiers are recommended to the developer, which part is as important as the identification of the renaming opportunity, even determines the degree of interaction with the developer and the degree of acceptance of the recommended new identifiers by the developer, and influences the effectiveness and significance of the identifier reconstruction.

At this stage, no research has been done to systematically summarize what the effective recommendation sources for identifier reconstruction are and what the effective recommendation means are. The research on the part is relatively deficient at present, and further intensive research needs to be carried out urgently.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a code based on programming context information for individually constructing a new reconstruction identifier and improving software quality.

The technical scheme is as follows: a code reconstruction method based on programming context information analyzes a demand document, a design document, a defect report and a code structure of a programming project according to the context information of the programming project, and comprises the following steps:

(1) segmenting and serializing the code context, the annotation, the requirement document and the design document to form a text vector;

(2) converting the text vector into a numerical vector by a machine learning method, and embedding the numerical vector to form a vector space;

(3) for a plurality of formed vector spaces, clustering the vector spaces by a clustering method, and classifying the same finally formed programming context models together;

(4) and forming a reconstructed alternative ranking list by programming a context model for the given identifier, and recommending the reconstructed alternative ranking list to the developer.

Further, in the step (1), before performing segmentation serialization, firstly, converting a code context into an abstract syntax tree; the code context, the annotation, the requirement document and the design document are marked as text vectors; all code contexts are parsed by the abstract syntax tree and the generated abstract syntax tree is traversed by a depth first search algorithm to obtain two kinds of tokens: one is the node type and the other is the identifier of the node; and preparing for vector embedding of the next step.

Further, in the step (2), converting the text vector generated in the segmentation serialization into a numerical vector, and performing vector embedding to form a vector space, including the steps of:

step 21, descriptive text embedding

Providing a text mark sequence through the text vector mode of the step (1):

NV_name←E_PV(T_N)

wherein E is_PVIs a paragraph vector embedding function, and a training set T of sequence marks_NAs an input; the output is the name mapping function NV_name：T_N→V_NIn which V is_NIs the embedded vector space;

code context embedding, step 22

Two layers of neural networks are selected, and the model is constructed as follows:

TV_B←E_W(T_B)

wherein E is_WIs a marker sequence T_BThe training set of (a) is an input label embedding function; the output is a label mapping function TV_B:TW_B→V_BWWherein TW_BIs a vocabulary of tokens, V_BWIs at TW_BA vector space in which the token is embedded;

after sequence embedding, the code context is finally represented as a two-dimensional vector; setting a given code context b by T_b＝(t₁,t₂,t₃...t_k) Marker sequence representation, wherein t_i∈TW_B,i＝1,2,3,....,k；V_bIs corresponding to T_bA two-dimensional value vector of, then V_bThe following is inferred:

V_b←∫(T_b,TV_B)

wherein ^ is based on the label mapping function TV_BConverting the sequence of tokens of the code context into a function of a two-dimensional vector, thus V_b＝(v₁,v₂,v₃...v_k)∈V_B(ii) a Wherein v is_i←TV_B(t_i)，V_BIs a set of two-dimensional vectors;

step 23, generating vector space

For a code context, the input is the two-dimensional vector generated in step (22), the output is a vector space corresponding to an entirety, and the mapping function is obtained by the following formula:

VV_body←E_BV(V_B)

wherein E is_BVIs an embedded function, and context the code V_BAs training data, and generates a mapping function VV_bodyIt is defined as follows:

VV_body：V_B→V_B′

wherein, V_B' is the embedded vector space of the code context.

Further, in the step (3), the vector space generated in the step (2) is divided into different subsets by k-means clustering, and the partition clustering divides the data objects into non-overlapping subsets, so that each data object is in exactly one subset; finding clusters with k number specified by a user by adopting k mean value, and regarding the generated vector space as a point in a high-dimensional space to finish a clustering task; the k-means clustering operation is as follows:

first, k initial centroids are selected, where k is a user-specified parameter; each point is assigned to the nearest centroid, and the set of points assigned to one centroid is a cluster;

then, updating the centroid of each cluster according to the assigned cluster points; repeating the assigning and updating steps until the cluster does not change, or equivalently until the centroid does not change;

employing a proximity metric to quantify data, assigning points to the nearest centroid using euclidean distances for the points in space;

for data of euclidean distance, the sum of squares of the errors SSE is taken as the objective function to measure the cluster quality:

where dist is the standard Euclidean distance between two objects in Euclidean space; c_iIs the ith cluster, k is the number of clusters, and x is the object.

Further, in the step (4), for the identifier to be reconstructed, inputting the identifier into the clustered programming context model to obtain a classification output, outputting a classification cluster label of the programming context model corresponding to the identifier, identifying all identifiers of the same programming context classification cluster label as the identifier, and calculating vector similarity between all identifiers to obtain a context environment to which the identifier belongs; and simultaneously, identifiers corresponding to other context environments similar to the identifier context environment are attached, and the corresponding identifiers form a reconstructed alternative ordered list.

Compared with the prior art, the invention has the following remarkable effects: 1. the code programming context information is utilized to construct a programming context model, so that a new reconstruction identifier is more effectively and individually constructed, the software quality is improved, the program defect and corruption are avoided, and developers are helped to develop the program; 2. inputting the identifier to be reconstructed into a programming context model, outputting a classification label as a hidden reconstruction source, and better performing renaming recommendation to a developer; 3. the concept of renaming recommendation is provided, and a code reconstruction method based on programming context information is provided.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a diagram of a generic and accurate abstract syntax tree for a code fragment according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

The programming context data is an important source of information for the identifier to correct. The existing programming context data of the identifier is not effectively utilized, and mainly comprises three types, namely programmer information, programming project information and programming environment information, wherein the programmer and the programming project are divided into historical information and field information. The information has the characteristics of diversity, mass, high speed, variability and the like. Aiming at the characteristics, the invention researches the structure, the connotation and the characteristic extraction and classification technology of the context information. Among the three types of context information, the programming items have a wide variety of context information, which is the content of the key analysis in the present invention. Aiming at the programming project context information, products such as a requirement document, a design document, a defect report, a code structure and the like of the project are analyzed, and an association relation between the identifier and the software product is constructed to form a programming context model. For a given identifier, the identifier consistent with the programming context environment of the identifier can be found out from the programming context model, the identifier consistent with the programming context environment of the identifier is listed as a reconstruction source of the identifier, a reconstruction source list which is more semantic-compliant and personalized is constructed, and the renaming recommendation is better performed for a developer, wherein the whole process of the invention is shown in FIG. 1.

The implementation of the invention comprises the following steps:

(1) segmentation serialization: and (3) carrying out segmentation serialization on the code context (if the identifier is a method name, the method body of the method for the code context), the annotation, the requirement document, the design document and the like to form a text vector.

(2) Vector embedding: text vectors are converted into numerical vectors by machine learning methods such as Paragraph vectors and CNNs (Convolutional Neural Networks), and the numerical vectors are embedded to form a Vector space.

(3) And (4) clustering output: for the formed vector spaces, a clustering method (for example, calculating the similarity of two vectors, namely, examining the Euclidean distance between two vectors in the vector space) is selected to cluster the vector spaces, namely, the finally formed same programming context model is classified together.

(4) Entering identifiers into the programming context model: for a given identifier, inputting the identifier into a programming context model, outputting a classification cluster label of the programming context model corresponding to the identifier, identifying all identifiers of the same programming context model as the identifier, forming a reconstructed alternative ordered list by calculating vector similarity among all the identifiers, and recommending the reconstructed alternative ordered list to a developer.

The detailed implementation process of the invention is as follows:

step one, segmentation serialization

Before the segmentation serialization is carried out, the code context is preprocessed, namely the code context is converted into an Abstract Syntax Tree (AST). In the generated traditional abstract syntax tree, some labels marked as SimpleName in leaf nodes can interfere with feature learning of code fragments. For example, in fig. 2Tree a the variable node list is denoted as (SimpleName, list) and the method node toArray is denoted as (SimpleName, toArray). It can be challenging to distinguish between these two nodes at the leaf nodes of the generic abstract syntax tree. For this reason, the invention refines the grammar book by subdividing the unused nodes into different node types, and the refined grammar tree is called as an accurate abstract grammar tree.

FIG. 2 is a general and accurate abstract syntax tree for a return statement. First, the exact syntax tree presents a simplified architecture. Second, it becomes easier to distinguish some different nodes with an exact abstract syntax tree than with a general syntax tree node. The nodes of String [ ] type are marked as (ArrayType, String [ ]), the Variable node list is simplified as (Variable, list), and the toArray method is simplified as (method, toArray), so that two nodes are easier to distinguish by using an accurate abstract syntax tree, and the similarity of the generated text vectors is also reduced.

In the present invention, code context, comments, requirements documents, design documents, etc. are all marked as text vectors. All code contexts are parsed by the exact abstract syntax tree and the generated exact abstract syntax tree is traversed by the Depth First Search algorithm (DFS) to obtain two kinds of tokens: one is the AST node type and the other is the identifier of the node. For example, the code "double fix" is labeled as a four-labeled text vector (PrimitiveType, double, variable, fix), and so on, all code contexts can be converted into four-labeled text vectors. For comments, requirement documents and design documents, a text vector mode (ObjectType, Function, ParaType, Output) is designed, wherein the ObjectType refers to the type of a description object, for example, a sentence of a comment describes a Function of a method, and the ObjectType is MethodType at this time, and so on; the Function is a Function to be described, for example, if a code is required to complete the operation of interchanging two numbers on a document, the Function is a swop; the ParaType is a parameter type for describing the object, and if a sentence describes addition of two integer types, the ParaType is int at the moment; output is the Output describing the object, and null if there is no Output. Through the text vector mode designed above, comments, requirement documents, design documents and the like can be well converted into text vectors. All programming contexts are converted into text vectors, and the operation of segmentation serialization is completed to prepare for the subsequent vector embedding.

Step two, vector embedding

And converting the text vector generated in the segmentation serialization into a numerical vector, and embedding the vector to form a vector space. For descriptive statements such as comments, requirement documents and design documents, they are embedded into vectors by the Paragraph Vector algorithm. For the code context, text labels are first embedded into vectors by Word2Vec technology, and then these embedded vectors are sent to the convolutional neural network, thereby embedding the whole code context into the vector space. The detailed steps are as follows:

step 21, descriptive text embedding

Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations, such as sentences and paragraphs, from variable-length text segments. Thus, vector embedding can be well continued for descriptive statements such as comments, requirements documents, and design documents. Specifically, a text markup sequence is provided by the new text vector schema set forth in step one.

NV_name←E_PV(T_N) (1)

In the formula (1), E_PVIs a paragraph vector embedding function, and a training set T of sequence marks_NAs an input; the output is the name mapping function NV_name：T_N→V_NIn which V is_NIs the embedded vector space. The Paragraph Vector algorithm can embed the entire text sequence into a Vector.

Code context embedding, step 22

Word2Vec is a two-layer neural network whose main function is to embed words, converting each Word into a numerical vector. The model is constructed as follows:

TV_B←E_W(T_B) (2)

in the formula (2), E_WIs a marker sequence T_BThe training set of (a) is an input label embedding function; the output is a label mapping function TV_B:TW_B→V_BWWherein TW_BIs a vocabulary of tokens, V_BWIs at TW_BThe vector space in which the token is embedded.

After sequence embedding, the code context is finally represented as a two-dimensional vector. Setting a given code context b by T_b＝(t₁,t₂,t₃...t_k) A marker sequence representation; where ti e TW_B,i＝1,2,3,....,k；V_bIs corresponding to T_bTwo-dimensional vector of (1), then V_bThe following is inferred:

V_b←∫(T_b,TV_B) (3)

in formula (3), ^ is based on the label mapping function TV_BFunction for converting a sequence of tokens of a code context into a two-dimensional vectorThus V_b＝(v₁,v₂,v₃...v_k)∈V_B(ii) a Wherein v is_i←TV_B(t_i)，V_BIs a set of two-dimensional vectors.

Step 23, generating vector space

The descriptive text is a collection of words, and the Paragraph Vector algorithm can generate numerical vectors with the same length from paragraphs with different lengths, and the numerical vectors are collected into a space, so the Vector space generation of a code context (such as a method body) is taken as an example.

For the code context, an additional mapping function is required, where the input is the two-dimensional vector generated in step 22 and the output is the vector space corresponding to an entirety.

The mapping function is obtained by the following formula:

VV_body←E_BV(V_B) (4)

in the formula (4), E_BVIs an embedding function (CNNs) that context the code V_BAs training data, and generates a mapping function VV_bodyIt is defined as follows:

VV_body：V_B→V_B′ (5)

in the formula (5), V_B' is the embedded vector space of the code context.

At this point, the text vector generated in the segmentation serialization is converted into a numerical vector, and vector embedding is performed, so that a vector space is formed.

Step three, clustering output

Cluster analysis groups data objects based only on information found in the data describing the objects and their relationships. The goal is that objects within the data are similar (related) while objects in different groups are different (unrelated). The vector space generated as described above needs to be divided into different subsets (clusters) by clustering techniques. Partitional clustering may partition data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. k-means clustering is a prototype-based, partitioned clustering technique that attempts to find a user-specified number (k) of clusters. In the invention, the generated vector space is regarded as a point in a high-dimensional space, and then the task of clustering is completed by adopting k-means clustering. The correlation operation of k-means clustering is described in detail below.

First, k initial centroids are selected, where k is a user-specified parameter, i.e., the desired number of clusters. Each point is assigned to the nearest centroid and the set of points assigned to one centroid is a cluster. The centroid of each cluster is then updated according to the assigned cluster points. The assigning and updating steps are repeated until the cluster does not change, or equivalently until the centroid does not change.

To assign a point to the nearest centroid, a proximity metric is required to quantify the "nearest" notion of the data under consideration, using the euclidean distance for the points within the space.

As the centroid may vary from data proximity metric to clustering target. The objective of clustering is usually represented by an objective function that depends on the proximity of points to each other, or to the centroid of the cluster, taking into account the data whose proximity metric is the euclidean distance, taking the sum of squares of the errors SSE as the objective function for measuring the cluster quality:

in equation (6), dist is the standard Euclidean distance between two objects in Euclidean space, C_iIs the ith cluster, k is the number of clusters, and x is the object.

Step four, inputting the identifier into the programming context model

By calculating the similarity between vectors, a plurality of dissimilar directional quantum spaces are formed after clustering the generated vector space, and different sub-spaces correspond to similar context environments. Inputting the identifiers needing to be reconstructed into the clustered programming context model to obtain classified output, wherein the obtained classified output is not only the context environment to which the identifiers belong; and simultaneously, identifiers corresponding to other context environments similar to the identifier context environment are attached, and the corresponding identifiers form a reconstruction alternative sorting table.

The method for calculating the similarity of the vectors can also calculate the similarity between the two vectors, calculate the similarity between the identifiers of the reconstruction candidates and the identifiers needing to be reconstructed, form a reconstruction ranking list and recommend the reconstruction ranking list to developers. This may better provide developers with more personalization, semantics that better conform to the identifier, and renaming of the identifier of the environment in which it is located.

Claims

1. A code reconstruction method based on programming context information is characterized in that a requirement document, a design document, a defect report and a code structure of a programming project are analyzed according to the context information of the programming project, and the method comprises the following steps:

2. The method according to claim 1, wherein in step (1), before performing segmentation serialization, the code context is first converted into an abstract syntax tree; the code context, the annotation, the requirement document and the design document are marked as text vectors; all code contexts are parsed by the abstract syntax tree and the generated abstract syntax tree is traversed by a depth first search algorithm to obtain two kinds of tokens: one is the node type and the other is the identifier of the node.

3. The method according to claim 1, wherein in the step (2), the text vector generated in the segmentation serialization is converted into a numerical vector, and vector embedding is performed to form a vector space, and the method comprises the following steps:

step 21, descriptive text embedding

Providing a text mark sequence through the text vector mode of the step (1):

NV_name←E_PV(T_N)

code context embedding, step 22

TV_B←E_W(T_B)

after sequence embedding, the code context is finally represented as a two-dimensional vector; setting a given code context b by T_b＝(t₁,t₂,t₃...,t_k) Marker sequence representation, wherein t_i∈TW_B,i＝1,2,3,....,k；V_bIs corresponding to T_bA two-dimensional value vector of, then V_bThe following is inferred:

V_b←∫(T_b,TV_B)

step 23, generating vector space

VV_body←E_BV(V_B)

VV_body：V_B→V_B′

wherein, V_B' is the embedded vector space of the code context.

4. The method of claim 1, wherein in step (3), the vector space generated in step (2) is divided into different subsets by k-means clustering, and the partitional clustering divides the data objects into non-overlapping subsets such that each data object is in exactly one subset; finding clusters with k number specified by a user by adopting k mean value, and regarding the generated vector space as a point in a high-dimensional space to finish a clustering task; the k-means clustering operation is as follows:

5. The code restructuring method based on the context information of claim 1, wherein in step (4), for an identifier to be restructured, the identifier is input into the clustered context model to obtain a classification output, a classification cluster label of the programming context model corresponding to the identifier is output, all identifiers of the same classification cluster label of the programming context are identified, and then the context environment to which the identifier belongs is obtained by calculating the vector similarity between all identifiers; and simultaneously, identifiers corresponding to other context environments similar to the identifier context environment are attached, and the corresponding identifiers form a reconstructed alternative ordered list.