CN111104159A

CN111104159A - Annotation positioning method based on program analysis and neural network

Info

Publication number: CN111104159A
Application number: CN201911321441.0A
Authority: CN
Inventors: 张卫丰; 李小满; ***; 王子元; 张迎周
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2020-05-05

Abstract

The invention relates to an annotation positioning method based on program analysis and a neural network, which comprises the following steps: firstly, constructing a project to be analyzed; extracting the annotation of each method in the Java project, manually marking the category, and constructing a training set of an annotation classifier; training an annotation classifier, classifying the annotations, and extracting the annotations describing the implementation details of the method; acquiring all variables in each method body; matching the variables in the method with the annotations of the method to find out the variables existing in the annotations; extracting code segments related to the variables in the annotation from the method body, and constructing a training set of an annotation positioning model; and training an annotation positioning model, and calculating the similarity of the annotation and the code segment through the model so as to construct a mapping relation between the code and the annotation. The invention mainly associates the annotations with the corresponding codes, can help developers to understand the functions of the codes and improve the development efficiency.

Description

Annotation positioning method based on program analysis and neural network

Technical Field

The invention belongs to the field of software engineering, and particularly relates to an annotation positioning method based on program analysis and a neural network.

Background

As the software development process is complicated, the cooperation among developers becomes more important, and the development of a project often requires the cooperation of multiple developers, often requires interaction with other developers, or calls an interface provided by another developer to perform the cooperative development. During the development process, the replacement of personnel is also possible, and during the project handover process, codes written by other developers need to be read to understand the business functions. Therefore, it is important to understand the existing API correctly, so as to improve the development efficiency and reduce the introduction of duplicate BUGs.

High-quality code annotation can accurately explain the core function of the code, and reduces the time for maintenance personnel to understand the code. However, in the development process, developers are often required to add comments to methods or variables to generate maintenance documents. However, since writing a correct and high-quality annotation is too costly, developers tend to write annotations that are not added to the annotation or are easy to understand from their own perspective, which results in poor readability of the code annotation and inability to determine which piece of code the written annotation describes specifically, and more effort is required to understand the code.

One effective solution to these problems is to locate each annotation to a specific piece of code as the software developer reads the code. The positioning function can help developers to better understand the functions of the codes, so that the development efficiency and accuracy are improved. However, no annotation positioning technology exists at present, and therefore, the main objective of the invention is to research a positioning information capable of automatically generating annotations and codes to help developers to understand the codes and better complete development tasks.

Disclosure of Invention

The invention aims to provide an annotation positioning method based on program analysis and a neural network aiming at the existing problems so as to solve the problem of poor code readability caused by irregular writing in software development. The invention realizes the automatic positioning of the annotation, improves the readability of the code, reduces the code development cost and improves the code development efficiency.

In order to achieve the purpose, the invention adopts the technical scheme that:

an annotation positioning method based on program analysis and neural network comprises the following steps:

s1, downloading a Java open source project, and extracting the annotation of the method level in the project;

s2, manually marking annotation categories according to the annotation data obtained in the step S1 to form a set of < annotation, annotation category > pairs as an annotation classification training set;

s3, preprocessing the training set generated in the step S2, and training an annotation classifier by using a neural network model;

s4, classifying the annotation of each method in the project by using a classifier, extracting the annotation of the How type, finding out the corresponding code from the method body, and forming a set of < annotation, code segment > pairs as a training set of an annotation positioning model;

s5, preprocessing the training set constructed in the step S4, and training an annotation positioning model by using a neural network model;

s6, after the annotation positioning model is trained, giving an annotation statement and a plurality of code segments in the Java method, outputting the code segment most similar to the annotation, and forming a mapping relation between the annotation and the code segment;

specifically, in step S2, the Java method level annotations include What type annotations and How type annotations. Where What type of annotation is an annotation describing a method's functionality and How type of annotation is an annotation describing a method's specific implementation.

Specifically, in step S3, the preprocessing of the training set means to perform word segmentation on the annotation text, delete rare symbols and stop words therein, construct an annotation vocabulary, and convert the annotation text into a numeric list.

Specifically, in step S4, a training set of the annotation positioning model is constructed, and the specific method includes: firstly, all variables in the method body are obtained, then the variables are matched with the How type annotation of the method, the variables existing in the annotation are found out, and then the code segments related to the variables are found out from the method body according to the variables. One annotation may correspond to multiple code decisions, so it is necessary to manually decide which code segment is closest to the annotation meaning, and thus form a set of < annotation, code segment > pairs.

Specifically, in step S5, the annotation positioning model is a recurrent neural network, which maps the code and the annotation to a vector space, and then constructs the mapping relationship between the annotation and the code by calculating the cosine similarity between the annotation vector and the code vector.

The invention has the beneficial effects that:

compared with the prior art, the method and the device mainly utilize program analysis and neural network technology to realize automatic positioning of the annotation. The invention can effectively solve the problem caused by the irregular writing of the annotation, position the annotation and the code, enhance the readability of the code, reduce the burden of developers and improve the development efficiency.

Drawings

FIG. 1 is a schematic diagram of an annotation positioning process based on program analysis and neural network according to the present invention;

FIG. 2 is a schematic diagram of a code and annotation extraction flow;

Detailed Description

The technical solution of the present invention will be further described with reference to the accompanying drawings, and the embodiments are not intended to limit the present invention.

As shown in fig. 1, the annotation locating method based on program analysis and neural network of the present invention specifically includes the following steps:

specifically, in step S1, the present invention aims to train a neural network model, so that a code library with sufficient data needs to be constructed. Downloading java open source items with star number more than 2000 from the Github open source community, and extracting the annotation of the method level in the java open source items through an eclipseJDT tool.

Specifically, in step S2, the Java method level annotations include What type annotations and How type annotations. Where What type of annotation is an annotation describing a method's functionality and How type of annotation is an annotation describing a method's specific implementation. The annotations at the method level extracted from the Java project are manually classified to form a set of < annotations, code segment > pairs as a training set of the annotation classification model.

Specifically, in step S3, the training set of the annotation classifier is preprocessed by the annotation data obtained in step S2, and the specific steps are as follows:

s3.1, performing word segmentation on the annotation sentences;

s3.2, deleting stop words;

s3.3, changing the words into lower case;

s3.4, constructing an annotation vocabulary with the size of 10000;

and S3.5, converting the comment statement into a digital list.

Further, the main parameters of the annotation classification model are set as: the convolutional neural network has a word embedding dimension of 128 and a number of hidden layers of 48, using an Adam optimizer.

Specifically, as shown in fig. 2, in step S4, a training set of the annotation positioning model is constructed, and the specific steps are as follows:

s4.1, classifying the annotations by using an annotation classifier, and taking out the annotations of the How type;

s4.2, acquiring all variables in the method body;

s4.3, matching the variables in the method body with the How type annotation of the method, and finding out the variables existing in the annotation;

s4.4, finding out a code segment related to the variable in the annotation from the method body;

s4.5 one note may correspond to multiple code segments, so it is necessary to manually determine which code segment is closest to the note meaning, thereby forming a set of < note, code segment > pairs.

Specifically, in step S5, the preprocessing of the training data set is performed through the data in the < note, code segment > format obtained in step S4, and the specific steps are as follows:

s5.1, performing word segmentation on the code segments;

s5.2, deleting the symbols in the data;

s5.3, deleting java keywords in the data;

s5.4, cutting each word according to a hump rule;

s5.5, deleting repeated words;

s5.6, changing the capitalization of the word into the lowercase;

and S5.7, forming two text sequences by the processed word list and the comments.

Further, the main function of the annotation positioning model is to map the code and the annotation to a vector space, and then construct the mapping relation between the annotation and the code by calculating the cosine similarity of the annotation vector and the code vector. The main parameters of the annotation positioning model are set as: the recurrent neural network hidden unit is set to 100, the dimension of word embedding is 100, and an Adam optimizer is used.

Applications of the present invention are numerous, and it will be appreciated by those skilled in the art that the above embodiments are examples of the present invention and that numerous changes, modifications, substitutions and alterations can be made in the embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An annotation positioning method based on program analysis and neural network is characterized by comprising the following steps:

and S6, after the annotation positioning model is trained, giving an annotation statement and a plurality of code segments in the Java method, outputting the code segment most similar to the annotation, and forming the mapping relation between the annotation and the code segment.

2. The program analysis and neural network based annotation localization method of claim 1, wherein in said step S2, Java method level annotations comprise What type annotations and How type annotations. Where What type of annotation is an annotation describing a method's functionality and How type of annotation is an annotation describing a method's specific implementation.

3. The method for annotation localization based on program analysis and neural network as claimed in claim 1, wherein said preprocessing the training set in step S3 comprises segmenting the annotation text, deleting rare symbols and stop words therein, constructing an annotation vocabulary, and converting the annotation text into a numeric list.

4. The annotation positioning method based on program analysis and neural network as claimed in claim 1, wherein in step S4, a training set of annotation positioning model is constructed by: firstly, all variables in the method body are obtained, then the variables are matched with the How type annotation of the method, the variables existing in the annotation are found out, and then the code segments related to the variables are found out from the method body according to the variables. One annotation may correspond to a plurality of code judgments, so that it needs to manually judge which code segment is the closest to the annotation meaning, so as to form a set of < annotation, code segment > pairs, which is used as a training set of an annotation positioning model.

5. The method for annotation localization according to claim 1, wherein in step S5, the annotation localization model is a recurrent neural network, which maps the code and the annotation to a vector space, and then constructs the mapping relationship between the annotation and the code by calculating the cosine similarity between the annotation vector and the code vector.