CN112417852B

CN112417852B - Method and device for judging importance of code segment

Info

Publication number: CN112417852B
Application number: CN202011418126.2A
Authority: CN
Inventors: 舒俊淮; 陈湘萍; 金舒原; 郑子彬
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2022-01-25
Anticipated expiration: 2040-12-07
Also published as: CN112417852A; WO2022121146A1

Abstract

The invention discloses a method and a device for judging the importance of a code segment, which comprises the following steps: generating a target classification model through a preset classification model training process; when a code segment to be annotated is received, extracting a first feature vector of the code segment to be annotated; and inputting the first feature vector into the target classification model, and outputting the importance judgment result of the code segment to be annotated. The method can efficiently judge the importance of the code to be annotated so as to facilitate the standardization of the annotation behavior of software development and maintenance personnel and keep the code annotation amount in a more appropriate range.

Description

Method and device for judging importance of code segment

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for judging the importance of code segments.

Background

The research directions of intelligent software engineering include software warehouse mining, program code understanding, code automatic generation, annotation automatic generation and the like, and the purpose of the research directions is to help software developers to improve the efficiency in the development and maintenance processes. In recent years, with the advent of machine learning and deep learning techniques, researchers associated with intelligent software engineering problems have begun exploring the possibilities of solving related research problems with these advanced techniques and have yielded a number of encouraging results. For example, information retrieval and recommendation system based techniques help developers to improve utilization of an open source software repository; important words in the code are distinguished based on a machine learning method so as to help other tasks to correctly understand the program; a convolutional neural network-based code generation technique; and so on.

The automatic generation technology of code annotation is a hot topic in the research field of intelligent software engineering. The code annotation can help people to know the intention and thought of a code author, and has important effects on software maintenance, code reuse, team collaborative development and the like. The technology aims at automatically generating annotations of a given code segment by a machine so as to reduce the time spent by a software developer on writing the annotations of the code and improve the development efficiency. With machine learning and deep learning techniques, researchers translate this problem into "translation tasks" in natural language processing to solve. Using the sequence-to-sequence model in natural language processing (i.e., inputting a sequence of text, the model will output a sequence of text), researchers "translate" the code language into natural language, taking the resulting natural language as an annotation of the corresponding code fragment.

However, in the existing technology for predicting the position of code annotation, the model is trained by simply converting the code text into the feature vector. This practice is quite common in natural language processing. But this is equivalent to treating the code text as a natural language and only using the text information of the code, resulting in a less than ideal effect of the last method. In addition, the method is not sufficient in utilization of text information, only simply converts words into features, and does not consider the distribution condition of the text, so that the accuracy of determining the code annotation position is lowered, reasonable suggestions can not be provided for developers, and the working efficiency of software development and maintenance personnel is further reduced.

Disclosure of Invention

The invention provides a method and a device for judging the importance of a code segment, and solves the technical problems that the accuracy of determining a code annotation position is low and cannot be reasonably suggested by developers because the feature utilization rate of multiple dimensions is low because the existing technology for predicting the code annotation position is only limited to taking a code text as a pure text without a structure, and the working efficiency of software development and maintenance personnel is further reduced.

The invention provides a method for judging the importance of a code segment, which comprises the following steps:

receiving a code segment to be annotated;

extracting a first feature vector of the code segment to be annotated;

inputting the first feature vector into the target classification model, and outputting an importance judgment result of the code segment to be annotated;

wherein the target classification model is generated through a preset classification model training process.

Optionally, the classification model training process includes:

acquiring an annotated code file from a preset software warehouse;

dividing the annotated code file by taking a function as a unit to generate a plurality of training code segments;

setting a first preset label for the training code segment with the preset type annotation;

setting a second preset label for the training code segment without the preset type annotation;

respectively extracting a second feature vector of each training code segment;

and training a preset initial classification model by adopting a plurality of second feature vectors to obtain a target classification model.

Optionally, the first feature vector includes a syntactic feature vector, a text feature vector, a structural feature vector, and a relational feature vector, and the step of extracting the first feature vector of the code segment to be annotated includes:

converting the code segment to be annotated into an abstract syntax tree;

extracting statement type information of the code segment to be annotated from the abstract syntax tree;

determining a grammatical feature vector corresponding to the code segment to be annotated according to the statistical result of the statement type information;

extracting a target word from the code segment to be annotated according to a preset variable word division rule;

determining a text characteristic vector corresponding to the code segment to be annotated according to the statistical result of the target word;

determining a structural feature vector corresponding to the code segment to be annotated according to the complexity of the code segment to be annotated;

and determining a relation characteristic vector corresponding to the code segment to be annotated based on the function call quantity of the code segment to be annotated.

Optionally, the statement type information includes occurrence frequency, number, and frequency distribution of multiple statement types, and the step of determining the grammatical feature vector corresponding to the code segment to be annotated according to the statistical result of the statement type information includes:

counting the frequency distribution conditions of the various statement types, and determining statement frequency distribution characteristics;

counting the number of the multiple statement types, and determining statement number characteristics;

counting the total number of sentences corresponding to the various sentence types, and determining the total sentence number characteristic;

adopting a first preset word feature conversion model to respectively convert the multiple statement types into statement type features;

taking the occurrence frequency as weight, and carrying out weighted summation on the statement type characteristics to determine the total statement type characteristics;

and splicing the statement frequency distribution characteristic, the statement quantity characteristic, the total statement quantity characteristic and the total statement type characteristic to generate a grammar characteristic vector corresponding to the code segment to be annotated.

Optionally, the preset variable word division rule includes a hump rule or an underlining rule, and the step of extracting the target word from the code segment to be annotated according to the preset variable word division rule includes:

extracting words from the code segment to be annotated;

determining words to be processed from the words by adopting the hump rule or the underlining rule;

deleting preset stop words from the words to be processed to obtain words to be extracted;

and extracting the word stem in the word to be extracted to generate a target word.

Optionally, the target word includes a plurality of words to be counted, and the step of determining the text feature vector corresponding to the code segment to be annotated according to the statistical result of the target word includes:

counting the total number of the multiple words to be counted, and determining the total word number characteristic;

counting the number of the types of the words to be counted, and determining the number characteristics of the types of the words;

respectively calculating the variance of the occurrence frequency of each word to be counted in the plurality of words to be counted, and determining word variance characteristics;

counting the proportion of non-English words in the words to be counted, and determining the non-word proportion characteristic;

respectively converting the words to be counted into word features by adopting a second preset word feature conversion model;

taking the occurrence frequency of each word to be counted as weight, and carrying out weighted summation on all the word features to generate total word features;

and splicing the total word quantity characteristic, the word type quantity characteristic, the word variance characteristic, the non-word proportion characteristic and the total word characteristic to generate a text characteristic vector corresponding to the code segment to be annotated.

Optionally, the step of determining a structural feature vector corresponding to the code segment to be annotated according to the complexity of the code segment to be annotated includes:

counting the number of code lines in the code segment to be annotated, and determining line number characteristics;

counting the number of nested sentences in the code segment to be annotated, and determining the number characteristic of the nested sentences;

counting the maximum nesting layer number in the code segment to be annotated, and determining the maximum nesting layer number characteristic;

counting the number of form parameters in the code segment to be annotated, and determining the feature of the number of the form parameters;

respectively counting word quantity characteristics of the longest sentence in the code segment to be annotated, API call quantity characteristics of the code segment to be annotated, variable quantity characteristics of the code segment to be annotated, identifier quantity characteristics of the code segment to be annotated and internal annotation quantity characteristics of the code segment to be annotated, and sequentially splicing to generate comprehensive characteristics;

and splicing the line number characteristic, the nested statement quantity characteristic, the nested maximum layer number characteristic, the shape parameter quantity characteristic and the comprehensive characteristic to generate a structural characteristic vector corresponding to the code segment to be annotated.

Optionally, the function call number includes an extra function call number and a called number, and the step of determining the relationship feature vector corresponding to the code segment to be annotated based on the function call number of the code segment to be annotated includes:

traversing the code file to be annotated to which the code segment to be annotated belongs, and determining the number of the called extra functions and the number of called times of the code segment to be annotated;

and splicing the number of the called extra functions and the called times to generate a relation characteristic vector corresponding to the code segment to be annotated.

Optionally, the step of inputting the first feature vector into the target classification model and outputting a result of judging the importance of the code segment to be annotated includes:

inputting the first feature vector to the target classification model;

when the output of the target classification model is the first preset label, determining that the importance judgment result of the code segment to be annotated is important;

and when the output of the target classification model is the second preset label, determining that the importance judgment result of the code segment to be annotated is unimportant.

The invention also provides a device for judging the importance of the code segments, which comprises:

the code segment receiving module is used for receiving the code segment to be annotated;

the first feature vector extraction module is used for extracting a first feature vector of the code segment to be annotated;

the importance output module is used for inputting the first feature vector into the target classification model and outputting the importance judgment result of the code segment to be annotated;

the target classification model is generated through a preset classification model training module.

According to the technical scheme, the invention has the following advantages:

the method comprises the steps of generating a target classification model through a preset classification model training process, extracting a first feature vector from a code segment to be annotated when the code segment to be annotated is received, and finally inputting the first feature vector into the target classification model to obtain an importance judgment result of the code segment to be annotated. Therefore, the technical problems that the accuracy of determining the code annotation position is low due to the fact that the code text is only used as a plain text without a structure in the existing technology for predicting the code annotation position, the feature utilization rate of multiple dimensions is low, reasonable suggestions cannot be provided for developers, and the working efficiency of software development and maintenance personnel is reduced are solved, so that the importance of the code to be annotated can be judged efficiently, the annotation behavior of the software development and maintenance personnel can be optimized, and the code annotation quantity is kept in a more appropriate range.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a flowchart illustrating steps of a method for determining importance of a code segment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of a method for determining the importance of a code segment according to an alternative embodiment of the present invention;

FIG. 3 is an exemplary diagram of nested statements in an embodiment of the invention;

FIG. 4 is a flowchart illustrating a method for determining the importance of a code segment according to another embodiment of the present invention;

fig. 5 is a block diagram of a device for determining importance of code segments according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a method and a device for judging the importance of a code segment, which are used for solving the technical problems that the accuracy of determining a code annotation position is low and cannot be reasonably suggested by developers because a code text is only limited to be taken as a pure text without a structure in the conventional technology for predicting the code annotation position and the utilization rate of characteristics of multiple dimensions is low, and further the working efficiency of software development and maintenance personnel is reduced.

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for determining importance of a code segment according to an embodiment of the present invention.

step 101, receiving a code segment to be annotated;

in the embodiment of the invention, in order to more accurately judge the importance of the code segment so as to better support downstream tasks such as judging the position of code annotation, before a user needs to annotate the code segment, the importance judgment process can be carried out by receiving the code segment to be annotated input by the user.

The code segment to be annotated may be a Java code segment, and the like, which is not limited in this embodiment of the present invention.

Step 102, extracting a first feature vector of the code segment to be annotated;

after receiving the code segment to be annotated, extracting a first feature vector, such as a grammatical feature, a text feature, a structural feature, a relational feature and the like, from the code segment to be annotated, taking the first feature vector as an input quantity of a subsequent model, and performing an importance judgment process on the code segment to be annotated based on the features.

103, inputting the first feature vector into the target classification model, and outputting an importance judgment result of the code segment to be annotated;

in the specific implementation, the target classification model is generated through a preset classification model training process, and after the target classification model is obtained, the first feature vector is input into the target classification model to perform an importance judgment process on the code segment to be annotated, so that whether the code segment is important or not is determined as an importance judgment result.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for determining importance of code segments according to an alternative embodiment of the present invention.

step 201, receiving a code segment to be annotated;

before step 201, in order to facilitate the subsequent importance judgment process for the code segment to be annotated quickly, the target classification model may be generated in advance through a classification model training process, which includes the following steps S1-S6:

s1, obtaining the annotated code file from a preset software warehouse;

in the embodiment of the invention, in order to obtain sufficiently reliable training data, code files of some Java projects with long maintenance history, namely annotated code files, sourced by international large corporation or organization are obtained from the software project warehouse gitubs.

S2, dividing the annotated code file by taking a function as a unit to generate a plurality of training code segments;

s3, setting a first preset label for the training code segment with the preset type annotation;

s4, setting a second preset label for the training code segment without the preset type annotation;

in a specific implementation, since an annotated code file often includes a plurality of code segments, it is an object of the present invention to determine the importance of a code segment. The annotated code file can be divided by taking a function as a unit, the annotated code file is divided into training code segments of functions, and then each training code segment is labeled according to the standard of whether the training code segment exists or not.

Specifically, a first preset tag may be set for the training code segment with a preset type of annotation to identify that the code segment is important, and a second preset tag may be set for the training code segment without a preset type of annotation to identify that the code segment is not important. The preset type annotation may be a function header annotation, etc., the first preset tag may be 1, and the second preset tag may be 0, which are not limited in the embodiment of the present invention.

S5, respectively extracting a second feature vector of each training code segment;

and S6, training a preset initial classification model by adopting a plurality of second feature vectors to obtain a target classification model.

In the embodiment of the present invention, after the training code segment is obtained, a second feature vector of the training code segment needs to be extracted, where the type of the second feature vector is the same as that of the first feature vector, that is, the second feature vector also includes a grammatical feature vector, a text feature vector, a structural feature vector, and a relational feature vector, and the extraction manner of the second feature vector is the same as that of the first feature vector.

After the second feature vector of each training code segment is obtained, a training set can be formed by adopting a plurality of second feature vectors, and a preset initial classification model is trained by adopting the training set to obtain a target classification model.

It should be noted that the initial classification model may be a random forest model or other classification models, which is not limited in this embodiment of the present invention.

The specific training process may be as follows: the data set was randomly divided into 10 equal parts, 1 of which was taken as the test set and the other 9 as training sets each time. Training the model with the training set, and testing the effect of the model with the testing set. When the effect of the model on the test set does not improve any more for 20 consecutive iterations, the number of iterations corresponding to the best effect is recorded. The training process was repeated 10 times, and 10 equal parts of the data set were used as the test set to obtain 10 optimal iterations. The 10 iterations are averaged to obtain the iteration number when the model is finally trained. And finally, training a random forest model by using the full data, and finishing the training when the iteration number of the model reaches a preset value.

In the embodiment of the present invention, the first feature vector includes a syntactic feature vector, a textual feature vector, a structural feature vector, and a relational feature vector, and the step 102 may be replaced with the following steps 202-208:

step 202, converting the code segment to be annotated into an abstract syntax tree;

an Abstract Syntax Tree (AST), or simply Syntax Tree (Syntax Tree), is an Abstract representation of the source code Syntax structure. It represents the syntactic structure of the programming language in the form of a tree, each node on the tree representing a structure in the source code.

In the embodiment of the invention, in order to enable the syntactic structure of the code segment to be annotated to be embodied visually and concretely, the code segment to be annotated can be converted into an abstract syntactic tree so as to facilitate the subsequent extraction of syntactic feature vectors.

Step 203, extracting statement type information of the code segment to be annotated from the abstract syntax tree;

since the abstract syntax tree can reflect each syntax structure in the code segment to be annotated, that is, can reflect the statement type information of the code segment to be annotated, including but not limited to IfStmt (if statement), forsstmt (for loop statement), while stmt (while loop statement), and the like, can be extracted from the abstract syntax tree.

Step 204, determining a grammatical feature vector corresponding to the code segment to be annotated according to the statistical result of the statement type information;

the embodiment of the invention relates to a grammatical feature vector of a code segment to be annotated, and aims to describe grammatical information of the code segment in a code language. And determining a grammatical feature vector corresponding to the code segment to be annotated according to the statistical result of the statement type information.

Wherein, the syntactic feature vector may be: frequency distribution characteristics of frequency distribution cases of different sentence types, a sentence quantity characteristic of the quantity of different sentence types (i.e., deduplication of the same sentence type), a total sentence quantity characteristic of the total quantity of sentences, and a total sentence type characteristic weighted based on the sentence types.

In this embodiment of the present invention, the statement type information includes the occurrence frequency, number and frequency distribution of multiple statement types, and step 204 may include the following sub-steps:

In the embodiment of the invention, the statement frequency distribution characteristics of each statement type can be determined by counting the frequency distribution condition of each statement type; respectively counting the number of each statement type, and determining the statement number characteristic of each statement type; counting the total number of all sentences and determining the total sentence number characteristic; adopting a first preset Word feature conversion model such as a Word2Vec model and the like to respectively convert each statement type into corresponding statement type features, and then taking the occurrence frequency of each statement type as a weight to carry out weighted summation on the statement type features to determine the total statement type features; and finally, splicing the statement frequency distribution characteristic, the statement quantity characteristic, the total statement quantity characteristic and the total statement type characteristic to obtain a grammatical characteristic vector representing the code segment to be annotated.

It is worth mentioning that the above statistical processes can be performed in parallel.

Word2vec, a group of related models used to generate Word features. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented in words and input words in adjacent positions are guessed. After training is completed, the word2vec model can be used to map each word to a feature, which can be used to represent word-to-word relationships, the feature being a hidden layer of the neural network.

Step 205, extracting a target word from the code segment to be annotated according to a preset variable word division rule;

the embodiment of the invention also needs to acquire a plurality of text feature vectors related to the code segments, so as to describe the distribution condition of texts in the code segments, extract the meaning of words and the context information of the words. Before the text feature vector is obtained, the code segment to be annotated needs to be preprocessed to obtain the target word.

In a specific implementation, the target word may be extracted from the code segment to be annotated according to a preset variable word division rule.

Optionally, the preset variable word segmentation rule includes a hump rule or an underlining rule, and step 205 may include the following sub-steps:

extracting words from the code segment to be annotated;

In a specific implementation, since a code author prefers to name a code variable by combining multiple english words together in a humped naming style or an underlined naming style. Therefore, words can be extracted from the code segment to be annotated, generally, the words are distinguished by using a space, a bracket or a separator such as a semicolon, and then the words are divided by adopting a hump rule or an underlining rule to determine the words to be annotated; deleting preset stop words to obtain words to be extracted, wherein the preset stop words are functional words without actual meanings such as the, the is, the at, the on and the like; and because the words of the same word stem can appear in different forms, in order to reduce the number of the words, the word stem in the words to be extracted can be extracted to obtain the target words.

Meanwhile, in order to facilitate subsequent operations, all target words can be uniformly processed into a lower case form, which is not limited in the embodiment of the present invention.

Step 206, determining a text feature vector corresponding to the code segment to be annotated according to the statistical result of the target word;

further, the target word comprises a plurality of words to be counted, and step 206 may comprise the following sub-steps:

In an embodiment of the present invention, the text feature vector includes: the method comprises the following steps of obtaining a total word quantity characteristic, a word type quantity characteristic (namely, duplicate removal of the same word), a word variance characteristic, a non-word proportion characteristic and a total word characteristic based on word weighting, wherein the three characteristics of the total word quantity characteristic, the word type quantity characteristic and the word variance characteristic measure the distribution condition of target words in a code to be annotated; non-english words, i.e., words that are variable names that the code author "builds" on its own without actual meaning. The non-word proportion characteristic measures the interpretability information of the word; total word feature this feature contains context information of the word.

In a specific implementation, the total number of words to be counted is counted to determine the total word number characteristic; respectively counting the variety and the quantity of various words to be counted to determine the word variety quantity characteristics; respectively calculating the variance of the occurrence frequency of each word to be counted, and determining the word variance characteristics; counting the proportion of non-English words in the words to be counted, and determining the non-word proportion characteristic; respectively converting the words to be counted into word features by adopting a second preset word feature conversion model; taking the occurrence frequency of each word to be counted as weight, and carrying out weighted summation on all the word features to generate total word features; and splicing the total word quantity characteristic, the word type quantity characteristic, the word variance characteristic, the non-word proportion characteristic and the total word characteristic to generate a text characteristic vector corresponding to the code segment to be annotated.

The statistical process may also be performed in parallel, and the second preset Word feature conversion model may be a Word2Vec model, and the like, which is not limited in this embodiment of the present invention.

Step 207, determining a structural feature vector corresponding to the code segment to be annotated according to the complexity of the code segment to be annotated;

in one example of the present invention, the step 207 may include the following sub-steps:

The embodiment of the invention also needs to determine the structural characteristics of the code segment to be annotated, so as to determine the complexity of the code segment to be annotated. The structural characteristics of the function code segment are some characteristics that can describe the composition structure of the function code segment. These structural features are respectively: the number of code lines, the number of nested sentences, the maximum number of layers of the nested sentences, whether the function has a form parameter and the number of form parameters, the number of words of the longest sentence, the number of API calls, the number of variables, the number of identifiers, the number of internal comments and the like. The term "nested sentence" means that one sentence includes another sentence. FIG. 3 shows an example of a for loop statement containing an if conditional statement, where the complexity of a code segment is positively correlated with its importance.

It should be noted that the statistical process of the structural features may be performed in parallel, the structural features do not necessarily need to be used all at all, and technicians may flexibly select the structural features according to the complexity of describing code segments in actual operation, which is not limited in the embodiment of the present invention.

And step 208, determining a relation characteristic vector corresponding to the code segment to be annotated based on the function call quantity of the code segment to be annotated.

In another example of the present invention, where the number of function calls includes the number of additional functions called and the number of times called, step 208 may include the sub-steps of:

In the embodiment of the present invention, it is also necessary to analyze the mutual connection between different function code segments, and at this time, a social network or a directed graph-like manner may be adopted, and the number of times of being called in the current segment to be annotated is defined by an out-degree value, and the number of times of calling an additional function is defined by an in-degree value. And scanning and traversing the code file to be annotated to which the whole code segment to be annotated belongs to determine an out value and an in value of the code segment to be annotated, namely calling the number of additional functions and the number of called times, and splicing the two characteristics to generate a relation characteristic vector corresponding to the code segment to be annotated.

It should be noted that the

steps

202 and 204 as a whole, the

steps

205 and 206 as a whole, and the

steps

207 and 208 can be executed in parallel.

Step 209, inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code segment to be annotated;

in a specific implementation, the step 209 may include the following sub-steps:

inputting the first feature vector to the target classification model;

In the embodiment of the present invention, after the first feature vector is obtained, the first feature vector may be input to the target classification model, and the target classification model performs comprehensive judgment based on the first feature vector to obtain the model output. When the output of the target classification model is the first preset label, determining that the importance judgment result of the code segment to be annotated is important; and if the output of the target classification model is the second preset label, determining that the importance judgment result of the code segment to be annotated is unimportant.

Referring to fig. 4, fig. 4 is a flowchart illustrating a method for determining importance of code segments according to an embodiment of the present invention.

Collecting Java project code files from a software repository; dividing the project code file by taking a function as a unit; marking the function code segment with the function head annotation as 1, otherwise marking as 0; extracting required features from the function code segments; the required characteristics comprise grammatical characteristics, text characteristics, structural characteristics and relation characteristics;

the grammatical feature extraction process comprises the following steps: preparing to extract grammatical features; converting the function code segments into an abstract syntax tree; obtaining statement type information of the function code segment from the abstract syntax tree; counting the frequency distribution conditions of different statement types; counting the number of different statement types; counting the total number of sentences; converting the statement type into features and weighting and summing according to the occurrence frequency; and splicing to obtain the grammatical features.

The text feature extraction process comprises the following steps: preparing to extract text features; extracting words in the function code segment; dividing variable words according to a hump rule or an underline rule; uniformly processing words into a lower case form; deleting stop words; extracting a stem; counting the total number of words; counting the number of the types of the used words; calculating the variance of the occurrence frequency of different words; counting the proportion of non-English words; converting the words into features and weighting and summing the features according to the occurrence frequency; and splicing to obtain the text features.

The structural feature extraction process comprises: preparing to extract structural features; counting the code line number of the function code segment; counting the number of nested statements of the function code fragment; counting the maximum layer number of nested statements of the function code fragment; counting the number of form parameters in the function code segment; counting the number of words of the longest sentence in the function code segment; counting the API calling number in the function code segment; counting the number of variables in the function code segment; counting the number of identifiers in the function code segment and the number of internal annotations in the function code segment; and splicing to obtain the structural characteristics.

The relational feature extraction process comprises the following steps: preparing to extract relational features; defining the concepts of out-degree values and in-degree values; counting the out-degree value and the in-degree value of each function; and splicing to obtain the relation characteristics.

After the four extraction processes are executed in parallel, the final characteristics of each function code segment are obtained through splicing; training a classification model by combining the labels; obtaining a target classification model, wherein each training can output the result whether the function code segment is important or not;

when a new function code segment is received, extracting the final characteristics of the function code segment; and inputting the final characteristics into a target classification model, and outputting whether the function code segment is important or not.

Referring to fig. 5, fig. 5 is a block diagram illustrating a device for determining the importance of a code segment according to an embodiment of the present invention.

a code segment receiving module 501, configured to receive a code segment to be annotated;

a first feature vector extraction module 502, configured to extract a first feature vector of the code segment to be annotated;

an importance output module 503, configured to input the first feature vector to the target classification model, and output an importance judgment result for the code segment to be annotated;

Optionally, the classification model training module includes:

the annotated code file receiving submodule is used for acquiring an annotated code file from a preset software warehouse;

the file dividing submodule is used for dividing the annotated code file by taking a function as a unit to generate a plurality of training code segments;

the first label setting submodule is used for setting a first preset label for the training code segment with the preset type annotation;

the second label setting submodule is used for setting a second preset label for the training code segment without the preset type annotation;

the second feature vector extraction submodule is used for respectively extracting a second feature vector of each training code segment;

and the classification model training submodule is used for training a preset initial classification model by adopting a plurality of second characteristic vectors to obtain a target classification model.

Optionally, the first feature vector includes a syntactic feature vector, a textual feature vector, a structural feature vector, and a relational feature vector, and the first feature vector extraction module 502 includes:

the conversion submodule is used for converting the code segment to be annotated into an abstract syntax tree;

a statement type information extraction submodule, configured to extract statement type information of the code segment to be annotated from the abstract syntax tree;

the grammar feature vector determining submodule is used for determining a grammar feature vector corresponding to the code segment to be annotated according to the statistical result of the statement type information;

the target word extraction submodule is used for extracting a target word from the code segment to be annotated according to a preset variable word division rule;

the text characteristic vector determining submodule is used for determining a text characteristic vector corresponding to the code segment to be annotated according to the statistical result of the target word;

the structural feature vector determining submodule is used for determining a structural feature vector corresponding to the code segment to be annotated according to the complexity of the code segment to be annotated;

and the relation characteristic vector determining submodule is used for determining the relation characteristic vector corresponding to the code segment to be annotated based on the function calling number of the code segment to be annotated.

Optionally, the statement type information includes an occurrence frequency, a number, and a frequency distribution of a plurality of statement types, and the syntax feature vector determination sub-module includes:

the statement frequency distribution characteristic determining unit is used for counting the frequency distribution conditions of the statement types and determining statement frequency distribution characteristics;

the sentence quantity characteristic determining unit is used for counting the quantity of the plurality of sentence types and determining the sentence quantity characteristic;

a total sentence number characteristic determining unit, configured to count a total number of sentences corresponding to the multiple sentence types, and determine a total sentence number characteristic;

the sentence type feature conversion unit is used for respectively converting the various sentence types into sentence type features by adopting a first preset word feature conversion model;

a total sentence type feature determining unit, configured to perform weighted summation on the sentence type features by using the occurrence frequency as a weight, and determine a total sentence type feature;

and the grammar feature vector generating unit is used for splicing the statement frequency distribution feature, the statement quantity feature, the total statement quantity feature and the total statement type feature to generate a grammar feature vector corresponding to the code segment to be annotated.

Optionally, the preset variable word division rule includes a hump rule or an underlining rule, and the target word extraction sub-module includes:

a word extraction unit for extracting words from the code segment to be annotated;

the word to be processed determining unit is used for determining a word to be processed from the words by adopting the hump rule or the underline rule;

the word to be extracted determining unit is used for deleting preset stop words from the words to be processed to obtain words to be extracted;

and the target word determining unit is used for extracting the word stem in the word to be extracted and generating a target word.

Optionally, the target word includes a plurality of words to be counted, and the text feature vector determination sub-module includes:

the total word quantity characteristic determining unit is used for counting the total quantity of the multiple words to be counted and determining the total word quantity characteristic;

the word type quantity characteristic determining unit is used for counting the type quantity of the words to be counted and determining the word type quantity characteristic;

the word variance characteristic determining unit is used for respectively calculating the variance of the occurrence frequency of each word to be counted in the plurality of words to be counted and determining the word variance characteristic;

the non-word proportion characteristic determining unit is used for counting the proportion of non-English words in the plurality of words to be counted and determining non-word proportion characteristics;

the word feature conversion unit is used for respectively converting the words to be counted into word features by adopting a second preset word feature conversion model;

a total word feature generation unit, configured to perform weighted summation on all the word features by using the occurrence frequency of each word to be counted as a weight, so as to generate a total word feature;

and the text feature vector determining unit is used for splicing the total word quantity feature, the word type quantity feature, the word variance feature, the non-word proportion feature and the total word feature to generate a text feature vector corresponding to the code segment to be annotated.

Optionally, the structural feature vector determination sub-module includes:

the line number characteristic determining unit is used for counting the line number of the codes in the code segment to be annotated and determining line number characteristics;

the nested statement quantity characteristic determining unit is used for counting the number of nested statements in the code segment to be annotated and determining the nested statement quantity characteristic;

the maximum nesting layer number characteristic determining unit is used for counting the maximum nesting layer number in the code segment to be annotated and determining the maximum nesting layer number characteristic;

the figure parameter quantity characteristic determining unit is used for counting the quantity of the form parameters in the code segment to be annotated and determining the figure parameter quantity characteristic;

a comprehensive characteristic determining unit, configured to count a word quantity characteristic of a longest sentence in the code segment to be annotated, an API call quantity characteristic of the code segment to be annotated, a variable quantity characteristic of the code segment to be annotated, an identifier quantity characteristic of the code segment to be annotated, and an internal annotation quantity characteristic of the code segment to be annotated, and sequentially splice to generate a comprehensive characteristic;

and the structural feature vector generating unit is used for splicing the line number feature, the nested statement quantity feature, the nested maximum layer number feature, the form parameter quantity feature and the comprehensive feature to generate a structural feature vector corresponding to the code segment to be annotated.

Optionally, the number of function calls includes the number of calls to an additional function and the number of times to be called, and the coefficient feature vector determination sub-module includes:

the function call quantity determining unit is used for traversing the code file to be annotated to which the code segment to be annotated belongs, and determining the call extra function quantity and the called times of the code segment to be annotated;

and the relation feature vector generating unit is used for splicing the number of the called extra functions and the called times to generate the relation feature vector corresponding to the code segment to be annotated.

Optionally, the importance output module 503 includes:

a feature vector input sub-module for inputting the first feature vector to the target classification model;

the importance determination submodule is used for determining that the importance judgment result of the code segment to be annotated is important when the output of the target classification model is the first preset label;

and the importance negation sub-module is used for determining that the importance judgment result of the code segment to be annotated is unimportant when the output of the target classification model is the second preset label.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for judging the importance of a code segment is characterized by comprising the following steps:

receiving a code segment to be annotated;

extracting a first feature vector of the code segment to be annotated;

inputting the first feature vector into a target classification model, and outputting an importance judgment result of the code segment to be annotated;

the target classification model is generated through a preset classification model training process;

the first feature vector comprises a grammar feature vector, a text feature vector, a structure feature vector and a relation feature vector, and the step of extracting the first feature vector of the code segment to be annotated comprises the following steps:

converting the code segment to be annotated into an abstract syntax tree;

determining a relation characteristic vector corresponding to the code segment to be annotated based on the function call quantity of the code segment to be annotated;

the statement type information comprises the occurrence frequency, the number and the frequency distribution condition of a plurality of statement types, and the step of determining the grammatical feature vector corresponding to the code segment to be annotated according to the statistical result of the statement type information comprises the following steps:

2. The method of claim 1, wherein the classification model training process comprises:

acquiring an annotated code file from a preset software warehouse;

respectively extracting a second feature vector of each training code segment;

3. The method for determining the importance of the code segment according to claim 1, wherein the preset variable word segmentation rule includes a hump rule or an underlining rule, and the step of extracting the target word from the code segment to be annotated according to the preset variable word segmentation rule includes:

extracting words from the code segment to be annotated;

4. The method for judging the importance of the code segment according to claim 1 or 3, wherein the target word comprises a plurality of words to be counted, and the step of determining the text feature vector corresponding to the code segment to be annotated according to the statistical result of the target word comprises:

5. The method for determining the importance of the code segment according to claim 1, wherein the step of determining the structural feature vector corresponding to the code segment to be annotated according to the complexity of the code segment to be annotated includes:

6. The method for judging the importance of the code segment according to claim 1, wherein the function call number includes an extra function call number and a called number, and the step of determining the relationship feature vector corresponding to the code segment to be annotated based on the function call number of the code segment to be annotated includes:

7. The method according to claim 2, wherein the step of inputting the first feature vector into the target classification model and outputting the result of determining the importance of the code segment to be annotated includes:

inputting the first feature vector to the target classification model;

8. An apparatus for determining the importance of a code segment, comprising:

the importance output module is used for inputting the first feature vector into a target classification model and outputting an importance judgment result of the code segment to be annotated;

the target classification model is generated through a preset classification model training module;

optionally, the first feature vector includes a syntactic feature vector, a text feature vector, a structural feature vector, and a relational feature vector, and the first feature vector extraction module includes:

the relation feature vector determining submodule is used for determining a relation feature vector corresponding to the code segment to be annotated based on the function calling number of the code segment to be annotated;