CN111045716B

CN111045716B - Related patch recommendation method based on heterogeneous data

Info

Publication number: CN111045716B
Application number: CN201911067915.3A
Authority: CN
Inventors: 郑子彬; 陈志豪; 李全忠
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2022-02-22
Anticipated expiration: 2039-11-04
Also published as: CN111045716A

Abstract

The invention discloses a related patch recommendation method based on heterogeneous data, which comprises the steps of climbing multi-element heterogeneous data examined by a replacing code, cleaning the data, splicing multi-element heterogeneous data characteristics into patch characteristic vectors, matching patch pairs, marking two classification labels for positive samples and negative samples by taking the correlation with predicted patches as positive samples and the non-correlation with the predicted patches as negative samples, dividing a training set and a verification set, respectively training three models of logistic regression, random forest and LightGBM by using the training set to obtain corresponding probabilities and prediction labels, calculating corresponding accuracy according to the prediction labels, and finally constructing prediction scores according to the weighted sum of fusion weights and the corresponding probabilities to obtain optimal prediction scores. According to the method, machine learning is utilized to perform relevance evaluation on the data submitted to the code examination system to obtain the optimal recommendation, so that the reliability and stability of the recommendation are improved, and the labor cost is saved.

Description

Related patch recommendation method based on heterogeneous data

Technical Field

The invention relates to the field of code examination, in particular to a related patch recommendation method based on heterogeneous data.

Background

The code examination is an important basis for smooth iteration of a software engineering project, and is composed of multiple complex small tasks, including correction of code specifications, supplementary annotation of codes and the like. At present, the method is mostly applied in the field of software engineering, a manual review method is adopted to update and manage the versions of the codes, and the labor cost is high.

Currently, the software engineering industry generally uses git, Gerrit and other similar systems for code management and examination. On these systems, each time code is updated, we call a code modification or a patch. These systems provide a reviewer, submitter, for each submitted patch to complete the query or review. The basic flow of the patch examination is as follows:

1) the programmer, i.e. author, completes the patch;

2) submitting the patches to a checking system by a submitter;

3) the reviewers review and evaluate to a review system. If yes, ending the process; otherwise, the patch author makes modifications with reference to the reviewer opinion and returns the flow to the first step.

Such a system is described in detail below using Gerrit as an example:

1) the code review system may store versions of patch code that belong to the same patch, or referred to as a code update.

2) A typical code review system will add a unique numeric id to each patch submitted to the system, with a larger id indicating a later time at which the patch was submitted.

3) Patches submitted to a system generally include: the brief description, the information of the submitter, the information of the reviewer, the submission time, the code file and the diff file which are used for introducing the problem solved by the patch are the corresponding code modification comparison files generated by the system.

Backtracking of individual versions of code and references to related submissions are a necessary process to correct code errors and exceptions. However, because of the large number of code iterations and the large number of code files in each project, it takes a lot of labor cost and time cost to manually find the remaining submissions related to the current problem submission or manually mark whether the submissions are related in advance.

Disclosure of Invention

In order to overcome at least one of the defects (shortcomings) of the prior art, the invention provides a related patch recommendation method based on heterogeneous data.

The present invention aims to solve the above technical problem at least to some extent.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a related patch recommendation method based on heterogeneous data comprises the following steps:

s10, crawling multi-element heterogeneous data of code review records or files of the code review website instead of the code, wherein the multi-element heterogeneous data comprises patch basic information, patch codes, patch brief descriptions and a patch list which is relevant to predicted patches;

s20, constructing meta-features by using patch basic information; counting the frequency of various patch types in the patch code, and combining the frequency of various modification types in the same patch modification process into a modification feature vector of the patch; extracting corresponding brief descriptions of the patches, and converting natural languages into brief description embedding characteristics which can be recognized by a machine; splicing the patch element characteristics, the modified characteristic vectors and the briefly described embedded characteristics into patch characteristic vectors;

s30, in the patch code, a prediction patch i and a patch j in a patch list form a patch i-patch j pair, the prediction patch i and the patch j outside the patch list form a patch i-patch j pair, the patch i-patch j pair is regarded as a negative sample, i represents a current item, j represents the number of paired patches, and the positive and negative samples are marked with binary labels; acquiring positive and negative samples from multi-element heterogeneous data to form a sample set, and dividing the sample set into a training set and a verification set;

s40 training a logistic regression model, a random forest model and a LightGBM model by using a training set;

s50, setting a prediction threshold eta and a mapping relation from the probability to the label, and inputting the verification set into a logistic regression model to obtain a probability P_LRAnd corresponding prediction labels, inputting the random forest model to obtain the probability P_RFAnd corresponding prediction label, inputting LightGBM model to obtain probability P_LGBAnd corresponding prediction labels, calculating the prediction accuracy of the corresponding models by using the corresponding prediction labels, and respectively obtaining the accuracy a of the logistic regression model_LRAccuracy of random forest model a_RFAccuracy of LightGBM model a_LGB；

S60 calculating the fusion weight w of logistic regression, random forest and LightGBM respectively_IBy fusing weights w_IAnd probability P_LRProbability P_RFProbability P_LGBAnd the weighted summation of the two parameters is used for constructing a prediction score to obtain an optimal prediction score.

Preferably, after S10 and before S20, the method further includes:

s70, according to the requirement of machine identification, the cleaning treatment of text or list is carried out to the multi-element heterogeneous data, the cleaning treatment at least unifies the case form of name and converts the related list of character string type into the identifiable list type.

Preferably, the method for extracting the corresponding profile of the patch in S30 and converting the natural language of the patch into the profile embedding feature which can be recognized by a machine includes:

s301, stopping word processing is carried out on each brief description text, and stopping words at least comprise query words and various punctuations;

s302, obtaining a word2vec vector of the word in the brief description by adopting a word2vec method;

s303, embedding the word2vec vector of the word into the corresponding position of the word in the brief description, and replacing the word at the corresponding position, thereby obtaining an M-N brief description matrix, wherein M is the length of the brief description, and N is the word2vec vector dimension of the word;

s304 averages each column of the profile matrix to obtain a vector of 1 × N, which is regarded as the embedded feature corresponding to the profile.

Preferably, the basic information of the patch includes information of a submitter and/or a reviewer, the name of a reviewer of the crawled patch, the name of a submitter of the patch, the name of an author of the patch, the name of a project to which the patch belongs, the name of a project branch to which the patch belongs, the time of submission of the patch, the number of personnel participating in review of the patch, and the number of code files that are modified.

Preferably, the method for constructing the meta-feature by using the patch basic information in S20 specifically includes:

directly extracting numerical characteristics of patch submission time, the number of personnel participating in patch examination and the number of code files subjected to modification in the basic information of the patch as patch meta-characteristics;

the basic information of the patch except the patch submission time, the number of personnel participating in patch examination and the number of the code files which are modified in the basic information of the patch is regarded as a category field and is subjected to independent hot coding; and combining the one-hot codes of all the types of fields of the same patch into a group of one-hot coding features with multi-dimensional sparsity, and taking the one-hot coding features as patch meta-features.

Preferably, the logistic regression model, the random forest model, and the LightGBM model are all subjected to a classification task to predict whether two patches are related, that is, to predict whether a label corresponding to one patch i-patch j pair is 0 or 1, the probability that a prediction output sample of the final model is labeled as 1, and the prediction label corresponding to the sample is 0 or 1.

Preferably, the S40 is specifically:

training the logistic regression model with training set, that is, the characteristic x of patch pair is [ x ]₁,x₂,...,x_n]^TInputting the parameters into a logistic regression model, and using the parameter vectors in the logistic regression model: w ═ w₁,w₂,...,w_n]^TThe input features and the parameter vectors are linearly summed into the inner product of the two vectors: w is a^T·x＝w₁x₁+w₂x₂+...+w_nx_nPredicting a target formula:

wherein p is_LRIs the output of the logistic regression, i.e., the probability that the sample label is 1, x is the input feature vector of the model, w is the linear sum parameter of each feature of the logistic regression, b is the bias value of the linear sum, σ (w)^TX + b) is a sigmoid mapping function, w and b are optimized by a negative gradient descent method to obtain a determined logistic regression model, and the model is marked as LR;

and training the random forest model by using the training set. The random forest is a set of K CART decision trees, the CART classification tree distributes a score for each leaf node, the final score is the mean value of the prediction scores of all CARTs, each CART classification tree is independent, a determined random forest model is trained by a training set and is marked as RF, the feature vector of a patch pair is marked as x, and a prediction target formula of the probability that a sample label is 1 is output through the model:

wherein T is_k(x) Is the Kth CART tree, the learning target of each tree is the label corresponding to x, and the classification probability, T, is output_rRepresenting a function space containing all CART trees;

the LightGBM model is trained using a training set. The LightGBM training is a process of iteratively establishing a decision tree, and M iterations are set to obtain a LightGBM model which is marked as F_M(x)＝F_M-1(x)+α_MT_M(x) Wherein T is_M(x) For the Mth tree, alpha, established in the iterative process_MInputting a training set for training to obtain a determined LightGBM model for the weight of the Mth tree, recording the LightGBM model as an LGB, inputting a patch feature vector x of a verification set to the LightGBM model, and outputting the probability that a sample label is 1 by the LightGBM model:

wherein F_M(x) The final model of the LightGBM model after M iterations is obtained, and M is the iteration convergence times.

Preferably, the mapping formula from probability to label in S50 is:

wherein I is LR, RF or LGB.

Preferably, the accuracy rate α_IThe calculation formula of (2) is as follows:

rate of accuracy

Wherein I is LR, RF or LGB.

Preferably, the calculation formulas of the fusion weight and the prediction score in S60 are respectively:

respectively calculating fusion weight values w of the logistic regression model, the random forest model and the LightGBM model_I：

Wherein I is LR, RF or LGB;

the prediction score calculation formula is as follows:

i is LR, RF or LGB, p_i·IThe probability of 1 in the logistic regression model, the random forest model and the LightGBM model is represented by I.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the automatic recommendation of the corresponding page of the related patch is realized, the cost for manually searching the page of the related patch is greatly saved, and the high efficiency of the examiner is effectively improved; meanwhile, multivariate heterogeneous data is used as input of machine learning model training, similarities and differences among multiple submissions in code examination are described from different dimensions, and information of different dimensions is fully utilized; in addition, a final recommendation result is generated by adopting a multi-model fusion method, and the recommendation reliability and stability are improved.

Drawings

FIG. 1 is a schematic view of model fusion according to the present invention.

Force 2 is an exemplary diagram of obtaining a word2vec vector of a simplified Chinese word using the word2vec method.

Fig. 3 is a schematic architecture for training the integration tree model based on the bagging integration mode.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

As shown in fig. 1 to 3, a related patch recommendation method based on heterogeneous data includes:

s50, setting a prediction threshold eta and a mapping relation from the probability to the label, and inputting the verification set into a logistic regression model to obtain a probability P_LRAnd corresponding prediction label, inputting and inputting random forest model to obtain probability P_RFAnd corresponding prediction label, inputting LightGBM model probability P_LGBAnd corresponding prediction labels, calculating the prediction accuracy of the corresponding models by using the corresponding prediction labels, and respectively obtaining the accuracy a of the logistic regression model_LRAccuracy of random forest model a_RFAccuracy of LightGBM model a_LGB；

S60 calculating fusion weight w of logistic regression model, random forest model and LightGBM model respectively_IBy fusing weights w_IAnd probability P_LRProbability P_RFProbability P_LGBAnd the weighted summation of the two parameters is used for constructing a prediction score to obtain an optimal prediction score.

Preferably, after S10 and before S20, the method further includes:

Preferably, the S40 is specifically:

Preferably, the mapping formula from probability to label in S50 is:

wherein I is LR, RF or LGB.

Preferably, the calculation formula of the accuracy is as follows:

rate of accuracy

Wherein I is LR, RF or LGB.

Wherein I is LR, RF or LGB;

the prediction score calculation formula is as follows:

An actual operation example:

1. the method comprises the following steps of automatically acquiring review records and files or data of a plurality of projects on Gerrit in a web crawler mode, wherein the obtained data has heterogeneity and is of multiple types, and the obtained data at least comprises the following three types:

1) basic information of the patch (source of patch meta-feature): discrete data such as submitters, reviewers, and the like; the submission information of the patch crawled by adopting the crawler method comprises the name of a reviewer of the patch, the name of a submitter of the patch, the name of an author of the patch, the name of a project to which the patch belongs, the name of a project branch to which the patch belongs, the submission time of the patch, the number of personnel participating in patch review, the number of code files which are modified and the like.

2) The summary crawled at the time of submission of each patch is textual data.

3) And crawling patches in a crawler mode, wherein the code files modified in the patches only comprise a plurality of code text files which are modified.

2. Data cleaning: and cleaning the data acquired by the crawler, such as case form of uniform name, conversion of related list of character string types into list types and the like.

3. The method is characterized in that human recognizable fields in basic information of the patch web page are converted into machine recognizable features, for example, the category fields are subjected to one-hot coding, the one-hot coding is a method for processing category type features commonly used in machine learning, and the one-hot coding is mainly used for carrying out normalization processing on the category features, such as project names, project branch names and the like. The coding principle simply reconstructs the features according to the number of categories included in the features. For example, assume that the branches of the item have 3 total masters, dev, and release, which are not well handled in the machine learning algorithm, so the discrete values of the features are generally represented numerically. For example, master is coded as 0, dev is coded as 1, and release is coded as 2. However, even if converted to numbers, the above data cannot be used directly in our classification model. Because the input data by default to the classifier is continuous and ordered. One possible solution to this problem is to use One-Hot Encoding (One-Hot Encoding), which is a technique where only One bit in the Encoding is valid, and is therefore also called "One-bit valid Encoding". The following table shows, by way of example only, the one-hot encoding method corresponding to one entry branch field, and the entry name and the entry branch name are two types of patches:

watch 1

Combining the one-hot codes of all the types of fields for submitting the patch into a group of characteristics, wherein the group of characteristics comprises the name of a reviewer of the patch, the name of a submitter of the patch, the name of an author of the patch, the name of a project to which the patch belongs and the name of a project branch to which the patch belongs, and finally forming a multi-dimensional sparse characteristic vector.

Watch two

3) The 3 types of numerical fields of the time of patch submission, the number of personnel participating in patch examination and the number of code files which are modified only need to be kept in the original numerical form to be directly used as characteristics.

4. Extracting the frequency of various modification types in each submitted code text: the main function of this step is to extract the process information of the iterative modification of different patches as the features recommended by the related patches, which reflect the correlation of different patches in the modification process, and extract the frequency of various modification types in each submitted code text:

1) taking out an initial version code file and a final version code file in a patch;

2) comparing the difference of the two files by using a Changedistiller tool, and counting the frequency of each modification type of the final version code relative to the initial version code, wherein the modification types comprise: adding a statement, changing a statement, deleting a statement, adding a class, modifying a class, deleting a class, adding a method, modifying a method, deleting a method, renaming a method, modifying a return type, modifying a parameter type, renaming a parameter, adding a parameter, deleting a parameter, adding an attribute, renaming an attribute, changing a conditional expression, and the like;

3) and combining the frequencies of all modification types in the modification process of the same patch into a vector as a group of characteristics.

5. Extracting embedded features of profile at submission of individual patches

1) The patch profile is converted from natural language into characteristic vector recognizable by machine, and the vector embedding method is a general name of deep learning numerical mapping method. When text data is processed, the method is to obtain corresponding low-dimensional dense vectors through the negative gradient learning of a neural network by using the high-dimensional sparse one-hot coding corresponding to each word, and the vectors are collectively called as embedding features. For example, given a document, which is a sequence of words such as "A B A C B F G", we can finally find: the vector corresponding to A is [ 0.10.6-0.5 ], the vector corresponding to B is [ -0.20.90.7 ], and a vector corresponding to a sentence can be obtained. The patch brief description intuitively reflects the problem solved by the corresponding patch, and reflects the relevance on the solved problem. But it needs to be embedded in order for a machine to understand such textual information. Extracting the brief embedding characteristics when each patch is submitted, wherein the implementation steps comprise:

1) the method comprises the following steps of (1) performing stop word processing on each brief text, wherein stop words comprise interrogative words such as ' what ', ' why ', ' v ', ' etc. and punctuations;

2) obtaining a word vector of the word in the brief description by adopting a word2vec method;

3) and replacing the word2vec vector of the word to a corresponding position according to the position of the word in the brief description, and obtaining an M-N matrix for each brief description, wherein M is the length of the brief description, and N is the dimension of the output word vector when the word2vec is embedded.

4) Averaging each column of the matrix corresponding to the brief description to obtain a vector of 1 × N, namely the embedded feature vector corresponding to the brief description.

6. Splicing features, labeling, training logistic regression, random forest, LightGBM model

1) Splicing characteristics are that all types of characteristic vectors constructed above are connected in a front-back mode to obtain a new vector;

2) because the patches which are crawled from the website are not provided with ready-made labels, the labels need to be marked, and because the patches corresponding to part of the webpage currently have other manually added related patch lists, the manually marked related patch pairs can be regarded as positive samples, and the patch pairs which are not in the related lists can be regarded as negative samples.

Specifically, the implementation steps of splicing features and labeling include:

crawling a 'related patch' list corresponding to each patch on the Gerrit system to the local, wherein the lists are constructed manually by system examiners, see an example FIG. 1, and a list corresponding to a patch i is marked as l_i，l_iComprising n related terms, length n_i。

First, a positive sample is obtained, two patches with correlation are found, and now, assuming that one patch now has a recommendation list of manually-constructed "correlation patches", it is obvious that the current patch and the patches in the list have correlation. And forming a patch pair of 'patch i-patch j' by the predicted patch and the patches in the list, wherein the patch i is the predicted patch, the patch j is the patch in the list, and the label of the patch pair is marked as 1.

Second, to get a negative example, two patches are found that do not actually have a correlation. Now assume that a patch now has a recommended list of "relevant patches" that was manually constructed, and that no correlation exists between other patches than this list, and the predicted patch. But the number of such "irrelevant" patch pairs can be much larger in number relative to the "relevant" patch pairs.

In order to equalize the number of negative and positive samples, n is randomly taken from the system_iA patch, wherein n_iAll patches are not in_iIn (1). The prediction patch i is the same as the n_iEach patch constitutes a pair of "Patch i-Patch j", with the label of such pair being noted 0. The sample pairs are combined and the respective features of patch i and patch j are concatenated. For example, if patch i is characterized by a feature dimension of D and patch j is characterized by a feature dimension of D, the input features of the merged model are

And finally, the ratio of positive samples to negative samples is 1:1, and the characteristic dimension of each patch pair is 2 x D. Then, the training set and the verification set are divided in a ratio of 0.7: 0.3. And training three models of logistic regression, random forest and LightGBM by using a training set.

3) The three models of logistic regression, random forest and LightGBM all adopt classification algorithms, but the three models have different learning modes and learning capabilities for input features, and each of the three models can complete independent two classification tasks. The training models of the invention are two classification models: the prediction of whether two patches are related is taken as a target, namely, labels corresponding to one 'patch-patch' pair are 0 and 1, and the prediction output of the final model comprises the probability that a sample is 1 and the prediction label 0 or 1 corresponding to the sample. The specific method comprises the following steps:

firstly, the logistic regression can better express the implicit linear relation between input and output, and the negative effect of characteristic multiple collinearity is automatically reducedAnd (6) sounding. The features of the training set samples are used as input, and the input features are recorded as follows: x ═ x₁,x₂,...,x_n]^TThe existing parameter vector: w ═ w₁,w₂,...,w_n]^TThe linear sum is written as the inner product of two vectors: w is a^T·x＝w₁x₁+w₂x₂+...+w_nx_nThe prediction formula is as follows:

wherein p is_LRIs the output of the logistic regression, i.e., the probability that the sample label is 1, x is the input feature vector of the model, w is the linear sum parameter of each feature of the logistic regression, b is the bias value of the linear sum, σ (w)^TX + b) is a sigmoid mapping function, w and b are optimized by a negative gradient descent method to obtain a determined logistic regression model, which is marked as LR.

After the probability is obtained, a threshold η is set as a basis for determining the 0 and 1 labels, and η is 0.5, for example. The mapping from probability to label is:

by using the characteristics of the divided verification set samples as input, the prediction label of the samples in the verification set can be obtained, and further the accuracy of LR single prediction can be obtained:

rate of accuracy

The random forest is a common Bagging-based integrated tree model, can well express and express the implicit nonlinear relation between input and output, and has high stability. A random forest is a collection of K CART (classification and regression trees). CART assigns a score to each leaf node, and the final score is the average of the predicted scores of all CARTs; in addition, in random forests, the construction of each tree is performed independently. Similarly, training with the training set yields a deterministic random forest model, denoted as RF. In verification and actual prediction, the input feature vector is recorded as x, and the probability that the sample label is 1 can be output through the model:

wherein T is_k(x) Is the kth CART tree where the learning objective for each tree is the label for x, the classification probability is output, and Tr represents the function space that contains all CART trees. Then through the mapping formula from probability to label:

by using the characteristics of the divided verification set samples as input, the prediction labels of the samples in the verification set can be obtained, and further, the accuracy of the independent prediction of the RF model can be obtained

Rate of accuracy

The LightGBM is an integrated tree model based on boosting, can well express the implicit nonlinear relation between input and output, and has high fitting capability. Unlike random forests, where each tree is built independently, training of LightGBM is a process that iteratively builds decision trees. For example, the LightGBM final model obtained through M iterations can be recorded as

F”_M(x)＝F”_M-1(x)+α_MT_M(x)

Wherein T is_M(x) For the Mth tree, alpha, established in the iterative process_MIs the weight of the mth tree. Similarly, training with the training set yields a defined LightGBM model, denoted LGB. In verification and actual prediction, the input feature vector is recorded as x, and the probability that the sample label is 1 can be output through the model:

wherein F_M(x) The final model of the LightGBM after M iterations is obtained, and M is the iteration convergence time. Then through the mapping formula from probability to label:

by using the characteristics of the divided verification set samples as input, the prediction labels of the samples in the verification set can be obtained, and the accuracy of the LGB model single prediction can be further obtained

Rate of accuracy

7. And calculating fusion weight values by utilizing the accuracy of the S6 models in the verification set, carrying out weighted summation on the probability during testing, recording the summation result as score, sequencing the samples according to the score, and taking the first k as the final recommendation result. First, for the same feature, different models can learn the information that the feature contains from different angles. In addition, the accuracy can reflect the strength of the learning ability of a single model to the characteristics, and the fusion weight obtained by calculation according to the accuracy can laterally reflect the credibility of the prediction results of each model in several models. Therefore, on one hand, fusion ensures the comprehensiveness of the learning of the final model, and on the other hand, weighting ensures that the final model can distinguish the dependence degree on different angle information. The weighted fusion of multiple models is beneficial to improving the accuracy and stability of final recommendation.

1) Calculating the weight of each model, wherein the weight calculation formula is as follows:

wherein I is LR, RF or LGB, alpha_II ═ LR, RF or LGB, the accuracy of the logistic regression model, random forest model, LightGBM model, respectively;

2) a recommendation candidate set and a prediction score are constructed. Because a large number of patches exist in the system, if two patches form a patch pair and all the patches participate in prediction during actual recommendation, a large amount of computer resources need to be consumed, and therefore a small candidate set needs to be constructed through a certain rule to participate in prediction. Through analysis, the system storage id of the existing Gerrit's "related patch" has the following relationship: the difference between two patch ids in 80% of the "patch pairs" is less than 300. It is also contemplated that the recommendation method of the present invention is primarily used in the following scenarios: when a patch is newly submitted on the system, the system can find out the currently relevant patch from all patches stored in the past as a recommendation list, so when making a "relevant" recommendation for a patch, it is considered whether other patches submitted before this patch are relevant to it. Therefore, according to the method, the id of each patch in the system is taken as a basis, N other patches submitted before the predicted patch are taken out from the id of the predicted patch to form a recommended candidate set consisting of N patch pairs, and characteristics are constructed and spliced according to methods of S3, S4, S5 and S6, wherein N is greater than 0, and N is preset to be 300.

Predicting the N patch pairs by using the trained three models, wherein i is more than or equal to 1 and less than or equal to 300:

p_i·Iwith I respectively representing the probability of 1 predicted in the logistic regression model, the random forest model and the LightGBM model, and calculating the final fusion score as:

wherein score_iRepresents the ith patch pair fusion score, w_IWhere I ═ LR, RF or LGB represent, with I, the fusion weights of logistic regression, random forest, LightGBM, respectively, p_i·IWith I representing predictions in logistic regression model, random forest model, LightGBM model, respectivelyA probability of 1. And finally, printing a prediction label 1 on a patch pair k before the score value ranking, namely taking the patches in the k patch pairs as the recommendation of the last 'related patches' of the prediction patches, wherein k is preset to be 10.

The same or similar reference numerals correspond to the same or similar parts;

the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A related patch recommendation method based on heterogeneous data is characterized by comprising the following steps:

s20, constructing the frequency of each patch type in the meta-feature statistical patch code by using the patch basic information, and combining the frequency of each modification type in the same patch modification process into a modification feature vector of the patch; extracting corresponding brief descriptions of the patches, and converting natural languages into brief description embedding characteristics which can be recognized by a machine; splicing the patch element characteristics, the modified characteristic vectors and the briefly described embedded characteristics into patch characteristic vectors;

s50, setting a prediction threshold eta and a mapping relation from the probability to the label, and inputting the verification set into a logistic regression model to obtain a probability P_LRAnd corresponding prediction labels, inputting the random forest model to obtain the probability P_RFAnd corresponding prediction label, inputting LightGBM model probability P_LGBAnd corresponding prediction labels, calculating the prediction accuracy of the corresponding models by using the corresponding prediction labels, and respectively obtaining the accuracy a of the logistic regression model_LRAccuracy of random forest model a_RFAccuracy of LightGBM model a_LGB；

S60 calculating fusion weight w of logistic regression model, random forest model and LightGBM model respectively_IBy fusing weights w_IAnd probability P_LRProbability P_RFProbability P_LGBThe weighted summation is carried out to construct a prediction score, and an optimal prediction score is obtained;

the logistic regression model, the random forest model and the LightGBM model are all used for performing two classification tasks, whether two patches are related is predicted or not is taken as a target, namely, one patch i-patch j pair is taken as a target, the probability that a prediction output sample of the final model is marked as 1 and the prediction label 0 or 1 corresponding to the sample are taken as a target;

the S40 specifically includes:

training a random forest model by using a training set; the random forest is a set of K CART decision trees, each CART classification tree distributes a score for each leaf node, the final score is the mean value of the prediction scores of all CARTs, each CART classification tree is independent, a determined random forest model is trained by a training set and is marked as RF, the feature vector of a patch pair is marked as x, and then a prediction target formula of the probability that a sample label is 1 is output through the model:

training a LightGBM model by using a training set; the LightGBM training is a process of iteratively establishing a decision tree, and M iterations are set to obtain a LightGBM model which is marked as F_M(x)＝F_M-1(x)+α_MT_M(x) Wherein T is_M(x) For establishing in an iterative processOf the M tree, a_MInputting a training set for training to obtain a determined LightGBM model for the weight of the Mth tree, and recording as LGB, recording the feature vector of the patch pair as x, and then outputting the probability that the sample label is 1 by the LightGBM model:

2. The method for recommending related patches based on heterogeneous data according to claim 1, wherein after S10 and before S20, further comprising:

3. The method for recommending related patches based on heterogeneous data according to claim 1, wherein the method for extracting corresponding profiles of patches and converting the corresponding profiles into machine-recognizable profile embedding features in S30 comprises:

4. The related patch recommendation method based on heterogeneous data as claimed in claim 1, wherein the basic information of the patch includes information of submitter and/or reviewer, name of reviewer of the crawled patch, name of submitter of the patch, name of author of the patch, name of item to which the patch belongs, name of branch of item to which the patch belongs, time of submission of the patch, number of persons participating in review of the patch, number of code files where modification occurs.

5. The related patch recommendation method based on heterogeneous data as claimed in claim 4, wherein the method for constructing the meta-feature by using the patch basic information in S20 specifically comprises:

the basic information of the other patches except the patch submission time, the number of personnel participating in patch examination and the number of the code files which are modified in the basic information of the patches are taken as category fields and are subjected to one-hot coding; and combining the one-hot codes of all the types of fields of the same patch into a group of one-hot coding features with multi-dimensional sparsity, and taking the one-hot coding features as patch meta-features.

6. The related patch recommendation method based on heterogeneous data as claimed in claim 1, wherein the mapping formula from probability to label in S50 is:

wherein I is LR, RF or LGB.

7. The related patch recommendation method based on heterogeneous data as claimed in claim 6, wherein the accuracy rate α is_IThe calculation formula of (2) is as follows:

rate of accuracy

Wherein I is LR, RF or LGB.

8. The related patch recommendation method based on heterogeneous data as claimed in claim 1, wherein the calculation formulas of the fusion weight and the prediction score in S60 are respectively:

respectively calculating fusion weight w of logistic regression model, random forest model and LightGBM model_I：

the prediction score calculation formula is as follows:

i is LR, RF or LGB, p_i·IScore with I representing the probability of a prediction of 1 in the logistic regression model, random forest model, LightGBM model, respectively_iRepresenting the ith patch pair fusion score.