CN111045716B - Related patch recommendation method based on heterogeneous data - Google Patents

Related patch recommendation method based on heterogeneous data Download PDF

Info

Publication number
CN111045716B
CN111045716B CN201911067915.3A CN201911067915A CN111045716B CN 111045716 B CN111045716 B CN 111045716B CN 201911067915 A CN201911067915 A CN 201911067915A CN 111045716 B CN111045716 B CN 111045716B
Authority
CN
China
Prior art keywords
patch
model
prediction
probability
logistic regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911067915.3A
Other languages
Chinese (zh)
Other versions
CN111045716A (en
Inventor
郑子彬
陈志豪
李全忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201911067915.3A priority Critical patent/CN111045716B/en
Publication of CN111045716A publication Critical patent/CN111045716A/en
Application granted granted Critical
Publication of CN111045716B publication Critical patent/CN111045716B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/65Updates
    • G06F8/658Incremental updates; Differential updates

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a related patch recommendation method based on heterogeneous data, which comprises the steps of climbing multi-element heterogeneous data examined by a replacing code, cleaning the data, splicing multi-element heterogeneous data characteristics into patch characteristic vectors, matching patch pairs, marking two classification labels for positive samples and negative samples by taking the correlation with predicted patches as positive samples and the non-correlation with the predicted patches as negative samples, dividing a training set and a verification set, respectively training three models of logistic regression, random forest and LightGBM by using the training set to obtain corresponding probabilities and prediction labels, calculating corresponding accuracy according to the prediction labels, and finally constructing prediction scores according to the weighted sum of fusion weights and the corresponding probabilities to obtain optimal prediction scores. According to the method, machine learning is utilized to perform relevance evaluation on the data submitted to the code examination system to obtain the optimal recommendation, so that the reliability and stability of the recommendation are improved, and the labor cost is saved.

Description

Related patch recommendation method based on heterogeneous data
Technical Field
The invention relates to the field of code examination, in particular to a related patch recommendation method based on heterogeneous data.
Background
The code examination is an important basis for smooth iteration of a software engineering project, and is composed of multiple complex small tasks, including correction of code specifications, supplementary annotation of codes and the like. At present, the method is mostly applied in the field of software engineering, a manual review method is adopted to update and manage the versions of the codes, and the labor cost is high.
Currently, the software engineering industry generally uses git, Gerrit and other similar systems for code management and examination. On these systems, each time code is updated, we call a code modification or a patch. These systems provide a reviewer, submitter, for each submitted patch to complete the query or review. The basic flow of the patch examination is as follows:
1) the programmer, i.e. author, completes the patch;
2) submitting the patches to a checking system by a submitter;
3) the reviewers review and evaluate to a review system. If yes, ending the process; otherwise, the patch author makes modifications with reference to the reviewer opinion and returns the flow to the first step.
Such a system is described in detail below using Gerrit as an example:
1) the code review system may store versions of patch code that belong to the same patch, or referred to as a code update.
2) A typical code review system will add a unique numeric id to each patch submitted to the system, with a larger id indicating a later time at which the patch was submitted.
3) Patches submitted to a system generally include: the brief description, the information of the submitter, the information of the reviewer, the submission time, the code file and the diff file which are used for introducing the problem solved by the patch are the corresponding code modification comparison files generated by the system.
Backtracking of individual versions of code and references to related submissions are a necessary process to correct code errors and exceptions. However, because of the large number of code iterations and the large number of code files in each project, it takes a lot of labor cost and time cost to manually find the remaining submissions related to the current problem submission or manually mark whether the submissions are related in advance.
Disclosure of Invention
In order to overcome at least one of the defects (shortcomings) of the prior art, the invention provides a related patch recommendation method based on heterogeneous data.
The present invention aims to solve the above technical problem at least to some extent.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a related patch recommendation method based on heterogeneous data comprises the following steps:
s10, crawling multi-element heterogeneous data of code review records or files of the code review website instead of the code, wherein the multi-element heterogeneous data comprises patch basic information, patch codes, patch brief descriptions and a patch list which is relevant to predicted patches;
s20, constructing meta-features by using patch basic information; counting the frequency of various patch types in the patch code, and combining the frequency of various modification types in the same patch modification process into a modification feature vector of the patch; extracting corresponding brief descriptions of the patches, and converting natural languages into brief description embedding characteristics which can be recognized by a machine; splicing the patch element characteristics, the modified characteristic vectors and the briefly described embedded characteristics into patch characteristic vectors;
s30, in the patch code, a prediction patch i and a patch j in a patch list form a patch i-patch j pair, the prediction patch i and the patch j outside the patch list form a patch i-patch j pair, the patch i-patch j pair is regarded as a negative sample, i represents a current item, j represents the number of paired patches, and the positive and negative samples are marked with binary labels; acquiring positive and negative samples from multi-element heterogeneous data to form a sample set, and dividing the sample set into a training set and a verification set;
s40 training a logistic regression model, a random forest model and a LightGBM model by using a training set;
s50, setting a prediction threshold eta and a mapping relation from the probability to the label, and inputting the verification set into a logistic regression model to obtain a probability PLRAnd corresponding prediction labels, inputting the random forest model to obtain the probability PRFAnd corresponding prediction label, inputting LightGBM model to obtain probability PLGBAnd corresponding prediction labels, calculating the prediction accuracy of the corresponding models by using the corresponding prediction labels, and respectively obtaining the accuracy a of the logistic regression modelLRAccuracy of random forest model aRFAccuracy of LightGBM model aLGB
S60 calculating the fusion weight w of logistic regression, random forest and LightGBM respectivelyIBy fusing weights wIAnd probability PLRProbability PRFProbability PLGBAnd the weighted summation of the two parameters is used for constructing a prediction score to obtain an optimal prediction score.
Preferably, after S10 and before S20, the method further includes:
s70, according to the requirement of machine identification, the cleaning treatment of text or list is carried out to the multi-element heterogeneous data, the cleaning treatment at least unifies the case form of name and converts the related list of character string type into the identifiable list type.
Preferably, the method for extracting the corresponding profile of the patch in S30 and converting the natural language of the patch into the profile embedding feature which can be recognized by a machine includes:
s301, stopping word processing is carried out on each brief description text, and stopping words at least comprise query words and various punctuations;
s302, obtaining a word2vec vector of the word in the brief description by adopting a word2vec method;
s303, embedding the word2vec vector of the word into the corresponding position of the word in the brief description, and replacing the word at the corresponding position, thereby obtaining an M-N brief description matrix, wherein M is the length of the brief description, and N is the word2vec vector dimension of the word;
s304 averages each column of the profile matrix to obtain a vector of 1 × N, which is regarded as the embedded feature corresponding to the profile.
Preferably, the basic information of the patch includes information of a submitter and/or a reviewer, the name of a reviewer of the crawled patch, the name of a submitter of the patch, the name of an author of the patch, the name of a project to which the patch belongs, the name of a project branch to which the patch belongs, the time of submission of the patch, the number of personnel participating in review of the patch, and the number of code files that are modified.
Preferably, the method for constructing the meta-feature by using the patch basic information in S20 specifically includes:
directly extracting numerical characteristics of patch submission time, the number of personnel participating in patch examination and the number of code files subjected to modification in the basic information of the patch as patch meta-characteristics;
the basic information of the patch except the patch submission time, the number of personnel participating in patch examination and the number of the code files which are modified in the basic information of the patch is regarded as a category field and is subjected to independent hot coding; and combining the one-hot codes of all the types of fields of the same patch into a group of one-hot coding features with multi-dimensional sparsity, and taking the one-hot coding features as patch meta-features.
Preferably, the logistic regression model, the random forest model, and the LightGBM model are all subjected to a classification task to predict whether two patches are related, that is, to predict whether a label corresponding to one patch i-patch j pair is 0 or 1, the probability that a prediction output sample of the final model is labeled as 1, and the prediction label corresponding to the sample is 0 or 1.
Preferably, the S40 is specifically:
training the logistic regression model with training set, that is, the characteristic x of patch pair is [ x ]1,x2,...,xn]TInputting the parameters into a logistic regression model, and using the parameter vectors in the logistic regression model: w ═ w1,w2,...,wn]TThe input features and the parameter vectors are linearly summed into the inner product of the two vectors: w is aT·x=w1x1+w2x2+...+wnxnPredicting a target formula:
Figure GDA0003455954390000041
wherein p isLRIs the output of the logistic regression, i.e., the probability that the sample label is 1, x is the input feature vector of the model, w is the linear sum parameter of each feature of the logistic regression, b is the bias value of the linear sum, σ (w)TX + b) is a sigmoid mapping function, w and b are optimized by a negative gradient descent method to obtain a determined logistic regression model, and the model is marked as LR;
and training the random forest model by using the training set. The random forest is a set of K CART decision trees, the CART classification tree distributes a score for each leaf node, the final score is the mean value of the prediction scores of all CARTs, each CART classification tree is independent, a determined random forest model is trained by a training set and is marked as RF, the feature vector of a patch pair is marked as x, and a prediction target formula of the probability that a sample label is 1 is output through the model:
Figure GDA0003455954390000042
wherein T isk(x) Is the Kth CART tree, the learning target of each tree is the label corresponding to x, and the classification probability, T, is outputrRepresenting a function space containing all CART trees;
the LightGBM model is trained using a training set. The LightGBM training is a process of iteratively establishing a decision tree, and M iterations are set to obtain a LightGBM model which is marked as FM(x)=FM-1(x)+αMTM(x) Wherein T isM(x) For the Mth tree, alpha, established in the iterative processMInputting a training set for training to obtain a determined LightGBM model for the weight of the Mth tree, recording the LightGBM model as an LGB, inputting a patch feature vector x of a verification set to the LightGBM model, and outputting the probability that a sample label is 1 by the LightGBM model:
Figure GDA0003455954390000043
wherein FM(x) The final model of the LightGBM model after M iterations is obtained, and M is the iteration convergence times.
Preferably, the mapping formula from probability to label in S50 is:
Figure GDA0003455954390000044
wherein I is LR, RF or LGB.
Preferably, the accuracy rate αIThe calculation formula of (2) is as follows:
rate of accuracy
Figure GDA0003455954390000045
Wherein I is LR, RF or LGB.
Preferably, the calculation formulas of the fusion weight and the prediction score in S60 are respectively:
respectively calculating fusion weight values w of the logistic regression model, the random forest model and the LightGBM modelI
Figure GDA0003455954390000051
Wherein I is LR, RF or LGB;
the prediction score calculation formula is as follows:
Figure GDA0003455954390000052
i is LR, RF or LGB, pi·IThe probability of 1 in the logistic regression model, the random forest model and the LightGBM model is represented by I.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the automatic recommendation of the corresponding page of the related patch is realized, the cost for manually searching the page of the related patch is greatly saved, and the high efficiency of the examiner is effectively improved; meanwhile, multivariate heterogeneous data is used as input of machine learning model training, similarities and differences among multiple submissions in code examination are described from different dimensions, and information of different dimensions is fully utilized; in addition, a final recommendation result is generated by adopting a multi-model fusion method, and the recommendation reliability and stability are improved.
Drawings
FIG. 1 is a schematic view of model fusion according to the present invention.
Force 2 is an exemplary diagram of obtaining a word2vec vector of a simplified Chinese word using the word2vec method.
Fig. 3 is a schematic architecture for training the integration tree model based on the bagging integration mode.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
As shown in fig. 1 to 3, a related patch recommendation method based on heterogeneous data includes:
s10, crawling multi-element heterogeneous data of code review records or files of the code review website instead of the code, wherein the multi-element heterogeneous data comprises patch basic information, patch codes, patch brief descriptions and a patch list which is relevant to predicted patches;
s20, constructing meta-features by using patch basic information; counting the frequency of various patch types in the patch code, and combining the frequency of various modification types in the same patch modification process into a modification feature vector of the patch; extracting corresponding brief descriptions of the patches, and converting natural languages into brief description embedding characteristics which can be recognized by a machine; splicing the patch element characteristics, the modified characteristic vectors and the briefly described embedded characteristics into patch characteristic vectors;
s30, in the patch code, a prediction patch i and a patch j in a patch list form a patch i-patch j pair, the prediction patch i and the patch j outside the patch list form a patch i-patch j pair, the patch i-patch j pair is regarded as a negative sample, i represents a current item, j represents the number of paired patches, and the positive and negative samples are marked with binary labels; acquiring positive and negative samples from multi-element heterogeneous data to form a sample set, and dividing the sample set into a training set and a verification set;
s40 training a logistic regression model, a random forest model and a LightGBM model by using a training set;
s50, setting a prediction threshold eta and a mapping relation from the probability to the label, and inputting the verification set into a logistic regression model to obtain a probability PLRAnd corresponding prediction label, inputting and inputting random forest model to obtain probability PRFAnd corresponding prediction label, inputting LightGBM model probability PLGBAnd corresponding prediction labels, calculating the prediction accuracy of the corresponding models by using the corresponding prediction labels, and respectively obtaining the accuracy a of the logistic regression modelLRAccuracy of random forest model aRFAccuracy of LightGBM model aLGB
S60 calculating fusion weight w of logistic regression model, random forest model and LightGBM model respectivelyIBy fusing weights wIAnd probability PLRProbability PRFProbability PLGBAnd the weighted summation of the two parameters is used for constructing a prediction score to obtain an optimal prediction score.
Preferably, after S10 and before S20, the method further includes:
s70, according to the requirement of machine identification, the cleaning treatment of text or list is carried out to the multi-element heterogeneous data, the cleaning treatment at least unifies the case form of name and converts the related list of character string type into the identifiable list type.
Preferably, the method for extracting the corresponding profile of the patch in S30 and converting the natural language of the patch into the profile embedding feature which can be recognized by a machine includes:
s301, stopping word processing is carried out on each brief description text, and stopping words at least comprise query words and various punctuations;
s302, obtaining a word2vec vector of the word in the brief description by adopting a word2vec method;
s303, embedding the word2vec vector of the word into the corresponding position of the word in the brief description, and replacing the word at the corresponding position, thereby obtaining an M-N brief description matrix, wherein M is the length of the brief description, and N is the word2vec vector dimension of the word;
s304 averages each column of the profile matrix to obtain a vector of 1 × N, which is regarded as the embedded feature corresponding to the profile.
Preferably, the basic information of the patch includes information of a submitter and/or a reviewer, the name of a reviewer of the crawled patch, the name of a submitter of the patch, the name of an author of the patch, the name of a project to which the patch belongs, the name of a project branch to which the patch belongs, the time of submission of the patch, the number of personnel participating in review of the patch, and the number of code files that are modified.
Preferably, the method for constructing the meta-feature by using the patch basic information in S20 specifically includes:
directly extracting numerical characteristics of patch submission time, the number of personnel participating in patch examination and the number of code files subjected to modification in the basic information of the patch as patch meta-characteristics;
the basic information of the patch except the patch submission time, the number of personnel participating in patch examination and the number of the code files which are modified in the basic information of the patch is regarded as a category field and is subjected to independent hot coding; and combining the one-hot codes of all the types of fields of the same patch into a group of one-hot coding features with multi-dimensional sparsity, and taking the one-hot coding features as patch meta-features.
Preferably, the logistic regression model, the random forest model, and the LightGBM model are all subjected to a classification task to predict whether two patches are related, that is, to predict whether a label corresponding to one patch i-patch j pair is 0 or 1, the probability that a prediction output sample of the final model is labeled as 1, and the prediction label corresponding to the sample is 0 or 1.
Preferably, the S40 is specifically:
training the logistic regression model with training set, that is, the characteristic x of patch pair is [ x ]1,x2,...,xn]TInputting the parameters into a logistic regression model, and using the parameter vectors in the logistic regression model: w ═ w1,w2,...,wn]TThe input features and the parameter vectors are linearly summed into the inner product of the two vectors: w is aT·x=w1x1+w2x2+...+wnxnPredicting a target formula:
Figure GDA0003455954390000071
wherein p isLRIs the output of the logistic regression, i.e., the probability that the sample label is 1, x is the input feature vector of the model, w is the linear sum parameter of each feature of the logistic regression, b is the bias value of the linear sum, σ (w)TX + b) is a sigmoid mapping function, w and b are optimized by a negative gradient descent method to obtain a determined logistic regression model, and the model is marked as LR;
and training the random forest model by using the training set. The random forest is a set of K CART decision trees, the CART classification tree distributes a score for each leaf node, the final score is the mean value of the prediction scores of all CARTs, each CART classification tree is independent, a determined random forest model is trained by a training set and is marked as RF, the feature vector of a patch pair is marked as x, and a prediction target formula of the probability that a sample label is 1 is output through the model:
Figure GDA0003455954390000072
wherein T isk(x) Is the Kth CART tree, the learning target of each tree is the label corresponding to x, and the classification probability, T, is outputrRepresenting a function space containing all CART trees;
the LightGBM model is trained using a training set. The LightGBM training is a process of iteratively establishing a decision tree, and M iterations are set to obtain a LightGBM model which is marked as FM(x)=FM-1(x)+αMTM(x) Wherein T isM(x) For the Mth tree, alpha, established in the iterative processMInputting a training set for training to obtain a determined LightGBM model for the weight of the Mth tree, recording the LightGBM model as an LGB, inputting a patch feature vector x of a verification set to the LightGBM model, and outputting the probability that a sample label is 1 by the LightGBM model:
Figure GDA0003455954390000081
wherein FM(x) The final model of the LightGBM model after M iterations is obtained, and M is the iteration convergence times.
Preferably, the mapping formula from probability to label in S50 is:
Figure GDA0003455954390000082
wherein I is LR, RF or LGB.
Preferably, the calculation formula of the accuracy is as follows:
rate of accuracy
Figure GDA0003455954390000083
Wherein I is LR, RF or LGB.
Preferably, the calculation formulas of the fusion weight and the prediction score in S60 are respectively:
respectively calculating fusion weight values w of the logistic regression model, the random forest model and the LightGBM modelI
Figure GDA0003455954390000084
Wherein I is LR, RF or LGB;
the prediction score calculation formula is as follows:
Figure GDA0003455954390000085
i is LR, RF or LGB, pi·IThe probability of 1 in the logistic regression model, the random forest model and the LightGBM model is represented by I.
An actual operation example:
1. the method comprises the following steps of automatically acquiring review records and files or data of a plurality of projects on Gerrit in a web crawler mode, wherein the obtained data has heterogeneity and is of multiple types, and the obtained data at least comprises the following three types:
1) basic information of the patch (source of patch meta-feature): discrete data such as submitters, reviewers, and the like; the submission information of the patch crawled by adopting the crawler method comprises the name of a reviewer of the patch, the name of a submitter of the patch, the name of an author of the patch, the name of a project to which the patch belongs, the name of a project branch to which the patch belongs, the submission time of the patch, the number of personnel participating in patch review, the number of code files which are modified and the like.
2) The summary crawled at the time of submission of each patch is textual data.
3) And crawling patches in a crawler mode, wherein the code files modified in the patches only comprise a plurality of code text files which are modified.
2. Data cleaning: and cleaning the data acquired by the crawler, such as case form of uniform name, conversion of related list of character string types into list types and the like.
3. The method is characterized in that human recognizable fields in basic information of the patch web page are converted into machine recognizable features, for example, the category fields are subjected to one-hot coding, the one-hot coding is a method for processing category type features commonly used in machine learning, and the one-hot coding is mainly used for carrying out normalization processing on the category features, such as project names, project branch names and the like. The coding principle simply reconstructs the features according to the number of categories included in the features. For example, assume that the branches of the item have 3 total masters, dev, and release, which are not well handled in the machine learning algorithm, so the discrete values of the features are generally represented numerically. For example, master is coded as 0, dev is coded as 1, and release is coded as 2. However, even if converted to numbers, the above data cannot be used directly in our classification model. Because the input data by default to the classifier is continuous and ordered. One possible solution to this problem is to use One-Hot Encoding (One-Hot Encoding), which is a technique where only One bit in the Encoding is valid, and is therefore also called "One-bit valid Encoding". The following table shows, by way of example only, the one-hot encoding method corresponding to one entry branch field, and the entry name and the entry branch name are two types of patches:
Figure GDA0003455954390000091
watch 1
Combining the one-hot codes of all the types of fields for submitting the patch into a group of characteristics, wherein the group of characteristics comprises the name of a reviewer of the patch, the name of a submitter of the patch, the name of an author of the patch, the name of a project to which the patch belongs and the name of a project branch to which the patch belongs, and finally forming a multi-dimensional sparse characteristic vector.
Figure GDA0003455954390000092
Figure GDA0003455954390000101
Watch two
3) The 3 types of numerical fields of the time of patch submission, the number of personnel participating in patch examination and the number of code files which are modified only need to be kept in the original numerical form to be directly used as characteristics.
4. Extracting the frequency of various modification types in each submitted code text: the main function of this step is to extract the process information of the iterative modification of different patches as the features recommended by the related patches, which reflect the correlation of different patches in the modification process, and extract the frequency of various modification types in each submitted code text:
1) taking out an initial version code file and a final version code file in a patch;
2) comparing the difference of the two files by using a Changedistiller tool, and counting the frequency of each modification type of the final version code relative to the initial version code, wherein the modification types comprise: adding a statement, changing a statement, deleting a statement, adding a class, modifying a class, deleting a class, adding a method, modifying a method, deleting a method, renaming a method, modifying a return type, modifying a parameter type, renaming a parameter, adding a parameter, deleting a parameter, adding an attribute, renaming an attribute, changing a conditional expression, and the like;
3) and combining the frequencies of all modification types in the modification process of the same patch into a vector as a group of characteristics.
5. Extracting embedded features of profile at submission of individual patches
1) The patch profile is converted from natural language into characteristic vector recognizable by machine, and the vector embedding method is a general name of deep learning numerical mapping method. When text data is processed, the method is to obtain corresponding low-dimensional dense vectors through the negative gradient learning of a neural network by using the high-dimensional sparse one-hot coding corresponding to each word, and the vectors are collectively called as embedding features. For example, given a document, which is a sequence of words such as "A B A C B F G", we can finally find: the vector corresponding to A is [ 0.10.6-0.5 ], the vector corresponding to B is [ -0.20.90.7 ], and a vector corresponding to a sentence can be obtained. The patch brief description intuitively reflects the problem solved by the corresponding patch, and reflects the relevance on the solved problem. But it needs to be embedded in order for a machine to understand such textual information. Extracting the brief embedding characteristics when each patch is submitted, wherein the implementation steps comprise:
1) the method comprises the following steps of (1) performing stop word processing on each brief text, wherein stop words comprise interrogative words such as ' what ', ' why ', ' v ', ' etc. and punctuations;
2) obtaining a word vector of the word in the brief description by adopting a word2vec method;
3) and replacing the word2vec vector of the word to a corresponding position according to the position of the word in the brief description, and obtaining an M-N matrix for each brief description, wherein M is the length of the brief description, and N is the dimension of the output word vector when the word2vec is embedded.
4) Averaging each column of the matrix corresponding to the brief description to obtain a vector of 1 × N, namely the embedded feature vector corresponding to the brief description.
6. Splicing features, labeling, training logistic regression, random forest, LightGBM model
1) Splicing characteristics are that all types of characteristic vectors constructed above are connected in a front-back mode to obtain a new vector;
2) because the patches which are crawled from the website are not provided with ready-made labels, the labels need to be marked, and because the patches corresponding to part of the webpage currently have other manually added related patch lists, the manually marked related patch pairs can be regarded as positive samples, and the patch pairs which are not in the related lists can be regarded as negative samples.
Specifically, the implementation steps of splicing features and labeling include:
crawling a 'related patch' list corresponding to each patch on the Gerrit system to the local, wherein the lists are constructed manually by system examiners, see an example FIG. 1, and a list corresponding to a patch i is marked as li,liComprising n related terms, length ni
First, a positive sample is obtained, two patches with correlation are found, and now, assuming that one patch now has a recommendation list of manually-constructed "correlation patches", it is obvious that the current patch and the patches in the list have correlation. And forming a patch pair of 'patch i-patch j' by the predicted patch and the patches in the list, wherein the patch i is the predicted patch, the patch j is the patch in the list, and the label of the patch pair is marked as 1.
Second, to get a negative example, two patches are found that do not actually have a correlation. Now assume that a patch now has a recommended list of "relevant patches" that was manually constructed, and that no correlation exists between other patches than this list, and the predicted patch. But the number of such "irrelevant" patch pairs can be much larger in number relative to the "relevant" patch pairs.
In order to equalize the number of negative and positive samples, n is randomly taken from the systemiA patch, wherein niAll patches are not iniIn (1). The prediction patch i is the same as the niEach patch constitutes a pair of "Patch i-Patch j", with the label of such pair being noted 0. The sample pairs are combined and the respective features of patch i and patch j are concatenated. For example, if patch i is characterized by a feature dimension of D and patch j is characterized by a feature dimension of D, the input features of the merged model are
Figure GDA0003455954390000121
And finally, the ratio of positive samples to negative samples is 1:1, and the characteristic dimension of each patch pair is 2 x D. Then, the training set and the verification set are divided in a ratio of 0.7: 0.3. And training three models of logistic regression, random forest and LightGBM by using a training set.
3) The three models of logistic regression, random forest and LightGBM all adopt classification algorithms, but the three models have different learning modes and learning capabilities for input features, and each of the three models can complete independent two classification tasks. The training models of the invention are two classification models: the prediction of whether two patches are related is taken as a target, namely, labels corresponding to one 'patch-patch' pair are 0 and 1, and the prediction output of the final model comprises the probability that a sample is 1 and the prediction label 0 or 1 corresponding to the sample. The specific method comprises the following steps:
firstly, the logistic regression can better express the implicit linear relation between input and output, and the negative effect of characteristic multiple collinearity is automatically reducedAnd (6) sounding. The features of the training set samples are used as input, and the input features are recorded as follows: x ═ x1,x2,...,xn]TThe existing parameter vector: w ═ w1,w2,...,wn]TThe linear sum is written as the inner product of two vectors: w is aT·x=w1x1+w2x2+...+wnxnThe prediction formula is as follows:
Figure GDA0003455954390000122
wherein p isLRIs the output of the logistic regression, i.e., the probability that the sample label is 1, x is the input feature vector of the model, w is the linear sum parameter of each feature of the logistic regression, b is the bias value of the linear sum, σ (w)TX + b) is a sigmoid mapping function, w and b are optimized by a negative gradient descent method to obtain a determined logistic regression model, which is marked as LR.
After the probability is obtained, a threshold η is set as a basis for determining the 0 and 1 labels, and η is 0.5, for example. The mapping from probability to label is:
Figure GDA0003455954390000123
Figure GDA0003455954390000124
by using the characteristics of the divided verification set samples as input, the prediction label of the samples in the verification set can be obtained, and further the accuracy of LR single prediction can be obtained:
rate of accuracy
Figure GDA0003455954390000125
The random forest is a common Bagging-based integrated tree model, can well express and express the implicit nonlinear relation between input and output, and has high stability. A random forest is a collection of K CART (classification and regression trees). CART assigns a score to each leaf node, and the final score is the average of the predicted scores of all CARTs; in addition, in random forests, the construction of each tree is performed independently. Similarly, training with the training set yields a deterministic random forest model, denoted as RF. In verification and actual prediction, the input feature vector is recorded as x, and the probability that the sample label is 1 can be output through the model:
Figure GDA0003455954390000131
wherein T isk(x) Is the kth CART tree where the learning objective for each tree is the label for x, the classification probability is output, and Tr represents the function space that contains all CART trees. Then through the mapping formula from probability to label:
Figure GDA0003455954390000132
by using the characteristics of the divided verification set samples as input, the prediction labels of the samples in the verification set can be obtained, and further, the accuracy of the independent prediction of the RF model can be obtained
Rate of accuracy
Figure GDA0003455954390000133
The LightGBM is an integrated tree model based on boosting, can well express the implicit nonlinear relation between input and output, and has high fitting capability. Unlike random forests, where each tree is built independently, training of LightGBM is a process that iteratively builds decision trees. For example, the LightGBM final model obtained through M iterations can be recorded as
F”M(x)=F”M-1(x)+αMTM(x)
Wherein T isM(x) For the Mth tree, alpha, established in the iterative processMIs the weight of the mth tree. Similarly, training with the training set yields a defined LightGBM model, denoted LGB. In verification and actual prediction, the input feature vector is recorded as x, and the probability that the sample label is 1 can be output through the model:
Figure GDA0003455954390000134
wherein FM(x) The final model of the LightGBM after M iterations is obtained, and M is the iteration convergence time. Then through the mapping formula from probability to label:
Figure GDA0003455954390000141
by using the characteristics of the divided verification set samples as input, the prediction labels of the samples in the verification set can be obtained, and the accuracy of the LGB model single prediction can be further obtained
Rate of accuracy
Figure GDA0003455954390000142
7. And calculating fusion weight values by utilizing the accuracy of the S6 models in the verification set, carrying out weighted summation on the probability during testing, recording the summation result as score, sequencing the samples according to the score, and taking the first k as the final recommendation result. First, for the same feature, different models can learn the information that the feature contains from different angles. In addition, the accuracy can reflect the strength of the learning ability of a single model to the characteristics, and the fusion weight obtained by calculation according to the accuracy can laterally reflect the credibility of the prediction results of each model in several models. Therefore, on one hand, fusion ensures the comprehensiveness of the learning of the final model, and on the other hand, weighting ensures that the final model can distinguish the dependence degree on different angle information. The weighted fusion of multiple models is beneficial to improving the accuracy and stability of final recommendation.
1) Calculating the weight of each model, wherein the weight calculation formula is as follows:
Figure GDA0003455954390000143
wherein I is LR, RF or LGB, alphaII ═ LR, RF or LGB, the accuracy of the logistic regression model, random forest model, LightGBM model, respectively;
2) a recommendation candidate set and a prediction score are constructed. Because a large number of patches exist in the system, if two patches form a patch pair and all the patches participate in prediction during actual recommendation, a large amount of computer resources need to be consumed, and therefore a small candidate set needs to be constructed through a certain rule to participate in prediction. Through analysis, the system storage id of the existing Gerrit's "related patch" has the following relationship: the difference between two patch ids in 80% of the "patch pairs" is less than 300. It is also contemplated that the recommendation method of the present invention is primarily used in the following scenarios: when a patch is newly submitted on the system, the system can find out the currently relevant patch from all patches stored in the past as a recommendation list, so when making a "relevant" recommendation for a patch, it is considered whether other patches submitted before this patch are relevant to it. Therefore, according to the method, the id of each patch in the system is taken as a basis, N other patches submitted before the predicted patch are taken out from the id of the predicted patch to form a recommended candidate set consisting of N patch pairs, and characteristics are constructed and spliced according to methods of S3, S4, S5 and S6, wherein N is greater than 0, and N is preset to be 300.
Predicting the N patch pairs by using the trained three models, wherein i is more than or equal to 1 and less than or equal to 300:
pi·Iwith I respectively representing the probability of 1 predicted in the logistic regression model, the random forest model and the LightGBM model, and calculating the final fusion score as:
Figure GDA0003455954390000151
wherein scoreiRepresents the ith patch pair fusion score, wIWhere I ═ LR, RF or LGB represent, with I, the fusion weights of logistic regression, random forest, LightGBM, respectively, pi·IWith I representing predictions in logistic regression model, random forest model, LightGBM model, respectivelyA probability of 1. And finally, printing a prediction label 1 on a patch pair k before the score value ranking, namely taking the patches in the k patch pairs as the recommendation of the last 'related patches' of the prediction patches, wherein k is preset to be 10.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the automatic recommendation of the corresponding page of the related patch is realized, the cost for manually searching the page of the related patch is greatly saved, and the high efficiency of the examiner is effectively improved; meanwhile, multivariate heterogeneous data is used as input of machine learning model training, similarities and differences among multiple submissions in code examination are described from different dimensions, and information of different dimensions is fully utilized; in addition, a final recommendation result is generated by adopting a multi-model fusion method, and the recommendation reliability and stability are improved.
The same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (8)

1. A related patch recommendation method based on heterogeneous data is characterized by comprising the following steps:
s10, crawling multi-element heterogeneous data of code review records or files of the code review website instead of the code, wherein the multi-element heterogeneous data comprises patch basic information, patch codes, patch brief descriptions and a patch list which is relevant to predicted patches;
s20, constructing the frequency of each patch type in the meta-feature statistical patch code by using the patch basic information, and combining the frequency of each modification type in the same patch modification process into a modification feature vector of the patch; extracting corresponding brief descriptions of the patches, and converting natural languages into brief description embedding characteristics which can be recognized by a machine; splicing the patch element characteristics, the modified characteristic vectors and the briefly described embedded characteristics into patch characteristic vectors;
s30, in the patch code, a prediction patch i and a patch j in a patch list form a patch i-patch j pair, the prediction patch i and the patch j outside the patch list form a patch i-patch j pair, the patch i-patch j pair is regarded as a negative sample, i represents a current item, j represents the number of paired patches, and the positive and negative samples are marked with binary labels; acquiring positive and negative samples from multi-element heterogeneous data to form a sample set, and dividing the sample set into a training set and a verification set;
s40 training a logistic regression model, a random forest model and a LightGBM model by using a training set;
s50, setting a prediction threshold eta and a mapping relation from the probability to the label, and inputting the verification set into a logistic regression model to obtain a probability PLRAnd corresponding prediction labels, inputting the random forest model to obtain the probability PRFAnd corresponding prediction label, inputting LightGBM model probability PLGBAnd corresponding prediction labels, calculating the prediction accuracy of the corresponding models by using the corresponding prediction labels, and respectively obtaining the accuracy a of the logistic regression modelLRAccuracy of random forest model aRFAccuracy of LightGBM model aLGB
S60 calculating fusion weight w of logistic regression model, random forest model and LightGBM model respectivelyIBy fusing weights wIAnd probability PLRProbability PRFProbability PLGBThe weighted summation is carried out to construct a prediction score, and an optimal prediction score is obtained;
the logistic regression model, the random forest model and the LightGBM model are all used for performing two classification tasks, whether two patches are related is predicted or not is taken as a target, namely, one patch i-patch j pair is taken as a target, the probability that a prediction output sample of the final model is marked as 1 and the prediction label 0 or 1 corresponding to the sample are taken as a target;
the S40 specifically includes:
training the logistic regression model with training set, that is, the characteristic x of patch pair is [ x ]1,x2,...,xn]TInputting the parameters into a logistic regression model, and using the parameter vectors in the logistic regression model: w ═ w1,w2,...,wn]TThe input features and the parameter vectors are linearly summed into the inner product of the two vectors: w is aT·x=w1x1+w2x2+...+wnxnPredicting a target formula:
Figure FDA0003455954380000021
wherein p isLRIs the output of the logistic regression, i.e., the probability that the sample label is 1, x is the input feature vector of the model, w is the linear sum parameter of each feature of the logistic regression, b is the bias value of the linear sum, σ (w)TX + b) is a sigmoid mapping function, w and b are optimized by a negative gradient descent method to obtain a determined logistic regression model, and the model is marked as LR;
training a random forest model by using a training set; the random forest is a set of K CART decision trees, each CART classification tree distributes a score for each leaf node, the final score is the mean value of the prediction scores of all CARTs, each CART classification tree is independent, a determined random forest model is trained by a training set and is marked as RF, the feature vector of a patch pair is marked as x, and then a prediction target formula of the probability that a sample label is 1 is output through the model:
Figure FDA0003455954380000022
wherein T isk(x) Is the Kth CART tree, the learning target of each tree is the label corresponding to x, and the classification probability, T, is outputrRepresenting a function space containing all CART trees;
training a LightGBM model by using a training set; the LightGBM training is a process of iteratively establishing a decision tree, and M iterations are set to obtain a LightGBM model which is marked as FM(x)=FM-1(x)+αMTM(x) Wherein T isM(x) For establishing in an iterative processOf the M tree, aMInputting a training set for training to obtain a determined LightGBM model for the weight of the Mth tree, and recording as LGB, recording the feature vector of the patch pair as x, and then outputting the probability that the sample label is 1 by the LightGBM model:
Figure FDA0003455954380000023
wherein FM(x) The final model of the LightGBM model after M iterations is obtained, and M is the iteration convergence times.
2. The method for recommending related patches based on heterogeneous data according to claim 1, wherein after S10 and before S20, further comprising:
s70, according to the requirement of machine identification, the cleaning treatment of text or list is carried out to the multi-element heterogeneous data, the cleaning treatment at least unifies the case form of name and converts the related list of character string type into the identifiable list type.
3. The method for recommending related patches based on heterogeneous data according to claim 1, wherein the method for extracting corresponding profiles of patches and converting the corresponding profiles into machine-recognizable profile embedding features in S30 comprises:
s301, stopping word processing is carried out on each brief description text, and stopping words at least comprise query words and various punctuations;
s302, obtaining a word2vec vector of the word in the brief description by adopting a word2vec method;
s303, embedding the word2vec vector of the word into the corresponding position of the word in the brief description, and replacing the word at the corresponding position, thereby obtaining an M-N brief description matrix, wherein M is the length of the brief description, and N is the word2vec vector dimension of the word;
s304 averages each column of the profile matrix to obtain a vector of 1 × N, which is regarded as the embedded feature corresponding to the profile.
4. The related patch recommendation method based on heterogeneous data as claimed in claim 1, wherein the basic information of the patch includes information of submitter and/or reviewer, name of reviewer of the crawled patch, name of submitter of the patch, name of author of the patch, name of item to which the patch belongs, name of branch of item to which the patch belongs, time of submission of the patch, number of persons participating in review of the patch, number of code files where modification occurs.
5. The related patch recommendation method based on heterogeneous data as claimed in claim 4, wherein the method for constructing the meta-feature by using the patch basic information in S20 specifically comprises:
directly extracting numerical characteristics of patch submission time, the number of personnel participating in patch examination and the number of code files subjected to modification in the basic information of the patch as patch meta-characteristics;
the basic information of the other patches except the patch submission time, the number of personnel participating in patch examination and the number of the code files which are modified in the basic information of the patches are taken as category fields and are subjected to one-hot coding; and combining the one-hot codes of all the types of fields of the same patch into a group of one-hot coding features with multi-dimensional sparsity, and taking the one-hot coding features as patch meta-features.
6. The related patch recommendation method based on heterogeneous data as claimed in claim 1, wherein the mapping formula from probability to label in S50 is:
Figure FDA0003455954380000031
wherein I is LR, RF or LGB.
7. The related patch recommendation method based on heterogeneous data as claimed in claim 6, wherein the accuracy rate α isIThe calculation formula of (2) is as follows:
rate of accuracy
Figure FDA0003455954380000041
Wherein I is LR, RF or LGB.
8. The related patch recommendation method based on heterogeneous data as claimed in claim 1, wherein the calculation formulas of the fusion weight and the prediction score in S60 are respectively:
respectively calculating fusion weight w of logistic regression model, random forest model and LightGBM modelI
Figure FDA0003455954380000042
Wherein I is LR, RF or LGB, alphaII ═ LR, RF or LGB, the accuracy of the logistic regression model, random forest model, LightGBM model, respectively;
the prediction score calculation formula is as follows:
Figure FDA0003455954380000043
i is LR, RF or LGB, pi·IScore with I representing the probability of a prediction of 1 in the logistic regression model, random forest model, LightGBM model, respectivelyiRepresenting the ith patch pair fusion score.
CN201911067915.3A 2019-11-04 2019-11-04 Related patch recommendation method based on heterogeneous data Active CN111045716B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911067915.3A CN111045716B (en) 2019-11-04 2019-11-04 Related patch recommendation method based on heterogeneous data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911067915.3A CN111045716B (en) 2019-11-04 2019-11-04 Related patch recommendation method based on heterogeneous data

Publications (2)

Publication Number Publication Date
CN111045716A CN111045716A (en) 2020-04-21
CN111045716B true CN111045716B (en) 2022-02-22

Family

ID=70232211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911067915.3A Active CN111045716B (en) 2019-11-04 2019-11-04 Related patch recommendation method based on heterogeneous data

Country Status (1)

Country Link
CN (1) CN111045716B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469730A (en) * 2021-06-08 2021-10-01 北京化工大学 Customer repurchase prediction method and device based on RF-LightGBM fusion model under non-contract scene

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106408184A (en) * 2016-09-12 2017-02-15 中山大学 User credit evaluation model based on multi-source heterogeneous data
CN108090607A (en) * 2017-12-13 2018-05-29 中山大学 A kind of social media user's ascribed characteristics of population Forecasting Methodology based on the fusion of multi-model storehouse
CN109472462A (en) * 2018-10-18 2019-03-15 中山大学 A kind of project risk ranking method and device based on the fusion of multi-model storehouse
CN109522917A (en) * 2018-09-10 2019-03-26 中山大学 A method of fusion forecasting is stacked based on multi-model
CN109887540A (en) * 2019-01-15 2019-06-14 中南大学 A kind of drug targets interaction prediction method based on heterogeneous network insertion
CN109960759A (en) * 2019-03-22 2019-07-02 中山大学 Recommender system clicking rate prediction technique based on deep neural network
CN110263257A (en) * 2019-06-24 2019-09-20 北京交通大学 Multi-source heterogeneous data mixing recommended models based on deep learning
CN110348580A (en) * 2019-06-18 2019-10-18 第四范式(北京)技术有限公司 Construct the method, apparatus and prediction technique, device of GBDT model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106548210B (en) * 2016-10-31 2021-02-05 腾讯科技(深圳)有限公司 Credit user classification method and device based on machine learning model training

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106408184A (en) * 2016-09-12 2017-02-15 中山大学 User credit evaluation model based on multi-source heterogeneous data
CN108090607A (en) * 2017-12-13 2018-05-29 中山大学 A kind of social media user's ascribed characteristics of population Forecasting Methodology based on the fusion of multi-model storehouse
CN109522917A (en) * 2018-09-10 2019-03-26 中山大学 A method of fusion forecasting is stacked based on multi-model
CN109472462A (en) * 2018-10-18 2019-03-15 中山大学 A kind of project risk ranking method and device based on the fusion of multi-model storehouse
CN109887540A (en) * 2019-01-15 2019-06-14 中南大学 A kind of drug targets interaction prediction method based on heterogeneous network insertion
CN109960759A (en) * 2019-03-22 2019-07-02 中山大学 Recommender system clicking rate prediction technique based on deep neural network
CN110348580A (en) * 2019-06-18 2019-10-18 第四范式(北京)技术有限公司 Construct the method, apparatus and prediction technique, device of GBDT model
CN110263257A (en) * 2019-06-24 2019-09-20 北京交通大学 Multi-source heterogeneous data mixing recommended models based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Code Review Knowledge Perception: Fusing Multi-Features for Salient-Class Location;Yuan Huang et al.;《IEEE TRANSACTIONS ON SOFTWARE ENGINEERING》;20180829;第1-17页 *
Fusing Kalman Filter With TLD Algorithm For Target Tracking;Chengjian Sun et al.;《Proceedings of the 34th Chinese Control Conference》;20150730;第3736-3741页 *
基于集成学习的出租车预计到达时间预测;李辉华;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20190715;第I140-61页 *

Also Published As

Publication number Publication date
CN111045716A (en) 2020-04-21

Similar Documents

Publication Publication Date Title
Liu et al. Learning to spot and refactor inconsistent method names
CN109446338B (en) Neural network-based drug disease relation classification method
CN110196906B (en) Deep learning text similarity detection method oriented to financial industry
CN108984775B (en) Public opinion monitoring method and system based on commodity comments
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
WO2005010727A2 (en) Extracting data from semi-structured text documents
CN113254507B (en) Intelligent construction and inventory method for data asset directory
CN115019906A (en) Multi-task sequence labeled drug entity and interaction combined extraction method
Shafiq et al. NLP4IP: Natural language processing-based recommendation approach for issues prioritization
CN111045716B (en) Related patch recommendation method based on heterogeneous data
CN116342167B (en) Intelligent cost measurement method and device based on sequence labeling named entity recognition
CN117076765A (en) Intelligent recruitment system sentry matching method and system based on heterogeneous graph neural network
CN116484025A (en) Vulnerability knowledge graph construction method, vulnerability knowledge graph evaluation equipment and storage medium
CN114879945A (en) Long-tail distribution characteristic-oriented diversified API sequence recommendation method and device
EP3977310B1 (en) Method for consolidating dynamic knowledge organization systems
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system
Dziczkowski et al. An autonomous system designed for automatic detection and rating of film reviews
CN113326348A (en) Blog quality evaluation method and tool
Short et al. Tag recommendations in stackoverflow
Shahzad et al. On comparing manual and automatic generated textual descriptions of business process models
US20230004583A1 (en) Method of graph modeling electronic documents with author verification
CN117852553B (en) Language processing system for extracting component transaction scene information based on chat record
Sheikhshab et al. Graphner: Using corpus level similarities and graph propagation for named entity recognition
CN113821618B (en) Method and system for extracting class items of electronic medical record
Dai et al. Grantextractor: A winning system for extracting grant support information from biomedical literature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant