CN117931275A

CN117931275A - Automatic code merging conflict resolution method based on machine learning

Info

Publication number: CN117931275A
Application number: CN202410087287.XA
Authority: CN
Inventors: 许蕾; 杨钧尹
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2024-01-19
Filing date: 2024-01-19
Publication date: 2024-04-26

Abstract

The invention relates to a code merging conflict automatic resolution method based on machine learning, which comprises the following steps: firstly, mining code merging conflicts in a Git code warehouse, collecting existing code merging conflicts in historical merging nodes, extracting features from the collected conflicts and manually resolving results of developers, training a machine learning model by using the features, and finally resolving new code merging conflicts by using the trained machine learning model. The invention has the characteristics of high execution efficiency and strong expandability, can lighten the burden of a developer and greatly reduces the labor cost of conflict resolution.

Description

Automatic code merging conflict resolution method based on machine learning

Technical Field

The invention belongs to the field of computers, in particular to the technical field of software. The invention provides a code merging conflict automatic resolution method based on machine learning, which processes newly generated code merging conflicts by learning a history merging record of a Git code warehouse, thereby improving the efficiency of solving the code merging conflicts by a developer.

Background

The growth in software size has led to the development of version control systems (e.g., git, SVN) that play an increasingly important role in software development processes. In Git, a developer may open a new branch (branch) to test new functionality and repair vulnerabilities, or may merge one branch into another, but merging of branches often causes conflicts (conflict). This is because the new branch and the original branch have modified content at the same location at the same time, so that Git cannot determine which party's modification to take. Both the new branch and the original branch are based on one and the same original code (base), and this conflict is called a three-way merge conflict. In general, a typical three-way merge conflict is of the form:

＜＜＜＜＜＜＜

content of Branch A

|||||||

Content of base

Content of branch B

＞＞＞＞＞＞＞

The traditional three-way merging algorithm is based on code lines, and the algorithm generally calculates the maximum common substring of the original branch and the original code based on the lines and the maximum common substring of the new branch and the original code based on the lines, compares the two maximum common substrings, and merges the two maximum common substrings. Such algorithms are purely text-based merging, known as unstructured merging, and are widely used. However, in a large-scale software project, the number of conflicts generated by each merge is very large, and it is very time-consuming to manually resolve each conflict. Structural merging, semi-structural merging, machine learning-based merging, etc. follow-up attempts to automatically resolve a portion of the merging conflicts.

The structured merging converts the conflicting code files into abstract syntax tree AST (Abstract Syntax Tree), and converts the merging of codes into merging of AST, and converts the problem into graph theory problem, and can solve a part of merging conflicts by using some graph theory algorithms such as hungarian algorithm and some heuristic rules. Semi-structured merging is based on structured merging, reduces granularity of AST, and enables the AST to have better time efficiency.

The automatic resolving method for the code merging conflict based on machine learning can learn a mode from the history merging record of a code warehouse to process the conflict possibly occurring in the future. The method can better process a large number of conflicts, and can give out conflict resolution schemes more and more accurately along with the increase of the scale of the collected data, so that the burden of a developer is reduced, and the software development cost is reduced.

Disclosure of Invention

The invention aims to solve the problems that: conflicts are collected from the historical merge of the code repository, a machine learning model is trained, the machine learning model learns patterns from the conflicts, and is used to process newly generated conflicts.

The technical scheme of the invention is as follows: mining the code merging conflicts in the Git code warehouse, collecting the existing code merging conflicts in the historical merging nodes, extracting features from the collected conflicts and manually resolved results of developers, training a machine learning model by using the collected conflicts and the features thereof, and finally using the trained machine learning model to resolve new code merging conflicts.

The invention comprises the following specific steps:

1) And mining the code merging conflicts in the Git code warehouse, and collecting the existing code merging conflicts in the historical merging nodes.

2) Editing information features are extracted from the collected code merge conflicts.

3) And constructing AST (Abstract Syntax Tree) a conflict code source file, and collecting conflict structural information features from the AST fragments corresponding to the conflict.

4) And acquiring the result of manually resolving the conflict by the developer from the submitting node after the merging node where the conflict occurs, marking the category of the result, and modeling the conflict resolution problem as a classification problem.

5) The method comprises the steps of training a model for the extracted features by using a random forest algorithm, and processing newly generated code merging conflicts by the trained model according to the extracted features, classification and resolution.

In step 1), mining historical code merging conflicts in the git code warehouse, wherein a conflict is defined as a five-tuple T and is marked as T= (A, B, O, M and R), wherein A and B respectively represent conflict files corresponding to local branches and merging branches, O represents a common ancestor file of the A and B files, M represents three conflict files generated by A, B, O in merging, and R represents files after the conflict is resolved by a developer; a conflict file may contain a plurality of conflict blocks, one conflict block identified by conflict identification (typically "<", ">", and ">") To distinguish, each conflict block is defined as a quadruple t, which is marked as t= (a, b, O, R), wherein a and b are respectively the parts corresponding to A, B files in the conflict block, O represents the part corresponding to O files in the conflict block, and R is code extracted from R files and manually resolved by a developer; and screening out merging submissions (merge submissions) from historical submissions (submissions) of the git, and recombining all merging submissions to obtain code merging conflicts.

In step 2), the characteristic of the editing information is extracted from the collected conflict blocks t of the code combination.

The editing information comprises the number of code lines information and the similarity information of the conflicting two parties and the original version of the code. The code line number information includes the difference between the conflict area size and the conflict block size related to each conflict party, the length of the code is LOC (LineOfCode), and for one conflict block t= (a, b, o, r), the line number information LOC (a), LOC (b), LOC (o), LOC (a) -LOC (o), LOC (b) -LOC (o), LOC (a) -LOC (b) of 6 conflict blocks needs to be collected.

The similarity is divided into line similarity and Token similarity, wherein the line similarity comprises two-by-two line similarities of three code versions, and the Token similarity comprises two-by-two Token similarities between the three code versions. The similarity was calculated using Jaccard similarity. For example, for row similarity, set (a) is the Set of all code rows in the a region, and then the row similarity of a and b is:

in step 3), the structural information features are extracted from the collected conflict blocks t of the code merging conflict.

The structure information refers to syntax elements contained in the conflict code, and a general programming language may be expressed as AST at a syntax level, wherein each node corresponds to one syntax element, such as a class, a function, a sentence, and the like. After representing the code source file as an AST, the syntax element included in the conflict area is extracted from the AST position corresponding to the conflict code.

In step 4), the result of manually resolving the conflict by the developer is obtained from the submitting node after the merging node where the conflict occurs, and the category of the result is marked, so that the conflict resolution problem is converted into a classification problem.

The result of the developer manually resolving the conflict is obtained from the next commit of the merged commit where the conflict occurred. Here, heuristic methods are used, where the context of the conflict area is first found, and then the part manually resolved by the developer is located in the file after the conflict is resolved by the context.

The conflict resolution is converted into a multi-classification problem, and five categories are respectively identified as follows:

1) A, namely receiving a in (a, b, o, r) as a conflict resolution scheme;

2) B, i.e. receiving B in (a, B, o, r) as a conflict resolution scheme;

3) Concatenation (CC for short), namely splicing a and b in (a, b, o, r) and taking the spliced a and b as a conflict resolution scheme;

4) Coding (CB for short), namely combining the code rows in a and b in (a, b, o, r) to be used as a conflict resolution scheme;

5) New Code (NC for short), i.e. in addition to using the codes of a and b in (a, b, o, r), a New Code is added as a conflict resolution scheme.

In step 5), a random forest algorithm is used for training a model on the extracted features, and the trained model processes new code merging conflicts according to the processes of extracting the features, classifying and resolving.

By adopting the technical scheme, the invention has the following advantages:

1. the execution efficiency is high: the method only needs to call a pre-trained machine learning model when the conflict is processed, and the traditional structured merging needs to run a complex graph theory algorithm, so that the time complexity is high.

2. The accuracy is high: the invention provides a resolution scheme according to the historical conflict resolution mode of a specific code warehouse from a model trained in the merging history of the code warehouse.

3. The expandability is strong: the invention can be widely applied to common programming languages, and the support for new languages is very convenient to add. Conventional approaches typically only support a specific programming language.

Drawings

FIG. 1 is a flow chart of the present invention

FIG. 2 is a diagram of a conflicting file corresponding to a Git commit history

Detailed Description

The invention firstly digs the code merging conflict in the Git code warehouse based on the automatic resolving method of the code merging conflict of machine learning, collects all the code merging conflicts in the historical merging node, extracts the characteristics and the manual resolving result of the developer from the collected conflicts, trains the machine learning model by using the collected conflicts and the characteristics thereof, and finally uses the machine learning model obtained by training to resolve the newly generated code merging conflict.

The flow of the invention is shown in figure 1, and specifically comprises the following six steps.

The first step: mining historical code merging conflicts in the Git code warehouse, screening merging submissions (merge submissions) from the historical submissions (submissions) of the Git, and recombining all merging submissions to obtain code merging conflicts. Git's APIs are implemented in a variety of programming languages, such as jgit in Java, and we use jgit to perform common Git operations to obtain metadata in some Git stores. Taking jgit as an example, if history merging in the Git warehouse is to be mined, all the commit can be traversed first, and whether one commit is mergecommit is determined by checking whether the parent node number of the commit is 2. After all mergecommit and the corresponding parent nodes are collected, the two parent nodes are locally combined again, and the original combination conflict is obtained.

For commit and conflict t= (a, B, O, M, R) involved in merging, see schematically fig. 2, where the solid lines represent real commit and branching and the dashed lines represent nodes that are re-merged locally. Firstly, recording the file name of a conflict file at an M node, simultaneously obtaining the conflict file M with a conflict identifier locally, and then obtaining A, B, R corresponding files by switching to a commit corresponding to an A node, a B node and an R node through revocation merging and checkout instruction, wherein the R node comprises a result of manually resolving the conflict by a developer. Finally, find the common ancestor O node of A, B through jgit corresponding APIs, obtain the O file from the O node. And (3) performing the operation on all the merging nodes to obtain a conflict file tuple T= (A, B, O, M and R) set corresponding to the conflict in the history.

For the conflict block t= (a, B, O, R) in the conflict t= (a, B, O, M, R), the following method may be used for collection. The conflict blocks separated by conflict identifiers are identified from the M file, the corresponding parts of a, b and o can be directly collected from the conflict blocks, the R part can be obtained from the R file, and the detailed collection method is described in the fourth step.

The set of historical merging conflicts T and the set of conflict blocks T in any code warehouse can be obtained through the method and used as a data set in the machine learning method.

And a second step of: and extracting the characteristic of the editing information from the collected conflict blocks t of the code combination. The characteristic of the editing information mainly indicates the degree to which a, b changes relative to o in a conflict block t= (a, b, o, r), and the similarity between the a, b is the Jaccard similarity. The similarity measure includes code line similarity and Token similarity. Taking the calculation of the similarity of code lines as an example, we first divide the codes in a, b and o into different sets set (a), set (b) and set (o) in line units, then calculate the Jaccard similarity, the code line similarity of a, b isSimilarity (a, o), similarity (b, o) can be calculated using the same method.

The calculation of Token similarity is the same as the calculation of the similarity of code lines, except that the codes corresponding to a, b and o are firstly required to be divided into Token sequences before the Token similarity is calculated. There are many ways to segment Token sequences, and a Parser corresponding to a programming language may be selected, or a BPE-based word segmentation device in the field of natural language processing may be used, and we use an open source tool TREESITTER to segment Token of a code. a. And b, converting codes corresponding to the codes b and o into Token sequences, obtaining corresponding 3 Token sets, and obtaining Token similarity by using a method of calculating the same similarity of code lines.

The second aspect of the edit information is the number of code lines in the codes in a, b, o. Using LOC (Line Of Code) to represent the number of lines of code, we need to calculate LOC (a), LOC (b), LOC (o), LOC (a) -LOC (o), LOC (b) -LOC (o), LOC (a) -LOC (b).

And a third step of: and extracting structural information features from the collected code merging conflict blocks t. The structure information refers to syntax elements contained in the conflict code, and the programming language may be expressed as AST at a syntax level, wherein each node corresponds to one syntax element, such as a class, a function, a sentence, etc. The syntax element contained by the conflict block contains information on how to resolve the conflict, such as a conflict for an import statement, and the resolution scheme is likely to be the code that concatenates the a and b parts.

The obtaining of the structure information requires taking the node type of the AST node corresponding to the conflict block as a feature. Taking a in the conflict block t as an example, first we need to obtain the start and stop line number corresponding to the code of a from the file a. Since a is necessarily in A, we can get the start and stop line numbers, denoted as l1 and l2, through simple string matching. The previously mentioned TREESITTER then converts the a file to an AST while recording the start-stop line numbers corresponding to source file a for each node in the AST. Traversing the whole AST, comparing the start and stop line numbers of each node with the sizes of l1 and l2, recording the start and stop line numbers corresponding to one node as t1 and t2, and recording each node meeting l1 < t1 and l2 > t 2. These nodes are AST nodes contained in the conflict blocks, and the node types of the AST nodes are collected. Since the total number of node types is fixed, we use one-hot coding to translate the set of node types into structural information features.

Fourth step: and acquiring the result of manually resolving the conflict by the developer from the submitting node after the merging node where the conflict occurs, marking the category of the result, and converting the conflict resolution into a classification problem. In the second step, we collect the a, b, o parts in the conflict block t= (a, b, o, R), R needs to be obtained from the R file, and the algorithm for obtaining the R corresponding part from the R file is as follows:

1) The context of the conflict t is recorded from M, namely a code line preceding "<" > "and a code following" > ", respectively referred to as prefix and suffix. If prefix and unification extend to another conflict, only the code between the two conflicts is recorded.

2) The suffix is compared and matched to the R file. Starting from the first row in R, for each row in R, the same number of prefix rows as suffix is calculated from the row starting point, and the starting position of the R code corresponding to the found maximum number of prefixes is denoted as s.

3) After simultaneous reverse ordering of prefix and R, starting from the first row of reverse order of R, the row number corresponding to the largest common prefix is calculated using the same search as 2), denoted p.

4) And recording codes between p and s in the R file, namely R.

After r is obtained, a complete conflict block t= (a, b, o, r) is obtained, and before the label of r is calculated, how to convert the conflict resolution problem into the classification problem is defined. For a conflict, we divide its resolution scheme into five categories, respectively:

1) A, namely receiving a in (a, b, o, r) as a conflict resolution scheme;

2) B, i.e. receiving B in (a, B, o, r) as a conflict resolution scheme;

3) CC, namely splicing a and b in (a, b, o, r) and taking the spliced a and b as a conflict resolution scheme;

4) CB, namely, combining the code lines in a and b in (a, b, o, r) to be used as a conflict resolution scheme;

5) NC, i.e. in addition to using the codes of a and b in (a, b, o, r), adds a new code as a conflict resolution scheme.

The label of r is determined to be in which of the five types, whether r belongs to A, B, CC types can be determined directly whether r is equal to a, b or splicing of a and b, whether r belongs to CB needs to be determined first from sets set (r), set (a) and set (b) of code lines in r, a and b, and whether set (r) belongs to a union set of set (a) and set (b) is determined. If r does not fall within the four classes above, then it is labeled NC.

Fifth step: the model is trained using a machine learning algorithm on the extracted features and corresponding tags. Here we use a random forest model, which gives stable performance against a variety of classification problems. In addition, adaboost or XGBboost algorithms may be used to train the machine learning model, which may be slightly less effective than random forests.

After model training is completed, the method can be used for resolving new code merging conflicts, and the specific method is as follows: firstly, extracting new conflicting features according to the same method in the second step and the third step, then vectorizing the features and then taking the features as input of a model, wherein the model can generate five values between O and 1, the values respectively represent the probabilities of being classified into five categories, the sum of the probabilities is 1, and only the category corresponding to the maximum probability is selected as the result of the classifier. For conflict with the category A, B, CC, the classifier can directly give out conflict resolution results, namely directly taking the a part and the b part in the conflict bump or splicing the a part and the b part; for conflicts of categories CB and NC, the classifier cannot give the resolution directly, but can give the developer a hint that this is a relatively complex conflict, possibly taking into account how to combine modifications to o in a, b or introduce new code.

Sixth step: to examine classifier validity, the method needs to be evaluated in a real code warehouse. Specifically, 50 Java open-source projects which are high in stars and active are selected from Github, historical conflicts of the Java open-source projects are collected for each warehouse, a conflict data set is divided into a training set and a testing set according to the proportion of 80% and 20%, the training set is used for training a random forest model, the testing set is used for evaluating the performance of the classifier, and evaluation indexes are the accuracy rate, the recall rate and the F1 score of a classification result.

Claims

1. The automatic code merging conflict resolution method based on machine learning is characterized by mining the code merging conflicts in a Git code warehouse, collecting the existing code merging conflicts in historical merging nodes, extracting features from the collected conflicts and manually resolving results of developers, training a machine learning model by using the features, and finally resolving new code merging conflicts by using the trained machine learning model.

2. The automatic resolution method of machine learning based code combining conflict according to claim 1, characterized by comprising the steps of:

1) Mining the code merging conflicts in the Git code warehouse, and collecting the existing code merging conflicts in the historical merging nodes;

2) Extracting editing information features from the collected code merging conflicts;

3) Constructing an abstract syntax tree AST (Abstract Syntax Tree) for a conflict code source file, and collecting conflict structural information features from AST fragments corresponding to conflicts;

4) The method comprises the steps that a result of manually resolving conflicts by a developer is obtained from a submitting node after a merging node where the conflicts occur, the category of the result is marked, and a conflict resolving problem is modeled as a classification problem;

5) And training a model for the extracted features by using a random forest algorithm, and processing new code merging conflicts by the trained model according to the processes of extracting the features, classifying and resolving.

3. The automatic resolution method of code merging conflicts based on machine learning according to claim 2, characterized in that in step 1), historical code merging conflicts in the git code repository are mined, a conflict is defined as a five-tuple T, denoted as t= (a, B, O, M, R), wherein a and B represent conflict files corresponding to local branches and merging branches, respectively, O represents a common ancestor file of a and B files, M represents three-way conflict files generated by A, B, O at the time of merging, and R represents a file after the conflict is resolved by a developer; a conflict file may contain a number of conflict blocks, a conflict block being distinguished by conflict identifications (typically "<" >, ">," = = = = = "), each conflict block being defined as a quadruple t, let t= (a, b, O, R), where a, b are the parts of the conflict block corresponding to A, B files, respectively, O represents the part of the conflict block corresponding to O files, R is the code extracted from R files that is manually resolved by the developer; and screening out merging submissions (merge submissions) from historical submissions (submissions) of the git, and recombining all merging submissions to obtain code merging conflicts.

4. The automatic resolution method of code combining conflict based on machine learning according to claim 2, wherein in step 2), editing information features are extracted from the collected code combining conflict blocks t; editing code line number information and similarity information between two conflicting parties and an original code version, wherein the code line number information comprises a conflict area size and a conflict block size difference respectively related to the conflicting parties, the length of the code is LOC (Line OfCode), and for one conflict block t= (a, b, o, r), line number information LOC (a), LOC (b), LOC (o), LOC (a) -LOC (o), LOC (b) -LOC (o) and LOC (a) -LOC (b) of 6 conflict blocks need to be collected; the similarity is divided into line similarity and Token similarity, wherein the line similarity comprises two-by-two line similarities of three code versions, and the Token similarity comprises two-by-two Token similarities among the three code versions; the similarity is calculated by using Jaccard similarity, and for the line similarity, set (a) is the Set of all code lines in the region a, and then the line similarity of a and b is:

5. The automatic resolution method of a code combining conflict based on machine learning according to claim 2, wherein in step 3), the feature of the structural information is extracted from the collected conflict block t of the code combining conflict; the structure information refers to syntax elements contained in the conflict code, and a general programming language can be expressed as AST on a syntax level, wherein each node corresponds to one syntax element, such as class, function, sentence and the like; after representing the code source file as an AST, the syntax element included in the conflict area is extracted from the AST position corresponding to the conflict code.

6. The automatic resolution method of code merging conflicts based on machine learning according to claim 2, wherein in step 4), the result of manual resolution of conflicts by a developer is obtained from a submitting node after the merging node where the conflicts occur, and the classification thereof is marked, and the conflict resolution is converted into a classification problem; from the next submission of the combined submissions where the conflict occurs, obtaining the result of the manual resolution of the conflict by the developer, using a heuristic method to find the context of the conflict area, and locating the part of the manual resolution of the developer in the file after the conflict is resolved by the context; the conflict resolution is converted into a multi-classification problem, and five categories are respectively identified as follows:

1) A, namely receiving a in (a, b, o, r) as a conflict resolution scheme;

2) B, i.e. receiving B in (a, B, o, r) as a conflict resolution scheme;

7. The automatic resolution method of code combining conflict based on machine learning according to claim 2, wherein in step 5), a model is trained on the extracted features using a random forest algorithm, and the trained model processes new code combining conflicts according to the flow of extracting features, classifying, resolving.