CN117591172A

CN117591172A - Feature fusion code clone detection method and device based on vector database

Info

Publication number: CN117591172A
Application number: CN202311546640.8A
Authority: CN
Inventors: 智晨; 李景涛; 张宇; 胡子良; 徐梅芳; 严申; 李井; 邓水光; 尹建伟
Original assignee: Zhejiang University ZJU; Hundsun Technologies Inc
Current assignee: Zhejiang University ZJU; Hundsun Technologies Inc
Priority date: 2023-11-20
Filing date: 2023-11-20
Publication date: 2024-02-23

Abstract

The invention discloses a feature fusion code clone detection method and device based on a vector database, which can analyze the function level of codes to obtain code fingerprints fused with lexical, grammatical and feature information, and store and search the code fingerprints by adopting the vector database, thereby greatly accelerating the query and detection of similar codes; then, threshold screening and token-based longest public subsequence calculation are carried out according to the queried similar code function to obtain a fine-granularity similarity result, and the fine-granularity similarity result is used as a code similarity result of a function level; and finally, reversely deducing the similarity result of the file level according to the code similarity result of the function level, and reversely deducing the similarity result of the item level through the similarity result of the file level. The invention can rapidly screen out the most similar code cloning result under the condition of mass data (hundred million levels and above), not only can ensure the detection speed, but also can ensure the accuracy of the detection result under the condition of fine granularity.

Description

Feature fusion code clone detection method and device based on vector database

Technical Field

The invention relates to the technical field of code cloning, in particular to a feature fusion code cloning detection method and device based on a vector database.

Background

Code cloning refers to similar or identical code segments that exist in a software system. The presence of code clones can negatively impact the software project: code cloning can increase the maintenance cost of a software system, lead to code redundancy, reduce software quality, and increase the risk of flaws and vulnerabilities.

Therefore, in order to effectively manage and maintain a software system, code clones need to be detected, analyzed and reconstructed. Code clone detection refers to identifying code clones present in a software system and classifying them into different types. Code cloning can be classified into the following four types according to its similarity and variation:

type I: identical code segments, except for insignificant differences in blank, comment, and format.

Type II: structurally similar code fragments may have grammatical differences such as identifiers, literal amounts, types, layouts, and nonfunctional statements, in addition to differences in type I.

Type III: on the basis of type II, there are also differences such as statement addition, deletion or modification.

Type IV: functionally similar or equivalent code segments, but with greater differences in syntax and semantics.

At present, the common code similarity detection methods mainly comprise the following steps: text-based, lexical-based, grammatical-based, index-based, semantic-based, etc. The chinese patent document with publication number CN115587358A discloses a binary code similarity detection method, which comprises: acquiring a control flow chart of a function of the binary firmware file; extracting semantic information to obtain a code block embedded vector of the control flow chart; obtaining depth semantic features of a control flow chart through embedding vectors into code blocks, and determining sequence perception features of the code block embedding vectors; fusing the depth semantic features and the sequence perception features to obtain a graph embedding vector; the similarity of the functions is calculated by embedding vectors through the graph.

However, a single detection method does not better acquire and detect code clones of all types and scenarios.

On the one hand, different types of code cloning require different degrees of abstraction and accuracy. For example, type I and type II code clones may be efficiently detected by text-based or lexical methods, while type III and type IV code clones need to be analyzed more deeply by grammar-based or semantic methods.

On the other hand, code cloning for different scenarios requires different degrees of efficiency and scalability. For example, when detecting code clones in a large database (e.g., hundreds of millions of code clone detection), it can be time consuming and labor intensive if one-to-one comparisons are made with each code segment in the database. Thus, there is a need for alternative schemes for faster and better rapid screening.

Disclosure of Invention

The invention provides a feature fusion code clone detection method and device based on a vector database, which can be used for rapidly screening a batch of most similar code clone results under the condition of mass data (hundred million levels and above), and can ensure the detection speed and the accuracy of the detection results under the condition of fine granularity.

The feature fusion code clone detection method based on the vector database is characterized by comprising the following steps of:

(1) A fingerprint library code preprocessing stage, namely performing function extraction and preprocessing on the code file;

(2) In the generation stage of the lexical fingerprints of the fingerprint library codes, lexical analysis is carried out on each function, the lexical fingerprints of each function are obtained through calculation, and lexical analysis result files are stored locally;

(3) The code grammar and characteristic fingerprint generation stage of the fingerprint library, carrying out grammar extraction and characteristic extraction on each function, and outputting grammar and characteristic fingerprints;

(4) A code warehousing stage of a fingerprint library, wherein the lexical fingerprint obtained in the step (2) is fused with the grammar and the characteristic fingerprint obtained in the step (3) to obtain binary code fingerprints, and an index is built according to the binary code fingerprints and a vector database is entered in a partition mode according to the code length;

(5) A fingerprint coarse granularity screening stage, namely processing the code to be detected by the same method in the steps (1) - (3), extracting the function to be detected, generating corresponding binary code fingerprint values, and selecting a plurality of partitions according to rules through a vector database to search to obtain the most similar first N results;

(6) A fingerprint fine granularity screening stage, wherein N results obtained in the step (5) are screened according to Hamming distance, and the result with a higher threshold value than that preset manually is eliminated, so as to obtain a screened candidate similarity set; then, each candidate function in the candidate similarity set is read from a local lexical analysis file, and the fine-granularity similarity between the candidate function and the function to be detected is calculated through fine-granularity lexical;

(7) And (3) in a stage of calculating the similarity of the file and the item, counting the similarity of the function obtained in the step (6) to obtain the similarity of the file level, and counting according to the similarity of the file level to obtain the similarity of the item level.

The invention can realize code clone detection at the function level, and realize code fingerprint generation of the function by fusing the lexical, grammatical and index features of the function code. And by adopting a vector database method, the method can realize high-efficiency high-performance storage and quick search of the most similar code functions according to the fingerprint index without one-to-one matching with mass data, and can realize quick screening of a batch of most similar code cloning results under the condition of mass data (hundred million levels and above). And then further screening through the threshold setting of the code fingerprint, performing one-to-one similarity fine granularity calculation according to a small number of candidate similar code functions after screening, and taking the code functions meeting the requirements as final searching results. And finally, respectively deducing code similar results based on the file level and the item level reversely according to the detection result of the function level.

Further, the specific process of the step (1) is as follows:

(1-1) extracting the function, traversing each code file in the project, analyzing each code file into an abstract syntax tree AST through a code analysis package, and finding an abstract syntax tree node entry of each function;

(1-2) preprocessing, namely filtering nodes such as notes, tabs, spaces, blank lines and the like in each function which are irrelevant to fingerprint and code influence, and replacing function names, entries, function calls and identifiers of the functions.

The specific process of the step (2) is as follows:

(2-1) lexical statistics, namely firstly generating a lexical fingerprint with all digits of 0, then counting lexical token contained in the function preprocessed in the step (1), and counting the number of different token texts and occurrence times in the lexical of the fingerprint;

(2-2) Simhash calculation, namely calculating each different token text through a hash algorithm to obtain a binary fingerprint, taking the bit of each number of the obtained fingerprint as a positive weight, and accumulating the occurrence times of the token on each corresponding bit of the lexical fingerprint; and subtracting the number of times of occurrence of the token from each corresponding bit of the lexical fingerprint by taking the bit with each number of 0 as negative weight, sequentially until all different token texts are calculated, and then setting the positions of the lexical fingerprint larger than 0 as 1 and the rest as 0 to obtain the final binary lexical fingerprint.

The specific process of the step (3) is as follows:

(3-1) generating a node combination, counting the grammar information of each non-leaf node and the feature statistics vector obtained by taking the node as a subtree, counting the different grammar information of all non-root nodes contained in the sub-number tree by the feature statistics to form a multidimensional feature vector, and splicing the grammar information of the node and the feature vector to be used as the node combination;

(3-2) Simhash calculation, counting each node combination, calculating how many different node combination texts are in total and the occurrence times of the different node combination texts, carrying out hash calculation on each combination according to the method of the step (2-2), and finally obtaining binary grammar and characteristic fingerprints.

In the step (4), the binary code fingerprints obtained by directly splicing the lexical fingerprints, the grammar and the characteristic fingerprints are divided into a plurality of partitions according to the code line number of the function, the partitions corresponding to the function are calculated, and the binary code fingerprints are inserted into the corresponding partitions of the vector database.

In the step (5), the code to be detected is calculated and spliced in the steps (1) - (3) to obtain a binary code fingerprint, the corresponding partition is calculated according to the number of lines of the code to be detected, the corresponding partition and a plurality of adjacent partitions are searched in a vector database according to the code fingerprint, and the most similar first N results are returned.

The specific process of the step (6) is as follows:

(6-1) threshold screening, wherein the result obtained in the step (5) is screened according to a preset Hamming distance threshold, and only the result smaller than the Hamming distance is subjected to further fine granularity comparison;

and (6-2) fine granularity comparison, namely locally reading the remaining results into corresponding lexical saved files according to the mapping relation, and saving the results with similarity results larger than a threshold value by calculating the lexical set of the two lexical saved files through the longest common subsequence LCS.

The specific process of the step (7) is as follows:

(7-1) calculating file level similarity, obtaining the detection result of each detected function under the file (the code file in the step (1)), accumulating the number of code lines (for calculating the file similarity later) for each detected function, and traversing all similar function results; for each similar function result, recording the corresponding file name and the mapping of the function similarity multiplied by the code line number of the detected function (the code line number of the similar function result), and if the detected function is different but the file corresponding to the similar function appears, updating the previous result to be the previous result plus the function similarity multiplied by the code line number of the detected function; sorting according to the mapping values of the files after traversing, selecting the first N most similar files as file level similar results of the files and storing the files;

(7-2) calculating the similarity of the item level, traversing all the detected files when the uploaded item is processed, and calculating the detected files according to the step (7-1) to obtain the similarity of the file level; accumulating the code lines of each detected file (for calculating the similarity of the files in the later period) and traversing all the file similarity results, corresponding to each similarity result, recording the mapping of the name of the corresponding item and the similarity of the file to the code line number of the detected file (the code line number of the similar file result), if the detected files are different but the items corresponding to the similar files are already appeared, updating the previous result to the previous result plus the similarity of the file to the code line number of the detected file, selecting the first N most similar items as the item level similarity result of the item according to the ordered result, and storing.

The feature fusion code clone detection device based on the vector database comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the feature fusion code clone detection method based on the vector database when executing the executable codes.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention can integrate code information including morphology, grammar and characteristics to generate corresponding code fingerprints for similar code detection.

2. The invention also adopts the vector database for storage and inquiry, has obvious advantages in speed compared with the traditional one-to-one comparison or the segmented index based on the drawer principle, and blocks the vector database according to the number of lines of codes, thereby greatly improving the inquiry time and rapidly filtering to obtain a result set which is possibly similar.

3. The invention also carries out screening of a certain rule and fine granularity similarity calculation according to the query result, and screens again according to the calculation result, thereby improving the similarity accuracy of the result. And according to the detection result of the function level, reversely pushing the possibly similar files of each uploaded file in the database, and reversely pushing the possibly similar items of the uploaded items in the database according to the detection result of the file level.

4. Compared with the existing method, the method and the device have remarkable advantages in code information acquisition, code fingerprint storage and searching and code similarity calculation.

Drawings

FIG. 1 is a flow chart of a feature fusion code clone detection method based on a vector database;

FIG. 2 is a flow chart of a fingerprint library code preprocessing stage in the present invention;

FIG. 3 is a flow chart of the fingerprint generation stage of the fingerprint library code lexical fingerprint in the present invention;

FIG. 4 is a flow chart of the fingerprint library code grammar and feature fingerprint generation phase of the present invention;

FIG. 5 is a flow chart of the fine grain screening stage of the present invention;

FIG. 6 is a flow chart showing a stage of calculating similarity of a reverse document and an item according to the present invention.

Detailed Description

The invention will be described in further detail with reference to the drawings and examples, it being noted that the examples described below are intended to facilitate the understanding of the invention and are not intended to limit the invention in any way.

As shown in fig. 1, a feature fusion code clone detection method based on a vector database includes the following steps:

s101, a fingerprint library code preprocessing stage: the preprocessing stage performs function extraction and preprocessing on massive code files, and prepares for computing code fingerprints in the following steps S102 and S103, and the method is shown in FIG. 2 and comprises the following sub-steps:

s1011, function extraction: decompressing the compressed package of the submitted code item, traversing each code file in the compressed package, analyzing each code file into an Abstract Syntax Tree (AST) through a code analysis package in a third party Maven dependency, and finding an abstract syntax tree node entry of each function.

S1012, pretreatment: nodes, such as notes, tabs, spaces, blank lines and the like, in each function which are irrelevant to fingerprint and code influence are filtered, the function name of the replacement function is myFunc, the replacement function is referred to as myParam, other function calls in the function body are replaced by myMethod, and all other identifiers of the function body are replaced by myIdent.

S102, fingerprint library code lexical fingerprint generation: the lexical fingerprint generation stage calculates the lexical fingerprint of each function by performing lexical analysis and statistics on each function, and locally stores the lexical analysis result, and the method is shown in fig. 3 and comprises the following sub-steps:

s1021, lexical statistics: a 64-bit lexical fingerprint is first generated and the values of all bits are set to 0. The function is then traversed and counted through all non-annotated, space, line feed, tab, etc. lexical token in the abstract syntax tree node. And counting the total number of different token texts and the occurrence times of the token texts in the lexicon of the fingerprint.

S1022, simhash calculation: calculating each different token text through a xxhash algorithm to obtain a 64-bit fingerprint, taking the bit of each number 1 of the obtained fingerprint as a positive weight, and accumulating the occurrence times of the token on each corresponding bit of the lexical fingerprint; and subtracting the number of times of occurrence of the token from each corresponding bit of the lexical fingerprint by taking the bit with each number of 0 as negative weight, sequentially until all different token texts are calculated, and then setting the positions of the lexical fingerprint larger than 0 as 1 and the rest as 0 to obtain the final binary lexical fingerprint.

S103, fingerprint library code grammar and characteristic fingerprint generation stage: the grammar and characteristic fingerprint generating stage outputs the code fingerprint value obtained by grammar and characteristic calculation by carrying out grammar extraction and characteristic extraction on each function, and the method is shown in fig. 4 and comprises the following sub-steps:

s1031, generating a node combination: firstly initializing each feature count vector, judging whether the current node is a leaf node, if so, judging whether the grammar type of the node is in the feature count vector, if so, adding 1 to the bit count corresponding to the feature count vector, and then returning; if the node is not a leaf node, traversing to obtain a return value of each child node and accumulating each feature into a feature count vector, then combining and storing the grammar type of the node and the feature count vector thereof, judging the grammar type of the node, adding the combination with the feature vector of the node, and then returning.

S1032, simhash calculation: generating a grammar with 64 bits and each bit being 0 and a characteristic fingerprint initial value, then counting the combination obtained by each node, counting the number of occurrence times of each different combination text, calculating each different combination text through a xxhash hash algorithm to obtain a 64-bit fingerprint, taking the bit with each number being 1 of the obtained fingerprint as a positive weight, and accumulating the number of occurrence times of the combination text on each corresponding bit of the grammar and the characteristic fingerprint; and subtracting the occurrence times of the combined text from each corresponding bit of the grammar and the feature fingerprint by taking the bit with each number of 0 as a negative weight, sequentially until all different combined texts are calculated, and then setting the positions of the grammar and the feature fingerprint larger than 0 as 1 and setting the rest as 0 to obtain the final binary grammar and the feature fingerprint.

S104, a fingerprint library code warehousing stage: the input stage directly splices two 64-bit fingerprints generated in the lexical fingerprint generation stage and the grammar and characteristic fingerprint generation stage according to the function to obtain 128-bit binary code fingerprints, calculates each ten rows as a partition according to the number of rows of the function to obtain the partition corresponding to the function, and inserts the binary code fingerprints into the corresponding partition of a vector database, wherein the vector database can be a Milvus database.

S105, a fingerprint coarse granularity screening stage: and in the coarse granularity screening stage, the code to be detected is subjected to lexical fingerprint calculation, grammar and characteristic fingerprint calculation according to S101-S103, the code fingerprint to be detected is obtained after splicing, the corresponding partition is calculated according to the code line number of the function to be detected according to one partition in every ten lines, the code fingerprint is searched in five partitions which are the corresponding partition and the upper partition and the lower partition in the Milvus database, and the result is returned.

S106, fingerprint fine granularity screening: and in the fine granularity screening stage, the result of the previous stage is screened according to the Hamming distance, and the result with the threshold higher than the manually preset threshold is eliminated, so that a candidate similarity set after the function screening is obtained. Then, each function in the candidate similarity set is read from a local lexical analysis result file, and the fine-granularity similarity between the candidate function and the function to be detected is calculated through fine-granularity lexical, and the method is shown in fig. 5 and comprises the following sub-steps:

s1061, threshold screening: and (3) screening the results obtained in the step (S105) according to a preset Hamming distance threshold (the threshold can be obtained by taking the number of the code lines of the function detected according to the generation as a mapping), and comparing the results with fine granularity only when the results are smaller than or equal to the Hamming distance, so that the rest results are eliminated.

S1062, fine granularity comparison: and (3) locally reading the file corresponding to the lexical file according to the mapping relation by the residual result, carrying out Longest Common Subsequence (LCS) calculation according to the lexical of the two and taking the token as granularity, multiplying the token number of the longest common subsequence by 2, dividing the token number by the sum of the lexical numbers of the two to obtain a similarity result with fine granularity, storing a result with the similarity result larger than a threshold value, and eliminating the result smaller than the threshold value.

S107, a step of calculating the similarity of the reverse push file and the project: the step of calculating the similarity of the reverse push file and the item, which is to count the function similarity pair obtained in the previous step to obtain the similarity of the file level, and then count the similarity of the item level according to the similarity of the file level, and the method is shown in figure 6, and comprises the following sub-steps:

s1071, calculating file level similarity: the detection result of each detected function under the file (code file in S101) is acquired, and for each detected function, the number of code lines thereof (for calculating file similarity later) is accumulated and all the similar function results thereof are traversed. For each similar function result, recording the corresponding file name and the mapping of the function similarity multiplied by the code line number of the detected function (the code line number of the similar function result), and if the detected function is different but the file corresponding to the similar function appears, updating the previous result to be the previous result plus the function similarity multiplied by the code line number of the detected function. And after traversing, sorting according to the size of the mapping values of the files, selecting the first 10 most similar files as file level similar results of the files, and storing the file level similar results.

S1072: and (3) calculating the similarity of the item level, traversing all the detected files when the uploaded item is processed, and calculating the detected files according to the step S1071 to obtain the similarity of the file level. Accumulating the code lines of each detected file (for calculating the similarity of the files in the later period) and traversing all the file similarity results, corresponding to each similarity result, recording the mapping of the name of the corresponding item and the similarity of the file to the code line number of the detected file (the code line number of the similar file result), if the detected files are different but the items corresponding to the similar files are already appeared, updating the previous result to the previous result plus the similarity of the file to the code line number of the detected file, selecting the first 10 most similar items as the item level similarity result of the item according to the ordered result, and storing.

Based on the same inventive principle, the embodiment of the invention also provides a feature fusion code clone detection device based on the vector database, which comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the feature fusion code clone detection method based on the vector database when executing the executable codes.

The foregoing embodiments have described in detail the technical solution and the advantages of the present invention, it should be understood that the foregoing embodiments are merely illustrative of the present invention and are not intended to limit the invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the invention.

Claims

1. The feature fusion code clone detection method based on the vector database is characterized by comprising the following steps of:

2. The feature fusion code clone detection method based on a vector database according to claim 1, wherein the specific process of step (1) is as follows:

(1-2) preprocessing, filtering nodes in each function which are irrelevant to fingerprint and code influence, and replacing function names, parameters, function calls and identifiers of the functions.

3. The feature fusion code clone detection method based on a vector database according to claim 1, wherein the specific process of step (2) is as follows:

4. The feature fusion code clone detection method based on a vector database according to claim 3, wherein the specific process of step (3) is as follows:

5. The feature fusion code clone detection method based on a vector database according to claim 1, wherein in the step (4), binary code fingerprints obtained by directly splicing lexical fingerprints, grammar and feature fingerprints are partitioned into a plurality of partitions according to the code line number of a function and the partitions corresponding to the function are calculated, and the binary code fingerprints are inserted into the corresponding partitions of the vector database.

6. The feature fusion code clone detection method based on a vector database according to claim 1, wherein in the step (5), the code to be detected is calculated and spliced in the steps (1) - (3) to obtain binary code fingerprints, corresponding partitions are calculated according to the number of lines of the code to be detected, the corresponding partitions and a plurality of adjacent partitions are searched in the vector database according to the code fingerprints, and the top N most similar results are returned.

7. The feature fusion code clone detection method based on a vector database according to claim 1, wherein the specific process of step (6) is as follows:

8. The feature fusion code clone detection method based on a vector database according to claim 1, wherein the specific process of step (7) is as follows:

(7-1) calculating file level similarity, namely acquiring a detection result of each detected function under the code file in the step (1), accumulating the number of lines of the code for each detected function, and traversing all similar function results; for each similar function result, recording the corresponding file name and function similarity multiplied by the code line number of the detected function, and if the detected function is different but the file corresponding to the similar function appears, updating the previous result into the previous result plus the function similarity multiplied by the code line number of the detected function; sorting according to the mapping values of the files after traversing, selecting the first N most similar files as file level similar results of the files and storing the files;

(7-2) calculating the similarity of the item level, traversing all the detected files when the uploaded item is processed, and calculating the detected files according to the step (7-1) to obtain the similarity of the file level; accumulating the code lines of each detected file and traversing all the file similar results, corresponding to each similar result, recording the mapping of the item name corresponding to the similar result and the file similarity multiplied by the code line number of the detected file, if the detected file is different but the item corresponding to the similar file has occurred, updating the previous result into the previous result plus the file similarity multiplied by the code line number of the detected file, and selecting the first N most similar items as the item level similar result of the item according to the ordered result and storing.

9. A vector database based feature fusion code clone detection device comprising a memory and one or more processors, the memory having executable code stored therein, the one or more processors configured to implement the vector database based feature fusion code clone detection method of any one of claims 1-8 when executing the executable code.