CN113158640A - Code similarity detection method and device, storage medium and electronic equipment - Google Patents

Code similarity detection method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN113158640A
CN113158640A CN202110390520.8A CN202110390520A CN113158640A CN 113158640 A CN113158640 A CN 113158640A CN 202110390520 A CN202110390520 A CN 202110390520A CN 113158640 A CN113158640 A CN 113158640A
Authority
CN
China
Prior art keywords
array
variable
count
code
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110390520.8A
Other languages
Chinese (zh)
Inventor
鲁海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weimin Insurance Agency Co Ltd
Original Assignee
Weimin Insurance Agency Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weimin Insurance Agency Co Ltd filed Critical Weimin Insurance Agency Co Ltd
Priority to CN202110390520.8A priority Critical patent/CN113158640A/en
Publication of CN113158640A publication Critical patent/CN113158640A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a code similarity detection method and device, a storage medium and electronic equipment, and belongs to the field of computers. Wherein, the method comprises the following steps: analyzing a syntactic structure of a target code file, and generating an abstract syntactic tree of the target code file, wherein the abstract syntactic tree comprises a plurality of nodes, and each node corresponds to a temporary array and a counting array; and calculating the code similarity of the code blocks in the target code file by adopting the temporary array and the counting array. The method and the device solve the technical problem that the code similarity error is large in calculation through character comparison in the related art, and improve the maintenance efficiency of repeated codes.

Description

Code similarity detection method and device, storage medium and electronic equipment
Technical Field
The present application relates to the field of computers, and in particular, to a method and an apparatus for detecting code similarity, a storage medium, and an electronic device.
Background
In the related art, the maintenance of the repeated codes requires very high labor cost (the modification in one place cannot be synchronized to all), so the extraction and abstraction of the common codes are important in the development.
In the related art, a progressive scanning mode is adopted for code comparison, and the comparison of whether text characters in each row and each column are consistent is based on character comparison and is easily interfered, for example: code obfuscation compilation, special characters, code logic consistency, code blocks with inconsistent variable naming, and the like. In the related art, a character-based comparison mode does not pay attention to the repetition degree of code logic, only aiming at the comparison of characters in each row and each column, and when a code is compiled or subjected to confusion processing, variable name characters, variable definition positions, operation characters and even an operation mode possibly change, so that a very large error is generated only through character comparison at this time, and the real situation of the code cannot be reflected.
In view of the above problems in the related art, no effective solution has been found at present.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, the application provides a code similarity detection method and device, a storage medium, and an electronic device.
According to an aspect of the embodiments of the present application, there is provided a method for detecting code similarity, including: analyzing a syntactic structure of a target code file, and generating an abstract syntactic tree of the target code file, wherein the abstract syntactic tree comprises a plurality of nodes, and each node corresponds to a temporary array and a counting array; and calculating the code similarity of the code blocks in the target code file by adopting the temporary array and the counting array.
According to another aspect of the embodiments of the present application, there is also provided a device for detecting code similarity, including: the generating module is used for analyzing a syntactic structure of the target code file and generating an abstract syntactic tree of the target code file, wherein the abstract syntactic tree comprises a plurality of nodes, and each node corresponds to a temporary array and a counting array; and the calculation module is used for calculating the code similarity of the code blocks in the target code file by adopting the temporary array and the counting array.
Further, the calculation module includes: a generating unit, configured to generate a plurality of count arrays based on a plurality of temporary array correspondences in the AST, where the count arrays include the following information of variable definition expressions: operation type, variable type and variable number; and the comparison unit is used for comparing a first count array and a second count array in the plurality of count arrays and determining the similarity between the first count array and the second count array as the similarity between a first code block and a second code block in the target code file, wherein the first count array corresponds to the first code block, and the second count array corresponds to the second code block.
Further, the generation unit includes: the first reading subunit is used for reading a first key character in each temporary array in the AST, wherein the first key character is used for representing the operation type of a variable definition expression; the second reading subunit is configured to read second key characters in the temporary array, and count the variable quantity of the second key characters, where the second key characters are used to represent variable types under a current operation type, and each second key character corresponds to one variable type; and the generating subunit is used for generating a counting array by adopting the first key character, the second key character and the variable quantity.
Further, the comparison unit includes: the comparison subunit is used for comparing the first operation type of the first counting array with the second operation type of the second counting array; the first calculating subunit is configured to calculate a first quotient value between a first variable number of the first count array and a second variable number of the second count array if the first operation type is the same as the second operation type, where the first quotient value is used to represent a first similarity between the first count array and the second count array; a first determining subunit, configured to determine the first quotient value as a similarity between a first code block and a second code block in the target code file.
Further, the comparison unit includes: the comparison subunit is used for comparing the first operation type of the first counting array with the second operation type of the second counting array; a second calculating subunit, configured to calculate, if the first operation type is the same as the second operation type, a second quotient between a first variable type number of the first count array and a second variable type number of the second count array, where the first variable type number and the second variable type number are obtained according to variable type types included in the first operation type and the second operation type, and the second quotient is used to represent a second similarity between the first count array and the second count array; a second determining subunit, configured to determine the second quotient value as a similarity between the first code block and the second code block in the target code file.
Further, the comparison unit includes: the comparison subunit is used for comparing the first operation type of the first counting array with the second operation type of the second counting array; a third calculation subunit, configured to, if the first operation type is the same as the second operation type, determine a specific variable number of the same variable type in the first count array and the second count array, determine a total variable number of the first count array or the second count array, and calculate a third quotient of the specific variable number and the total variable number, where the total variable numbers of the first count array and the second count array are the same, and the third quotient is used to represent a third similarity between the first count array and the second count array; a third determining subunit, configured to determine the third quotient value as a similarity between the first code block and the second code block in the target code file.
Further, the generating module includes: the dividing unit is used for dividing the code character string in the target code file into lexical units, wherein each lexical unit corresponds to one code block; a generating unit configured to generate a lexical unit list based on the divided plurality of lexical units; a conversion unit configured to convert the lexical unit list into AST in units of variable definition expressions.
Further, the conversion unit includes: the splitting subunit is used for splitting the lexical unit list into a plurality of temporary arrays by taking the variable definition expressions as units, wherein each temporary array corresponds to one variable definition expression, and each variable definition expression is identified by adopting key characters; a mapping subunit, configured to define an expression for each variable of the lexical unit list, and map the corresponding temporary array as a sub-AST; and the combining subunit is used for respectively determining the plurality of sub ASTs as the sub nodes under the designated tree root node, and combining to obtain a total AST.
Further, the dividing unit includes: a scanning subunit, configured to scan the code character strings from left to right starting from a first character of the code character strings in the target code file until a next empty character string, and extract a consecutive character string before the empty character string; a determining subunit, configured to determine the continuous character string as a key character, continue to scan the code character string from left to right until a nearest identifier after the key character, and determine a character set between the key character and the identifier as a lexical unit; and the processing subunit is used for continuously scanning the rest code character strings from left to right in a line feed manner until the last character of the code character strings is scanned.
According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program that executes the above steps when the program is executed.
According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; wherein: a memory for storing a computer program; a processor for executing the steps of the method by running the program stored in the memory.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the steps of any embodiment of the code similarity detection method.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
in the embodiment of the application, by analyzing the similarity degree of the grammar and the logic level, the influence of code compiling or some confusion encryption means can be avoided, the error is reduced, the technical problem that the error of the similarity of codes calculated by character comparison in the related art is large is solved, and the maintenance efficiency of repeated codes is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a flowchart of a method for detecting code similarity according to an embodiment of the present application;
FIG. 2 is a diagram illustrating scanning a code string according to an embodiment of the present application;
fig. 3 is a schematic diagram of AST in an embodiment of the present application;
FIG. 4 is two schematic diagrams of generating a count array according to an embodiment of the present application;
FIG. 5 is a diagram illustrating a comparative example of two code blocks according to an embodiment of the present application;
FIG. 6 is a comparative schematic view of an embodiment of the present application;
FIG. 7 is a diagram of a comparative example of three code blocks in an embodiment of the present application;
fig. 8 is a block diagram of a device for detecting code similarity according to an embodiment of the present application;
fig. 9 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments, and the illustrative embodiments and descriptions thereof of the present application are used for explaining the present application and do not constitute a limitation to the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another similar entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In this embodiment, a method for detecting code similarity is provided, and fig. 1 is a flowchart of a method for detecting code similarity according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:
step S102, analyzing a syntactic structure of a target code file, and generating an abstract syntactic tree of the target code file, wherein the abstract syntactic tree comprises a plurality of nodes, and each node corresponds to a temporary array and a counting array;
the Abstract Syntax Tree (AST) in this embodiment is a tree representation of the Abstract syntax structure of the source code in the target code file.
Step S104, calculating the code similarity of the code blocks in the target code file by adopting the temporary array and the counting array;
in the embodiment, the logic similarity of the code at the semantic level can be calculated through the temporary array and the counting array corresponding to the AST, and the compiling method of the code and the text are not depended on.
Through the steps, the syntactic structure of the object code file is analyzed, the abstract syntactic tree of the object code file is generated, each node of the AST corresponds to one temporary array, then the AST is adopted to calculate the code similarity of the object code file, and by analyzing the similarity of the syntactic and logic levels, the influence of code compiling or some confusion encryption means can be avoided, the error is reduced, the technical problem that the error of the code similarity calculated through character comparison in the related technology is large is solved, and the maintenance efficiency of repeated codes is improved.
In this embodiment, parsing the syntax structure of the object code file to generate an abstract syntax tree of the object code file includes:
s11, dividing the code character string in the target code file into lexical units, wherein each lexical unit corresponds to one code block;
s12, generating a lexical unit list based on the divided lexical units;
s13, converting the lexical unit list into AST in units of variable definition expressions.
In one embodiment of this embodiment, converting the lexical unit list into an AST in units of variable definition expressions comprises: splitting the lexical unit list into a plurality of temporary arrays by taking the variable definition expressions as units, wherein each temporary array corresponds to one variable definition expression, and each variable definition expression is identified by adopting key characters; defining an expression aiming at each variable of the lexical unit list, and mapping a corresponding temporary array into a sub AST; and respectively determining the plurality of sub ASTs as the sub nodes under the designated tree root node, and combining to obtain a total AST.
In one example, the key character of the variable definition expression is VariableDeclaration, and the temporary array is: VariableDeclaration, VariableDeclarator, Identifier, leral.
In this embodiment, the dividing the code character string in the target code file into lexical units includes: scanning the code character string from left to right from the first character of the code character string in the target code file until the next empty character string, and extracting a continuous character string before the empty character string; and determining the continuous character string as a key character, continuing to scan the code character string from left to right until the identifier is the nearest identifier after the key character, and determining a character set between the key character and the identifier as a lexical unit.
After the first lexical unit is obtained by scanning, line feed continues to scan the rest code character strings from left to right (if the code character strings which are not scanned completely exist) until the last character of the code character strings is scanned.
Alternatively, the identifier may be a semicolon or a linefeed. The first character of the code string is the first non-empty character of the code string starting from the starting column (e.g., column 0) of the starting row (e.g., row 0).
In the javascript language of the embodiment, the smallest logic execution unit is an expression (i.e., AST node), for example, "v" is the first character, "var" is a key character, and "var a ═ 1" is an expression in the scanning process; each expression has a key character to identify the expression, so that the AST syntax tree can be constructed by identifying the key character, for example, the character string var is a key character (used to identify the variable definition expression) in the code of the object code file.
Fig. 2 is a schematic diagram of scanning a code string according to an embodiment of the present application, generating coordinates of a row number x and a column number y for each character by scanning the characters row by row and column by column, as shown in FIG. 2, starting with the first character x0y0, ignoring the null character, the first character v non-null character x1y0 is recorded, and continuing to scan the next character a (x2y0) until the next null character (x4y0) appears, so as to obtain a continuous character string var from x1y0 to x3y0, while the string var is a key character in JavaScript language (variable definition expression), the syntax logic of the variable definition expression in JavaScript can judge that 9 characters (x1y0 to x9y0) from var to the nearest semicolon (x10y0) or line break (x11y0) (the semicolon and the line break are special characters separating the expression in the JavaScript language) are an expression.
After a complete expression is obtained by scanning, then splitting a variable definition expression (abbreviated as an expression), where the complete variable definition expression (variable definition) can be necessarily split into the following constituent elements (i.e., child nodes in AST), each constituent element being separated by a null character, and in an example where var a is 0, the following elements can be obtained by splitting according to a code position, and then a corresponding temporary array is obtained:
3 characters of x1y0-x3y0 are var (VariableDeclarator);
x5y0 character a (identifier) (here possibly a number of consecutive characters, only one in this example);
x7y0 has a fixed arithmetic sign;
x9y0 character 1(literal) (here possibly a number of consecutive characters, only one in this example).
In addition, the literal definition in this example is a simple literal definition, and here, it may be another expression, and if the literal definition is an expression, the literal definition is used as a child node of the variable definition expression VariableDeclaration, and the splitting analysis is continued according to the splitting flow).
Another expression var b in the above example is 2; and var a ═ 1; has been separated by a semicolon or line break, so is two independent expressions, belonging to the root node of the AST tree, and temporarily storing each child node in the form of an array, i.e. a temporary array, as follows:
Figure BDA0003016541870000091
the AST tree structure of the expression can be obtained by judging the variable definition expression VariableDeclaration, splitting the expression into 4 components VariableDeclaration, VariableDeclarator, Identifier, and iteral and recycling the temporary array. Next, as long as the above steps are repeated until the last character of the code character string is scanned, an AST syntax tree that marks all expressions of the code character string in the code file can be constructed, and fig. 3 is a schematic diagram of an AST in the embodiment of the present application, which illustrates two nodes (a lexical unit node 1 and a lexical unit node 2) that respectively correspond to two expressions. The following explains the respective elements in AST:
a lexical unit: each complete child node is a lexical unit;
program: a tree root node;
VariableDeclaration: a variable definition expression;
VariableDeclarator: lexical unit/child node/code block type (here denoted variable definition);
king: "var": declaring a variable type;
identification: lexical unit/child node/code block type (here denoted variable identifier definition);
name is as follows: declaring a variable name;
literal: operation type of lexical unit/child node/code block (here representing literal definition);
raw is as follows: declare variable values (literal quantities);
it should be noted that, the value (variable classifier/Identifier/primitive) in the lexical unit/child node/code block type is an enumerated type, and other values such as functional classification, representation method definition, and the like are similar.
In this embodiment, calculating the code similarity of the target code file using the AST includes:
s21, generating a plurality of counting arrays based on the plurality of temporary array correspondences in the AST, wherein the counting arrays comprise the following information of variable definition expressions: operation type, variable type and variable number;
the operation type, the variable type and the variable number are three dimensions of the degree of the calculation repetition rate, and the similarity only concerns the definition and the form of the variable and the operation of the code and does not need to concern whether the variable name and the position of the operator statement are consistent or not.
Optionally, the generating a plurality of count arrays based on the plurality of temporary arrays in the AST correspondingly includes: for each temporary array in the abstract syntax tree, reading a first key character in the temporary array, wherein the first key character is used for representing the operation type of a variable definition expression; reading second key characters in the temporary array, and counting the variable quantity of the second key characters, wherein the second key characters are used for representing variable types under the current operation type, and each second key character corresponds to one variable type; and generating a counting array by adopting the first key character, the second key character and the variable quantity.
In one example, any two code blocks are traversed from left to right sequentially according to the tree structure of the AST, or the code blocks meeting preset conditions in the policy variable AST are traversed (for example, only the code blocks with the operation type being the specified type are traversed). For example, the first code block and the second code block contain expressions "var a ═ 1" and "var b ═ 2", respectively, and the temporary arrays are as follows:
Figure BDA0003016541870000111
in the first code block, a variable definition expression is VariableDeclaration, and then the variable definition expression is split into 4 groups which form VariableDeclaration, VariableDeclarator, Identifier, and literal; in the second code block, the variable definition expression is also VariableDeclaration, and this expression is split into 4 components VariableDeclaration, VariableDeclarator, Identifier, iteral. And reading the temporary array of the first code block, wherein the first key character is '1', the operation type is word size definition, the second key character is 'VariableDeclarator', the variable type under the operation type is variable declaration, the number of 'var' is 1, and the number of variables is 1. Correspondingly, reading the temporary array of the second code block, wherein the first key character is '2', the operation type is word size definition, the second key character is 'variable descriptor', the variable type under the operation type is variable declaration, the number of 'var' is 1, and the number of variables is also 1.
Fig. 4 is two schematic diagrams of generating a count array according to an embodiment of the present application, and another count array (operation type omitted) can be generated at the same time when the temporary array is generated, the structure is similar to the temporary array, the count array does not need to describe the meaning of each node in the complete tree structure like the temporary array (the part which does not affect the comparison is ignored, because the comparison is of the expression type and not a specific value), and only the number of the same expression at the same array position (i.e. AST tree node position) is recorded for comparison statistical use.
S22, comparing a first count array and a second count array in the plurality of count arrays, and determining the similarity between the first count array and the second count array as the similarity between a first code block and a second code block in the target code file, wherein the first count array corresponds to the first code block, and the second count array corresponds to the second code block. The first code block and the second code block are any two code blocks to be subjected to similarity comparison in the same target code file, the first counting array corresponds to a first temporary array generated by the first code block and is generated based on the first temporary array, and the second counting array corresponds to a second temporary array generated by the second code block and is generated based on the second temporary array.
The expression "var ═ 1" is taken as an example for explanation, fig. 5 is a comparative example of two code blocks in the embodiment of the present invention, the dashed boxes are repeated parts, and 1 and 2 in fig. 5 are respectively different implementations of a method definition expression named aFunc, but the functions are repeated. Next, how to find the repeated portion (the dotted portion in AST) using the count array corresponding to the AST tree will be explained by way of example.
In one example, comparing a first count array and a second count array of the plurality of count arrays and determining a similarity between the first count array and the second count array as a similarity of a first code block and a second code block in the target code file comprises: comparing the first operation type of the first counting array with the second operation type of the second counting array; if the first operation type is the same as the second operation type, comparing a first variable type of the first counting array with a second variable type of the second counting array; if the first variable type of the first counting array is the same as the second variable type of the second counting array, comparing the first variable quantity of the first counting array with the second variable quantity of the second counting array; if the first variable quantity of the first counting array is the same as the second variable quantity of the second counting array, the similarity between the first code block and the second code block in the target code file is determined to be 100%.
In another aspect of this example, if the first operation type and the second operation type are different, it is directly determined that a similarity between the first code block and the second code block is 0.
FIG. 6 is a comparison diagram in the embodiment of the present application, comparing expressions of each level in the count array 1 with loops of each level in the count array 2 respectively until the 4 th time of comparison, the same type of function deletion is found. If the expression types are both method-defined expressions, the comparison continues to be made between the expression types of the child nodes of the method-defined expressions, namely variable declaration, and there are only 1 type of variable declaration in both the child nodes of the method-defined expressions, and num is 1, so that the functions of the aFunc methods in the code blocks 1 and 2 in fig. 5 can be considered to be completely consistent, and the degree of recognition is 100%. The above example is a scene with a similarity of 100%.
In one example, comparing a first count array and a second count array of the plurality of count arrays and determining a similarity between the first count array and the second count array as a similarity of a first code block and a second code block in the target code file comprises: comparing the first operation type of the first counting array with the second operation type of the second counting array; if the first operation type is the same as the second operation type, calculating a first quotient value between a first variable number of the first count array and a second variable number of the second count array, wherein the first quotient value is used for representing a first similarity between the first count array and the second count array; and determining the first quotient as the similarity between the first code block and the second code block in the target code file.
In one example, comparing a first count array and a second count array of the plurality of count arrays and determining a similarity between the first count array and the second count array as a similarity of a first code block and a second code block in the target code file comprises: comparing the first operation type of the first counting array with the second operation type of the second counting array; if the first operation type is the same as the second operation type, calculating a second quotient between a first variable type quantity of the first counting array and a second variable type quantity of the second counting array, wherein the first variable type quantity and the second variable type quantity are respectively obtained according to variable type types included in the first operation type and the second operation type in a statistical manner, and the second quotient is used for representing a second similarity between the first counting array and the second counting array; and determining the second quotient as the similarity between the first code block and the second code block in the target code file.
In another aspect of this example, if the first operation type and the second operation type are different, it is directly determined that a similarity between the first code block and the second code block is 0.
In one example, comparing a first count array and a second count array of the plurality of count arrays and determining a similarity between the first count array and the second count array as a similarity of a first code block and a second code block in the target code file comprises: comparing the first operation type of the first counting array with the second operation type of the second counting array; if the first operation type is the same as the second operation type, determining the specific variable quantity of the same variable type in the first count array and the second count array, determining the variable total quantity of the first count array or the second count array, and calculating a third quotient value of the specific variable quantity and the variable total quantity, wherein the variable total quantities of the first count array and the second count array are the same, and the third quotient value is used for representing a third similarity between the first count array and the second count array; and determining the third quotient value as the similarity of the first code block and the second code block in the target code file.
In another aspect of this example, if the first operation type and the second operation type are different, it is directly determined that a similarity between the first code block and the second code block is 0.
Here, a scenario with similarity lower than 100% is exemplified by three code blocks, where code block 1 includes one expression "var a ═ 1", code block 2 includes two expressions "var a ═ 1" and "var b ═ 1", and code block 3 includes two expressions "var a ═ 1" and "function b ()". An incomplete matching aFunc method defines an expression, and can compare through a count array to obtain the similarity of two code blocks, fig. 7 is a comparison example diagram of three code blocks in the embodiment of the present application, including 1,2, and 3 code blocks, which respectively correspond to count arrays 1,2, and 3, and compare two of them, it should be noted that the count array is generated according to a temporary array, and then when comparing two of them, the operation type, the variable quantity, and the variable type quantity in the same array position (i.e. the position of a tree node in an abstract syntax tree) are compared, which is exemplified here:
the method in the count array 1,2 defines that the expression type of functional declaration and the type of variable declaration (there is and only this type) of the child node nodes can both be matched, but the value of the number num is different, that is, the first operation type is the same as the second operation type, but the first variable number is different from the second variable number, and calculates the first quotient of the first variable number num:1 of the first count array and the second variable number num:2 of the second count array, so that the aFunc method function in 1,2 can be known as: num:1/num:2 is 50%, and 50% is taken as the similarity between the first code block and the second code block in the target code file.
In the count arrays 1 and 3, although the number num of types: variable declaration is the same, the count array 3 has 1 more child node types (functional declaration), that is, the first operation type is the same as the second operation type, but the number of the first variable types is different from the number of the second variable types, and the number of the first variable types of the first count array is calculated as 1 and the number of the second variable types of the second count array: 2, it can be calculated that: 1/3 counts of the count array: and 2-50%, and taking 50% as the similarity between the first code block and the second code block in the target code file. In other examples, if the original nodes number in count array 1 is 3 and the original nodes number in count array 3 is 2, then the following can be calculated: counting the number of the arrays 3nodes, 2/counting the number of the arrays 1 nodes: and 3, 66.67 percent.
The number of the child nodes in the counting arrays 2 and 3 is greater than 2 (the value of num is equal to the number of the same type of expressions in the nodes), the same type of expressions need to be removed first, that is, the first operation type is the same as the second operation type, but the number of the first variable types and the number of the second variable types are greater than a threshold value, and the same variable types exist, the specific variable number of the same variable types in the first counting array and the second counting array is determined, 1 variable type with the same type is available in the counting arrays 2 and 3, so that 1 type needs to be removed first in the counting arrays 2 and 3, and the specific variable number of the same variable types is 1. Thus, only 1 expression type is left in each child node nodes of the counting arrays 2 and 3, and it can be understood that the total number of variables is 2, and therefore, the number of removed expressions can be divided by the number of original nodes: the number of nodes removed from the counting array 1 (counting array 3) is 1/the number of original nodes in the counting array 1 (counting array 3): and 2-50%, and taking 50% as the similarity between the first code block and the second code block in the target code file.
The scheme of the embodiment is not based on character comparison, but analyzes the similarity degree of the grammar and the logic level, so that the scheme cannot be influenced by code compiling or some confusion encryption means. The technical scheme is that conditions are provided for extraction and abstraction of public codes, developers are helped to find out which codes are worthy of being commonized, and the maintenance cost of the codes is greatly reduced.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
In this embodiment, a device for detecting code similarity is further provided, which is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 8 is a block diagram of a device for detecting code similarity according to an embodiment of the present application, and as shown in fig. 8, the device includes: a generation module 80, a calculation module 82, wherein,
a generating module 80, configured to parse a syntax structure of a target code file and generate an abstract syntax tree of the target code file, where the abstract syntax tree includes a plurality of nodes, and each node corresponds to a temporary array and a count array;
and a calculating module 82, configured to calculate, by using the temporary array and the count array, a code similarity of the code blocks in the target code file.
Optionally, the calculation module includes: a generating unit, configured to generate a plurality of count arrays based on a plurality of temporary array correspondences in the AST, where the count arrays include the following information of variable definition expressions: operation type, variable type and variable number; and the comparison unit is used for comparing a first count array and a second count array in the plurality of count arrays and determining the similarity between the first count array and the second count array as the similarity between a first code block and a second code block in the target code file, wherein the first count array corresponds to the first code block, and the second count array corresponds to the second code block.
Optionally, the generating unit includes: the first reading subunit is used for reading a first key character in each temporary array in the AST, wherein the first key character is used for representing the operation type of a variable definition expression; the second reading subunit is configured to read second key characters in the temporary array, and count the variable quantity of the second key characters, where the second key characters are used to represent variable types under a current operation type, and each second key character corresponds to one variable type; and the generating subunit is used for generating a counting array by adopting the first key character, the second key character and the variable quantity.
Optionally, the comparison unit includes: the comparison subunit is used for comparing the first operation type of the first counting array with the second operation type of the second counting array; the first calculating subunit is configured to calculate a first quotient value between a first variable number of the first count array and a second variable number of the second count array if the first operation type is the same as the second operation type, where the first quotient value is used to represent a first similarity between the first count array and the second count array; a first determining subunit, configured to determine the first quotient value as a similarity between a first code block and a second code block in the target code file.
Optionally, the comparison unit includes: the comparison subunit is used for comparing the first operation type of the first counting array with the second operation type of the second counting array; a second calculating subunit, configured to calculate, if the first operation type is the same as the second operation type, a second quotient between a first variable type number of the first count array and a second variable type number of the second count array, where the first variable type number and the second variable type number are obtained according to variable type types included in the first operation type and the second operation type, and the second quotient is used to represent a second similarity between the first count array and the second count array; a second determining subunit, configured to determine the second quotient value as a similarity between the first code block and the second code block in the target code file.
Optionally, the comparison unit includes: the comparison subunit is used for comparing the first operation type of the first counting array with the second operation type of the second counting array; a third calculation subunit, configured to, if the first operation type is the same as the second operation type, determine a specific variable number of the same variable type in the first count array and the second count array, determine a total variable number of the first count array or the second count array, and calculate a third quotient of the specific variable number and the total variable number, where the total variable numbers of the first count array and the second count array are the same, and the third quotient is used to represent a third similarity between the first count array and the second count array; a third determining subunit, configured to determine the third quotient value as a similarity between the first code block and the second code block in the target code file.
Optionally, the generating module includes: the dividing unit is used for dividing the code character string in the target code file into lexical units, wherein each lexical unit corresponds to one code block; a generating unit configured to generate a lexical unit list based on the divided plurality of lexical units; a conversion unit configured to convert the lexical unit list into AST in units of variable definition expressions.
Optionally, the conversion unit includes: the splitting subunit is used for splitting the lexical unit list into a plurality of temporary arrays by taking the variable definition expressions as units, wherein each temporary array corresponds to one variable definition expression, and each variable definition expression is identified by adopting key characters; a mapping subunit, configured to define an expression for each variable of the lexical unit list, and map the corresponding temporary array as a sub-AST; and the combining subunit is used for respectively determining the plurality of sub ASTs as the sub nodes under the designated tree root node, and combining to obtain a total AST.
Optionally, the dividing unit includes: a scanning subunit, configured to scan the code character strings from left to right starting from a first character of the code character strings in the target code file until a next empty character string, and extract a consecutive character string before the empty character string; a determining subunit, configured to determine the continuous character string as a key character, continue to scan the code character string from left to right until a nearest identifier after the key character, and determine a character set between the key character and the identifier as a lexical unit; and the processing subunit is used for continuously scanning the rest code character strings from left to right in a line feed manner until the last character of the code character strings is scanned.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Fig. 9 is a structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 9, the electronic device includes a processor 91, a communication interface 92, a memory 93, and a communication bus 94, where the processor 91, the communication interface 92, and the memory 93 complete communication with each other through the communication bus 94, and the memory 93 is used for storing a computer program; the processor 91 is configured to implement the code similarity detection method according to any one of the embodiments when executing the program stored in the memory 43.
The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the terminal and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In another embodiment provided by the present application, a computer-readable storage medium is further provided, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the computer is caused to execute the method for detecting code similarity in any of the above embodiments.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the steps of any embodiment of the code similarity detection method.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.
The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A method for detecting code similarity, comprising:
analyzing a syntactic structure of a target code file, and generating an abstract syntactic tree of the target code file, wherein the abstract syntactic tree comprises a plurality of nodes, and each node corresponds to a temporary array and a counting array;
and calculating the code similarity of the code blocks in the target code file by adopting the temporary array and the counting array.
2. The method of claim 1, wherein calculating the code similarity of the code blocks in the object code file using the temporary array and the count array comprises:
generating a plurality of count arrays based on the plurality of temporary array correspondences in the abstract syntax tree, wherein the count arrays include the following information of variable definition expressions: operation type, variable type and variable number;
comparing a first count array and a second count array in the plurality of count arrays, and determining the similarity between the first count array and the second count array as the similarity between a first code block and a second code block in the target code file, wherein the first count array corresponds to the first code block, and the second count array corresponds to the second code block.
3. The method of claim 2, wherein generating a plurality of count arrays based on the plurality of temporary array correspondences in the abstract syntax tree comprises:
for each temporary array in the abstract syntax tree, reading a first key character in the temporary array, wherein the first key character is used for representing the operation type of a variable definition expression;
reading second key characters in the temporary array, and counting the variable quantity of the second key characters, wherein the second key characters are used for representing variable types under the current operation type, and each second key character corresponds to one variable type;
and generating a counting array by adopting the first key character, the second key character and the variable quantity.
4. The method of claim 2, wherein comparing a first count array and a second count array of the plurality of count arrays and determining a similarity between the first count array and the second count array as a similarity of a first code block and a second code block in the target code file comprises:
comparing the first operation type of the first counting array with the second operation type of the second counting array;
if the first operation type is the same as the second operation type, calculating a first quotient value between a first variable number of the first count array and a second variable number of the second count array, wherein the first quotient value is used for representing a first similarity between the first count array and the second count array;
and determining the first quotient as the similarity between the first code block and the second code block in the target code file.
5. The method of claim 2, wherein comparing a first count array and a second count array of the plurality of count arrays and determining a similarity between the first count array and the second count array as a similarity of a first code block and a second code block in the target code file comprises:
comparing the first operation type of the first counting array with the second operation type of the second counting array;
if the first operation type is the same as the second operation type, calculating a second quotient value between the first variable type quantity of the first counting array and the second variable type quantity of the second counting array; the first variable type quantity and the second variable type quantity are obtained through statistics according to variable type types included in the first operation type and the second operation type respectively, wherein the second quotient is used for representing a second similarity between the first counting array and the second counting array;
and determining the second quotient as the similarity between the first code block and the second code block in the target code file.
6. The method of claim 2, wherein comparing a first count array and a second count array of the plurality of count arrays and determining a similarity between the first count array and the second count array as a similarity of a first code block and a second code block in the target code file comprises:
comparing the first operation type of the first counting array with the second operation type of the second counting array;
if the first operation type is the same as the second operation type, determining the specific variable quantity of the same variable type in the first count array and the second count array, determining the variable total quantity of the first count array or the second count array, and calculating a third quotient value of the specific variable quantity and the variable total quantity, wherein the variable total quantities of the first count array and the second count array are the same, and the third quotient value is used for representing a third similarity between the first count array and the second count array;
and determining the third quotient value as the similarity of the first code block and the second code block in the target code file.
7. The method of claim 1, wherein parsing the syntax structure of the object code file to generate an abstract syntax tree of the object code file comprises:
dividing the code character string in the target code file into lexical units, wherein each lexical unit corresponds to one code block;
generating a lexical unit list based on the divided plurality of lexical units;
and converting the lexical unit list into an abstract syntax tree by taking a variable definition expression as a unit.
8. The method of claim 7, wherein converting the lexical unit list into an abstract syntax tree in units of variable definition expressions comprises:
splitting the lexical unit list into a plurality of temporary arrays by taking the variable definition expressions as units, wherein each temporary array corresponds to one variable definition expression, and each variable definition expression is identified by adopting key characters;
defining an expression aiming at each variable of the lexical unit list, and mapping a corresponding temporary array into a sub-abstract syntax tree;
and respectively determining the plurality of sub abstract syntax trees as the sub nodes under the appointed tree root node, and combining to obtain a total abstract syntax tree.
9. The method of claim 7, wherein segmenting the code string in the target code file into lexical units comprises:
scanning the code character string from left to right from the first character of the code character string in the target code file until the next empty character string, and extracting a continuous character string before the empty character string;
and determining the continuous character string as a key character, continuing to scan the code character string from left to right until the identifier is the nearest identifier after the key character, and determining a character set between the key character and the identifier as a lexical unit.
10. A device for detecting similarity between codes, comprising:
the generating module is used for analyzing a syntactic structure of the target code file and generating an abstract syntactic tree of the target code file, wherein the abstract syntactic tree comprises a plurality of nodes, and each node corresponds to a temporary array and a counting array;
and the calculation module is used for calculating the code similarity of the code blocks in the target code file by adopting the temporary array and the counting array.
11. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program is operative to perform the method steps of any of the preceding claims 1 to 9.
12. An electronic device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; wherein:
a memory for storing a computer program;
a processor for performing the method steps of any of claims 1 to 9 by executing a program stored on a memory.
CN202110390520.8A 2021-04-12 2021-04-12 Code similarity detection method and device, storage medium and electronic equipment Pending CN113158640A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110390520.8A CN113158640A (en) 2021-04-12 2021-04-12 Code similarity detection method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110390520.8A CN113158640A (en) 2021-04-12 2021-04-12 Code similarity detection method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN113158640A true CN113158640A (en) 2021-07-23

Family

ID=76890021

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110390520.8A Pending CN113158640A (en) 2021-04-12 2021-04-12 Code similarity detection method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113158640A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117827214A (en) * 2023-12-25 2024-04-05 慧之安信息技术股份有限公司 Algorithm expanding method and system in algorithm management platform

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117827214A (en) * 2023-12-25 2024-04-05 慧之安信息技术股份有限公司 Algorithm expanding method and system in algorithm management platform
CN117827214B (en) * 2023-12-25 2024-06-11 慧之安信息技术股份有限公司 Algorithm expanding method and system in algorithm management platform

Similar Documents

Publication Publication Date Title
CN109445834B (en) Program code similarity rapid comparison method based on abstract syntax tree
CN111177184A (en) Structured query language conversion method based on natural language and related equipment thereof
CN107341399B (en) Method and device for evaluating security of code file
CN111552969A (en) Embedded terminal software code vulnerability detection method and device based on neural network
CN108664471B (en) Character recognition error correction method, device, equipment and computer readable storage medium
CN106295346B (en) Application vulnerability detection method and device and computing equipment
CN107273546B (en) Counterfeit application detection method and system
CN111723192B (en) Code recommendation method and device
CN113961768A (en) Sensitive word detection method and device, computer equipment and storage medium
CN113254023B (en) Object reading method and device and electronic equipment
CN110795069A (en) Code analysis method, intelligent terminal and computer readable storage medium
CN117113347A (en) Large-scale code data feature extraction method and system
CN113158640A (en) Code similarity detection method and device, storage medium and electronic equipment
CN114201756A (en) Vulnerability detection method and related device for intelligent contract code segment
CN112069305B (en) Data screening method and device and electronic equipment
CN117940894A (en) System and method for detecting code clones
CN113158627A (en) Code complexity detection method and device, storage medium and electronic equipment
CN110989991B (en) Method and system for detecting source code clone open source software in application program
CN110188432B (en) System architecture verification method, electronic device and computer-readable storage medium
CN113128213A (en) Log template extraction method and device
CN114490673B (en) Data information processing method and device, electronic equipment and storage medium
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN114510717A (en) ELF file detection method and device and storage medium
CN115525671A (en) Data query method, device, equipment and storage medium
CN114880523A (en) Character string processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination