CN114816518A - Simhash-based open source component screening and identifying method and system in source code - Google Patents

Simhash-based open source component screening and identifying method and system in source code Download PDF

Info

Publication number
CN114816518A
CN114816518A CN202210337119.2A CN202210337119A CN114816518A CN 114816518 A CN114816518 A CN 114816518A CN 202210337119 A CN202210337119 A CN 202210337119A CN 114816518 A CN114816518 A CN 114816518A
Authority
CN
China
Prior art keywords
source
code
open source
module
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210337119.2A
Other languages
Chinese (zh)
Inventor
汪杰
万振华
王颉
李华
董燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Seczone Technology Co Ltd
Original Assignee
Seczone Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Seczone Technology Co Ltd filed Critical Seczone Technology Co Ltd
Priority to CN202210337119.2A priority Critical patent/CN114816518A/en
Publication of CN114816518A publication Critical patent/CN114816518A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a simhash-based method and a system for screening and identifying open source components in a source code, wherein the method comprises the following steps: constructing a basic source code library; respectively analyzing the source codes in each open source file in the basic source code library by adopting a simhash algorithm to obtain a data matching table; analyzing each source code file in the source code assembly to be detected in the same analyzing and processing mode as the open source file to obtain a plurality of second numerical code groups; respectively matching the character strings in the second numerical group with the character strings in each first numerical group in the data matching table; judging whether any first number group has the same character string as that in the current second number group, if so, defining the first number group as a number group to be selected; finding out a plurality of open source components related to the source code component to be detected in a basic source code library according to the number group to be selected; according to the method, the open source component in the software can be detected from the source code layer, the detection coverage is wide, and the detection efficiency is high.

Description

Simhash-based open source component screening and identifying method and system in source code
Technical Field
The invention belongs to the technical field of software source code open source component detection, and particularly relates to a simhash-based open source component screening and identifying method and system in a source code.
Background
The method aims to analyze open source components quoted in a software project so as to detect vulnerabilities in a targeted mode, and the open source components in the software comprise used open source components or copied partial open source codes. There are many SCA tools that can support the analysis of open source components, but these tools typically analyze the open source components of a project based on a feature file of the project, i.e., the open source components used in the project. Whereas open source component analysis based on code is rare. The method mainly comprises the steps that the difficulty of analyzing open source components based on massive open source codes is high, and the detection efficiency is not ideal enough.
Disclosure of Invention
The invention aims to solve the technical problems and provides a method and a system for screening and identifying open source components in source codes based on a simhash, which are used for detecting software open source components based on massive open source codes and effectively ensuring the detection efficiency.
In order to achieve the purpose, the invention discloses a simhash-based method for screening and identifying a source component in a source code, which comprises the following steps:
acquiring a plurality of open source components to construct a basic source code library, wherein each open source component comprises a plurality of open source files;
respectively analyzing the source codes in each open source file in the basic source code library by adopting a simhash algorithm to obtain a plurality of first digital strings respectively related to the corresponding open source files, dividing the first digital strings into four segments in sequence to obtain a plurality of first digital groups respectively related to the corresponding open source files and comprising four character strings, and storing the first digital groups in a database to obtain a data matching table;
analyzing each source code file in the source code assembly to be detected by adopting the same analyzing and processing mode as the open source file to obtain a plurality of second numerical code groups which are respectively related to the corresponding source code files and comprise four character strings;
respectively and integrally matching the character strings in the second numerical group with the character strings in each first numerical group in the data matching table;
judging whether any first numerical code group has the same character string as that in the current second numerical code group, if so, defining the first numerical code group as a numerical code group to be selected;
and finding out a plurality of open source components related to the source code component to be detected in the basic source code library according to the number group to be selected.
Preferably, the method further comprises the following step of further screening the candidate number group:
respectively judging whether other three character strings in any one to-be-selected number group exist in the current second number group, and calculating the Hamming distance between the to-be-selected number group and the second number group according to the judgment result;
judging whether the Hamming distance between the number group to be selected and the second number group is less than or equal to 3, if so, defining the number group to be selected as a target number group;
and finding out a plurality of open source components related to the source code component to be detected in the basic source code library according to the target number group.
Preferably, the method for finding out a plurality of open source components related to the source code component to be tested in the basic source code library according to the target number group comprises:
generating an association table in the basic source code library, wherein the association table records an open source file name associated with each first digital code group, an open source component name corresponding to the open source file, a version number of the open source component and a storage position of the open source component;
searching all open source files, open source components and storage positions corresponding to the target number groups according to the association table;
counting the number of the source code files corresponding to each found open source component, and calculating the similarity between each open source component and the source code component to be tested according to the counted number;
and taking the first N open source components as objects for further comparison and analysis according to the sequence of the similarity from high to low.
Preferably, the method further comprises a step of storing the open-source file in the basic source code library:
adopting a minhash algorithm to perform hash calculation on the whole of any open source file to obtain a characteristic hash code corresponding to the open source file, and naming the open source file by the characteristic hash code;
and segmenting the characteristic hash code to obtain a hash code segment, establishing a multi-level storage directory according to the hash code segment, and storing the open source files with the same hash code segment in the storage directory of the same level.
The invention also discloses a simhash-based system for screening and identifying the source components in the source codes, which comprises a source code library construction module, a file analysis module, a data table establishment module, a matching module, a first judgment module, a first screening module and a search module;
the source code library construction module is used for constructing a basic source code library through the acquired plurality of open source components, and each open source component comprises a plurality of open source files;
the file analysis module is used for analyzing a code file comprising an open source file and a source code file in a source code assembly to be detected by adopting a simhash algorithm so as to obtain a plurality of hash digital strings respectively related to corresponding codes, and dividing the hash digital strings into four sections in sequence so as to obtain a plurality of digital groups respectively related to the corresponding code file and comprising four character strings, wherein the digital groups comprise a first digital group corresponding to the open source file and a second digital group corresponding to the source code file;
the data table establishing module is used for storing a plurality of first digital groups in a database to obtain a data matching table;
the matching module is used for respectively and integrally matching character strings in the second number group corresponding to any source code file in the current source code assembly to be tested with character strings in each first number group in the data matching table;
the first judging module is used for judging whether the character string identical to the character string in the current second numerical code group exists in any first numerical code group according to the matching result of the matching module;
the first screening module is used for defining the first digital code group with the judgment result of the judging module as the to-be-selected digital code group;
and the searching module is used for finding out a plurality of open source components related to the source code component to be detected in the basic source code library according to the number code group to be selected.
Preferably, the system further comprises a second judgment module, a calculation module and a second screening module;
the second judging module is configured to respectively judge whether other three character strings in any one of the to-be-selected number groups exist in the current second number group;
the calculation module is used for calculating the Hamming distance between the to-be-selected digital group and the second digital group according to the judgment result of the second judgment module;
the second screening module is used for defining the number group to be selected with the Hamming distance of the second number group being less than or equal to 3 as a target number group;
and the searching module finds out a plurality of open source components related to the source code component to be detected in the basic source code library according to the target number code group.
Preferably, the system further comprises an association table generation module, wherein the search module comprises a table look-up module, a statistical calculation module and an extraction module;
the association table generating module is configured to generate an association table in the basic source code library, where the association table records an open source file name associated with each first code group, an open source component name corresponding to the open source file, a version number of the open source component, and a storage location of the open source component;
the table look-up module is used for looking up all open source files, open source components and storage positions corresponding to the target number group according to the association table;
the statistical calculation module is used for counting the number of the source code files corresponding to each found open source component and calculating the similarity between each open source component and the source code component to be detected according to the counted number;
and the extraction module is used for extracting the first N open source components as objects of further comparison analysis according to the sequence of the similarity from high to low obtained by the statistical calculation module.
Preferably, the system also comprises a file name processing module and a directory establishing and storing module;
the file name processing module is used for performing hash calculation on the whole open source file by adopting a minhash algorithm to obtain a characteristic hash code corresponding to the open source file, and naming the open source file by the characteristic hash code;
the directory establishing and storing module is used for segmenting the characteristic hash code to obtain a hash code segment, establishing a multi-level storage directory according to the hash code segment, and storing the open source files with the same hash code segment in the storage directory of the same level.
The invention also discloses a simhash-based system for screening and identifying the source components in the source code, which comprises the following steps:
one or more processors;
a memory;
and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs including instructions for performing the method for source component screening identification in simhash-based source code as described above.
The invention also discloses a computer readable storage medium, which includes a computer program, where the computer program is executable by a processor to complete the method for screening and identifying the source component in the source code based on the simhash.
Compared with the prior art, the method for screening and identifying the source components in the source codes comprises the steps of constructing a basic source code library comprising a large number of open source files, processing the open source files in the basic database by adopting a simhash algorithm, resolving each open source file into a first digital group comprising four character strings according to the source code content in the open source files to obtain a data matching table comprising a plurality of first digital groups, resolving each source code file in each source code assembly to be detected by adopting the same resolving processing mode as the open source files to obtain a second digital group, matching the second digital group with each first digital group in the data matching table, defining the first digital group as a to-be-selected digital group as long as one character string exists in the current second digital group in any first digital group, and accordingly, selecting the files with greater similarity to the open source file to be detected from the massive open source files, and removing the open source files without similarity possibility; therefore, according to the method, the first number group and the second number group used for comparison are analyzed according to the code content by adopting the simhash algorithm, so that the open source components in the software can be detected from the source code layer, the detection coverage is wide, and through the analysis processing mode, only integral matching needs to be carried out on each pair of character strings in the matching process, bit-by-bit matching is not needed, so that the matching speed is effectively improved, and the detection efficiency of the open source components is ensured.
Drawings
Fig. 1 is a flowchart of a method for screening and identifying open source components in a source code according to an embodiment of the present invention.
FIG. 2 is a flowchart of a method for screening and identifying source components in source code according to another embodiment of the present invention.
FIG. 3 is a flowchart of a method for screening and identifying open source components in a source code according to still another embodiment of the present invention.
Detailed Description
In order to explain technical contents, structural features, and objects and effects of the present invention in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
The embodiment discloses a screening and identifying method for open source components in source codes, which is used for rapidly detecting the open source components in a software project from a code layer, and specifically, as shown in fig. 1, the screening and identifying method comprises the following steps:
s10: and acquiring a plurality of open source components to construct a basic source code library, wherein each open source component comprises a plurality of open source files. In this step, the open source file may be downloaded from the open source code hosting platform, or may be acquired in other manners.
S11: respectively analyzing the source codes in each open source file in the basic source code library by adopting a simhash algorithm to obtain a plurality of first digital strings respectively related to the corresponding open source files, and equally dividing the first digital strings into four segments in sequence to obtain a plurality of first digital groups respectively related to the corresponding open source files and comprising four character strings.
S12: and storing a plurality of first number groups in a database to obtain a data matching table.
S20: and analyzing each source code file in the source code assembly to be detected by adopting the same analyzing and processing mode as the open source file, namely, respectively analyzing and processing the source code in each open source file in the source code assembly to be detected by adopting a simhash algorithm to obtain a plurality of second number strings respectively related to the corresponding source code file, and equally dividing the second number strings into four sections in sequence to obtain a plurality of second number groups respectively related to the corresponding source code file and comprising four character strings.
S21: and respectively matching the character strings in the second number group with the character strings in each first number group in the data matching table.
S22: and judging whether the same character string exists in any first number group as that in the current second number group, if so, entering S23, and if not, skipping.
S23: and defining the first code group as a code group to be selected.
S30: and finding out a plurality of open source components related to the source code component to be detected in the basic source code library according to the number code group to be selected.
According to the screening and identifying method, after a basic source code library is formed, a simhash algorithm is adopted to analyze and process source codes in each open source file in the basic source code library, first, a first digital string related to a corresponding source code file is obtained, the first digital string comprises 64-bit bytes, according to a hash judging rule, for a hash digital string with the length of 64-bit bytes, if three within bytes are different, namely, two hash digital strings are highly similar, therefore, in the embodiment, the hash digital strings (such as the first digital string and the second digital string) are equally divided into 4 segments, each segment is a character string with 16 bytes, if the probability of the two hash digital strings being the same or highly similar exists, at least one segment is the same, on the other hand, if four character strings of the two hash digital strings are all different, there is no likelihood of similarity between the two hash strings. Therefore, in the embodiment, in the matching process of the second number group in the data matching table, only the first number group with one same character string needs to be found, and the first number group can be defined as the number group to be selected to be used as a further detailed analysis, so that a large number of open source files which have no similar possibility with the current source code file are filtered from the massive source code library, and the detailed comparison time is greatly shortened. In addition, in the process of matching in the data matching table according to the second numerical code group, only two sections of 16-bit strings need to be matched to be integrally identical, and the individual characters do not need to be compared bit by bit, so that the matching speed is greatly improved.
In the above embodiment, when performing a simhash process on code files such as an open source file and a source code file, each code file is taken out first, then words are divided according to a certain rule, such as a space, a character, and the like, so that the content of one code file generates a plurality of tokens (each word is a token), then a weight is added to each token according to a service requirement, then a hash integer value of each token is calculated, and then the hash integer value is multiplied by the weight of the token, that is, the final hash value of the token. Finally, for the calculated 64-bit byte value of the whole code file, 1 is taken when the value is larger than 0, 0 is taken when the value is smaller than or equal to 0, and the 64-bit byte value is converted into a 64-bit byte consisting of 0 and 1. In this embodiment, the obtained simhash value is more accurate through word segmentation and weight processing.
In order to further improve the accuracy of the detailed comparison and improve the efficiency, as shown in fig. 2, the screening and identifying method further includes a method for further screening the to-be-selected number group:
s24: respectively judging whether other three character strings in any number group to be selected exist in the current second number group; in step S22, if there is a character string in the first number group that is the same as the character string in the second number group, the character string may be defined as a primary selection character string, and then, in step S24, it is determined whether there are three other character strings in any candidate number group that are different from the primary selection character string in the second number group;
s25: calculating the Hamming distance between the number group to be selected and the second number group according to the judgment result;
s26: judging whether the Hamming distance between the number group to be selected and the second number group is less than or equal to 3, if so, entering S27, and if not, abandoning the number group to be selected;
s27: and defining the number group to be selected as a target number group.
Then, in step S30, a number of open source components related to the source code component to be tested are found in the base source code library according to the target number group.
In this embodiment, the second number group corresponding to any source code file is represented as P2[ a1, a2, a3, a4], a1, 2, a3, a4 as character strings of four 16 bytes in the number group P2, and the first number group in the data matching table is represented as P1[ b1, b2, b3, b4], b1, b2, b3, b4 as character strings of four 16 bytes in the number group P1. In the matching process, if a1 is the same as one of b1, b2, b3 and b4, for example, a1 is the same as b1, the number group P1 is defined as the number group to be selected. Then, the other three character strings a2, a3, a4 in the number code group P1 are continuously matched with the number code group P1, if a2 is the same as one of b2, b3, b4 (e.g. b3), skipping, if a2 is different from any of b2, b3, b4, the hamming distances of P1 and P2 are calculated based on a2, and similarly, the character string a3 is processed, the resulting hamming distances are added, and if the total hamming distance is not more than 3, for example, the hamming distance calculated based on a2 is 1, the hamming distance calculated based on a3 is 1, and the total hamming distance is 2, the number code group P1 is defined as the target number code group. If the other three strings a2, a3 and a4 in the number group P1 all exist in the number group P1, the Hamming distance between the number group P1 and the number group P2 is zero.
Compared with the number code group to be selected, the open source file corresponding to the target number code group has higher similarity with the source code file to be detected, and further detailed comparison is facilitated.
Further, as shown in fig. 3, the method for finding out a plurality of open source components related to the source code component to be tested in the basic source code library according to the target number group includes the following steps:
s300: and generating an association table in the basic source code library, wherein the association table records the open source file name associated with each first code group, the open source component name corresponding to the open source file, the version number of the open source component and the storage position of the open source component.
S301: and searching all open source files, open source components and storage positions corresponding to the target digital group according to the association table.
S302: and counting the number of the source code files corresponding to each found open source component, and calculating the similarity between each open source component and the source code component to be tested according to the counted number. For example, it can be known through statistics that there are five open source components found, which are respectively an open source component a, an open source component B, an open source component C, an open source component D, and an open source component E, and the current source code component to be tested includes 100 source code files, if there are 80 source code files corresponding to the open source component a, 50 source code files corresponding to the open source component B, 40 source code files corresponding to the open source component C, 40 source code files corresponding to the open source component D, and 30 source code files corresponding to the open source component E, then the similarity between the open source component a and the current source code component to be tested is 80%, the similarity between the open source component B and the current source code component to be tested is 50%, and the similarity between the open source component C and the current source code component to be tested is 40%, the similarity between the open source component D and the current source code component to be tested is 40%, and the similarity between the open source component E and the current source code component to be tested is 30%.
S303: and taking the first N open source components as objects of further comparison analysis according to the sequence of similarity from high to low. According to the example of step 302, if the first three open source components are taken as the objects of further comparison analysis, the open source component a, the open source component B, the open source component C and the open source component D are taken, and since the ranking of the open source component C and the open source component D is the same, the open source component C and the open source component D are selected at the same time.
When the first N open source components are selected, the open source files in the open source components can be compared one by one, line by line, using methods known to those skilled in the art, to perform depth detection. When the depth detection is carried out, a direct text comparison method can be adopted, namely, the two code files are directly compared with the text, and the number of lines is the same. The second method is hash comparison, and a unique hash value is calculated for each line of the code files, so that each code file has a plurality of hashes, and only the number of the hashes of the two code files is compared.
In another preferred embodiment of the method for screening and identifying source components in source code of the present invention, the method further comprises a method for storing source files in a base source code library, wherein the method comprises the following steps:
adopting a minhash algorithm to perform hash calculation on the whole of any open source file to obtain a 16-bit characteristic hash code corresponding to the open source file, and naming the open source file by using the characteristic hash code;
and segmenting the characteristic hash code to obtain a hash code segment, establishing a multi-level storage directory according to the hash code segment, and storing the open source files with the same hash code segment in the storage directory of the same level.
In this embodiment, each open source file is named by a 16-bit characteristic hash code, so that duplicate storage of the same source code file between different versions under the same open source component can be avoided, original open source files with the same name but different contents can be stored separately, and distinguishing is facilitated. In addition, the characteristic hash code is segmented, and the multilevel storage directory is established according to the hash code segment, so that the excessive file number of a single target can be avoided, and a certain open source file can be conveniently and quickly searched. For example, a directory with a name of 0101 is created, and then all open-source files with a characteristic hash code of "0101" beginning are placed in this directory. When traversing and searching are carried out, the storage directory can be found according to the first 4 bits of the searched open source file name, and then the specific file is continuously traversed.
To sum up, the embodiment of the invention discloses a simhash-based method for screening and identifying source-opening components in source codes.
Then, analyzing the open source files in the basic source code library by adopting a simhash algorithm, equally dividing the analyzed 64-bit hash code string corresponding to each open source file into four segments so as to obtain a first code group comprising four segments of character strings corresponding to each open source file, and storing the first code groups in a database so as to obtain a data matching table.
Then, the source code file in the source code assembly to be tested is analyzed by adopting the same analyzing and processing mode so as to obtain a plurality of second numerical code groups which respectively correspond to each source code file and comprise four sections of character strings.
Then, matching each character string in any second number group with the character string in each first number group in the data matching table, judging whether the same character string as that in the current second number group exists in any first number group, if so, defining the second number group as a number group to be selected, and defining the same character string as an initial character string.
And then, further screening the number group to be selected, namely, respectively matching other three character strings except the initially selected character string in the number group to be selected with the character strings in the current second number group, and calculating the Hamming distance between the number group to be selected and the current second number group according to the matching result.
Then, whether the Hamming distance between any number group to be selected and the current second number group is less than or equal to 3 is judged, if yes, the number group to be selected is defined as a target number group.
Then, according to the association table, the open source file, the open source component and the storage location corresponding to any target code group (i.e. any source code file) are searched.
Then, the number of the source code files corresponding to each found open source component is counted, and the similarity between each open source component and the current source code component to be tested is calculated according to the counting result.
Then, according to the sequence of similarity from high to low, taking the first N open source components for further comparison analysis.
And finally, taking out each open source file in the first N open source components, and carrying out code line-by-line comparison on each open source file and each source code file in the source code component to be detected, so as to finally obtain the accurate similarity between each open source component in the N open source components to be taken out and the source code component to be detected, and further carry out quantitative analysis on the open source component of the current source code component to be detected.
Moreover, the method for screening and identifying the open source components in the source code further comprises a method for displaying the result of the detection, that is, the obtained result is displayed through a display device, and an uploading page, a detection result page and a parameter setting page are arranged on the display device, for example, the N value in the above embodiment is set, and a sensitivity can also be set, wherein the sensitivity is a file which satisfies how many lines are the same. For a detection result page, the N open source components which are returned in total in the detection are displayed firstly, then each open source component is opened, how many open source files meeting the detection requirement of sensitivity are displayed, then each open source file is opened, the text contents of the specific source code file to be detected and the text contents of the open source file of the source code library are displayed, and the same highs of the two files are marked.
The invention also discloses a system for screening and identifying the source-opening components in the source codes, which comprises a source code library construction module, a file analysis module, a data table establishment module, a matching module, a first judgment module, a first screening module and a searching module.
And the source code library construction module is used for constructing a basic source code library through the acquired plurality of open source components, and each open source component comprises a plurality of open source files.
The file analysis module is used for analyzing the code files including the open source files and the source code files in the source code assembly to be detected by adopting a simhash algorithm to obtain a plurality of hash digital strings respectively related to corresponding codes, and dividing the hash digital strings into four sections in sequence to obtain a plurality of digital groups including four character strings respectively related to the corresponding code files, wherein the digital groups include a first digital group corresponding to the open source files and a second digital group corresponding to the source code files.
And the data table establishing module is used for storing the first number groups in a database to obtain a data matching table.
And the matching module is used for respectively matching the character strings in the second numerical group corresponding to any source code file in the current source code assembly to be tested with the character strings in each first numerical group in the data matching table.
And the first judging module is used for judging whether the character strings same as those in the current second numerical code group exist in any first numerical code group or not according to the matching result of the matching module.
And the first screening module is used for defining the first digital code group with the judgment result of the judgment module as the to-be-selected digital code group.
And the searching module is used for finding out a plurality of open source components related to the source code component to be detected in the basic source code library according to the number code group to be selected.
Further, the system for screening and identifying a source component in a source code in this embodiment further includes a second determining module, a calculating module, and a second screening module.
And the second judging module is used for respectively judging whether other three character strings in any one to-be-selected number group exist in the current second number group.
And the calculation module is used for calculating the Hamming distance between the to-be-selected digital group and the second digital group according to the judgment result of the second judgment module.
And the second screening module is used for defining the number group to be selected with the Hamming distance of the second number group being less than or equal to 3 as the target number group.
The searching module finds out a plurality of open source components related to the source code component to be detected in the basic source code library according to the target number code group.
Further, the system for screening and identifying a source component in a source code in this embodiment further includes an association table generating module, and the searching module includes a table look-up module, a statistical calculation module, and an extraction module.
And the association table generation module is used for generating an association table in the basic source code library, and the association table records the open source file name associated with each first digital group, the open source component name corresponding to the open source file, the version number of the open source component and the storage position of the open source component.
And the table look-up module is used for looking up all open source files, open source components and storage positions corresponding to the target digital group according to the association table.
And the statistical calculation module is used for counting the number of the source code files corresponding to each found open source assembly and calculating the similarity between each open source assembly and the source code assembly to be detected according to the counted number.
And the extraction module is used for extracting the first N open source assemblies as objects for further comparison analysis according to the sequence of the similarity from high to low obtained by the statistical calculation module.
Furthermore, the system for screening and identifying the source components in the source code also comprises a file name processing module and a directory establishing and storing module.
And the file name processing module is used for performing hash calculation on the whole of any open source file by adopting a minhash algorithm to obtain a characteristic hash code corresponding to the open source file, and naming the open source file by the characteristic hash code.
And the directory establishing and storing module is used for segmenting the characteristic hash code to obtain a hash code segment, establishing a multi-level storage directory according to the hash code segment, and storing the open source files with the same hash code segment in the storage directory of the same level.
The invention also discloses another screening and identifying system for open source components in source code, which comprises one or more processors, a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the programs comprise instructions for executing the screening and identifying method for open source components in source code. The processor may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement functions that need to be executed by modules in the source code source component screening and identifying system in the embodiment of the present Application, or to execute the source component screening and identifying method in the source code in the embodiment of the present Application.
The invention also discloses a computer readable storage medium, which comprises a computer program, wherein the computer program can be executed by a processor to complete the method for screening and identifying the source components in the source code. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a read-only memory (ROM), or a Random Access Memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a Digital Versatile Disk (DVD), or a semiconductor medium, such as a Solid State Disk (SSD).
The embodiment of the application also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the method for screening and identifying the source component in the source code.
The above disclosure is only for the preferred embodiment of the present invention, and it should be understood that the present invention is not limited thereto, and the invention is not limited to the above disclosure.

Claims (10)

1. A method for screening and identifying a source component in a source code based on a simhash is characterized by comprising the following steps:
acquiring a plurality of open source components to construct a basic source code library, wherein each open source component comprises a plurality of open source files;
respectively analyzing the source codes in each open source file in the basic source code library by adopting a simhash algorithm to obtain a plurality of first digital strings respectively related to the corresponding open source files, dividing the first digital strings into four segments in sequence to obtain a plurality of first digital groups respectively related to the corresponding open source files and comprising four character strings, and storing the first digital groups in a database to obtain a data matching table;
analyzing each source code file in the source code assembly to be detected by adopting the same analyzing and processing mode as the open source file to obtain a plurality of second numerical code groups which are respectively related to the corresponding source code files and comprise four character strings;
respectively and integrally matching the character strings in the second numerical group with the character strings in each first numerical group in the data matching table;
judging whether any first numerical code group has the same character string as that in the current second numerical code group, if so, defining the first numerical code group as a numerical code group to be selected;
and finding out a plurality of open source components related to the source code component to be detected in the basic source code library according to the number group to be selected.
2. The simhash-based source code open source component screening and identifying method as recited in claim 1, further comprising a method for further screening the to-be-selected number group:
respectively judging whether other three character strings in any one to-be-selected number group exist in the current second number group, and calculating the Hamming distance between the to-be-selected number group and the second number group according to the judgment result;
judging whether the Hamming distance between the number group to be selected and the second number group is less than or equal to 3, if so, defining the number group to be selected as a target number group;
and finding out a plurality of open source components related to the source code component to be detected in the basic source code library according to the target number group.
3. The simhash-based method for screening and identifying open source components in source codes according to claim 2, wherein the method for finding out a plurality of open source components related to the source code component to be tested in the basic source code library according to the target number group comprises:
generating an association table in the basic source code library, wherein the association table records an open source file name associated with each first digital code group, an open source component name corresponding to the open source file, a version number of the open source component and a storage position of the open source component;
searching all open source files, open source components and storage positions corresponding to the target number groups according to the association table;
counting the number of the source code files corresponding to each found open source component, and calculating the similarity between each open source component and the source code component to be tested according to the counted number;
and taking the first N open source components as objects for further comparison and analysis according to the sequence of the similarity from high to low.
4. The simhash-based method for screening and identifying the open source components in the source code according to claim 1, further comprising a method for storing the open source file in the basic source code library, wherein the method comprises the following steps:
adopting a minhash algorithm to perform hash calculation on the whole of any open source file to obtain a characteristic hash code corresponding to the open source file, and naming the open source file by the characteristic hash code;
and segmenting the characteristic hash code to obtain a hash code segment, establishing a multi-level storage directory according to the hash code segment, and storing the open source files with the same hash code segment in the storage directory of the same level.
5. A simhash-based screening and identifying system for source components in source codes is characterized by comprising a source code library building module, a file analyzing module, a data table building module, a matching module, a first judging module, a first screening module and a searching module;
the source code library construction module is used for constructing a basic source code library through the acquired plurality of open source components, and each open source component comprises a plurality of open source files;
the file analysis module is used for analyzing a code file comprising an open source file and a source code file in a source code assembly to be detected by adopting a simhash algorithm so as to obtain a plurality of hash digital strings respectively related to corresponding codes, and dividing the hash digital strings into four sections in sequence so as to obtain a plurality of digital groups respectively related to the corresponding code file and comprising four character strings, wherein the digital groups comprise a first digital group corresponding to the open source file and a second digital group corresponding to the source code file;
the data table establishing module is used for storing a plurality of first digital groups in a database to obtain a data matching table;
the matching module is used for respectively and integrally matching character strings in the second number group corresponding to any source code file in the current source code assembly to be tested with character strings in each first number group in the data matching table;
the first judging module is used for judging whether any one first numerical group has the same character string as the current second numerical group or not according to the matching result of the matching module;
the first screening module is used for defining the first digital code group with the judgment result of the judging module as the to-be-selected digital code group;
and the searching module is used for finding out a plurality of open source components related to the source code component to be detected in the basic source code library according to the number code group to be selected.
6. The simhash-based source-splitting component screening and identifying system of source code according to claim 5, further comprising a second determining module, a calculating module, and a second screening module;
the second judging module is configured to respectively judge whether other three character strings in any one of the to-be-selected number groups exist in the current second number group;
the calculation module is used for calculating the Hamming distance between the number group to be selected and the second number group according to the judgment result of the second judgment module;
the second screening module is used for defining the number group to be selected with the Hamming distance of the second number group being less than or equal to 3 as a target number group;
and the searching module finds out a plurality of open source components related to the source code component to be detected in the basic source code library according to the target number code group.
7. The simhash-based source-open component screening and identifying system in source codes according to claim 6, further comprising an association table generating module, wherein the searching module comprises a table searching module, a statistical calculation module and an extraction module;
the association table generating module is configured to generate an association table in the basic source code library, where the association table records an open source file name associated with each first code group, an open source component name corresponding to the open source file, a version number of the open source component, and a storage location of the open source component;
the table look-up module is used for looking up all open source files, open source components and storage positions corresponding to the target number group according to the association table;
the statistical calculation module is used for counting the number of the source code files corresponding to each found open source component and calculating the similarity between each open source component and the source code component to be detected according to the counted number;
and the extraction module is used for extracting the first N open source components as objects of further comparison analysis according to the sequence of the similarity from high to low obtained by the statistical calculation module.
8. The simhash-based system for screening and identifying the source-developing components in the source code according to claim 5, further comprising a file name processing module and a directory establishing and storing module;
the file name processing module is used for performing hash calculation on the whole open source file by adopting a minhash algorithm to obtain a characteristic hash code corresponding to the open source file, and naming the open source file by the characteristic hash code;
the directory establishing and storing module is used for segmenting the characteristic hash code to obtain a hash code segment, establishing a multi-level storage directory according to the hash code segment, and storing the open source files with the same hash code segment in the storage directory of the same level.
9. A system for screening and identifying a source component in a source code based on simhash is characterized by comprising:
one or more processors;
a memory;
and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs including instructions for performing the method for source component screening identification in simhash-based source code according to any of claims 1 to 4.
10. A computer-readable storage medium comprising a computer program executable by a processor to perform the method for screening and identifying source components in a simhash-based source code according to any one of claims 1 to 4.
CN202210337119.2A 2022-03-31 2022-03-31 Simhash-based open source component screening and identifying method and system in source code Pending CN114816518A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210337119.2A CN114816518A (en) 2022-03-31 2022-03-31 Simhash-based open source component screening and identifying method and system in source code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210337119.2A CN114816518A (en) 2022-03-31 2022-03-31 Simhash-based open source component screening and identifying method and system in source code

Publications (1)

Publication Number Publication Date
CN114816518A true CN114816518A (en) 2022-07-29

Family

ID=82532600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210337119.2A Pending CN114816518A (en) 2022-03-31 2022-03-31 Simhash-based open source component screening and identifying method and system in source code

Country Status (1)

Country Link
CN (1) CN114816518A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117850756A (en) * 2023-11-17 2024-04-09 深圳微米信息服务有限公司 Management system and method for WEB front-end component

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117850756A (en) * 2023-11-17 2024-04-09 深圳微米信息服务有限公司 Management system and method for WEB front-end component

Similar Documents

Publication Publication Date Title
Li et al. Fast and accurate long-read alignment with Burrows–Wheeler transform
CN112579155B (en) Code similarity detection method and device and storage medium
CN109460386B (en) Malicious file homology analysis method and device based on multi-dimensional fuzzy hash matching
CN111723371B (en) Method for constructing malicious file detection model and detecting malicious file
JP2019512127A (en) String distance calculation method and apparatus
CN109933502B (en) Electronic device, user operation record processing method and storage medium
CN112364014B (en) Data query method, device, server and storage medium
CN111930610B (en) Software homology detection method, device, equipment and storage medium
CN115658080A (en) Method and system for identifying open source code components of software
CN114816518A (en) Simhash-based open source component screening and identifying method and system in source code
CN111177719A (en) Address category determination method, device, computer-readable storage medium and equipment
CN108804917B (en) File detection method and device, electronic equipment and storage medium
CN112698861A (en) Source code clone identification method and system
US11487876B1 (en) Robust whitelisting of legitimate files using similarity score and suspiciousness score
WO2007132564A1 (en) Data processing device and method
CN112632528A (en) Threat information generation method, equipment, storage medium and device
KR102068605B1 (en) Method for classifying malicious code by using sequence of functions' execution and device using the same
CN114995880B (en) Binary code similarity comparison method based on SimHash
CN114884686B (en) PHP threat identification method and device
US20190130034A1 (en) Fingerprint clustering for content-based audio recognition
CN114756586A (en) Code matching analysis method and device, electronic equipment and storage medium
US20220092086A1 (en) Order Independent Data Categorization, Indication, and Remediation Across Realtime Datasets of Live Service Environments
Esmat et al. A parallel hash‐based method for local sequence alignment
Chen et al. CGAP-align: a high performance DNA short read alignment tool
CN112100670A (en) Big data based privacy data grading protection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination