CN114995880A - Binary code similarity comparison method based on SimHash - Google Patents

Binary code similarity comparison method based on SimHash Download PDF

Info

Publication number
CN114995880A
CN114995880A CN202210566698.8A CN202210566698A CN114995880A CN 114995880 A CN114995880 A CN 114995880A CN 202210566698 A CN202210566698 A CN 202210566698A CN 114995880 A CN114995880 A CN 114995880A
Authority
CN
China
Prior art keywords
simhash
function
value
code
blocks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210566698.8A
Other languages
Chinese (zh)
Other versions
CN114995880B (en
Inventor
贾张涛
陶金龙
孔祥炳
邵飒
张建伟
冯大成
付修锋
安恒
刘玉波
金玉川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Computer Technology and Applications filed Critical Beijing Institute of Computer Technology and Applications
Priority to CN202210566698.8A priority Critical patent/CN114995880B/en
Publication of CN114995880A publication Critical patent/CN114995880A/en
Application granted granted Critical
Publication of CN114995880B publication Critical patent/CN114995880B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a binary code similarity comparison method based on SimHash, and belongs to the field of code comparison. According to the method, binary codes are disassembled and are preprocessed, the assembly codes are subjected to standardization processing, the SimHash value of the assembly codes is calculated, a code characteristic relational database framework is constructed, and the binary codes based on text similarity are quickly positioned. The invention has the following advantages: the scheme provided by the invention can ensure the similarity comparison efficiency of the binary codes while giving consideration to the comparison efficiency; the invention adopts a text comparison method based on SimHash, and can improve the efficiency of binary code similarity comparison.

Description

Binary code similarity comparison method based on SimHash
Technical Field
The invention belongs to the field of code comparison, and particularly relates to a binary code similarity comparison method based on SimHash.
Background
Code multiplexing is generally based on functions, and a large number of functions are kept even if the code multiplexing is highly optimized by a compiler, so that tracing by using functions as units is more consistent with a multiplexing scene. Different compilers insert functions that differ from function to function in which they are inserted, requiring a great deal of experience and skill to identify the functions. The multiplexing function causes great interference to malicious code analysis and homologous judgment work, and at present, the homologous judgment is not efficient due to the fact that the multiplexing function is mainly identified by the experience of malicious code analysts. The rapid identification of the multiplexing function greatly improves the efficiency and improves the reliability of the homologous judgment conclusion.
The basis of the tracing of the multiplexing function is similar function judgment, and if a similar function of one function exists in a certain sample, the function is the multiplexing function. At present, most of similar function judgment technologies have high accuracy and recall rate, but the judgment efficiency is low, the method is not suitable for the multi-function tracing of massive codes, and the difference of instruction sequence, register, jump position and the like in the assembly code after the reverse direction can be caused by the small modification of one function source code and the difference of compiling options and the positions of the compiling options, so that the very low recall rate can be caused if the hashing and other methods are used for tracing. In the function, the jump structure of the code block is an important feature of similarity judgment, and the extraction of the jump relation and the comparison of the structure diagram take a lot of time, which is an important reason that the accuracy, the recall rate and the speed of the current similarity judgment are difficult to be compatible.
The scheme provides a multiplexing function fast tracing method based on SimHash and function characteristics. The core idea is that similar code blocks are found out to find out similar functions based on SimHash.
The SimHash is one of Local Sensitive Hashing (LSH) algorithms, which is firstly proposed by Charika and the like in 2002, Manku and the like [18] of Google apply the algorithm to massive similar web pages for deduplication in 2007, a SimHash value of 64 bits is calculated for each web page according to the algorithm, web pages with Hamming distances of the SimHash values within 3 are considered to be similar, and Manku also proposes a quick retrieval method of the SimHash value with a specific Hamming distance which is superior in time and space based on a drawer principle. Currently, the SimHash algorithm is applied in several aspects, especially in the field of source code cloning.
Disclosure of Invention
Technical problem to be solved
The invention aims to solve the technical problem of how to provide a binary code similarity comparison method based on SimHash so as to solve the problem that the binary code multiplexing function traceability and the similarity judgment accuracy, recall rate and speed in defect scanning are difficult to be obtained simultaneously.
(II) technical scheme
In order to solve the technical problem, the invention provides a binary code similarity comparison method based on SimHash, which comprises the following steps:
s1, disassembling binary codes and preprocessing assembly codes
Disassembling multi-platform binary codes and generating disassembling files of the binary codes under different architectures by analyzing instruction sets under different architectures; splitting the assembly file into a plurality of functions according to the special identification in the assembly file; dividing the function into a plurality of code blocks according to the jump instruction in the function;
s2, the assembly code normalization process includes: standardizing the instructions in the code block according to rules;
s3, calculation of SimHash value of assembly code
Calculating a SimHash value corresponding to each code basic block, each function and each file;
s4, constructing a code feature relational library framework
Establishing a file information table, a function information table and a basic block information table, and establishing a corresponding relation table; one binary file comprises at least one function, and one function comprises a plurality of basic blocks, so that one file record corresponds to at least one function record, and one function information record corresponds to a plurality of basic block records;
s5 quick positioning of binary codes based on text similarity
And recording the function to be compared as ObjFunc, comparing the ObjFunc with the basic blocks of each function, calculating the Hamming distance between the SimHash values of the basic blocks, considering that the Hamming distance is less than 3 and is a similar basic block, and recording the function containing the similar basic blocks with the highest proportion as a comparison result.
(III) advantageous effects
The invention provides a binary code similarity comparison method based on SimHash, which has the following advantages:
the scheme provided by the invention can ensure the similarity comparison efficiency of the binary codes while giving consideration to the comparison efficiency;
the invention adopts a text comparison method based on SimHash, and can improve the efficiency of binary code similarity comparison.
Detailed Description
In order to make the objects, contents and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be given in conjunction with examples.
The invention provides a binary code similarity comparison method based on SimHash and code characteristics, which utilizes a SimHash algorithm to carry out comparison retrieval, reduces the comparison range of binary codes, and then utilizes the binary code characteristics to carry out accurate similarity comparison, thereby realizing the rapid and accurate comparison of the binary codes, providing support for binary code tracing and defect scanning analysis, and meeting the requirements of binary code similarity comparison under different scenes.
In order to solve the problem that the similarity determination accuracy, the recall rate and the speed in the binary code multiplexing function tracing and defect scanning are difficult to be obtained, the invention provides a SimHash-based binary code similarity comparison scheme.
In order to solve the problem that the similarity determination accuracy, the recall rate and the speed in binary code multiplexing function tracing and defect scanning are difficult to be obtained, the invention provides a SimHash-based binary code similarity comparison scheme, a similar code block is found out based on SimHash, a similar function is found out through comparison and analysis, and the precision of the binary code similarity comparison efficiency is improved, wherein the main contents of the scheme comprise:
s1, disassembling binary codes and preprocessing assembly codes
By analyzing the instruction sets under different architectures, disassembling binary codes of platforms such as Arm, PowerPC, X86 and the like is realized, and disassembling files of the binary codes under different architectures are generated; splitting the assembly file into a plurality of functions according to the special identification in the assembly file; and splitting the function into a plurality of code blocks according to jump instructions such as jnz and jmp in the function.
S2, the assembly code standardization treatment comprises the following steps: the instructions in the code block are standardized according to rules to ignore differences caused by differences in registers, memory addresses, and the like.
S3, calculation of SimHash value of assembly code
The SimHash is one of Local Sensitive Hashing (LSH) algorithms, and calculates a SimHash value corresponding to each code basic block, function, and file.
S4, constructing a code feature relational library framework
And establishing a file information table, a function information table and a basic block information table, and establishing a corresponding relation table. A binary file comprises at least one function, and a function comprises a plurality of basic blocks, so that one file record corresponds to at least one function record, and one function information record corresponds to a plurality of basic block records.
S5 quick positioning of binary codes based on text similarity
The Hamming distances (the number of bits with different numerical values on corresponding bits) of the SimHash values are within 3 and can be considered to be similar, but the complexity of searching the SimHash values with the Hamming distances within 3 in a massive SimHash value list is very high, and in order to improve the efficiency, the invention provides a multi-table indexing method which gives consideration to time and space. The method comprises the steps of establishing a basic block SimHash table, inquiring a basic block, calculating the SimHash distance of a code block, calculating the similarity of functions, and screening out the function with the highest similarity.
And recording the function to be compared as ObjFunc, comparing the ObjFunc with the basic blocks of each function, calculating the Hamming distance between the SimHash values of the basic blocks, considering that the Hamming distance is less than 3 and is a similar basic block, and recording the function containing the similar basic blocks with the highest proportion as a comparison result.
Example 1:
the invention provides a binary code similarity comparison scheme based on SimHash and function characteristics, wherein a similar code block is found based on SimHash to reduce the judgment range of a similar function, and then the similar function is found based on an accurate comparison method of the binary code characteristics, so that the accuracy of the efficiency of binary code similarity comparison is improved
S1, disassembling binary codes and preprocessing assembly codes
S11, disassembling binary codes of platforms such as Arm, PowerPC, X86 and the like by analyzing instruction sets under different architectures, and generating disassembling files of the binary codes under different architectures, wherein the disassembling files are expressed as ASM;
s12, splitting the assembly file into a plurality of functions according to the assembly file identifier, where a function is represented by Func, and one assembly file is represented as a set ASM of the plurality of functions { Func ═ 1 ,Func 2 ,……,func n };
S13, dividing the function into a plurality of basic blocks according to jump instructions such as jnz and jmp in the function, and each function is represented by BB as a set Func ═ BB of a plurality of code blocks 1 ,BB 2 ,……,BB m }。
S2, the assembly code normalization process includes: standardize the order in the code block, in order to ignore the difference caused by difference of the register, memory address, etc., code block standardized processing rule is as follows:
(1) memories such as [ eax ], [ edi +8] and the like are all expressed as Memory;
(2) immediate numbers such as 0, 384Dh are expressed as Value;
(3) registers such as eax, ax, al and the like are respectively standardized to reg _32, reg _16 and reg _18 according to occupied bits;
(4) when calling an external system library function, the call instruction does not process the instruction, and when calling an internal function such as 'call sub _134B 4', the call sub _ xxx is normalized;
(5) jump instructions such as "jz short loc _134B 4" are normalized to "jump loc _ xxx".
S3 general SimHash value calculating method
The SimHash is one of Local Sensitive Hashing (LSH) algorithms, is used for calculating SimHash values of basic blocks, functions and files, and comprises the following steps:
s31, create a 64 as variable SimH, and initialize to 0.
S32, performing word segmentation processing on the assembly code, wherein 2 ways are generally adopted: n-gram character strings or n-gram words are segmented by adopting a n-gram word method.
S33, assigning a weight to each participle (assembly language identifier): usually based on frequency, i.e. the number of occurrences of the word-segmentation.
S34, hashing each participle to obtain a 64-bit hash value: typically using MD5 or SHA1 hashing algorithms and then taking 64 bits of them, each participle corresponding to a 64-bit hash value.
S35, weighting and combining the hash values of the participles: for each bit of the hash value of the participle, if the bit is 1, adding the weight of the participle to the value of the corresponding bit of the weighted value, otherwise, subtracting the weight of the participle.
S36, dimension reduction: for each bit of the weight, if the bit is greater than 0, it is set to 1, otherwise it is set to 0, resulting in a 64-bit SimHash value.
And calculating the SimHash value of the function and the file by adopting the SimHash value calculation method.
S4, feature extraction of binary code
In order to ensure the accuracy of the similarity comparison of the binary codes, the accuracy of the similarity comparison of the binary codes is improved by extracting the characteristics of the binary codes, including the SimHash value of the basic block and the SimHash value of the function.
(1) Extracting a basic block SimHash value: calculating Sim for each basic blockThe Hash value, SimHash value, is denoted herein as SimH. Finally each function is represented as a set of SimHash values Func 0 →{BB 1 ,BB 2 ,……,BB m }→{SimH 1 ,SimH 2 ,……,SimH m },BB m SimH, for disassembly corresponding to the mth basic block m Corresponding to the SimHash value for the mth basic block.
(2) SimHash value extraction of function: using the method of S3, a SimHash value of each function is calculated, the SimHash value being expressed as FSimH, and each assembly file being expressed as a set ASM ═ Func of SimHash values 1 ,Func 2 ,……,Func n }→{FSimH 1 ,FSimH 2 ,……,FSimH n Therein Func n For disassembling the corresponding nth function, FSimH n Is the SimHash value corresponding to the nth function.
(3) Extracting the SimHash value of the file: using the method S3, a SimHash value of the file is calculated, which is denoted as FileSimH.
S5, constructing a code feature relational library framework
And establishing a file information table, a function information table and a basic block information table, and establishing a corresponding relation table. A binary file comprises at least one function, and a function comprises a plurality of basic blocks, so that one file record corresponds to at least one function record, and one function information record corresponds to a plurality of basic block records.
(1) Establishing a basic block information table basicblock _ table of a database, equally dividing each SimHash value into 8 subblocks (sub _ tab 1-sub _ tab8), creating 8 tables with the value, respectively storing the content of the SimHash value, and storing blocks at different positions in different tables.
(2) Establishing a function information table func _ table, and storing the binary code characteristics of the function, including the information of the SimHash value of the function and the like;
(3) establishing a file _ table of a file information table, and storing binary file names and FileSimH information;
s6 quick positioning of binary codes based on text similarity
The Hamming distances (the number of bits with different numerical values on corresponding bits) of the SimHash values are within 3 and can be considered to be similar, but the complexity of searching the SimHash values with the Hamming distances within 3 in a massive SimHash value list is very high, and in order to improve the efficiency, the invention provides a multi-table indexing method which gives consideration to time and space.
(1) Establishing a basic block SimHash table
In order to improve the retrieval efficiency and take space overhead into consideration, each SimHash value is equally divided into 8 blocks, 8 tables sub _ tabq are created for all the SimHash values, q takes a value of 1-8, different tables store SimHash blocks at different positions, for example, the first table stores 0-7 bits, the second table stores 8-15 bits, the third table stores 16-23 bits, and the like.
(2) Hamming distance calculation
Hamming distance: carrying out exclusive OR operation on the SimHash values corresponding to the two basic blocks, wherein the exclusive OR operation is carried out on the SimHash values which contain the number of 1 and are recorded as Hamming distances;
hamming distance is less than N (typically N is 3 and N <8) calculated: in order to improve the calculation efficiency of the Hamming distance, if the Hamming distance corresponding to two basic blocks is N, the corresponding values of N bits are different, and the SimHash value is divided into 8 sub-blocks, and the N bits may be all in 1-N sub-blocks. When N is 3, at least 5 sub-blocks (8 bits per block) corresponding to each SimHash value should be the same;
(3) querying basic blocks
When other SimHash values with the Hamming distance within 3 are searched according to a certain SimHash, the SimHash is divided into 8 blocks (SimHash _ bb 1-SimHash _ bb8) in an average manner, each SimHash _ bbq (q takes 1-8) searches for similar blocks in corresponding table sub _ tabq (q takes 1-8), the similar blocks are taken to correspond to a SimHash set, and the SimHash values which are at least 5 blocks same are screened out.
(4) Similarity function comparison
Comparing the function objFunc to be compared with each function code basic block, calculating the Hamming distance of the code basic block, marking the function containing the similar basic block with the proportion exceeding a certain threshold (for example: 50%) as a similar function, and finding out the function set SimFunc with higher similarity as { SimFunc } 1 ,SimFunc 2 ,……,SimFunc p And p is the number of similarity functions.
S7, code similarity accurate evaluation based on code feature comparison
(1) Let the function to be compared be ObjFunc, and let ObjFunc be equal to SimFunc obtained in S6 { SimFunc 1 ,SimFunc 2 ,……,SimFunc p Compare, compare the same basic block fraction (same basic number/ObjFunc basic block total) one by one in the function, and record the similarity result as SimV, where SimV is { SimV } 1 ,SimV 2 ,……SimV p };
(2) Similarity result SimV ═ SimV 1 ,SimV 2 ,……SimV p And (4) sorting, and selecting three with the largest similarity as a similarity comparison result.
The invention has the following advantages:
the scheme provided by the invention can ensure the similarity comparison efficiency of the binary codes while considering the comparison efficiency;
the invention adopts a text comparison method based on SimHash, and can improve the efficiency of binary code similarity comparison.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A binary code similarity comparison method based on SimHash is characterized by comprising the following steps:
s1, disassembling binary codes and preprocessing assembly codes
Disassembling multi-platform binary codes is realized by analyzing instruction sets under different architectures, and disassembling files of the binary codes under different architectures are generated; splitting the assembly file into a plurality of functions according to the special identification in the assembly file; dividing the function into a plurality of code blocks according to the jump instruction in the function;
s2, the assembly code normalization process includes: standardizing the instructions in the code block according to rules;
s3, calculation of SimHash value of assembly code
Calculating a SimHash value corresponding to each code basic block, each function and each file;
s4, constructing a code feature relational library framework
Establishing a file information table, a function information table and a basic block information table, and establishing a corresponding relation table; one binary file comprises at least one function, and one function comprises a plurality of basic blocks, so that one file record corresponds to at least one function record, and one function information record corresponds to a plurality of basic block records;
s5 quick positioning of binary codes based on text similarity
And recording the function to be compared as ObjFunc, comparing the ObjFunc with the basic blocks of each function, calculating the Hamming distance between the SimHash values of the basic blocks, considering that the Hamming distance is less than 3 and is a similar basic block, and recording the function containing the similar basic blocks with the highest proportion as a comparison result.
2. The SimHash-based binary code similarity comparison method of claim 1, wherein the step S1 specifically comprises:
s11, disassembling the multi-platform binary codes and generating disassembling files of the binary codes under different architectures through analyzing instruction sets under different architectures, wherein the disassembling files are expressed as ASM;
s12, according to the assembly file identification, splitting the assembly file into a plurality of functions, wherein the functions are expressed by Func, and one assembly file is expressed as a set ASM (Func) of the functions 1 ,Func 2 ,……,func n };
S13, dividing the function into a plurality of basic blocks according to the jump instruction in the function, wherein each function is represented by BB, and each function is represented by a set Func of a plurality of code blocks { BB ═ BB 1 ,BB 2 ,……,BB m }。
3. The SimHash-based binary code similarity comparison method of claim 2, wherein the multiple platforms include Arm, PowerPC and X86.
4. The SimHash-based binary code similarity comparison method of claim 2, wherein the jump instruction comprises jnz and jmp.
5. The SimHash-based binary code similarity comparison method according to any of claims 1-4, wherein the normalization processing rule of step S2 includes:
the Memory is represented as Memory;
immediate is indicated as Value;
the registers are respectively standardized to reg _32, reg _16 and reg _18 according to occupied bits;
when calling an external system library function, a call instruction does not process the command, and when calling an internal function, the call instruction is normalized to 'call sub _ xxx';
the jump instruction is normalized to "jump loc _ xxx".
6. The SimHash-based binary code similarity comparison method as claimed in claim 5, wherein the step S3 of calculating the SimHash values corresponding to the basic blocks, functions and files comprises:
s31, creating a variable SimH of 64, and initializing to 0;
s32, performing word segmentation processing on the assembly code by adopting n-gram character strings or n-gram words;
s33, assigning a weight to each participle based on the occurrence frequency of the participle;
s34, performing hash processing on each participle to obtain 64-bit hash values, wherein each participle corresponds to one of the 64-bit hash values;
s35, weighting and combining the hash values of the participles: for each digit of the hashed value of the participle, if the digit is 1, adding the weight value of the corresponding digit of the weighted value to the weight value of the participle, otherwise, subtracting the weight value of the participle;
s36, dimension reduction: for each bit of the weight, if the bit is greater than 0, it is set to 1, otherwise it is set to 0, resulting in a 64-bit SimHash value.
7. The SimHash-based binary code similarity comparison method according to claim 6, wherein the 64-bit hash value uses MD5 or SHA1 hash algorithm, and then 64 bits are taken.
8. The SimHash-based binary code similarity comparison method of claim 6, wherein the step S3 further comprises:
extracting a basic block SimHash value: calculating a SimHash value for each basic block, wherein the SimHash value is represented as SimH; finally each function is represented as a set of SimHash values Func 0 →{BB 1 ,BB 2 ,……,BB m }→{SimH 1 ,SimH 2 ,……,SimH m },BB m For disassembly, SimH, corresponding to the mth basic block m Corresponding to the SimHash value for the mth basic block;
SimHash value extraction of function: calculating a SimHash value for each function, the SimHash value being expressed as FSimH, and each assembly file being expressed as a set of SimHash values ASM ═ { Func 1 ,Func 2 ,……,Func n }→{FSimH 1 ,FSimH 2 ,……,FSimH n Therein Func n For disassembling the corresponding nth function, FSimH n The corresponding SimHash value of the nth function is taken as the corresponding SimHash value;
extracting the SimHash value of the file: the SimHash value of the file is calculated and expressed as FileSimH.
9. The SimHash-based binary code similarity comparison method of claim 8, wherein the step S4 specifically comprises:
establishing a basic block information table basic block _ table of a database, equally dividing each SimHash value into 8 sub-blocks sub _ tab 1-sub _ tab8, establishing 8 tables for respectively storing the contents of the SimHash values, and storing blocks at different positions by different tables;
establishing a function information table func _ table, and storing the binary code characteristics of the function, including the SimHash value information of the function;
and establishing a file information table file _ table, and storing the binary file name and the FileSimH information.
10. The SimHash-based binary code similarity comparison method of claim 9, wherein the step S6 specifically comprises:
establishing a basic block SimHash table: dividing each SimHash value into 8 blocks, creating 8 tables sub _ tabq for all the SimHash values, wherein q takes a value of 1-8, and different tables store SimHash blocks at different positions;
hamming distance calculation: if the Hamming distance corresponding to the two basic blocks is N, the corresponding values of N bits are different, because the SimHash value is divided into 8 subblocks, N bits can be totally arranged in 1-N subblocks, and when N is 3, the 8 subblocks corresponding to each SimHash value are at least 5 same;
when other SimHash values with the Hamming distance within 3 are searched according to a certain SimHash, dividing the SimHash into 8 SimHash _ bb 1-SimHash _ bb8, searching similar blocks in a corresponding table sub _ tabq by each SimHash _ bbq, taking q as 1-8, taking the similar blocks corresponding to a SimHash set, and screening out the same SimHash values of at least 5 blocks;
comparison of similarity functions: comparing the function ObjFunc to be compared with each function code basic block, calculating the Hamming distance of the code basic block, marking the function containing the similar basic block with the proportion exceeding a certain threshold as a similar function, and finding out the function set Simfunc with higher similarity as { SimFunc ═ SimFunc 1 ,SimFunc 2 ,……,SimFunc p P is the number of similar functions;
code similarity accurate assessment based on code feature comparison: let the function to be compared be as ObjFunc, and let ObjFunc and SimFunc be { SimFunc 1 ,SimFunc 2 ,……,SimFunc p Comparing, comparing the same basic block proportion in the function one by one, and recording the similarity result as SimV, wherein the result SimV is { SimV } 1 ,SimV 2 ,……SimV p }; similarity result SimV ═ SimV 1 ,SimV 2 ,……SimV p And (4) sorting, and selecting three with the largest similarity as a similarity comparison result.
CN202210566698.8A 2022-05-23 2022-05-23 Binary code similarity comparison method based on SimHash Active CN114995880B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210566698.8A CN114995880B (en) 2022-05-23 2022-05-23 Binary code similarity comparison method based on SimHash

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210566698.8A CN114995880B (en) 2022-05-23 2022-05-23 Binary code similarity comparison method based on SimHash

Publications (2)

Publication Number Publication Date
CN114995880A true CN114995880A (en) 2022-09-02
CN114995880B CN114995880B (en) 2024-04-05

Family

ID=83027811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210566698.8A Active CN114995880B (en) 2022-05-23 2022-05-23 Binary code similarity comparison method based on SimHash

Country Status (1)

Country Link
CN (1) CN114995880B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591119A (en) * 2023-11-01 2024-02-23 国家计算机网络与信息安全管理中心 Mass APK source code feature extraction and similarity analysis method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140142806A (en) * 2013-06-04 2014-12-15 한양대학교 산학협력단 Malware analysis and variants detection methods using visualization of binary information, apparatus for processing the same method
CN106126235A (en) * 2016-06-24 2016-11-16 中国科学院信息工程研究所 A kind of multiplexing code library construction method, the quick source tracing method of multiplexing code and system
CN110569629A (en) * 2019-09-10 2019-12-13 北京计算机技术及应用研究所 Binary code file tracing method
CN113703773A (en) * 2021-08-26 2021-11-26 北京计算机技术及应用研究所 NLP-based binary code similarity comparison method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140142806A (en) * 2013-06-04 2014-12-15 한양대학교 산학협력단 Malware analysis and variants detection methods using visualization of binary information, apparatus for processing the same method
CN106126235A (en) * 2016-06-24 2016-11-16 中国科学院信息工程研究所 A kind of multiplexing code library construction method, the quick source tracing method of multiplexing code and system
CN110569629A (en) * 2019-09-10 2019-12-13 北京计算机技术及应用研究所 Binary code file tracing method
CN113703773A (en) * 2021-08-26 2021-11-26 北京计算机技术及应用研究所 NLP-based binary code similarity comparison method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
乔延臣;云晓春;庹宇鹏;张永铮;: "基于simhash与倒排索引的复用代码快速溯源方法", 通信学报, no. 11, 25 November 2016 (2016-11-25) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591119A (en) * 2023-11-01 2024-02-23 国家计算机网络与信息安全管理中心 Mass APK source code feature extraction and similarity analysis method
CN117591119B (en) * 2023-11-01 2024-05-31 国家计算机网络与信息安全管理中心 Mass APK source code feature extraction and similarity analysis method

Also Published As

Publication number Publication date
CN114995880B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
CN111324784B (en) Character string processing method and device
CN111324750B (en) Large-scale text similarity calculation and text duplicate checking method
US6173252B1 (en) Apparatus and methods for Chinese error check by means of dynamic programming and weighted classes
EP3292481B1 (en) Method, system and computer program product for performing numeric searches
EP2095277B1 (en) Fuzzy database matching
CN109460386B (en) Malicious file homology analysis method and device based on multi-dimensional fuzzy hash matching
CN111310178B (en) Firmware vulnerability detection method and system in cross-platform scene
US20160147867A1 (en) Information matching apparatus, information matching method, and computer readable storage medium having stored information matching program
CN111723371B (en) Method for constructing malicious file detection model and detecting malicious file
CN110569629A (en) Binary code file tracing method
JPH08255176A (en) Method and system for comparison of table of database
Liu et al. An image-based near-duplicate video retrieval and localization using improved edit distance
CN109858025B (en) Word segmentation method and system for address standardized corpus
CN111930610B (en) Software homology detection method, device, equipment and storage medium
CN108280226B (en) Data processing method and related equipment
CN114995880A (en) Binary code similarity comparison method based on SimHash
CN104933096A (en) Abnormal key recognition method of database, abnormal key recognition device of database and data system
CN114021116B (en) Construction method of homologous analysis knowledge base, homologous analysis method and device
CN115016843A (en) High-precision binary code similarity comparison method
CN114816518A (en) Simhash-based open source component screening and identifying method and system in source code
CN115186138A (en) Comparison method and terminal for power distribution network data
CN108170672A (en) A kind of Chinese organization names real-time analysis method and system
EP2780830A1 (en) Fast database matching
US8498988B2 (en) Fast search
CN115168399B (en) Data processing method, device and equipment based on graphical interface and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant