CN114995880A

CN114995880A - Binary code similarity comparison method based on SimHash

Info

Publication number: CN114995880A
Application number: CN202210566698.8A
Authority: CN
Inventors: 贾张涛; 陶金龙; 孔祥炳; 邵飒; 张建伟; 冯大成; 付修锋; 安恒; 刘玉波; 金玉川
Original assignee: Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Computer Technology and Applications
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2022-09-02
Anticipated expiration: 2042-05-23
Also published as: CN114995880B

Abstract

The invention relates to a binary code similarity comparison method based on SimHash, and belongs to the field of code comparison. According to the method, binary codes are disassembled and are preprocessed, the assembly codes are subjected to standardization processing, the SimHash value of the assembly codes is calculated, a code characteristic relational database framework is constructed, and the binary codes based on text similarity are quickly positioned. The invention has the following advantages: the scheme provided by the invention can ensure the similarity comparison efficiency of the binary codes while giving consideration to the comparison efficiency; the invention adopts a text comparison method based on SimHash, and can improve the efficiency of binary code similarity comparison.

Description

Binary code similarity comparison method based on SimHash

Technical Field

The invention belongs to the field of code comparison, and particularly relates to a binary code similarity comparison method based on SimHash.

Background

Code multiplexing is generally based on functions, and a large number of functions are kept even if the code multiplexing is highly optimized by a compiler, so that tracing by using functions as units is more consistent with a multiplexing scene. Different compilers insert functions that differ from function to function in which they are inserted, requiring a great deal of experience and skill to identify the functions. The multiplexing function causes great interference to malicious code analysis and homologous judgment work, and at present, the homologous judgment is not efficient due to the fact that the multiplexing function is mainly identified by the experience of malicious code analysts. The rapid identification of the multiplexing function greatly improves the efficiency and improves the reliability of the homologous judgment conclusion.

The basis of the tracing of the multiplexing function is similar function judgment, and if a similar function of one function exists in a certain sample, the function is the multiplexing function. At present, most of similar function judgment technologies have high accuracy and recall rate, but the judgment efficiency is low, the method is not suitable for the multi-function tracing of massive codes, and the difference of instruction sequence, register, jump position and the like in the assembly code after the reverse direction can be caused by the small modification of one function source code and the difference of compiling options and the positions of the compiling options, so that the very low recall rate can be caused if the hashing and other methods are used for tracing. In the function, the jump structure of the code block is an important feature of similarity judgment, and the extraction of the jump relation and the comparison of the structure diagram take a lot of time, which is an important reason that the accuracy, the recall rate and the speed of the current similarity judgment are difficult to be compatible.

The scheme provides a multiplexing function fast tracing method based on SimHash and function characteristics. The core idea is that similar code blocks are found out to find out similar functions based on SimHash.

The SimHash is one of Local Sensitive Hashing (LSH) algorithms, which is firstly proposed by Charika and the like in 2002, Manku and the like [18] of Google apply the algorithm to massive similar web pages for deduplication in 2007, a SimHash value of 64 bits is calculated for each web page according to the algorithm, web pages with Hamming distances of the SimHash values within 3 are considered to be similar, and Manku also proposes a quick retrieval method of the SimHash value with a specific Hamming distance which is superior in time and space based on a drawer principle. Currently, the SimHash algorithm is applied in several aspects, especially in the field of source code cloning.

Disclosure of Invention

Technical problem to be solved

The invention aims to solve the technical problem of how to provide a binary code similarity comparison method based on SimHash so as to solve the problem that the binary code multiplexing function traceability and the similarity judgment accuracy, recall rate and speed in defect scanning are difficult to be obtained simultaneously.

(II) technical scheme

In order to solve the technical problem, the invention provides a binary code similarity comparison method based on SimHash, which comprises the following steps:

s1, disassembling binary codes and preprocessing assembly codes

Disassembling multi-platform binary codes and generating disassembling files of the binary codes under different architectures by analyzing instruction sets under different architectures; splitting the assembly file into a plurality of functions according to the special identification in the assembly file; dividing the function into a plurality of code blocks according to the jump instruction in the function;

s2, the assembly code normalization process includes: standardizing the instructions in the code block according to rules;

s3, calculation of SimHash value of assembly code

Calculating a SimHash value corresponding to each code basic block, each function and each file;

s4, constructing a code feature relational library framework

Establishing a file information table, a function information table and a basic block information table, and establishing a corresponding relation table; one binary file comprises at least one function, and one function comprises a plurality of basic blocks, so that one file record corresponds to at least one function record, and one function information record corresponds to a plurality of basic block records;

s5 quick positioning of binary codes based on text similarity

And recording the function to be compared as ObjFunc, comparing the ObjFunc with the basic blocks of each function, calculating the Hamming distance between the SimHash values of the basic blocks, considering that the Hamming distance is less than 3 and is a similar basic block, and recording the function containing the similar basic blocks with the highest proportion as a comparison result.

(III) advantageous effects

The invention provides a binary code similarity comparison method based on SimHash, which has the following advantages:

the scheme provided by the invention can ensure the similarity comparison efficiency of the binary codes while giving consideration to the comparison efficiency;

the invention adopts a text comparison method based on SimHash, and can improve the efficiency of binary code similarity comparison.

Detailed Description

In order to make the objects, contents and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be given in conjunction with examples.

The invention provides a binary code similarity comparison method based on SimHash and code characteristics, which utilizes a SimHash algorithm to carry out comparison retrieval, reduces the comparison range of binary codes, and then utilizes the binary code characteristics to carry out accurate similarity comparison, thereby realizing the rapid and accurate comparison of the binary codes, providing support for binary code tracing and defect scanning analysis, and meeting the requirements of binary code similarity comparison under different scenes.

In order to solve the problem that the similarity determination accuracy, the recall rate and the speed in the binary code multiplexing function tracing and defect scanning are difficult to be obtained, the invention provides a SimHash-based binary code similarity comparison scheme.

In order to solve the problem that the similarity determination accuracy, the recall rate and the speed in binary code multiplexing function tracing and defect scanning are difficult to be obtained, the invention provides a SimHash-based binary code similarity comparison scheme, a similar code block is found out based on SimHash, a similar function is found out through comparison and analysis, and the precision of the binary code similarity comparison efficiency is improved, wherein the main contents of the scheme comprise:

s1, disassembling binary codes and preprocessing assembly codes

By analyzing the instruction sets under different architectures, disassembling binary codes of platforms such as Arm, PowerPC, X86 and the like is realized, and disassembling files of the binary codes under different architectures are generated; splitting the assembly file into a plurality of functions according to the special identification in the assembly file; and splitting the function into a plurality of code blocks according to jump instructions such as jnz and jmp in the function.

S2, the assembly code standardization treatment comprises the following steps: the instructions in the code block are standardized according to rules to ignore differences caused by differences in registers, memory addresses, and the like.

S3, calculation of SimHash value of assembly code

The SimHash is one of Local Sensitive Hashing (LSH) algorithms, and calculates a SimHash value corresponding to each code basic block, function, and file.

S4, constructing a code feature relational library framework

And establishing a file information table, a function information table and a basic block information table, and establishing a corresponding relation table. A binary file comprises at least one function, and a function comprises a plurality of basic blocks, so that one file record corresponds to at least one function record, and one function information record corresponds to a plurality of basic block records.

S5 quick positioning of binary codes based on text similarity

The Hamming distances (the number of bits with different numerical values on corresponding bits) of the SimHash values are within 3 and can be considered to be similar, but the complexity of searching the SimHash values with the Hamming distances within 3 in a massive SimHash value list is very high, and in order to improve the efficiency, the invention provides a multi-table indexing method which gives consideration to time and space. The method comprises the steps of establishing a basic block SimHash table, inquiring a basic block, calculating the SimHash distance of a code block, calculating the similarity of functions, and screening out the function with the highest similarity.

Example 1:

the invention provides a binary code similarity comparison scheme based on SimHash and function characteristics, wherein a similar code block is found based on SimHash to reduce the judgment range of a similar function, and then the similar function is found based on an accurate comparison method of the binary code characteristics, so that the accuracy of the efficiency of binary code similarity comparison is improved

S1, disassembling binary codes and preprocessing assembly codes

S11, disassembling binary codes of platforms such as Arm, PowerPC, X86 and the like by analyzing instruction sets under different architectures, and generating disassembling files of the binary codes under different architectures, wherein the disassembling files are expressed as ASM;

s12, splitting the assembly file into a plurality of functions according to the assembly file identifier, where a function is represented by Func, and one assembly file is represented as a set ASM of the plurality of functions { Func ═ ₁ ,Func ₂ ,……,func _n }；

S13, dividing the function into a plurality of basic blocks according to jump instructions such as jnz and jmp in the function, and each function is represented by BB as a set Func ═ BB of a plurality of code blocks ₁ ,BB ₂ ,……,BB _m }。

S2, the assembly code normalization process includes: standardize the order in the code block, in order to ignore the difference caused by difference of the register, memory address, etc., code block standardized processing rule is as follows:

(1) memories such as [ eax ], [ edi +8] and the like are all expressed as Memory;

(2) immediate numbers such as 0, 384Dh are expressed as Value;

(3) registers such as eax, ax, al and the like are respectively standardized to reg _32, reg _16 and reg _18 according to occupied bits;

(4) when calling an external system library function, the call instruction does not process the instruction, and when calling an internal function such as 'call sub _134B 4', the call sub _ xxx is normalized;

(5) jump instructions such as "jz short loc _134B 4" are normalized to "jump loc _ xxx".

S3 general SimHash value calculating method

The SimHash is one of Local Sensitive Hashing (LSH) algorithms, is used for calculating SimHash values of basic blocks, functions and files, and comprises the following steps:

s31, create a 64 as variable SimH, and initialize to 0.

S32, performing word segmentation processing on the assembly code, wherein 2 ways are generally adopted: n-gram character strings or n-gram words are segmented by adopting a n-gram word method.

S33, assigning a weight to each participle (assembly language identifier): usually based on frequency, i.e. the number of occurrences of the word-segmentation.

S34, hashing each participle to obtain a 64-bit hash value: typically using MD5 or SHA1 hashing algorithms and then taking 64 bits of them, each participle corresponding to a 64-bit hash value.

S35, weighting and combining the hash values of the participles: for each bit of the hash value of the participle, if the bit is 1, adding the weight of the participle to the value of the corresponding bit of the weighted value, otherwise, subtracting the weight of the participle.

S36, dimension reduction: for each bit of the weight, if the bit is greater than 0, it is set to 1, otherwise it is set to 0, resulting in a 64-bit SimHash value.

And calculating the SimHash value of the function and the file by adopting the SimHash value calculation method.

S4, feature extraction of binary code

In order to ensure the accuracy of the similarity comparison of the binary codes, the accuracy of the similarity comparison of the binary codes is improved by extracting the characteristics of the binary codes, including the SimHash value of the basic block and the SimHash value of the function.

(1) Extracting a basic block SimHash value: calculating Sim for each basic blockThe Hash value, SimHash value, is denoted herein as SimH. Finally each function is represented as a set of SimHash values Func ₀ →{BB ₁ ,BB ₂ ,……,BB _m }→{SimH ₁ ,SimH ₂ ,……,SimH _m }，BB _m SimH, for disassembly corresponding to the mth basic block _m Corresponding to the SimHash value for the mth basic block.

(2) SimHash value extraction of function: using the method of S3, a SimHash value of each function is calculated, the SimHash value being expressed as FSimH, and each assembly file being expressed as a set ASM ═ Func of SimHash values ₁ ,Func ₂ ,……,Func _n }→{FSimH ₁ ,FSimH ₂ ,……,FSimH _n Therein Func _n For disassembling the corresponding nth function, FSimH _n Is the SimHash value corresponding to the nth function.

(3) Extracting the SimHash value of the file: using the method S3, a SimHash value of the file is calculated, which is denoted as FileSimH.

S5, constructing a code feature relational library framework

(1) Establishing a basic block information table basicblock _ table of a database, equally dividing each SimHash value into 8 subblocks (sub _ tab 1-sub _ tab8), creating 8 tables with the value, respectively storing the content of the SimHash value, and storing blocks at different positions in different tables.

(2) Establishing a function information table func _ table, and storing the binary code characteristics of the function, including the information of the SimHash value of the function and the like;

(3) establishing a file _ table of a file information table, and storing binary file names and FileSimH information;

s6 quick positioning of binary codes based on text similarity

The Hamming distances (the number of bits with different numerical values on corresponding bits) of the SimHash values are within 3 and can be considered to be similar, but the complexity of searching the SimHash values with the Hamming distances within 3 in a massive SimHash value list is very high, and in order to improve the efficiency, the invention provides a multi-table indexing method which gives consideration to time and space.

(1) Establishing a basic block SimHash table

In order to improve the retrieval efficiency and take space overhead into consideration, each SimHash value is equally divided into 8 blocks, 8 tables sub _ tabq are created for all the SimHash values, q takes a value of 1-8, different tables store SimHash blocks at different positions, for example, the first table stores 0-7 bits, the second table stores 8-15 bits, the third table stores 16-23 bits, and the like.

(2) Hamming distance calculation

Hamming distance: carrying out exclusive OR operation on the SimHash values corresponding to the two basic blocks, wherein the exclusive OR operation is carried out on the SimHash values which contain the number of 1 and are recorded as Hamming distances;

hamming distance is less than N (typically N is 3 and N <8) calculated: in order to improve the calculation efficiency of the Hamming distance, if the Hamming distance corresponding to two basic blocks is N, the corresponding values of N bits are different, and the SimHash value is divided into 8 sub-blocks, and the N bits may be all in 1-N sub-blocks. When N is 3, at least 5 sub-blocks (8 bits per block) corresponding to each SimHash value should be the same;

(3) querying basic blocks

When other SimHash values with the Hamming distance within 3 are searched according to a certain SimHash, the SimHash is divided into 8 blocks (SimHash _ bb 1-SimHash _ bb8) in an average manner, each SimHash _ bbq (q takes 1-8) searches for similar blocks in corresponding table sub _ tabq (q takes 1-8), the similar blocks are taken to correspond to a SimHash set, and the SimHash values which are at least 5 blocks same are screened out.

(4) Similarity function comparison

Comparing the function objFunc to be compared with each function code basic block, calculating the Hamming distance of the code basic block, marking the function containing the similar basic block with the proportion exceeding a certain threshold (for example: 50%) as a similar function, and finding out the function set SimFunc with higher similarity as { SimFunc } ₁ ，SimFunc ₂ ，……，SimFunc _p And p is the number of similarity functions.

S7, code similarity accurate evaluation based on code feature comparison

(1) Let the function to be compared be ObjFunc, and let ObjFunc be equal to SimFunc obtained in S6 { SimFunc ₁ ，SimFunc ₂ ，……，SimFunc _p Compare, compare the same basic block fraction (same basic number/ObjFunc basic block total) one by one in the function, and record the similarity result as SimV, where SimV is { SimV } ₁ ，SimV ₂ ，……SimV _p }；

(2) Similarity result SimV ═ SimV ₁ ，SimV ₂ ，……SimV _p And (4) sorting, and selecting three with the largest similarity as a similarity comparison result.

The invention has the following advantages:

the scheme provided by the invention can ensure the similarity comparison efficiency of the binary codes while considering the comparison efficiency;

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A binary code similarity comparison method based on SimHash is characterized by comprising the following steps:

s1, disassembling binary codes and preprocessing assembly codes

Disassembling multi-platform binary codes is realized by analyzing instruction sets under different architectures, and disassembling files of the binary codes under different architectures are generated; splitting the assembly file into a plurality of functions according to the special identification in the assembly file; dividing the function into a plurality of code blocks according to the jump instruction in the function;

s3, calculation of SimHash value of assembly code

s4, constructing a code feature relational library framework

s5 quick positioning of binary codes based on text similarity

2. The SimHash-based binary code similarity comparison method of claim 1, wherein the step S1 specifically comprises:

s11, disassembling the multi-platform binary codes and generating disassembling files of the binary codes under different architectures through analyzing instruction sets under different architectures, wherein the disassembling files are expressed as ASM;

s12, according to the assembly file identification, splitting the assembly file into a plurality of functions, wherein the functions are expressed by Func, and one assembly file is expressed as a set ASM (Func) of the functions ₁ ,Func ₂ ,……,func _n }；

S13, dividing the function into a plurality of basic blocks according to the jump instruction in the function, wherein each function is represented by BB, and each function is represented by a set Func of a plurality of code blocks { BB ═ BB ₁ ,BB ₂ ,……,BB _m }。

3. The SimHash-based binary code similarity comparison method of claim 2, wherein the multiple platforms include Arm, PowerPC and X86.

4. The SimHash-based binary code similarity comparison method of claim 2, wherein the jump instruction comprises jnz and jmp.

5. The SimHash-based binary code similarity comparison method according to any of claims 1-4, wherein the normalization processing rule of step S2 includes:

the Memory is represented as Memory;

immediate is indicated as Value;

the registers are respectively standardized to reg _32, reg _16 and reg _18 according to occupied bits;

when calling an external system library function, a call instruction does not process the command, and when calling an internal function, the call instruction is normalized to 'call sub _ xxx';

the jump instruction is normalized to "jump loc _ xxx".

6. The SimHash-based binary code similarity comparison method as claimed in claim 5, wherein the step S3 of calculating the SimHash values corresponding to the basic blocks, functions and files comprises:

s31, creating a variable SimH of 64, and initializing to 0;

s32, performing word segmentation processing on the assembly code by adopting n-gram character strings or n-gram words;

s33, assigning a weight to each participle based on the occurrence frequency of the participle;

s34, performing hash processing on each participle to obtain 64-bit hash values, wherein each participle corresponds to one of the 64-bit hash values;

s35, weighting and combining the hash values of the participles: for each digit of the hashed value of the participle, if the digit is 1, adding the weight value of the corresponding digit of the weighted value to the weight value of the participle, otherwise, subtracting the weight value of the participle;

7. The SimHash-based binary code similarity comparison method according to claim 6, wherein the 64-bit hash value uses MD5 or SHA1 hash algorithm, and then 64 bits are taken.

8. The SimHash-based binary code similarity comparison method of claim 6, wherein the step S3 further comprises:

extracting a basic block SimHash value: calculating a SimHash value for each basic block, wherein the SimHash value is represented as SimH; finally each function is represented as a set of SimHash values Func ₀ →{BB ₁ ,BB ₂ ,……,BB _m }→{SimH ₁ ,SimH ₂ ,……,SimH _m }，BB _m For disassembly, SimH, corresponding to the mth basic block _m Corresponding to the SimHash value for the mth basic block;

SimHash value extraction of function: calculating a SimHash value for each function, the SimHash value being expressed as FSimH, and each assembly file being expressed as a set of SimHash values ASM ═ { Func ₁ ,Func ₂ ,……,Func _n }→{FSimH ₁ ,FSimH ₂ ,……,FSimH _n Therein Func _n For disassembling the corresponding nth function, FSimH _n The corresponding SimHash value of the nth function is taken as the corresponding SimHash value;

extracting the SimHash value of the file: the SimHash value of the file is calculated and expressed as FileSimH.

9. The SimHash-based binary code similarity comparison method of claim 8, wherein the step S4 specifically comprises:

establishing a basic block information table basic block _ table of a database, equally dividing each SimHash value into 8 sub-blocks sub _ tab 1-sub _ tab8, establishing 8 tables for respectively storing the contents of the SimHash values, and storing blocks at different positions by different tables;

establishing a function information table func _ table, and storing the binary code characteristics of the function, including the SimHash value information of the function;

and establishing a file information table file _ table, and storing the binary file name and the FileSimH information.

10. The SimHash-based binary code similarity comparison method of claim 9, wherein the step S6 specifically comprises:

establishing a basic block SimHash table: dividing each SimHash value into 8 blocks, creating 8 tables sub _ tabq for all the SimHash values, wherein q takes a value of 1-8, and different tables store SimHash blocks at different positions;

hamming distance calculation: if the Hamming distance corresponding to the two basic blocks is N, the corresponding values of N bits are different, because the SimHash value is divided into 8 subblocks, N bits can be totally arranged in 1-N subblocks, and when N is 3, the 8 subblocks corresponding to each SimHash value are at least 5 same;

when other SimHash values with the Hamming distance within 3 are searched according to a certain SimHash, dividing the SimHash into 8 SimHash _ bb 1-SimHash _ bb8, searching similar blocks in a corresponding table sub _ tabq by each SimHash _ bbq, taking q as 1-8, taking the similar blocks corresponding to a SimHash set, and screening out the same SimHash values of at least 5 blocks;

comparison of similarity functions: comparing the function ObjFunc to be compared with each function code basic block, calculating the Hamming distance of the code basic block, marking the function containing the similar basic block with the proportion exceeding a certain threshold as a similar function, and finding out the function set Simfunc with higher similarity as { SimFunc ═ SimFunc ₁ ，SimFunc ₂ ，……，SimFunc _p P is the number of similar functions;

code similarity accurate assessment based on code feature comparison: let the function to be compared be as ObjFunc, and let ObjFunc and SimFunc be { SimFunc ₁ ，SimFunc ₂ ，……，SimFunc _p Comparing, comparing the same basic block proportion in the function one by one, and recording the similarity result as SimV, wherein the result SimV is { SimV } ₁ ，SimV ₂ ，……SimV _p }; similarity result SimV ═ SimV ₁ ，SimV ₂ ，……SimV _p And (4) sorting, and selecting three with the largest similarity as a similarity comparison result.