CN112948828A

CN112948828A - Binary program malicious code detection method, terminal device and storage medium

Info

Publication number: CN112948828A
Application number: CN202110092695.0A
Authority: CN
Inventors: 姚刚; 陈奋; 陈荣有; 孙晓波; 龚利军
Original assignee: Xiamen Fuyun Information Technology Co ltd
Current assignee: Xiamen Fuyun Information Technology Co ltd
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-06-11

Abstract

The invention relates to a method for detecting malicious codes of binary programs, a terminal device and a storage medium, wherein the method comprises the following steps: s1: extracting dynamic instruction streams of a program to be detected and a known malicious program sample; s2: generating a basic block set and preprocessing the basic block set; s3: calculating the similarity among the basic block sets; s4: constructing a control flow diagram of a program to be detected and a known malicious program sample; s5: calculating the similarity between control flow graphs; s6: calculating the similarity between the program to be detected and a known malicious program sample according to the similarity between the basic block sets and the similarity between the control flow graphs; s7: and judging whether the program to be detected is a malicious code program or not according to the size relation between the similarity between the program to be detected and the known malicious program sample and the threshold value. The method comprehensively considers the similarity of the codes from two aspects of program structure and code semantics, has more comprehensive and accurate measurement on the similarity of the codes, and can effectively identify the homologous malicious codes.

Description

Binary program malicious code detection method, terminal device and storage medium

Technical Field

The present invention relates to the field of network security, and in particular, to a method for detecting malicious codes of binary programs, a terminal device, and a storage medium.

Background

At present, various code resources on the internet are more and more, a software multiplexing technology is mature day by day, and developers can rapidly develop a new program on the basis of the original program, so that the program development period is greatly shortened, the software development cost is greatly reduced, and the program development threshold is also greatly reduced. Especially in the field of malicious program development, it has become common to perform secondary development on the basis of existing codes or to integrate existing codes, and malicious program developers often use a multiplexing means to complete rapid update and development of malicious codes. Therefore, the malicious programs can be detected by comparing the similarity of the unknown programs and the known malicious programs, and meanwhile, the programs can be traced and classified by the similarity detection. In order to avoid security software detection and resist software killing, a malicious program developer changes characteristics of a malicious program through operations such as confusion, shell adding, self-compression and the like. Therefore, it is difficult to identify the similarity between the deformed malware and the original malware only by using the information such as the feature code, the hash value, and the software fingerprint, and it is necessary to perform similarity analysis by comprehensively considering various features of the malware.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method for detecting malicious codes of binary programs, a terminal device, and a storage medium.

The specific scheme is as follows:

a binary program malicious code detection method comprises the following steps:

s1: extracting dynamic instruction streams of a program to be detected and a known malicious program sample;

s2: generating a basic block set according to the extracted dynamic instruction stream, and preprocessing the basic block set;

s3: calculating the similarity between the basic block set of the program to be detected and the basic block set of the known malicious program sample according to the preprocessed basic block set;

s4: constructing a control flow diagram of a program to be detected and a known malicious program sample;

s5: calculating the similarity between the control flow graph of the program to be detected and the known malicious program sample;

s6: calculating the similarity between the program to be detected and the known malicious program sample according to the similarity between the basic block set of the program to be detected and the known malicious program sample and the similarity between the control flow graphs;

s7: and judging whether the program to be detected is a malicious code program or not according to the size relation between the similarity between the program to be detected and the known malicious program sample and the threshold value.

Further, the extraction of the dynamic instruction stream is performed by the pinwsand Box, after the pintools in the pinwsand Box are adjusted, the dynamic instruction stream is extracted by taking the basic block as a unit, and while the dynamic instruction stream is extracted, the basic block execution sequence is stored in a basic block index library manner, a unique number is set for each basic block, the basic block is represented by the unique number in the basic block execution sequence, the content of the basic block is recorded in the basic block index library, and the basic block execution sequence is recorded in the basic block execution sequence file.

Further, the specific process of dynamic instruction stream extraction includes the following steps:

s101: when a program is ready to execute a basic block, triggering an instrumentation function of the basic block;

s102: inquiring the unique number corresponding to the basic block in the basic block index library, and if the unique number is inquired, entering S105; otherwise, entering S103;

s103: setting a unique number corresponding to the basic block;

s104: recording the content of the basic block in a basic block index library;

s105: adding the unique number of the basic block into the execution sequence of the basic block;

s106: the program executes the basic block;

s107: judging whether the program is executed completely, if so, ending; otherwise, returning to S101, the next basic block is ready to be executed.

Further, the preprocessing includes screening common instructions in the dynamic instruction stream, unifying instructions with the same function, standardizing the instruction format, and simplifying the basic block based on the DAG graph.

Further, the method for simplifying the basic block based on the DAG graph comprises the following steps: in the DAG graph, the return value and the parameters after the instruction statement standardization are used as vertexes, the vertex where the return value is located is set as a def vertex, the vertex where the primarily used variable is located is a zero vertex, and other vertexes are user vertexes; judging whether the statement needs to be optimized or not by judging the number of parent nodes and child nodes of the vertex, and completing optimization of the DAG graph by merging or deleting the vertex; if a certain vertex has no father node, deleting the vertex; if a certain vertex has only one child node, merging the vertex and the child node thereof; and finally, restoring the optimized DAG graph into a basic block format.

Further, the method for calculating the similarity between the basic block sets comprises the following steps: and calculating the similarity among the basic blocks according to the def-use chain of the basic blocks, and constructing a similarity matrix of the basic block set according to the similarity among the basic blocks to calculate the similarity among the basic block set.

Further, the method for calculating the similarity between two basic blocks comprises the following steps: converting semantic contents of two basic blocks used for calculating the similarity into four parts of a variable number, a constant number, a def-use chain set and a constant set of the basic blocks, wherein the calculation formula of the similarity of the two basic blocks is as follows:

wherein A and B represent two basic blocks, sim, used to calculate the similarity_bbl(A, B) denotes the similarity of the basic blocks A and B, sim_var(A, B) represents the variable similarity of the basic blocks A and B, sim_con(A, B) denotes the constant similarity of the basic blocks A and B, sim_{cha_set(A,B)}Constant set similarity, sim, representing basic blocks A and B_{con_set}(A, B) represents the def-chain set similarity, varnum, of the basic blocks A and B_aAnd varnum_bIndicates the number of variables, connum, of the basic blocks A and B, respectively_aAnd connum_bRepresenting the number of constants, N, of the basic blocks A and B, respectively_samechainAnd N_diffchainRespectively representing the number of the same chain and the number of different chains in the def-chain set of the basic blocks A and B, N_sameconAnd N_diffconThe number of the same constants and the number of different constants in the constant sets of the basic blocks a and B are respectively represented.

Further, the similarity between the two basic block sets is calculated by the following formula:

wherein P and Q respectively represent two basic block sets, sim, for calculating the similarity_BBLSet(P, Q) represents the similarity between the basic block sets P and Q,

denotes the bblnum number in program P_PA basic block and the bblnum in the program Q_QSimilarity between the basic blocks, bblnum_PIndicates the number of basic blocks, bblnum, in the program P_QRepresents the number of basic blocks in the program Q, and max () represents the maximum value.

Further, the construction method of the control flow graph comprises the following steps: all numbers in the basic block execution sequence file are used as nodes of the graph, the adjacent relation of the numbers is used as an edge, the direction of the edge is pointed to the number behind the edge by the number appearing first, and the repeated appearance frequency of the edge is used as the weight of the edge.

Further, before calculating the similarity between the two control flow graphs, the method further includes: and merging the sequential structures in the control flow graph.

Furthermore, the similarity between the two control flow diagrams is the ratio of the common subgraph scale of the two control flow diagrams to the original graph scale.

A binary program malicious code detection terminal device comprises a processor, a memory and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the steps of the method of the embodiment of the invention.

A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method as described above for an embodiment of the invention.

The technical scheme is adopted, aiming at the malicious program of the Windows platform, the dynamic instruction flow information of the program is extracted at the assembly level, the program control flow graph is constructed based on the off-line basic block sequence, and the similarity analysis is carried out on the program from two dimensions of the semantic feature and the structural feature of the program code. The dynamic extraction of the program features can effectively avoid the influence of software deformation technologies such as shell adding, confusion and the like, the extraction of the information at the assembly level enables the detection not to be limited by the program development language, the detection is carried out from two dimensions, the semantic features and the structural features of the software are considered, and the program codes can be comprehensively analyzed.

Drawings

Fig. 1 is a flowchart illustrating a first embodiment of the present invention.

Fig. 2 is a schematic diagram showing a basic structure of the procedure in this embodiment.

Fig. 3 is a schematic diagram of the basic block index library in this embodiment.

Fig. 4 is a schematic diagram of a basic block sequence control flow in this embodiment.

Fig. 5 is a simplified schematic diagram of the control flow diagram in this embodiment.

Fig. 6 is a schematic diagram illustrating the calculation of the similarity of the control flow chart in this embodiment.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

The first embodiment is as follows:

an embodiment of the present invention provides a method for detecting a malicious code of a binary program, as shown in fig. 1, the method includes the following steps:

s1: and extracting dynamic instruction streams of the program to be detected and the known malicious program sample.

In order to avoid the risk of running unknown software to cause adverse effects on the analysis system, the PinWSand Box non-perception sandbox is preferably adopted in the embodiment to extract the dynamic instruction flow of the software. The PinWSand Box non-perception sandbox monitors software system calling, module loading and instruction flow through a dynamic instrumentation technology, and can rollback malicious behaviors and effectively protect the security of the analysis host. The pinwsand Box is a sandbox developed by the Pintools based on the Pin platform. Pin is a binary program dynamic analysis platform developed by Intel corporation, supports IA-32, x86-64 and MIC instruction set architectures, and supports Linux, Windows and MacOS systems. Pin is equivalent to a just-in-time (JIT) compiler, can insert codes at any position of a binary program and execute the codes, can replace original program codes, can record conditions such as system call, program thread activity and the like, can detect process trees and simulation Application Programming Interface (API) calls, and has three instrumentation granularities of an instruction level, a basic block level and a function level. The Pintools is an extension tool for the dynamic pile platform Pin. The Pin platform provides rich API interfaces for users, allows the users to develop plug-ins in a dynamic link library (dynamic link library) form, so that the users can customize the content and position of the inserted codes and further extract information interested by the users. These inserts are called Pintools.

The data volume of the software dynamic instruction stream is huge, hundreds of thousands of instructions can be executed by one 1KB program, the operation speed of the program is greatly influenced by directly inserting the piles into each instruction, the analysis efficiency is greatly reduced, and the storage is difficult. Therefore, in the embodiment, pintools in the pinwsand Box are adjusted, a software dynamic instruction stream is extracted in units of basic blocks, and a basic block execution sequence is stored in a basic block index library manner.

A sequence of assembly statements is executed in sequence by a program, the sequence is single-inlet and single-outlet, the sequence is only entered from an inlet and only exited from an outlet during the program execution, and the sequence is called a basic block. Once a basic block is entered, program execution must execute all instructions in the basic block until the basic block is exited. If the basic block execution sequence is directly recorded, a large storage space is still needed, so the basic block index library is used for storage, and the basic block execution sequence are respectively stored. When the basic blocks are stored, the same basic blocks are recorded only once, each basic block is assigned with a unique number, and the numbers are used in the execution sequence of the basic blocks to replace the basic blocks, so that the storage space is greatly reduced. The specific process of dynamic instruction stream extraction comprises the following steps:

s102: inquiring the unique number corresponding to the basic block in the basic block index library, and if the unique number is inquired (indicating that the basic block is recorded), entering S105; otherwise, entering S103;

s103: setting a unique number corresponding to the basic block;

s104: recording the content of the basic block in a basic block index library;

s106: the program executes the basic block;

S2: and preprocessing a basic block set in the extracted dynamic instruction stream.

The basic block set obtained by direct extraction has the problems of function-independent instruction noise, rearrangeable instruction noise, instruction diversity and the like, and can seriously affect the similarity calculation result, so that the basic block set needs to be preprocessed. The preprocessing of the basic block set in this embodiment includes the following four points.

(1) And screening common instructions in the dynamic instruction stream.

In the preprocessing process, the instruction format in the dynamic instruction stream needs to be processed uniformly. At present, an intermediate language capable of unifying the formats of assembly languages is lacked, if the formats of each assembly instruction are specified one by one, the workload is overlarge, the instructions are not all frequently used instructions, all processing is not necessary, and only the formats of the frequently used instruction formats need to be unified. Therefore, the common instructions need to be screened out.

In this embodiment, 1500 binary samples are obtained from a sample website, a dynamic instruction stream of the samples is extracted, and the statistics of the instructions belonging to the x86 instruction set are performed. The statistical content comprises instruction symbols, the frequency of the instructions appearing in the sample set, the sample ratio containing the instructions, and finally 100 most frequently used assembly instructions are screened out.

(2) Instructions with the same function are unified.

There are some statements in assembly language that have the same function, e.g. xoreax, eax and mov eax,0 is to clear the eax register, and there is essentially no difference. To avoid this synonym from interfering with the basic block similarity calculation, it is necessary to unify these instructions to have the same function.

(3) The instruction format is standardized.

The formats of the assembly instructions are various (such as x86), the operation on parameters is different, the calculation of the similarity is troublesome, and therefore the instruction formats need to be standardized. The instruction statement is normalized into three parts: one instruction code, one return value, and three parameters. The instruction code is the original instruction code, the result storage variable is a variable for storing the instruction execution result, the parameter is the original parameter, and null is used for replacing the parameter with insufficient number. In addition to instruction format, instruction parameters are also standardized. The assembly instruction can use registers, stack addresses and memory addresses as parameters, and the parameters are all variables in nature, so the parameters are named uniformly and numbered.

(4) The basic block is simplified based on the DAG graph.

After the above processing, a basic block set with a uniform format can be obtained, and then the basic block is simplified based on a DAG (Directed Acyclic Graph), including local optimization processing such as deleting useless assignment statements and merging serial assignment statements, so that the basic block statements are more simplified, and the scale of the basic block set is further reduced.

In this embodiment, the Basic Block (BBL) is converted into a DAG graph with the return value and the parameters as vertices. The vertices are divided into three types: the vertex where the return value is located is a def vertex, the vertex where the variable used for the first time is a zero vertex, and other vertices are use vertices. Whether the statement needs to be optimized is judged by judging the number of the parent node and the child node of the vertex, and the optimization is completed by merging and deleting the vertex. If a certain definition vertex has no parent node, the definition of the vertex is not used and can be deleted; if a vertex is defined to have only one child node, it is stated that the value of the parameter or return value corresponding to the vertex is the same as that of the child node, and the vertex and the child node may be merged. And finally, restoring the optimized DAG graph into a basic block format.

S3: and calculating the similarity between the basic block set of the program to be detected and the basic block set of the known malicious program sample according to the preprocessed basic block set.

In the embodiment, the def-use chains of the basic blocks are extracted based on the parameter dependence among the instructions, and the similarity among the basic blocks is calculated based on the def-use chains of the basic blocks, so that the interference caused by the instruction rearrangement problem to the similarity calculation is avoided. And then, constructing a similarity matrix of the sets according to the similarity between the basic blocks to calculate the similarity between the sets.

(1) Similarity of basic blocks

The def-use chain is a linked list used for describing variable dependency in the compilation principle, and in a variable scope, a variable definition statement is taken as a starting point, and the variable use statement is attached in sequence until the variable is defined again or the scope is finished. The def-use chain of the basic block takes one basic block as a scope of the variable, takes a variable definition statement or a variable initial occurrence statement as a starting point, and affixes the variable use statements in sequence until the basic block is ended or the variable is redefined.

Definition 1 (basic block constant set) refers to the immediate number in the operand of the assembly instruction in the basic block as the basic block constant, the symbol as con, the number of constants con as connum, and the constant set symbol as con _ set in the basic block, then con _ set_connum＝{con₁,con₂,...con_connum}(1)

Definition 2 (basic block variable set) refers to the other operands except the constant in the basic block as variables, symbols as var, the number of the variables as varum, and the variable set in the basic block as var _ set, then var _ set_varnum＝{var₁,var₂,...var_varnum}(2)

Definition 3(def-use chain set) records def-use chains in a basic block as chain, the number of the def-use chains is recorded as chanum, the def-chain set is recorded as cha _ set, and then the cha _ set_chanum＝{chain₁,chain₂,...chain_chanum}(3)

The 4 (similarity) set or degree of similarity between values is defined, the values range from 0 to 1, and the notation is sim. The similarity between the basic blocks is noted as sim_bb1And the similarity of variables is recorded as sim_varConstant similarity is denoted sim_conConstant set similarity is denoted sim_{con_set}The similarity of def-chain set is denoted as sim_{cha_set}。

Converting the semantic content of the basic block into four parts of variable number, constant number, def-use chain set and constant set of the basic block, and calculating the similarity of the basic block through the combination of the similarity of the four parts. For the basic block A, the related information is converted into

Wherein the set of def-use chains of the basic block A can be expressed as

The constant set is represented as:

also for basic block B, it is converted into:

the similarity calculation formula of the basic block A and the basic block B is as follows:

wherein, sim_varCalculated as the distance between varnum, sim_conThe distance between connum is used for calculation, and the formula is as follows:

sim_{cha_set}two purposesJaccard coefficient of cha _ set of basic blocks, sim_{con_set}The Jaccard coefficient of con _ set for the two basic blocks is used for calculation. Wherein the Jaccard coefficient is defined as the ratio of the intersection size of the two sets to the union size. For sets S1 and S2, the Jaccard coefficient calculation formula is as follows:

comparing cha _ sets of A and B, and marking the number of the same cha as N_samechainThe number of different chain is recorded as N_diffchainAnd then:

comparing con _ set of A and B, and marking the number of the same con as N_sameconThe number of different con is marked as N_diffconThen, then

(2) Similarity of set of basic blocks

Definition 5 (basic block set) when a basic block is denoted by bbl, the number of basic blocks is denoted by bblnum, and a basic block set is denoted by BBLSet, the basic block set of the program P is expressed as:

defining 6 (a basic block set similarity matrix) to calculate the similarity between any two basic blocks in the combination of the two basic blocks, and arranging the basic blocks according to the sequence of the basic blocks to form a similarity matrix which is marked as SimMatrix. Let s denote the similarity between any two basic blocks bbl of programs P and Q_ij＝sim_bbl(bbl_i,bbl_j) Wherein, in the step (A),

the basic block set similarity matrix for P and Q is then expressed as:

using KM algorithm to calculate

The highest similarity match value. The KM algorithm is a classical algorithm for obtaining maximum weight matching, and can obtain the optimal matching sequence of two sets of sets under the condition of maximizing the result. The similarity formula for the two bbl sets is:

s4: and constructing a control flow graph of the program to be detected and the known malicious program sample.

A Control Flow Graph (CFG) is an abstract data structure proposed by Frances e.allen in 1970, which is a simplification of program Control Flow graphs to describe program code structures. The three most basic structures in the program are shown in fig. 2: sequential structure, branched structure, cyclic structure.

The basic block set only comprises the components of the program dynamic instruction flow and does not comprise the code structure information of the program, so that the similarity of the control flow is introduced, and the measurement of the similarity of the program is more comprehensive and accurate.

In the embodiment, a basic block level control flow graph is constructed based on a basic block execution sequence, and the basic block execution sequence is stored while extracting a program dynamic instruction stream by using the PinWSand Box. When recording basic blocks, a basic block index library mode is adopted, a unique number is set for each basic block, and the unique number is used for representing the basic block in an execution sequence. The content of the basic block is recorded in a basic block index database, and the execution sequence of the basic block is recorded in a basic block execution sequence file.

As shown in fig. 3, the basic block index library is a part of a basic block index library, and includes 5 basic blocks, which are numbered as 1,2,3,4, and 5, and the corresponding execution sequence of the 5 basic blocks in the execution sequence file is 1,2,3,4, and 5.

And constructing a control flow graph according to the number sequence in the basic block execution sequence file. All the numbers are used as nodes of the graph, the adjacent relation of the numbers is used as an edge, the direction of the edge is pointed to the number behind the edge by the number appearing first, and the frequency of repeated appearance of the edge is used as the weight of the edge. In this embodiment, the generated control flow graph is denoted by a symbol G, and G (V, E) denotes a control flow graph including a node set V and a directed weighted edge set E. For example, if the number sequence of basic block executions is 1,2,3,4,5, 1,2,3,4, and 5 are vertices of the control flow graph, and five adjacent relations in the execution sequence are directed edges connecting the vertices, which are respectively (1,2) (2,3) (3,4) (4,3) (3,5), each edge appears only once, so the weight of the edge is 1. Thus, a control flow graph with a loop structure is generated, as shown in fig. 4.

S5: and calculating the similarity between the control flow graph of the program to be detected and the known malicious program sample.

The similarity calculation of the graph is a very complex problem, and the graph is often specially processed according to the actual situation in the actual application process, so that the complexity is reduced, and the calculation efficiency is improved. The most critical part for the dynamic control flow graph of the program is a branch structure and a loop structure, wherein the two structures contain control transfer information in the running process of the program, and the sequence structure is not important for the control flow graph and is abundant in the control flow graph. Therefore, the merging of the parts of the sequential structure in the control flow graph can effectively reduce the scale of the graph and does not influence the similarity calculation of the graph. Vertical structures are identified by examining the in-and out-degrees of a vertex and its neighbors. If the in-degree of a vertex is greater than 1 or the out-degree is greater than 1, then the vertex is on a branch structure and needs to be preserved, such as vertex 3 in FIG. 4. Vertices pointed to by such vertices also need to be preserved, such as

vertices

4 and 5 in fig. 4. The remaining vertices may be merged. For example, for the example in fig. 4, with a vertical structure between

vertices

1,2,3 and a cyclic structure between

vertices

3,4,5, the sequence from 123435 can be reduced to 3435. A simplified schematic is shown in fig. 5.

The pseudo code of the control flow graph sequential structure merging algorithm is as follows:

according to the experimental result, the scales of V and E are reduced to be within 20 percent of the original scales through the sequential structure combination treatment.

And defining the similarity of the two control flow graphs as the ratio of the common subgraph scale of the two control flow graphs to the original graph scale. In this embodiment, the following algorithm is used to obtain a common subgraph of the control flow diagrams G1 and G2, and the pseudo code is as follows:

according to the method, the control flow graph G can be obtained₁And G₂The set being capable of covering the maximum range G₁And G₂Common part of (2), denoted as common graph (G)₁,G₂). The similarity of the graph is recorded as sim_graphThen sim_graph(G₁,G₂) Represents G₁And G₂The similarity of (c).

Definition 7 (the scale of the graph) defines the scale of the graph G as the total number of the nodes and the edges in the graph, which is marked as Scale (G), and G is then₁And G₂The similarity calculation formula is as follows:

for example, for the two control flow graphs a and b in fig. 6, the graph formed by

vertices

3,4,5 in a is isomorphic with the subgraph formed by

vertices

1,2,3 in b, then the common subgraph scale for a and b is 6. On the scale of fig. a being 6 and on the scale of fig. b being 10, the similarity between the two figures is 0.75.

S6: and calculating the similarity between the program to be detected and the known malicious program sample according to the similarity between the basic block set of the program to be detected and the known malicious program sample and the similarity between the control flow graphs.

The basic block set can represent semantic features of programs, the control flow graph can represent structural features of the programs, the two features are combined, the similarity between the programs is calculated from two dimensions of semantics and structures, and the calculation result of the similarity can be more accurate and more reliable.

Definition 8 (program similarity) defines the similarity between two programs as the linear combination of the similarity of the dynamic basic block sets of the two programs and the similarity of the dynamic control flow graph, and is recorded as sim_pro。

For programs P and Q, the similarity is noted sim_pro(P, Q), the similarity calculation formula is:

sim_Pro(P,Q)＝sim_BBLSet(P,Q)×α+sim_Graph(P,Q)×(1-α)

where α is a linear coefficient with a value of 0.5.

In this embodiment, if 0.8. ltoreq. sim is set_ProIf (P, Q) is less than or equal to 1.0, judging that the program P is similar to the program Q; if 0.0. ltoreq. sim_ProIf (P, Q) is less than or equal to 0.4, judging that the programs P and Q are not similar; otherwise, it cannot be determined whether programs P and Q are similar.

Through experiments, about 99% of the calculation results of the similarity between the programs in the same group are greater than 0.8, which shows that the method of the embodiment can effectively identify similar malicious programs. The average value of the similarity between different groups of programs is within 0.4, which shows that the method of the present embodiment can distinguish malicious programs with small similarity.

According to the embodiment of the invention, a program instruction stream snapshot is established by using the PinWSand Box, the program instruction stream snapshot is stored in an index library mode, then the acquired instruction stream information is preprocessed, the interference noise of a basic block set and a control flow diagram is removed, and finally the similarity between malicious codes is obtained by synthesizing the similarity of the program basic block set and the control flow diagram. The similarity of the codes is comprehensively considered from two aspects of program structure and code semantics, the measurement of the similarity of the codes is more comprehensive and more accurate, homologous malicious codes can be effectively identified, and better anti-interference effect can be achieved on code deformation means such as shell adding and the like.

Example two:

the invention further provides a binary program malicious code detection terminal device, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the steps of the method embodiment of the first embodiment of the invention.

Further, as an executable scheme, the binary program malicious code detection terminal device may be a computing device such as a desktop computer, a notebook, a palm computer, and a cloud server. The binary program malicious code detection terminal device can comprise, but is not limited to, a processor and a memory. It can be understood by those skilled in the art that the above-mentioned constituent structure of the binary program malicious code detection terminal device is only an example of the binary program malicious code detection terminal device, and does not constitute a limitation on the binary program malicious code detection terminal device, and may include more or less components than the above, or combine some components, or different components, for example, the binary program malicious code detection terminal device may further include an input output device, a network access device, a bus, and the like, which is not limited in this embodiment of the present invention.

Further, as an executable solution, the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general-purpose processor may be a microprocessor or the processor may be any conventional processor, and the processor is a control center of the binary program malicious code detection terminal device, and various interfaces and lines are used to connect various parts of the entire binary program malicious code detection terminal device.

The memory can be used for storing the computer program and/or the module, and the processor can realize various functions of the binary program malicious code detection terminal device by running or executing the computer program and/or the module stored in the memory and calling data stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The invention also provides a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method of an embodiment of the invention.

The module/unit integrated with the binary program malicious code detection terminal device can be stored in a computer readable storage medium if the module/unit is implemented in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), software distribution medium, and the like.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A binary program malicious code detection method is characterized by comprising the following steps:

s2: preprocessing a basic block set in the extracted dynamic instruction stream;

2. The binary program malicious code detection method according to claim 1, wherein: the extraction of the dynamic instruction stream is carried out by a PinWSand Box, after the pintools in the PinWSand Box are adjusted, the dynamic instruction stream is extracted by taking a basic block as a unit, a basic block execution sequence is stored by adopting a basic block index library mode while the dynamic instruction stream is extracted, a unique number is set for each basic block, the basic block is represented by the unique number in the basic block execution sequence, the content of the basic block is recorded in the basic block index library, and the basic block execution sequence is recorded in a basic block execution sequence file.

3. The binary program malicious code detection method according to claim 1, wherein: the specific process of dynamic instruction stream extraction comprises the following steps:

s103: setting a unique number corresponding to the basic block;

s104: recording the content of the basic block in a basic block index library;

s106: the program executes the basic block;

4. The binary program malicious code detection method according to claim 1, wherein: preprocessing includes screening common instructions in a dynamic instruction stream, normalizing instructions with the same functionality, normalizing instruction formats, and simplifying basic blocks based on a DAG graph.

5. The binary program malicious code detection method according to claim 4, wherein: the method for simplifying the basic block based on the DAG graph comprises the following steps: in the DAG graph, the return value and the parameters after the instruction statement standardization are used as vertexes, the vertex where the return value is located is set as a def vertex, the vertex where the primarily used variable is located is a zero vertex, and other vertexes are user vertexes; judging whether the statement needs to be optimized or not by judging the number of parent nodes and child nodes of the vertex, and completing optimization of the DAG graph by merging or deleting the vertex; if a certain vertex has no father node, deleting the vertex; if a certain vertex has only one child node, merging the vertex and the child node thereof; and finally, restoring the optimized DAG graph into a basic block format.

6. The binary program malicious code detection method according to claim 1, wherein: the method for calculating the similarity among the basic block sets comprises the following steps: and calculating the similarity among the basic blocks according to the def-use chain of the basic blocks, and constructing a similarity matrix of the basic block set according to the similarity among the basic blocks to calculate the similarity among the basic block set.

7. The binary program malicious code detection method according to claim 6, wherein: the method for calculating the similarity of the two basic blocks comprises the following steps: converting semantic contents of two basic blocks used for calculating the similarity into four parts of a variable number, a constant number, a def-use chain set and a constant set of the basic blocks, wherein the calculation formula of the similarity of the two basic blocks is as follows:

8. The binary program malicious code detection method according to claim 6, wherein: the similarity between two basic block sets is calculated by the following formula:

9. The binary program malicious code detection method according to claim 1, wherein: the construction method of the control flow graph comprises the following steps: all numbers in the basic block execution sequence file are used as nodes of the graph, the adjacent relation of the numbers is used as an edge, the direction of the edge is pointed to the number behind the edge by the number appearing first, and the repeated appearance frequency of the edge is used as the weight of the edge.

10. The binary program malicious code detection method according to claim 1, wherein: before calculating the similarity between the two control flow graphs, the method further comprises the following steps: and merging the sequential structures in the control flow graph.

11. The binary program malicious code detection method according to claim 1, wherein: the similarity between the two control flow graphs is the ratio of the common subgraph scale of the two control flow graphs to the original graph scale.

12. A binary program malicious code detection terminal device is characterized in that: comprising a processor, a memory and a computer program stored in the memory and running on the processor, the processor implementing the steps of the method according to any of claims 1 to 11 when executing the computer program.

13. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed by a processor implementing the steps of the method as claimed in any one of claims 1 to 11.