CN114036514A

CN114036514A - Malicious code homologous analysis method and device and computer readable storage medium

Info

Publication number: CN114036514A
Application number: CN202111203593.8A
Authority: CN
Inventors: 黄娜
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2022-02-11

Abstract

The invention discloses a malicious code homologous analysis method, equipment and a computer readable storage medium. The malicious code homologous analysis method based on multi-label classification comprises the following steps: collecting malicious code samples and determining at least one ethnic group to which each malicious code sample belongs; acquiring an assembly instruction sequence of each malicious code sample, and intercepting or filling the assembly instruction sequence by adopting an average length technology so as to enable the lengths of all the assembly instruction sequences to be consistent; constructing a multi-label attention classification model, and training the multi-label attention classification model based on an assembly instruction sequence processed by adopting an average length technology; and inputting the assembly instruction sequence of the malicious code to be analyzed into the trained multi-label attention classification model to obtain a group list to which the malicious code to be analyzed belongs, wherein the group list comprises at least one group. The invention can judge each malicious code to belong to a plurality of families, and is beneficial to more accurately carrying out homologous analysis.

Description

Malicious code homologous analysis method and device and computer readable storage medium

Technical Field

The present invention relates to the technical field of malicious code homologous analysis, and in particular, to a malicious code homologous analysis method, a malicious code homologous analysis device, and a computer-readable storage medium.

Background

Malicious code Homology Analysis (malicious Analysis) is to analyze the derived relevance between malicious codes through the internal and external characteristics of the malicious codes and the generation and propagation rules. Malicious code is of a wide variety of types, including computer viruses, worms, trojan horse programs, back door programs, logical bombs, and the like. Each type of malicious code tends to have similarities in several respects: (1) functional code, in order to implement the same malicious function, key code fragments may be similar, these similar code fragments are also called genetic code, such as Duqu and Stuxnet, their key functional code such as DLL injection, RPC service, etc. are highly similar, and are considered as malicious code that is commonly used to launch attacks against iran nuclear facilities; (2) calling a system function, wherein the operation of a malicious behavior usually depends on the calling of an operating system function, and the name, frequency, sequence and the like of the called function are possibly similar; (3) functional behaviors, each type of malicious code is targeted to destroy behaviors, for example, the Lesoh software can read and write user data, the remote control Trojan horse can check a screen or a camera, and the similarity of the functional behaviors is reflected in the aspects of files, processes, networks, registries and the like.

When the malicious code group is determined according to the type, the groups are not independent, but are crossed and associated, for example, a malicious code belongs to both spyware and screen capture tool. The existing homologous analysis method based on the machine learning classification technology judges a malicious code to belong to one of the families through classification, and influences the accuracy of type judgment.

Disclosure of Invention

The embodiment of the invention provides a malicious code homologous analysis method, malicious code homologous analysis equipment and a computer readable storage medium, which are used for solving the problem of low accuracy of a homologous analysis method based on a machine learning classification technology in the prior art.

The malicious code homologous analysis method based on multi-label classification comprises the following steps:

collecting malicious code samples and determining at least one ethnic group to which each of the malicious code samples belongs;

acquiring an assembly instruction sequence of each malicious code sample, and intercepting or filling the assembly instruction sequence by adopting an average length technology so as to enable the lengths of all the assembly instruction sequences to be consistent;

constructing a multi-label attention classification model, and training the multi-label attention classification model based on an assembly instruction sequence processed by adopting an average length technology;

inputting an assembly instruction sequence of the malicious code to be analyzed into the trained multi-label attention classification model to obtain a group list to which the malicious code to be analyzed belongs, wherein the group list comprises at least one group.

According to some embodiments of the invention, the determining at least one ethnic group to which each of the malicious code samples belongs comprises:

setting a corresponding label list for each malicious code sample based on the coding characteristics and behavior attributes of each malicious code sample, wherein the label list comprises a plurality of elements, each element corresponds to a group, the value of each element comprises a first numerical value and a second numerical value, the first numerical value of the element indicates that the malicious code sample belongs to the group corresponding to the element, and the second numerical value of the element indicates that the malicious code sample does not belong to the group corresponding to the element.

According to some embodiments of the invention, the obtaining of the sequence of assembly instructions of each malicious code sample comprises:

disassembling the malicious code sample by using an IDE tool to obtain an assembly instruction sequence of the malicious code sample.

According to some embodiments of the invention, the constructing the multi-label attention classification model comprises:

constructing an input layer of the multi-label attention classification model, wherein the input layer is used for characterizing each assembler instruction in the assembler instruction sequence into a word vector by adopting a Skip-gram method;

constructing a hidden layer of the multi-label attention classification model, the hidden layer being configured to process the word vectors using equation 1,

h(x_i)＝x_iw + b of the formula 1,

wherein x is_iRepresenting a word vector, i ∈ [0, m-1 ]]M is the length of the assembler instruction sequence, w and b are shared parameters;

constructing an attention layer of the multi-label attention classification model, wherein the attention layer is used for outputting h (x) of the hidden layer according to formula 2 based on different families_i) The different weights are given to the respective channels,

z＝h(X)·C_m×pin the formula 2, the first and second groups,

wherein X represents a word vector matrix, C_m×pThe attention weight with the dimension of m multiplied by p is represented, and p represents the number of the clans;

constructing an output layer of the multi-label attention classification model, wherein the output layer is used for outputting the probability corresponding to each group according to a formula 3 by adopting a Sigmoid function,

wherein z is the output of the attention layer, z^T _jRepresents the j element, j is equal to [0, p-1 ]]。

According to some embodiments of the invention, the training the multi-label attention classification model based on the assembler instruction sequence processed by the average length technique comprises:

and training the multi-label attention classification model by adopting a random gradient descent method based on the assembly instruction sequence processed by adopting an average length technology.

According to some embodiments of the present invention, the training the multi-label attention classification model by using a stochastic gradient descent method based on the assembler instruction sequence processed by using an average length technique includes:

setting a key value pair list, and hiding the malicious code sample after each trainingThe output of (c) is stored as a key and an attention weight matrix as a value in a key-value pair table ω (ω ═ value), and<key₁，key₂，...>，<value₁，value₂，...>}；

according to some embodiments of the present invention, the storing the output of the hidden layer as key and the attention weight matrix as value in the key-value pair list ω table includes:

and judging whether the current key exists in the omega, if not, adding the current key into the omega, and if so, calculating the average value of the value corresponding to the current key and updating the average value into the omega.

According to some embodiments of the present invention, the inputting a sequence of assembly instructions of the malicious code to be analyzed into the trained multi-tag attention classification model to obtain a population list to which the malicious code to be analyzed belongs includes:

acquiring an assembly instruction sequence of a malicious code to be analyzed, and intercepting or filling the assembly instruction sequence of the malicious code to be analyzed by adopting an average length technology so as to enable the length of the assembly instruction sequence of the malicious code to be analyzed to be m;

inputting an assembly instruction sequence of the malicious code to be analyzed with the length of m into the trained multi-label attention classification model to obtain the output of the malicious code in the hidden layer;

calculating the similarity between the hidden layer output of the malicious code to be analyzed and each key in omega, and performing weighted summation on all similarity values after the softmax normalization processing and the value in omega to serve as an attention weight matrix output by the hidden layer of the malicious code to be analyzed;

and acquiring a group list to which the malicious code to be analyzed belongs based on the hidden layer output and the attention weight matrix of the malicious code to be analyzed.

According to the malicious code homologous analysis equipment based on multi-label classification, the malicious code homologous analysis equipment comprises: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of multi-tag classification-based malicious code homology analysis as described above.

According to the embodiment of the invention, the computer readable storage medium stores the information transfer implementation program, and the program is executed by the processor to implement the steps of the malicious code homology analysis based on multi-label classification as described above.

By adopting the embodiment of the invention, the multi-label classification is carried out on the malicious codes according to the characteristics of the assembly instruction, so that each malicious code can be judged to belong to a plurality of families, and the method and the device are favorable for more accurately carrying out homologous analysis.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. In the drawings:

FIG. 1 is a flowchart of a malicious code homology analysis method based on multi-tag classification according to an embodiment of the present invention;

FIG. 2 is a flowchart of a malicious code homology analysis method based on multi-tag classification according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating assembly instruction sequence fetching according to an embodiment of the present invention;

FIG. 4 is a diagram of a multi-label attention classification model in an embodiment of the invention;

FIG. 5 is a schematic illustration of attention layer processing in an embodiment of the invention;

FIG. 6 is a block diagram of a malicious code homology analysis device based on multi-tag classification in the embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Additionally, in some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

An embodiment of a first aspect of the present invention provides a malicious code homology analysis method based on multi-tag classification, as shown in fig. 1, including:

s1, collecting malicious code samples, and determining at least one ethnic group to which each malicious code sample belongs;

s2, acquiring an assembly instruction sequence of each malicious code sample, and intercepting or filling the assembly instruction sequence by adopting an average length technology so as to enable the lengths of all the assembly instruction sequences to be consistent;

s3, constructing a multi-label attention classification model, and training the multi-label attention classification model based on an assembly instruction sequence processed by adopting an average length technology;

s4, inputting the assembly instruction sequence of the malicious code to be analyzed into the trained multi-label attention classification model to obtain a group list to which the malicious code to be analyzed belongs, wherein the group list comprises at least one group.

On the basis of the above-described embodiment, various modified embodiments are further proposed, and it is to be noted herein that, in order to make the description brief, only the differences from the above-described embodiment are described in the various modified embodiments.

h(x_i)＝x_iw + b of the formula 1,

z＝h(X)·C_m×pin the formula 2, the first and second groups,

setting a key value pair list, taking the output of a hidden layer of the malicious code sample as a key and an attention weight matrix as a value after each malicious code sample is trained, and storing the key value pair list in a key value pair mode in a table omega, wherein omega is a great face<key₁，key₂，...>，<value₁，value₂，...>}；

The malicious code homology analysis method based on multi-tag classification according to the embodiment of the invention is described in detail in a specific embodiment with reference to fig. 2 to 5. It is to be understood that the following description is illustrative only and is not intended to be in any way limiting. All similar structures and similar variations thereof adopted by the invention are intended to fall within the scope of the invention.

Referring to fig. 2, the malicious code homology analysis method based on multi-tag classification in the embodiment of the present invention includes the following specific steps:

step 1: a data set is established.

Collecting malicious code samples from channels such as a public network, a professional database and the like by taking different malicious code types as a family group, manually (or by using a corresponding tool) analyzing the coding characteristics, behavior attributes and the like of each sample, judging the type of each sample, and setting a multi-dimensional label list for each sample. Each element in the tag list corresponds to one group, the corresponding element of the group to which the sample belongs is set to be a number 1, and the corresponding elements of the other groups are set to be a number 0.

For example, according to common malicious code types, 14 malicious code populations are first set, including { Adware; backdoor; bot; bootkit; ddos; dropper; exploit-kit; keylogger; a Rantome; rogue-security-software; rootkit; screen-capture; spyware; trojan }.

Malicious code samples are collected from channels such as public networks, professional databases and the like, coding characteristics, behavior attributes and the like of each sample are analyzed manually (or corresponding tools), and a 14-dimensional label list is set for each sample. Each element in the tag list corresponds to the group listed above, the group corresponding element to which the sample belongs is set to a number 1, and the group corresponding elements of the other groups are set to a number 0. For example, if a malicious code belongs to both Spyware and Screen-capture, its tag list is {0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0 }.

Step 2: an assembly instruction sequence of malicious code is extracted.

Disassembling the collected malicious codes and extracting their assembly instruction sequence OP, OP ═ OP { (OP)₀，op₁，...，op_mFor example, fig. 3. Calculating the average length m of all sample instruction sequences, intercepting the previous m instructions for the instruction sequences with the length exceeding m, and supplementing null values to the length m for the instruction sequences with the length less than m.

And step 3: and (3) establishing a multi-label attention classification model, and training the multi-label attention classification model based on the assembly instruction sequence of the data set collected in the step (2).

As shown in fig. 4, the multi-label attention classification model includes:

and the input layer adopts a Skip-gram method to characterize each instruction as an n-dimensional word vector.

A hidden layer, the operation being represented by formula (1),

h(x_i)＝x_i·w+b (1)，

wherein x is_iWord vector representing an instruction, i ∈ [0, m-1 ]]Sharing the parameters w and b.

Attention layer, operating as equation (2),

z＝h(X)·C_m×p (2)，

wherein X represents a word vector matrix, C_m×pAttention weight with dimension m × p is shown, p being the population number.

With the attention layer, different weights can be given to the output of the hidden layer according to different groups. For example, a malicious code sample can be determined to belong to Spyware according to the first 10 instructions in the instruction sequence, and can be determined to belong to Screen-capture according to the last 10 instructions, the Spyware group assigns a higher attention weight to the first 10 instructions, and Screen-capture assigns a higher attention weight to the last 10 instructions.

A classification layer (namely an output layer) which outputs the probability corresponding to each group, adopts a Sigmoid function and operates as the formula (3),

where z is the output of the attention layer (dimension 1 × p), z^T _jRepresents the j element, j is equal to [0, p-1 ]]。

And (3) taking the data set as training data, training the model by adopting a random gradient descent method, calculating the cross entropy loss and the inverse gradient of a sample, and adjusting the parameters of the hidden layer and the attention layer.

In this process, a list of key value pairs is set, the hidden layer output and the corresponding attention weight are stored, which is denoted by the symbol ω, ω ═ f<key₁，key₂，...>，<value₁，value₂，...>}. In the last round of training, the output of the hidden layer is taken as a key and the attention weight matrix is taken as a value after each sample is trained. If the key does not exist in omega, adding the key and the value, if the key exists, calculating the average value of the value corresponding to the key, and updating the average value into omega.

And 4, carrying out homology analysis on unknown malicious codes by using the model.

The unknown malicious code is extracted according to the instruction sequence extracted in the step 2And inputting the model trained in the step 3. At this time, the operation of the attention layer is different from the training phase, as shown in fig. 5, the similarity of the hidden layer output h (x) to each key in ω is calculated, normalized by softmax, and weighted and summed with value in ω as the attention weight matrix C of h (x)_m×p. And then, continuously calculating according to the step 3 to obtain the probability corresponding to each group. And presetting a probability threshold, and if the probability value is greater than the threshold, judging that the group belongs to the group, otherwise, judging that the group does not belong to the group.

By adopting the embodiment of the invention, the malicious codes are classified by multiple labels, and the model applies the attention layer to ensure that each label obtains the information of a corresponding section of instruction, thereby being beneficial to more accurately carrying out homologous analysis under the condition that the groups are mutually crossed.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the present invention, and those skilled in the art can make various modifications and changes. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

A second aspect of the embodiment of the present invention provides a malicious code homologous analysis apparatus 1000, as shown in fig. 6, including: a memory 1010, a processor 1020 and a computer program stored on the memory 1010 and executable on the processor 1020, wherein the computer program when executed by the processor 1020 implements the steps of the multi-tag classification-based malicious code homology analysis method according to the embodiment of the first aspect.

A third embodiment of the present invention provides a computer-readable storage medium, on which an implementation program for information transmission is stored, where the program, when executed by a processor, implements the steps of the malicious code homology analysis method based on multi-tag classification as described in the first embodiment of the present invention.

It should be noted that the computer-readable storage medium in this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like. The program can be a mobile phone, a computer, a server, an air conditioner, or a network device.

Reference in the specification to the terms "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. The particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. For example, in the claims, any of the claimed embodiments may be used in any combination.

In addition, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A malicious code homologous analysis method based on multi-label classification is characterized by comprising the following steps:

2. The multi-tag classification-based malicious code homology analysis method according to claim 1, wherein the determining at least one population to which each malicious code sample belongs comprises:

3. The multi-label classification-based malicious code homology analysis method according to claim 1, wherein the obtaining of the assembly instruction sequence of each malicious code sample comprises:

4. The multi-label classification-based malicious code homology analysis method according to claim 1, wherein the constructing of the multi-label attention classification model comprises:

h(x_i)＝x_iw + b of the formula 1,

z＝h(X)·C_m×pin the formula 2, the first and second groups,

5. The method for malicious code homology analysis based on multi-label classification as claimed in claim 4, wherein the training of the multi-label attention classification model based on the assembly instruction sequence processed by the average length technique comprises:

6. The method for malicious code homology analysis based on multi-label classification as claimed in claim 5, wherein the training of the multi-label attention classification model by using a stochastic gradient descent method based on the assembly instruction sequence processed by using an average length technique comprises:

setting a key value pair list, taking the output of a hidden layer of the malicious code sample as a key and an attention weight matrix as a value after each malicious code sample is trained, and storing the key value pair list in a key value pair mode in a table omega, wherein omega is a great face<key₁，key₂，...>，<value₁，value₂，...>}。

7. The malicious code homology analysis method based on multi-label classification as claimed in claim 6, wherein the storing the output of the hidden layer as key and attention weight matrix as value in a key-value pair list ω table comprises:

8. The method according to claim 6 or 7, wherein the step of inputting the assembly instruction sequence of the malicious code to be analyzed into the trained multi-tag attention classification model to obtain the population list to which the malicious code to be analyzed belongs includes:

9. A malicious code homology analysis device based on multi-label classification, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the multi-tag classification-based malicious code homology analysis according to any of claims 1 to 8.

10. A computer-readable storage medium, on which an information transfer implementation program is stored, and which when executed by a processor implements the steps of the multi-tag classification-based malicious code homology analysis according to any one of claims 1 to 8.