CN115758362A

CN115758362A - Multi-feature-based automatic malicious software detection method

Info

Publication number: CN115758362A
Application number: CN202211511935.7A
Authority: CN
Inventors: 李益洲; 杨星; 李梦龙
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-07

Abstract

The invention discloses a malicious software automatic detection method based on multiple characteristics, which comprises the following steps of S1: preprocessing data; step S2: carrying out feature extraction on the data to obtain a feature vector; and step S3: effectively aligning and fusing the binary feature vector obtained in the step S2 and the operation code feature vector generated by the triangular attention mechanism to generate a final fusion vector; and step S4: responsible for detection and classification of malware; the method can effectively detect the variants of the malicious software, has higher accuracy, and enhances the performance of an automatic malicious software analysis tool by increasing the identification efficiency of the variants of the malicious software in an actual production environment, thereby reducing the labor cost.

Description

Multi-feature-based automatic malicious software detection method

Technical Field

The invention belongs to the field of software, and particularly relates to a malicious software automatic detection method based on multiple characteristics.

Background

The internet is gradually permeating into various fields of human life and work, and novel internet applications which are continuously and massively emerging are also deeply changing social life of the information age. The global outbreak of COVID-19 makes the office model of each field gradually turn to the line, the proliferation of network office valley also provides business opportunities for developers of malicious software, and network criminals develop various malicious software, such as malicious advertisements, lesoh software, mine digging viruses and the like. According to the data representation from 11 months in 2019 to 10 months in 2021 of Kasperssky Security Network (KSN) statistics, only in one year, 15.45% of Internet user computers around the world suffer from malware attacks at least once, wherein Lesojour software and miners' viruses are rampant. The malicious software collects private information on a target network, abuse server resources, even infect hosts on the whole Internet through zombie application programs, the behaviors cause huge threats to the safety of individuals and various fields, and the timely detection and response of the appearing malicious software become infrastructure for ensuring the network office safety.

The detection method of the malicious software is mainly divided into static detection and dynamic detection. The static detection method generally extracts features of malware through static analysis tools such as decompilation and the like, such as: the method is characterized in that whether the code is malicious code or not is judged by binary sequences, assembly codes, operation code sequences and the like, extracted key features are used for generating feature signatures, and a rule matching-based method is usually used in detection. Dynamic detection is generally divided into dynamic monitoring and dynamic analysis, wherein the dynamic monitoring refers to observing software behaviors by virtually running target software and using a mode of instant debugging and the like, and the dynamic analysis refers to obtaining related information such as an API (application program interface) calling sequence, system resource use and the like after running the target software in environments such as a sandbox and the like and then further analyzing.

Disclosure of Invention

Aiming at the defects or the improvement requirement of the prior art, the invention provides a multi-feature-based malware automatic detection method.

The specific technical scheme is as follows:

the multi-feature-based malware automatic detection method comprises the following steps,

step S1: preprocessing data;

step S2: performing feature extraction on the data to obtain a feature vector;

and step S3: effectively aligning and fusing the binary characteristic vector obtained in the step S2 and the operation code characteristic vector generated by the triangular attention mechanism to generate a final fused vector;

and step S4: responsible for the detection and classification of malware.

Preferably, step S1 comprises the following sub-steps:

substep S11: after an original malicious software file is taken, performing static decompilation on the file by using an IDApro tool to obtain a binary file and an assembly file of the malicious software;

substep S12: and counting the frequency of the key operation codes from the assembly file.

Preferably, the substep S12 comprises the substeps of:

substep S121: dividing each program into a plurality of subprogram blocks, and calculating the frequency of the key operation codes of each subprogram block;

substep S122: these key opcodes include the basic opcode, the variable name and the register name.

Preferably, step 2 comprises the following sub-steps:

substep S21: malicious software sequence information expressed by binary system contains key information such as core function, resource calling and the like, and is usually scattered to different positions of the sequence along with compiling, and a stacked double-layer one-dimensional convolution neural network model is used for carrying out feature coding on binary files, so that the problem of dependence of remote information of binary texts can be solved;

substep S22: extracting operation code information, and calculating the internal correlation of operation code blocks and the long-distance relation between different operation code blocks by using a currently advanced triangular attention algorithm;

substep S23: and calculating the relevance scores between the operation codes in different subprogram blocks and other operation codes so as to obtain a matrix vector for describing the malware assembly file.

Preferably, step S3 comprises the following sub-steps:

substep S31: using a gated self-attention mechanism module to further filter out critical information;

substep S32: using a cross attention module and using a self-attention calculation method to take the operation code matrix vector as a calculation condition of the binary eigenvector, so that the calculated output vector is fused with the information of the two eigenvectors;

substep S33: and a residual error network structure is used to reduce information loss in the process of transmitting the training model.

Preferably, S4 comprises the following sub-steps:

substep S41: generating a classification result using the DNN network;

substep S42: at the output level, the input software is marked as its malicious family using the softmax function;

substep S43: and a cross entropy loss function is adopted in the training process.

Compared with the prior art, the invention has the following beneficial effects:

1. a stacked two-layer one-dimensional convolutional network is provided for efficiently extracting features of binary files of malware.

2. The method comprises the steps of extracting the characteristics of the assembly file by using a triangular attention algorithm, and calculating the characteristics of important attention in the assembly file by respectively taking an operation code and an assembly code block as main bodies.

3. A neural network classification model based on multiple features is provided, and a good malicious software classification effect is achieved by using a multi-feature alignment fusion algorithm.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a diagram of a stacked two-layer one-dimensional convolutional neural network model of the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the drawings and examples. The following detailed description of the embodiments and the accompanying drawings are provided to illustrate the principles of the invention and are not intended to limit the scope of the invention, which is defined by the claims and their equivalents, which can be combined with other features of the embodiments to form new embodiments by those skilled in the art without inventive faculty, and which are also covered by the scope of the invention.

By efficiently integrating binary files and assembly code, a hybrid attention model for malware detection is presented. In order to overcome the problem of incomplete capture of binary sequence information, a binary sequence is used as an original input, features are extracted through a stacked double-layer convolution network, time information is extracted through a first convolution layer, and discontinuity in function calling and jumping processes is captured through a second convolution layer. Meanwhile, a triangle attention module is used for extracting code level features, the integral relation of functions is considered, and the inherent use mode of operation codes is combined. Finally, the two features are aligned and fused by using a cross attention module, and the fused feature vector is input into a network for training, so that the network can learn the relationship between the binary file and the assembly code, the stability of fusion feature representation is improved, and better performance is obtained in independent test.

Fig. 1 is a main design diagram of the technical solution of the present invention. As shown in fig. 1, the automatic malware detection method based on multiple features provided by the present invention includes the following four modules:

a module a: and (4) preprocessing data. After the original malware file is taken, the file is statically decompiled using the IDApro tool to obtain a binary file and an assembly file of the malware. Then, extracting an operation code sequence from the binary file, specifically comprising the following steps: each program is divided into a plurality of subprogram blocks, and the frequency of occurrence of key operation codes of each subprogram block is calculated, wherein the key operation codes comprise basic operation codes, variable names and register names.

And a module b: and (5) feature extraction. And (4) further extracting the features of the binary file obtained in the module a and the operation codes extracted from the assembly file. Specifically, the malware sequence information represented by binary contains key information such as core functions and resource calling, and the key information is usually scattered to different positions of a sequence along with compiling, and a stacked double-layer one-dimensional convolutional neural network model is used for carrying out feature coding on a binary file, so that the problem of dependence of long-distance information of a binary text can be solved. Second, opcode information is extracted. The intra-correlation of the operational code blocks and the long-distance relationship between different operational code blocks are calculated using the currently advanced trigonometric attention algorithm. The algorithm calculates the relevance scores between the operation codes in different subprogram blocks and other operation codes, so as to obtain a matrix vector for describing the malware assembly file.

The stacked two-layer one-dimensional convolutional neural network model proposed by the present invention is further described with reference to fig. 2.

The binary sequence is first input into a one-dimensional convolution to extract byte features within a fixed window size.

Then, vectors are extracted by using dilation convolution, and by adding a receiving domain, the range of information contained in each convolution output is expanded, and the problem of long-distance information dependence of binary texts is solved.

The feature map of the two-layer one-dimensional convolution output is input to the pooling layer to reduce dimensionality and compress the features.

The infrastructure consisting of the two layers of one-dimensional convolutions and pooling layers will be repeated until the dimension of the output characteristic matches the threshold accepted by the subsequent module.

And a module c: in order to align and fuse the binary feature vectors obtained by the above module b with the matrix vectors generated by the triangular attention mechanism efficiently.

First, a gated self-attention mechanism module is used to further filter out critical information,

then, using a cross attention module and using a self-attention calculation method, the operation code matrix vector is used as a calculation condition of the binary feature vector, so that the calculated output vector fuses the information of the two feature vectors.

Finally, to reduce information loss during the transmission of the training model, we use a residual network structure.

A module d: responsible for the detection and classification of malware. The classification result is generated using a DNN network. At the output level, we use the softmax function to mark the input software as its malicious family. And a cross entropy loss function is adopted in the training process.

Claims

1. The multi-feature-based malware automatic detection method is characterized by comprising the following steps,

step S1: preprocessing data;

step S2: carrying out feature extraction on the data to obtain a feature vector;

and step S4: responsible for the detection and classification of malware.

2. The automated multi-feature-based malware detection method according to claim 1, wherein the step S1 comprises the sub-steps of:

3. The multi-feature based malware automated detection method of claim 1, wherein the substep S12 comprises the substeps of:

4. The automated multi-feature-based malware detection method according to claim 1, wherein the step 2 comprises the sub-steps of:

5. The automated multi-feature-based malware detection method according to claim 1, wherein said step S3 comprises the sub-steps of:

6. The automated multi-feature-based malware detection method according to claim 1, wherein said S4 comprises the sub-steps of:

substep S41: generating a classification result using the DNN network;