CN114896594A

CN114896594A - Malicious code detection device and method based on image feature multi-attention learning

Info

Publication number: CN114896594A
Application number: CN202210408579.XA
Authority: CN
Inventors: 武志超; 谭振华; 王卫东; 吴建
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-08-12

Abstract

The invention provides a malicious code detection device and method based on image feature multi-attention learning, and belongs to the technical field of malicious code identification. The apparatus includes a code-image converter, a feature extractor, and a classifier; converting an original malicious code file to be detected into a gray image and defining the gray image as a malicious image; extracting low-level semantic features in the malicious image to obtain features F; extracting key features F' from the features F; extracting a correlation characteristic F 'of the pixel on the key characteristic F'; extracting higher-order features from the features F' to obtain features M; extracting a correlation characteristic M' of the pixel from the characteristic M; and mapping the deep image features M' to a sample mark space, so that the malicious images are classified into specific malicious software categories. The device and the method have higher recognition rate while reducing the complexity of the convolutional neural network model.

Description

Malicious code detection device and method based on image feature multi-attention learning

Technical Field

The invention relates to the technical field of malicious code identification, in particular to a malicious code detection device and method based on image feature multi-attention learning.

Background

The rapid development of information technology makes enterprise network security face more complex challenges. Among them, malware is one of the major threats to network security. Malware refers to software or code that is installed and run on a user's computer or other terminal without explicitly prompting or licensing the user, violating the legitimate interests of the user. Malicious code detection aims to identify malicious programs in a computer or a terminal so as to prevent the malicious programs from generating greater harm.

The traditional malicious code detection method is divided into static detection and dynamic detection. Static detection is the analysis of the file content and structure of malicious code, including byte code, assembly instructions, functions, and the like. Static detection has the disadvantage that it is difficult to identify complex variants of malware, such as file encryption, packaging, morphing. Dynamic detection identifies malware by executing the malware and analyzing the behavior of the malware, which needs to be done during the activation of the malware, taking into account the cost of time and hardware resources. In a word, both static detection and dynamic detection need to consume a large amount of time cost, labor cost and hardware resources, and the requirement for quickly and efficiently identifying malicious codes is difficult to meet.

With the development of deep learning, researchers have proposed a malicious code detection method based on image processing. The method gets rid of the defects of time and labor consumption of the traditional method, and the image is classified by using the convolutional neural network in a mode of converting the malicious code into the image, so that the malicious code is detected. The current image classification task is the core of computer vision, and mainly adopts a convolutional neural network in deep learning to classify images. The main task is to give the convolutional neural network an input picture, which it assigns to a certain label in a known mixed class. Many existing convolutional neural network models can well process image classification tasks and achieve good accuracy. However, in the neural network model for malicious code detection based on image processing, the structure of the neural network model is too complex. For example, the VGGNet-16 network model structure shown in fig. 1 and the ResNet-50 network model structure shown in fig. 2, each square in the diagram represents an operation in a convolutional neural network, Input represents data Input, Conv represents a convolution operation, for example, Conv3x3, 64 represents a convolution kernel size used in the convolution operation is 3x3, and a channel of the neural network Input data is 64. MaxPool and AvgPool represent the maximum pooling operation and the average pooling operation in the convolutional neural network, and the FC layer and Softmax constitute a classifier of the convolutional neural network, which can classify data into a specific class. In such complex networks as VGGNet-16 and ResNet-50, the overly complex network structure results in more computational effort. Moreover, the image converted by the malicious code, namely the malicious image, has the characteristics that: firstly, the malicious image has key features and non-key features, and when the malicious code is converted into the image, the tail part of the binary file is usually supplemented with 0 and converted into black, so that the black part area is irrelevant to the original code, and therefore, the area converted into three shades of black, white and gray by the code is usually called key features, for example, the image feature in the white frame line area shown in fig. 3 is the key feature, and the rest of the black area is called non-key features; secondly, the malicious codes have semantic information and code correlation, because the original codes have semantic correlation, the semantic information is formed, so that the codes can play a role, and when the codes are sequentially converted into pixels, the pixels are sequentially arranged, so that the pixels also have correlation. Therefore, after the codes are sequentially converted into the images, the correlation between the malicious codes is mapped by the pixel correlation existing in the malicious images. The existing image processing-based malicious software classification method is not designed according to the characteristics of malicious images, and key features of the malicious images and correlation features among pixels cannot be extracted more deeply, so that the identification accuracy of a malicious image classification model is improved.

Disclosure of Invention

Aiming at the problems that a neural network model for classifying malicious images is too complex, large in calculation amount and weak in extraction capability of deep features of the malicious images in the prior art, the invention provides a malicious code detection device and method based on multi-attention learning of image features, and aims to reduce the complexity of the neural network model for classifying the images while improving the malicious code identification accuracy through the system and method.

The technical scheme adopted by the invention is as follows:

the invention provides a malicious code detection device based on image feature multi-attention learning, which comprises:

the code-image converter is used for converting an input original malicious code file into a gray image and defining the gray image as a malicious image to be sent to the feature extractor;

the characteristic extractor is used for extracting key characteristics and correlation characteristics among pixels from the received malicious images so as to obtain deep-level characteristics in the malicious images;

and the classifier is used for classifying the malicious images according to the deep level features extracted by the feature extractor and classifying the malicious images into specific malicious software categories.

Further, according to the malicious code detection device based on image feature multi-attention learning, the feature extractor takes a convolutional neural network as a basic network and comprises three structures, namely a CNN module, a spatial attention module and a self-attention module.

Further, according to the malicious code detection device based on image feature multi-attention learning, the feature extractor specifically includes:

the first CNN module is used for receiving the malicious image sent by the code-image converter, extracting low-level semantic features in the input image from the malicious image through a convolutional layer, putting the extracted low-level semantic features into a pooling layer through an activation function, generating a feature F after the dimension reduction of the feature is carried out through the maximum pooling operation, and sending the feature F to the spatial attention module;

the spatial attention module is used for extracting key features from the features F, specifically, high-order features are further extracted from the features F through a convolution layer, and weights are distributed to the extracted features according to the principle that the key features have higher weights; then, the channel information of the features is aggregated in the space dimension by using the maximum pooling operation and the average pooling operation to generate two-dimensional feature maps F _avg And F _max And F is spliced by a splicing operation _avg And F _max Stacking the data channels together to obtain key features with weight information after the data channels are compressed; finally, multiplying the key features by the features F to obtain key features F 'with weights distributed on the features F, and sending the key features F' to a first self-attention module;

the first self-attention module is used for extracting the correlation characteristics of the pixels on the key characteristics F ', and specifically, firstly, performing convolution operation on the characteristics F ' to linearly map the characteristics F ' to obtain Q, K, V characteristic matrixes; then multiplying the output transpose of the matrix Q and the output of the matrix K to obtain a correlation matrix; then multiplying the correlation matrix with the matrix V to obtain a new matrix; finally, point multiplication is carried out on the new matrix with the weight information and the correlation characteristics and the key characteristics F 'received from the space attention module, pixel correlation of the original key characteristics F' is given, pixel correlation characteristics F 'in the image are obtained, and the pixel correlation characteristics F' are sent to a second CNN module;

the second CNN module is used for extracting higher-order features from the features F ', extracting high-level semantic features from the features F', transferring the extracted high-level semantic features to the pooling layer through an activation function, performing feature dimension reduction through maximum pooling operation to obtain features M, and sending the features M to the second self-attention module;

the second self-attention module is used for extracting the correlation characteristics of the pixels on the characteristics M, and specifically, firstly, convolution operation is carried out on the characteristics M to linearly map the characteristics M to obtain Q, K, V characteristic matrixes; then multiplying the output transpose of the matrix Q and the output of the matrix K to obtain a correlation matrix; then multiplying the correlation matrix with the matrix V to obtain a new matrix; and finally, performing dot multiplication on the new matrix with the correlation characteristics and the high-order characteristics M received from the second CNN module, giving pixel correlation to the characteristics M, obtaining pixel correlation characteristics M 'in the image, and sending the pixel correlation characteristics M' to the classifier.

Further, according to the malicious code detection apparatus based on image feature multi-attention learning, the classifier is composed of at least 1 linear layer and 1 softmax classifier.

Further, according to the malicious code detection device based on image feature multi-attention learning, the classifier consists of 3 linear layers and 1 softmax classifier which are connected together.

The invention provides a malicious code detection method based on image feature multi-attention learning, which comprises the following steps:

step 100: converting an original malicious code file to be detected into a gray image, and defining the gray image as a malicious image;

step 200: extracting low-level semantic features in the malicious image to obtain features F;

step 300: extracting key features F' from the features F;

step 400: extracting a correlation characteristic F 'of the pixel on the key characteristic F';

step 500: extracting higher-order features from the features F' to obtain features M;

step 600: extracting a correlation characteristic M' of the pixel from the characteristic M;

step 700: and mapping the deep image features M' to a sample mark space, so that the malicious images are classified into specific malicious software categories.

Further, according to the malicious code detection method based on image feature multi-attention learning, the method for extracting the low-level semantic feature in the malicious image to obtain the feature F in the step 200 is the same as the method for extracting the higher-level feature in the feature F ″ to obtain the feature M in the step 500, and specifically includes: firstly, semantic features are extracted once from a malicious image or feature F' through a convolutional layer, then the extracted semantic features are transmitted to a pooling layer through an activation function, dimension reduction of the features is carried out through maximum pooling operation, and finally the feature F or feature M is generated.

Further, according to the malicious code detection method based on image feature multi-attention learning, the method for extracting the key feature F' from the feature F in step 300 includes: firstly, further extracting high-order features from the features F through a convolution layer and distributing weights to the extracted features according to the principle that key features have higher weights; then, the channel information of the features is aggregated in the space dimension by using the maximum pooling operation and the average pooling operation to generate two-dimensional feature maps F _avg And F _max And F is spliced by a splicing operation _avg And F _max Stacking the data channels together to obtain key features with weight information after the data channels are compressed; and finally, multiplying the key features with the weight information after the data channel is compressed by the feature F to obtain the key features F' with the weights distributed on the feature F.

Further, according to the malicious code detection method based on image feature multi-attention learning, the method for extracting the correlation feature F ″ of the pixel on the key feature F 'in the step 400 is the same as the method for extracting the correlation feature M' of the pixel on the feature M in the step 600, and specifically includes: firstly, performing convolution operation on the feature F 'or the feature M to linearly map the feature F' or the feature M to obtain Q, K, V three feature matrixes; then multiplying the output transpose of the matrix Q and the output of the matrix K to obtain a correlation matrix; then multiplying the correlation matrix with the matrix V to obtain a new matrix; and finally, carrying out point multiplication on the new matrix with the weight information and the correlation characteristic and the key feature F 'or the feature M to endow the original key feature F' or the feature M with pixel correlation, and obtaining the pixel correlation characteristic F 'or the pixel correlation characteristic M' in the image.

Further, the malicious code detection device or method based on image feature multi-attention learning as described in any one of the above, wherein the key features refer to features in a region containing three shades of black, white and gray in a gray level image converted by malicious codes.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the device of the invention classifies malicious images by combining spatial attention and two self-attention modules to form a neural network model and extracting deep features of the images on the basis of a convolutional neural network, although the model contains a plurality of modules, the structure of the modules is very simple and effective, each module only uses a small number of convolutional layers and pooling layers, the complexity and the calculation amount of the model are greatly reduced compared with the stacking of a large number of convolutional layers and pooling layers used by VGGNet and ResNet networks, each module can extract the deep features in the malicious images, namely key features and pixel correlation features in the images, and the problems of complexity of the neural network model for classifying the malicious images and insufficient extraction capability of the deep features of the images are solved. The identification accuracy rate obtained by experiments on the disclosed Malimg malicious code data set by the system and the method reaches 96.38 percent and exceeds 96.10 percent of VGGNet, which proves that the device and the method have higher identification rate while the complexity of a convolutional neural network model is reduced.

Drawings

FIG. 1 is a schematic diagram of a VGGNet-16 network model structure;

FIG. 2 is a schematic diagram of a ResNet-50 network model structure;

FIG. 3 is an exemplary diagram of key features in a malicious image;

FIG. 4 is a schematic structural diagram of a malicious code detection apparatus based on multi-attention learning of image features according to this embodiment;

FIG. 5 is a flowchart illustrating a malicious code detection method based on multi-attention learning of image features according to this embodiment;

FIG. 6(a) is an image of example 1 translation of malicious code belonging to the Fakerean family; (b) an image transformed for malicious code example 2 belonging to the Fakerean family; (c) an image transformed for example 3 malicious code belonging to the Fakerean family;

FIG. 7(a) is an image of example 1 translation of malicious code belonging to the Dontovo family; (b) an image translated for malicious code example 2 belonging to the Dontovo family; (c) an image translated for malicious code example 3 belonging to the Dontovo family;

FIG. 8 is a diagram of the results of classification of malware families on a Malimg dataset by the system of the present invention;

Detailed Description

To facilitate an understanding of the present application, the present application will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present application are given in the accompanying drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

In this embodiment, the software environment is a WINDOWS 10 system, and the simulation environment is PyCharm 2021.3.3x 64. The malicious code file samples were in the Malimg dataset, which included 25 malware families, i.e., 25 malware types, with the number of samples for each type shown in table 1. In an embodiment of the present invention, the 25 malware families are represented by the numbers 0-24, respectively, with the aim of finally outputting a specific number to represent each malware family by the classifier in the inventive apparatus. In this embodiment, the correspondence between the 25 malware family names and the numbers 0 to 24 is as follows: 0 in Allapel.L; 1 in Allapel.A; "Yuner.A": 2; "Lolyda.AA1": 3; (ii) "Lolyda. AA2": 4; (ii) "Lolyda. AA3": 5; c2 LOP.P: 6; gen! g, 7; "instant" 8; gen! I is 9; gen! E, 10; "VB.AT" is 11; 12 is Fakerean; gen! J is 13; gen! J is 14; 15 "Lolyda. AT"; "Adialer.C": 16; 17 is Wintrum.BX; 18 in Dialpatform.B; 19 in Dontovo.A "; 20 "obfuscator. ad"; (ii) 21 is agent. FYI; 22 "Autorun. K"; "Rbot! gen 23; "Skinrim. N": 24.

TABLE 1 Malimg dataset malware type and sample number

Fig. 4 is a schematic structural diagram of the malicious code detection apparatus based on multi-attention learning of image features according to the present embodiment, and as shown in fig. 4, the apparatus includes three parts, namely a code-image converter, a feature extractor and a classifier. The code-image converter is used for converting an input original malicious code file into a gray image, defining the gray image as a malicious image and sending the malicious image to the feature extractor; the feature extractor is used for extracting key features and correlation features from the received malicious image so as to obtain deep features in the malicious image; the classifier is used for classifying the malicious images according to the deep level features extracted by the feature extractor and classifying the malicious images into specific malicious software family categories. The feature extractor and the classifier form a neural network model in the apparatus of the present invention, which is called a FA model (Fusion Attention model).

As shown in fig. 4, the feature extractor, which takes a convolutional neural network as a basic network, includes three structures, namely a CNN module, a spatial attention module, and a self-attention module, and specifically includes:

the first CNN module is used for receiving the malicious image sent by the code-image converter, extracting low-level semantic features in the input image from the malicious image through a convolutional layer, transmitting the extracted low-level semantic features to a pooling layer through an activation function, performing feature dimension reduction through maximum pooling operation, reducing network model parameters, finally generating a feature F, and sending the feature F to the spatial attention module;

the spatial attention module is used for extracting key features from the features F, specifically, high-order features are further extracted from the features F through a convolution layer, and weights are distributed to the extracted features according to the principle that the key features have higher weights; then, the channel information of the features is aggregated in the space dimension by using the maximum pooling operation and the average pooling operation to generate two-dimensional feature maps F _avg And F _max And F is spliced by a splicing operation _avg And F _max Are stacked togetherObtaining key features with weight information after compressing the data channel; finally, multiplying the key features with the weight information after compressing the data channel with the feature F to obtain key features F 'with the weights distributed on the feature F, and sending the key features F' to the first self-attention module;

the first self-attention module is used for extracting the correlation characteristics of pixels on the key characteristics F ', specifically, firstly, the convolution operation is carried out on the characteristics F' and linear mapping is completed to obtain Q, K, V characteristic matrixes, and the difference between Q, K, V three characteristic matrixes is only that the output channels are different in size; after obtaining the three characteristic matrixes, multiplying the output transpose of the matrix Q and the output of the matrix K to obtain a correlation matrix; and multiplying the correlation matrix by the matrix V to obtain a new matrix, wherein each pixel point in the new matrix is related to the key feature F' received from the spatial attention module, and the new matrix comprises weight information. Finally, carrying out point multiplication on the new matrix with the weight information and the correlation characteristics and the key characteristics F 'received from the space attention module, giving pixel correlation to the original key characteristics F', obtaining pixel correlation characteristics F 'in the image, and sending the pixel correlation characteristics F' to the second CNN module;

the second CNN module is used for extracting higher-order features from the features F ', extracting high-level semantic features from the features F', transferring the extracted high-level semantic features to the pooling layer through an activation function, performing feature dimension reduction through maximum pooling operation, reducing network model parameters, finally generating features M, and sending the features M to the second self-attention module;

the second self-attention module is used for extracting the correlation characteristics of the pixels on the characteristics M, and specifically, firstly, performing convolution operation on the characteristics M to complete linear mapping to obtain Q, K, V three characteristic matrixes; after obtaining the three characteristic matrixes, multiplying the output transpose of the matrix Q and the output of the matrix K to obtain a correlation matrix; and multiplying the correlation matrix by the matrix V to obtain a new matrix, wherein each pixel point in the new matrix is correlated with the high-order feature M received from the second CNN module. And finally, performing dot multiplication on the new matrix with the correlation characteristics and the high-order characteristics M received from the second CNN module, giving pixel correlation to the characteristics M, obtaining pixel correlation characteristics M 'in the image, and sending the pixel correlation characteristics M' to the classifier.

As shown in fig. 4, the classifier consists of at least 1 linear layer and 1 softmax classifier. In the embodiment, the classifier is composed of 3 linear layers and 1 softmax classifier connected together, and is used for mapping the deep image features extracted by the feature extractor to the sample mark space, that is, classifying the deep image features received by the classifier. The linear layer (linear layer) is a model classification structure in the deep learning framework Pytorch, and is used for performing linear transformation on the obtained deep level image features to generate input features required by softmax. The 3 linear layers of the present embodiment have the functions that each linear layer reduces the dimension of the feature, and the neural network parameters are sequentially decreased. The softmax classifier can normalize the features and distribute weights, finally, each feature matrix is distributed to a specific malicious family category, and corresponding numbers representing the malicious family categories are generated, so that the corresponding malicious software family categories are obtained. Table 2 shows detailed parameters of each layer in the FA model in this embodiment.

TABLE 2 detailed parameters of layers in the FA model

Fig. 5 is a schematic flowchart of a malicious code detection method based on image feature multi-attention learning, which aims to train and verify a malicious code detection apparatus based on image feature multi-attention learning, and includes the following steps:

step 1: converting an original malicious code file into a gray image, normalizing all the gray images, and dividing the normalized gray images into a training set and a test set according to a certain proportion;

step 1.1: converting the original malicious code file into a gray image;

in the present embodiment, as described above, the original malicious code file sample adopts a Malimg data set. And acquiring a Malimg data set, wherein an original malicious code file in the Malimg data set contains a binary bit string, such as 011100110101100101101101010. Converting the binary bit string in the malicious code file into a gray image according to the following method: firstly, extracting the binary bit strings by using open and write functions in Python language, writing the binary bit strings into the open and write functions and storing the binary bit strings in a computer. Taking each byte in the malicious code file as a unit, splitting and reading each binary bit string into 8-bit vectors, namely extracting the stored binary bit strings according to 8 bits, and converting the vectors into decimal unsigned integers through a binary conversion operation in a Python language to be mapped into a space of 0-255. The formula for the binary conversion calculation of each byte is:

where I is the resulting mapping value. For example: 01100000 and 10101100, and calculating to obtain the corresponding mapping value of 96 and 172 through a system conversion calculation formula. This allows mapping all codes to values between 0 and 255, and finally converting them to grayscale images in the interval 0 (black) -255 (white) to obtain an image sample dataset, as shown in fig. 6 and 7.

Step 1.2: normalizing all the gray level images to construct an image sample data set;

the image is processed into a fixed size of uniform size. The reason for this is that an FA model (Fusion Attention model) composed of a feature extractor and a classifier requires input of images of uniform size, and small-sized images have better recognition effect. In order to prevent overfitting of the FA model, random crop is carried out on image data by using a RandomCrop method in a deep learning framework pyrrch, so that the data enhancement effect can be achieved. And normalizing the data by using Normalization operation, so that the data can be mapped into a range of 0-1, the calculation amount of the FA model is reduced, and the precision and the convergence speed of the FA model are improved.

Step 1.3: dividing the image sample data set into a training set and a test set according to a certain proportion;

after all malicious code files are converted into gray level images, an image sample data set is formed, and the image sample data set needs to be divided for subsequently training and testing the built model. In this embodiment, the image sample data set is divided into a training set and a test set using python language at a ratio of 8: 2. Then the data set traversal is performed with the for loop, new training set folders and test set folders are formed with the os.path.join method in the python dependent package os, and the assigned pictures are added to both folders.

Step 2: inputting the image sample data in the training set into the malicious code detection device based on the image feature multi-attention learning shown in fig. 4, and performing forward propagation on the malicious code detection device based on the image feature multi-attention learning to obtain a prediction result.

Step 2.1: extracting low-level semantic features in an input image through a first CNN module to obtain a feature F;

in this embodiment, specifically, the malicious images in the training set are input to the first CNN module: in a first CNN module, firstly extracting low-level semantic features in an input image through a convolutional layer, then transferring the extracted low-level semantic features into a pooling layer through an activation function, performing feature dimension reduction through maximum pooling operation, and finally generating features F;

step 2.2: extracting key features F' from the features F through a space attention module;

in the spatial attention module of the present embodiment, first, a convolution layer is used to further extract high-order features from the features F and weights are assigned to the extracted features according to the principle that key features have higher weights; secondly, aggregating channel information of the features in the space dimension by using maximum pooling operation and average pooling operation to generate two-dimensional feature maps Favg and Fmax, and stacking the Favg and the Fmax together by splicing operation to obtain key features with weight information after compressing a data channel; finally, multiplying the key features with the weight information after compressing the data channel with the feature F to obtain key features F' with weights distributed on the feature F;

the operation formula of the space attention module for feature extraction is as follows:

wherein M is _s The entire operation of extracting features on behalf of the spatial attention module; f is an input image; σ represents an activation function; f. of ^7x7 Represents a convolution operation using a 7x7 convolution kernel; AvgPool and MaxPool represent the average pooling and maximum pooling, respectively, in equation (2)

And

represents; r represents a characteristic matrix, and 1 multiplied by H multiplied by W represents the characteristic size of 1 channel with the length H and the width W;

step 2.3: extracting a correlation feature F 'of the pixel on the key feature F' by a first self-attention module;

in the first self-attention module of this embodiment, the convolution operation is performed on the feature F' and linear mapping is completed to obtain Q, K, V three feature matrices; then multiplying the output transposition of the matrix Q and the output of the matrix K to obtain a correlation matrix; then multiplying the correlation matrix with the matrix V to obtain a new matrix; finally, carrying out point multiplication on the new matrix with the weight information and the correlation characteristic and the key characteristic F 'received from the spatial attention module to obtain a pixel correlation characteristic F' in the image;

the operation formula of the self-attention module is as follows:

Attention(Q,K,V)＝softmax(B _i,j )V·F (3)

B _i,j ＝Q(x _i ) ^T K(x _j ) (4)

wherein softmax represents an activation function; q, K, V represents three feature matrices; b is _i,j Is used to represent i ^th Position pair generation j ^th The relationship weight of the position, namely the relationship weight between different pixels; f represents the original features of the input; q (x) and K (x) represent feature matrices Q and K generated after the convolution operation; q (x) _i ) ^T Representing the transposed matrix of the generated feature matrix Q.

Step 2.4: extracting higher-order features from the features F' through a second CNN module to obtain features M;

in the second CNN module of the present embodiment, first, a high-level semantic feature is extracted from the feature F ″ through the convolutional layer, then the extracted high-level semantic feature is put into the pooling layer through the activation function, and the feature M is finally generated by performing feature dimension reduction through the maximum pooling operation.

Step 2.5: extracting a correlation feature M' of the pixel from the feature M through a second self-attention module;

in the second self-attention module of this embodiment, the convolution operation is performed on the feature M to complete linear mapping, so as to obtain Q, K, V three feature matrices; then multiplying the output transpose of the matrix Q and the output of the matrix K to obtain a correlation matrix; then multiplying the correlation matrix with the matrix V to obtain a new matrix; and finally, performing dot multiplication on the new matrix with the correlation characteristic and the high-order characteristic M received from the second CNN module, giving pixel correlation to the characteristic M, and obtaining a pixel correlation characteristic M' in the image.

Step 2.6: the deep image features M' are mapped to a sample label space by a classifier, i.e., the features are classified.

In the classifier of the embodiment, firstly, the dimension reduction is continuously performed on the feature M' through 3 linear layers, so that the neural network parameters are sequentially decreased; and then normalizing the features and distributing weights through a softmax classifier, finally distributing each corresponding feature matrix to a specific malicious family category, and generating corresponding numbers representing the malicious family categories so as to acquire the corresponding malicious software family categories.

And 3, step 3: calculating a loss value according to a prediction result obtained by forward propagation of the malicious code detection device based on the image feature multi-attention learning, performing backward propagation, and updating parameters of the malicious code detection device based on the image feature multi-attention learning.

In the embodiment, a random gradient descent method is used to update parameters of the malicious code detection device based on image feature multi-attention learning, and a loss function formula used for training is as follows:

wherein p ═ p ₀ ,……,p _c-1 ]Representing a probability distribution, each element p _i Representing the probability that the sample belongs to class i malware; y ═ y ₀ ,……,y _c-1 ]Is a vector-form representation of the sample label, y when the sample belongs to class i _i 1, otherwise y _i 0; c represents a sample label.

In the training, the network layer uses an Initialization method to initialize network parameters, the size of a training batch is set to be 128, the initial learning rate is set to be 0.005, and the training round is set to be 100. In the training process, the device can be tested on the test set every 10 times of training, the test result can be output to a log file, and the parameter file of the device can be stored in a pth file.

And 4, step 4: ACC is used as a measure of detection performance, namely, the ratio of the predicted correct sample to the total sample. And (3) testing and evaluating the trained malicious code detection device based on the image characteristic multi-attention learning by using a test set, re-training the device according to the method in the step (2) after parameter adjustment is carried out on the current device according to an evaluation result, testing and evaluating the re-trained device again, repeatedly executing training and testing and evaluating operations on the device until an optimal device reaching a measurement index is obtained, and taking the device as a final malicious code detection device based on the image characteristic multi-attention learning.

In order to better illustrate the experimental results and performance improvement of the device, the experimental results of different models on the Malimg data set are compared, and the specific results are shown in table 3. The model of the combination of the feature extractor and the classifier in the device is referred to as a FA model (Fusion Attention model) for short.

TABLE 3 comparison of the experimental results of different models

It can be seen from table 3 that the identification effect of the FA model has exceeded the basic CNN model, and the FA model can be compared with the complex network VGGNet on Malimg data set. The device is effective in reducing the complexity of the neural network model and acquiring the deep-level features of the image.

In addition to performing the accuracy comparison between models, the family classification results of the device of the present invention on the Malimg data set are shown in fig. 8. The horizontal axis of fig. 8 represents the true label and the vertical axis represents the predicted label. The horizontal axis corresponds to 0, which represents the correct prediction, and the other numbers represent the number of families predicted as horizontal family on the vertical axis. It can be seen that most families are well predicted, maintaining 100% recognition accuracy with only a few sample prediction errors. 4 samples of the Allapel.L class are predicted to be Allapel.A class, and 5 samples of the Allapel.A class are predicted to be the Allapel.L class; class lolyda.aa2 3 samples were predicted to be of class lolyda.aa1; class c2lop.p 7 samples are predicted to be c2 lop.gen! g types; gen! Class I has 6 samples predicted to be Swizzor. gen! Class E; it can be seen that prediction errors occur between two sub-classes of the same general class. Due to the homology between the subclasses, there is little difference between them. This shows that the FA model in the device of the present invention captures the features of the same family well, and the extraction of the features through pixel correlation facilitates the identification of the same family by the model.

In conclusion, the device and the method of the invention meet the requirements of high efficiency and accuracy of malicious code identification. The problem of traditional malicious code discernment time cost and cost of labor, resource cost are high is solved. In the existing image-based method, the problems of complex model and insufficient extraction capability of the model on the deep level features of malicious codes are solved.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A malicious code detection device based on image feature multi-attention learning is characterized by comprising:

the code-image converter is used for converting an input original malicious code file into a gray image and defining the gray image as a malicious image and sending the malicious image to the feature extractor;

2. The apparatus according to claim 1, wherein the feature extractor is based on a convolutional neural network, and comprises three structures, namely a CNN module, a spatial attention module and a self-attention module.

3. The apparatus for detecting malicious code based on image feature multi-attention learning according to claim 2, wherein the feature extractor specifically comprises:

the first CNN module is used for receiving the malicious images sent by the code-image converter, extracting low-level semantic features in the input images from the malicious images through a convolutional layer, inputting the extracted low-level semantic features into a pooling layer through an activation function, performing feature dimension reduction through maximum pooling operation to generate features F, and sending the features F to the spatial attention module;

the first self-attention module is used for extracting the correlation characteristics of the pixels on the key characteristics F ', and specifically, firstly, performing convolution operation on the characteristics F ' to linearly map the characteristics F ' to obtain Q, K, V characteristic matrixes; then multiplying the output transpose of the matrix Q and the output of the matrix K to obtain a correlation matrix; then multiplying the correlation matrix with the matrix V to obtain a new matrix; finally, carrying out point multiplication on the new matrix with the weight information and the correlation characteristics and the key characteristics F 'received from the space attention module, giving pixel correlation to the original key characteristics F', obtaining pixel correlation characteristics F 'in the image, and sending the pixel correlation characteristics F' to the second CNN module;

4. The apparatus according to claim 1, wherein the classifier is composed of at least 1 linear layer and 1 softmax classifier.

5. The apparatus according to claim 4, wherein the classifier is composed of 3 linear layers and 1 softmax classifier connected together.

6. A malicious code detection method based on image feature multi-attention learning is characterized by comprising the following steps:

step 200: extracting low-level semantic features in the malicious image to obtain a feature F;

step 300: extracting key features F' from the features F;

7. The method for detecting malicious codes based on image feature multi-attention learning according to claim 6, wherein the method for extracting the low-level semantic features in the malicious images to obtain the features F in the step 200 is the same as the method for extracting the features M at a higher order in the features F ″ in the step 500, and specifically comprises: firstly, semantic features are extracted once from a malicious image or feature F' through a convolutional layer, then the extracted semantic features are transmitted into a pooling layer through an activation function, dimension reduction of the features is carried out through maximum pooling operation, and finally the feature F or the feature M is generated.

8. The method for detecting malicious codes based on image feature multi-attention learning according to claim 6, wherein the method for extracting key features F' from the features F in the step 300 is as follows: firstly, further extracting high-order features from the features F through a convolution layer and distributing weights to the extracted features according to the principle that key features have higher weights; then, the channel information of the features is aggregated in the space dimension by using the maximum pooling operation and the average pooling operation to generate two-dimensional feature maps F _avg And F _max And F is spliced by a splicing operation _avg And F _max Stacking the data channels together to obtain key features with weight information after the data channels are compressed; and finally, multiplying the key features with the weight information after the data channel is compressed by the feature F to obtain the key features F' with the weights distributed on the feature F.

9. The method for detecting malicious codes based on image feature multi-attention learning according to claim 6, wherein the method for extracting the correlation feature F ″ of the pixel on the key feature F 'in the step 400 is the same as the method for extracting the correlation feature M' of the pixel on the feature M in the step 600, and specifically comprises: firstly, carrying out convolution operation on the feature F 'or the feature M to linearly map the feature F' or the feature M to obtain Q, K, V three feature matrixes; then multiplying the output transpose of the matrix Q and the output of the matrix K to obtain a correlation matrix; then multiplying the correlation matrix with the matrix V to obtain a new matrix; and finally, carrying out point multiplication on the new matrix with the weight information and the correlation characteristic and the key feature F 'or the feature M to endow the original key feature F' or the feature M with pixel correlation, and obtaining the pixel correlation characteristic F 'or the pixel correlation characteristic M' in the image.

10. The apparatus or method for detecting malicious codes based on image feature multi-attention learning as claimed in any preceding claim, wherein the key features refer to features in a region containing three shades of black, white and gray in a gray level image converted by malicious codes.