CN113222053B

CN113222053B - Malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion

Info

Publication number: CN113222053B
Application number: CN202110589078.1A
Authority: CN
Inventors: 李树栋; 许娜; 吴晓波; 韩伟红; 方滨兴; 田志宏; 顾钊铨; 殷丽华; 唐可可; 仇晶
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2022-03-15
Anticipated expiration: 2041-05-28
Also published as: CN113222053A

Abstract

The invention discloses a malicious software family classification method, a system and a medium based on RGB image and Stacking multi-model fusion, wherein the method comprises the following steps: constructing an API category database Q; extracting a malicious software API calling sequence chain; constructing an API call relation pair according to the API call sequence chain to obtain an API call relation pair directed graph G; determining the weight by using an improved iteration scale algorithm in the maximum entropy model to obtain the weight w of each API call relation pair_i,j(ii) a Obtaining an RGB image representing the behavior of calling API by malicious software; constructing a stacking multi-model fusion classifier, training and learning, and inputting an RGB image data set representing each behavior characteristic of the malicious software into the classifier so as to predict the family name of the malicious software. According to the invention, the API calling behavior of the malicious software is converted into the RGB image through the conversion rule, the conversion process not only considers the times of API calling, but also considers the calling relation among the APIs, and the Stacking technology is used for multi-model fusion, so that the accuracy of the model can be improved.

Description

Malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion

Technical Field

The invention belongs to the technical field of malicious software classification, and particularly relates to a malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion.

Background

Malware refers to executable programs written for some malicious purpose, including viruses, worms, and trojan horses. The Lexuo software has serious influence on society, and mainly attacks medium and large-sized government and enterprise institutions such as enterprises, governments, education and the like in the modes of mail fishing, account blasting, vulnerability exploitation and the like to obtain violence from the Lexuo software. At the same time, the amount of malware has also increased year by year, as in 2020, and the number of malware executables known to the security community has exceeded 11 billion, and this number may continue to grow. The reason for the increase of the amount of the malicious software is as follows: first, with the rapid development of network technology, there are more and more malicious software propagation approaches, such as downloading pirated movies, searching hot topics, and installing unknown anti-virus software. Second, abuse of automated malware generation tools results in an increase in the number of malware variants. Thirdly, the criminal group of the malicious software gradually forms large-scale commercial operation and forms a new malicious software cooperation ecology.

In order to solve the above problems, a method of classifying malware families can be adopted to perform targeted defense on unknown malware. On the whole, the malware classification technology relies on a malware detection method, and the current malware detection methods mainly include three types: static analysis, dynamic analysis, and visualization techniques.

The static analysis method focuses on the expression of the malicious software in a text form, and by analyzing the static characteristics extracted from a binary file, an executable file or a disassembled file of the malicious software, the same malicious software family code reuse can cause similarity between malicious software authors or team codes, so that data and structural information of the same family of malicious software can also have certain similarity when the same family of malicious software is loaded into a memory and executed. When a static analysis method is adopted, when the malicious software is subjected to shell adding and resource confusion, static characteristics cannot be acquired. Malware shelling refers to the process by which a malware writer compresses, encrypts, or otherwise destroys a malware program, which when run, automatically unpacks and executes. Malware resource obfuscation refers to the way that obfuscator resources are stored on disk and then the malware runtime performs obfuscation restoration so that they can be used by malware. In addition, the method has limitations of disassembling technology, dynamically downloading data and the like.

The dynamic analysis method focuses on behavior characteristics of the malicious software, and records behavior information of the malicious software, such as a behavior log, context parameters, an API (application program interface) calling sequence and the like, by executing a malicious software sample in a virtual controlled environment. By running malware, it is possible to find out which servers a particular malware file is connected to, which parameters were modified, and which device inputs/outputs were performed. The method can bypass the limitation of malicious software shell adding, but the problem that the malicious software has no network behavior in the execution process cannot be solved. The method is only suitable for detecting the malicious software with network behaviors, and has no generalization capability.

The visualization technology is used for visualizing information obtained by the malicious software through a static analysis method or a dynamic analysis method, visualized data is usually more visual than non-visualized statistical results, and the development trend of the malicious software in the whole threat range can be rapidly identified by visualizing the safety data. But the method only considers the execution times of the malicious software in the whole execution process, and does not consider the correlation between the API and the API, and in practice, the correlation between the API called by the malicious software and the API has great influence on the classification result.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a malware family classification method, system and medium based on RGB image and Stacking multi-model fusion.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a malicious software family classification method based on RGB image and Stacking multi-model fusion, which comprises the following steps:

constructing an API category database Q;

executing unknown malicious software M in a sandbox, obtaining an execution report, and extracting a malicious software API call sequence chain API _ calls { x } from the execution report₁,x₂,x₃,...,x_n}，x_iCalling the name of the API for the malware M;

constructing an API call relation pair according to the API call sequence chain, thereby obtaining an API call relation pair directed graph G;

determining the weight by using an improved iteration scale algorithm in the maximum entropy model to obtain the weight w of each API call relation pair_i,j；

Filling RGB pixel points by combining the API calling relation and the weight vector with an API category database Q to obtain an RGB image representing the behavior of calling the API by the malicious software, wherein the RGB image can well represent the relation between the API and the API of the malicious software;

constructing a stacking multi-model fusion algorithm and performing training learning, wherein the stacking multi-model fusion algorithm is divided into two layers of training processes, the first layer divides an original data set into a plurality of sub-data sets and inputs the sub-data sets into each base learner for learning, and each base learner outputs a respective prediction result; the second layer takes the first-level output result as input, the learner of the second layer learns the first-level output result and outputs a final prediction result;

and inputting the RGB image data set representing the behavior characteristics of each piece of malware into a stacking multi-model fusion algorithm, thereby predicting the name of the malware family.

As a preferred technical solution, the constructing the API category database specifically includes:

the API is divided into 256 categories [0, 255], and the API category number [0,11] is divided according to the categories in the windows API reference manual;

the API class number [12,255] is divided according to the number of occurrences in the popular malware usage API report (https:// rstforums. com/forum/topic/95273-top-maleiniously-used-API /), with an adjacent interval of 1200.

As a preferred technical solution, the API call relationship pair is:

g＝{<x₁,y₁＝x₂>,<x₂,y₂＝x₃>,<x₃,y₃＝x₄>,…,<x_n-1，y_n-1＝x_n>}

the definition of the API call relation to the directed graph G is as follows:

g ═ S, E, S is the set of calling APIs

Wherein w_i,jIs x_iTo x_jWeight of (2), representing x_iAfter being called, x_jProbability of being invoked.

As a preferred technical solution, the determining of the weight by using the improved iteration scale algorithm in the maximum entropy model specifically includes:

inputting: characteristic function f_i(x, y), i ═ 1, 2.., n-1; empirical distribution

Model P_w(y|x)；

And (3) outputting: optimal weight w'_i(ii) a Optimum model P_w'；

s1. initialization w_i＝0,i＝1,2,...,n-1；

s2. for each

i ═

1, 2.., n-1;

a. let sigma_iIs an equation

Wherein

b. Updating w_iThe value: w is a_i←w_i+σ_i

s3. if all of w_iConverge and end, otherwise repeat s 2;

wherein g { (x) for a given training set₁,y₁),(x₂,y₂),(x₃,y₃),...,(x_n-1,y_n-1) Determining an empirical distribution of the joint distribution P (X, Y)

And empirical distribution of edge distribution P (X)

Characteristic function f_i(x, y) represents a binary function of the input x and the output y,

is f_i(x, y) about

The expected value of (d);

is f_i(X, Y) with respect to model P (Y | X) and empirical distribution

Is calculated from the expected value of (c).

As a preferred technical solution, the filling of RGB pixels by the API call relation pair and the weight vector in combination with the API classification database Q specifically includes: API call relationship pair (x)_i,x_j) Determining the positions filled in the image matrix, the positions being (R ═ Q (x)_i),G＝Q(y_i) A pixel value at this position is (R ═ Q (x))_i),G＝Q(y_i),B＝w_iX 255) in which w_i,jIs x_iTo x_jWeight of (2), representing x_iAfter being called, x_jProbability of being invoked. 6. The classification method for malicious software families based on RGB image and Stacking multi-model fusion as claimed in claim 1 is characterized in that in the Stacking multi-model fusion algorithm, the following 5 algorithms are selected as base classifiers in the first layer: SGD, SVM, KNN, MLP and Xcenter, and a method for selecting LR by a second layer is used for fusing the classifiers of the first layer;

the specific steps of constructing a stacking multi-model fusion algorithm and training and learning are as follows:

firstly, dividing an obtained malware RGB image set into a training set Train and a Test set Test, and dividing the training set into 5 parts: train1, train2, train3, train4, train 5;

selecting SGD, SVM, KNN, MLP and Xceptation from the base model; in the SGD model part, sequentially using train1, train2, train3, train4 and train5 as a verification set, using the other 4 parts as a training set, performing 5-fold cross verification to perform model training, predicting on a test set to obtain 5 parts of predictions trained by the SGD model and 1 part of predicted values B1 on the test set, and longitudinally combining the 5 parts of predictions to obtain P1; the SVM, KNN, MLP and Xcenter models are processed in the same way;

after the training of 5 models is finished, inputting the predicted values of the 5 models on the training set as the training sets (P1, P2 and P3) of the LR models respectively for training;

and predicting by using the trained LR model to obtain a final predicted malware family class label or probability.

As a preferable technical scheme, the method further comprises the following steps:

and measuring the quality of the prediction structure by using an MSFE loss function, wherein the MSFE loss function is calculated according to the following formula:

wherein

Is the true family tag to which the malware belongs,

and (3) labeling the predicted malware family by the Stacking multi-model fusion, wherein N is the number of negative samples, P is the number of positive samples, FPN is the average false positive error, and FNE is the average false negative error.

The invention provides a malicious software family classification system based on RGB image and Stacking multi-model fusion, which is applied to the malicious software family classification method based on RGB image and Stacking multi-model fusion and comprises a database construction module, a feature extraction module, a feature processing module, a weight determination module, a Stacking multi-model fusion module and a prediction module;

the database construction module is used for constructing an API (application programming interface) category database Q;

the feature extraction module is configured to execute the unknown malware M in the sandbox, obtain an execution report, and extract the malware API call sequence chain API _ calls ═ x from the execution report₁,x₂,x₃,...,x_n}，x_iCalling the name of the API for the malware M;

the feature processing module constructs an API call relation pair according to the API call sequence chain so as to obtain an API call relation pair directed graph G;

the weight determining module is used for determining the weight by using an improved iteration scale algorithm in the maximum entropy model to obtain the weight w of each API call relation pair_i,j；

The RGB image construction module is used for filling RGB pixel points in the API classification database Q combined with the weight vectors through API calling relation to obtain an RGB image representing the behavior of calling the API by the malicious software, and the RGB image can well represent the relation between the API and the API of the malicious software;

the stacking multi-model fusion module is used for constructing a stacking multi-model fusion algorithm and performing training and learning, the stacking multi-model fusion network is divided into two layers of training processes, the first layer divides an original data set into a plurality of sub-data sets and inputs the sub-data sets into each base learner for learning, and each base learner outputs a respective prediction result; the second layer takes the first-level output result as input, the learner of the second layer learns the first-level output result and outputs a final prediction result;

and the prediction module is used for inputting the RGB image data set representing the behavior characteristics of each malicious software into a stacking multi-model fusion algorithm so as to predict the names of the malicious software families.

As a preferred technical solution, the method further includes a loss function module, which measures the quality of the prediction structure by using an MSFE loss function, where the MSFE loss function calculation formula is as follows:

wherein

Is the true family tag to which the malware belongs,

The invention further provides a storage medium which stores a program, and the program realizes the method for classifying the malware family based on the fusion of the RGB images and the Stacking multi-model when being executed by a processor.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention classifies malware families by adopting a RGB image and Stacking multi-model fusion method, the API calling behavior of the malware is converted into the RGB image through a conversion rule, the conversion process not only considers the API calling times, but also considers the calling relation between the API and the API, so that the malware classification problem is converted into the image classification problem, SGD, SVM, KNN, MLP, Xception and LR models are selected, the Stacking technology is used for carrying out multi-model fusion, the accuracy of the models can be improved, meanwhile, MSFE loss functions are used for evaluating the integral models, and the influence caused by unbalanced data set categories can be eliminated.

The invention has the advantage of rapidly classifying a large amount of malicious software. Similar to the prior art, the behavior characteristics of the malicious software are selected by using a dynamic classification method, and in the process of converting the behavior characteristics into the image, the acquired RGB image of the malicious software can more accurately represent the behavior information of the malicious software by considering not only the called times of the API but also the correlation between the API and the API category. 5 classification models are fused at the same time, so that the prediction result is more accurate.

The invention has the advantages of bypassing the shell adding and confusion of the malicious software. Similar to the prior art, the method has the advantages that the static characteristics of the malicious software are not selected, but the dynamic characteristics of the malicious software are selected, so that the problem that malicious software analysts cannot analyze the malicious software due to malicious shelling or confusion can be avoided, and the feasibility is realized.

Drawings

FIG. 1 is a flowchart of a malware family classification method based on RGB image and Stacking multi-model fusion according to an embodiment of the present invention;

FIG. 2 is a pictorial diagram of an embodiment of the present invention for creating RGB images;

FIG. 3 is a diagram of a Stacking multi-model fusion framework according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a malware family classification system based on RGB image and Stacking multi-model fusion according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Examples

Aiming at the defects of shell adding, confusion, encryption and the like which cannot be solved in the malicious software family classification technology in the prior art, the invention provides a malicious software family classification scheme based on RGB image and Stacking multi-model fusion on the basis of the known malicious software family.

The main ideas of the invention are as follows: extracting an API calling sequence of the malicious software by using a dynamic analysis method, firstly constructing an API calling sequence chain, further constructing an API calling relationship pair graph, training according to an improved iteration scale algorithm in a maximum entropy model to obtain a weight vector of API calling relationship pair conversion, constructing an RGB image set according to the API calling relationship pair and the weight vector, and finally training and predicting a malicious software family by using a Stacking model fusion method.

The technical scheme adopted by the invention has the following two purposes: first, considering the dependencies between APIs may improve the accuracy of malware classification. Secondly, the classification problem of the malicious software family is converted into an RGB image, the classification accuracy is improved by constructing a Stacking multi-model fusion method, and meanwhile, the influence caused by the unbalanced classification can be eliminated by means of a Mean Squared False Error (MSFE) loss function.

The setup of the relevant system, sandbox cuckoo sandbox2.0 usage: and dynamically executing tracking to obtain an API calling sequence.

As shown in fig. 1, the malware family classification method based on RGB image and Stacking multi-model fusion provided by this embodiment includes the following steps:

s1, API classification, namely classifying the API into 256 categories, wherein the classification rule is as follows:

s1.1, the first 12 API categories [0,11] are classified according to the categories in the windows API reference manual: in view of the API functional features, as shown in table 1:

table 1: according to API functional partitioning

S1.2, the classification standard of the API category number [12,255] is that according to the occurrence frequency (shown in Table 2) in the API report of hot malware use, the adjacent interval is 1200 times, and hot API called by malware is focused

Table 2: according to the number of called times

The API category database Q is constructed according to the partitioning according to step S1.1 and step S1.2 above.

S2, feature extraction:

executing unknown malicious software M in a sandbox, obtaining an execution report, and extracting a malicious software API call sequence chain API _ calls { x } from the execution report₁,x₂,x₃,...,x_n}，x_iThe name of the API is called for the malware M.

S3, feature processing:

the API call relation pair is constructed from the API call sequence chain in step S2,

this is done to not focus on a single API or complete sequence, but rather to represent the behavior and current state of the malware at that time with pairs of API call relationships. This may result in the definition of the API call directed graph G:

g ═ S, E, S is the set of calling APIs

S4, determining the weight:

in step 3, after the API call sequence chain API _ calls is converted into the API call relationship pair graph G, the weight is determined by using the improved iteration scale algorithm in the maximum entropy model, and finally the weight w of each API call relationship pair can be obtained_i,j. Improved iterative scaling algorithm finds weight vector w ═ w (w)₁,w₂,...,w_n-1) The steps of (1):

Model P_w(y|x)；

And (3) outputting: optimal weight w'_i(ii) a Optimum model P_w'。

S1, initializing w_i＝0,i＝1,2,...,n-1。

S2. for each i ═ 1,2

a. Let sigma_iIs an equation

Wherein

b. Updating w_iThe value: w is a_i←w_i+σ_i

s3. if all of w_iConverge and end, otherwise repeat s2.

Wherein g { (x) for a given training set₁,y₁),(x₂,y₂),(x₃,y₃),...,(x_n-1,y_n-1) An empirical distribution of the joint distribution P (X, Y) can be determined

And empirical distribution of edge distribution P (X)

is f_i(x, y) about

Is calculated from the expected value of (c).

About model P (Y | X) and empirical distribution

Is calculated from the expected value of (c).

S5, constructing an RGB graph;

and filling RGB pixel points by combining the API call relation pair and the weight vector with an API category database Q. For example for an API call relationship pair

The position of the pixel point in the RGB image is (Q (x)_i),Q(x_j) Pixel value of (Q (x))_i),Q(x_j),w_iX 255) as shown in fig. 2, which is a visual diagram for constructing an RGB image.

S6, model training and prediction:

and obtaining an RGB image representing the behavior of calling the API by the malicious software, wherein the image can well represent the relationship between the API and the malicious software. The classification problem of the malicious software family can be converted into an image classification problem through the steps, and then the training and prediction of the model are carried out through an image classification algorithm.

The method comprises the steps of training and learning by using stacking multi-model fusion, wherein the stacking technology is divided into two layers of training processes, the first layer divides an original data set into a plurality of sub-data sets, the sub-data sets are input into each base learner to be learned, and each base learner outputs a respective prediction result. The second layer takes the output result of the first level as input, and the learner of the second layer learns and outputs the final prediction result. Therefore, the Stacking itself includes a weighting process for the first-level classifier, but this process is handed over to the next-level model, and the result of the first-level learner is used as the input of the next-level, and the model learns to organize this input to obtain the final output.

By comprehensively analyzing the image classification technology, the following 5 algorithms are selected as a base classifier in the first layer of Stacking: SGD, SVM, KNN, MLP and Xcenter, and a method for selecting LR in the second layer are used for fusing classifiers in the first layer, and then the basic classifiers are briefly introduced: SGD: the stochastic gradient descent classifier has the advantage that in the case of an extremely large sample size, it is possible to obtain a model with a loss value within an acceptable range without taking all samples, which can efficiently process large data sets. SVM, support vector machine, which represents instances as points in space, whose decision boundary is the maximum margin hyperplane solved for the learning samples. The main advantage is that a kernel function can be used, the disadvantage is that the algorithm is limited by speed and data set size during the training and testing phase. The KNN K nearest neighbor method divides the feature vector space by using a training data set and uses the feature vector space as a classification model. The advantages are the simplicity of the algorithm and the ability to handle multiclasses, and the disadvantage is that all features are used equally for similarity calculation, possibly leading to classification errors. MLP-multi-layered perceptual classifier that may have one or more non-linear layers between the input and output layers, called hidden layers. The advantage is the ability to learn nonlinear models, and the disadvantage is the need to adjust the number of hidden layers and neurons. Xception: a new concept called depth separable convolution operation is used that can outperform inclusion-v 3 on large image classification datasets containing 3.5 million images and 17000 classes. Since the Xception architecture has the same number of parameters as inclusion-v 3, the performance improvement is not due to the increase in capacity, but to the more efficient use of the model parameters. And LR is logistic regression, and the classification model has the characteristics of simplicity, high efficiency, easiness in parallel and online learning, and is very wide in application.

s6.1, firstly, dividing the malicious software RGB image set obtained in S5 into a training set Train and a Test set Test, and dividing the training set into 5 parts: train1, train2, train3, train4, train 5.

S6.2, selecting SGD, SVM, KNN, MLP and Xceptation from the basic model. In the SGD model part, using train1, train2, train3, train4 and train5 as a verification set, using the rest 4 parts as a training set, performing 5-fold cross verification to perform model training, and performing prediction on a test set to obtain 5 parts of predictions trained by the SGD model and 1 part of predicted value B1 on the test set. The 5 predictions were combined together lengthwise to give P1. The SVM, KNN, MLP and Xcenter models are processed in the same way.

And S6.3, completing training of 5 models, and inputting the predicted values of the 5 models on the training set as the training sets (P1, P2 and P3) of the LR models respectively for training.

And S6.4, predicting by using the trained LR model to obtain a final predicted malware family class label or probability.

In summary, a Stacking multi-model fusion framework map can be obtained, as shown in fig. 3.

S7, model verification:

the MSFE loss function is used to measure how good the model predicts. Unlike the MSE loss function, the MSFE loss function can capture errors from both the majority and minority classes equally, and in particular, by computing errors on different classes separately, it is more sensitive to errors in the minority class than the commonly used MSE loss function. The MSFE loss function calculation formula is as follows:

wherein

Is the true family tag to which the malware belongs,

the method is characterized in that the method is used for predicting the malware family label for the Stacking multi-model fusion, N is the number of negative samples, and P isNumber of positive samples, FPN is the average false positive error, FNE is the average false negative error.

As shown in fig. 4, in another embodiment, a malware family classification system based on RGB image and Stacking multi-model fusion is provided, which includes a database construction module, a feature extraction module, a feature processing module, a weight determination module, a Stacking multi-model fusion module, and a prediction module;

the stacking multi-model fusion module is used for constructing a stacking multi-model fusion algorithm and performing training and learning, the stacking multi-model fusion algorithm is divided into two layers of training processes, the first layer divides an original data set into a plurality of sub-data sets and inputs the sub-data sets into each base learner for learning, and each base learner outputs a respective prediction result; the second layer takes the first-level output result as input, the learner of the second layer learns the first-level output result and outputs a final prediction result;

and the prediction module is used for inputting the RGB image data set representing each malware behavior characteristic into a stacking multi-model fusion network so as to predict the name of the malware family.

Further, the system further comprises a loss function module, which measures the quality of the prediction structure by using an MSFE loss function, wherein the MSFE loss function is calculated according to the following formula:

wherein

Is the true family tag to which the malware belongs,

It should be noted that the system provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the above described functions.

As shown in fig. 5, in another embodiment of the present application, there is further provided a storage medium storing a program, which when executed by a processor, implements a malware family classification method based on RGB image and Stacking multi-model fusion, specifically:

constructing an API category database Q;

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The method for classifying the malicious software family based on the RGB image and Stacking multi-model fusion is characterized by comprising the following steps of:

constructing an API category database Q;

Filling RGB pixel points in the API category database Q combined with the weight by the API calling relationship to obtain an RGB image representing the behavior of calling the API by the malicious software, wherein the RGB image can well represent the relationship between the API and the API of the malicious software;

and inputting the RGB image data set representing each malware behavior characteristic into a stacking multi-model fusion algorithm, thereby predicting the probability value of the malware family, and taking the maximum probability value as the classification result of unknown malware.

2. The method for classifying the malware family based on the fusion of the RGB image and the Stacking multi-model as claimed in claim 1, wherein the constructing of the API category database specifically comprises:

the API class number [12,255] is divided according to the number of occurrences in the report of the hot malware usage API, with 1200 contiguous intervals.

3. The classification method for the malware family based on the fusion of the RGB image and the Stacking multi-model as claimed in claim 1, wherein the API call relationship pair is:

the definition of the API call relation to the directed graph G is as follows:

g ═ S, E, S is the set of calling APIs

Wherein w_iIs x_iTo x_jWeight of (2), representing x_iAfter being called, x_jProbability of being invoked.

4. The method for classifying the malware family based on the fusion of the RGB image and the Stacking multi-model according to claim 1, wherein the determination of the weight is performed by using an improved iterative scaling algorithm in the maximum entropy model, specifically:

Model P_w(y|x)；

And (3) outputting: optimal weight w'_i(ii) a Optimum model P_w'；

s1. initialization w_i＝0,i＝1,2,...,n-1；

s2. for each i ═ 1, 2.., n-1;

a. let sigma_iIs an equation

Wherein

b. Updating w_iThe value: w is a_i←w_i+σ_i

s3. if all of w_iConverge and end, otherwise repeat s 2;

And empirical distribution of edge distribution P (X)

is f_i(x, y) about

The expected value of (d);

is f_i(X, Y) with respect to model P (Y | X) and empirical distribution

Is calculated from the expected value of (c).

5. The RGB image and Stacking multi-model fusion-based malware family classification method as claimed in claim 1, wherein the filling of RGB pixel points is performed by combining an API call relationship pair with a weight vector and an API category database Q, and specifically comprises: API call relationship pair (x)_i,x_j) Determining the position filled in the image matrix, position (R ═ Q (x)_i),G＝Q(y_i) A pixel value at this position is (R ═ Q (x))_i),G＝Q(y_i),B＝w_iX 255) in which w_i,jIs x_iTo x_jWeight of (2), representing x_iAfter being called, x_jProbability of being invoked.

6. The classification method for malicious software families based on RGB image and Stacking multi-model fusion as claimed in claim 1 is characterized in that in the Stacking multi-model fusion algorithm, the following 5 algorithms are selected as base classifiers in the first layer: SGD, SVM, KNN, MLP and Xcenter, and a method for selecting LR by a second layer is used for fusing the classifiers of the first layer;

7. The method for classifying the malware family based on the fusion of the RGB image and the Stacking multi-model as claimed in claim 1, further comprising the following steps:

wherein

Is the true family tag to which the malware belongs,

malware family predicted for Stacking multi-model fusionLabel, N is the number of negative samples, P is the number of positive samples, FPN is the average false positive error, FNE is the average false negative error.

8. The system for classifying the malicious software family based on the fusion of the RGB image and the Stacking multi-model is characterized by being applied to the method for classifying the malicious software family based on the fusion of the RGB image and the Stacking multi-model according to any one of claims 1 to 7, and comprising a database construction module, a feature extraction module, a feature processing module, a weight determination module, a Stacking multi-model fusion module and a prediction module;

and the prediction module is used for inputting the RGB image data set representing each malware behavior characteristic into a stacking multi-model fusion algorithm so as to predict the name of the malware family.

9. The RGB image and Stacking multi-model fusion based malware family classification system of claim 8, further comprising a loss function module, using an MSFE loss function to measure the quality of the predicted structure, wherein the MSFE loss function calculation formula is as follows:

wherein

Is the true family tag to which the malware belongs,

10. A storage medium storing a program, wherein the program, when executed by a processor, implements the method for classifying a malware family based on fusion of RGB images and Stacking multiple models according to any one of claims 1 to 7.