CN113222053B - Malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion - Google Patents

Malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion Download PDF

Info

Publication number
CN113222053B
CN113222053B CN202110589078.1A CN202110589078A CN113222053B CN 113222053 B CN113222053 B CN 113222053B CN 202110589078 A CN202110589078 A CN 202110589078A CN 113222053 B CN113222053 B CN 113222053B
Authority
CN
China
Prior art keywords
api
model
malware
rgb image
malicious software
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110589078.1A
Other languages
Chinese (zh)
Other versions
CN113222053A (en
Inventor
李树栋
许娜
吴晓波
韩伟红
方滨兴
田志宏
顾钊铨
殷丽华
唐可可
仇晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202110589078.1A priority Critical patent/CN113222053B/en
Publication of CN113222053A publication Critical patent/CN113222053A/en
Application granted granted Critical
Publication of CN113222053B publication Critical patent/CN113222053B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/52Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
    • G06F21/53Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by executing in a restricted environment, e.g. sandbox or secure virtual machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • G06V10/464Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a malicious software family classification method, a system and a medium based on RGB image and Stacking multi-model fusion, wherein the method comprises the following steps: constructing an API category database Q; extracting a malicious software API calling sequence chain; constructing an API call relation pair according to the API call sequence chain to obtain an API call relation pair directed graph G; determining the weight by using an improved iteration scale algorithm in the maximum entropy model to obtain the weight w of each API call relation pairi,j(ii) a Obtaining an RGB image representing the behavior of calling API by malicious software; constructing a stacking multi-model fusion classifier, training and learning, and inputting an RGB image data set representing each behavior characteristic of the malicious software into the classifier so as to predict the family name of the malicious software. According to the invention, the API calling behavior of the malicious software is converted into the RGB image through the conversion rule, the conversion process not only considers the times of API calling, but also considers the calling relation among the APIs, and the Stacking technology is used for multi-model fusion, so that the accuracy of the model can be improved.

Description

Malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion
Technical Field
The invention belongs to the technical field of malicious software classification, and particularly relates to a malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion.
Background
Malware refers to executable programs written for some malicious purpose, including viruses, worms, and trojan horses. The Lexuo software has serious influence on society, and mainly attacks medium and large-sized government and enterprise institutions such as enterprises, governments, education and the like in the modes of mail fishing, account blasting, vulnerability exploitation and the like to obtain violence from the Lexuo software. At the same time, the amount of malware has also increased year by year, as in 2020, and the number of malware executables known to the security community has exceeded 11 billion, and this number may continue to grow. The reason for the increase of the amount of the malicious software is as follows: first, with the rapid development of network technology, there are more and more malicious software propagation approaches, such as downloading pirated movies, searching hot topics, and installing unknown anti-virus software. Second, abuse of automated malware generation tools results in an increase in the number of malware variants. Thirdly, the criminal group of the malicious software gradually forms large-scale commercial operation and forms a new malicious software cooperation ecology.
In order to solve the above problems, a method of classifying malware families can be adopted to perform targeted defense on unknown malware. On the whole, the malware classification technology relies on a malware detection method, and the current malware detection methods mainly include three types: static analysis, dynamic analysis, and visualization techniques.
The static analysis method focuses on the expression of the malicious software in a text form, and by analyzing the static characteristics extracted from a binary file, an executable file or a disassembled file of the malicious software, the same malicious software family code reuse can cause similarity between malicious software authors or team codes, so that data and structural information of the same family of malicious software can also have certain similarity when the same family of malicious software is loaded into a memory and executed. When a static analysis method is adopted, when the malicious software is subjected to shell adding and resource confusion, static characteristics cannot be acquired. Malware shelling refers to the process by which a malware writer compresses, encrypts, or otherwise destroys a malware program, which when run, automatically unpacks and executes. Malware resource obfuscation refers to the way that obfuscator resources are stored on disk and then the malware runtime performs obfuscation restoration so that they can be used by malware. In addition, the method has limitations of disassembling technology, dynamically downloading data and the like.
The dynamic analysis method focuses on behavior characteristics of the malicious software, and records behavior information of the malicious software, such as a behavior log, context parameters, an API (application program interface) calling sequence and the like, by executing a malicious software sample in a virtual controlled environment. By running malware, it is possible to find out which servers a particular malware file is connected to, which parameters were modified, and which device inputs/outputs were performed. The method can bypass the limitation of malicious software shell adding, but the problem that the malicious software has no network behavior in the execution process cannot be solved. The method is only suitable for detecting the malicious software with network behaviors, and has no generalization capability.
The visualization technology is used for visualizing information obtained by the malicious software through a static analysis method or a dynamic analysis method, visualized data is usually more visual than non-visualized statistical results, and the development trend of the malicious software in the whole threat range can be rapidly identified by visualizing the safety data. But the method only considers the execution times of the malicious software in the whole execution process, and does not consider the correlation between the API and the API, and in practice, the correlation between the API called by the malicious software and the API has great influence on the classification result.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a malware family classification method, system and medium based on RGB image and Stacking multi-model fusion.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a malicious software family classification method based on RGB image and Stacking multi-model fusion, which comprises the following steps:
constructing an API category database Q;
executing unknown malicious software M in a sandbox, obtaining an execution report, and extracting a malicious software API call sequence chain API _ calls { x } from the execution report1,x2,x3,...,xn},xiCalling the name of the API for the malware M;
constructing an API call relation pair according to the API call sequence chain, thereby obtaining an API call relation pair directed graph G;
determining the weight by using an improved iteration scale algorithm in the maximum entropy model to obtain the weight w of each API call relation pairi,j
Filling RGB pixel points by combining the API calling relation and the weight vector with an API category database Q to obtain an RGB image representing the behavior of calling the API by the malicious software, wherein the RGB image can well represent the relation between the API and the API of the malicious software;
constructing a stacking multi-model fusion algorithm and performing training learning, wherein the stacking multi-model fusion algorithm is divided into two layers of training processes, the first layer divides an original data set into a plurality of sub-data sets and inputs the sub-data sets into each base learner for learning, and each base learner outputs a respective prediction result; the second layer takes the first-level output result as input, the learner of the second layer learns the first-level output result and outputs a final prediction result;
and inputting the RGB image data set representing the behavior characteristics of each piece of malware into a stacking multi-model fusion algorithm, thereby predicting the name of the malware family.
As a preferred technical solution, the constructing the API category database specifically includes:
the API is divided into 256 categories [0, 255], and the API category number [0,11] is divided according to the categories in the windows API reference manual;
the API class number [12,255] is divided according to the number of occurrences in the popular malware usage API report (https:// rstforums. com/forum/topic/95273-top-maleiniously-used-API /), with an adjacent interval of 1200.
As a preferred technical solution, the API call relationship pair is:
g={<x1,y1=x2>,<x2,y2=x3>,<x3,y3=x4>,…,<xn-1,yn-1=xn>}
the definition of the API call relation to the directed graph G is as follows:
g ═ S, E, S is the set of calling APIs
Figure BDA0003088750900000031
Wherein wi,jIs xiTo xjWeight of (2), representing xiAfter being called, xjProbability of being invoked.
As a preferred technical solution, the determining of the weight by using the improved iteration scale algorithm in the maximum entropy model specifically includes:
inputting: characteristic function fi(x, y), i ═ 1, 2.., n-1; empirical distribution
Figure BDA00030887509000000310
Model Pw(y|x);
And (3) outputting: optimal weight w'i(ii) a Optimum model Pw'
s1. initialization wi=0,i=1,2,...,n-1;
s2. for each i ═ 1, 2.., n-1;
a. let sigmaiIs an equation
Figure BDA0003088750900000032
Wherein
Figure BDA0003088750900000033
b. Updating wiThe value: w is ai←wii
s3. if all of wiConverge and end, otherwise repeat s 2;
wherein g { (x) for a given training set1,y1),(x2,y2),(x3,y3),...,(xn-1,yn-1) Determining an empirical distribution of the joint distribution P (X, Y)
Figure BDA0003088750900000034
And empirical distribution of edge distribution P (X)
Figure BDA0003088750900000035
Characteristic function fi(x, y) represents a binary function of the input x and the output y,
Figure BDA0003088750900000036
is fi(x, y) about
Figure BDA0003088750900000037
The expected value of (d);
Figure BDA0003088750900000038
is fi(X, Y) with respect to model P (Y | X) and empirical distribution
Figure BDA0003088750900000039
Is calculated from the expected value of (c).
As a preferred technical solution, the filling of RGB pixels by the API call relation pair and the weight vector in combination with the API classification database Q specifically includes: API call relationship pair (x)i,xj) Determining the positions filled in the image matrix, the positions being (R ═ Q (x)i),G=Q(yi) A pixel value at this position is (R ═ Q (x))i),G=Q(yi),B=wiX 255) in which wi,jIs xiTo xjWeight of (2), representing xiAfter being called, xjProbability of being invoked. 6. The classification method for malicious software families based on RGB image and Stacking multi-model fusion as claimed in claim 1 is characterized in that in the Stacking multi-model fusion algorithm, the following 5 algorithms are selected as base classifiers in the first layer: SGD, SVM, KNN, MLP and Xcenter, and a method for selecting LR by a second layer is used for fusing the classifiers of the first layer;
the specific steps of constructing a stacking multi-model fusion algorithm and training and learning are as follows:
firstly, dividing an obtained malware RGB image set into a training set Train and a Test set Test, and dividing the training set into 5 parts: train1, train2, train3, train4, train 5;
selecting SGD, SVM, KNN, MLP and Xceptation from the base model; in the SGD model part, sequentially using train1, train2, train3, train4 and train5 as a verification set, using the other 4 parts as a training set, performing 5-fold cross verification to perform model training, predicting on a test set to obtain 5 parts of predictions trained by the SGD model and 1 part of predicted values B1 on the test set, and longitudinally combining the 5 parts of predictions to obtain P1; the SVM, KNN, MLP and Xcenter models are processed in the same way;
after the training of 5 models is finished, inputting the predicted values of the 5 models on the training set as the training sets (P1, P2 and P3) of the LR models respectively for training;
and predicting by using the trained LR model to obtain a final predicted malware family class label or probability.
As a preferable technical scheme, the method further comprises the following steps:
and measuring the quality of the prediction structure by using an MSFE loss function, wherein the MSFE loss function is calculated according to the following formula:
Figure BDA0003088750900000041
Figure BDA0003088750900000042
Figure BDA0003088750900000043
wherein
Figure BDA0003088750900000044
Is the true family tag to which the malware belongs,
Figure BDA0003088750900000045
and (3) labeling the predicted malware family by the Stacking multi-model fusion, wherein N is the number of negative samples, P is the number of positive samples, FPN is the average false positive error, and FNE is the average false negative error.
The invention provides a malicious software family classification system based on RGB image and Stacking multi-model fusion, which is applied to the malicious software family classification method based on RGB image and Stacking multi-model fusion and comprises a database construction module, a feature extraction module, a feature processing module, a weight determination module, a Stacking multi-model fusion module and a prediction module;
the database construction module is used for constructing an API (application programming interface) category database Q;
the feature extraction module is configured to execute the unknown malware M in the sandbox, obtain an execution report, and extract the malware API call sequence chain API _ calls ═ x from the execution report1,x2,x3,...,xn},xiCalling the name of the API for the malware M;
the feature processing module constructs an API call relation pair according to the API call sequence chain so as to obtain an API call relation pair directed graph G;
the weight determining module is used for determining the weight by using an improved iteration scale algorithm in the maximum entropy model to obtain the weight w of each API call relation pairi,j
The RGB image construction module is used for filling RGB pixel points in the API classification database Q combined with the weight vectors through API calling relation to obtain an RGB image representing the behavior of calling the API by the malicious software, and the RGB image can well represent the relation between the API and the API of the malicious software;
the stacking multi-model fusion module is used for constructing a stacking multi-model fusion algorithm and performing training and learning, the stacking multi-model fusion network is divided into two layers of training processes, the first layer divides an original data set into a plurality of sub-data sets and inputs the sub-data sets into each base learner for learning, and each base learner outputs a respective prediction result; the second layer takes the first-level output result as input, the learner of the second layer learns the first-level output result and outputs a final prediction result;
and the prediction module is used for inputting the RGB image data set representing the behavior characteristics of each malicious software into a stacking multi-model fusion algorithm so as to predict the names of the malicious software families.
As a preferred technical solution, the method further includes a loss function module, which measures the quality of the prediction structure by using an MSFE loss function, where the MSFE loss function calculation formula is as follows:
Figure BDA0003088750900000051
Figure BDA0003088750900000052
Figure BDA0003088750900000053
wherein
Figure BDA0003088750900000061
Is the true family tag to which the malware belongs,
Figure BDA0003088750900000062
and (3) labeling the predicted malware family by the Stacking multi-model fusion, wherein N is the number of negative samples, P is the number of positive samples, FPN is the average false positive error, and FNE is the average false negative error.
The invention further provides a storage medium which stores a program, and the program realizes the method for classifying the malware family based on the fusion of the RGB images and the Stacking multi-model when being executed by a processor.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention classifies malware families by adopting a RGB image and Stacking multi-model fusion method, the API calling behavior of the malware is converted into the RGB image through a conversion rule, the conversion process not only considers the API calling times, but also considers the calling relation between the API and the API, so that the malware classification problem is converted into the image classification problem, SGD, SVM, KNN, MLP, Xception and LR models are selected, the Stacking technology is used for carrying out multi-model fusion, the accuracy of the models can be improved, meanwhile, MSFE loss functions are used for evaluating the integral models, and the influence caused by unbalanced data set categories can be eliminated.
The invention has the advantage of rapidly classifying a large amount of malicious software. Similar to the prior art, the behavior characteristics of the malicious software are selected by using a dynamic classification method, and in the process of converting the behavior characteristics into the image, the acquired RGB image of the malicious software can more accurately represent the behavior information of the malicious software by considering not only the called times of the API but also the correlation between the API and the API category. 5 classification models are fused at the same time, so that the prediction result is more accurate.
The invention has the advantages of bypassing the shell adding and confusion of the malicious software. Similar to the prior art, the method has the advantages that the static characteristics of the malicious software are not selected, but the dynamic characteristics of the malicious software are selected, so that the problem that malicious software analysts cannot analyze the malicious software due to malicious shelling or confusion can be avoided, and the feasibility is realized.
Drawings
FIG. 1 is a flowchart of a malware family classification method based on RGB image and Stacking multi-model fusion according to an embodiment of the present invention;
FIG. 2 is a pictorial diagram of an embodiment of the present invention for creating RGB images;
FIG. 3 is a diagram of a Stacking multi-model fusion framework according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a malware family classification system based on RGB image and Stacking multi-model fusion according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Examples
Aiming at the defects of shell adding, confusion, encryption and the like which cannot be solved in the malicious software family classification technology in the prior art, the invention provides a malicious software family classification scheme based on RGB image and Stacking multi-model fusion on the basis of the known malicious software family.
The main ideas of the invention are as follows: extracting an API calling sequence of the malicious software by using a dynamic analysis method, firstly constructing an API calling sequence chain, further constructing an API calling relationship pair graph, training according to an improved iteration scale algorithm in a maximum entropy model to obtain a weight vector of API calling relationship pair conversion, constructing an RGB image set according to the API calling relationship pair and the weight vector, and finally training and predicting a malicious software family by using a Stacking model fusion method.
The technical scheme adopted by the invention has the following two purposes: first, considering the dependencies between APIs may improve the accuracy of malware classification. Secondly, the classification problem of the malicious software family is converted into an RGB image, the classification accuracy is improved by constructing a Stacking multi-model fusion method, and meanwhile, the influence caused by the unbalanced classification can be eliminated by means of a Mean Squared False Error (MSFE) loss function.
The setup of the relevant system, sandbox cuckoo sandbox2.0 usage: and dynamically executing tracking to obtain an API calling sequence.
As shown in fig. 1, the malware family classification method based on RGB image and Stacking multi-model fusion provided by this embodiment includes the following steps:
s1, API classification, namely classifying the API into 256 categories, wherein the classification rule is as follows:
s1.1, the first 12 API categories [0,11] are classified according to the categories in the windows API reference manual: in view of the API functional features, as shown in table 1:
table 1: according to API functional partitioning
Figure BDA0003088750900000071
Figure BDA0003088750900000081
S1.2, the classification standard of the API category number [12,255] is that according to the occurrence frequency (shown in Table 2) in the API report of hot malware use, the adjacent interval is 1200 times, and hot API called by malware is focused
Table 2: according to the number of called times
Figure BDA0003088750900000082
The API category database Q is constructed according to the partitioning according to step S1.1 and step S1.2 above.
S2, feature extraction:
executing unknown malicious software M in a sandbox, obtaining an execution report, and extracting a malicious software API call sequence chain API _ calls { x } from the execution report1,x2,x3,...,xn},xiThe name of the API is called for the malware M.
S3, feature processing:
the API call relation pair is constructed from the API call sequence chain in step S2,
g={<x1,y1=x2>,<x2,y2=x3>,<x3,y3=x4>,…,<xn-1,yn-1=xn>}
this is done to not focus on a single API or complete sequence, but rather to represent the behavior and current state of the malware at that time with pairs of API call relationships. This may result in the definition of the API call directed graph G:
g ═ S, E, S is the set of calling APIs
Figure BDA0003088750900000091
Wherein wi,jIs xiTo xjWeight of (2), representing xiAfter being called, xjProbability of being invoked.
S4, determining the weight:
in step 3, after the API call sequence chain API _ calls is converted into the API call relationship pair graph G, the weight is determined by using the improved iteration scale algorithm in the maximum entropy model, and finally the weight w of each API call relationship pair can be obtainedi,j. Improved iterative scaling algorithm finds weight vector w ═ w (w)1,w2,...,wn-1) The steps of (1):
inputting: characteristic function fi(x, y), i ═ 1, 2.., n-1; empirical distribution
Figure BDA0003088750900000092
Model Pw(y|x);
And (3) outputting: optimal weight w'i(ii) a Optimum model Pw'
S1, initializing wi=0,i=1,2,...,n-1。
S2. for each i ═ 1,2
a. Let sigmaiIs an equation
Figure BDA0003088750900000093
Wherein
Figure BDA0003088750900000094
b. Updating wiThe value: w is ai←wii
s3. if all of wiConverge and end, otherwise repeat s2.
Wherein g { (x) for a given training set1,y1),(x2,y2),(x3,y3),...,(xn-1,yn-1) An empirical distribution of the joint distribution P (X, Y) can be determined
Figure BDA0003088750900000095
And empirical distribution of edge distribution P (X)
Figure BDA0003088750900000096
Characteristic function fi(x, y) represents a binary function of the input x and the output y,
Figure BDA0003088750900000097
is fi(x, y) about
Figure BDA0003088750900000098
Is calculated from the expected value of (c).
Figure BDA0003088750900000099
About model P (Y | X) and empirical distribution
Figure BDA00030887509000000910
Is calculated from the expected value of (c).
S5, constructing an RGB graph;
and filling RGB pixel points by combining the API call relation pair and the weight vector with an API category database Q. For example for an API call relationship pair
Figure BDA00030887509000000911
The position of the pixel point in the RGB image is (Q (x)i),Q(xj) Pixel value of (Q (x))i),Q(xj),wiX 255) as shown in fig. 2, which is a visual diagram for constructing an RGB image.
S6, model training and prediction:
and obtaining an RGB image representing the behavior of calling the API by the malicious software, wherein the image can well represent the relationship between the API and the malicious software. The classification problem of the malicious software family can be converted into an image classification problem through the steps, and then the training and prediction of the model are carried out through an image classification algorithm.
The method comprises the steps of training and learning by using stacking multi-model fusion, wherein the stacking technology is divided into two layers of training processes, the first layer divides an original data set into a plurality of sub-data sets, the sub-data sets are input into each base learner to be learned, and each base learner outputs a respective prediction result. The second layer takes the output result of the first level as input, and the learner of the second layer learns and outputs the final prediction result. Therefore, the Stacking itself includes a weighting process for the first-level classifier, but this process is handed over to the next-level model, and the result of the first-level learner is used as the input of the next-level, and the model learns to organize this input to obtain the final output.
By comprehensively analyzing the image classification technology, the following 5 algorithms are selected as a base classifier in the first layer of Stacking: SGD, SVM, KNN, MLP and Xcenter, and a method for selecting LR in the second layer are used for fusing classifiers in the first layer, and then the basic classifiers are briefly introduced: SGD: the stochastic gradient descent classifier has the advantage that in the case of an extremely large sample size, it is possible to obtain a model with a loss value within an acceptable range without taking all samples, which can efficiently process large data sets. SVM, support vector machine, which represents instances as points in space, whose decision boundary is the maximum margin hyperplane solved for the learning samples. The main advantage is that a kernel function can be used, the disadvantage is that the algorithm is limited by speed and data set size during the training and testing phase. The KNN K nearest neighbor method divides the feature vector space by using a training data set and uses the feature vector space as a classification model. The advantages are the simplicity of the algorithm and the ability to handle multiclasses, and the disadvantage is that all features are used equally for similarity calculation, possibly leading to classification errors. MLP-multi-layered perceptual classifier that may have one or more non-linear layers between the input and output layers, called hidden layers. The advantage is the ability to learn nonlinear models, and the disadvantage is the need to adjust the number of hidden layers and neurons. Xception: a new concept called depth separable convolution operation is used that can outperform inclusion-v 3 on large image classification datasets containing 3.5 million images and 17000 classes. Since the Xception architecture has the same number of parameters as inclusion-v 3, the performance improvement is not due to the increase in capacity, but to the more efficient use of the model parameters. And LR is logistic regression, and the classification model has the characteristics of simplicity, high efficiency, easiness in parallel and online learning, and is very wide in application.
The specific steps of constructing a stacking multi-model fusion algorithm and training and learning are as follows:
s6.1, firstly, dividing the malicious software RGB image set obtained in S5 into a training set Train and a Test set Test, and dividing the training set into 5 parts: train1, train2, train3, train4, train 5.
S6.2, selecting SGD, SVM, KNN, MLP and Xceptation from the basic model. In the SGD model part, using train1, train2, train3, train4 and train5 as a verification set, using the rest 4 parts as a training set, performing 5-fold cross verification to perform model training, and performing prediction on a test set to obtain 5 parts of predictions trained by the SGD model and 1 part of predicted value B1 on the test set. The 5 predictions were combined together lengthwise to give P1. The SVM, KNN, MLP and Xcenter models are processed in the same way.
And S6.3, completing training of 5 models, and inputting the predicted values of the 5 models on the training set as the training sets (P1, P2 and P3) of the LR models respectively for training.
And S6.4, predicting by using the trained LR model to obtain a final predicted malware family class label or probability.
In summary, a Stacking multi-model fusion framework map can be obtained, as shown in fig. 3.
S7, model verification:
the MSFE loss function is used to measure how good the model predicts. Unlike the MSE loss function, the MSFE loss function can capture errors from both the majority and minority classes equally, and in particular, by computing errors on different classes separately, it is more sensitive to errors in the minority class than the commonly used MSE loss function. The MSFE loss function calculation formula is as follows:
Figure BDA0003088750900000111
Figure BDA0003088750900000112
Figure BDA0003088750900000113
wherein
Figure BDA0003088750900000114
Is the true family tag to which the malware belongs,
Figure BDA0003088750900000115
the method is characterized in that the method is used for predicting the malware family label for the Stacking multi-model fusion, N is the number of negative samples, and P isNumber of positive samples, FPN is the average false positive error, FNE is the average false negative error.
As shown in fig. 4, in another embodiment, a malware family classification system based on RGB image and Stacking multi-model fusion is provided, which includes a database construction module, a feature extraction module, a feature processing module, a weight determination module, a Stacking multi-model fusion module, and a prediction module;
the database construction module is used for constructing an API (application programming interface) category database Q;
the feature extraction module is configured to execute the unknown malware M in the sandbox, obtain an execution report, and extract the malware API call sequence chain API _ calls ═ x from the execution report1,x2,x3,...,xn},xiCalling the name of the API for the malware M;
the feature processing module constructs an API call relation pair according to the API call sequence chain so as to obtain an API call relation pair directed graph G;
the weight determining module is used for determining the weight by using an improved iteration scale algorithm in the maximum entropy model to obtain the weight w of each API call relation pairi,j
The RGB image construction module is used for filling RGB pixel points in the API classification database Q combined with the weight vectors through API calling relation to obtain an RGB image representing the behavior of calling the API by the malicious software, and the RGB image can well represent the relation between the API and the API of the malicious software;
the stacking multi-model fusion module is used for constructing a stacking multi-model fusion algorithm and performing training and learning, the stacking multi-model fusion algorithm is divided into two layers of training processes, the first layer divides an original data set into a plurality of sub-data sets and inputs the sub-data sets into each base learner for learning, and each base learner outputs a respective prediction result; the second layer takes the first-level output result as input, the learner of the second layer learns the first-level output result and outputs a final prediction result;
and the prediction module is used for inputting the RGB image data set representing each malware behavior characteristic into a stacking multi-model fusion network so as to predict the name of the malware family.
Further, the system further comprises a loss function module, which measures the quality of the prediction structure by using an MSFE loss function, wherein the MSFE loss function is calculated according to the following formula:
Figure BDA0003088750900000121
Figure BDA0003088750900000122
Figure BDA0003088750900000123
wherein
Figure BDA0003088750900000124
Is the true family tag to which the malware belongs,
Figure BDA0003088750900000125
and (3) labeling the predicted malware family by the Stacking multi-model fusion, wherein N is the number of negative samples, P is the number of positive samples, FPN is the average false positive error, and FNE is the average false negative error.
It should be noted that the system provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the above described functions.
As shown in fig. 5, in another embodiment of the present application, there is further provided a storage medium storing a program, which when executed by a processor, implements a malware family classification method based on RGB image and Stacking multi-model fusion, specifically:
constructing an API category database Q;
executing unknown malicious software M in a sandbox, obtaining an execution report, and extracting a malicious software API call sequence chain API _ calls { x } from the execution report1,x2,x3,...,xn},xiCalling the name of the API for the malware M;
constructing an API call relation pair according to the API call sequence chain, thereby obtaining an API call relation pair directed graph G;
determining the weight by using an improved iteration scale algorithm in the maximum entropy model to obtain the weight w of each API call relation pairi,j
Filling RGB pixel points by combining the API calling relation and the weight vector with an API category database Q to obtain an RGB image representing the behavior of calling the API by the malicious software, wherein the RGB image can well represent the relation between the API and the API of the malicious software;
constructing a stacking multi-model fusion algorithm and performing training learning, wherein the stacking multi-model fusion algorithm is divided into two layers of training processes, the first layer divides an original data set into a plurality of sub-data sets and inputs the sub-data sets into each base learner for learning, and each base learner outputs a respective prediction result; the second layer takes the first-level output result as input, the learner of the second layer learns the first-level output result and outputs a final prediction result;
and inputting the RGB image data set representing the behavior characteristics of each piece of malware into a stacking multi-model fusion algorithm, thereby predicting the name of the malware family.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. The method for classifying the malicious software family based on the RGB image and Stacking multi-model fusion is characterized by comprising the following steps of:
constructing an API category database Q;
executing unknown malicious software M in a sandbox, obtaining an execution report, and extracting a malicious software API call sequence chain API _ calls { x } from the execution report1,x2,x3,...,xn},xiCalling the name of the API for the malware M;
constructing an API call relation pair according to the API call sequence chain, thereby obtaining an API call relation pair directed graph G;
determining the weight by using an improved iteration scale algorithm in the maximum entropy model to obtain the weight w of each API call relation pairi,j
Filling RGB pixel points in the API category database Q combined with the weight by the API calling relationship to obtain an RGB image representing the behavior of calling the API by the malicious software, wherein the RGB image can well represent the relationship between the API and the API of the malicious software;
constructing a stacking multi-model fusion algorithm and performing training learning, wherein the stacking multi-model fusion algorithm is divided into two layers of training processes, the first layer divides an original data set into a plurality of sub-data sets and inputs the sub-data sets into each base learner for learning, and each base learner outputs a respective prediction result; the second layer takes the first-level output result as input, the learner of the second layer learns the first-level output result and outputs a final prediction result;
and inputting the RGB image data set representing each malware behavior characteristic into a stacking multi-model fusion algorithm, thereby predicting the probability value of the malware family, and taking the maximum probability value as the classification result of unknown malware.
2. The method for classifying the malware family based on the fusion of the RGB image and the Stacking multi-model as claimed in claim 1, wherein the constructing of the API category database specifically comprises:
the API is divided into 256 categories [0, 255], and the API category number [0,11] is divided according to the categories in the windows API reference manual;
the API class number [12,255] is divided according to the number of occurrences in the report of the hot malware usage API, with 1200 contiguous intervals.
3. The classification method for the malware family based on the fusion of the RGB image and the Stacking multi-model as claimed in claim 1, wherein the API call relationship pair is:
g={<x1,y1=x2>,<x2,y2=x3>,<x3,y3=x4>,…,<xn-1,yn-1=xn>}
the definition of the API call relation to the directed graph G is as follows:
g ═ S, E, S is the set of calling APIs
Figure FDA0003452189110000011
Wherein wiIs xiTo xjWeight of (2), representing xiAfter being called, xjProbability of being invoked.
4. The method for classifying the malware family based on the fusion of the RGB image and the Stacking multi-model according to claim 1, wherein the determination of the weight is performed by using an improved iterative scaling algorithm in the maximum entropy model, specifically:
inputting: characteristic function fi(x, y), i ═ 1, 2.., n-1; empirical distribution
Figure FDA0003452189110000021
Model Pw(y|x);
And (3) outputting: optimal weight w'i(ii) a Optimum model Pw'
s1. initialization wi=0,i=1,2,...,n-1;
s2. for each i ═ 1, 2.., n-1;
a. let sigmaiIs an equation
Figure FDA0003452189110000022
Wherein
Figure FDA0003452189110000023
b. Updating wiThe value: w is ai←wii
s3. if all of wiConverge and end, otherwise repeat s 2;
wherein g { (x) for a given training set1,y1),(x2,y2),(x3,y3),...,(xn-1,yn-1) Determining an empirical distribution of the joint distribution P (X, Y)
Figure FDA0003452189110000024
And empirical distribution of edge distribution P (X)
Figure FDA0003452189110000025
Characteristic function fi(x, y) represents a binary function of the input x and the output y,
Figure FDA0003452189110000026
is fi(x, y) about
Figure FDA0003452189110000027
The expected value of (d);
Figure FDA0003452189110000028
is fi(X, Y) with respect to model P (Y | X) and empirical distribution
Figure FDA0003452189110000029
Is calculated from the expected value of (c).
5. The RGB image and Stacking multi-model fusion-based malware family classification method as claimed in claim 1, wherein the filling of RGB pixel points is performed by combining an API call relationship pair with a weight vector and an API category database Q, and specifically comprises: API call relationship pair (x)i,xj) Determining the position filled in the image matrix, position (R ═ Q (x)i),G=Q(yi) A pixel value at this position is (R ═ Q (x))i),G=Q(yi),B=wiX 255) in which wi,jIs xiTo xjWeight of (2), representing xiAfter being called, xjProbability of being invoked.
6. The classification method for malicious software families based on RGB image and Stacking multi-model fusion as claimed in claim 1 is characterized in that in the Stacking multi-model fusion algorithm, the following 5 algorithms are selected as base classifiers in the first layer: SGD, SVM, KNN, MLP and Xcenter, and a method for selecting LR by a second layer is used for fusing the classifiers of the first layer;
the specific steps of constructing a stacking multi-model fusion algorithm and training and learning are as follows:
firstly, dividing an obtained malware RGB image set into a training set Train and a Test set Test, and dividing the training set into 5 parts: train1, train2, train3, train4, train 5;
selecting SGD, SVM, KNN, MLP and Xceptation from the base model; in the SGD model part, sequentially using train1, train2, train3, train4 and train5 as a verification set, using the other 4 parts as a training set, performing 5-fold cross verification to perform model training, predicting on a test set to obtain 5 parts of predictions trained by the SGD model and 1 part of predicted values B1 on the test set, and longitudinally combining the 5 parts of predictions to obtain P1; the SVM, KNN, MLP and Xcenter models are processed in the same way;
after the training of 5 models is finished, inputting the predicted values of the 5 models on the training set as the training sets (P1, P2 and P3) of the LR models respectively for training;
and predicting by using the trained LR model to obtain a final predicted malware family class label or probability.
7. The method for classifying the malware family based on the fusion of the RGB image and the Stacking multi-model as claimed in claim 1, further comprising the following steps:
and measuring the quality of the prediction structure by using an MSFE loss function, wherein the MSFE loss function is calculated according to the following formula:
Figure FDA0003452189110000031
Figure FDA0003452189110000032
Figure FDA0003452189110000033
wherein
Figure FDA0003452189110000034
Is the true family tag to which the malware belongs,
Figure FDA0003452189110000035
malware family predicted for Stacking multi-model fusionLabel, N is the number of negative samples, P is the number of positive samples, FPN is the average false positive error, FNE is the average false negative error.
8. The system for classifying the malicious software family based on the fusion of the RGB image and the Stacking multi-model is characterized by being applied to the method for classifying the malicious software family based on the fusion of the RGB image and the Stacking multi-model according to any one of claims 1 to 7, and comprising a database construction module, a feature extraction module, a feature processing module, a weight determination module, a Stacking multi-model fusion module and a prediction module;
the database construction module is used for constructing an API (application programming interface) category database Q;
the feature extraction module is configured to execute the unknown malware M in the sandbox, obtain an execution report, and extract the malware API call sequence chain API _ calls ═ x from the execution report1,x2,x3,...,xn},xiCalling the name of the API for the malware M;
the feature processing module constructs an API call relation pair according to the API call sequence chain so as to obtain an API call relation pair directed graph G;
the weight determining module is used for determining the weight by using an improved iteration scale algorithm in the maximum entropy model to obtain the weight w of each API call relation pairi,j
The RGB image construction module is used for filling RGB pixel points in the API classification database Q combined with the weight vectors through API calling relation to obtain an RGB image representing the behavior of calling the API by the malicious software, and the RGB image can well represent the relation between the API and the API of the malicious software;
the stacking multi-model fusion module is used for constructing a stacking multi-model fusion algorithm and performing training and learning, the stacking multi-model fusion algorithm is divided into two layers of training processes, the first layer divides an original data set into a plurality of sub-data sets and inputs the sub-data sets into each base learner for learning, and each base learner outputs a respective prediction result; the second layer takes the first-level output result as input, the learner of the second layer learns the first-level output result and outputs a final prediction result;
and the prediction module is used for inputting the RGB image data set representing each malware behavior characteristic into a stacking multi-model fusion algorithm so as to predict the name of the malware family.
9. The RGB image and Stacking multi-model fusion based malware family classification system of claim 8, further comprising a loss function module, using an MSFE loss function to measure the quality of the predicted structure, wherein the MSFE loss function calculation formula is as follows:
Figure FDA0003452189110000041
Figure FDA0003452189110000042
Figure FDA0003452189110000043
wherein
Figure FDA0003452189110000044
Is the true family tag to which the malware belongs,
Figure FDA0003452189110000045
and (3) labeling the predicted malware family by the Stacking multi-model fusion, wherein N is the number of negative samples, P is the number of positive samples, FPN is the average false positive error, and FNE is the average false negative error.
10. A storage medium storing a program, wherein the program, when executed by a processor, implements the method for classifying a malware family based on fusion of RGB images and Stacking multiple models according to any one of claims 1 to 7.
CN202110589078.1A 2021-05-28 2021-05-28 Malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion Active CN113222053B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110589078.1A CN113222053B (en) 2021-05-28 2021-05-28 Malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110589078.1A CN113222053B (en) 2021-05-28 2021-05-28 Malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion

Publications (2)

Publication Number Publication Date
CN113222053A CN113222053A (en) 2021-08-06
CN113222053B true CN113222053B (en) 2022-03-15

Family

ID=77099041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110589078.1A Active CN113222053B (en) 2021-05-28 2021-05-28 Malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion

Country Status (1)

Country Link
CN (1) CN113222053B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113570453A (en) * 2021-09-24 2021-10-29 中国光大银行股份有限公司 Abnormal behavior identification method and device
CN116523136A (en) * 2023-05-05 2023-08-01 中国自然资源航空物探遥感中心 Mineral resource space intelligent prediction method and device based on multi-model integrated learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552966A (en) * 2020-04-07 2020-08-18 哈尔滨工程大学 Malicious software homology detection method based on information fusion
CN111832020A (en) * 2020-06-22 2020-10-27 华中科技大学 Android application maliciousness and malicious ethnicity detection model construction method and application
CN112182577A (en) * 2020-10-14 2021-01-05 哈尔滨工程大学 Android malicious code detection method based on deep learning

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945347B (en) * 2012-09-29 2016-02-24 中兴通讯股份有限公司 A kind of method, system and equipment detecting Android malware
CN106096411B (en) * 2016-06-08 2018-09-18 浙江工业大学 A kind of Android malicious code family classification methods based on bytecode image clustering
US10848519B2 (en) * 2017-10-12 2020-11-24 Charles River Analytics, Inc. Cyber vaccine and predictive-malware-defense methods and systems
CN107908963B (en) * 2018-01-08 2020-11-06 北京工业大学 Method for automatically detecting core characteristics of malicious codes
CN108280348B (en) * 2018-01-09 2021-06-22 上海大学 Android malicious software identification method based on RGB image mapping
US11580222B2 (en) * 2018-11-20 2023-02-14 Siemens Aktiengesellschaft Automated malware analysis that automatically clusters sandbox reports of similar malware samples
CN110427756B (en) * 2019-06-20 2021-05-04 中国人民解放军战略支援部队信息工程大学 Capsule network-based android malicious software detection method and device
CN110222511B (en) * 2019-06-21 2021-04-23 杭州安恒信息技术股份有限公司 Malicious software family identification method and device and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552966A (en) * 2020-04-07 2020-08-18 哈尔滨工程大学 Malicious software homology detection method based on information fusion
CN111832020A (en) * 2020-06-22 2020-10-27 华中科技大学 Android application maliciousness and malicious ethnicity detection model construction method and application
CN112182577A (en) * 2020-10-14 2021-01-05 哈尔滨工程大学 Android malicious code detection method based on deep learning

Also Published As

Publication number Publication date
CN113222053A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
Vinayakumar et al. Robust intelligent malware detection using deep learning
Lin Deep learning for IoT
Jian et al. A novel framework for image-based malware detection with a deep neural network
Jahromi et al. An improved two-hidden-layer extreme learning machine for malware hunting
US11025649B1 (en) Systems and methods for malware classification
CN113222053B (en) Malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion
CN113360912A (en) Malicious software detection method, device, equipment and storage medium
Jin et al. A malware detection approach using malware images and autoencoders
Dewanje et al. A new malware detection model using emerging machine learning algorithms
AlGarni et al. An efficient convolutional neural network with transfer learning for malware classification
Khoda et al. Selective adversarial learning for mobile malware
Nahhas et al. Android Malware Detection Using ResNet-50 Stacking.
Yadav et al. Deep learning in malware identification and classification
Yaseen et al. A Deep Learning-based Approach for Malware Classification using Machine Code to Image Conversion
Alohali et al. Optimal Deep Learning Based Ransomware Detection and Classification in the Internet of Things Environment.
Alqahtani Machine learning techniques for malware detection with challenges and future directions
Onoja et al. Exploring the effectiveness and efficiency of LightGBM algorithm for windows malware detection
Alzahem et al. Towards optimizing malware detection: An approach based on generative adversarial networks and transformers
Sun et al. MLxPack: Investigating the effects of packers on ML-based Malware detection systems using static and dynamic traits
WO2020075462A1 (en) Learner estimating device, learner estimation method, risk evaluation device, risk evaluation method, and program
Aslam et al. Explainable Classification Model for Android Malware Analysis Using API and Permission-Based Features.
Bala et al. Transfer learning approach for malware images classification on android devices using deep convolutional neural network
Rueda et al. A hybrid intrusion detection approach based on deep learning techniques
Thakur et al. Classification of Android malware using its image sections
Luo et al. Sequence-based malware detection using a single-bidirectional graph embedding and multi-task learning framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant