CN111814147A

CN111814147A - Android malicious software detection method based on model library

Info

Publication number: CN111814147A
Application number: CN202010495937.6A
Authority: CN
Inventors: 李涛; 余东豪; 余鑫
Original assignee: Wuhan University of Science and Engineering WUSE
Current assignee: Wuhan University of Science and Engineering WUSE; Wuhan University of Science and Technology WHUST
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-10-23

Abstract

The invention discloses an android malicious software detection method based on a model library, which comprises the following steps: 1) establishing a data set by adopting android software application, and marking the data set; 2) dividing the training set into a data set 1 and a data set 2 according to the proportion of 8:2, and training the algorithm in the algorithm set by using the data in the data set 1 to generate a BaseModel; 3) randomly combining BaseModel Models to obtain Models; 4) detecting and evaluating the models by using the data set 2 to obtain the accuracy detection result of each model; 5) adjusting the weight between the base models; 6) repeating the steps 4) to 5) until the best value of the detection result is not changed; 7) sequencing the models to obtain k models with the best recognition effect; 8) and calculating the accuracy, the recall rate and the F1 value of the k models by using the test set, and performing android malware detection by adopting the model with the best effect. The method is suitable for detecting the android malicious software of multiple groups.

Description

Android malicious software detection method based on model library

Technical Field

The invention relates to an information security technology, in particular to an android malicious software detection method based on a model library.

Background

There are a large number of applications that have multiple class labels, and are multi-population. The feature difference of the APP can lead to that one application belongs to a plurality of populations, the application scenes are crossed and overlapped, the populations of the APP cannot be accurately marked out, so that the situation that the multiple populations of applications are detected by a recognizer trained by one population is difficult, and android application maliciousness detection cannot be directly carried out at a population angle. Therefore, a method for establishing a model library is further provided, so that the problem that malicious detection is difficult for multiple populations of APPs is solved, and multiple identifiers are required to be combined for detection. Therefore, finding the optimal recognizer combination quickly becomes the key to solving the problem.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for detecting android malicious software based on a model library aiming at the defects in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows: an android malicious software detection method based on a model library comprises the following steps:

1) the method comprises the steps of collecting android software application to establish a data set, marking the data set, and dividing the marked data set into a test set and a training set;

2) dividing the training set into a data set 1 and a data set 2 according to the proportion of 8:2, and training the algorithm in the algorithm set by using the data in the data set 1 to generate a BaseModel;

the algorithm set is an algorithm set packaged by various classification algorithms including SVM, RF and FC;

3) randomly combining BaseModel Models to obtain Models; the model is an integrated recognizer composed of a plurality of base models and comprises all the base models and the weights w of all the base models in the model_i；

Wherein, w_i＝n_i/N；

Wherein n is_iIs a BaseModel of a base model in a model_iN is the total number of base models in the model;

4) detecting and evaluating the models by using the data set 2 to obtain the detection result of each model;

5) adjusting the weight between the base models;

6) repeating the steps 4) to 5) until the times of weight adjustment reach a set value or the best value of the detection result does not change any more;

7) sequencing the models to obtain k models with the best recognition effect, and confirming the combination and the weight of the base models;

8) and calculating the accuracy, the recall rate and the F1 value of the k models by using the test set, and performing android malware detection by adopting the model with the best effect.

According to the scheme, the process of randomly combining BaseModel base Models in the step 3) to obtain model Models is as follows: the method comprises the steps of firstly determining the upper limit of the total number of basic models in a model, then setting the number of the basic models by combining random numbers which are less than or equal to the upper limit value, and then combining to form the model.

According to the scheme, the weight between the base models in the step 5) is adjusted to be random weight adjustment.

According to the scheme, the following method is adopted for adjusting the weight between the basic models in the step 5):

if a certain base model is BaseModel_pIf the weight of the model is larger than the set threshold value, another base model BaseModel is randomly selected_qThe weights of both are set to (w)_p+w_q)/2。

selecting m basic models in the model detected in the step 4) according to a preset probability P, and replacing the basic models by n randomly generated basic models to complete weight adjustment among the basic models to form a new model, wherein Np is more than or equal to m and more than or equal to 1, and m is more than or equal to n and more than or equal to 1.

According to the scheme, in the step 5), in the new model, if the weight of a certain base model exceeds the set threshold, the weight of the base model is adjusted to the set threshold.

The invention has the following beneficial effects:

the model established by the detection method is suitable for detecting the android malicious software of multiple groups, and the accuracy rate of the model can meet the set requirement.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a method of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, a method for detecting android malware based on a model library includes the following steps:

in this embodiment, the crawled 98 types of applications are used as a data set, the applications marked as benign are used as positive samples, the applications marked as malicious are used as negative samples, and the applications are extracted from a plurality of populations of the data set to provide training data and test data for the establishment of a model base. The data set is divided into a training set TrainSet and a testing set TestSet according to the proportion of 9: 1.

the algorithm set is as follows: algorithms ═ algorithmm 1, algorithmm 2, …, Algorithmn }; implementing and packaging various algorithms to enable the algorithms to have a uniform input and output format;

the base model is a recognizer generated by training a single algorithm by multiple groups of application data sets;

3) randomly combining the BaseModel Models to obtain Models of all combinations; the model is an integrated recognizer consisting of a plurality of base models and comprises all the base models and the weights of all the base models in the model;

the model Models are obtained by random combination among the BaseModels and the process is as follows: the method comprises the steps of firstly determining the upper limit of the total number of basic models in a model, then setting the number of the basic models by combining random numbers which are less than or equal to the upper limit value, and then combining to form the model.

Model library:

ModelLibrary＝{BaseModel1,BaseModel2…,BaseModeln}.

the ModelLibrary is composed of all basic models, and stores the basic models generated by all algorithms in the algorithm set;

the model is an integrated recognizer composed of a plurality of base models and comprises all the base models and the weights w of all the base models in the model_i；

Wherein, w_i＝n_i/N；

4) detecting and evaluating the models by using the data set 2 to obtain the identification accuracy rate detection result of each model;

5) adjusting the weight between the base models;

random weight adjustment:

the weight between the base models is adjusted to adopt random weight adjustment if a certain base model is BaseModel_pIf the weight of the model is larger than the set threshold value, another base model BaseModel is randomly selected_qThe weights of both are set to (w)_p+w_q)/2。

2. Replacement adjustment

In the process of forming a new model, if the weight of a certain base model exceeds a set threshold, the weight of the base model is adjusted to the set threshold.

8) and calculating the accuracy, recall rate and F1 value of k models by using the test set, and performing android malware detection by using the model with the best effect, wherein k is generally 3-10.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. An android malicious software detection method based on a model library is characterized by comprising the following steps:

Wherein, w_i＝n_i/N；

4) detecting and evaluating the models by using the data set 2 to obtain the accuracy detection result of each model;

5) adjusting the weight between the base models in the model;

6) repeating the steps 4) to 5) until the best value of the detection result is not changed;

2. The method for detecting android malware based on model library of claim 1, wherein the model Models are obtained by random combination between base Models BaseModel in step 3) as follows: the method comprises the steps of firstly determining the upper limit of the total number of basic models in a model, then setting the number of the basic models by combining random numbers which are less than or equal to the upper limit value, and then combining to form the model.

3. The method as claimed in claim 1, wherein the weights between the base models in step 5) are adjusted by random weight adjustment.

4. The method of claim 3, wherein the adjusting the weights between the base models in step 5) is performed by:

after random weight adjustment, certain base model BaseModel_pIf the weight of the model is larger than the set threshold value, another base model BaseModel is randomly selected_qThe weights of both are set to (w)_p+w_q)/2。

5. The method for detecting android malware based on model library of claim 1, wherein the following method is adopted for adjusting the weight between the base models in the step 5):

6. The method as claimed in claim 1, wherein in the step 5), if the weight of a base model exceeds a predetermined threshold, the weight of the base model is adjusted to the predetermined threshold.