CN111444502B

CN111444502B - Population-oriented android malicious software detection model library method

Info

Publication number: CN111444502B
Application number: CN201911215882.2A
Authority: CN
Inventors: 余东豪; 李涛; 余鑫; 张晏成; 颜松; 郑昊天; 常远; 贾志强; 乐金祥; 黄甫; 谢君臣
Original assignee: Wuhan University of Science and Engineering WUSE
Current assignee: Wuhan University of Science and Engineering WUSE
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2023-05-02
Anticipated expiration: 2039-12-02
Also published as: CN111444502A

Abstract

The invention discloses a population-oriented android malicious software detection model library method, which comprises the following steps of: 1) Collecting application files, extracting application authority use conditions, integrating the application authority use conditions into an authority information matrix, and forming population information of the application according to category labels; 2) Training a classifier according to the extracted application permission set; 3) The method comprises the steps of collecting an authority information matrix of an application to be detected, determining the category of the application to be detected by using a classifier, and taking population information of the application to be detected as input of a model library; and finding a recognizer pool corresponding to the population in the model library, detecting the application by using a recognizer which is most in line with the constraint condition according to the constraint condition, and judging the maliciousness of the application. The method of the invention refers to the thought of biological population, divides the application into different populations by processing the authority characteristics of the application, and finds the corresponding recognition algorithm model in the model library by constraint, thus finally obtaining better recognition results.

Description

Population-oriented android malicious software detection model library method

Technical Field

The invention relates to a malicious software detection technology, in particular to a population-oriented android malicious software detection model library method.

Background

The detection of the malicious nature of Android applications is an uncertainty problem. Heretofore, malware detection methods can be categorized into static detection, dynamic detection, and dynamic-static combination detection. However, with the rise of machine learning and data mining, more and more researchers choose to combine previous dynamic and static detection methods with machine learning techniques.

At present, a detector applied to Android application malicious detection is mainly trained by a machine learning method such as a support vector machine, a random forest, K-means and the like. Various detection methods lay a foundation for Android detection, but have some defects: because of the diversity of Android applications, the use of privacy permissions is a typical uncertainty problem, and it is difficult to distinguish between normal permissions and privacy permissions. There is still a certain disadvantage to using the same detector to achieve detection for all kinds of applications.

The different types of applications have different requirements for the rights, and should not be aimed at the rights themselves or for a certain application individual, but should consider the use of the application in combination with the function of the app. For example, address book authority is read, for social applications, because users mostly register accounts through mobile phone numbers, applications can associate friends with users through the address book of the users, and the application function integrity can be maintained only by having the authority, but not necessary for tool applications such as flashlights, readers and the like, otherwise, the 'minimum privilege principle' is violated. The risk posed by the same rights is therefore different for applications of different function types. Applications of similar use have similar functionality, resulting in similar rights requirements.

Therefore, by referring to the concept of population in biology, the invention provides a method suitable for detecting large-scale Android malicious applications based on population angles. The same type of application performs similar functions and the required system permissions are similar. Therefore, we divide the applications of the same function type into a population, set population labels for them, and conduct malicious detection research of Android applications in units of the population.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a population-oriented android malicious software detection model library method.

The technical scheme adopted for solving the technical problems is as follows: a population-oriented android malicious software detection model library method comprises the following steps:

1) Collecting application files, extracting application authority use conditions, integrating the application authority use conditions into an authority information matrix, and forming population information of the application according to category labels; the information of the population comprises category labels corresponding to each application and an authority information matrix of the application after authority pretreatment;

2) Training a classifier according to the extracted application permission set;

dividing the extracted application permission set into a training set and a testing set, wherein the training set is used as the input of the SMO algorithm classifier, so that the classifier can classify the application through the permission through continuous learning; the test set tests the classifier and verifies the classification effect of the classifier;

3) The method comprises the steps of collecting an application to be detected, acquiring a right information matrix of the application, determining the category of the application to be detected by using a classifier, dividing the application with the same function type into a population, setting a category label of the population for the application, and taking population information of the application to be detected as input of a model library; the model library encapsulates a plurality of population identifier pools, each identifier Chi Junyou SVM, and the random forest and neural network are fully connected with identifiers generated by training three algorithms;

and finding a recognizer pool corresponding to the population in the model library, detecting the application by using a recognizer which is most in line with the constraint condition according to the constraint condition, and judging the maliciousness of the application.

In the above scheme, in the step 2), the classification model is built for the data set training by using the SMO function of Weka.

According to the above scheme, the application is detected in the step 3), and the application maliciousness is determined, specifically as follows:

3.1 According to the class label of the applied population, finding a population identifier pool of a corresponding type in the model library; the population identifier pool comprises: the system comprises an SVM identifier, a random forest identifier and a neural network full-connection identifier;

3.2 According to the constraint condition, finding the identifier Classfier which is most in line with the constraint condition in the population identifier pool; the identifier Classfier is one of an SVM identifier, a random forest identifier and a neural network full-connection identifier;

the constraint conditions are detection accuracy and detection running time;

3.3 The population information of the application is used as input, provided for Classfier for identification, and the output result R, R is benign application or malignant application.

The invention has the beneficial effects that: the invention refers to the thought of biological population, divides the application into different populations by processing the authority characteristics of the application, and finds the corresponding recognition algorithm model in the model library by constraint, thus finally obtaining a better recognition result. The invention has the following characteristics:

(1) When classifying the application programs, classifying the application programs by adopting a sequence minimum optimization algorithm with higher efficiency, wherein the accuracy of the classification result of each class reaches more than 85 percent;

(2) When the application program is detected, according to the category of the application program, a corresponding identifier population is automatically found in the model library, so that the identification effect is improved;

(3) And (3) screening out the identifier which is most in line with the condition by adding the constraint condition, so that the identification effect becomes a result expected by the user.

The method of the invention not only can detect a large number of application programs at the same time, but also is easy to realize, is simple and convenient to use, and can obtain the result wanted by the user by modifying the constraint. The method provides a new idea for identifying android malicious software.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a schematic diagram of population information according to an embodiment of the present invention;

FIG. 3 is a classifier training schematic of an embodiment of the invention;

FIG. 4 is a diagram of a test pattern structure according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, a population-oriented android malicious software detection model library method includes the following steps:

the application APK file is crawled from the internet through the written python program to serve as a positive sample, a malicious application sample is obtained from a virus share, and the authority use condition of the application is extracted and integrated into an authority matrix;

in the embodiment, the 360 application market and the intelligent application market are selected as data sources, the python language is used for writing a crawler program, and the application is crawled and stored according to the type by using the application type label provided by the website, so that the uninterrupted batch downloading of the application from the application market is realized.

The source data acquired by the crawler also needs to be further extracted in terms of authority characteristics, so that the source data can be used as basic data of an experiment. The feature extraction is mainly divided into three stages: decompiling, analyzing the XML file, and constructing a feature vector, wherein the feature vector is specifically as follows:

(1) In the decompilation stage, apktool is combined with a python script program to finish decompilation of the application, and a manifest file AndroidManifest.xml when the application program is installed is obtained;

(2) In the XML file analysis stage, combining AAPT (Android Asset PackgingTool) tools, writing an android management file analysis code by using a python program, and extracting a 'uses-permission' tag to obtain application program authority information;

(3) After the application permission is extracted, the application permission is stored in a cloud database by taking a population as a unit. Since the rights information is scalar, it is stored in the form of a "0-1" matrix, "1" indicating the inclusion of the rights feature, and "0" indicating the absence;

thus, the characteristic data set DataSet divided by the population can be obtained.

The population is the most necessary one as the basis of the invention. And converting the apk file into population information data to form a unified input format for model library detection. A sample is marked 1 if the dimension feature exists and 0 if not, as shown in fig. 2.

The maliious field indicates whether the application is malware, a 1 indicates yes, and a 0 indicates no. The Class field corresponds to the Class of application. PackageName is the application package name, which is the unique identifier of the application. The lower field is 144 rights of the android system, the corresponding value is 1 when the application has the corresponding rights, and otherwise, the value is 0.

Applications involved in this embodiment, their rights are a subset of all rights of Android, and the following definitions are given for rights information and population information.

Definition 1Android application permissions:

Permissions＝{P _i |P _i ∈Android}

the set of rights information is a subset of all rights of Android.

The application of the same function type is used as a population, the population class is divided into x classes, the class of the population can be changed according to the increase of the total number of the crawled apps, and then the class set is defined as:

definition of category 2 tags:

Class＝{C ₁ ,C ₂ …,C _x }

C _x a category label for each group, such as a flashlight, camera, player, social chat, etc.;

definition of 3 populations:

Population＝{C _x ,PermissionMatrix}

C _x for the class label of each population, the permission information matrix of each application after permission pretreatment is specifically defined as follows:

defining a 4-right matrix:

PermissionMatrix＝

{P _ij |i＝1,2,3…,m；j＝1,2,3…,n}

i represents population C _x App numbered i in (1), if App _i Possessing rights j, then P _ij 1, otherwise P _ij Is 0.

2) Training a classifier according to the applied population information;

as shown in fig. 3, the population information of all extracted applications is divided into a training set and a testing set, wherein the training set is used as the input of the SMO algorithm classifier, so that the classifier can classify the applications through the authority by continuous learning; the test set tests the classifier and verifies the classification effect of the classifier;

the training data sample set contains the permission of the Android application program and the class label corresponding to each application program, and the identification of the Android application program refers to the process of carrying out class identification on the application program sample to be detected through a trained classification model.

Assume that the N training data of statistics are

(Permissions ₁ ,C ₁ ),(Permissions ₂ ,C ₂ ),…,(Permissions _n ,C _n ) Wherein C is _i Classification tags for the application _i Is a matrix of permissions for the application.

The SMO algorithm compares the N data pairs and learns the rights and categories applied in the training set to obtain a functional relationship that determines the category of the application. In the embodiment, a classification model is established for data set training by using an SMO algorithm of Weka, wherein Weka is open source software fused with machine learning and data mining under Java environment, and then the classification model is utilized to determine the category of an application program to be detected;

3) The method comprises the steps of collecting an application to be detected, acquiring a right information matrix of the application, determining the category of the application to be detected by using a classifier, obtaining population information of the application to be detected as input, finding a recognizer pool corresponding to the population in a model library, detecting the application by using a recognizer which is most in line with constraint conditions according to constraint conditions, and judging the maliciousness of the application;

an identifier:

Classifier＝{Classifier(P _i ,A _i )|

P _i ∈Population,A _i ∈Algorithm}

Classifier(P _i ,A _i ) For machine learning algorithm A _i By P _i And the identifiers generated after population data training, such as a flashlight SVM identifier, a reader random forest identifier and the like. Wherein Algorithm is defined as follows:

algorithm set:

Algorithm＝{SVM,RF,FC}

population identifier:

ClassfierPopulation＝

{Classifier(P,A _i )|A _i ∈Algorithm}

ClassfierPoplation is all identifiers generated after all machine learning algorithms are trained with the data of population P.

3.2 According to the category, the group identifier pool is good to the identifier of the corresponding group, such as a flashlight SVM identifier, a flashlight random forest identifier and a flashlight neural network full-connection identifier. Then comparing the constraint condition with the identifier effect record table, and finding the identifier Classfier which is most in line with the constraint condition according to the priority of the constraint condition; the identifier Classfier is one of an SVM identifier, a random forest identifier and a neural network full-connection identifier;

the constraint conditions are detection accuracy and detection running time;

three algorithms, namely a Support Vector Machine (SVM), a Random Forest (RF) and a neural network Full Connection (FC), are used for the recognizer. The SVM algorithm has stable operation effect, and the random forest algorithm has the advantage of high operation speed, and the full connection can be used for classifying any situation well.

Experimental description of the effects of the invention:

simple experiments were performed to verify the method. The sources of the datasets, the algorithms used for the experiments and simple constraints will be described.

The experimental operation environment is as follows: windows 7 operating system, 3.4GHz four-core processor, 8GB memory.

Currently, from the 360 application market and the An Zhi application market, a total of 32537 Android applications of 62 types are crawled. For the already collected apps we have acquired their rights information list androidmanfest. Xml and generated the rights information vector, where 1 represents that rights are applied and 0 represents that there is no. We scanned apps for both flashlight and reader populations using kingsoft and F-scure, ultimately selecting as positive samples that were marked benign by both software. Based on the design considerations of the experiment, a flashlight population, a camera population, a reader population, and a malicious sample from the VirusShare were selected for the experiment.

As shown in fig. 4, we selected three populations of cameras, flashlights and readers as subjects of the experiment for several reasons. Firstly, the three types of applications of the camera, the flashlight and the reader have clear functional boundaries, and for an app, whether the app belongs to the flashlight, the camera or the reader category or not is easily distinguished from the main authority statement condition and the application description text filled in during uploading. Second, flashlights, cameras and readers are widely used by users, almost every user will have a flashlight, camera or reader application installed in addition to the individual needs. If some application with rich and good functions is added with malicious codes by lawbreakers and is re-uploaded after being shelled, a large number of users are affected.

1. Classification experiments based on SMO

The experiment used three types of applications, camera, flashlight and reader, for a total of 2225, these programs combined into a training set.

And obtaining a management file of each application program by using Apktool, and extracting a permission vector in the management file through a Python script. The results of the 10-fold cross-validation using the SMO function of weka are shown in table 1.

TABLE 1 different categories of software Classification results

According to the application program classification result, the accuracy and recall ratio are high, and the SMO algorithm is proved to be capable of performing better classification learning.

2. Population-oriented Android malicious software detection experiment

Three algorithms, namely a Support Vector Machine (SVM), a Random Forest (RF) and a neural network Full Connection (FC), are used in the experiment. The SVM algorithm has stable operation effect, and the random forest algorithm has the advantage of high operation speed, and the full connection can be used for classifying any situation well.

Because of simple verification, the evaluation criteria of the algorithm can be used as constraints. The accuracy and the running time are adopted as evaluation standards, but the accuracy of the experiment is more important than the running time because of the high efficiency of the random forest algorithm. The higher the accuracy rate is, the higher the application recognition rate of the algorithm to the population is; and shorter run times mean faster identification of the population by the algorithm.

The data set is first divided into a training set and a testing set. And directly taking the training set non-classified population as input, training three algorithms of SVM, random forest and neural network full connection, and generating a recognizer detection test set to obtain time and accuracy. After the data set is classified by the classification module, the training set and the test set are divided according to the population, three algorithms are trained again by using training sets of different populations, so that the identifiers of different algorithms with population attribute differences are obtained, and the identifiers are divided into identifier populations according to the population. And then testing by using test sets of different populations to obtain the spending time and accuracy of the identifier corresponding to the population. The indices of the two identifiers are compared.

TABLE 2 malicious recognition results

Wherein, the data sets A, B, C and D respectively represent three populations of a camera, a flashlight and a reader and a fusion whole set of the three populations.

As can be seen from the results of table 2, the random forest algorithm is superior to the other two algorithms in time and accuracy for the camera population; for flashlight and reader population, although the random forest algorithm is excellent in detection time, the accuracy is not as good as the full connection of a support vector machine and a neural network; the three algorithms work better on the three data sets a, B, C than on data set D.

We can conclude that: 1) The detection effect of the identifier obtained by training the data set after population classification by the same algorithm is improved compared with the effect of full set training, and the maximum improvement reaches 13.26%; 2) Even the same population, the detection effects among the identifiers are different, so that the identifiers meeting the conditions can be selected from the population of the identifiers to be detected according to the actual requirements so as to achieve the best effect.

The experiment proves that the validity of the SMO algorithm on the application division population and the detection effect of the division of the application population on the application maliciousness are greatly improved. Meanwhile, verification is carried out, and the Android malicious software detection model library method facing the population is effective and feasible.

It will be understood that modifications and variations will be apparent to those skilled in the art from the foregoing description, and it is intended that all such modifications and variations be included within the scope of the following claims.

Claims

1. A population-oriented android malicious software detection model library method is characterized by comprising the following steps:

2) Training a classifier according to the extracted application permission set;

finding a recognizer pool corresponding to the population in the model library, detecting the application by using a recognizer which is most in line with the constraint condition according to the constraint condition, and judging the maliciousness of the application; the method comprises the following steps:

the constraint conditions are detection accuracy and detection running time;

2. The method of claim 1, wherein in the step 2), the classification model is built for the data set training using SMO algorithm of Weka.