CN113901463A

CN113901463A - Concept drift-oriented interpretable Android malicious software detection method

Info

Publication number: CN113901463A
Application number: CN202111033119.5A
Authority: CN
Inventors: 张炳; 文峥; 高原; 赵旭阳; 任家东
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2022-01-07
Anticipated expiration: 2041-09-03
Also published as: CN113901463B

Abstract

The invention discloses a concept drift-oriented interpretable Android malicious software detection method, belongs to the technical field of information security, and comprises the steps of introducing detection characteristics through an artificial Android malicious software analysis report, improving the traditional characteristic package based on an automatic machine learning algorithm and an interpretable algorithm, and fusing a same distribution inspection algorithm and a migration learning algorithm. The method improves the interpretability of the Android malicious software detection model, is beneficial to manual verification of the detection model by reverse analysts, reduces the influence of the concept drift problem on the accuracy rate of the detection model, is beneficial to maintaining high accuracy rate of the detection model for a long time with low cost, and is used for detection and analysis of the Android malicious application software.

Description

Concept drift-oriented interpretable Android malicious software detection method

Technical Field

The invention relates to the technical field of information security, in particular to a concept drift-oriented interpretable Android malicious software detection method.

Background

In the 1 st quarter of 2021, about 206.5 thousands of newly added malicious program samples of the mobile terminal are intercepted by 360 Internet security centers, which is 426.5% higher than that of the newly added malicious program samples of the mobile terminal in the same period of 2020, and the per-capita economic loss is 14611 yuan. By 4 months in 2021, compared with the iOS operating system, the Android operating system occupies 76.91% of the china mobile terminal market, and the application software ecology of the Android open platform makes it more vulnerable to malware.

The existing Android malicious software detection technologies are divided into three major categories, namely a detection technology based on a feature code, a static detection technology based on machine learning and an application behavior detection technology based on machine learning. The sandbox mechanism of the Android system makes monitoring of application dynamic behaviors in the non-customized system difficult. The static detection technology based on machine learning is a mainstream Android malicious software detection method due to the advantages of high detection accuracy rate of unknown malicious software, low requirement on equipment hardware and the like.

However, the static detection technology based on machine learning has 3 main problems as follows:

1. the application proportion of requesting sensitive permission in the application market is decreasing, and part of malicious applications can complete attack on the basis of not applying for new permission. A single authority feature, or a combination of features introduced without logic, is not sufficient to characterize malware.

2. While the machine learning algorithm of the black box obtains higher and higher accuracy, the interpretability and transparency of the model are higher and higher for malicious application detection. Android malware reverse personnel need the model to provide decision basis so as to promote the reasonability of manual analysis or judgment of model decision.

3. The high-frequency updating of the Android system version leads to certain market share of Android applications developed on the basis of software development kits of various versions. Due to the concept drift phenomenon, the machine learning model obtained by training at the cost of a large number of samples is poor in performance of detection of Android malicious software in different periods.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a concept drift-oriented interpretable Android malicious software detection method, so that the interpretability of an Android malicious software detection model is improved, and the influence of the concept drift problem on the accuracy of the detection model is reduced.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a concept drift-oriented interpretable Android malware detection method comprises the following steps:

step 1, collecting a plurality of analysis reports of the Android malicious application software to form an Android malicious application software manual analysis report sample library;

step 2, collecting a plurality of malicious and benign Android application software samples to form an initial Android application software sample library, wherein the number of the malicious samples is consistent with that of the benign samples;

step 3, extracting high-frequency words of Android malicious application software reverse analysis from an Android malicious application software manual analysis report library, wherein the effective words A before ranking are used as the characteristic types used by the detection model;

step 4, according to an initial Android application software sample library, using an automatic machine learning algorithm, corresponding to the feature types used by each detection model, constructing screening feature vectors, and training feature component screening models, wherein the quantity of the screening feature vectors is A;

step 5, screening the model according to each characteristic component, and respectively calculating the Shapril average absolute value of all components in the screened characteristic vector by using an interpretable machine learning algorithm, wherein the component B before ranking is used as a sub-characteristic vector used by the detection model;

step 6, combining sub-feature vectors used by all detection models to serve as features used by the detection models; extracting characteristic corresponding data used by the detection model according to an initial Android application software sample library to form an initial training data set;

step 7, training an initial detection model on the initial training data set by using a machine learning algorithm based on a tree model, and outputting the characteristics used by the detection model as the basis for manually verifying the detection model;

step 8, extracting characteristic corresponding data used by a detection model for the Android malicious software with unknown security, inputting the characteristic corresponding data into the trained initial detection model, and detecting whether the application is the Android malicious software;

step 9, according to domestic and foreign mainstream application markets and security websites, Android malicious software samples are obtained by using a crawler technology, and a model migration malicious software sample library is formed, wherein the publishing time interval collection date of malicious software is not more than C months, and the number of the malicious software is D;

step 10, extracting characteristic corresponding data used by a detection model according to a model migration malicious software sample library to form a model migration data set;

step 11, calculating a test statistic by using a same distribution test algorithm according to the model migration data set and the initial training data set, and judging whether the Android malicious software has concept drift;

step 12, if the Android malicious software has concept drift, migrating an initial detection model by using a self-adaptive algorithm in the migration learning field, iterating for E times, training a new detection model, and replacing the initial detection model;

and step 13, repeatedly executing the steps 8-12 by taking the time interval of C months as a period, updating the detection model, and detecting the Android malicious software.

The technical scheme of the invention is further improved as follows: in step 3, the method for extracting the high-frequency words reversely analyzed by the Android malicious application software is a word frequency statistical algorithm, and the effective words A before the ranking are Android programming language keywords.

The technical scheme of the invention is further improved as follows: in step 4, the following substeps are included:

4.1 projecting a feature type used by a detection model from an initial Android application software sample library;

4.2 if the feature has been projected, selecting the feature which is not projected in the feature type used by a detection model, and executing the step 4.1;

4.3 if the feature is not projected, taking all the different data of the feature contained in the projected data as the screening feature vector of the feature; constructing a feature component screening data set, wherein the feature component screening data set comprises sample feature vectors of all samples;

4.4, inputting the feature component screening data set into an automatic machine learning algorithm, and selecting a pipeline with the highest accuracy rate in output pipelines as a feature component screening model of the feature;

4.5 if the feature type of the feature component screening model which is not output exists, executing the step 4.1.

The technical scheme of the invention is further improved as follows: in step 4.4, the automatic machine learning algorithm is a TPOT automatic machine learning algorithm, and the pipeline with the highest accuracy in the output pipelines is selected to apply the tree-based machine learning model.

The technical scheme of the invention is further improved as follows: in step 5, the sum of the average absolute values of the salpril of the components B before the ranking is not less than F times of the sum of the average absolute values of the remaining components of the salpril, wherein F is a positive integer not less than 4.

The technical scheme of the invention is further improved as follows: in step 5, the interpretable machine learning algorithm is a SHAP algorithm.

The technical scheme of the invention is further improved as follows: 6, 8 and 10, extracting feature corresponding data used by the detection model, matching the files decompressed by the Android application software APK by using a reverse tool Android tool according to the features used by the detection model, and recording the occurrence times if the features used by the detection model appear in the files decompressed by the Android application software APK; otherwise, the sequence is recorded as 0 to generate a sequence, and 1 is added after the sequence of the malicious sample; on the contrary, 0 is added as a detection model sample feature vector.

The technical scheme of the invention is further improved as follows: in step 7, the machine learning algorithm based on the tree model is a Catboost algorithm.

The technical scheme of the invention is further improved as follows: in step 12, the adaptive algorithm is JDA algorithm.

The technical scheme of the invention is further improved as follows: a is a positive integer of not less than 4, B is a positive integer of not less than 1, C is a positive integer of not less than 1, D is a positive integer of not less than 100, and E is a positive integer of not more than 5.

Due to the adoption of the technical scheme, the invention has the technical progress that:

1. according to the method, the high-frequency words in the attack flow are extracted through the Android malicious software analysis report, various characteristics of a source code level and an assembly instruction level are introduced, and high logicality and reasonability of detection characteristics are improved.

2. According to the invention, by combining the initial characteristics, low storage overhead and high analysis speed are ensured, and malicious software can be better characterized.

3. According to the method, the optimal machine learning classification model based on the tree is screened by using an automatic machine learning algorithm, and compared with a parameter adjusting process in the traditional machine learning model training technology, the method enhances the engagement degree between training data and the model, and improves convenience and efficiency.

4. The invention uses the interpretability algorithm to construct the detection model interpretation mechanism, and the screened features have high contribution degree to the classification results of most training samples, thereby ensuring the interpretability and verifiability of the detection model.

5. According to the method, a domain self-adaptive method is introduced in the technical field of information security, particularly in the detection technology of Andorid malicious software, and a small amount of new-period Android malicious software is used according to the existing data and a detection model, so that the time sequence stability of the model detection accuracy rate provided by the invention is ensured, and the concept drift problem existing in the Android malicious software detection is effectively relieved.

Drawings

FIG. 1 is a flow chart of the detection method of the present invention;

FIG. 2 is a sub-flowchart for constructing a feature component screening model in the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and examples:

as shown in fig. 1, a concept drift-oriented interpretable Android malware detection method specifically includes the following steps:

step 1, collecting a sufficient amount of analysis reports of the artificial Android malicious application software to form an Android malicious application software manual analysis report sample library.

In this embodiment, an Android malware analysis report is sampled from a Kharon data set to form an Android malware manual analysis report sample library, where the Android malware analysis report language is english, and the total number of words is 4957.

And 2, collecting enough malicious and benign Android application software samples to form an initial Android application software sample library, wherein the number of the malicious samples is consistent with that of the benign samples.

In the embodiment, 2900 pieces of Android malicious software and 2900 pieces of benign software are collected from an Omnidoid data set, and Android application software samples are in an APK format. Wherein, malicious software is defined as more than 50% of antivirus engine detection results in the VIRUSTOTAL website are positive, and benign software is defined as more than or equal to 50% of antivirus engine detection results in the VIRUSTOTAL website are negative.

And 3, extracting the high-frequency words of the Android malicious application software reverse analysis from the Android malicious application software manual analysis report library, wherein the effective words A before ranking are positive integers not less than 4 and serve as the characteristic types used by the detection model.

In the embodiment, a word frequency statistical algorithm is used for extracting high-frequency words of Android malicious application software reverse analysis, in the embodiment, a is 4, effective words used as feature types of the detection model are Android programming language keywords, and are extracted from an Android malicious application software manual analysis report library without meaningless words such as articles, pronouns, quantity words and the like, so that the feature types used by the detection model are authority, API packet names, intention names and Dalvik byte codes. Wherein removing words includes, but is not limited to: the, is, to, a, and, in, of, also, from.

And 4, according to the initial Android application software sample library, using an automatic machine learning algorithm, corresponding to the feature types used by each detection model, constructing screening feature vectors, and training feature component screening models, wherein the quantity of the screening feature vectors is A.

As shown in fig. 2, the method specifically includes the following sub-steps:

4.1 projecting the characteristic types used by a detection model from an initial Android application software sample library.

4.2 if the feature has been projected, selecting the non-projected feature of the feature class used by the detection model, and executing step 4.1.

4.3 if the feature is not projected, using all the different data of the feature contained in the projected data as the screening feature vector of the feature. And constructing a feature component screening data set, wherein the feature component screening data set comprises sample feature vectors of all samples.

In this embodiment, each sample in the initial Android application software sample library includes 45 features such as an installation package name, a file name, a HASH code, a projection authority, an API package name, an intention name, and Dalvik, and corresponding data only including any one of the four features of the projection authority, the API package name, the intention name, and the Dalvik bytecode is obtained through projection. Screening the component of the feature vector to appear in the sample, and recording the corresponding position of the component as the appearance frequency; otherwise, marking as 0, generating a sequence, and adding 1 to the malicious sample after the sequence; otherwise, 0 is added. For example, the screening feature vector of the Dalvik bytecode is [ "shl-int", "long-to-int", "if-gt" ], and the sample feature vector of one sample is [5,3,21,1], which means that the malicious sample contains 5 Dalvik bytecodes, "shl-int", "long-to-int" 3 "and" if-gt "21. In this embodiment, feature screening vectors of four features of the authority, the API package name, the intention name, and the Dalvik bytecode are 184, 4185, 223, and 436 dimensions, respectively.

And 4.4, inputting the feature component screening data set into an automatic machine learning algorithm, and selecting the pipeline with the highest accuracy rate in the output pipelines as the feature component screening model of the features.

In the embodiment, a TPOT automatic machine learning algorithm is used, and the pipeline with the highest accuracy in the output pipelines is selected to apply the tree-based machine learning model.

In this embodiment, step 4 obtains four feature component screening models of authority, API package name, intention name, and Dalvik bytecode.

And 5, screening the model according to each characteristic component, and respectively calculating the Shapril average absolute value of all components in the screened characteristic vector by using an interpretable machine learning algorithm, wherein the component B before ranking, B is a positive integer not less than 1, and the component B is used as a sub-characteristic vector used by the detection model. And the sum of the average absolute values of the salpril of the components of the B before the ranking is not less than F times of the sum of the average absolute values of the salpril of the rest components, wherein F is a positive integer not less than 4.

The B in this embodiment is 9, the interpretable machine learning algorithm used is the swap algorithm, the sub-feature vector used by the authority detection model is calculated as [ "SEND _ SMS", "GET _ TASKS", "READ _ PHONE STATE", "RECEIVE _ root _ complete", "RECEIVE _ SMS", "insert _ response", "GET _ access", "view _ component", "response" and the sub-feature vector used by the API package name detection model is [ a.java.current.current.locks "," address.function "," address.function.address.function ", and the sub-feature vector used by the authority detection model is" address. "and-int/2 addr", "rsub-int", "or-int/lit 16", "rem-long", "invoke-super/range", "iput", "if-nez" ].

Step 6, combining sub-feature vectors used by all detection models to serve as features used by the detection models; and extracting characteristic corresponding data used by the detection models according to the initial Android application software sample library, and horizontally splicing sub-characteristic vectors used by the detection models in any sequence to form an initial training data set.

According to the characteristics used by the detection model, using a reverse tool Android, matching files decompressed by Android application software APK, and recording the occurrence times if the characteristics used by the detection model appear in the files decompressed by the Android application software APK; otherwise, the sequence is recorded as 0 to generate a sequence, and 1 is added after the sequence of the malicious sample; on the contrary, 0 is added as a detection model sample feature vector.

In this embodiment, the sub-feature vector merging manner used by all detection models is horizontal stitching, and the obtained feature types used by the detection models are 36. The process of composing the initial training data set is the same as the process of constructing the feature component screening data set in step 4.3.

And 7, training an initial detection model on the initial training data set by using a machine learning algorithm based on a tree model, and outputting the characteristics used by the detection model as the basis for manually verifying the detection model.

In the embodiment, the machine learning algorithm based on the tree model is a Catboost algorithm.

And 8, extracting characteristic corresponding data used by the detection model for the Android malicious software with unknown security, inputting the characteristic corresponding data into the trained initial detection model, and detecting whether the application is the Android malicious software.

In this embodiment, Python is used to integrate the "get _ properties", "get _ services", "get _ methods", "get _ classes" and "get _ instructions" commands of the android tool, and the authority, API package name, intention name, and Dalvik byte code features are extracted from the APK file to form a sample to be tested. Constructing a sample vector to be detected, wherein the characteristics used by the detection model appear in the sample, and the position component corresponding to the sample vector to be detected is recorded as the occurrence frequency; otherwise, it is noted as 0. And analyzing the vector of the sample to be detected by using the initial detection model, and outputting a detection result. If the detection result is 1, the Android application software to be detected is malicious software; and if the detection result is 0, the Android application software to be detected is benign software.

And 9, acquiring Android malicious software samples by using a crawler technology according to mainstream application markets and security websites at home and abroad to form a model migration malicious software sample library, wherein the publishing time interval and the collecting date of the malicious software are not more than C month, C is a positive integer not less than 1, the number of the malicious software is D, and D is a positive integer not less than 100.

In this embodiment, C is 12, the crawled website is GitHub, the test years are 2019 and 2020, where the 2019 model migration malware sample library contains 149 samples, and the number of the 2020 model migration malware sample library samples is 181.

And step 10, extracting characteristic corresponding data used by the detection model according to the model migration malicious software sample library to form a model migration data set.

The method for extracting the feature correspondence data used by the detection model in this embodiment is the same as that in step 8.

And 11, calculating a test statistic by using a same distribution test algorithm according to the model migration data set and the initial training data set, and judging whether the Android malicious software has concept drift.

In the embodiment, the same distribution detection algorithm is a Mann-Whitney U detection algorithm, and the adopted detection critical value is 5.

And step 12, if the concept drift of the Android malicious software occurs, migrating the initial detection model by using a self-adaptive algorithm in the migration learning field, iterating for E times, training a new detection model, and replacing the initial detection model by the detection model with the highest accuracy in the iteration for E times.

The adaptive algorithm used in this embodiment is JDA algorithm, and E is 5.

And step 13, repeatedly executing the processing of the step 8 to the step 12 by taking the time interval of C months as a period, updating the detection model, and detecting the Android malicious software.

In this embodiment, data of three years including 2018, 2019 and 2020 is used as test data, wherein the accuracy of the initial detection model in 2018 is 96%, the accuracy is 34% before step 8-step 12 in 2019, the accuracy is increased to 80% after the step is executed, the accuracy is increased to 43% before step 8-step 12 in 2020, and the accuracy is increased to 87% after the step is executed.

In conclusion, the detection features are introduced through the analysis report of the artificial Android malicious software, the traditional feature packaging method is improved based on the automatic machine learning algorithm and the interpretable machine learning algorithm, the homography detection and the transfer learning algorithm are fused, the interpretability of the Android malicious software detection model is improved, and the influence of the concept drift problem on the accuracy rate of the detection model is reduced.

Claims

1. A concept drift-oriented interpretable Android malware detection method is characterized by comprising the following steps: the method comprises the following steps:

2. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 3, the method for extracting the high-frequency words reversely analyzed by the Android malicious application software is a word frequency statistical algorithm, and the effective words A before the ranking are Android programming language keywords.

3. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 4, the following substeps are included:

4. The concept-drift-oriented interpretable Android malware detection method of claim 3, wherein: in step 4.4, the automatic machine learning algorithm is a TPOT automatic machine learning algorithm, and the pipeline with the highest accuracy in the output pipelines is selected to apply the tree-based machine learning model.

5. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 5, the sum of the average absolute values of the salpril of the components B before the ranking is not less than F times of the sum of the average absolute values of the remaining components of the salpril, wherein F is a positive integer not less than 4.

6. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 5, the interpretable machine learning algorithm is a SHAP algorithm.

7. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: 6, 8 and 10, extracting feature corresponding data used by the detection model, matching the files decompressed by the Android application software APK by using a reverse tool Android tool according to the features used by the detection model, and recording the occurrence times if the features used by the detection model appear in the files decompressed by the Android application software APK; otherwise, the sequence is recorded as 0 to generate a sequence, and 1 is added after the sequence of the malicious sample; on the contrary, 0 is added as a detection model sample feature vector.

8. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 7, the machine learning algorithm based on the tree model is a Catboost algorithm.

9. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 12, the adaptive algorithm is JDA algorithm.

10. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: a is a positive integer of not less than 4, B is a positive integer of not less than 1, C is a positive integer of not less than 1, D is a positive integer of not less than 100, and E is a positive integer of not more than 5.