CN113901463A - Concept drift-oriented interpretable Android malicious software detection method - Google Patents

Concept drift-oriented interpretable Android malicious software detection method Download PDF

Info

Publication number
CN113901463A
CN113901463A CN202111033119.5A CN202111033119A CN113901463A CN 113901463 A CN113901463 A CN 113901463A CN 202111033119 A CN202111033119 A CN 202111033119A CN 113901463 A CN113901463 A CN 113901463A
Authority
CN
China
Prior art keywords
android
feature
detection model
model
interpretable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111033119.5A
Other languages
Chinese (zh)
Other versions
CN113901463B (en
Inventor
张炳
文峥
高原
赵旭阳
任家东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yanshan University
Original Assignee
Yanshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yanshan University filed Critical Yanshan University
Priority to CN202111033119.5A priority Critical patent/CN113901463B/en
Publication of CN113901463A publication Critical patent/CN113901463A/en
Application granted granted Critical
Publication of CN113901463B publication Critical patent/CN113901463B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/031Protect user input by software means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a concept drift-oriented interpretable Android malicious software detection method, belongs to the technical field of information security, and comprises the steps of introducing detection characteristics through an artificial Android malicious software analysis report, improving the traditional characteristic package based on an automatic machine learning algorithm and an interpretable algorithm, and fusing a same distribution inspection algorithm and a migration learning algorithm. The method improves the interpretability of the Android malicious software detection model, is beneficial to manual verification of the detection model by reverse analysts, reduces the influence of the concept drift problem on the accuracy rate of the detection model, is beneficial to maintaining high accuracy rate of the detection model for a long time with low cost, and is used for detection and analysis of the Android malicious application software.

Description

Concept drift-oriented interpretable Android malicious software detection method
Technical Field
The invention relates to the technical field of information security, in particular to a concept drift-oriented interpretable Android malicious software detection method.
Background
In the 1 st quarter of 2021, about 206.5 thousands of newly added malicious program samples of the mobile terminal are intercepted by 360 Internet security centers, which is 426.5% higher than that of the newly added malicious program samples of the mobile terminal in the same period of 2020, and the per-capita economic loss is 14611 yuan. By 4 months in 2021, compared with the iOS operating system, the Android operating system occupies 76.91% of the china mobile terminal market, and the application software ecology of the Android open platform makes it more vulnerable to malware.
The existing Android malicious software detection technologies are divided into three major categories, namely a detection technology based on a feature code, a static detection technology based on machine learning and an application behavior detection technology based on machine learning. The sandbox mechanism of the Android system makes monitoring of application dynamic behaviors in the non-customized system difficult. The static detection technology based on machine learning is a mainstream Android malicious software detection method due to the advantages of high detection accuracy rate of unknown malicious software, low requirement on equipment hardware and the like.
However, the static detection technology based on machine learning has 3 main problems as follows:
1. the application proportion of requesting sensitive permission in the application market is decreasing, and part of malicious applications can complete attack on the basis of not applying for new permission. A single authority feature, or a combination of features introduced without logic, is not sufficient to characterize malware.
2. While the machine learning algorithm of the black box obtains higher and higher accuracy, the interpretability and transparency of the model are higher and higher for malicious application detection. Android malware reverse personnel need the model to provide decision basis so as to promote the reasonability of manual analysis or judgment of model decision.
3. The high-frequency updating of the Android system version leads to certain market share of Android applications developed on the basis of software development kits of various versions. Due to the concept drift phenomenon, the machine learning model obtained by training at the cost of a large number of samples is poor in performance of detection of Android malicious software in different periods.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a concept drift-oriented interpretable Android malicious software detection method, so that the interpretability of an Android malicious software detection model is improved, and the influence of the concept drift problem on the accuracy of the detection model is reduced.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a concept drift-oriented interpretable Android malware detection method comprises the following steps:
step 1, collecting a plurality of analysis reports of the Android malicious application software to form an Android malicious application software manual analysis report sample library;
step 2, collecting a plurality of malicious and benign Android application software samples to form an initial Android application software sample library, wherein the number of the malicious samples is consistent with that of the benign samples;
step 3, extracting high-frequency words of Android malicious application software reverse analysis from an Android malicious application software manual analysis report library, wherein the effective words A before ranking are used as the characteristic types used by the detection model;
step 4, according to an initial Android application software sample library, using an automatic machine learning algorithm, corresponding to the feature types used by each detection model, constructing screening feature vectors, and training feature component screening models, wherein the quantity of the screening feature vectors is A;
step 5, screening the model according to each characteristic component, and respectively calculating the Shapril average absolute value of all components in the screened characteristic vector by using an interpretable machine learning algorithm, wherein the component B before ranking is used as a sub-characteristic vector used by the detection model;
step 6, combining sub-feature vectors used by all detection models to serve as features used by the detection models; extracting characteristic corresponding data used by the detection model according to an initial Android application software sample library to form an initial training data set;
step 7, training an initial detection model on the initial training data set by using a machine learning algorithm based on a tree model, and outputting the characteristics used by the detection model as the basis for manually verifying the detection model;
step 8, extracting characteristic corresponding data used by a detection model for the Android malicious software with unknown security, inputting the characteristic corresponding data into the trained initial detection model, and detecting whether the application is the Android malicious software;
step 9, according to domestic and foreign mainstream application markets and security websites, Android malicious software samples are obtained by using a crawler technology, and a model migration malicious software sample library is formed, wherein the publishing time interval collection date of malicious software is not more than C months, and the number of the malicious software is D;
step 10, extracting characteristic corresponding data used by a detection model according to a model migration malicious software sample library to form a model migration data set;
step 11, calculating a test statistic by using a same distribution test algorithm according to the model migration data set and the initial training data set, and judging whether the Android malicious software has concept drift;
step 12, if the Android malicious software has concept drift, migrating an initial detection model by using a self-adaptive algorithm in the migration learning field, iterating for E times, training a new detection model, and replacing the initial detection model;
and step 13, repeatedly executing the steps 8-12 by taking the time interval of C months as a period, updating the detection model, and detecting the Android malicious software.
The technical scheme of the invention is further improved as follows: in step 3, the method for extracting the high-frequency words reversely analyzed by the Android malicious application software is a word frequency statistical algorithm, and the effective words A before the ranking are Android programming language keywords.
The technical scheme of the invention is further improved as follows: in step 4, the following substeps are included:
4.1 projecting a feature type used by a detection model from an initial Android application software sample library;
4.2 if the feature has been projected, selecting the feature which is not projected in the feature type used by a detection model, and executing the step 4.1;
4.3 if the feature is not projected, taking all the different data of the feature contained in the projected data as the screening feature vector of the feature; constructing a feature component screening data set, wherein the feature component screening data set comprises sample feature vectors of all samples;
4.4, inputting the feature component screening data set into an automatic machine learning algorithm, and selecting a pipeline with the highest accuracy rate in output pipelines as a feature component screening model of the feature;
4.5 if the feature type of the feature component screening model which is not output exists, executing the step 4.1.
The technical scheme of the invention is further improved as follows: in step 4.4, the automatic machine learning algorithm is a TPOT automatic machine learning algorithm, and the pipeline with the highest accuracy in the output pipelines is selected to apply the tree-based machine learning model.
The technical scheme of the invention is further improved as follows: in step 5, the sum of the average absolute values of the salpril of the components B before the ranking is not less than F times of the sum of the average absolute values of the remaining components of the salpril, wherein F is a positive integer not less than 4.
The technical scheme of the invention is further improved as follows: in step 5, the interpretable machine learning algorithm is a SHAP algorithm.
The technical scheme of the invention is further improved as follows: 6, 8 and 10, extracting feature corresponding data used by the detection model, matching the files decompressed by the Android application software APK by using a reverse tool Android tool according to the features used by the detection model, and recording the occurrence times if the features used by the detection model appear in the files decompressed by the Android application software APK; otherwise, the sequence is recorded as 0 to generate a sequence, and 1 is added after the sequence of the malicious sample; on the contrary, 0 is added as a detection model sample feature vector.
The technical scheme of the invention is further improved as follows: in step 7, the machine learning algorithm based on the tree model is a Catboost algorithm.
The technical scheme of the invention is further improved as follows: in step 12, the adaptive algorithm is JDA algorithm.
The technical scheme of the invention is further improved as follows: a is a positive integer of not less than 4, B is a positive integer of not less than 1, C is a positive integer of not less than 1, D is a positive integer of not less than 100, and E is a positive integer of not more than 5.
Due to the adoption of the technical scheme, the invention has the technical progress that:
1. according to the method, the high-frequency words in the attack flow are extracted through the Android malicious software analysis report, various characteristics of a source code level and an assembly instruction level are introduced, and high logicality and reasonability of detection characteristics are improved.
2. According to the invention, by combining the initial characteristics, low storage overhead and high analysis speed are ensured, and malicious software can be better characterized.
3. According to the method, the optimal machine learning classification model based on the tree is screened by using an automatic machine learning algorithm, and compared with a parameter adjusting process in the traditional machine learning model training technology, the method enhances the engagement degree between training data and the model, and improves convenience and efficiency.
4. The invention uses the interpretability algorithm to construct the detection model interpretation mechanism, and the screened features have high contribution degree to the classification results of most training samples, thereby ensuring the interpretability and verifiability of the detection model.
5. According to the method, a domain self-adaptive method is introduced in the technical field of information security, particularly in the detection technology of Andorid malicious software, and a small amount of new-period Android malicious software is used according to the existing data and a detection model, so that the time sequence stability of the model detection accuracy rate provided by the invention is ensured, and the concept drift problem existing in the Android malicious software detection is effectively relieved.
Drawings
FIG. 1 is a flow chart of the detection method of the present invention;
FIG. 2 is a sub-flowchart for constructing a feature component screening model in the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and examples:
as shown in fig. 1, a concept drift-oriented interpretable Android malware detection method specifically includes the following steps:
step 1, collecting a sufficient amount of analysis reports of the artificial Android malicious application software to form an Android malicious application software manual analysis report sample library.
In this embodiment, an Android malware analysis report is sampled from a Kharon data set to form an Android malware manual analysis report sample library, where the Android malware analysis report language is english, and the total number of words is 4957.
And 2, collecting enough malicious and benign Android application software samples to form an initial Android application software sample library, wherein the number of the malicious samples is consistent with that of the benign samples.
In the embodiment, 2900 pieces of Android malicious software and 2900 pieces of benign software are collected from an Omnidoid data set, and Android application software samples are in an APK format. Wherein, malicious software is defined as more than 50% of antivirus engine detection results in the VIRUSTOTAL website are positive, and benign software is defined as more than or equal to 50% of antivirus engine detection results in the VIRUSTOTAL website are negative.
And 3, extracting the high-frequency words of the Android malicious application software reverse analysis from the Android malicious application software manual analysis report library, wherein the effective words A before ranking are positive integers not less than 4 and serve as the characteristic types used by the detection model.
In the embodiment, a word frequency statistical algorithm is used for extracting high-frequency words of Android malicious application software reverse analysis, in the embodiment, a is 4, effective words used as feature types of the detection model are Android programming language keywords, and are extracted from an Android malicious application software manual analysis report library without meaningless words such as articles, pronouns, quantity words and the like, so that the feature types used by the detection model are authority, API packet names, intention names and Dalvik byte codes. Wherein removing words includes, but is not limited to: the, is, to, a, and, in, of, also, from.
And 4, according to the initial Android application software sample library, using an automatic machine learning algorithm, corresponding to the feature types used by each detection model, constructing screening feature vectors, and training feature component screening models, wherein the quantity of the screening feature vectors is A.
As shown in fig. 2, the method specifically includes the following sub-steps:
4.1 projecting the characteristic types used by a detection model from an initial Android application software sample library.
4.2 if the feature has been projected, selecting the non-projected feature of the feature class used by the detection model, and executing step 4.1.
4.3 if the feature is not projected, using all the different data of the feature contained in the projected data as the screening feature vector of the feature. And constructing a feature component screening data set, wherein the feature component screening data set comprises sample feature vectors of all samples.
In this embodiment, each sample in the initial Android application software sample library includes 45 features such as an installation package name, a file name, a HASH code, a projection authority, an API package name, an intention name, and Dalvik, and corresponding data only including any one of the four features of the projection authority, the API package name, the intention name, and the Dalvik bytecode is obtained through projection. Screening the component of the feature vector to appear in the sample, and recording the corresponding position of the component as the appearance frequency; otherwise, marking as 0, generating a sequence, and adding 1 to the malicious sample after the sequence; otherwise, 0 is added. For example, the screening feature vector of the Dalvik bytecode is [ "shl-int", "long-to-int", "if-gt" ], and the sample feature vector of one sample is [5,3,21,1], which means that the malicious sample contains 5 Dalvik bytecodes, "shl-int", "long-to-int" 3 "and" if-gt "21. In this embodiment, feature screening vectors of four features of the authority, the API package name, the intention name, and the Dalvik bytecode are 184, 4185, 223, and 436 dimensions, respectively.
And 4.4, inputting the feature component screening data set into an automatic machine learning algorithm, and selecting the pipeline with the highest accuracy rate in the output pipelines as the feature component screening model of the features.
In the embodiment, a TPOT automatic machine learning algorithm is used, and the pipeline with the highest accuracy in the output pipelines is selected to apply the tree-based machine learning model.
4.5 if the feature type of the feature component screening model which is not output exists, executing the step 4.1.
In this embodiment, step 4 obtains four feature component screening models of authority, API package name, intention name, and Dalvik bytecode.
And 5, screening the model according to each characteristic component, and respectively calculating the Shapril average absolute value of all components in the screened characteristic vector by using an interpretable machine learning algorithm, wherein the component B before ranking, B is a positive integer not less than 1, and the component B is used as a sub-characteristic vector used by the detection model. And the sum of the average absolute values of the salpril of the components of the B before the ranking is not less than F times of the sum of the average absolute values of the salpril of the rest components, wherein F is a positive integer not less than 4.
The B in this embodiment is 9, the interpretable machine learning algorithm used is the swap algorithm, the sub-feature vector used by the authority detection model is calculated as [ "SEND _ SMS", "GET _ TASKS", "READ _ PHONE STATE", "RECEIVE _ root _ complete", "RECEIVE _ SMS", "insert _ response", "GET _ access", "view _ component", "response" and the sub-feature vector used by the API package name detection model is [ a.java.current.current.locks "," address.function "," address.function.address.function ", and the sub-feature vector used by the authority detection model is" address. "and-int/2 addr", "rsub-int", "or-int/lit 16", "rem-long", "invoke-super/range", "iput", "if-nez" ].
Step 6, combining sub-feature vectors used by all detection models to serve as features used by the detection models; and extracting characteristic corresponding data used by the detection models according to the initial Android application software sample library, and horizontally splicing sub-characteristic vectors used by the detection models in any sequence to form an initial training data set.
According to the characteristics used by the detection model, using a reverse tool Android, matching files decompressed by Android application software APK, and recording the occurrence times if the characteristics used by the detection model appear in the files decompressed by the Android application software APK; otherwise, the sequence is recorded as 0 to generate a sequence, and 1 is added after the sequence of the malicious sample; on the contrary, 0 is added as a detection model sample feature vector.
In this embodiment, the sub-feature vector merging manner used by all detection models is horizontal stitching, and the obtained feature types used by the detection models are 36. The process of composing the initial training data set is the same as the process of constructing the feature component screening data set in step 4.3.
And 7, training an initial detection model on the initial training data set by using a machine learning algorithm based on a tree model, and outputting the characteristics used by the detection model as the basis for manually verifying the detection model.
In the embodiment, the machine learning algorithm based on the tree model is a Catboost algorithm.
And 8, extracting characteristic corresponding data used by the detection model for the Android malicious software with unknown security, inputting the characteristic corresponding data into the trained initial detection model, and detecting whether the application is the Android malicious software.
According to the characteristics used by the detection model, using a reverse tool Android, matching files decompressed by Android application software APK, and recording the occurrence times if the characteristics used by the detection model appear in the files decompressed by the Android application software APK; otherwise, the sequence is recorded as 0 to generate a sequence, and 1 is added after the sequence of the malicious sample; on the contrary, 0 is added as a detection model sample feature vector.
In this embodiment, Python is used to integrate the "get _ properties", "get _ services", "get _ methods", "get _ classes" and "get _ instructions" commands of the android tool, and the authority, API package name, intention name, and Dalvik byte code features are extracted from the APK file to form a sample to be tested. Constructing a sample vector to be detected, wherein the characteristics used by the detection model appear in the sample, and the position component corresponding to the sample vector to be detected is recorded as the occurrence frequency; otherwise, it is noted as 0. And analyzing the vector of the sample to be detected by using the initial detection model, and outputting a detection result. If the detection result is 1, the Android application software to be detected is malicious software; and if the detection result is 0, the Android application software to be detected is benign software.
And 9, acquiring Android malicious software samples by using a crawler technology according to mainstream application markets and security websites at home and abroad to form a model migration malicious software sample library, wherein the publishing time interval and the collecting date of the malicious software are not more than C month, C is a positive integer not less than 1, the number of the malicious software is D, and D is a positive integer not less than 100.
In this embodiment, C is 12, the crawled website is GitHub, the test years are 2019 and 2020, where the 2019 model migration malware sample library contains 149 samples, and the number of the 2020 model migration malware sample library samples is 181.
And step 10, extracting characteristic corresponding data used by the detection model according to the model migration malicious software sample library to form a model migration data set.
According to the characteristics used by the detection model, using a reverse tool Android, matching files decompressed by Android application software APK, and recording the occurrence times if the characteristics used by the detection model appear in the files decompressed by the Android application software APK; otherwise, the sequence is recorded as 0 to generate a sequence, and 1 is added after the sequence of the malicious sample; on the contrary, 0 is added as a detection model sample feature vector.
The method for extracting the feature correspondence data used by the detection model in this embodiment is the same as that in step 8.
And 11, calculating a test statistic by using a same distribution test algorithm according to the model migration data set and the initial training data set, and judging whether the Android malicious software has concept drift.
In the embodiment, the same distribution detection algorithm is a Mann-Whitney U detection algorithm, and the adopted detection critical value is 5.
And step 12, if the concept drift of the Android malicious software occurs, migrating the initial detection model by using a self-adaptive algorithm in the migration learning field, iterating for E times, training a new detection model, and replacing the initial detection model by the detection model with the highest accuracy in the iteration for E times.
The adaptive algorithm used in this embodiment is JDA algorithm, and E is 5.
And step 13, repeatedly executing the processing of the step 8 to the step 12 by taking the time interval of C months as a period, updating the detection model, and detecting the Android malicious software.
In this embodiment, data of three years including 2018, 2019 and 2020 is used as test data, wherein the accuracy of the initial detection model in 2018 is 96%, the accuracy is 34% before step 8-step 12 in 2019, the accuracy is increased to 80% after the step is executed, the accuracy is increased to 43% before step 8-step 12 in 2020, and the accuracy is increased to 87% after the step is executed.
In conclusion, the detection features are introduced through the analysis report of the artificial Android malicious software, the traditional feature packaging method is improved based on the automatic machine learning algorithm and the interpretable machine learning algorithm, the homography detection and the transfer learning algorithm are fused, the interpretability of the Android malicious software detection model is improved, and the influence of the concept drift problem on the accuracy rate of the detection model is reduced.

Claims (10)

1. A concept drift-oriented interpretable Android malware detection method is characterized by comprising the following steps: the method comprises the following steps:
step 1, collecting a plurality of analysis reports of the Android malicious application software to form an Android malicious application software manual analysis report sample library;
step 2, collecting a plurality of malicious and benign Android application software samples to form an initial Android application software sample library, wherein the number of the malicious samples is consistent with that of the benign samples;
step 3, extracting high-frequency words of Android malicious application software reverse analysis from an Android malicious application software manual analysis report library, wherein the effective words A before ranking are used as the characteristic types used by the detection model;
step 4, according to an initial Android application software sample library, using an automatic machine learning algorithm, corresponding to the feature types used by each detection model, constructing screening feature vectors, and training feature component screening models, wherein the quantity of the screening feature vectors is A;
step 5, screening the model according to each characteristic component, and respectively calculating the Shapril average absolute value of all components in the screened characteristic vector by using an interpretable machine learning algorithm, wherein the component B before ranking is used as a sub-characteristic vector used by the detection model;
step 6, combining sub-feature vectors used by all detection models to serve as features used by the detection models; extracting characteristic corresponding data used by the detection model according to an initial Android application software sample library to form an initial training data set;
step 7, training an initial detection model on the initial training data set by using a machine learning algorithm based on a tree model, and outputting the characteristics used by the detection model as the basis for manually verifying the detection model;
step 8, extracting characteristic corresponding data used by a detection model for the Android malicious software with unknown security, inputting the characteristic corresponding data into the trained initial detection model, and detecting whether the application is the Android malicious software;
step 9, according to domestic and foreign mainstream application markets and security websites, Android malicious software samples are obtained by using a crawler technology, and a model migration malicious software sample library is formed, wherein the publishing time interval collection date of malicious software is not more than C months, and the number of the malicious software is D;
step 10, extracting characteristic corresponding data used by a detection model according to a model migration malicious software sample library to form a model migration data set;
step 11, calculating a test statistic by using a same distribution test algorithm according to the model migration data set and the initial training data set, and judging whether the Android malicious software has concept drift;
step 12, if the Android malicious software has concept drift, migrating an initial detection model by using a self-adaptive algorithm in the migration learning field, iterating for E times, training a new detection model, and replacing the initial detection model;
and step 13, repeatedly executing the steps 8-12 by taking the time interval of C months as a period, updating the detection model, and detecting the Android malicious software.
2. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 3, the method for extracting the high-frequency words reversely analyzed by the Android malicious application software is a word frequency statistical algorithm, and the effective words A before the ranking are Android programming language keywords.
3. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 4, the following substeps are included:
4.1 projecting a feature type used by a detection model from an initial Android application software sample library;
4.2 if the feature has been projected, selecting the feature which is not projected in the feature type used by a detection model, and executing the step 4.1;
4.3 if the feature is not projected, taking all the different data of the feature contained in the projected data as the screening feature vector of the feature; constructing a feature component screening data set, wherein the feature component screening data set comprises sample feature vectors of all samples;
4.4, inputting the feature component screening data set into an automatic machine learning algorithm, and selecting a pipeline with the highest accuracy rate in output pipelines as a feature component screening model of the feature;
4.5 if the feature type of the feature component screening model which is not output exists, executing the step 4.1.
4. The concept-drift-oriented interpretable Android malware detection method of claim 3, wherein: in step 4.4, the automatic machine learning algorithm is a TPOT automatic machine learning algorithm, and the pipeline with the highest accuracy in the output pipelines is selected to apply the tree-based machine learning model.
5. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 5, the sum of the average absolute values of the salpril of the components B before the ranking is not less than F times of the sum of the average absolute values of the remaining components of the salpril, wherein F is a positive integer not less than 4.
6. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 5, the interpretable machine learning algorithm is a SHAP algorithm.
7. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: 6, 8 and 10, extracting feature corresponding data used by the detection model, matching the files decompressed by the Android application software APK by using a reverse tool Android tool according to the features used by the detection model, and recording the occurrence times if the features used by the detection model appear in the files decompressed by the Android application software APK; otherwise, the sequence is recorded as 0 to generate a sequence, and 1 is added after the sequence of the malicious sample; on the contrary, 0 is added as a detection model sample feature vector.
8. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 7, the machine learning algorithm based on the tree model is a Catboost algorithm.
9. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: in step 12, the adaptive algorithm is JDA algorithm.
10. The concept-drift-oriented interpretable Android malware detection method of claim 1, wherein: a is a positive integer of not less than 4, B is a positive integer of not less than 1, C is a positive integer of not less than 1, D is a positive integer of not less than 100, and E is a positive integer of not more than 5.
CN202111033119.5A 2021-09-03 2021-09-03 Concept drift-oriented interpretable Android malicious software detection method Active CN113901463B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111033119.5A CN113901463B (en) 2021-09-03 2021-09-03 Concept drift-oriented interpretable Android malicious software detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111033119.5A CN113901463B (en) 2021-09-03 2021-09-03 Concept drift-oriented interpretable Android malicious software detection method

Publications (2)

Publication Number Publication Date
CN113901463A true CN113901463A (en) 2022-01-07
CN113901463B CN113901463B (en) 2023-06-30

Family

ID=79188640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111033119.5A Active CN113901463B (en) 2021-09-03 2021-09-03 Concept drift-oriented interpretable Android malicious software detection method

Country Status (1)

Country Link
CN (1) CN113901463B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115795466A (en) * 2023-02-06 2023-03-14 广东省科技基础条件平台中心 Malicious software organization identification method and equipment
TWI822388B (en) * 2022-10-12 2023-11-11 財團法人資訊工業策進會 Labeling method for information security protection detection rules and tactic, technique and procedure labeling device for the same

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190065746A1 (en) * 2017-08-25 2019-02-28 Drexel University Light-Weight Behavioral Malware Detection for Windows Platforms
CN109684840A (en) * 2018-12-20 2019-04-26 西安电子科技大学 Based on the sensitive Android malware detection method for calling path
CN110519228A (en) * 2019-07-22 2019-11-29 中国科学院信息工程研究所 A kind of black recognition methods and system for producing malice cloud robot under scene
US20200034692A1 (en) * 2018-07-30 2020-01-30 National Chengchi University Machine learning system and method for coping with potential outliers and perfect learning in concept-drifting environment
CN111259393A (en) * 2020-01-14 2020-06-09 河南信息安全研究院有限公司 Anti-concept drift method of malicious software detector based on generation countermeasure network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190065746A1 (en) * 2017-08-25 2019-02-28 Drexel University Light-Weight Behavioral Malware Detection for Windows Platforms
US20200034692A1 (en) * 2018-07-30 2020-01-30 National Chengchi University Machine learning system and method for coping with potential outliers and perfect learning in concept-drifting environment
CN109684840A (en) * 2018-12-20 2019-04-26 西安电子科技大学 Based on the sensitive Android malware detection method for calling path
CN110519228A (en) * 2019-07-22 2019-11-29 中国科学院信息工程研究所 A kind of black recognition methods and system for producing malice cloud robot under scene
CN111259393A (en) * 2020-01-14 2020-06-09 河南信息安全研究院有限公司 Anti-concept drift method of malicious software detector based on generation countermeasure network

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
ENRICO MARICONTI 等: "MAMADROID: Detecting Android Malware by Building Markov Chains of Behavioral Models", 《THE PROCEEDINGS OF 24TH NETWORK AND DISTRIBUTED SYSTEM SECURITY SYMPOSIUM (NDSS 2017)》, pages 1 - 16 *
SCOTT M. LUNDBERG 等: "A Unified Approach to Interpreting Model Predictions", 《31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2017)》, pages 1 - 10 *
任家东 等: "S-C特征提取的计算机漏洞自动分类算法", 计算机科学与探索, vol. 14, no. 07, pages 1173 - 1182 *
张永生 等: "基于可信度的Android恶意代码多模型协同检测方法", 广西师范大学学报(自然科学版), vol. 38, no. 02, pages 19 - 28 *
张炳 等: "InterDroid: 面向概念漂移的可解释性 Andorid恶意软件检测方法", 《计算机研究与发展》, vol. 58, no. 11, pages 2456 - 2474 *
张炳 等: "双粒度轻量级漏洞代码切片方法评估模型", 《通信学报》, vol. 42, no. 11, pages 233 - 241 *
知有CXY: "检测恶意软件分类模型中的概念漂移", Retrieved from the Internet <URL:https://blog.csdn.net/qq_41641505/article/details/127825221> *
范铭 等: "安卓恶意软件检测方法综述", 中国科学:信息科学, vol. 50, no. 08, pages 1148 - 1177 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI822388B (en) * 2022-10-12 2023-11-11 財團法人資訊工業策進會 Labeling method for information security protection detection rules and tactic, technique and procedure labeling device for the same
CN115795466A (en) * 2023-02-06 2023-03-14 广东省科技基础条件平台中心 Malicious software organization identification method and equipment

Also Published As

Publication number Publication date
CN113901463B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
CN109426722B (en) SQL injection defect detection method, system, equipment and storage medium
CN108268777B (en) Similarity detection method for carrying out unknown vulnerability discovery by using patch information
CN102054149B (en) Method for extracting malicious code behavior characteristic
CN111400724A (en) Operating system vulnerability detection method, system and medium based on code similarity analysis
CN113901463B (en) Concept drift-oriented interpretable Android malicious software detection method
Kang et al. A secure-coding and vulnerability check system based on smart-fuzzing and exploit
CN114077741A (en) Software supply chain safety detection method and device, electronic equipment and storage medium
CN112115326B (en) Multi-label classification and vulnerability detection method for Etheng intelligent contracts
CN105760761A (en) Software behavior analyzing method and device
Alrabaee et al. On leveraging coding habits for effective binary authorship attribution
CN116932381A (en) Automatic evaluation method for security risk of applet and related equipment
CN115168865A (en) Cross-item vulnerability detection model based on domain self-adaptation
Fazlali et al. Metamorphic malware detection using opcode frequency rate and decision tree
Tian et al. Enhancing vulnerability detection via AST decomposition and neural sub-tree encoding
CN111625448B (en) Protocol packet generation method, device, equipment and storage medium
CN113626823A (en) Reachability analysis-based inter-component interaction threat detection method and device
CN114285587A (en) Domain name identification method and device and domain name classification model acquisition method and device
US20060004810A1 (en) Method, system and product for determining standard java objects
US7647581B2 (en) Evaluating java objects across different virtual machine vendors
Al Debeyan et al. Improving the performance of code vulnerability prediction using abstract syntax tree information
Guo et al. An investigation of quality issues in vulnerability detection datasets
Liu et al. Graph neural network based approach to automatically assigning common weakness enumeration identifiers for vulnerabilities
Alzahrani Measuring class cohesion based on client similarities between method pairs: An improved approach that supports refactoring
CN110321130A (en) The not reproducible compiling localization method of log is called based on system
Imtiaz et al. Predicting vulnerability for requirements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant