CN111324893B - Detection method and background system for android malicious software based on sensitive mode - Google Patents

Detection method and background system for android malicious software based on sensitive mode Download PDF

Info

Publication number
CN111324893B
CN111324893B CN202010097459.3A CN202010097459A CN111324893B CN 111324893 B CN111324893 B CN 111324893B CN 202010097459 A CN202010097459 A CN 202010097459A CN 111324893 B CN111324893 B CN 111324893B
Authority
CN
China
Prior art keywords
sensitive
mode
sensitive mode
data
android
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010097459.3A
Other languages
Chinese (zh)
Other versions
CN111324893A (en
Inventor
廖丹
陈锐
黄畅
李慧
张明
陈雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianfu Co Innovation Center University Of Electronic Science And Technology Of China
University of Electronic Science and Technology of China
Original Assignee
Tianfu Co Innovation Center University Of Electronic Science And Technology Of China
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianfu Co Innovation Center University Of Electronic Science And Technology Of China, University of Electronic Science and Technology of China filed Critical Tianfu Co Innovation Center University Of Electronic Science And Technology Of China
Priority to CN202010097459.3A priority Critical patent/CN111324893B/en
Publication of CN111324893A publication Critical patent/CN111324893A/en
Application granted granted Critical
Publication of CN111324893B publication Critical patent/CN111324893B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/565Static detection by checking file integrity

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a detection method and a background system of android malicious software based on a sensitive mode, wherein the detection method comprises the steps of obtaining an APK file of android software to be detected, performing disassembling operation on the APK file to extract authority data and API (application program interface) calling data, and then filtering the extracted data to form a data sample; reading a sensitive mode cluster constructed by data samples based on a plurality of android software, and constructing the data samples into feature vectors based on existence and maximum inclusion degrees by taking the number of the sensitive mode clusters as dimensions; and inputting the feature vector into the trained malicious software detection model, and outputting a detection result. The scheme also provides a background system of the application store, which comprises a detection method of android malicious software based on a sensitive mode, wherein the detection method is integrated in the background system.

Description

Detection method and background system for android malicious software based on sensitive mode
Technical Field
The invention relates to the field of communication security detection, in particular to a detection method and a background system of android malicious software based on a sensitive mode.
Background
With the rapid development of mobile communication technology, the use of various mobile communication devices such as smart phones, tablet computers, and the like is increasing. To provide a good user experience, a variety of mobile terminal operating systems have emerged, with the android system occupying a large portion of the market share. According to a global smartphone operating system market research report issued by IDC (International Data corporation), the android system is far ahead with a market share of 86.7%.
In order to protect android mobile users from malicious software and create a secure and healthy mobile communication environment, researchers in academia and industry have proposed techniques and tools for detecting malicious software. The detection method is mainly divided into static analysis and dynamic analysis according to the content of the analysis.
The traditional static analysis technology is based on a signature authentication mechanism, a detection system maintains a signature database of known malicious software, and when a signature used by software to be detected exists in the database, the software to be detected is judged to be the malicious software. A disadvantage of this approach is that unknown malware and malware that uses obfuscation techniques cannot be detected.
To address this problem, more static features are introduced into the analysis of malware. For example, when malicious software executes malicious behaviors, corresponding permissions are often required to be applied, such as reading a contact list, sending a short message and the like, so some research works propose a permission-based detection method, and the malicious software and normal software are distinguished by comparing differences in use conditions of the related permissions. API calling is used as the bottom layer implementation of the function behavior of the application software, and the behavior characteristics of the application software can be reflected to a great extent. By means of data flow analysis, API calls with a high degree of security threat to users can be obtained, and the API calls are helpful for identifying malicious software. In addition to rights and API calls, some research has also analyzed android application components, including Activity (Activity), Service (Service), broadcast receiver (broadcastdetect) and Content Provider (Content Provider), to improve detection accuracy.
Some researchers believe that there is a bottleneck of dimensional disaster in processing character string features, and structured features are more beneficial to processing mass data. For example, by constructing a Dalvik opcode map and analyzing its topology, the number of nodes, map probability density, map distance, etc. are used to characterize malware. In addition, a Function Call Graph (FCG) is generated according to the call dependency relationship among the methods, and classification and detection of malicious software can be realized by utilizing similarity calculation of the graphs.
Unlike static analysis, which directly parses an APK file, dynamic analysis needs to go through the running process of application software, observe and collect the running data. The operation environment is often a controllable simulation platform, and the interactive operation under the near-real scene is completed. Taint analysis is a commonly used means in dynamic analysis, for example, a tool called tantdroid integrates taint propagation of four granularities of messages, variables, methods and files by using an android virtualization architecture, simultaneously tracks a plurality of sensitive information sources, and identifies malicious behaviors of application software by monitoring sensitive data. To capture operating system and Java level semantics simultaneously, some studies have collected detailed native instructions and Dalvik instruction traces to track information leakage through Java and native components.
It has been found that most malware requires network connectivity when performing malicious activities, and therefore, analyzing network traffic is also an effective means for detecting malware. By analyzing the IP address, the port number and other connection information in the message, the difference between the malicious software and the normal software can be found. In addition, there are related researches for monitoring abnormal behaviors of application software from a hardware perspective, for example, a power consumption perception-based malware detection framework is proposed, which can generate a corresponding power signature according to a history of power consumption, and reduce detection overhead by adopting noise filtering and data compression techniques. By collecting the information related to the data such as CPU and memory during the operation of the system components, the malicious software can be found according to the abnormal situation of resource occupation.
Through analysis and comparison of the prior art, the scheme finds that the technologies have certain limitations. For example, in static analysis, modeling with a single feature is often at risk of overfitting, while introducing too many features results in over-complexity and dimensionality disasters. Although the dynamic analysis can improve the generalization capability of the detection model to a certain extent, the resource cost required by the dynamic analysis is relatively high because the dynamic analysis needs to be implemented by executing application software.
Disclosure of Invention
Aiming at the defects in the prior art, the android malicious software detection method based on the sensitive mode and the background system can be used for detecting software with high precision under the condition of not starting the software.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that:
in a first aspect, a method for detecting android malware based on sensitive patterns is provided, which includes:
acquiring an APK file of android software to be detected, performing disassembling operation on the APK file to extract authority data and API call data, and filtering the extracted data to form a data sample;
reading a sensitive mode cluster constructed by data samples based on a plurality of android software, and constructing the data samples into feature vectors based on existence and maximum inclusion degrees by taking the number of the sensitive mode clusters as dimensions;
and inputting the feature vector into the trained malicious software detection model, and outputting a detection result.
Further, the method for constructing the malware detection model comprises the following steps:
acquiring a plurality of android software, disassembling an APK file of the android software to extract authority data and API call data, and filtering the extracted data, wherein each android software forms a data sample;
extracting all frequent item sets in a transaction data set formed by all data samples, wherein each frequent item set is used as a sensitive mode;
calculating the Jaro distance of any two sensitive modes as text similarity and calculating the cosine similarity of any two sensitive modes as support similarity, and then taking each sensitive mode as a cluster;
calculating the similarity between the two clusters based on the text similarity and the support similarity of the two sensitive modes;
judging whether the maximum similarity is smaller than a set threshold value or not, if not, combining two clusters with the maximum similarity into one cluster, returning to the previous step, and if not, taking all current clusters as sensitive mode clusters and entering the next step;
constructing a feature vector based on existence and maximum inclusion degree of each data sample by taking the number of the sensitive mode clusters as dimensions;
and training the multilayer gradient lifting decision tree by adopting a training set formed by all the feature vectors to obtain a malicious software detection model.
In a second aspect, a background system of an application store is provided, which includes a detection method based on sensitive pattern android malware, and the detection method is integrated in the background system.
The invention has the beneficial effects that: according to the detection method, the difference between the malicious software and the normal software is revealed from the perspective of sensitive authority and API calling, the data sample is constructed into the feature vector through the sensitive mode cluster, the malicious software is detected through the constructed malicious software detection model, application software does not need to be executed in the detection process, and the resource cost can be reduced.
In addition, the malware detection model constructed by the scheme is formed by training the feature vectors constructed by the data samples based on the sensitive pattern clusters, and has high precision and good generalization capability, so that the detection accuracy in malware detection can be ensured.
Drawings
FIG. 1 is a flow chart of a method for detecting android malware in a sensitive mode.
FIG. 2 is a flow chart of a method of constructing a malware detection model.
FIG. 3 is a training process of a multi-level gradient boosting decision tree.
Fig. 4 shows the support of different sensitive patterns (in part) in malware and normal software.
Fig. 5 is a comparison graph of the detection performance of the detection method of the present embodiment and various detection methods in the prior art.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
Referring to fig. 1, fig. 1 illustrates a flow diagram of a method for sensitive schema-based android malware detection; as shown in fig. 1, the method 100 includes steps 101 to 103.
In step 101, an APK file of android software to be detected is obtained, disassembling operation is performed on the APK file to extract authority data and API call data, and then the extracted data is filtered to form a data sample.
The android software has numerous static characteristics, and the authority and API calling information of the android software are only analyzed in consideration of the complexity and resource cost of the method. In the scheme, Apktool (a reverse engineering analysis tool) is used for disassembling an APK file of android software, and authority data and API calling data are extracted from an android manifest file and a smali file respectively.
In implementation, the implementation method for preferably filtering the extracted data in the scheme is as follows:
acquiring dangerous authorities published by an android official network and a sensitive API list provided by SuSi, and taking the dangerous authorities and the sensitive API list as a standard database;
and comparing the extracted data with data in the standard database, and deleting data which are not located in the standard database to form a data sample.
The dangerous authority published by the android official network and the sensitive API list provided by SuSi are extracted from a plurality of angles such as network connection, mobile phone states, contact lists, short messages, mails, account information, geographic positions and the like, so that the comprehensiveness of the coverage of a standard database is ensured.
By deleting part of the data in the above way, redundant information and noise in the extracted data can be removed, so as to reduce the complexity of analysis.
In step 102, a sensitive pattern cluster constructed based on a plurality of android software data samples is read (the sensitive pattern cluster is constructed by steps S1 to S6 in the method for constructing the malware detection model), and the data samples are constructed into feature vectors based on existence and maximum inclusion degrees by taking the number of the sensitive pattern cluster as a dimension.
In step 103, the feature vector is input into the trained malware detection model, and the detection result is output.
Referring to FIG. 2, FIG. 2 shows a flow chart of a method of building a malware detection model; as shown in fig. 2, the method S includes steps S1 to S7.
In step S1, a plurality of android software are acquired, the APK file thereof is disassembled to extract authority data and API call data, and then the extracted data is filtered, each android software forms a data sample;
the implementation of step S1 is the same as that of step S101, and will not be described herein.
In step S2, extracting all frequent item sets in the transaction data set formed by all data samples, each frequent item set being a sensitive mode; the frequent item set of the scheme can be extracted by adopting the existing Apriori algorithm and FP-growth algorithm.
Meanwhile, the scheme also provides a new method for extracting a plurality of frequent item sets in the transaction data set, which comprises the following steps:
a1, traversing a transaction data set D, and generating a corresponding FP tree and a head pointer table T according to the minimum support degree;
a2, for each element item k in the head pointer table T, after adding the element item k in the set Q, adding the element item k into the frequent item set list L, and obtaining the condition mode base B corresponding to the element item k from the FP treek
A3 traversing condition mode base BkRecording the element item k and the count value in the element item k to the head pointer table t corresponding to the element item kkThe preparation method comprises the following steps of (1) performing;
a4, head pointer deletion table tkThe element item k of which the medium conditional frequency is less than the minimum support degree or equal to the conditional support degree;
a5, if head pointer table tkIf not, go to step A6, otherwise go to step A9;
a6 according to head pointer table tkFor conditional mode base BkFiltering and sorting;
a7 traversing updated condition mode base BkGenerating a closed condition FP tree corresponding to the element item k;
a8, adopting closed condition FP tree to update FP tree, head pointer table tkUpdating the head pointer table T and returning to the step A2;
a9, outputting a frequent item set list L.
The novel method for extracting a plurality of frequent item sets in the transaction data set can greatly reduce the scale of a search space through a pruning strategy, improve the efficiency of finding effective frequent item sets, and reduce the number of frequent item sets to a great extent.
According to the scheme, a new method for extracting a plurality of frequent item sets in the transaction data set is adopted to carry out sensitive mode mining on malicious software and normal software, and the obvious difference between the malicious software and the normal software in different combination modes is found. As can be seen in fig. 3, some sensitive patterns are significantly more supported in malware than in normal software, such as:
{ READ _ PHONE _ STATE, INTERNET, SEND _ SMS, getDeviceId (), getActiveNetworkInfo () } support in malicious software is 0.75, and support in normal software is 0.42; however, some sensitive modes are just the opposite, such as { ACCESS _ NETWORK _ STATE, BLUETOOTH, getMessage () } support up to 0.71 in normal software, and only 0.36 in malware. In order to construct a characteristic capable of effectively distinguishing malicious software from normal software, the scheme removes sensitive modes with similar support degrees in the two types of software.
The sensitive mode of the scheme is formed by combining sensitive authority frequently appearing in malicious software or normal software and API calling.
In step S3, the Jaro distance of any two sensitive modes is calculated as the text similarity and the cosine similarity of any two sensitive modes is calculated as the support similarity, and then each sensitive mode is taken as a cluster.
In implementation, the optimal Jaro distance calculation formula of any two sensitive modes in the scheme is as follows:
Figure GDA0003468256990000081
wherein d isJaroijIs a sensitive mode spiAnd sensitive mode spjThe Jaro distance between; m isijIs a sensitive mode spiAnd sensitive mode spjThe number of words matched between; | s1I and s2Respectively is a sensitive mode spiAnd sensitive mode spjThe number of words of (2).
For example, the sensitivity mode { ACCESS _ NETWORK _ STATE, getDeviceId (), getLine1Number (), getMessage (), getText () } and the sensitivity mode { ACCESS _ NETWORK _ STATE, getDeviceId (), getMessage (), getText (), INTERNET }, the Number of matched words is 4, then their Jaro distance is:
Figure GDA0003468256990000082
the cosine similarity of any two sensitive modes is calculated according to the formula:
Figure GDA0003468256990000091
wherein the content of the first and second substances,
Figure GDA0003468256990000092
is a sensitive mode spiAnd sensitive mode spjCosine similarity between them;
Figure GDA0003468256990000093
is a sensitive mode spiA support vector of (2);
Figure GDA0003468256990000094
is a sensitive mode spjA support vector of (a); (supa)iAnd (subpb)iRespectively representing a sensitive pattern spiThe support in malware and normal software, the notation a represents the malware class and b represents the normal software class.
According to the scheme, the similarity in the step S4 is calculated by combining the two aspects of text similarity and support similarity, so that the stability of the malware detection model obtained by later training is higher, and the detection is more accurate when the malware detection model is applied to malware detection.
In step S4, the similarity between two clusters is calculated based on the text similarity and the support similarity of the two sensitive patterns:
Figure GDA0003468256990000095
sim_max(Ci,Cj)=max({sim(spi,spj)|spi∈Ci,spj∈Cj})
sim_min(Ci,Cj)=min({sim(spi,spj)|spi∈Ci,spj∈Cj})
Figure GDA0003468256990000096
wherein, sim (sp)i,spj) Is a sensitive mode spiAnd sensitive mode spjSimilarity between them; w is [0,1 ]]A weight value of; ciAnd CjAre all clustered; sim (C)i,Cj) Is CiAnd CjSimilarity between them; sim _ max (C)i,Cj) Is CiAnd CjThe maximum similarity of (c); sim _ min (C)i,Cj) Is CiAnd CjThe minimum similarity of;
Figure GDA0003468256990000097
is a sensitive mode spiAnd sensitive mode spjThe Jaro distance between;
Figure GDA0003468256990000098
is a sensitive mode spiAnd sensitive mode spjCosine similarity between them.
In step S5, determining whether the maximum similarity is smaller than a set threshold, if not, merging two clusters with the maximum similarity into one cluster, and returning to step S4, otherwise, taking all current clusters as sensitive mode clusters, and proceeding to step S6;
in step S6, constructing a feature vector based on the existence and the maximum inclusion degree of each data sample with the number of the sensitive pattern clusters as dimensions; the specific implementation of step S6 is as follows:
to display the representation of each android software sample, the present solution constructs a feature vector based on presence and maximum containment. The characteristic dimension is the number of the sensitive mode clusters, if the data sample has any mode in one cluster, the corresponding characteristic value is 1, otherwise, the inclusion degree of the data sample to each mode in the cluster is calculated, and the maximum inclusion degree is taken as the characteristic value. Assume the set of permissions and API calls in the data sample is PAhThen, the feature vector of the data sample is constructed as follows:
Vh={vh1,vh2...vhi...vhn}
Figure GDA0003468256990000101
Figure GDA0003468256990000102
wherein, VhThe characteristic vector corresponding to the h-th data sample; v. ofhiIs a VhThe ith element in (1); n is the number of the sensitive mode clusters; PAhIs the h data sample; spjA sensitive mode; i spjI is a sensitive mode spjThe number of middle elements; i spj∩PAhL is spjAnd PAhThe number of the same element items in the same element item; inclu (sp)j,PAh) Is degree of inclusion, i.e. spjAnd PAhThe number of same element items in spjThe proportion of the total number of the element items in the total.
In step S7, a training set composed of all feature vectors is used to train the multi-layer gradient boosting decision tree, so as to obtain a malware detection model.
A hierarchical model algorithm with strong characteristic learning ability of a multi-layer gradient boosting decision tree (mGBDT) is formed by stacking a plurality of regression GBDT layers as building blocks and performing joint training with variants of target propagation. For each layer of GBDT, a mapping Fi:oi-1→oi(oiIndicating the ith layer output) there is a corresponding pseudo-inverse mapping GiSatisfy the requirement of
Figure GDA0003468256990000103
(t denotes the tth iteration), which can be calculated by minimizing the inverse loss function:
Figure GDA0003468256990000104
Figure GDA0003468256990000111
for the reverse loss function
Figure GDA0003468256990000112
At the output oiInjection of gaussian noise epsilon in-1 can enhance the robustness and generalization capability of the model.
As shown in FIG. 4, the training process of the multi-layer gradient boosting decision tree is described, the whole process includes a plurality of iterations, and in each iteration, the pseudo-inverse mapping G of each layer is updated sequentially from the beginning to the endiAnd calculates the corresponding pseudo label Zi-1Then based on the forward loss function Li(Li=||Fi(oi-1)-zi| l) update the mapping F from front to back in sequenceiAnd obtain new output O of each layeriAnd finally completing the construction of each layer of GBDT after a specified number of iterations.
In addition, the scheme also provides a background system of the application store, which comprises a detection method of android malicious software based on a sensitive mode, wherein the detection method is integrated in the background system.
In order to verify the effectiveness of the method proposed by the invention, the results thereof are analyzed by means of relevant experiments as follows:
data set and experiment platform
In the experiment, the malware sample of the dataset was from VirusShare, containing 8183 malware; meanwhile, the scheme downloads 9058 pieces of normal software from a plurality of official application stores such as *** play, 360 assistants and the like, and in order to guarantee the quality of the data set, the scheme uses VirusTotal to perform secondary verification on the downloaded normal software, and the number of the normal software finally used for experiments is 8745.
All experiments in the scheme are completed on one PC, the PC is provided with a dual-core 3.7GHz processor and an 8G memory, and an operating system is windows10(64 bits).
Firstly, the effectiveness of the new method for extracting a plurality of frequent item sets in the transaction data set (the new extraction method in the scheme) provided by the scheme is explained:
the performance of the method is compared with that of the traditional FP-growth algorithm. In the data set used in the scheme, the sensitive authority and the API calling number contained in each sample are different and range from a few to hundreds of samples. Table 1 shows the number of frequent itemsets and mining time for the two methods, respectively, to mine at different minimum support degrees.
TABLE 1 Performance comparison of the new extraction method of the present scheme with the conventional FP-growth algorithm
Figure GDA0003468256990000121
As can be seen from table 1, due to the addition of the pruning strategy, the number of frequent item sets mined by the new extraction method according to the scheme is less than that of the conventional FP-growth algorithm, and the difference is more obvious along with the reduction of the minimum support degree. Therefore, the new extraction method of the scheme can greatly reduce the generation of redundant mode information. In addition, the mining efficiency of the new extraction method is higher than that of the traditional FP-growth algorithm, and the mining of frequent item sets can be completed in a shorter time.
Multi-tiered gradient boosting decision tree performance
In the scheme, a multi-layer gradient boosting decision tree (mGBDT) is adopted to train a detection model, and in order to evaluate the performance of the detection model, the traditional machine learning algorithms such as a Support Vector Machine (SVM), a Decision Tree (DT) and a Random Forest (RF) are compared, and the XGboost which is very competitive in the field of integrated learning is adopted. The indicators evaluated include Accuracy (Accuracy), Precision (Precision), and Recall (Recall).
As can be seen from FIG. 5, the performance of the multi-layer gradient boosting decision tree is obviously superior to that of other algorithms, particularly a support vector machine, a decision tree and a random forest, and the precision is 3% -6%. Although the XGboost has higher precision and recall ratio, the mGBDT has higher precision ratio, which is also important for a malicious software detection system.
The beneficial effects and application brought by the technology of the invention are as follows: according to the android malicious software detection method based on the sensitive mode, the difference between the malicious software and the normal software is revealed from the perspective of sensitive permission and API calling, the sensitive mode capable of effectively distinguishing the malicious software and the normal software can be rapidly obtained by using the new extraction method, and meanwhile, a detection model constructed by adopting a multi-layer gradient lifting decision tree algorithm has high precision and good generalization capability.
The technology can be integrated in a background system of an application store in practical application, detection and evaluation are carried out on application software to be put on shelf, and high-risk application release is prohibited.

Claims (9)

1. The detection method of the android malicious software based on the sensitive mode is characterized by comprising the following steps:
acquiring an APK file of android software to be detected, performing disassembling operation on the APK file to extract authority data and API call data, and filtering the extracted data to form a data sample;
reading a sensitive mode cluster constructed by data samples based on a plurality of android software, and constructing the data samples into feature vectors based on existence and maximum inclusion degree by taking the number of the sensitive mode cluster as dimensionality:
Figure 259356DEST_PATH_IMAGE001
wherein the content of the first and second substances,V h is as followshThe characteristic vector corresponding to each data sample;
Figure 790700DEST_PATH_IMAGE002
is composed ofV h To (1)iAn element;nthe number of the sensitive mode clusters;
Figure 554257DEST_PATH_IMAGE003
is as followshA data sample;sp j a sensitive mode is adopted;
Figure 998008DEST_PATH_IMAGE004
in a sensitive mode
Figure 214225DEST_PATH_IMAGE005
The number of middle elements;
Figure 253988DEST_PATH_IMAGE006
is composed of
Figure 137630DEST_PATH_IMAGE005
And
Figure 752282DEST_PATH_IMAGE007
the number of the same element items in the same element item;
Figure 111588DEST_PATH_IMAGE008
is degree of inclusion, i.e.
Figure 63364DEST_PATH_IMAGE005
And
Figure 942458DEST_PATH_IMAGE009
the number of the same element items in the total
Figure 744323DEST_PATH_IMAGE005
The proportion of the total number of the element items in the composition;
inputting the feature vector into a trained malicious software detection model, and outputting a detection result;
the method for acquiring the sensitive mode cluster comprises the following steps:
acquiring a plurality of android software, disassembling an APK file of the android software to extract authority data and API call data, and filtering the extracted data, wherein each android software forms a data sample;
extracting all frequent item sets in a transaction data set formed by all data samples, wherein each frequent item set is used as a sensitive mode;
calculating the Jaro distance of any two sensitive modes as text similarity and calculating the cosine similarity of any two sensitive modes as support similarity, and then taking each sensitive mode as a cluster;
calculating the similarity between the two clusters based on the text similarity and the support similarity of the two sensitive modes;
and judging whether the maximum similarity is smaller than a set threshold value or not, if not, combining the two clusters with the maximum similarity into one cluster, returning to the previous step, and otherwise, taking all the current clusters as sensitive mode clusters.
2. The method for detecting android malware based on sensitive patterns according to claim 1, wherein the method for constructing the malware detection model comprises:
acquiring a sensitive mode cluster;
constructing a feature vector based on existence and maximum inclusion degree of each data sample by taking the number of the sensitive mode clusters as dimensions;
and training the multilayer gradient lifting decision tree by adopting a training set formed by all the feature vectors to obtain a malicious software detection model.
3. The method for detecting android malware based on sensitive patterns as claimed in claim 2, wherein the calculation formula of the Jaro distance of any two sensitive patterns is:
Figure 872816DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 628283DEST_PATH_IMAGE011
in a sensitive modesp i And sensitive modesp j The Jaro distance between;m ij in a sensitive modesp i And sensitive modesp j The number of words matched between;
Figure 345572DEST_PATH_IMAGE012
and
Figure 895502DEST_PATH_IMAGE013
respectively in a sensitive modesp i And sensitive modesp j The number of words of (2).
4. The method for detecting android malware based on sensitive patterns according to claim 2, wherein a calculation formula of cosine similarity of any two sensitive patterns is as follows:
Figure 511291DEST_PATH_IMAGE014
wherein the content of the first and second substances,
Figure 425108DEST_PATH_IMAGE015
in a sensitive modesp i And a sensitive modesp j Cosine similarity between them;
Figure 137849DEST_PATH_IMAGE016
in a sensitive modesp i A support vector of (2);
Figure 734047DEST_PATH_IMAGE017
in a sensitive modesp j A support vector of (2);
Figure 820820DEST_PATH_IMAGE018
and
Figure 183669DEST_PATH_IMAGE019
respectively representing sensitive modessp i Support, sign in malware and normal softwareaOn behalf of the class of malware,brepresenting the normal software class.
5. The method for detecting android malware based on sensitive patterns as claimed in claim 2, wherein the calculation formula of the similarity between two clusters is:
Figure 626282DEST_PATH_IMAGE020
wherein the content of the first and second substances,
Figure 940851DEST_PATH_IMAGE021
in a sensitive modesp i And a sensitive modesp j Similarity between them;wis [0,1 ]]A weight value of;C i andC j are all clustered;
Figure 859129DEST_PATH_IMAGE022
is composed ofC i AndC j similarity between them;
Figure 635455DEST_PATH_IMAGE023
is composed ofC i AndC j the maximum similarity of (c);
Figure 57209DEST_PATH_IMAGE024
is composed ofC i AndC j the minimum similarity of;
Figure 510056DEST_PATH_IMAGE025
in a sensitive modesp i And a sensitive modesp j The Jaro distance between;
Figure 650050DEST_PATH_IMAGE026
in a sensitive modesp i And a sensitive modesp j Cosine similarity between them.
6. The method for detecting android malware based on sensitive patterns as claimed in any one of claims 1 to 5, wherein the method for extracting a plurality of frequent item sets in a transaction data set comprises:
a1 traversing transaction data setDGenerating corresponding according to the minimum supportFPTree and head pointer tableT
A2 pointer table for headTEach element item inkIn the collectionQAdding element itemkThen, the element itemkAdding frequent itemset listLFromFPObtaining element items in a treekCorresponding conditional mode baseB k
A3, traversing conditional mode baseB k The element items therein arekAnd the count value is recorded to the element itemkCorresponding head pointer tablet k Performing the following steps;
a4 pointer with delete headt k Element item with medium conditional frequency less than minimum support degree or equal to conditional support degreek
A5, if head pointerWatch (A)t k If not, go to step A6, otherwise go to step A9;
a6 pointer to headt k For conditional mode baseB k Filtering and sorting;
a7 traversing updated condition mode baseB k Generating element itemskCorresponding closure conditionFPA tree;
a8, adopting closed conditionFPTree updatesFPTree and head pointer tablet k Pointer with updated headTAnd returning to step A2;
a9, outputting a frequent item set listL
7. The method for detecting android malware based on sensitive patterns of any one of claims 1-5, wherein Apriori algorithm and FP-growth algorithm are adopted to extract a plurality of frequent item sets in transaction data sets.
8. The method for detecting android malware based on sensitive patterns as claimed in any one of claims 1 to 5, wherein the method for filtering the extracted data is implemented as follows:
acquiring dangerous authorities published by an android official network and a sensitive API list provided by SuSi, and taking the dangerous authorities and the sensitive API list as a standard database;
and comparing the extracted data with the data in the standard database, and deleting the data which are not located in the standard database.
9. A backend system of an application store, comprising the detection method of the sensitive-pattern-based android malware according to any one of claims 1 to 8, the detection method being integrated in the backend system.
CN202010097459.3A 2020-02-17 2020-02-17 Detection method and background system for android malicious software based on sensitive mode Active CN111324893B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010097459.3A CN111324893B (en) 2020-02-17 2020-02-17 Detection method and background system for android malicious software based on sensitive mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010097459.3A CN111324893B (en) 2020-02-17 2020-02-17 Detection method and background system for android malicious software based on sensitive mode

Publications (2)

Publication Number Publication Date
CN111324893A CN111324893A (en) 2020-06-23
CN111324893B true CN111324893B (en) 2022-05-10

Family

ID=71163431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010097459.3A Active CN111324893B (en) 2020-02-17 2020-02-17 Detection method and background system for android malicious software based on sensitive mode

Country Status (1)

Country Link
CN (1) CN111324893B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651024A (en) * 2020-12-29 2021-04-13 重庆大学 Method, device and equipment for malicious code detection
CN113747443B (en) * 2021-02-26 2024-06-07 上海观安信息技术股份有限公司 Safety detection method and device based on machine learning algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101472321B1 (en) * 2013-06-11 2014-12-12 고려대학교 산학협력단 Malignant code detect method and system for application in the mobile
CN105530265A (en) * 2016-01-28 2016-04-27 李青山 Mobile Internet malicious application detection method based on frequent itemset description
CN106874763A (en) * 2017-01-16 2017-06-20 西安电子科技大学 The Android software malicious act triggering system and method for modelling customer behavior
CN109753800A (en) * 2019-01-02 2019-05-14 重庆邮电大学 Merge the Android malicious application detection method and system of frequent item set and random forests algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101472321B1 (en) * 2013-06-11 2014-12-12 고려대학교 산학협력단 Malignant code detect method and system for application in the mobile
CN105530265A (en) * 2016-01-28 2016-04-27 李青山 Mobile Internet malicious application detection method based on frequent itemset description
CN106874763A (en) * 2017-01-16 2017-06-20 西安电子科技大学 The Android software malicious act triggering system and method for modelling customer behavior
CN109753800A (en) * 2019-01-02 2019-05-14 重庆邮电大学 Merge the Android malicious application detection method and system of frequent item set and random forests algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于包含度和频繁模式的文本特征选择方法;池云仙;《中文信息学报》;20180831;第32卷(第8期);91-102 *

Also Published As

Publication number Publication date
CN111324893A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
Fan et al. Dapasa: detecting android piggybacked apps through sensitive subgraph analysis
Alsaheel et al. {ATLAS}: A sequence-based learning approach for attack investigation
Zhang et al. An efficient Android malware detection system based on method-level behavioral semantic analysis
Moonsamy et al. Mining permission patterns for contrasting clean and malicious android applications
Li et al. Attribution classification method of APT malware in IoT using machine learning techniques
CN106599686A (en) Malware clustering method based on TLSH character representation
EP2975873A1 (en) A computer implemented method for classifying mobile applications and computer programs thereof
Martín et al. Android malware characterization using metadata and machine learning techniques
CN111324893B (en) Detection method and background system for android malicious software based on sensitive mode
Li et al. On locating malicious code in piggybacked android apps
Ficco Comparing API call sequence algorithms for malware detection
CN113468525A (en) Similar vulnerability detection method and device for binary program
Li et al. Ungrafting malicious code from piggybacked android apps
Xiao et al. A novel malware classification method based on crucial behavior
Li et al. Large-scale third-party library detection in android markets
CN113626810B (en) Android malicious software detection method and system based on sensitive subgraph
CN113468524B (en) RASP-based machine learning model security detection method
Zhang et al. Automatic detection of Android malware via hybrid graph neural network
Lyu et al. An Efficient and Packing‐Resilient Two‐Phase Android Cloned Application Detection Approach
Ohm et al. Sok: Practical detection of software supply chain attacks
Banik et al. Android malware detection by correlated real permission couples using FP growth algorithm and neural networks
D’Onghia et al. Apícula: Static detection of API calls in generic streams of bytes
Li et al. Android malware detection method based on frequent pattern and weighted naive Bayes
Vinod et al. Empirical evaluation of a system call-based android malware detector
Jiang et al. Hetersupervise: Package-level android malware analysis based on heterogeneous graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant