CN110610066B

CN110610066B - Counterfeit application detection method and related device

Info

Publication number: CN110610066B
Application number: CN201810619963.8A
Authority: CN
Inventors: 张路; 潘宣辰
Original assignee: Wuhan Antiy Mobile Security Co ltd
Current assignee: Wuhan Antiy Mobile Security Co ltd
Priority date: 2018-06-15
Filing date: 2018-06-15
Publication date: 2022-08-09
Anticipated expiration: 2038-06-15
Also published as: CN110610066A

Abstract

The embodiment of the invention discloses a counterfeit application detection method, a corresponding device, computer equipment and a computer storage medium, and relates to the technical field of network security. The virus library pushing method comprises the following steps: when it is to be examinedWhen the application and a certain known legal application meet a preset condition, judging that the application to be detected is a counterfeit application of the certain known legal application, wherein the preset condition comprises the following steps: s _x And S _i Has a Hamming distance of less than or equal to a first threshold value, S _x Is the simhash value, S, of the application to be examined _i Is the simhash value of the certain known positive application. The method starts from counterfeiting in nature, carries out counterfeiting judgment based on the class name and the method name, has high detection accuracy, and can prevent omission detection; the massive counterfeit application rapid retrieval can be realized based on the ElasticSearch database, the retrieval efficiency is high, the application is subjected to multi-weight classification according to various simhash algorithms based on an adaboost classification system, and the counterfeit judgment is more accurate.

Description

Counterfeit application detection method and related device

Technical Field

The invention relates to the technical field of network security, in particular to a counterfeit application detection method, a counterfeit application detection device, computer equipment and a computer storage medium thereof.

Background

With the rapid development of the mobile internet and the continuous popularization of the intelligent terminal, the application also has explosive growth. The safety problem brought by the explosive growth of the application is synchronously increased, the number of malicious applications is increased dramatically year by year, the safety situation is not optimistic, and particularly, the application market of the android system is more open.

Counterfeit applications are the most hazardous applications that can lead to malicious deductions, privacy disclosure, and rogue behavior. Although the mobile payment industry is developed at present, the security degree of the mobile phone application is enhanced, the counterfeit application can still steal the user money by intercepting the personal information of the user, and in addition, the counterfeit application is usually provided with viruses, and the application can seriously damage the property and the privacy of the user. Rogue behavior such as continuous networking can result in a large increase in traffic costs, and failure to shut down completely can result in slower and power hungry handsets. Therefore, a method for detecting counterfeit applications is needed.

Existing counterfeit application detection methods perform keyword matching from a program name dimension or a package name dimension to determine whether a counterfeit application is present. However, a higher false alarm rate is easily generated by simply matching the keyword, for example, the program name of an application is called "pay treasure use attack", and the program name includes the keyword "pay treasure", so the existing method judges the application as a counterfeit application of pay treasure, but the original intention of the application development of "pay treasure use attack" is not to imitate pay treasure, but is to guide how a group unfamiliar with pay treasure is used, and the method judges wrongly without imitating the application although the program name includes the keyword "pay treasure". In addition, the program name dimension or package name dimension is based on single string detection in such a way that a real counterfeit application is bypassed, resulting in missed detection, e.g., an application developer copies the code glowing from the royal and then the program name changes to "5 v5 super-playable", then detection can be bypassed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for detecting counterfeit applications, a computer program and a computer storage medium, which are used for solving the technical problems of high error rate and easiness in omission of detection in the conventional method for detecting the counterfeit applications based on the program name dimension or the package name dimension of a single character string.

In a first aspect, an embodiment of the present invention provides a counterfeit application detection method.

Specifically, the counterfeit application detection method includes:

when the application to be detected and a certain known legal application meet the preset condition, judging that the application to be detected is the counterfeit application of the certain known legal application,

wherein the preset conditions include: s _x And S _i Has a Hamming distance of less than or equal to a first threshold value, wherein S _x Is the simhash value, S, of the application to be examined _i The application name is a simhash value of the known legal application, and the simhash value of the application is a simhash value obtained by calculating a class name and a method name set of the application according to a simhash algorithm.

In a second aspect, an embodiment of the present invention provides a counterfeit application detection apparatus.

Specifically, the apparatus comprises:

a counterfeit judgment module for judging the application to be detected as a counterfeit application of a certain known genuine application when the application to be detected and the certain known genuine application satisfy a preset condition,

wherein the preset conditions include: s _x And S _i Has a small Hamming distanceIs equal to or higher than a first threshold value, wherein S _x Is the simhash value, S, of the application to be examined _i The application name is a simhash value of the known legal application, and the simhash value of the application is a simhash value obtained by calculating a class name and a method name set of the application according to a simhash algorithm.

In a third aspect, an embodiment of the present invention provides a computer device.

Specifically, the computer device comprises:

a processor; and

a memory for storing a computer program for executing a computer program,

wherein the processor is configured to execute the computer program stored in the memory to implement the counterfeit application detection method of the first aspect.

In a fourth aspect, embodiments of the present invention provide a computer storage medium.

In particular, the computer storage medium has stored therein a computer program which, when executed by a processor, implements the counterfeit application detection method of the first aspect.

Because the counterfeit of the genuine application to the counterfeit application is mainly embodied in the source code characteristics of the class name and the method name, the counterfeit application detection method, the device, the computer equipment and the computer storage medium thereof in the embodiment of the invention perform the counterfeit judgment based on the class name of the application and the method name under the class name, and compared with the existing method for performing the counterfeit application detection based on the program name dimension or the package name dimension of a single character string, the method and the device can start from the counterfeit essence, obtain high detection accuracy and prevent the omission detection.

In addition, by utilizing an elastic search database, character strings of the simhash value to be applied and the simhash value of a known legal version application at any corresponding position are matched with each other to further judge whether to be counterfeited, the judged candidate set is usually reduced by at least four orders of magnitude, a suspected counterfeit reference application set can be quickly searched from a massive application library, and then accurate simhash calculation is carried out based on the suspected counterfeit reference application set, so that the detection efficiency is ensured, and meanwhile, omission is avoided.

Finally, the invention is based on the adaboost classification system, and carries out multi-weight classification on the application to be detected according to various simhash algorithms, so that the counterfeit judgment is more accurate.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a counterfeit application detection method according to embodiment 1 of the method of the present invention;

FIG. 2 is a flow chart of a counterfeit application detection method according to embodiment 2 of the method of the present invention;

FIG. 3 is a flow chart of a counterfeit application detection method of method embodiment 3 of the present invention;

FIG. 4 is a schematic view of a counterfeit application detection apparatus of embodiment 1 of the apparatus of the present invention;

FIG. 5 is a schematic view of a counterfeit application detection apparatus according to embodiment 2 of the present invention;

fig. 6 is a schematic diagram of a counterfeit application detection apparatus according to embodiment 3 of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention.

In some of the flows described in the present specification and claims and in the above figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, with the reference numbers such as 102, 104, etc. merely being used to distinguish between the various operations, and the reference numbers themselves do not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the examples of the present invention, are within the scope of protection of the present invention.

[ METHOD EXAMPLE 1 ]

Fig. 1 is a flow chart of a counterfeit application detection method according to embodiment 1 of the method of the present invention. Referring to fig. 1, in the present embodiment, the method includes:

step S11, when the application to be detected and a certain known legal application satisfy a preset condition, determining that the application to be detected is a counterfeit application of the certain known legal application, wherein the preset condition includes: s _x And S _i Has a Hamming distance of less than or equal to a first threshold value, wherein S _x Is the simhash value, S, of the application to be examined _i The simhash value of a certain known legal application is a simhash value obtained by calculating a class name and a method name set of the certain application according to a certain simhash algorithm. .

Wherein the first threshold value can be determined according to scene requirements and artificial experience.

Since the counterfeit of the genuine application by the application of the counterfeit genuine application is mainly embodied in the source code characteristics of the class name and the method name, the counterfeit judgment is performed based on the class name of the application and the method name under the class name, and compared with the existing method for performing the counterfeit application detection based on the program name dimension or the package name dimension of a single character string, the method can be started from the counterfeit essence, obtain high detection accuracy and prevent the omission detection.

Further, the preset conditions may further include: and the simhash value to be applied is matched with the character string of the simhash value applied by a certain known legal edition at any corresponding position.

It is understood that all strings of known legal applications can be stored in the ElasticSearch database (ElasticSearch is a natural text search engine, and the core is distributed storage and data indexing). When the application to be detected is counterfeit-detected, the method can firstly perform the first step: based on the bool query of the Elasticissearch, through an API provided by the Elasticissearch, the character string of the application to be detected and the character string of each known legal version application stored in an Elasticissearch database are subjected to character string matching judgment at corresponding positions, if the character string of the application to be detected at the position is the same as the character string of a certain known legal version application at the position at any corresponding position, the known legal version application is used as a suspected counterfeit reference application of the application to be detected, so that a set of suspected counterfeit reference applications can be obtained, and then the second step is carried out: calculating the hamming distance between the simhash value of each suspected counterfeit reference application in the set and the simhash value to be applied to be detected, and further judging whether the application to be detected is indeed the counterfeit application of the suspected counterfeit reference application.

Compared with the second step without the first step, the present embodiment usually reduces the candidate set for judgment by at least four orders of magnitude, because in practical application scenarios, such scenarios are often encountered: inputting any legal edition application requires to quickly calculate all the counterfeit applications corresponding to the legal edition application from a massive application library, which may be in the order of tens of millions or even hundreds of millions, and the calculation process takes a long time, which affects the working efficiency and has poor experience. The method utilizes the ElasticSearch database, can quickly retrieve the set of suspected counterfeit reference applications from the massive application library, and then carries out accurate similarity calculation based on the set of suspected counterfeit reference applications, thereby ensuring the calculation efficiency and avoiding omission.

Preferably, the simhash value of the application to be detected and the simhash value of the certain known positive application are equally divided into four segments, and the corresponding position is the position of the corresponding segment of the application to be detected and the certain known positive application. That is, each segment of the 64-bit simhash value is 16 bits, each segment of the simhash value to be applied is compared with each segment of the simhash value applied by the known copyright, and if the 16-bit strings of the corresponding segments are matched, the known copyright is applied as the suspected counterfeit reference application to be applied.

The inventor finds that the simhash value is divided into 4 sections and then is subjected to the first-step screening through a large number of experiments, so that the suspected counterfeit reference application is obtained, the accuracy is guaranteed, and meanwhile, the calculation efficiency is higher.

Further, the preset condition may further include: card (A) _x ∩B _i )/card(A _x ∪B _i ) Greater than a second threshold value, wherein A _x Is a set of class names and method names to be applied, B _i Is the set of class names and method names for the certain known legal application. It is noted that the card (A) function represents the number of elements in set A. The second threshold may be determined based on scene requirements, human experience.

[ METHOD EXAMPLE 2 ]

Fig. 2 is a flow chart of a counterfeit application detection method according to embodiment 2 of the method of the present invention. Referring to fig. 2, in the present embodiment, the method includes:

step S21, acquiring mass reference application;

the reference application is any large batch of application programs and can be obtained by web crawling or an enterprise internal sample library.

Step S21, for each reference application, determining the frequency of occurrence of each method name in each class name, wherein the frequency of occurrence is the number of the corresponding method name appearing in the reference application;

that is, it is calculated how many reference applications each method name in each class name appears, respectively, and the count value is defined as the frequency of occurrence.

Step S23, constructing a third party library by using the method names corresponding to the occurrence frequencies ranked in the front and the names belonging to the method names;

in practical situations, an application may call some third-party libraries, such as android/support/v13, org/json, and the like, and the present embodiment may use the occurrence frequency to extract the third-party libraries.

Step S24, acquiring all class names of each application and all method names under each class name in the application to be detected and a certain known legal application;

step S25, deleting all method names and the same method name as any method name in the third-party library and the same class name of each method name in the class names of the application to be detected and the known legal application, and constructing a class name and a method name set by using the deleted residual method names and the class names to which the method names belong;

step S26, when the application to be checked and the certain known original application satisfy a preset condition, determining that the application to be checked is a counterfeit application of the certain known original application, where the preset condition includes: s _x And S _i Has a Hamming distance of less than or equal to a first threshold value, wherein S _x Is the simhash value, S, of the application to be examined _i The simhash value of the certain known legal application is the simhash value obtained by calculating the class name and the method name set of the certain application according to a certain simhash algorithm.

Many different applications may use the same third party library, for example, the game app and the video app may both support pay for treasure payment, and therefore both need to call the third party library of pay for treasure to realize the payment function, and at this time, it cannot be considered that the game app and the video app have a certain similarity because both call the same third party library, and there is no intrinsic connection between the game app and the video app in fact, and there is no counterfeiting. That is, the third-party library does not contribute to the application representation, and interference may be caused by inversion, and needs to be removed to improve the accuracy of subsequent similarity calculation. In the embodiment, the interference items in the application to be detected and the known legal edition are removed through the third-party library, so that the counterfeit judgment is more accurate.

Further, the preset condition may further include: card (A) _x ∩B _i )/card(A _x ∪B _i ) Greater than a second threshold value, wherein A _x Is a set of class names and method names to be applied, B _i Is the set of class names and method names for the certain known legal application. It should be noted that the card (a) function represents the number of elements in the set a, and the second threshold value may be determined based on scene requirements and human experience.

[ METHOD EXAMPLE 3 ]

Fig. 3 is a flow chart of a counterfeit application detection method according to embodiment 3 of the method of the present invention. Referring to fig. 3, in the present embodiment, the method includes:

step S31, determining a training sample set, wherein the training samples comprise M counterfeit applications and N non-counterfeit applications corresponding to S known genuine applications;

in the present embodiment, x _i The ith training sample is represented, including a set of class names and method names for the corresponding application, and corresponds to one known genuine application of the S genuine applications. Training sample x _i Corresponding counterfeit applications may be applied for the known genuine application, and possibly other applications unrelated to the known genuine application, and thus a counterfeit sample pair and a non-counterfeit sample pair may be constructed.

Step S32, giving initial weight to all training samples;

in this embodiment, all mock and non-mock applications are initially weighted the same.

Step S33, training samples by utilizing a plurality of preset simhash algorithms based on the adaboost classification system, wherein for the current simhash algorithm, whether each training sample and each known legal application meet preset conditions is judged, so as to determine whether each training sample is a counterfeit application of each known legal application, and further perform counterfeit classification on each training sample, wherein the preset conditions comprise: s _x And S _i Is less than or equal to the simhash distance threshold corresponding to the current simhash algorithm, wherein，S _x Is the simhash value of the training sample, S _i The method comprises the steps that a simhash value of a certain known legal application is obtained, wherein the simhash value of a certain application is a simhash value obtained by calculating a class name and method name set of the certain application according to a current simhash algorithm;

specifically, for the current simhash algorithm, whether the hamming distance between each training sample and each known legal application is smaller than the corresponding simhash hamming distance threshold is judged, if so, the training sample is similar to the legal sample and classified as a counterfeit sample, otherwise, the training sample is not similar to the legal sample and classified as a non-counterfeit sample.

The embodiment determines whether each training sample is counterfeit by using K simhash algorithms, including md5Hash, apHash, pyHash, and other algorithms.

Step S34, determining a current weak classifier according to the counterfeit classification result of each training sample, wherein the current weak classifier is used for outputting the classification result of the current simhash algorithm in a binary mode;

here, the classification judgment result of the ith training sample corresponding to the kth simhash algorithm may be determined by using the kth weak classifier H _k (x _i ) Is represented by, i.e., H _k (x _i ) The method comprises the steps of respectively calculating the class name and the method name of a counterfeit application or a non-counterfeit application and a genuine application in a training sample to obtain two corresponding simhash values through a current k-th simhash algorithm, calculating the hamming distance between the two corresponding simhash values, judging whether the hamming distance is smaller than a simhash distance threshold value corresponding to the current k-th simhash algorithm, and classifying the hamming distance into a counterfeit sample and H if the hamming distance is smaller than the simhash distance threshold value _k (x _i ) Output 1, otherwise classified as non-counterfeit sample, H _k (x _i ) The output is-1.

Step S35, judging whether each training sample is classified wrongly by the current weak classifier according to the sample label corresponding to each training sample;

in addition, for a certain training sample, when the sample label is a counterfeit, if the current weak classifier is classified as a counterfeit sample, the classification is correct, otherwise, the classification is wrong; and when the sample label is non-counterfeit, if the current classifier is classified as counterfeit sample, the classification is wrong, otherwise, the classification is correct.

Here, y may be used _i Sample label representing the ith training sample, usually when y _i For counterfeit applications, the value is 1, when y _i For non-counterfeit applications, the value is-1. At this time, the kth weak classifier H indicating the classification determination result of the ith training sample _k (x _i ) When the output judgment result is not consistent with the sample label corresponding to the ith training sample, the classification error is represented, I (H) _k (x _i )≠y _i ) A value of 1, otherwise indicates correct classification, I (H) _k (x _i )≠y _i ) The value is 0.

Step S36, determining the error rate of the current weak classifier according to the classification judgment result of the current weak classifier and the current weight of each training sample;

the error rate calculation for the weak classifier may be:

wherein epsilon _k Indicating the error rate, w, corresponding to the k-th weak classifier _k,i Representing the weight of the ith training sample corresponding to the kth simhash algorithm, wherein

Step S37, determining the weight of the current weak classifier according to the error rate of the current weak classifier;

the weight calculation formula of the weak classifier can be:

wherein alpha is _k Representing the weight corresponding to the k weak classifier;

step S38, updating the weight of each training sample to classify by using the next simhash algorithm until all preset simhash algorithm training is finished;

the weight calculation formula of the training sample is as follows:

here, the

To normalize the factors, let the sample weight vector D _k+1 ＝(w _(k+1),1 ,…,w _(k+1),i ,…,w _(k+1),(M+N) ) Becomes a probability distribution, wherein 1 is more than or equal to i is less than or equal to (M + N).

In this step, after the training of the current simhash algorithm is finished, it is determined whether all of the predetermined K simhash algorithms are trained, if so, it is described that all the simhash algorithms are trained, and the next step is performed, if not, the process goes to step S33.

And step S39, performing counterfeit classification on the applications to be detected by using all the weak classifiers, and judging whether the applications to be detected are counterfeit applications of a certain known genuine application or not according to the weight of each weak classifier and the counterfeit classification result.

Specifically, for an application to be detected and a certain known positive sample, if the total number of K simhash algorithms is K, for each simhash algorithm, determining a simhash value of the application to be detected, calculating the hamming distance between the simhash value of the application to be detected and the simhash value of the certain known positive application, judging whether the hamming distance between the application to be detected and the certain positive application is smaller than a simhash distance threshold value corresponding to the simhash algorithm, if so, classifying the sample as a counterfeit sample of the positive sample, and if not, classifying the sample as a non-counterfeit sample of the positive sample, and outputting a weak classifier as-1. Then, according to the weight of the weak classifier corresponding to each simhash algorithm (known from step S37 of each training) and the value output by each weak classifier, using a formula

(where, K is 1. ltoreq. k.ltoreq.K, alpha _k Is the k weak classifier corresponding to the k simhash algorithm, H _k (x _i ) Weak classifier corresponding to k-th simhash algorithm for a certain to-be-detected application), calculating multi-weight classification result SH of the to-be-detected application and the certain known positive version application, if SH output is positive, indicating that the to-be-detected application is a counterfeit sample of the known positive version sample, and if SH output is negative, then the to-be-detected applicationIs a non-counterfeit sample of the known positive sample.

In this embodiment, based on the adaboost classification system, multi-weight classification is performed on the to-be-detected application according to various simhash algorithms, so that counterfeit judgment is more accurate.

Further, the preset conditions further include: and the simhash value to be applied is matched with the character string of the simhash value applied by a certain known legal edition at any corresponding position. Furthermore, the simhash value of the to-be-detected application and the simhash value of the certain known legal application are equally divided into four segments, and the corresponding position is the position of the corresponding segment of the to-be-detected application and the certain known legal application.

Further, the method further comprises:

acquiring mass reference application;

for each reference application, determining the occurrence frequency of each method name in each class name, wherein the occurrence frequency is the number of the corresponding method name in the reference application;

constructing a third party library by using the method names corresponding to the appearance frequencies ranked at the top and the names belonging to the method names,

acquiring all class names of each application and all method names under each class name in the application to be detected and the certain known legal application;

and deleting all method names and method names in the class names to which the method names belong in the application to be detected and the certain known legal application, wherein the method names are the same as any method name in a third-party library and the class name to which the method name belongs, and constructing a class name and method name set by using the method names and the class names to which the method names belong, which are left after deletion.

[ DEVICE EXAMPLE 1 ]

Fig. 4 is a flow chart of a counterfeit application detection apparatus according to embodiment 1 of the apparatus of the present invention. Referring to fig. 4, in the present embodiment, the apparatus includes:

the counterfeit determination module 41 determines that the application to be detected is a counterfeit application of a certain known genuine application when the application to be detected and the certain known genuine application satisfy a preset condition, wherein the preset condition includes: s _x And S _i Has a Hamming distance of less thanEqual to a first threshold value, wherein S _x Is the simhash value, S, of the application to be examined _i The simhash value of a certain known legal application is a simhash value obtained by calculating a class name and a method name set of the certain application according to a certain simhash algorithm.

Further, the preset conditions further include:

and the simhash value to be applied is matched with the character string of the simhash value applied by a certain known legal edition at any corresponding position.

It is understood that all strings of known legal applications can be stored in the ElasticSearch database (ElasticSearch is a natural text search engine, and the core is distributed storage and data indexing). When the application to be detected is counterfeit detected, the device can firstly perform character string matching judgment on the character string to be detected and the character string of each known legal version application stored in the ElasticSearch database at the corresponding position through API provided by the ElasticSearch based on the bool query of the ElasticSearch to obtain a set of suspected counterfeit reference applications, then calculate the hamming distance between the simhash value of each suspected counterfeit reference application in the set and the simhash value of the application to be detected, and further judge whether the application to be detected is indeed the counterfeit application of the suspected counterfeit reference application. Therefore, the device can quickly retrieve the set of suspected counterfeit reference applications from the massive application library by using the ElasticSearch database, and then accurately calculate the similarity based on the set of suspected counterfeit reference applications, thereby ensuring the calculation efficiency and avoiding omission.

[ DEVICE EXAMPLE 2 ]

Fig. 5 is a flow chart of a counterfeit application detection apparatus according to embodiment 2 of the present invention. Referring to fig. 5, in the present embodiment, the apparatus includes:

a mass reference application obtaining module 51, configured to obtain a mass reference application;

an appearance frequency determining module 52 for determining, for each reference application, an appearance frequency of each method name in each class name;

a third party library construction module 53, configured to construct a third party library by using the method names corresponding to the top-ranked occurrence frequencies and the names belonging to the method names;

a class name and method name obtaining module 54, configured to obtain all class names and all method names under each class name of each application in the application to be detected and a known legal application;

a class name and method name set construction module 55, configured to delete, for each application in the to-be-detected application and the certain known legal application, all method names and method names that are the same as any method name in the third-party library and the class name to which the method name belongs, and construct a class name and method name set using the method names remaining after deletion and the class name to which each method name belongs;

a counterfeit determination module 56, configured to determine that the application to be detected is a counterfeit application of the certain known genuine application when the application to be detected and the certain known genuine application satisfy a preset condition, where the preset condition includes: s. the _x And S _i Has a Hamming distance of less than or equal to a first threshold value, wherein S _x Is the simhash value, S, of the application to be examined _i Is the simhash value of the certain known positive applicationThe simhash value of an application is a simhash value obtained by calculating a class name and a method name set of the application according to a simhash algorithm.

Since many different applications may use the same third party library, for example, the game app and the video app may both support pay for treasure payment, the third party library of pay for treasure needs to be called to implement the payment function, and at this time, the game app and the video app cannot call the same third party library because both are called, so that both are considered to have a certain similarity, because there is no essential connection between the game app and the video app in fact, and there is no counterfeiting. That is, the third-party library does not contribute to the application representation, and interference may be caused by inversion, and needs to be removed to improve the accuracy of subsequent similarity calculation. In the embodiment, the interference items in the application to be detected and the known legal edition are removed through the third-party library, so that the counterfeit judgment is more accurate.

Further, the preset condition may further include: card (A) _x ∩B _i )/card(A _x ∪B _i ) Greater than a second threshold value, wherein A _x Is a set of class names and method names to be applied, B _i Is the set of class names and method names for the certain known legal application. It is noted that the card (A) function represents the number of elements in set A.

Further, the preset conditions may further include: and the simhash value to be applied is matched with the character string of the simhash value applied by a certain known legal edition at any corresponding position. Preferably, the simhash value of the application to be detected and the simhash value of the certain known genuine application are equally divided into four segments, and the "corresponding position" is the position of the segment corresponding to the certain known genuine application of the application to be detected.

[ DEVICE EXAMPLE 3 ]

Fig. 6 is a flow chart of a counterfeit application detection apparatus according to embodiment 3 of the apparatus of the present invention. Referring to fig. 6, in the present embodiment, the apparatus includes:

a training sample set determining module 61, configured to determine a training sample set, where the training sample set includes a plurality of known counterfeit applications and a plurality of known non-counterfeit applications corresponding to a plurality of known genuine applications;

a training sample weight initialization module 62, configured to initialize weight distribution of each training sample in a training set;

a training sample counterfeit classification module 63, configured to train training samples based on an adaboost classification system by using a plurality of predetermined simhash algorithms, where, for a current simhash algorithm, it is determined whether each training sample and each known positive application satisfy a preset condition, so as to determine whether each training sample is a counterfeit application of each known positive application, and further perform counterfeit classification on each training sample, where the training sample is a tentative application to be checked, and the preset condition includes: s _x And S _i Is less than or equal to the simhash distance threshold corresponding to the current simhash algorithm, wherein S _x Is the simhash value, S, of the training sample _i The method comprises the steps that a simhash value of a certain known legal application is obtained, wherein the simhash value of a certain application is a simhash value obtained by calculating a class name and method name set of the certain application according to a current simhash algorithm; (ii) a

A weak classifier determining module 64, configured to determine a current weak classifier according to a counterfeit classification result of each training sample, where the current weak classifier is configured to output a counterfeit classification result of the current simhash algorithm in a binary manner;

a weak classifier classification judgment module 65, configured to judge whether each training sample is classified incorrectly by the current weak classifier according to a sample label corresponding to each training sample;

a weak classifier weight determining module 66, configured to determine the weight of the current weak classifier according to the classification judgment result of the current weak classifier and the current weight of each training sample;

a training sample weight updating module 67, configured to update the weight of each training sample by using the current weak classifier and its weight, the current weight of each training sample, and the sample label until all simhash algorithm training is finished;

and the counterfeit judgment module 68 is used for performing counterfeit classification on the applications to be detected by using all the weak classifiers, and judging whether the applications to be detected are counterfeit applications of a certain known genuine application or not according to the weight of each weak classifier and the counterfeit classification result.

An embodiment of the present invention further provides a computer device, including a processor and a memory for storing a computer program, where the processor is configured to execute the computer program stored in the memory to implement any one of the counterfeit application detection methods mentioned above, or to implement the processing performed by any one of the counterfeit application detection apparatuses mentioned above.

Furthermore, an embodiment of the present invention further provides a computer storage medium, in which a computer program is stored, where the computer program, when executed by a processor, implements any one of the counterfeit application detection methods mentioned above, or implements processing performed by any one of the counterfeit application detection apparatuses mentioned above.

Compared with the existing method for detecting the counterfeit application based on the program name dimension or the package name dimension of a single character string, the storage medium and the computer equipment can start from the counterfeit essence, obtain high detection accuracy and prevent missing detection by realizing the counterfeit application detection method

The embodiments in the present specification are described in a progressive manner, and the same and similar parts in the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and device embodiments, since they are substantially similar to the method embodiments, they are described relatively simply, and reference may be made to some descriptions of the method embodiments for relevant points.

Those skilled in the art will clearly understand that the present invention may be implemented entirely in software, or by a combination of software and a hardware platform. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, a smart phone, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present invention.

As used herein, the term "software" or the like refers to any type of computer code or set of computer-executable instructions in a general sense that is executed to program a computer or other processor to perform various aspects of the present inventive concepts as discussed above. Furthermore, it should be noted that according to one aspect of the embodiment, one or more computer programs implementing the method of the present invention when executed do not need to be on one computer or processor, but may be distributed in modules in multiple computers or processors to execute various aspects of the present invention.

Computer-executable instructions may take many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. In particular, the operations performed by the program modules may, in various embodiments, be combined or divided as desired in various different embodiments.

Also, the technical solution of the present invention may be embodied as a method, and at least one example of the method has been provided. The actions may be performed in any suitable order and may be presented as part of the method. Thus, embodiments may be configured such that acts may be performed in an order different than illustrated, which may include performing some acts simultaneously (although in the illustrated embodiments, the acts are sequential).

In various embodiments of the invention, the described features, architectures or functions may be combined in any combination in one or more embodiments, where well-known processes of operation, program modules, elements and their interconnection, linking, communication or operation with each other are not shown or described in detail. It will be understood by those skilled in the art that the various embodiments described below are illustrative only and are not intended to limit the scope of the present invention. Those skilled in the art will also readily appreciate that the program modules, elements, or steps of the various embodiments described herein and illustrated in the accompanying drawings may be combined and arranged in a wide variety of different configurations.

Technical terms not specifically described in the present specification should be construed in the broadest sense in the art unless otherwise specifically indicated. The definitions given and used herein should be understood with reference to dictionaries, definitions in documents incorporated by reference, and/or their ordinary meanings. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

As used in the claims and in the specification above, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It is to be understood that, although the terms first, second, third, etc. may be used herein to describe various information and/or modules, these information should not be limited by these terms. These terms are only used to distinguish one type of information and/or module from another. For example, a first information and/or module may also be referred to as a second information and/or module, and similarly, a second information and/or module may also be referred to as a first information and/or module without departing from the scope hereof. Additionally, the word "if" as used herein, whose meaning depends on context, may be interpreted as "at … …" or "at … …" or "in response to a determination".

In the claims, as well as in the specification above, all transitional phrases such as "comprising," "having," "containing," "carrying," "having," "involving," "consisting essentially of …," and any other variations thereof, are to be understood to be open-ended, i.e., to include, but not be limited to, non-exclusive inclusions, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The terms and expressions used in the specification of the present invention have been set forth for illustrative purposes only and are not meant to be limiting. It will be appreciated by those skilled in the art that changes could be made to the details of the above-described embodiments without departing from the underlying principles thereof. The scope of the invention is, therefore, indicated by the appended claims, in which all terms are intended to be interpreted in their broadest reasonable sense unless otherwise indicated.

Claims

1. A counterfeit application detection method, comprising:

when the application to be detected and a certain known legal application meet a preset condition, judging that the application to be detected is a counterfeit application of the certain known legal application;

wherein the preset conditions include: the Hamming distance between Sx and Si is smaller than or equal to a first threshold, wherein Sx is the simhash value to be applied, Si is the simhash value of a certain known legal edition application, and the simhash value of a certain application is the simhash value obtained by calculating the class name and method name set of the certain application according to a certain simhash algorithm;

wherein, the judging that the application to be detected is the counterfeit application of the certain known legal application comprises the following steps:

training a plurality of preset simhash algorithms based on an adaboost classification system, classifying applications to be detected step by step according to the plurality of simhash algorithms, outputting a classification value of 1 when the applications to be detected and a certain known legal edition application meet preset conditions, and outputting the classification value of-1 when the applications to be detected and the certain known legal edition application do not meet the preset conditions;

calculating the multi-weight classification result of the application to be detected and the certain known legal application according to the plurality of classification values

If, if

If the output is positive, the application to be detected is judged to be the counterfeit application of the certain known legal application, and if the output is positive, the application to be detected is judged to be the counterfeit application of the certain known legal application

And if the result output is negative, the application to be detected is the non-counterfeit application of the certain known positive version application.

2. The counterfeit application detection method of claim 1, wherein training the predetermined plurality of simhash algorithms based on the adaboost classification system comprises:

determining a set of training samples, the training samples being a plurality of known counterfeit applications and a plurality of known non-counterfeit applications corresponding to the plurality of known genuine applications;

initializing the weight distribution of each training sample in the training sample set;

training the training samples by utilizing a plurality of preset simhash algorithms based on an adaboost classification system, wherein for the current simhash algorithm, whether each training sample and each known legal copy application meet the preset condition is judged, so as to determine whether each training sample is a counterfeit application of each known legal copy application, and further perform counterfeit classification on each training sample, wherein the training samples are used as tentative applications to be detected;

determining a current weak classifier according to the counterfeit classification result of each training sample, wherein the current weak classifier is used for outputting the counterfeit classification result of the current simhash algorithm in a binary mode;

judging whether each training sample is classified wrongly by the current weak classifier according to a sample label corresponding to each training sample;

determining the weight of the current weak classifier according to the classification judgment result of the current weak classifier and the current weight of each training sample;

updating the weight of each training sample by using the current weak classifier and the weight thereof, the current weight of each training sample and a sample label until all preset simhash algorithm training is finished;

when the application to be detected and a certain known legal application meet a preset condition, judging that the application to be detected is a counterfeit application of the certain known legal application, and specifically comprising the following steps of:

and carrying out counterfeit classification on the applications to be detected by utilizing all the weak classifiers, and judging whether the applications to be detected are counterfeit applications of a certain known genuine application or not according to the weight of each weak classifier and a counterfeit classification result.

3. The counterfeit application detection method of claim 1 or 2, further comprising:

acquiring mass reference application;

constructing a third party library by using the method names corresponding to the occurrence frequencies ranked in the front and the names of the methods belonging to the method names;

and deleting all method names and method names belonging to the class names of the method names in the application to be detected and the certain known legal application, wherein the method names are the same as any method name in a third-party library and the class name belonging to the method name, and constructing the class name and method name set by using the method names and the class names belonging to the method names which are left after deletion.

4. The counterfeit application detection method of claim 3, wherein the preset condition further comprises:

and the simhash value to be applied is matched with the character strings of the simhash value of a certain known legal edition application at any corresponding position.

5. The counterfeit application detection method of claim 4, wherein the simhash value of the application to be detected and the simhash value of the certain known genuine application are divided equally into four segments, and the corresponding position is a position of the corresponding segment of the application to be detected and the certain known genuine application.

6. The counterfeit application detection method of claim 4, wherein the preset condition further comprises:

card（A _x ∩B _i ）/card（A _x ∪B _i ) Greater than a second threshold value, wherein A _x Is the set of class names and method names of the application to be examined, B _i Is the set of class names and method names for the certain known genuine application.

7. A counterfeit application detection apparatus comprising:

the counterfeit judgment module is used for judging that the application to be detected is counterfeit application of a certain known legal application when the application to be detected and the certain known legal application meet preset conditions;

If, if

8. The counterfeit application detection apparatus of claim 7, wherein the training of the predetermined plurality of simhash algorithms based on the adaboost classification system comprises:

a training sample set determining module, configured to determine a training sample set, where the training samples are a plurality of known counterfeit applications and a plurality of known non-counterfeit applications that correspond to the plurality of known genuine applications;

a training sample weight initialization module, configured to initialize weight distribution of each training sample in the training sample set;

the training sample counterfeit classification module is used for training the training samples by utilizing a plurality of preset simhash algorithms based on an adaboost classification system, judging whether each training sample and each known positive application meet the preset condition or not for the current simhash algorithm so as to determine whether each training sample is a counterfeit application of each known positive application or not, and further performing counterfeit classification on each training sample, wherein the training samples are used as temporary applications to be detected;

the weak classifier determining module is used for determining a current weak classifier according to the counterfeit classification result of each training sample, and the current weak classifier is used for outputting the counterfeit classification result of the current simhash algorithm in a binary mode;

the weak classifier classification judging module is used for judging whether each training sample is classified wrongly by the current weak classifier according to the sample label corresponding to each training sample;

the weak classifier weight determining module is used for determining the weight of the current weak classifier according to the classification judgment result of the current weak classifier and the current weight of each training sample;

a training sample weight updating module, configured to update the weight of each training sample by using the current weak classifier and the weight thereof, the current weight of each training sample, and a sample label until all predetermined simhash algorithm training is finished;

the counterfeit judgment module is specifically used for carrying out counterfeit classification on the applications to be detected by using all the weak classifiers, and judging whether the applications to be detected are counterfeit applications of a certain known genuine application or not according to the weight of each weak classifier and a counterfeit classification result.

9. The counterfeit application detection apparatus of claim 7 or 8, further comprising:

the mass reference application acquisition module is used for acquiring mass reference applications;

an appearance frequency determining module for determining an appearance frequency of each method name in each class name for each reference application;

the third-party library construction module is used for constructing a third-party library by using the method names corresponding to the occurrence frequencies ranked in the front and the names of the methods belonging to the method names;

the system comprises a class name and method name acquisition module, a class name and method name acquisition module and a classification and analysis module, wherein the class name and method name acquisition module is used for acquiring all class names of each application and all method names under each class name in the application to be detected and the certain known legal application;

and the class name and method name set construction module is used for deleting all method names and method names which are the same as any method name and the class name to which the method name belongs in a third-party library in each application of the to-be-detected application and the known legal application, and constructing the class names and method name sets by using the deleted residual method names and the class name to which each method name belongs.

10. The counterfeit application detection apparatus of claim 9, wherein the preset condition further comprises:

and the simhash value to be applied is matched with the character string of the simhash value of a known legal application at any corresponding position.

11. The counterfeit application detection apparatus of claim 10, wherein the simhash value of the application to be detected and the simhash value of the certain known genuine application are both divided into four segments, and the corresponding position is a position of the corresponding segment of the application to be detected and the certain known genuine application.

12. The counterfeit application detection apparatus of claim 10, wherein the preset condition further comprises:

13. A computer device, comprising:

a processor; and

a memory for storing a computer program for executing the computer program,

wherein the processor is configured to execute a computer program stored in the memory to implement the counterfeit application detection method of any of claims 1 to 6.

14. A computer storage medium, having stored therein a computer program which, when executed by a processor, implements the counterfeit application detection method of any of claims 1 to 6.