CN111259385B

CN111259385B - Application program identification method and device and neural network system

Info

Publication number: CN111259385B
Application number: CN201811459216.9A
Authority: CN
Inventors: 史东杰; 周楠
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2023-10-31
Anticipated expiration: 2038-11-30
Also published as: CN111259385A

Abstract

The invention discloses an application program identification method, an application program identification device and a neural network system, wherein the method comprises the steps of acquiring static information and identification information of an application program installation package, and inputting the static information into a target neural network system trained in advance, wherein the target neural network system comprises a first sub-network and a second sub-network; the first sub-network generates N characteristic sequences based on static information, and respectively performs preset first characteristic parameter extraction processing on each characteristic sequence to obtain N first characteristic vectors, wherein N is an integer greater than or equal to 2; the second sub-network splices the N first feature vectors, performs preset second feature parameter extraction processing on the second feature vectors obtained after splicing to obtain third feature vectors, and obtains virus identification results of the application program installation package based on the third feature vectors and the identification information, thereby effectively improving the accuracy of virus detection of the application program installation package.

Description

Application program identification method and device and neural network system

Technical Field

The present invention relates to the field of network security technologies, and in particular, to an application program identification method, an application program identification device, and a neural network system.

Background

With the continuous development of internet technology, software is increasingly used in people's work and life. However, with the widespread use of software, many potential security issues are increasingly exposed. In recent years, different kinds of malware are multiplying according to research reports from different antivirus software vendors. Such malware may destroy or perform undesirable actions on the computer system, such as disrupting computer operations, collecting sensitive information, bypassing access control, illegally accessing private computers, displaying various advertising information, and so forth. Therefore, detection of malware is critical.

The existing malicious software detection method is to create a malicious code library by marking known malicious codes, and further obtain a detection result in a malicious code matching mode. However, the method needs to update the malicious code base manually continuously, is easy to bypass, and has low accuracy.

Disclosure of Invention

The present invention has been made in view of the above problems, and has as its object to provide an application program identification method, apparatus and neural network system that overcome or at least partially solve the above problems.

In a first aspect, an embodiment of the present invention provides an application program identification method, where the method includes: acquiring static information and identification information of an application program installation package, and inputting the static information and the identification information into a target neural network system trained in advance, wherein the static information is obtained by analyzing a code file of the application program installation package, and the target neural network system comprises a first sub-network and a second sub-network; the first sub-network generates N characteristic sequences based on the static information, and respectively performs preset first characteristic parameter extraction processing on each characteristic sequence to obtain N first characteristic vectors, wherein N is an integer greater than or equal to 2; and the second sub-network splices the N first feature vectors, performs preset second feature parameter extraction processing on the second feature vectors obtained after splicing to obtain third feature vectors, and obtains a virus identification result of the application program installation package based on the third feature vectors and the identification information.

Further, the static information is a binary stream of the application program installation package, and the generating N feature sequences based on the static information includes: dividing the binary stream into N binary sequences; and encoding each binary sequence in the N binary sequences to obtain the N characteristic sequences.

Further, the encoding each binary sequence of the N binary sequences to obtain the N feature sequences includes: coding each binary sequence in the N binary sequences to obtain N coding sequences; and carrying out dimension reduction processing on each coding sequence in the N coding sequences based on a preset algorithm to obtain the N characteristic sequences, wherein the dimension of the characteristic sequences is lower than that of the corresponding coding sequences.

Further, the first subnetwork includes: the input layer, the first convolution layer and the first pooling layer are sequentially connected. The first subnetwork generates N feature sequences based on the static information, and respectively performs a preset first feature parameter extraction process on each feature sequence to obtain N first feature vectors, including: the input layer generates N characteristic sequences based on the static information; the first convolution layer performs one-dimensional convolution processing on each of the N feature sequences to obtain first feature information of the feature sequence, and activates the first feature information through a preset first activation function to obtain activated first feature information; and the first pooling layer pools the activated first feature information corresponding to each feature sequence in the N feature sequences to obtain N first feature vectors.

Further, the first convolution layer performs one-dimensional convolution processing on each of the N feature sequences to obtain first feature information of the feature sequence, where the first convolution layer includes: the first convolution layer performs the following steps for each of the N feature sequences: carrying out one-dimensional convolution processing on the characteristic sequence to obtain a first processing result; activating the first processing result through a preset second activation function to obtain a second processing result; and taking the product of the first processing result and the second processing result as first characteristic information of the characteristic sequence.

Further, the first pooling layer pools the activated first feature information corresponding to each feature sequence in the N feature sequences to obtain the N first feature vectors, where the pooling layer includes: and the first pooling layer pools the activated first characteristic information corresponding to each characteristic sequence in the N characteristic sequences in a maximum pooling mode to obtain N first characteristic vectors.

Further, the second sub-network comprises a second convolution layer, a second pooling layer and an output layer, and the second convolution layer, the second pooling layer and the output layer are sequentially connected. The second sub-network splices the N first feature vectors, performs preset second feature parameter extraction processing on the second feature vectors obtained after splicing to obtain third feature vectors, and obtains virus identification results of the application program installation package based on the third feature vectors and the identification information, wherein the method comprises the following steps: the second convolution layer splices the N first feature vectors to obtain a second feature vector, performs one-dimensional convolution processing on the second feature vector to obtain second feature information of the second feature vector, and activates the second feature information through a preset third activation function to obtain activated second feature information; the second pooling layer pools the activated second characteristic information to obtain a third characteristic vector; and the output layer obtains a virus identification result of the application program installation package based on the third feature vector and the identification information.

Further, the one-dimensional convolution processing is performed on the second feature vector to obtain second feature information of the second feature vector, including: carrying out one-dimensional convolution processing on the second feature vector to obtain a third processing result; activating the third processing result through a preset fourth activation function to obtain a fourth processing result; and taking the product of the third processing result and the fourth processing result as second characteristic information of the second characteristic vector.

Further, the second pooling layer pools the activated second feature information to obtain a third feature vector, including: and the second pooling layer pools the activated second characteristic information in an average pooling mode to obtain a third characteristic vector.

Further, the obtaining the virus identification result of the application program installation package based on the third feature vector and the identification information includes: converting the identification information into a fourth feature vector; splicing the third feature vector and the fourth feature vector to obtain a fifth feature vector; and obtaining a virus identification result of the application program installation package based on the fifth feature vector.

Further, the identification information includes a name and/or package name of the application program.

In a second aspect, an embodiment of the present invention provides an application program identification method, where the method includes: acquiring a training sample, wherein the training sample comprises static information and identification information of a plurality of application program installation packages and virus labels of each application program installation package, and the static information is obtained by analyzing code files of the application program installation packages; and training the initial neural network system which is built in advance through the training sample to obtain the target neural network system. The initial neural network system includes a first subnetwork and a second subnetwork. The first subnetwork is used for generating N feature sequences based on static information of an application program installation package, and respectively carrying out preset first feature parameter extraction processing on each feature sequence to obtain N first feature vectors, wherein N is an integer greater than or equal to 2. The second subnetwork is used for splicing the N first feature vectors, extracting preset second feature parameters from the second feature vectors obtained after splicing to obtain a third feature vector, and obtaining a virus identification result of the application program installation package based on the third feature vector and the identification information of the application program installation package.

In a third aspect, an embodiment of the present invention provides a neural network system, including: the first sub-network is used for generating N characteristic sequences based on the acquired static information of the application program installation package, and respectively carrying out preset first characteristic parameter extraction processing on each characteristic sequence to obtain N first characteristic vectors, wherein the static information is obtained by analyzing a code file of the application program installation package, and N is an integer greater than or equal to 2; and the second sub-network is used for splicing the N first feature vectors, carrying out preset second feature parameter extraction processing on the second feature vectors obtained after splicing to obtain a third feature vector, and obtaining a virus identification result of the application program installation package based on the third feature vector and the obtained identification information of the application program installation package.

Further, the static information is a binary stream of the application installation package, and the first subnetwork is specifically configured to: dividing the binary stream into N binary sequences; and encoding each binary sequence in the N binary sequences to obtain the N characteristic sequences.

Further, the first subnetwork is specifically configured to: coding each binary sequence in the N binary sequences to obtain N coding sequences; and carrying out dimension reduction processing on each coding sequence in the N coding sequences based on a preset algorithm to obtain the N characteristic sequences, wherein the dimension of the characteristic sequences is lower than that of the corresponding coding sequences.

Further, the first subnetwork includes: the input layer, the first convolution layer and the first pooling layer are sequentially connected. The input layer is used for generating N characteristic sequences based on the static information. The first convolution layer is configured to perform one-dimensional convolution processing on each of the N feature sequences to obtain first feature information of the feature sequence, and activate the first feature information through a preset first activation function to obtain activated first feature information. The first pooling layer is configured to pool the activated first feature information corresponding to each feature sequence in the N feature sequences, so as to obtain the N first feature vectors.

Further, the first convolution layer is specifically configured to: for each of the N feature sequences, performing the steps of: carrying out one-dimensional convolution processing on the characteristic sequence to obtain a first processing result; activating the first processing result through a preset second activation function to obtain a second processing result; and taking the product of the first processing result and the second processing result as first characteristic information of the characteristic sequence.

Further, the first pooling layer is specifically configured to pool, by using a maximum pooling manner, the activated first feature information corresponding to each feature sequence in the N feature sequences, so as to obtain the N first feature vectors.

Further, the second sub-network comprises a second convolution layer, a second pooling layer and an output layer, and the second convolution layer, the second pooling layer and the output layer are sequentially connected. The second convolution layer is configured to splice the N first feature vectors to obtain a second feature vector, perform one-dimensional convolution processing on the second feature vector to obtain second feature information of the second feature vector, and activate the second feature information through a preset third activation function to obtain activated second feature information. And the second pooling layer is used for pooling the activated second characteristic information to obtain a third characteristic vector. And the output layer is used for obtaining the virus identification result of the application program installation package based on the third feature vector and the identification information.

Further, the second convolution layer is specifically configured to: carrying out one-dimensional convolution processing on the second feature vector to obtain a third processing result; activating the third processing result through a preset fourth activation function to obtain a fourth processing result; and taking the product of the third processing result and the fourth processing result as second characteristic information of the second characteristic vector.

Further, the second pooling layer is specifically configured to: and the second pooling layer pools the activated second characteristic information in an average pooling mode to obtain a third characteristic vector.

Further, the second subnetwork is specifically configured to: converting the identification information into a fourth feature vector; splicing the third feature vector and the fourth feature vector to obtain a fifth feature vector; and obtaining a virus identification result of the application program installation package based on the fifth feature vector.

In a fourth aspect, an embodiment of the present invention provides an application identification apparatus, including: the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a training sample, the training sample comprises static information and identification information of a plurality of application program installation packages and a virus label of each application program installation package, and the static information is obtained by analyzing a code file of the application program installation package; and the training module is used for training the pre-constructed initial neural network system through the training sample to obtain a target neural network system. The initial neural network system comprises a first sub-network and a second sub-network, wherein the first sub-network is used for generating N characteristic sequences based on static information of an application program installation package, and respectively extracting preset first characteristic parameters of each characteristic sequence to obtain N first characteristic vectors, wherein N is an integer greater than or equal to 2. The second subnetwork is used for splicing the N first feature vectors, extracting preset second feature parameters from the second feature vectors obtained after splicing to obtain a third feature vector, and obtaining a virus identification result of the application program installation package based on the third feature vector and the identification information of the application program installation package.

In a fifth aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the memory is coupled to the processor, and the memory stores instructions that, when executed by the processor, cause the electronic device to perform the steps of the application identification method described above.

In a sixth aspect, an embodiment of the present invention provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of the application identification method.

According to the technical scheme provided by the embodiment of the invention, static information of an application program installation package is input into a pre-trained target neural network system, N feature sequences are generated by a first sub-network in the target neural network system based on the static information, and preset first feature parameter extraction processing is respectively carried out on each feature sequence to obtain N first feature vectors, wherein N is an integer greater than or equal to 2, then a second sub-network in the target neural network system is used for splicing the N first feature vectors, preset second feature parameter extraction processing is carried out on the second feature vectors obtained after splicing to obtain a third feature vector, and a virus identification result of the application program installation package is obtained based on the third feature vector and the identification information of the application program installation package. According to the scheme, through the pre-trained target neural network system comprising the first sub-network and the second sub-network, the static information and the identification information of the application program installation package are deeply learned, the virus identification result of the application program installation package is obtained, and the accuracy of virus detection of the application program installation package is effectively improved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 illustrates a schematic diagram of an operating environment suitable for use with embodiments of the present invention;

FIG. 2 is a flow chart of a method for identifying an application program according to a first embodiment of the present invention;

fig. 3 is a schematic diagram illustrating an analysis process of a target neural network system according to a first embodiment of the present invention;

FIG. 4 is a flowchart of an embodiment of a training method of a neural network system according to the present invention;

FIG. 5 is a schematic diagram illustrating the architecture of one embodiment of a neural network system provided by the present invention;

FIG. 6 is a schematic diagram illustrating an embodiment of an application identification apparatus according to the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Fig. 1 is a schematic diagram of an operating environment suitable for use in an embodiment of the present invention. As shown in fig. 1, one or more user terminals 100 (only one shown in fig. 1) may be connected to one or more servers 300 (only one shown in fig. 1) through a network 200 for data communication or interaction. The user terminal 100 may be a personal computer (Personal Computer, PC), a notebook computer, a tablet computer, a smart phone, an electronic reader, a vehicle-mounted device, a network television, a wearable device, or other intelligent devices with network functions.

The application program identification method provided by the embodiment of the invention can be applied to the user terminal or the server.

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 2 is a flowchart of a method for identifying an application according to a first embodiment of the present invention. In this embodiment, the application program identification method may be applied to a user terminal. Of course, in other embodiments of the present invention, the application identification method may also be applied to a server. As shown in fig. 2, the method at least includes the following steps S201 to S203.

Step S201, static information and identification information of an application program installation package are obtained, and the static information and the identification information are input into a target neural network system trained in advance.

In this embodiment, the application installation package refers to an application installation package that needs to detect whether a virus is carried, or in other embodiments of the present invention, the application installation package may also be an application installation package that needs to detect whether a virus is carried and the type of virus that is carried. Specifically, the application program installation package may be a software installation package of the mobile terminal, such as an Android installation package, with an apk suffix, or a software installation package of the computer, such as an exe installation package suffix.

In this embodiment, the static information is information obtained by parsing a code file of an application installation package. As one implementation, the static information may be a binary file of an application installation package.

In other embodiments of the present invention, the static information may be an operation code sequence obtained according to a code file of an application installation package, and the operation code may be a part of codes in the code file of the application installation package, or may be codes with functional logic, and after a plurality of operation codes are obtained, the operation codes are sequenced to obtain the operation code sequence. At this time, the process of acquiring the static information of the application installation package may be: and acquiring an application program installation package, performing disassembly operation on the application program installation package to obtain a returned and assembled smali file, and extracting an operation code (opcode) to obtain an operation code sequence. For example, assuming that the application program installation package to be tested is an apk file, a code file with a format of dex exists in the apk file, and the dex file contains all source codes of the application program corresponding to the apk file, and the corresponding Jave code can be obtained through a disassembly tool. After disassembly, files in the format of smali are obtained, each smali file represents a class in the dex file, each class consists of functions, each function consists of instructions, and each instruction consists of an operation code and a plurality of operands.

In this embodiment, the identification information of the application installation package may be a unique identifier of the application installation package. For example, the identification information of the application installation package may include the name of the application and/or the package name.

In this embodiment, the target neural network system includes a first subnetwork and a second subnetwork. The first sub-network is used for dividing static information of the application program installation package into a plurality of feature sequences and extracting local key information corresponding to each feature sequence. The second sub-network is used for further extracting more comprehensive characteristic parameters based on the local key information corresponding to each characteristic sequence extracted by the first sub-network, and obtaining a virus identification result of the application program installation package. As an embodiment, the first subnetwork and the second subnetwork may each employ a convolutional neural network. Of course, in other embodiments of the present invention, the first subnetwork and the second subnetwork may also employ other types of neural networks, such as deep neural networks, according to actual needs.

Step S202, the first sub-network generates N feature sequences based on the static information, and respectively performs preset first feature parameter extraction processing on each feature sequence to obtain N first feature vectors.

In this embodiment, N is an integer greater than or equal to 2. When the static information is a binary file of the application program installation package, specific implementation processes of generating N feature sequences based on the static information may be various, and four embodiments are mainly described below.

First, dividing a binary file into N binary sequences; and encoding each binary sequence in the N binary sequences to obtain the N characteristic sequences.

Second, as shown in fig. 3, the binary file is divided into N binary sequences; encoding each binary sequence in the N binary sequences to obtain N first encoding sequences; and carrying out dimension reduction processing on each first coding sequence in the N first coding sequences based on a preset algorithm to obtain N characteristic sequences, wherein the dimension of each characteristic sequence is lower than that of the corresponding first coding sequence.

Specifically, in the first and second embodiments described above, the division manner of dividing the binary file into N binary sequences may be set according to actual needs. As an embodiment, the predetermined number of bytes may be divided, and the predetermined number of bytes may be set according to actual needs, for example, assume a _i Representing the ith byte, when the preset byte number is 50000, a is as follows ₁ ～a ₅₀₀₀₀ Dividing into a binary sequence, dividing a into ₅₀₀₀₁ ～a ₁₀₀₀₀₀ Divided into a binary sequence, and so on. As another embodiment, the first preset step size and the first preset length may be divided, and the first preset step size and the first preset length may be set as required, for example, assume a _i Representing the ith byte, when the first preset step length is 10000 bytes and the first preset length is 50000 bytes, the method will be a ₁ ～a ₅₀₀₀₀ Dividing into a binary sequence, dividing a into ₁₀₀₀₁ ～a ₆₀₀₀₀ Divided into a binary sequence, and so on.

Thirdly, the binary file is encoded to obtain a second coding sequence, and the second coding sequence is divided into N characteristic sequences.

Fourth, the binary file is coded, and a second coding sequence is obtained; performing dimension reduction processing on the second coding sequence based on a preset algorithm to obtain a target sequence; the target sequence is divided into N feature sequences. The number of codes of the second code sequence is the dimension of the second code sequence, and the dimension of the target sequence is lower than the dimension of the second code sequence. The specific dimension reduction multiple can be set according to actual needs, for example, the dimension reduction can be 100 times or 50 times, and the second coding sequence can be reduced from millions of dimensions to tens of thousands of dimensions.

Similarly, in the third and fourth modes, the dividing mode of dividing the second coding sequence into N feature sequences may be set according to actual needs. As an embodiment, the preset code number may be divided at intervals, and the preset code number may be set according to actual needs. As another embodiment, the first preset step size and the second preset length may be divided, and both the second preset step size and the second preset length may be set as needed.

For example, the second code sequence is 500 ten thousand, the second code sequence is converted into a 9 ten thousand target sequence through dimension reduction processing, and then the 9 ten thousand target sequence is divided. For example, when the preset number of codes is divided at intervals of 1000, the target sequence may be divided into 90 feature sequences of 1000 dimensions.

In the above embodiments, the encoding modes may be various, and may be specifically set according to needs. For example, the binary number of each byte may be converted to a decimal number, and each byte may be converted to a number in the range of 0 to 255. For example, the number of the cells to be processed, hexadecimal corresponding to part of binary numbers in binary files of application program installation packages encoded as "\x90\x00\x03\x00\x00\x00\x04\x00\x00\x00\xff\xff", the corresponding transformed code was "144,0,3,0,0,0,4,0,0,0,255,255".

Specifically, in the dimension reduction process, the preset algorithm may be: bicubic interpolation algorithm, nearest neighbor interpolation algorithm, bilinear interpolation algorithm, etc. The subsequent processing is carried out after the dimension of the coding sequence is reduced, which is beneficial to improving the processing speed, and correspondingly, the training time of the neural network system can be reduced, and the resource occupation is reduced.

Taking the second mode as an example, a description is given below of a principle that can be followed by a specific dimension reduction process: the method comprises the steps of assuming that after a binary file of an application program installation package is encoded, the obtained encoding length of a first encoding sequence is M, and the encoding length after dimension reduction is M. In order to obtain the code value after the dimension reduction, the corresponding position of the point after the dimension reduction in the first code sequence can be obtained based on the formula x=x (M/M). Wherein X represents the corresponding position of the reduced-dimension point in the first coding sequence, and X represents the position of the reduced-dimension point in the corresponding feature sequence. It can be understood that the obtained x is a decimal value, 4 points closest to x can be found through the decimal value coordinates, and the corresponding weight is obtained by utilizing a preset basis function, so that the code value after dimension reduction is obtained. For example, the preset basis functions may be as follows:

def_hermite(A,B,C,D,t)

{/basis function

Parameters: A. b, C, D the 4 nearest points to this point, t the fractional part of this point

*/

a＝A*(-0.5)+B*1.5+C*(-1.5)+D*0.5

b＝A+B*(-2.5)+C*2.0+D*(-0.5)

c＝A*(-0.5)+C*0.5

d＝B

return a*t*t*t+b*t*t+c*t+d}

The return result of the function is the encoded value of the point.

In other embodiments of the present invention, when the static information is an operation code sequence obtained according to a code file of an application installation package, N feature sequences may be obtained based on the operation code sequence, so that each feature sequence is further processed.

Optionally, the first subnetwork includes: an input layer, a first convolution layer, and a first pooling layer. The input layer, the first convolution layer and the first pooling layer are sequentially connected. It will be appreciated that the first convolution layer functions to convolve each feature sequence with a predetermined number of convolution kernels based on a predetermined set to obtain a convolution feature, and to input the convolution feature into an activation function for activation. The first pooling layer is used for further dimension reduction and feature extraction of the activated convolution features.

At this time, the generating N feature sequences by the first sub-network based on the static information, and performing a preset first feature parameter extraction process on each feature sequence to obtain N first feature vectors may include: the input layer executes the static information to generate N characteristic sequences; the first convolution layer performs one-dimensional convolution processing on each of the N feature sequences to obtain first feature information of the feature sequence, and activates the first feature information through a preset first activation function to obtain activated first feature information; and the first pooling layer pools the activated first feature information corresponding to each feature sequence in the N feature sequences to obtain N first feature vectors.

Alternatively, the first activation function may be a Relu function. The Relu activation function can better prevent the gradient decay problem. Of course, other activation functions may be employed as desired.

Optionally, the first convolution layer performs one-dimensional convolution processing on each feature sequence in the N feature sequences, and the obtaining the first feature information of the feature sequence may specifically include: the first convolution layer performs the following steps for each of the N feature sequences: carrying out one-dimensional convolution processing on the characteristic sequence to obtain a first processing result; activating the first processing result through a preset second activation function to obtain a second processing result; and taking the product of the first processing result and the second processing result as first characteristic information of the characteristic sequence, as shown in fig. 3. Wherein the second activation function may employ a Sigmoid function. Thus, a Gate structure can be formed, and the Gate structure can better control the transmission of local feature information and improve the representation capability of local features.

In the one-dimensional convolution processing, the number, size and step length of the one-dimensional convolution kernels may be set according to actual needs. This embodiment can reduce computation and storage pressures by employing a one-dimensional convolution kernel of larger size and step size.

Optionally, the first pooling layer performs pooling processing on the activated first feature information corresponding to each feature sequence in the N feature sequences, and the obtaining the N first feature vectors may specifically include: and the first pooling layer performs pooling treatment on the activated first characteristic information corresponding to each characteristic sequence in the N characteristic sequences through a maximum pooling mode (max-pooling) to obtain N first characteristic vectors. Wherein the dimension of the first eigenvector is determined by the number of one-dimensional convolution kernels set by the first convolution layer. The invariance is introduced by adopting max-pooling, and meanwhile, the dimension reduction and local key information extraction are carried out, so that overfitting is prevented.

And step 203, the second sub-network splices the N first feature vectors, performs preset second feature parameter extraction processing on the second feature vectors obtained after splicing to obtain a third feature vector, and obtains a virus identification result of the application program installation package based on the third feature vector and the identification information.

As an embodiment, the second subnetwork may include a second convolutional layer, a second pooling layer, and an output layer, which are sequentially connected.

At this time, the foregoing second sub-network concatenates the N first feature vectors, performs a preset second feature parameter extraction process on the second feature vector obtained after concatenation to obtain a third feature vector, and based on the third feature vector and the identification information, obtains a virus identification result of the application installation package, where the method specifically may include: the second convolution layer splices the N first feature vectors to obtain a second feature vector, performs one-dimensional convolution processing on the second feature vector to obtain second feature information of the second feature vector, and activates the second feature information through a preset third activation function to obtain activated second feature information; the second pooling layer pools the activated second characteristic information to obtain a third characteristic vector; and the output layer obtains a virus identification result of the application program installation package based on the third feature vector and the identification information.

For example, P _j And the first feature vector corresponding to the j-th feature sequence in the N feature sequences is represented, and j is an integer between 1 and N. Will P ₁ To P _N And splicing to obtain a second feature vector. It will be appreciated that suppose P ₁ To P _N All are vectors of dimension H, the second eigenvector is a vector of dimension N x H,for example h=10, n=90, the second eigenvector is a 900-dimensional vector.

In this embodiment, the third activation function may also sample the Relu function. Also, the Relu activation function can better prevent the gradient decay problem.

Optionally, the performing one-dimensional convolution processing on the second feature vector to obtain second feature information of the second feature vector may specifically include: carrying out one-dimensional convolution processing on the second feature vector to obtain a third processing result; activating the third processing result through a preset fourth activation function to obtain a fourth processing result; and taking the product of the third processing result and the fourth processing result as second characteristic information of the second characteristic vector, as shown in fig. 3. Wherein the fourth activation function may employ a Sigmoid function. Therefore, a Gate structure can be formed, and at the moment, the Gate structure can better control the transmission of global feature information, and the representation capability of global features is improved.

Optionally, the second pooling layer pools the activated second feature information to obtain a third feature vector, including: and the second pooling layer pools the activated second characteristic information in an average pooling mode (avg-pooling) to obtain a third characteristic vector. The avg-mapping is used to make full use of the features of each feature sequence by the model in order to compromise global information and local information.

Alternatively, as shown in fig. 3, the output layer may include a full connection layer and a classifier. It will be appreciated that the convolutional layer and the pooling layer of the convolutional neural network function to map the original data into the hidden layer feature space, the purpose of the full-connection layer is to map the features learned by the network into the sample mark space, and the features obtained by the processing of the convolutional layer and the pooling layer are integrated to obtain the high-level meaning of the features for subsequent classification.

In this embodiment, the output of the full-connection layer is a t×1 vector, which is used as the input of the classifier. Wherein T is the category number. The classifier is used for obtaining the probability of the sample belonging to each category based on the output vector of the full connection layer. The output of the classifier is also a T x 1 vector, the magnitude of the value of each element in the classifier output vector ranges from 0 to 1, and the sum of the values of the elements is equal to 1.

The number of categories to be classified is set according to actual needs. For example, in an application scenario, whether the application installation package carries a virus needs to be detected, and the detection belongs to two categories, and at this time, the virus identification result of the application installation package is the probability of whether the application installation package carries a virus, that is, whether the application installation package carries a virus is identified. For another application scenario, it is required to detect whether the application installation package carries a virus and a type of the virus, and at this time, the virus identification result of the application installation package may be classified into three or more types, including whether the application installation package carries a virus and a type of the virus, and the virus identification result of the application installation package is a probability that the application installation package carries a virus and a type of the virus belongs to each type, that is, whether the application installation package carries a virus and a type of the virus is identified.

Alternatively, the full connection layer may adopt a Highway Network structure. By employing the Highway Network structure, the layer is represented as a learning residual function. This can improve network performance by simply increasing network depth. Alternatively, the classifier may employ a Softmax function.

In order to improve the convergence rate of the training process of the neural network system, the method acquires the identification information of the application program installation package. For example, the identification information of the application installation package may include the name of the application and/or the package name. At this time, the implementation process for obtaining the virus identification result of the application installation package based on the third feature vector and the identification information may include: converting the identification information of the application program installation package into a fourth feature vector, and splicing the third feature vector and the fourth feature vector to obtain a fifth feature vector; and obtaining a virus identification result of the application program installation package based on the fifth feature vector. Therefore, the convergence speed of the training process of the neural network system can be greatly improved. As shown in fig. 3, when the identification information includes the name and package name of the application, the name and package name of the application may be converted into vectors by setting an Embedding layer (Embedding), respectively.

It can be understood that, when the identification information of the application program installation package is the name of the application program, the fifth feature vector is a vector converted by the name of the application program; when the identification information of the application program installation package is the package name of the application program, the fifth feature vector is a vector converted by the package name of the application program; when the identification information of the application installation package includes the name of the application and the package name, the fifth feature vector includes a vector into which the application name is converted and a vector into which the package name of the application is converted.

According to the scheme, through the pre-trained target neural network system comprising the first sub-network and the second sub-network, the static information and the identification information of the application program installation package are deeply learned, the virus identification result of the application program installation package is obtained, and the accuracy of virus detection of the application program installation package is effectively improved.

Of course, before executing the application program identification method shown in the first embodiment, the target neural network system needs to be trained to obtain relevant parameters of each sub-network of the target neural network system. The process of training the target neural network system may be performed in a server. FIG. 4 is a flow chart of one embodiment of a training method of the neural network system of the present invention. As shown in fig. 4, the method of the present embodiment may at least include the following steps S401 to S402:

Step S401, obtaining a training sample, where the training sample includes static information and identification information of a plurality of application program installation packages, and a virus tag of each application program installation package, where the static information is obtained by analyzing a code file of the application program installation package.

It will be appreciated that the selection of training samples is dependent on the specific detection requirements, and that the virus labels are classification labels of training samples of known classes. For example, when it is required to detect whether the application installation package carries a virus, the training samples include static information and identification information of a plurality of application installation packages carrying viruses and static information and identification information of a plurality of application installation packages not carrying viruses; when it is required to detect whether the application program installation package carries viruses or not and the types of the viruses, if the types of the viruses are two types, i.e. the type I viruses and the type II viruses, respectively, the training sample comprises static information and identification information of a plurality of application program installation packages which do not carry viruses, static information and identification information of a plurality of application program installation packages which carry the type I viruses, and static information and identification information of a plurality of application program installation packages which carry the type II viruses.

Optionally, the static information is a binary file corresponding to the application program installation package. At this time, based on the static information, N feature sequences are generated, including: dividing the binary file into N binary sequences; and encoding each binary sequence in the N binary sequences to obtain the N characteristic sequences.

And step S402, training the pre-constructed initial neural network system through the training sample to obtain a target neural network system.

Wherein the initial neural network system includes a first subnetwork and a second subnetwork. The first subnetwork is used for generating N feature sequences based on static information of the application program installation package, and respectively carrying out preset first feature parameter extraction processing on each feature sequence to obtain N first feature vectors, wherein N is an integer greater than or equal to 2. The second sub-network is used for splicing the N first feature vectors, extracting preset second feature parameters from the second feature vectors obtained after splicing to obtain a third feature vector, and obtaining a virus identification result of the application program installation package based on the third feature vector and the identification information of the application program installation package.

Optionally, the first subnetwork and the second subnetwork each employ a convolutional neural network.

Optionally, the first subnetwork includes: the input layer, the first convolution layer and the first pooling layer are sequentially connected. At this time, the input layer is configured to generate N feature sequences based on the static information. The first convolution layer is used for carrying out one-dimensional convolution processing on each of the N feature sequences to obtain first feature information of the feature sequence, and activating the first feature information through a preset first activation function to obtain activated first feature information. The first pooling layer is used for pooling the activated first feature information corresponding to each feature sequence in the N feature sequences to obtain the N first feature vectors.

Optionally, the first convolution layer is specifically configured to: for each of the N feature sequences, performing the steps of: carrying out one-dimensional convolution processing on the characteristic sequence to obtain a first processing result; activating the first processing result through a preset second activation function to obtain a second processing result; and taking the product of the first processing result and the second processing result as first characteristic information of the characteristic sequence.

Optionally, the first pooling layer is specifically configured to pool, by using a maximum pooling manner, the activated first feature information corresponding to each feature sequence in the N feature sequences, so as to obtain the N first feature vectors.

Optionally, the second sub-network includes a second convolution layer, a second pooling layer, and an output layer, where the second convolution layer, the second pooling layer, and the output layer are sequentially connected. At this time, the second convolution layer is configured to splice the N first feature vectors to obtain a second feature vector, perform one-dimensional convolution processing on the second feature vector to obtain second feature information of the second feature vector, and activate the second feature information through a preset third activation function to obtain activated second feature information. And the second pooling layer is used for pooling the activated second characteristic information to obtain a third characteristic vector. And the output layer is used for obtaining a virus identification result of the application program installation package based on the third feature vector and the identification information.

Optionally, the second convolution layer is specifically configured to perform one-dimensional convolution processing on the second feature vector to obtain a third processing result; activating the third processing result through a preset fourth activation function to obtain a fourth processing result; and taking the product of the third processing result and the fourth processing result as second characteristic information of the second characteristic vector.

Optionally, the second pooling layer is specifically configured to pool the activated second feature information in an average pooling manner to obtain a third feature vector.

Optionally, the second sub-network is specifically configured to convert the identification information into a fourth feature vector; splicing the third feature vector and the fourth feature vector to obtain a fifth feature vector; and obtaining a virus identification result of the application program installation package based on the fifth feature vector.

Fig. 5 is a schematic structural diagram of an embodiment of a neural network system provided by the present invention, and as shown in fig. 5, the neural network system provided by the present embodiment may include: a first subnetwork 51 and a second subnetwork 52.

The first subnetwork 51 is configured to generate N feature sequences based on the obtained static information of the application installation package, and perform preset first feature parameter extraction processing on each feature sequence to obtain N first feature vectors, where N is an integer greater than or equal to 2, where the static information is obtained by analyzing a code file of the application installation package.

And the second subnetwork 52 is configured to splice the N first feature vectors, perform a preset second feature parameter extraction process on the second feature vectors obtained after the splicing to obtain a third feature vector, and obtain a virus identification result of the application installation package based on the third feature vector and the obtained identification information of the application installation package.

Optionally, the static information is a binary stream of the application installation package, and the first subnetwork 51 is specifically configured to: dividing the binary stream into N binary sequences; and encoding each binary sequence in the N binary sequences to obtain the N characteristic sequences.

Optionally, the first subnetwork 51 is specifically configured to: coding each binary sequence in the N binary sequences to obtain N coding sequences; and carrying out dimension reduction processing on each coding sequence in the N coding sequences based on a preset algorithm to obtain the N characteristic sequences, wherein the dimension of the characteristic sequences is lower than that of the corresponding coding sequences.

Optionally, the first subnetwork 51 includes: the input layer 511, the first convolution layer 512 and the first pooling layer 513 are sequentially connected to each other, and the input layer 511, the first convolution layer 512 and the first pooling layer 513 are sequentially connected to each other. Wherein the input layer 511 is configured to generate N feature sequences based on the static information. The first convolution layer 512 is configured to perform one-dimensional convolution processing on each of the N feature sequences to obtain first feature information of the feature sequence, and activate the first feature information through a preset first activation function to obtain activated first feature information. The first pooling layer 513 is configured to pool the activated first feature information corresponding to each feature sequence in the N feature sequences, to obtain the N first feature vectors.

Optionally, the first convolution layer 512 is specifically configured to perform, for each of the N feature sequences, the following steps: carrying out one-dimensional convolution processing on the characteristic sequence to obtain a first processing result; activating the first processing result through a preset second activation function to obtain a second processing result; and taking the product of the first processing result and the second processing result as first characteristic information of the characteristic sequence.

Optionally, the first pooling layer 513 is specifically configured to pool the activated first feature information corresponding to each feature sequence in the N feature sequences by using a maximum pooling manner, so as to obtain the N first feature vectors.

Optionally, the second subnetwork 52 includes a second convolution layer 521, a second pooling layer 522, and an output layer 523, where the second convolution layer 521, the second pooling layer 522, and the output layer 523 are sequentially connected. The second convolution layer 521 is configured to splice the N first feature vectors to obtain a second feature vector, perform one-dimensional convolution processing on the second feature vector to obtain second feature information of the second feature vector, and activate the second feature information through a preset third activation function to obtain activated second feature information. The second pooling layer 522 is configured to pool the activated second feature information to obtain a third feature vector. The output layer 523 is configured to obtain a virus identification result of the application installation package based on the third feature vector and the identification information.

Optionally, the second convolution layer 521 is specifically configured to perform one-dimensional convolution processing on the second feature vector to obtain a third processing result; activating the third processing result through a preset fourth activation function to obtain a fourth processing result; and taking the product of the third processing result and the fourth processing result as second characteristic information of the second characteristic vector.

Optionally, the second pooling layer 522 is specifically configured to pool the activated second feature information in an average pooling manner to obtain a third feature vector.

Optionally, the second subnetwork 52 is specifically configured to: converting the identification information into a fourth feature vector; splicing the third feature vector and the fourth feature vector to obtain a fifth feature vector; and obtaining a virus identification result of the application program installation package based on the fifth feature vector.

Optionally, the identification information of the application installation package includes a name and/or a package name of the application.

The neural network system provided in this embodiment may be used to execute the technical solution provided in the method embodiment shown in fig. 2, and the specific implementation manner and the technical effect are similar, and are not described herein again.

In addition, the embodiment of the invention also provides an application program identification device. As shown in fig. 6, the application program identification apparatus includes:

the obtaining module 61 is configured to obtain a training sample, where the training sample includes static information and identification information of a plurality of application program installation packages, and a virus tag of each application program installation package, where the static information is obtained by analyzing a code file of the application program installation package;

the training module 62 is configured to train the initial neural network system built in advance through the training sample to obtain a target neural network system;

the initial neural network system comprises a first sub-network and a second sub-network, wherein the first sub-network is used for generating N characteristic sequences based on static information of an application program installation package, and respectively carrying out preset first characteristic parameter extraction processing on each characteristic sequence to obtain N first characteristic vectors, wherein N is an integer greater than or equal to 2;

the second subnetwork is used for splicing the N first feature vectors, extracting preset second feature parameters from the second feature vectors obtained after splicing to obtain a third feature vector, and obtaining a virus identification result of the application program installation package based on the third feature vector and the identification information of the application program installation package.

It should be noted that, the specific implementation and the technical effects of the application program identification device provided by the embodiment of the present invention are the same as those of the training method embodiment of the neural network system, and for the sake of brief description, reference may be made to corresponding contents in the training method embodiment of the neural network system where the device embodiment is not mentioned.

An embodiment of the present invention also provides an electronic device, including a processor and a memory, where the memory is coupled to the processor, and the memory stores instructions that, when executed by the processor, cause the electronic device to perform the steps of the application identification method provided in the first embodiment.

An embodiment of the present invention also provides an electronic device, including a processor and a memory, the memory being coupled to the processor, the memory storing instructions that, when executed by the processor, cause the electronic device to perform the steps of the training method of the neural network system described above.

Fig. 7 illustrates a block diagram of an exemplary electronic device 700. As shown in fig. 7, the electronic device 700 includes a memory 702, a memory controller 704, one or more (only one is shown in the figure) processors 706, a peripheral interface 708, a network module 710, an input-output module 712, and a display module 714, among others. These components communicate with each other via one or more communication buses/signal lines 716.

The memory 702 may be used to store software programs and modules, such as program instructions/modules corresponding to the application program identification method in the embodiment of the present invention, and the processor 706 executes the software programs and modules stored in the memory 702 to perform various functional applications and data processing, such as the application program identification method provided in the embodiment of the present invention.

The memory 702 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. Access to the memory 702 by the processor 706 and possibly other components may be under control of the memory controller 704.

A peripheral interface 708 couples various input/output devices to the processor 706 and memory 702. In some embodiments, the peripheral interface 708, the processor 706, and the memory controller 704 may be implemented in a single chip. In other examples, they may be implemented by separate chips.

The network module 710 is configured to receive and transmit network signals. The network signals may include wireless signals or wired signals.

The input output module 712 is used to provide user input data to enable user interaction with the electronic device. The input/output module 712 may be, but is not limited to, a mouse, a keyboard, a touch screen, etc.

The display module 714 provides an interactive interface (e.g., a user-operated interface) between the electronic device 700 and the user or is used to display image data to a user reference. In this embodiment, the display module 917 may be a liquid crystal display or a touch display. In the case of a touch display, the touch display may be a capacitive touch screen or a resistive touch screen, etc. supporting single-point and multi-point touch operations. Supporting single-point and multi-point touch operations means that the touch display can sense touch operations simultaneously generated from one or more positions on the touch display, and the sensed touch operations are passed to the processor for calculation and processing.

It is to be understood that the configuration shown in fig. 7 is illustrative only, and that electronic device 700 may also include more or fewer components than those shown in fig. 7, or have a different configuration than that shown in fig. 7. The components shown in fig. 7 may be implemented in hardware, software, or a combination thereof.

An embodiment of the present invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the application program identifying method provided in the first embodiment described above.

An embodiment of the present invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the training method of the neural network system provided in the first embodiment described above.

The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random AccessMemory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

In this specification, each embodiment is mainly described in the difference from other embodiments, and identical and similar parts between the embodiments are referred to each other.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in a gateway, proxy server, system according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

The invention discloses an A1 application program identification method, which comprises the following steps:

acquiring static information and identification information of an application program installation package, and inputting the static information and the identification information into a target neural network system trained in advance, wherein the static information is obtained by analyzing a code file of the application program installation package, and the target neural network system comprises a first sub-network and a second sub-network;

The first sub-network generates N characteristic sequences based on the static information, and respectively performs preset first characteristic parameter extraction processing on each characteristic sequence to obtain N first characteristic vectors, wherein N is an integer greater than or equal to 2;

and the second sub-network splices the N first feature vectors, performs preset second feature parameter extraction processing on the second feature vectors obtained after splicing to obtain third feature vectors, and obtains a virus identification result of the application program installation package based on the third feature vectors and the identification information.

A2, the method according to A1, wherein the static information is a binary stream of the application program installation package, and the generating N feature sequences based on the static information includes: dividing the binary stream into N binary sequences; and encoding each binary sequence in the N binary sequences to obtain the N characteristic sequences.

A3, according to the method of A2, the encoding each binary sequence in the N binary sequences to obtain the N feature sequences includes: coding each binary sequence in the N binary sequences to obtain N coding sequences; and carrying out dimension reduction processing on each coding sequence in the N coding sequences based on a preset algorithm to obtain the N characteristic sequences, wherein the dimension of the characteristic sequences is lower than that of the corresponding coding sequences.

A4. the method according to A1, the first subnetwork comprising: the input layer, the first convolution layer and the first pooling layer are sequentially connected,

the first subnetwork generates N feature sequences based on the static information, and respectively performs a preset first feature parameter extraction process on each feature sequence to obtain N first feature vectors, including:

the input layer generates N characteristic sequences based on the static information;

the first convolution layer performs one-dimensional convolution processing on each of the N feature sequences to obtain first feature information of the feature sequence, and activates the first feature information through a preset first activation function to obtain activated first feature information;

and the first pooling layer pools the activated first feature information corresponding to each feature sequence in the N feature sequences to obtain N first feature vectors.

A5, according to the method of A4, the first convolution layer carries out one-dimensional convolution processing on each feature sequence in the N feature sequences to obtain first feature information of the feature sequence, and the method comprises the following steps: the first convolution layer performs the following steps for each of the N feature sequences: carrying out one-dimensional convolution processing on the characteristic sequence to obtain a first processing result; activating the first processing result through a preset second activation function to obtain a second processing result; and taking the product of the first processing result and the second processing result as first characteristic information of the characteristic sequence.

A6, according to the method of A4, the first pooling layer pools the activated first feature information corresponding to each feature sequence in the N feature sequences to obtain N first feature vectors, and the method comprises the following steps:

and the first pooling layer pools the activated first characteristic information corresponding to each characteristic sequence in the N characteristic sequences in a maximum pooling mode to obtain N first characteristic vectors.

A7, the method according to A1, wherein the second sub-network comprises a second convolution layer, a second pooling layer and an output layer, the second convolution layer, the second pooling layer and the output layer are sequentially connected,

the second sub-network splices the N first feature vectors, performs preset second feature parameter extraction processing on the second feature vectors obtained after splicing to obtain third feature vectors, and obtains virus identification results of the application program installation package based on the third feature vectors and the identification information, wherein the method comprises the following steps:

the second convolution layer splices the N first feature vectors to obtain a second feature vector, performs one-dimensional convolution processing on the second feature vector to obtain second feature information of the second feature vector, and activates the second feature information through a preset third activation function to obtain activated second feature information;

The second pooling layer pools the activated second characteristic information to obtain a third characteristic vector;

and the output layer obtains a virus identification result of the application program installation package based on the third feature vector and the identification information.

A8, performing one-dimensional convolution processing on the second feature vector according to the method of A7 to obtain second feature information of the second feature vector, including:

carrying out one-dimensional convolution processing on the second feature vector to obtain a third processing result;

activating the third processing result through a preset fourth activation function to obtain a fourth processing result;

and taking the product of the third processing result and the fourth processing result as second characteristic information of the second characteristic vector.

A9, according to the method of A7, the second pooling layer pools the activated second feature information to obtain a third feature vector, which includes: and the second pooling layer pools the activated second characteristic information in an average pooling mode to obtain a third characteristic vector.

A10, according to the method of A1, the obtaining the virus identification result of the application program installation package based on the third feature vector and the identification information includes: converting the identification information into a fourth feature vector; splicing the third feature vector and the fourth feature vector to obtain a fifth feature vector; and obtaining a virus identification result of the application program installation package based on the fifth feature vector.

A11, the method according to A1, wherein the identification information comprises the name and/or package name of the application program.

The invention discloses a B12 and application program identification method, which comprises the following steps:

acquiring a training sample, wherein the training sample comprises static information and identification information of a plurality of application program installation packages and virus labels of each application program installation package, and the static information is obtained by analyzing code files of the application program installation packages;

training the initial neural network system constructed in advance through the training sample to obtain a target neural network system;

The invention discloses a C13, a neural network system, the system includes:

the first sub-network is used for generating N characteristic sequences based on the acquired static information of the application program installation package, and respectively carrying out preset first characteristic parameter extraction processing on each characteristic sequence to obtain N first characteristic vectors, wherein the static information is obtained by analyzing a code file of the application program installation package, and N is an integer greater than or equal to 2;

and the second sub-network is used for splicing the N first feature vectors, carrying out preset second feature parameter extraction processing on the second feature vectors obtained after splicing to obtain a third feature vector, and obtaining a virus identification result of the application program installation package based on the third feature vector and the obtained identification information of the application program installation package.

C14, the system according to C13, wherein the static information is a binary stream of the application installation package, and the first sub-network is specifically configured to: dividing the binary stream into N binary sequences; and encoding each binary sequence in the N binary sequences to obtain the N characteristic sequences.

C15, the system of C14, the first subnetwork being specifically configured to:

coding each binary sequence in the N binary sequences to obtain N coding sequences;

and carrying out dimension reduction processing on each coding sequence in the N coding sequences based on a preset algorithm to obtain the N characteristic sequences, wherein the dimension of the characteristic sequences is lower than that of the corresponding coding sequences.

C16, the system of C13, the first subnetwork comprising: the input layer, the first convolution layer and the first pooling layer are sequentially connected,

the input layer is used for generating N characteristic sequences based on the static information;

the first convolution layer is used for carrying out one-dimensional convolution processing on each of the N feature sequences to obtain first feature information of the feature sequence, and activating the first feature information through a preset first activation function to obtain activated first feature information;

the first pooling layer is configured to pool the activated first feature information corresponding to each feature sequence in the N feature sequences, so as to obtain the N first feature vectors.

C17, the system of C16, the first convolution layer being specifically configured to: for each of the N feature sequences, performing the steps of: carrying out one-dimensional convolution processing on the characteristic sequence to obtain a first processing result; activating the first processing result through a preset second activation function to obtain a second processing result; and taking the product of the first processing result and the second processing result as first characteristic information of the characteristic sequence.

And C18, according to the system of C16, the first pooling layer is specifically configured to pool the activated first feature information corresponding to each feature sequence in the N feature sequences in a maximum pooling manner, so as to obtain the N first feature vectors.

C19, the system of C13, the second sub-network comprising a second convolution layer, a second pooling layer, and an output layer, the second convolution layer, the second pooling layer, and the output layer being connected in sequence. The second convolution layer is configured to splice the N first feature vectors to obtain a second feature vector, perform one-dimensional convolution processing on the second feature vector to obtain second feature information of the second feature vector, and activate the second feature information through a preset third activation function to obtain activated second feature information. And the second pooling layer is used for pooling the activated second characteristic information to obtain a third characteristic vector. And the output layer is used for obtaining the virus identification result of the application program installation package based on the third feature vector and the identification information.

C20, the system of C19, the second convolution layer being specifically configured to: carrying out one-dimensional convolution processing on the second feature vector to obtain a third processing result; activating the third processing result through a preset fourth activation function to obtain a fourth processing result; and taking the product of the third processing result and the fourth processing result as second characteristic information of the second characteristic vector.

C21, the system of C19, the second pooling layer being specifically configured to: and the second pooling layer pools the activated second characteristic information in an average pooling mode to obtain a third characteristic vector.

C22, the system according to C13, the second subnetwork being specifically configured to: converting the identification information into a fourth feature vector; splicing the third feature vector and the fourth feature vector to obtain a fifth feature vector; and obtaining a virus identification result of the application program installation package based on the fifth feature vector.

C23, the system according to C13, wherein the identification information comprises the name and/or package name of the application program.

The invention discloses a D24 and an application program identification device, which comprises:

The system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a training sample, the training sample comprises static information and identification information of a plurality of application program installation packages and a virus label of each application program installation package, and the static information is obtained by analyzing a code file of the application program installation package;

the training module is used for training the initial neural network system constructed in advance through the training sample to obtain a target neural network system;

An electronic device comprising a processor and a memory coupled to the processor, the memory storing instructions that when executed by the processor cause the electronic device to perform the steps of the method of any one of A1-a11 and B12.

The invention discloses F26, a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of A1-A11 and B12.

Claims

1. A method of application identification, the method comprising:

2. The method of claim 1, wherein the static information is a binary stream of the application installation package, and wherein the generating N feature sequences based on the static information comprises:

dividing the binary stream into N binary sequences;

and encoding each binary sequence in the N binary sequences to obtain the N characteristic sequences.

3. The method according to claim 2, wherein the encoding each of the N binary sequences to obtain the N feature sequences comprises:

4. The method of claim 1, wherein the first subnetwork comprises: the input layer, the first convolution layer and the first pooling layer are sequentially connected,

5. The method of claim 4, wherein the first convolution layer performs one-dimensional convolution processing on each of the N feature sequences to obtain first feature information of the feature sequence, and the first convolution layer includes:

The first convolution layer performs the following steps for each of the N feature sequences:

carrying out one-dimensional convolution processing on the characteristic sequence to obtain a first processing result;

activating the first processing result through a preset second activation function to obtain a second processing result;

and taking the product of the first processing result and the second processing result as first characteristic information of the characteristic sequence.

6. The method of claim 4, wherein the pooling the activated first feature information corresponding to each of the N feature sequences by the first pooling layer to obtain the N first feature vectors includes:

7. The method of claim 1, wherein the second subnetwork comprises a second convolutional layer, a second pooling layer, and an output layer, the second convolutional layer, the second pooling layer, and the output layer being sequentially coupled,

8. The method of claim 7, wherein the performing one-dimensional convolution on the second feature vector to obtain second feature information of the second feature vector comprises:

9. The method of claim 7, wherein the second pooling layer pooling the activated second feature information to obtain a third feature vector, comprising:

and the second pooling layer pools the activated second characteristic information in an average pooling mode to obtain a third characteristic vector.

10. The method according to claim 1, wherein the obtaining the virus identification result of the application installation package based on the third feature vector and the identification information includes:

converting the identification information into a fourth feature vector;

splicing the third feature vector and the fourth feature vector to obtain a fifth feature vector;

and obtaining a virus identification result of the application program installation package based on the fifth feature vector.

11. The method according to claim 1, wherein the identification information comprises a name and/or package name of the application.

12. A method of application identification, the method comprising:

13. A neural network system, the system comprising:

14. The system of claim 13, wherein the static information is a binary stream of the application installation package, and wherein the first subnetwork is configured to:

dividing the binary stream into N binary sequences;

15. The system according to claim 14, wherein the first subnetwork is specifically configured to:

16. The system of claim 13, wherein the first subnetwork comprises: the input layer, the first convolution layer and the first pooling layer are sequentially connected,

17. The system of claim 16, wherein the first convolution layer is specifically configured to:

for each of the N feature sequences, performing the steps of:

18. The system of claim 16, wherein the first pooling layer is specifically configured to pool, by using a maximum pooling manner, the activated first feature information corresponding to each of the N feature sequences, so as to obtain the N first feature vectors.

19. The system of claim 13, wherein the second subnetwork comprises a second convolutional layer, a second pooling layer, and an output layer, the second convolutional layer, the second pooling layer, and the output layer being sequentially coupled,

the second convolution layer is configured to splice the N first feature vectors to obtain a second feature vector, perform one-dimensional convolution processing on the second feature vector to obtain second feature information of the second feature vector, and activate the second feature information through a preset third activation function to obtain activated second feature information;

the second pooling layer is used for pooling the activated second characteristic information to obtain a third characteristic vector;

and the output layer is used for obtaining the virus identification result of the application program installation package based on the third feature vector and the identification information.

20. The system of claim 19, wherein the second convolution layer is specifically configured to:

21. The system of claim 19, wherein the second pooling layer is specifically configured to:

22. The system according to claim 13, wherein the second subnetwork is specifically configured to:

converting the identification information into a fourth feature vector;

23. The system of claim 13, wherein the identification information includes a name and/or package name of the application.

24. An application identification device, the device comprising:

25. An electronic device comprising a processor and a memory coupled to the processor, the memory storing instructions that, when executed by the processor, cause the electronic device to perform the steps of the method of any of claims 1-12.

26. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1-12.