CN114266046A

CN114266046A - Network virus identification method and device, computer equipment and storage medium

Info

Publication number: CN114266046A
Application number: CN202111522855.7A
Authority: CN
Inventors: 潘佳斌; 董雷; 童志明
Original assignee: Antiy Technology Group Co Ltd
Current assignee: Antiy Technology Group Co Ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-04-01

Abstract

The application provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, relates to the technical field of computational security, and is used for improving the efficiency and accuracy of network virus identification. The method mainly comprises the following steps: determining original characteristics and virus labels respectively corresponding to a plurality of types of virus samples; processing the original features; performing model training on the processed original features and the virus labels to obtain a virus classification recognition model; extracting partial parameters of the virus classification identification model to generate a feature fusion model; inputting a target program code into the feature fusion model to obtain a specific feature vector of the target program code; and determining a cluster corresponding to the specific characteristic vector of the target program code, and determining whether the target program code belongs to the network virus according to the determined cluster.

Description

Network virus identification method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of network security technologies, and in particular, to a method and an apparatus for identifying a network virus, a computer device, and a storage medium.

Background

The malicious code recognition objectively solves a complex and ultra-large-scale network virus classification and discrimination task. The traditional method for extracting the discriminant feature fragments by manual analysis or automation is difficult to provide enough generalization capability to discover unknown samples, and has certain hysteresis.

Meanwhile, the traditional method for analyzing and detecting the network virus is to manually analyze and debug the virus, extract a section of characteristic with special significance aiming at the behavior mode of the virus, and then detect the virus by utilizing the characteristic. But the efficiency and accuracy of manual detection of network viruses are low.

Disclosure of Invention

The embodiment of the application provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, which are used for improving the efficiency and accuracy of network virus identification.

The embodiment of the invention provides a network virus identification method, which comprises the following steps:

determining original characteristics and virus labels respectively corresponding to a plurality of types of virus samples;

processing the original features;

performing model training on the processed original features and the virus labels to obtain a virus classification recognition model;

extracting partial parameters of the virus classification identification model to generate a feature fusion model;

inputting a target program code into the feature fusion model to obtain a specific feature vector of the target program code;

and determining a cluster corresponding to the specific characteristic vector of the target program code, and determining whether the target program code belongs to the network virus according to the determined cluster.

The embodiment of the invention provides a network virus identification device, which comprises:

the determining module is used for determining original characteristics and virus labels respectively corresponding to various types of virus samples;

the preprocessing module is used for processing the original features;

the training module is used for carrying out model training on the processed original features and the virus labels to obtain a virus classification and identification model;

the generating module is used for extracting partial parameters of the virus classification identification model and generating a feature fusion model;

the acquisition module is used for inputting a target program code into the feature fusion model to obtain a specific feature vector of the target program code;

the determining module is further configured to determine a cluster corresponding to the specific feature vector of the target program code, and determine whether the target program code belongs to the network virus according to the determined cluster.

A computer device comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the network virus identification method.

A computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the network virus identification method described above.

A computer program product comprising a computer program which, when executed by a processor, implements the above-described network virus identification method.

The invention provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, which are used for determining original characteristics and virus labels respectively corresponding to various types of virus samples, processing the original characteristics, and performing model training on the processed original characteristics and virus labels to obtain a virus classification identification model; extracting partial parameters of the virus classification identification model to generate a feature fusion model; inputting the target program code into the feature fusion model to obtain a specific feature vector of the target program code; and determining a cluster corresponding to the specific characteristic vector of the target program code, and determining whether the target program code belongs to the network virus according to the determined cluster. The invention realizes the fusion of various types of characteristics by using a characteristic fusion model to obtain a specific characteristic vector, the specific characteristic vector is the fusion expression of an original redundant characteristic set of a sample, the original characteristic scale can be simplified, the characteristic form can be unified, the characteristic structure can be normalized on the premise of keeping the original characteristic specificity, and then the clustering calculation is carried out according to the specific characteristic vector to determine whether a target program code belongs to the network virus, thereby improving the efficiency and the accuracy of identifying the network virus.

Drawings

Fig. 1 is a flowchart of a network virus identification method provided in the present application;

FIG. 2 is a diagram of a model architecture provided herein;

FIG. 3 is a flow chart of another network virus identification method provided in the present application;

FIG. 4 is a flowchart of another network virus identification method provided in the present application;

fig. 5 is a schematic structural diagram of an identification apparatus for network viruses provided in the present application.

Fig. 6 is a schematic diagram of a computer device provided in the present application.

Detailed Description

In order to better understand the technical solutions described above, the technical solutions of the embodiments of the present application are described in detail below with reference to the drawings and the specific embodiments, and it should be understood that the specific features of the embodiments and the embodiments of the present application are detailed descriptions of the technical solutions of the embodiments of the present application, and are not limitations of the technical solutions of the present application, and the technical features of the embodiments and the embodiments of the present application may be combined with each other without conflict.

Example one

Referring to fig. 1, a method for identifying a network virus according to an embodiment of the present invention specifically includes steps S101 to S106:

and step S101, determining original characteristics and virus labels respectively corresponding to multiple types of virus samples.

The original features comprise numerical features, character features, serialization features and graph features, the original features refer to malicious code feature information extracted from a sample through means of static and dynamic feature analysis and the like, and the original features comprise static features and dynamic features. Specifically, static characteristics can be obtained through static analysis, and the static characteristics comprise file format information, file attribute information, character string information, binary information and instruction characteristic information; the dynamic characteristics are obtained by using dynamic analysis, and the dynamic characteristics include local behavior characteristics, network behavior characteristics, API call characteristics, and the like.

For the embodiment of the present invention, the virus tags are used to indicate the types of viruses, and there are a plurality of corresponding virus tags for how many types of viruses exist in the embodiment. The types of viruses can be classified into virus, trojan, worm and other categories, each category has a plurality of different malicious code families, each family may have a plurality of different variants, and each variant has a plurality of different files; the different sample classes here may be any of the different variants of malicious code (sample program code).

It should be noted that the virus tag in this embodiment may represent, in addition to the corresponding virus type, an expression form of the corresponding virus, where the expression form may be self-extracting packet, adding shell, and the like, and the expression form is not specifically limited in this embodiment.

And step S102, processing the original characteristics.

Wherein, processing the original features comprises: and normalizing the numerical characteristic to obtain a target numerical value, and converting the character characteristic, the serialized characteristic and the graph characteristic into a first vector, a second vector and a third vector which correspond to each other.

Specifically, the character-type features and the serialized features can be converted into word vectors (i.e., expressions of the first vector and the second vector) and the graph features can be converted into graph vectors (i.e., the third vector) by using methods such as Embedding.

And step S103, performing model training on the processed original features and the virus labels to obtain a virus classification and identification model.

Specifically, in this embodiment, model training may be performed according to the processed original features formed by the first vector, the second vector, the third vector and the target value, and the virus label, so as to obtain a virus classification recognition model.

The embodiment can splice the first vector, the second vector, the third vector and the target numerical value directly to obtain the processed original features. It should be noted that the present embodiment does not limit the splicing order.

And step S104, extracting partial parameters of the virus classification identification model and generating a feature fusion model.

Specifically, extracting part of parameters of the virus classification identification model includes: removing an output layer in the virus classification identification model, and extracting the residual model parameters of the virus classification identification model; taking the input specific feature vector of the output layer as the output of the feature fusion model; and generating a feature fusion model by using the residual model parameters.

The constructed virus classification identification model comprises but is not limited to a CNN structure network, an RNN structure network and a Bert structure network. As shown in fig. 2, the model has a multilayer structure, original features of the sample preprocessed by normalization, Embedding and other methods are used as input of the model, corresponding virus tags are used as output to perform model training, and model parameters are updated until the model is stable. Then, extracting model parameters in the model, generating a feature fusion model, removing an output layer in the virus classification identification model, and taking the specific feature vector input to the output layer as the output of the feature fusion model.

The specific feature vector has a fixed form and a value range, is a fusion expression of a sample original redundant feature set (numerical features, character features, serialization features and graph features), can simplify the scale of original features, unify feature forms and standardize feature structures on the premise of keeping the specificity of the original features, and can perform cluster calculation according to the specific feature vector to determine whether a target program code belongs to the network virus.

And step S105, inputting the target program code into the feature fusion model to obtain a specific feature vector of the target program code.

In the present embodiment, since the removal operation has been performed on the output layer in the classification recognition model, after the target program code is input into the feature fusion model, a specific feature vector with the target program code can be obtained.

And step S106, determining a cluster corresponding to the specific characteristic vector of the target program code, and determining whether the target program code belongs to the network virus according to the determined cluster.

The cluster is obtained by carrying out cluster calculation on the specific characteristic vectors corresponding to the samples, each cluster corresponds to a virus label, and the samples in the cluster belong to the viruses of the types corresponding to the virus labels. In this embodiment, the hamming distance may be used as the feature distance metric, and the specific feature vector corresponding to each sample may be used as an input to perform clustering calculation to obtain a plurality of cluster clusters. And then, selecting the virus label with the largest ratio as the virus label of the cluster according to the number distribution of the virus labels corresponding to the specific characteristic vectors in the cluster. The virus tag can be the mark information such as the name of the malicious software family to which the sample belongs, whether the sample is a self-extracting packet, whether the sample is a shell, whether the sample is an APT tool, and the like.

The invention provides a network virus identification method, which comprises the steps of determining original characteristics and virus labels respectively corresponding to various types of virus samples, processing the original characteristics, and performing model training on the processed original characteristics and the virus labels to obtain a virus classification identification model; extracting partial parameters of the virus classification identification model to generate a feature fusion model; inputting the target program code into the feature fusion model to obtain a specific feature vector of the target program code; and determining a cluster corresponding to the specific characteristic vector of the target program code, and determining whether the target program code belongs to the network virus according to the determined cluster. The invention realizes the fusion of various types of characteristics by using a characteristic fusion model to obtain a specific characteristic vector, the specific characteristic vector is the fusion expression of an original redundant characteristic set of a sample, the original characteristic scale can be simplified, the characteristic form can be unified, the characteristic structure can be normalized on the premise of keeping the original characteristic specificity, and then the clustering calculation is carried out according to the specific characteristic vector to determine whether a target program code belongs to the network virus, thereby improving the efficiency and the accuracy of identifying the network virus.

Example two

Referring to fig. 3, a method for identifying a network virus according to an embodiment of the present invention specifically includes steps S301 to S305:

step S301, determining original characteristics and virus labels respectively corresponding to multiple types of virus samples.

In this embodiment, after determining each type of feature in the original features, corresponding preprocessing needs to be performed according to a feature value type corresponding to the original features, where the feature value type refers to an extracted original representation form of the feature, for example, for a person, the feature value type of height and weight is a numerical value, the feature value type of gender is a boolean variable, and a fingerprint is a picture. Specifically, according to the data type of the original features in the sample, the original features may be divided into numerical features (number of file resources, number of file sections), character features, serialization features (disassembly instruction sequence), graph features (system call flow chart), boolean features (whether executable sections exist), and the like.

Step S302, calculating hash codes respectively corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic in the original characteristic.

In an optional embodiment of the present invention, calculating hash codes corresponding to the numerical feature, the character-type feature, the serialization feature, and/or the graph feature respectively includes:

step S3021, performing hash calculation on the numerical characteristic and the character characteristic to obtain hash codes corresponding to the numerical characteristic and the character characteristic, respectively.

Specifically, hash calculation is performed on the feature values of the character-type features (such as ip and domain name) to obtain hash codes corresponding to the character-type features respectively. For the numerical characteristics (such as the number of PE file sections and the number of resource files), hash calculation may be directly performed according to the characteristic name of the numerical characteristics to obtain the corresponding hash code, and hash calculation may also be performed according to the characteristic value name and the corresponding characteristic value to obtain the corresponding hash code.

For example, if the feature value of the numerical feature is named "called file number" and the feature value is 50, the hash calculation may be performed on the "called file number" to obtain the corresponding hash code, or the hash calculation may be performed according to the "called file number" in combination with the feature value 50 to obtain the corresponding hash code.

Further, in this embodiment, before the hash code corresponding to the numerical characteristic is calculated, normalization processing may be performed on the numerical characteristic, and then the hash calculation may be performed according to the normalized numerical characteristic to obtain the corresponding hash code.

Step S3022, converting each of the serialized features into a fixed-length feature vector.

Wherein the length of the feature vector is the same as the length of the hash code.

Step S3023, adding the feature vectors of each feature in the serialized features to obtain a target feature vector.

Step S3024, determining a hash code corresponding to the serialized features according to the target feature vector.

Determining the hash code corresponding to the serialized features according to the target feature vector, wherein the determining the hash code corresponding to the serialized features comprises: obtaining the value of each vector in the target characteristic vector; resetting the value of the vector with the value greater than 0 in the target feature vector to 1, and resetting the value of the vector with the value less than or equal to 0 to obtain the hash code corresponding to the serialized features.

For example, the serialized features are disassembled instruction sequences, the sequence content of the serialized features is (lea, mov, mov, cmp, jz), the serialized features are subjected to Embedding processing to obtain vectors of each feature in the serialized features, namely fixed-length (128) -dimensional vectorized representations corresponding to the lea, mov, mov, cmp, jz in the (lea, mov, mov, cmp, jz) are obtained respectively, each item vector of the serialized features is accumulated to obtain fixed-length vectors of the serialized features (lea, mov, mov, cmp, jz), each value in the vectors is truncated (namely the value is 1 when being larger than 0 and 0 when being smaller than or equal to 0), namely a binary sequence of (128) bits is obtained, and the hash code corresponding to the serialized features is obtained.

For graph features, it can be represented as a collection of points (function or API calls) and edges (associations), and these data can implement vectorization representation of graph points and edges by the method of Embedding.

Step S303, determining a specific feature vector corresponding to the sample according to the Hash codes respectively corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic.

Specifically, the hash codes corresponding to the numerical feature, the character feature, the serialization feature, and the graph feature may be added, and the result of the addition may be determined as the specific feature vector corresponding to the sample.

In an optional embodiment of the present invention, the determining, according to hash codes respectively corresponding to the numerical feature, the character-type feature, the serialization feature, and the graph feature, a specific feature vector corresponding to the sample includes: determining weight values corresponding to the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic respectively; and carrying out weighted calculation on the hash codes respectively corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic to obtain a specific characteristic vector corresponding to the sample.

In the embodiment of the present invention, the determination manner of the weight values corresponding to the numerical feature, the character feature, the serialization feature, and the graph feature may be: carrying out normalization processing on the numerical characteristic to obtain a corresponding weight value; determining a weight value corresponding to the character type characteristic through a word frequency-reverse file frequency TF-IDF algorithm; the weight values corresponding to the serialization features and the graph features are preset. Specifically, aiming at character type characteristics or Boolean type characteristics, the frequency TF of the appearance of characteristic values and the frequency IDF of the appearance of the characteristic values in the whole sample set are counted, and the TF-IDF method is utilized to realize the calibration of weight values; aiming at the Embedding Hash codes obtained by the serialization characteristics, a pre-calibrated empirical weight is used as the weight of the corresponding Hash codes.

In addition, the determination method of the weight values corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic may also be: combining the numerical type features, the character type features, the serialization features and the graph features to obtain summary features; inputting the summarized features into a weight recognition model to obtain weight values corresponding to the features, wherein the weight recognition model is obtained by training according to the summarized feature samples and the weight values corresponding to the features in the summarized feature samples, and the weight values of the features are determined according to a TF-IDF algorithm.

Specifically, hash codes corresponding to the numerical type features, the character type features, the serialization features and the graph features are combined to obtain summary features, and then the summary features are input into a weight identification model to obtain weight values corresponding to the features.

And step S304, performing clustering calculation on the specific characteristic vectors and the virus labels corresponding to the samples according to a clustering algorithm to obtain a plurality of clustering clusters.

Step S305, determining a cluster corresponding to the specific characteristic vector of the target program code, and determining whether the target program code belongs to the network virus according to the determined cluster.

The method for identifying whether the target program code belongs to the network virus according to the plurality of clustering clusters comprises the following steps: acquiring original characteristics of the target program code; determining a specific feature vector corresponding to the original feature of the target program code; determining whether the specific characteristic vector of the target program code belongs to a cluster through the clustering algorithm; if the specific characteristic vector of the target program code has the cluster to which the specific characteristic vector belongs, determining the specific characteristic vector as a network virus corresponding to the target program code according to a cluster label corresponding to the cluster to which the specific characteristic vector belongs, wherein the cluster label is used for representing the virus type of the corresponding cluster; and if the specific characteristic vector of the target program code does not have the cluster to which the specific characteristic vector belongs, determining that the target program code does not belong to the network virus.

For example, K-nearest neighbor search is performed on the target program code, if the set maximum valid distance threshold is 6 and K is 20, then the virus-tagged samples with the distance between the specific feature vectors being less than 6 are valid neighbors, and a total of 100 valid neighbors are found, and the distances of the valid neighbors are different from 0 to 5. The 100 neighbors are ranked from small to large in distance, and the nearest 20 of the 100 neighbors are selected. And voting the virus labels (sample family, whether the samples are packets or not and whether the samples are shells or not) of the 20 samples to give a judgment result (19 of the 20 samples are marked as trojans, 1 of the 20 samples are marked as worms, all the marks are not shells and are not self-decompression packets), and judging that the target program code is a trojan, a non-sub-decompression packet or a non-shell file.

The invention provides a network virus identification method, which is characterized in that original characteristics and virus labels respectively corresponding to samples of various virus types are determined, wherein the original characteristics comprise numerical type characteristics, character type characteristics, serialization characteristics and graph characteristics; calculating hash codes corresponding to all the characteristics in the original characteristics respectively; determining a specific characteristic vector corresponding to the sample according to the hash codes corresponding to all the characteristics; performing clustering calculation on the specific characteristic vectors and the virus labels corresponding to the samples according to a clustering algorithm to obtain a plurality of clustering clusters; and finally, determining a cluster corresponding to the specific characteristic vector of the target program code, and determining whether the target program code belongs to the network virus according to the determined cluster. The invention utilizes the specific characteristic vector fusion technology, realizes the characteristic dimension reduction and the formatting expression, greatly retains the sample specific information in the multisource characteristics, and then performs clustering calculation according to the specific characteristic vector to determine whether the target program code belongs to the network virus, thereby improving the efficiency and the accuracy of identifying the network virus.

EXAMPLE III

Referring to fig. 4, another network virus identification method according to an embodiment of the present invention includes steps S401 to S405:

step S401, determining the specific characteristic vector of the target program code.

It should be noted that, in this embodiment, the determination manner of the specific feature vector in step S401 is the same as the description content of the corresponding step in fig. 3, and this embodiment is not repeated herein.

Step S402, according to the specific characteristic vector of the target program code, acquiring the network virus corresponding to the target program code determined by the clustering algorithm and the network virus corresponding to the target program code determined by the virus library.

It should be noted that, in this embodiment, a specific implementation manner of obtaining the network virus corresponding to the target program code determined by the clustering algorithm according to the specific feature vector of the target program code is the same as the description content of the corresponding step in fig. 3, and this embodiment is not described herein again.

In this embodiment, obtaining a network virus corresponding to an object program code determined by a virus library according to a specific feature vector of the object program code includes: calculating the similarity between the specific characteristic vector of the target program code and various types of virus characteristics in a virus library; and determining the virus type corresponding to the specific characteristic vector with similarity exceeding a preset value in the virus library as the network virus corresponding to the target program code.

The virus library stores virus types corresponding to specific characteristic vectors of various types of viruses respectively. In this embodiment, after the specific feature vector of the target program code is obtained, the similarity between the specific feature vector of the target program code and the specific feature vector in the virus library is calculated, and finally, the virus type corresponding to the virus feature whose similarity exceeds a preset value in the virus library is determined as the network virus corresponding to the target program code. The preset value may be set according to an actual requirement, for example, the preset value may be 80%, 85%, or 90%, and this embodiment is not particularly limited.

For example, the virus library includes virus features of 5 specific feature vectors, virus type 1, virus type 2, virus type 3, virus type 4, and virus type 5, respectively. If the similarity between the specific feature vector corresponding to the target program code and the specific feature vector of the virus type 1 is 65%, the similarity between the specific feature vector corresponding to the target program code and the specific feature vector of the virus type 2 is 60%, the similarity between the specific feature vector corresponding to the target program code and the specific feature vector of the virus type 3 is 90%, the similarity between the specific feature vector corresponding to the target program code and the specific feature vector of the virus type 4 is 54%, the similarity between the specific feature vector corresponding to the target program code and the specific feature vector of the virus type 5 is 89%, and if the preset value is 85%, the virus type 3 and the virus type 5 can be determined to be the network viruses corresponding to the target program code.

And step S403, calculating probability values of the same network viruses in the network viruses determined by the clustering algorithm and the virus library.

And S404, determining the network virus with the highest probability value as the network virus corresponding to the target program code.

For example, in step S402, the network viruses corresponding to the target program code determined according to the virus library are virus type 3 and virus type 5, where the similarity of virus type 3 is 90% (i.e., a probability value), and the similarity of virus type 5 is 89%; the network viruses corresponding to the target program code determined according to the clustering algorithm are the virus type 3 and the virus type 2, wherein the probability value of the cluster where the virus type 3 is located is 90%, the probability value of the cluster where the virus type 2 is located is 20%, the network viruses (virus type 3) belonging to the same network are averaged to obtain the network virus with the highest probability value, namely the network virus 3 is determined, wherein the probability value of the corresponding virus type 3 is 90%, the probability value of the virus type 5 is 89%, the probability value of the virus type 2 is 20%, and the network virus with the highest probability value is the virus type 3.

The invention provides a network virus identification method, which is characterized in that according to the specific characteristic vector of a target program code, network viruses corresponding to the target program code determined by a clustering algorithm and network viruses corresponding to the target program code determined by a virus library are obtained, and then the network viruses corresponding to the target program code are determined by integrating the clustering algorithm and the virus library, so that the accuracy of network virus identification is further improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, an apparatus for identifying a network virus is provided, where the apparatus for identifying a network virus corresponds to the method for identifying a network virus in the foregoing embodiment one to one. As shown in fig. 5, the functional modules of the network virus identification apparatus are described in detail as follows:

a determining module 51, configured to determine original features and virus tags corresponding to multiple types of virus samples, respectively;

a preprocessing module 52, configured to process the original features;

the training module 53 is configured to perform model training on the processed original features and the virus labels to obtain a virus classification and identification model;

a generating module 54, configured to extract partial parameters of the virus classification and identification model, and generate a feature fusion model;

an obtaining module 55, configured to input a target program code into the feature fusion model, so as to obtain a specific feature vector of the target program code;

the determining module 51 is further configured to determine a cluster corresponding to the specific feature vector of the target program code, and determine whether the target program code belongs to the network disease according to the determined cluster.

In an alternative embodiment, the raw features include numeric, character, serialized, and graph features; the preprocessing module 52 is specifically configured to perform normalization processing on the numeric characteristics to obtain a target numeric value, and convert the character-type characteristics, the serialized characteristics, and the graph characteristics into corresponding first, second, and third vectors.

In an alternative embodiment, the generating module 54 is specifically configured to:

removing an output layer in the virus classification identification model, and extracting the residual model parameters of the virus classification identification model;

taking the input specific feature vector of the output layer as the output of the feature fusion model;

and generating a feature fusion model by using the residual model parameters.

In an optional embodiment, the apparatus further comprises: a calculation module 56;

a calculating module 56, configured to calculate hash codes corresponding to the numerical feature, the character feature, the serialization feature, and the graph feature respectively;

the determining module 51 is further configured to determine a specific feature vector corresponding to the sample according to hash codes corresponding to the numerical feature, the character-type feature, the serialization feature, and the graph feature respectively.

In an alternative embodiment, the determining module 51 is specifically configured to:

determining weight values corresponding to the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic respectively;

and performing weighted calculation on hash codes respectively corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic to obtain a specific characteristic vector corresponding to the sample.

In an alternative embodiment, the determining module 51 is specifically configured to;

determining whether the specific characteristic vector of the target program code belongs to a cluster through the clustering algorithm;

if the specific characteristic vector of the target program code has the cluster to which the specific characteristic vector belongs, determining the specific characteristic vector as a network virus corresponding to the target program code according to a cluster label corresponding to the cluster to which the specific characteristic vector belongs, wherein the cluster label is used for representing the virus type of the corresponding cluster;

and if the specific characteristic vector of the target program code does not have the cluster to which the specific characteristic vector belongs, determining that the target program code does not belong to the network virus.

In an optional embodiment, the calculating module 56 is further configured to calculate similarity between the specific feature vector of the target program code and various types of virus features in a virus library; the virus library stores virus types respectively corresponding to specific characteristic vectors of various types of viruses;

the determining module 51 is further configured to determine the virus category corresponding to the specific feature vector with the similarity exceeding a preset value in the virus library as the network virus corresponding to the target program code.

In an optional embodiment, the calculation module 56 is further configured to obtain the network virus corresponding to the target program code determined by the clustering algorithm and the network virus corresponding to the target program code determined by the virus library; calculating probability values of the network viruses which belong to the same network virus and are determined by the clustering algorithm and the virus library;

the determining module 51 is further configured to determine the network virus with the highest probability value as the network virus corresponding to the target program code.

For specific limitations of the network virus identification device, reference may be made to the above limitations of the network virus identification method, which are not described herein again. The various modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a network virus identification method.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

processing the original features;

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

processing the original features;

In one embodiment, a computer program product is provided, the computer program product comprising a computer program executed by a processor to perform the steps of:

processing the original features;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method for identifying a network virus, the method comprising:

processing the original features;

2. The method of claim 1, wherein the raw features comprise numeric, character, serialized, and graph features;

the processing the original features comprises: and normalizing the numerical characteristic to obtain a target numerical value, and converting the character type characteristic, the serialization characteristic and the graph characteristic into a corresponding first vector, a second vector and a third vector.

3. The method of claim 1, wherein the extracting partial parameters of the virus classification identification model to generate a feature fusion model comprises:

and generating a feature fusion model by using the residual model parameters.

4. The method of claim 2, further comprising:

calculating hash codes corresponding to the numerical type feature, the character type feature, the serialization feature and the graph feature respectively;

and determining a specific feature vector corresponding to the sample according to the hash codes respectively corresponding to the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic.

5. The method according to claim 4, wherein the determining the specific feature vector corresponding to the sample according to the hash codes corresponding to the numerical feature, the character-type feature, the serialization feature and the graph feature respectively comprises:

6. The method according to any one of claims 1-5, wherein determining whether the target program code belongs to a network virus according to the determined cluster comprises:

7. The method of claim 6, further comprising:

calculating the similarity between the specific characteristic vector of the target program code and various types of virus characteristics in a virus library; the virus library stores virus types respectively corresponding to specific characteristic vectors of various types of viruses;

and determining the virus type corresponding to the specific characteristic vector with similarity exceeding a preset value in the virus library as the network virus corresponding to the target program code.

8. The method of claim 7, further comprising:

acquiring the network virus corresponding to the target program code determined by the clustering algorithm and the network virus corresponding to the target program code determined by the virus library;

calculating probability values of the network viruses which belong to the same network virus and are determined by the clustering algorithm and the virus library;

and determining the network virus with the highest probability value as the network virus corresponding to the target program code.

9. An apparatus for identifying a network virus, the apparatus comprising:

the preprocessing module is used for processing the original features;

10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the network virus identification method according to any one of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the network virus identification method according to any one of claims 1 to 8.