CN118051908A

CN118051908A - Malicious code homology detection method, device, equipment and storage medium

Info

Publication number: CN118051908A
Application number: CN202311579624.9A
Authority: CN
Inventors: 吴畑; 安晓宁
Original assignee: Beijing Topsec Network Security Technology Co Ltd
Current assignee: Beijing Topsec Network Security Technology Co Ltd
Priority date: 2023-11-22
Filing date: 2023-11-22
Publication date: 2024-05-17

Abstract

The embodiment of the application provides a malicious code homology detection method, device, equipment and storage medium, and relates to the technical field of network security. The method comprises the following steps: acquiring malicious codes to be detected and converting the malicious codes into images to be detected; inputting the image to be detected into a pre-trained feature extraction model to obtain image features corresponding to the image to be detected, which are output by the feature extraction model; the feature extraction model is obtained by training based on a preset malicious code data set in a self-supervision learning mode; and classifying the image features by using a preset classification model to obtain a homologous detection result corresponding to the malicious code to be detected. According to the method, the feature extraction model is trained based on a self-supervision learning mode, deep features of malicious code samples can be extracted from scenes lacking in annotation data, so that effective malicious code homology analysis can be realized, and the flexibility of malicious code homology detection based on an image classification technology is effectively improved.

Description

Malicious code homology detection method, device, equipment and storage medium

Technical Field

The application relates to the technical field of network security, in particular to a malicious code homology detection method, a malicious code homology detection device, malicious code homology detection equipment and a malicious code homology detection storage medium.

Background

With the popularity of the internet and digital technology, malicious code is a continuing problem in the field of network security. There are various methods for malicious code detection, and as malicious code changes and evolves, it has been difficult for conventional signature-based or behavior-based detection techniques to identify complex and diverse malicious code. The identification technology based on image classification is an advanced malicious code homology analysis method, and the core idea is that codes to be detected are converted into image forms, and then the image processing and machine learning technologies are applied to analyze and classify the codes.

At present, the malicious code detection technology based on image classification is highly dependent on the existing marked data, and lacks feature learning capability for a large amount of unmarked data, so that effective homology analysis cannot be performed on malicious codes, and therefore the flexibility of malicious code homology detection is not high.

Disclosure of Invention

The embodiment of the application aims to provide a malicious code homology detection method, device and equipment and a storage medium, which are used for improving the flexibility of malicious code homology detection based on an image classification technology.

In a first aspect, an embodiment of the present application provides a malicious code homology detection method, including:

Acquiring malicious codes to be detected, and converting the malicious codes to be detected into images to be detected;

Inputting the image to be detected into a pre-trained feature extraction model to obtain image features, which are output by the feature extraction model and correspond to the image to be detected; the feature extraction model is obtained by training based on a preset malicious code data set in a self-supervision learning mode;

and classifying the image features by using a preset classification model to obtain a homologous detection result corresponding to the malicious code to be detected.

According to the embodiment of the application, the feature extraction model is trained based on the self-supervised learning mode, so that deep features of malicious code samples can be extracted from scenes lacking marking data, effective malicious code homology analysis can be realized, and the flexibility of malicious code homology detection based on the image classification technology is effectively improved.

In some possible embodiments, the training process of the feature extraction model includes:

loading malicious code image samples from the malicious code dataset;

masking the malicious code image sample based on a preset masking strategy;

training a pre-constructed self-supervision learning model by using the malicious code image sample subjected to mask processing to obtain the feature extraction model.

In the embodiment of the application, the mask processing is carried out on the malicious code image sample, so that the feature extraction capability of the training model can be improved, further feature information of the malicious code sample can be extracted, and the flexibility of malicious code homology detection based on the image classification technology is improved.

In some possible embodiments, the training the pre-built self-supervised learning model by using the malicious code image samples subjected to the mask processing to obtain the feature extraction model includes:

inputting the malicious code image sample subjected to mask processing into a pre-constructed self-supervision learning model so as to enable an encoder of the self-supervision learning model to learn and output corresponding characteristic information;

performing sample reconstruction by using a decoder of the self-supervision learning model based on the characteristic information output by the encoder to obtain a decoded image sample;

Determining loss indexes of the decoded image samples and the corresponding malicious code image samples based on a preset loss function, and training the self-supervision learning model with the loss indexes minimized as targets;

And when the training process of the self-supervision learning model reaches a preset training completion condition, obtaining the feature extraction model.

In the embodiment of the application, the encoder is utilized to learn and extract the characteristic information for reconstruction into the decoded image sample, and model training is carried out by minimizing the loss of the reconstructed image and the original image, so that the flexibility of malicious code homology detection is further improved.

In some possible embodiments, the feature extraction model is obtained when the training process of the self-supervised learning model reaches a preset training completion condition, and specifically includes:

When the training process of the self-supervision learning model reaches a preset training completion condition, an encoder of the self-supervision learning model obtained through training is used as the feature extraction model.

In the embodiment of the application, the encoder part of the trained self-supervision learning model is directly adopted as the feature extraction model, so that the feature extraction process of malicious code samples is simplified, and the flexibility of malicious code homology detection is further improved.

In some possible embodiments, the preset loss function includes at least one of pixel loss, perceived loss, and contrast loss; the determining the loss index of the decoded image sample and the corresponding malicious code image sample based on a preset loss function comprises the following steps:

And determining loss indexes of the decoded image samples and the corresponding malicious code image samples based on at least one preset loss function and corresponding super parameters thereof.

In the embodiment of the application, the training loss index is determined by integrating a plurality of loss functions, so that the effects of model learning and feature extraction can be improved, and the accuracy of malicious code homology detection is further improved.

In some possible embodiments, the classifying the image features by using a preset classification model to obtain a homologous detection result corresponding to the malicious code to be detected includes:

Classifying the image features based on known malicious code family information by using a preset multi-class support vector machine, and determining whether the malicious code to be detected belongs to the known malicious code family;

And when judging that the malicious code to be detected does not belong to the known malicious code family, classifying the image features by utilizing the multi-class support vector machine based on the pre-collected code sample to be classified to obtain a homologous detection result of the malicious code to be detected relative to the code sample to be classified.

In the embodiment of the application, the homology detection is firstly carried out on the malicious code to be detected based on the known malicious code family, and whether the malicious code is homologous with other code samples is detected when the malicious code does not belong to the known malicious code family, so that the flexibility of the homology detection of the malicious code is further improved.

In some possible embodiments, after the acquiring the malicious code to be detected and converting the malicious code to be detected into the image to be detected, before the inputting the image to be detected into the feature extraction model trained in advance, the method further includes:

And scaling the converted image to be detected based on the preset image size to obtain the image to be detected with uniform size.

In the embodiment of the application, the image to be detected is converted into the image with the uniform size, so that the calculation efficiency of the model is improved, and the flexibility of homologous detection of malicious codes is further improved.

In a second aspect, an embodiment of the present application provides a malicious code homology detection apparatus, including:

the image conversion module is used for acquiring malicious codes to be detected and converting the malicious codes to be detected into images to be detected;

The feature extraction module is used for inputting the image to be detected into a pre-trained feature extraction model to obtain image features, which are output by the feature extraction model and correspond to the image to be detected; the feature extraction model is obtained by training based on a preset malicious code data set in a self-supervision learning mode;

and the homology detection module is used for classifying the image features by utilizing a preset classification model to obtain a homology detection result corresponding to the malicious code to be detected.

In a third aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method according to any of the embodiments of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer program product comprising a computer program, wherein the computer program when executed by a processor implements the method according to any of the embodiments of the first aspect.

In a fifth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor executes the program to implement the method according to any one of the embodiments of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a malicious code homology detection method provided by an embodiment of the present application;

Fig. 2 is a schematic structural diagram of a malicious code homology detection device according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, an embodiment of the present application provides a malicious code homology detection method, which may include the steps of:

s1, acquiring malicious codes to be detected, and converting the malicious codes to be detected into images to be detected.

In the embodiment of the application, step 1 is to convert malicious codes to be detected into an image form, so as to realize visualization of the malicious codes to be detected. Specifically, firstly, the malicious code to be detected is converted into a binary file, then the binary file is converted into a one-dimensional 8-bit unsigned integer array, and each element value (8 bits are an element) in the array is between 0 and 255 and corresponds to a pixel gray value of an image respectively. Then, a linear mapping method is utilized to convert the one-dimensional array into a two-dimensional image, wherein the width and the height of the image can be determined according to the size of a binary file, for example, the width of the corresponding image is 32 pixels when the file size is 0-10 (KB), the width of the corresponding image is 64 pixels when the file size is 10-30 (KB), and the like, and the specific size interval of the file and the corresponding image width or height value can be set according to actual requirements.

It should be noted that the image to be detected may be a gray image or an RGB image. Specifically, a binary file structure may map each pixel in the image to three channels of RGB, respectively, resulting in an RGB image with richer semantic information.

S2, inputting the image to be detected into a pre-trained feature extraction model to obtain image features, which are output by the feature extraction model and correspond to the image to be detected; the feature extraction model is obtained by training based on a preset malicious code data set in a self-supervision learning mode.

After the image to be detected corresponding to the malicious code to be detected is obtained, the image to be detected can be input into a pre-trained feature extraction model so as to extract corresponding image features. The feature extraction model can be obtained by training in a self-supervision learning mode based on a large number of malicious code data sets.

It should be noted that, unlike the conventional supervised learning mode, the self-supervised learning does not depend on the tag data, but learns from the data itself to obtain useful feature information, so that the deep structure and mode information of the data can be captured without performing cumbersome data labeling processing on the preset malicious code data set.

And S3, classifying the image features by using a preset classification model to obtain a homologous detection result corresponding to the malicious code to be detected.

After extracting the feature information of the image to be detected corresponding to the malicious code to be detected, a preset classification model (for example, a model classified based on various machine learning or statistical methods) may be used to classify the image to determine whether the malicious code to be detected belongs to a known malicious code family or belongs to the same source as other code samples.

Before classifying the image features, feature dimension reduction processing may be performed by a PCA (principal component analysis) method, so as to reduce the calculation amount of the model and improve the efficiency of model processing.

Based on the method, the feature extraction model is trained based on a self-supervised learning mode, so that deep features of malicious code samples can be extracted from scenes lacking in annotation data, effective malicious code homology analysis can be realized, and the flexibility of malicious code homology detection based on an image classification technology is effectively improved.

Loading malicious code image samples from a malicious code dataset;

Masking the malicious code image sample based on a preset masking strategy;

Training a pre-constructed self-supervision learning model by using the malicious code image sample subjected to mask processing to obtain a feature extraction model.

In some possible embodiments, further, training a pre-built self-supervised learning model by using the malicious code image samples subjected to the mask processing to obtain a feature extraction model, which specifically includes:

a decoder utilizing a self-supervision learning model carries out sample reconstruction based on the characteristic information output by the encoder to obtain a decoded image sample;

Determining loss indexes of the decoded image samples and the corresponding malicious code image samples based on a preset loss function, and training a self-supervision learning model with the loss indexes minimized as targets;

and when the training process of the self-supervision learning model reaches the preset training completion condition, obtaining a feature extraction model.

In some possible embodiments, when the training process of the self-supervised learning model reaches a preset training completion condition, a feature extraction model is obtained, which specifically includes:

when the training process of the self-supervision learning model reaches the preset training completion condition, the encoder of the self-supervision learning model obtained through training is used as a feature extraction model.

It should be noted that, when training the feature extraction model, malicious code samples are first loaded from a preset malicious code dataset, where the malicious code samples may be converted into an image form in the same manner. It should be noted that, the preset malicious code dataset may employ MalNet datasets, which is the largest common binary image database in the current world, and the database contains more than 120 ten thousand binary images, and covers a hierarchy of 47 types and 696 families. Compared with the traditional popular Malimg database, the MalNet data set provides 133 times more images, the families are nearly 28 times more, and the malicious code homology analysis is closer to the actual scene.

For the obtained malicious code image sample, masking processing can be performed first, and masking processing generates a sample subjected to masking disturbance as a new input of a model by randomly masking a part of areas in the input sample in a certain proportion, so that a required supervision signal is obtained in an unsupervised mode. It should be noted that the proportion of the mask may be set to a higher proportion, such as 75% or more, to ensure that the input samples are disturbed to a greater extent. By way of example, assuming that the image corresponding to the transformation of the malicious code image sample is an RGB image of size 256×256, the image may be divided into 16×16 non-overlapping tiles, and the masking process is to sample the tiles and randomly select 75% of the tiles to mask (delete the image data of the corresponding tiles).

Based on the method, the mask processing is carried out on the malicious code image sample, so that the feature extraction capability of the training model can be improved, deeper feature information of the malicious code sample is extracted, and the flexibility of malicious code homology detection based on the image classification technology is improved.

It should be noted that, the structure of the self-supervised learning model (Masked AutoEncoders, MAE) includes an encoder and a decoder, where the encoder is configured to process the input samples subjected to the masking process to learn the feature expression information of the input samples, and takes the samples subjected to the masking process as input to output corresponding feature vectors. The core goal of the encoder is to learn the input content that is randomly masked and mine valid features from corrupted input data.

It will be appreciated that the characteristic information output by the encoder will be input to a decoder of the self-supervised learning model, which functions to attempt to reconstruct the original samples (decoded image samples) from the characteristic information output by the encoder and provide guidance for model training. The decoder takes as input the characteristic information output by the encoder and restores the specific input sample content (decoded image samples) by stepwise upsampling.

It should be noted that, according to the preset loss function, the error condition of the decoded image sample and the original sample (malicious code image sample) output by the decoder can be perceived, and the training target of the self-supervision learning model is to minimize the error of the decoded image sample and the original sample (malicious code image sample). A Adan optimizer (Adan: adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models) may be selected to minimize errors of decoded image samples from the original samples.

When the training process of the self-supervision learning model reaches the preset training completion condition, for example, the set optimization iteration times are reached, the trained feature extraction model can be obtained.

It should be noted that during model training, the encoder encodes the input data into a low-dimensional implicit representation for the decoder to reconstruct the original data from the information of these implicit representations. Since the encoder provides a compressed representation of the input data, the encoder portion can be simply extracted directly as a feature extraction model at the completion of model training to extract deep features of malicious code images. Based on the method, the encoder part of the trained self-supervision learning model is directly adopted as the feature extraction model, so that the feature extraction process of malicious code samples is simplified, and the flexibility of malicious code homology detection is further improved.

In some possible embodiments, the preset loss function includes at least one of pixel loss, perceived loss, and contrast loss; determining loss indicators of the decoded image samples and the corresponding malicious code image samples based on a preset loss function may include:

It should be noted that, in training the self-supervised learning model, the loss function may use one or more combinations of pixel loss, perceptual loss, and contrast loss. Where pixel loss is a loss defined based on pixel-level differences that ensures that basic structural feature information of malicious code samples is accurately captured. Pixel loss can be expressed as:

where Y _ij is the pixel value at row i and column j of the malicious code image sample, Is the pixel value at the ith row and jth column of the reconstructed decoded image sample, W and H are the width and height of the image, respectively.

It should be noted that, when the self-supervised learning model is used to process the image data, merely minimizing the reconstruction error at the pixel level may not be sufficient to obtain a high quality reconstruction result, so the effect of reconstructing the malicious code image may be further improved in combination with the perceptual loss. The perceptual loss can help the model focus on advanced characteristics of the image, such as texture and edge distribution, so that the code image characteristics captured by the model can be ensured to be similar on the human perception level, and the robustness of the model is further enhanced. Specifically, the pre-trained VGG16 model may be used to extract intermediate layer features of the input image and the reconstructed image, and the differences between these features are calculated as perceptual losses, which may be expressed as:

wherein F (Y) _ij and A feature representation of a layer in the VGG16 model of the malicious code image and the reconstructed decoded image sample, respectively.

In addition, in order to generate a reconstruction result with higher quality and more realism, an countermeasure loss can be additionally introduced, and a discriminator is additionally introduced to try to distinguish the true malicious code image from the reconstructed malicious code image, and the generator (self-supervision learning model) tries to deceive the discriminator so that the reconstructed malicious code image is considered to be true. The model is subjected to resistance training by introducing a resistance loss function, so that the model can better distinguish tiny differences of malicious codes, and the discrimination capability of the model is improved. The challenge loss can be expressed as:

wherein, Is the probability that the discriminator estimates the reconstructed malicious code image as a true malicious code image.

It should be noted that, when the above three losses are combined, a weight may be assigned to each loss, which is expressed as:

L_total＝αL_pixel+βL_perceptual+γL_adv

Wherein, alpha, beta and gamma are three preset super parameters, which respectively represent the probabilities of three loss weights. As an example, the three super parameters may take on values α=1, β=0.2, γ=0.01, respectively.

It will be appreciated that by combining the contrast loss with the pixel loss and the perception loss as a total training loss, the final reconstructed result combines pixel level accuracy, similarity of high level features, and fidelity of the generated image. Based on the method, training loss indexes are determined by integrating various loss functions, so that the effects of model learning and feature extraction can be improved, and the accuracy of malicious code homology detection is further improved.

In some possible embodiments, step S3 may include:

Classifying image features based on known malicious code family information by using a preset multi-class support vector machine, and determining whether the malicious code to be detected belongs to the known malicious code family;

when judging that the malicious code to be detected does not belong to the known malicious code family, classifying the image features by utilizing a multi-class support vector machine based on the pre-collected code sample to be classified, and obtaining a homologous detection result of the malicious code to be detected relative to the code sample to be classified.

It should be noted that, after the image features of the malicious code to be detected are extracted through the feature extraction model, a preset classification model (for example, a model for classifying based on various machine learning or statistical methods) may be used to classify the malicious code to be detected. For example, classifying based on preset known malicious code family information to determine whether the malicious code to be tested belongs to the known malicious code families. If the malicious code does not belong to the known malicious code family, other pre-collected code samples to be classified can be further utilized to carry out homology analysis, so as to obtain a detection result of whether the malicious code to be detected shares the same source with the code samples to be classified.

It can be appreciated that by determining homology information of malicious codes, people can be helped to understand behavior patterns of attackers, predict future attack trends, formulate effective defense strategies, and the like.

Based on the method, the homology detection is firstly carried out on the malicious code to be detected based on the known malicious code family, and whether the malicious code is homologous with other code samples is detected when the malicious code does not belong to the known malicious code family, so that the flexibility of malicious code homology detection is further improved.

In some possible embodiments, after acquiring the malicious code to be detected and converting the malicious code to be detected into the image to be detected, before inputting the image to be detected into the pre-trained feature extraction model, the method further includes:

It should be noted that if the malicious code file is larger, the converted image will be larger, and the calculation amount for the model training and classifying process will be correspondingly larger. Since research shows that the homology detection effect on malicious code is little affected by different image sizes, for an image converted from a malicious code sample, the size of the image can be unified based on a set size, for example, the image can be uniformly scaled to 256×256. Based on the method, the image to be detected is converted into the image with the uniform size, so that the calculation efficiency of the model is effectively improved, and the flexibility of malicious code homology detection is further improved.

Referring to fig. 2, fig. 2is a block diagram illustrating a malicious code homology detection apparatus according to some embodiments of the present application. It should be understood that the malicious code homology detection apparatus corresponds to the above embodiment of the method of fig. 1, and is capable of performing the steps involved in the above embodiment of the method, and specific functions of the malicious code homology detection apparatus may be referred to the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy.

The malicious code homology detection apparatus of fig. 2 includes at least one software functional module that can be stored in a memory in the form of software or firmware or cured in the malicious code homology detection apparatus, the malicious code homology detection apparatus including:

the image conversion module 210 is configured to obtain malicious codes to be detected, and convert the malicious codes to be detected into images to be detected;

The feature extraction module 220 is configured to input an image to be detected into a feature extraction model trained in advance, so as to obtain image features corresponding to the image to be detected, which are output by the feature extraction model; the feature extraction model is obtained by training based on a preset malicious code data set in a self-supervision learning mode;

The homology detection module 230 is configured to classify the image features by using a preset classification model, so as to obtain a homology detection result corresponding to the malicious code to be detected.

It can be understood that the embodiment of the device item corresponds to the embodiment of the method item of the present invention, and the malicious code homology detection device provided by the embodiment of the present invention can implement the malicious code homology detection method provided by any one of the embodiment of the method item of the present invention.

It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.

As shown in fig. 3, some embodiments of the present application provide an electronic device 300, the electronic device 300 comprising: memory 310, processor 320, and a computer program stored on memory 310 and executable on processor 320, wherein processor 320, when reading the program from memory 310 via bus 330 and executing the program, may implement the method of any of the embodiments as included in the malicious code homology detection method described above.

Processor 320 may process digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 320 may be a microprocessor.

Memory 310 may be used for storing instructions to be executed by processor 320 or data related to execution of the instructions. Such instructions and/or data may include code to implement some or all of the functions of one or more of the modules described in embodiments of the present application. The processor 320 of the disclosed embodiments may be configured to execute instructions in the memory 310 to implement the methods shown above. Memory 310 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.

Some embodiments of the application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of the method embodiment.

Some embodiments of the application also provide a computer program product which, when run on a computer, causes the computer to perform the method of the method embodiments.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A malicious code homology detection method, comprising:

2. The malicious code homology detection method of claim 1, wherein the training process of the feature extraction model comprises:

loading malicious code image samples from the malicious code dataset;

masking the malicious code image sample based on a preset masking strategy;

3. The malicious code homology detection method as claimed in claim 2, wherein said training a pre-built self-supervised learning model using mask processed malicious code image samples to obtain said feature extraction model comprises:

4. The malicious code homology detection method according to claim 3, wherein the feature extraction model is obtained when a training process of the self-supervised learning model reaches a preset training completion condition, specifically:

5. A malicious code homology detection method as claimed in claim 3, wherein said pre-set loss function comprises at least one of pixel loss, perceived loss and counterloss; the determining the loss index of the decoded image sample and the corresponding malicious code image sample based on a preset loss function comprises the following steps:

6. The malicious code homology detection method as claimed in claim 1, wherein said classifying the image features using a predetermined classification model to obtain a homology detection result corresponding to the malicious code to be detected comprises:

7. The malicious code homology detection method of claim 1, further comprising, after said acquiring malicious code to be detected and converting said malicious code to be detected into an image to be detected, before said inputting said image to be detected into a pre-trained feature extraction model:

8. A malicious code homology detection apparatus, comprising:

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor is configured to implement the malicious code homology detection method of any one of claims 1-7 when the program is executed by the processor.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the malicious code homology detection method according to any of claims 1-7.