CN115080973B

CN115080973B - Malicious code detection method and system based on multi-mode feature fusion

Info

Publication number: CN115080973B
Application number: CN202210849728.6A
Authority: CN
Inventors: 路冰; 张海文; 王琦博
Original assignee: Zhongfu Safety Technology Co Ltd
Current assignee: Zhongfu Safety Technology Co Ltd
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-12-06
Anticipated expiration: 2042-07-20
Also published as: CN115080973A

Abstract

The invention relates to the technical field of malicious code detection, and discloses a method and a system for detecting malicious codes based on multi-mode feature fusion, wherein the method comprises the following steps: acquiring a training code sample set; performing keyword recommendation and word embedding on each code sample; embedding the matrix according to the keywords to obtain a semantic expression vector; constructing a code weight vector based on a keyword extraction algorithm; constructing a code statistical vector based on chi-square test; constructing a theme representation vector based on a document theme classification algorithm; carrying out weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain a multi-modal feature vector; splicing the semantic expression vector and the multi-mode feature vector to obtain multi-mode fusion features; and performing model training based on the multi-modal fusion characteristics of the training code sample set for malicious code detection. The invention extracts the code features from multiple dimensions, can effectively mine effective feature information in the codes and improve the detection accuracy.

Description

Malicious code detection method and system based on multi-mode feature fusion

Technical Field

The invention belongs to the technical field of malicious code detection, and particularly relates to a malicious code detection method and system based on multi-mode feature fusion.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the high-speed development of internet and information technology, the hidden network security risks behind the internet are not in a small amount. Among them, malicious code belonging to the same category, which is most troublesome for network security practitioners, once called by a computer to execute, will cause serious problems to the system or the network. The most common are trojan horses, worm viruses, backdoor portals, etc. The malicious codes are continuously changed and updated from generation to the present, and the characteristics of huge variety number, rapid propagation speed and uncontrollable influence range are presented at present. Moreover, operators behind the screen of malicious codes often use some complicated methods such as shell adding and deformation to perform a series of skillful disguises to avoid tracking detection, so that the user host is attacked.

In order to effectively implement detection of malicious code, currently, four methods mainly exist: rule or text matching based, sandbox technology combined machine learning methods, feature extraction and knowledge base based, and deep learning based methods. The method is suitable for detecting the malicious codes of the same category. The sandbox technology is combined with a machine learning method to extract the characteristics of the codes through simulating the running environment, and the aim of detecting malicious codes is fulfilled by combining the machine learning method such as a random forest construction model. The feature extraction of the codes is completed through technologies such as machine learning and the like based on the feature extraction and the knowledge base, and the common behaviors of the same-family malicious codes are extracted by combining a data mining method. The deep learning-based method is characterized in that a network model is built by combining a deep learning method to realize malicious code detection by using an N-gram sequence of codes converted into byte codes as characteristics.

With the gradual increase of the number and the variants of malicious codes, the detection of the malicious codes by using the traditional rules or through text matching is gradually eliminated, but the method of combining sandbox and machine learning is mainly used for obtaining the characteristics of the malicious codes from a single scale, the characteristics cannot be efficiently and automatically extracted, manual operation is relied on, and the extracted shallow characteristics are not enough to accurately describe the malicious codes, so that the detection accuracy is low. Moreover, because the code is different from the ordinary text, the semantic sparseness exists in the code, and the semantic information of the code cannot be completely represented by using the traditional machine learning or extracting the features through the byte code.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a malicious code detection method and system based on multi-modal feature fusion, which are used for extracting code features from multiple dimensions and performing feature fusion, so that effective feature information in codes can be effectively mined, and the detection accuracy is improved.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

a malicious code detection method based on multi-modal feature fusion comprises the following steps:

obtaining a training code sample set, wherein the code sample comprises a benign code and a malignant code;

carrying out keyword recommendation and word embedding on each code sample to obtain a keyword embedding matrix;

extracting semantic features according to the keyword embedded matrix to obtain a semantic expression vector;

performing weight extraction on each keyword based on a keyword extraction algorithm, and connecting to obtain a code weight vector of the code sample; performing statistic calculation on each keyword based on chi-square test, and connecting to obtain a code statistic vector; obtaining a code-theme matrix and a theme-word embedding matrix based on a document theme classification algorithm, and multiplying the code-theme matrix and the theme-word embedding matrix to obtain a theme representation vector;

carrying out weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain a multi-modal feature vector;

splicing the semantic expression vector and the multi-mode feature vector to obtain multi-mode fusion features;

and training a malicious code detection model based on the multi-modal fusion characteristics of the code sample for malicious code detection.

Further, after the training code sample set is obtained, invalid characters are removed, stop words are removed, and variable name splitting preprocessing is performed.

Further, the performing semantic feature extraction includes:

respectively executing context semantic feature extraction and maximum pooling operation aiming at the keyword embedded matrix to obtain a semantic expression vector and a keyword vector;

and splicing the semantic expression vector and the keyword vector based on an attention mechanism to obtain a semantic feature matrix.

Further, for the keyword embedding matrix, performing context semantic feature extraction includes:

performing part-of-speech tagging based on the recommended keywords to generate a part-of-speech matrix;

obtaining a characteristic enhancement word embedding matrix according to the keyword embedding matrix and the part-of-speech matrix;

and executing context semantic feature extraction aiming at the feature enhancement word embedding matrix to obtain a semantic expression vector.

Further, the context semantic feature extraction adopts a bidirectional gating cycle-based unit.

Further, the keyword extraction algorithm adopts a TF-IDF algorithm.

Further, the document theme classification algorithm adopts a Gaussian LDA theme model.

One or more embodiments provide a malicious code detection system based on multi-modal feature fusion, comprising:

the training data acquisition module is used for acquiring a training code sample set, and the training code sample set comprises benign codes and malignant codes;

the data preprocessing module is used for carrying out keyword recommendation and word embedding on each code sample in the code samples to obtain a keyword embedding matrix;

the semantic feature extraction module is used for extracting semantic features according to the keyword embedding matrix to obtain a semantic expression vector;

the multi-modal feature extraction module is used for performing weight extraction on each keyword based on a keyword extraction algorithm and connecting to obtain a code weight vector; performing statistic calculation on each keyword based on chi-square test, and connecting to obtain a code statistic vector; obtaining a code-topic matrix and a topic-word embedding matrix based on a document topic classification algorithm, and multiplying the code-topic matrix and the topic-word embedding matrix to obtain a topic expression vector; and

performing weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain a multi-modal feature vector of the code sample;

the multi-modal feature fusion module is used for splicing the semantic expression vector and the multi-modal feature vector to obtain multi-modal fusion features of the code sample;

and the detection model training module is used for training a malicious code detection model based on the multi-modal fusion characteristics of the training code sample set and is used for detecting the malicious code.

One or more embodiments provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the multi-modal feature fusion based malicious code detection method when executing the program.

One or more embodiments provide a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the multimodal feature fusion based malicious code detection method.

The above one or more technical solutions have the following beneficial effects:

according to the method and the device, code features are extracted from multiple dimensions and fused, effective feature information in the codes can be effectively mined, and the detection accuracy is improved.

The semantic features are constructed by combining two dimensions of context semantic features and keyword features based on word embedding results, so that the defect of sparse code semantics is overcome to a certain extent, and richer semantic information is beneficially mined.

In addition, the keywords in the codes are analyzed by combining various modes of a keyword extraction algorithm, a chi-square test and a theme classification algorithm to obtain the weight of the words, the independent relation statistics of the words and the classification result and the word-theme-code multilevel theme characteristics, so that the fusion of the multidimensional characteristics is realized, the benign codes and the malicious codes can be effectively distinguished by the fusion characteristics, and the accuracy of a subsequent model is facilitated.

By introducing the GLDA topic classification model to extract the topic expression vector, the code semantics can be effectively enriched, the defect of sparse code semantics is further overcome, and the detection accuracy is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

Fig. 1 is an architecture diagram of an implementation of the malicious code detection method based on multi-modal feature fusion according to one or more embodiments of the present invention;

FIG. 2 is a flow diagram of data preprocessing in one or more embodiments of the invention;

FIG. 3 is a schematic diagram of GLDA-based code-topic-word embedding mining in one or more embodiments of the invention;

FIG. 4 is a graph of experimental validation results for GLDA using different numbers of topics;

FIG. 5 is a graph illustrating the comparative effectiveness of models and other baseline models as a function of training times in one or more embodiments of the invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

Because malicious codes have a large difference in format and content compared with traditional texts, the malicious codes include special formats, brackets, line feeds, indents, special symbols and the like. And semantic features of the text are fewer than those of the traditional text, so that a better detection result is difficult to obtain from the aspect of semantic extraction alone. Therefore, the embodiment provides a malicious code detection method based on multimodal feature fusion, which includes the following steps:

step 1: acquiring a training code sample set, wherein the code sample comprises a benign code and a malignant code, performing data preprocessing, and performing keyword recommendation and word embedding to obtain a keyword embedding matrix;

because the malicious code is not the same as the traditional text, the code text usually has a large amount of invalid characters and stop words besides the programming code sentence, and in the preprocessing stage, the interference of invalid information is usually removed, the training complexity is reduced, the quality of an input data set is further improved, and the subsequent model training precision is favorably improved. Besides the conventional data preprocessing method, effective special processing is required according to the characteristics of malicious codes.

The data preprocessing comprises the following steps: the method comprises the following steps of eliminating invalid characters, deactivating words, splitting variable names, recommending keywords and embedding words, and is shown in figure 2.

The Word embedding is executed based on Word2Vec, and unsupervised learning is carried out in a shallow network of the Word embedding to form a multidimensional vector matrix.

Step 2: and extracting semantic features based on the preprocessed code samples to obtain semantic expression vectors. The step 2 specifically comprises:

step 2.1: performing keyword recommendation on the preprocessed malicious codes;

step 2.2: generating a keyword embedding matrix based on the keywords; meanwhile, performing part-of-speech tagging based on the recommended keywords to generate a part-of-speech matrix;

step 2.3: obtaining a characteristic enhancement word embedding matrix according to the keyword embedding matrix and the part-of-speech matrix;

step 2.4: executing context semantic feature extraction aiming at the feature enhancement word embedding matrix to obtain a semantic expression vector;

in the embodiment, on the aspect of main semantic processing, the Bi-directional gating-based circulation unit Bi-GRU with the most ideal effect at present is adopted, data can be processed according to the sequence from front to back and from back to front at the same time, and then context information of two sequences is obtained.

Step 2.5: performing maximum pooling operation aiming at the keyword embedded matrix to obtain a keyword vector; in order to improve the classification detection effect of malicious codes, a keyword recommendation algorithm is applied to the codes to screen keywords, keyword information is subjected to embedding layer operation to obtain keyword embedding, and then max-posing maximum pooling operation is carried out, so that keyword vectors are obtained

。

Step 2.6: and splicing the semantic expression vector and the keyword vector based on an attention mechanism to obtain a semantic feature matrix.

In order to fully utilize the keyword information, the semantic feature matrix obtained in the step 2.4 is used

Combining with the keyword vector obtained in the step 2.5 to obtain a feature matrix

Fully paying attention to information corresponding to the keywords in the malicious codes by using an attention mechanism, and calculating a semantic expression vector

The calculation process is shown in equations (1) - (2). Wherein the weight matrix is composed of

Indicating that the random bias parameter is

It is shown that,

representing the activation function.

（1）

（2）

And 3, step 3: and performing weight extraction on each keyword based on a TF-IDF algorithm, and connecting to obtain a code weight vector.

Through the operation of the TF-IDF algorithm, the more times a word appears in a text, and the less times the word appears in a corpus, the more representative the word is. After the weight value of each word in the current data is obtained, the weight values of all words in the current data are connected to obtain a code weight vector

。

And 4, step 4: and performing statistic calculation on each keyword based on chi-square test, and connecting to obtain a code statistic vector.

By chi-square test (i.e.

Checking out),

the higher the value of the statistic, the less independence of a word in the provable malicious code from its corresponding category, and the normalization operation prevents certain words from being caused by

The value of the statistic is too large to affect the final classification effect. The calculation finally obtains the malicious code

Statistical vector

。

And 5: and obtaining a code topic representation vector based on a document topic classification algorithm according to the word embedding matrix.

The GLDA (Gaussian LDA) proposed on the basis of LDA utilizes multivariate Gaussian distribution to replace a generation part of words in the LDA, and adopts a Collapsed Gibbs sampling rapid sampling algorithm.

After word embedding is obtained, inputting the word embedding information into a GLDA model for discovering information in word embedding and carrying out hierarchical modeling, and finally obtaining a theme representation vector corresponding to current data by utilizing a hierarchical model of 'code-theme-word embedding': a code-topic matrix and a topic-word embedding matrix, as shown in fig. 3.

The malicious codes respectively represent the frequency of the malicious codes under different subjects and the frequency of words embedded under different subjects, and the words embedded under different subjects are multiplied to obtain the feature vectors of the words embedded in the codes, namely the subject expression vectors.

Step 6: and carrying out weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain the multi-modal feature vector.

Obtaining a code weight vector

、

Statistical vector

And a topic representation vector

And then, performing feature fusion on the three parts, as shown in formula (3):

（3）

wherein the content of the first and second substances,

、

、

the sum of the three weight parameters is 1, and the respective numerical values are obtained by iterative training, and the feature vector is generated by LDA

The semantic enhancement effect can be performed on the codes of the input model.

And 7: and splicing the semantic expression vector and the multi-modal feature vector to obtain multi-modal fusion features, and executing the training of a malicious code detection model for malicious code detection.

The malicious code detection model is trained based on a Sigmoid classifier. And connecting the semantic expression vector with the multi-modal feature vector, sending the connected vector into a full connection layer, using a Sigmoid classifier and outputting a result. The overall structure diagram of the finally obtained Multi-modal Feature Fusion malicious code detection Model MFFM (Multi-modal Feature Fusion Model) is shown in fig. 1.

Example two

Based on the method provided by the first embodiment, the present embodiment provides a malicious code detection system based on multimodal feature fusion, including:

the data preprocessing module is used for carrying out keyword recommendation and word embedding on each code sample to obtain a keyword embedding matrix;

the multi-modal feature extraction module is used for performing weight extraction on each keyword based on a keyword extraction algorithm and connecting to obtain a code weight vector; performing statistic calculation on each keyword based on chi-square test, and connecting to obtain a code statistic vector; obtaining a code-theme matrix and a theme-word embedding matrix based on a document theme classification algorithm, and multiplying the code-theme matrix and the theme-word embedding matrix to obtain a theme representation vector; and

carrying out weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain a multi-modal feature vector of the code sample;

and the detection model training module is used for training a malicious code detection model based on the multi-modal fusion characteristics of the training code sample set and is used for malicious code detection.

EXAMPLE III

The embodiment aims at providing an electronic device.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described in embodiment one when executing the program.

Example four

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth in the first embodiment.

The steps involved in the second to fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the related description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.

Results of the experiment

Data set and experimental environment

The data set adopts JavaScript codes, and comprises 10483 benign codes and 8367 malicious codes. In order to avoid the problem of data imbalance, 8000 of the benign codes and the malicious codes are randomly extracted. And the data set is divided into training set and test set by 8.

Model evaluation

And (3) defining the samples of the malicious codes as negative samples and the samples of the benign codes as positive samples, and selecting Precision, recall and F value F-score in the evaluation model to evaluate and judge the experimental results.

Analysis of experiments

In LDA, different setting sizes of the theme number K may cause different results generated by the code-theme and theme-word embedding layers, and therefore, the final theme representation vector may be directly affected and the classification effect of the model may be finally affected. In order to determine the size of K, different values of K are selected for experimental verification. Thereby determining the optimal number of subjects. The results are evaluated here using the Log-likelihood result value Log-likelihood. The experimental results are shown in fig. 4, and experiments prove that the best log-likelihood results can be obtained when K is 90.

In addition, to demonstrate the effectiveness of the model, it was compared to other baseline models that are more prevalent in existence. The performance index parameters are shown in the following table:

TABLE 1 malicious code detection model Performance indicators in different ways

Type of model	Rate of accuracy	Recall rate	F value
				LSTM	0.869	0.882	0.875
Bi-LSTM	0.881	0.901	0.889
				CNN-BiLSTM	0.875	0.883	0.878
MFFM	0.901	0.890	0.895

According to the experimental result, compared with the MFFM provided by the invention, other models only perform semantic feature extraction and lack multi-modal feature information, so that the final experimental effect is lower than that of the MFFM, and therefore, the fact that the multi-modal feature information is used in malicious code detection has a positive effect on improving the training effect of the models is proved.

In addition, in fig. 5, the abscissa represents the epoch times of the model training, the ordinate represents the loss value of the model verification set, the loss value of the model is continuously reduced along with the increase of the epoch times, and as can be seen from the decreasing trend of the loss, the MFFM model proposed herein has a significantly faster convergence rate and a relatively lower loss value when the training is finally stopped, compared with the other three models, because the Bi-GRU adopted lacks a gate structure compared with the Bi-LSTM.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive changes in the technical solutions of the present invention.

Claims

1. A malicious code detection method based on multi-modal feature fusion is characterized by comprising the following steps:

acquiring a training code sample set, wherein the training code sample set comprises benign codes and malignant codes;

performing weight extraction on each keyword based on a keyword extraction algorithm, and connecting to obtain a code weight vector; performing statistic calculation on each keyword based on chi-square test, and connecting to obtain a code statistic vector; obtaining a code-topic matrix and a topic-word embedding matrix based on a document topic classification algorithm, and multiplying the code-topic matrix and the topic-word embedding matrix to obtain a topic expression vector;

splicing the semantic expression vector and the multi-modal feature vector to obtain multi-modal fusion features of the code sample;

and training a malicious code detection model based on the multi-modal fusion characteristics of the training code sample set for malicious code detection.

2. The multi-modal feature fusion-based malicious code detection method as claimed in claim 1, wherein after the training code sample set is obtained, the preprocessing of eliminating invalid characters, removing stop words and splitting variable names is further performed.

3. The multi-modal feature fusion based malicious code detection method according to claim 1, wherein the performing semantic feature extraction comprises:

and splicing the semantic representation vector and the keyword vector based on an attention mechanism to obtain a semantic feature matrix.

4. The multi-modal feature fusion based malicious code detection method according to claim 3, wherein the performing context semantic feature extraction on the keyword embedding matrix comprises:

5. The multi-modal feature fusion based malicious code detection method according to claim 3, wherein the context semantic feature extraction is based on a bidirectional gating cycle unit.

6. The multi-modal feature fusion based malicious code detection method according to claim 1, wherein the keyword extraction algorithm adopts a TF-IDF algorithm.

7. The multi-modal feature fusion based malicious code detection method according to claim 1, wherein the document topic classification algorithm employs a Gaussian LDA topic model.

8. A malicious code detection system based on multi-modal feature fusion is characterized by comprising:

the semantic feature extraction module is used for extracting semantic features according to the keyword embedded matrix to obtain a semantic expression vector;

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-modal feature fusion based malicious code detection method according to any of claims 1-7 when executing the program.

10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the multi-modal feature fusion based malicious code detection method according to any of claims 1-7.