CN115080973B - Malicious code detection method and system based on multi-mode feature fusion - Google Patents

Malicious code detection method and system based on multi-mode feature fusion Download PDF

Info

Publication number
CN115080973B
CN115080973B CN202210849728.6A CN202210849728A CN115080973B CN 115080973 B CN115080973 B CN 115080973B CN 202210849728 A CN202210849728 A CN 202210849728A CN 115080973 B CN115080973 B CN 115080973B
Authority
CN
China
Prior art keywords
code
vector
keyword
matrix
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210849728.6A
Other languages
Chinese (zh)
Other versions
CN115080973A (en
Inventor
路冰
张海文
王琦博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongfu Safety Technology Co Ltd
Original Assignee
Zhongfu Safety Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongfu Safety Technology Co Ltd filed Critical Zhongfu Safety Technology Co Ltd
Priority to CN202210849728.6A priority Critical patent/CN115080973B/en
Publication of CN115080973A publication Critical patent/CN115080973A/en
Application granted granted Critical
Publication of CN115080973B publication Critical patent/CN115080973B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of malicious code detection, and discloses a method and a system for detecting malicious codes based on multi-mode feature fusion, wherein the method comprises the following steps: acquiring a training code sample set; performing keyword recommendation and word embedding on each code sample; embedding the matrix according to the keywords to obtain a semantic expression vector; constructing a code weight vector based on a keyword extraction algorithm; constructing a code statistical vector based on chi-square test; constructing a theme representation vector based on a document theme classification algorithm; carrying out weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain a multi-modal feature vector; splicing the semantic expression vector and the multi-mode feature vector to obtain multi-mode fusion features; and performing model training based on the multi-modal fusion characteristics of the training code sample set for malicious code detection. The invention extracts the code features from multiple dimensions, can effectively mine effective feature information in the codes and improve the detection accuracy.

Description

Malicious code detection method and system based on multi-mode feature fusion
Technical Field
The invention belongs to the technical field of malicious code detection, and particularly relates to a malicious code detection method and system based on multi-mode feature fusion.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
With the high-speed development of internet and information technology, the hidden network security risks behind the internet are not in a small amount. Among them, malicious code belonging to the same category, which is most troublesome for network security practitioners, once called by a computer to execute, will cause serious problems to the system or the network. The most common are trojan horses, worm viruses, backdoor portals, etc. The malicious codes are continuously changed and updated from generation to the present, and the characteristics of huge variety number, rapid propagation speed and uncontrollable influence range are presented at present. Moreover, operators behind the screen of malicious codes often use some complicated methods such as shell adding and deformation to perform a series of skillful disguises to avoid tracking detection, so that the user host is attacked.
In order to effectively implement detection of malicious code, currently, four methods mainly exist: rule or text matching based, sandbox technology combined machine learning methods, feature extraction and knowledge base based, and deep learning based methods. The method is suitable for detecting the malicious codes of the same category. The sandbox technology is combined with a machine learning method to extract the characteristics of the codes through simulating the running environment, and the aim of detecting malicious codes is fulfilled by combining the machine learning method such as a random forest construction model. The feature extraction of the codes is completed through technologies such as machine learning and the like based on the feature extraction and the knowledge base, and the common behaviors of the same-family malicious codes are extracted by combining a data mining method. The deep learning-based method is characterized in that a network model is built by combining a deep learning method to realize malicious code detection by using an N-gram sequence of codes converted into byte codes as characteristics.
With the gradual increase of the number and the variants of malicious codes, the detection of the malicious codes by using the traditional rules or through text matching is gradually eliminated, but the method of combining sandbox and machine learning is mainly used for obtaining the characteristics of the malicious codes from a single scale, the characteristics cannot be efficiently and automatically extracted, manual operation is relied on, and the extracted shallow characteristics are not enough to accurately describe the malicious codes, so that the detection accuracy is low. Moreover, because the code is different from the ordinary text, the semantic sparseness exists in the code, and the semantic information of the code cannot be completely represented by using the traditional machine learning or extracting the features through the byte code.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a malicious code detection method and system based on multi-modal feature fusion, which are used for extracting code features from multiple dimensions and performing feature fusion, so that effective feature information in codes can be effectively mined, and the detection accuracy is improved.
In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
a malicious code detection method based on multi-modal feature fusion comprises the following steps:
obtaining a training code sample set, wherein the code sample comprises a benign code and a malignant code;
carrying out keyword recommendation and word embedding on each code sample to obtain a keyword embedding matrix;
extracting semantic features according to the keyword embedded matrix to obtain a semantic expression vector;
performing weight extraction on each keyword based on a keyword extraction algorithm, and connecting to obtain a code weight vector of the code sample; performing statistic calculation on each keyword based on chi-square test, and connecting to obtain a code statistic vector; obtaining a code-theme matrix and a theme-word embedding matrix based on a document theme classification algorithm, and multiplying the code-theme matrix and the theme-word embedding matrix to obtain a theme representation vector;
carrying out weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain a multi-modal feature vector;
splicing the semantic expression vector and the multi-mode feature vector to obtain multi-mode fusion features;
and training a malicious code detection model based on the multi-modal fusion characteristics of the code sample for malicious code detection.
Further, after the training code sample set is obtained, invalid characters are removed, stop words are removed, and variable name splitting preprocessing is performed.
Further, the performing semantic feature extraction includes:
respectively executing context semantic feature extraction and maximum pooling operation aiming at the keyword embedded matrix to obtain a semantic expression vector and a keyword vector;
and splicing the semantic expression vector and the keyword vector based on an attention mechanism to obtain a semantic feature matrix.
Further, for the keyword embedding matrix, performing context semantic feature extraction includes:
performing part-of-speech tagging based on the recommended keywords to generate a part-of-speech matrix;
obtaining a characteristic enhancement word embedding matrix according to the keyword embedding matrix and the part-of-speech matrix;
and executing context semantic feature extraction aiming at the feature enhancement word embedding matrix to obtain a semantic expression vector.
Further, the context semantic feature extraction adopts a bidirectional gating cycle-based unit.
Further, the keyword extraction algorithm adopts a TF-IDF algorithm.
Further, the document theme classification algorithm adopts a Gaussian LDA theme model.
One or more embodiments provide a malicious code detection system based on multi-modal feature fusion, comprising:
the training data acquisition module is used for acquiring a training code sample set, and the training code sample set comprises benign codes and malignant codes;
the data preprocessing module is used for carrying out keyword recommendation and word embedding on each code sample in the code samples to obtain a keyword embedding matrix;
the semantic feature extraction module is used for extracting semantic features according to the keyword embedding matrix to obtain a semantic expression vector;
the multi-modal feature extraction module is used for performing weight extraction on each keyword based on a keyword extraction algorithm and connecting to obtain a code weight vector; performing statistic calculation on each keyword based on chi-square test, and connecting to obtain a code statistic vector; obtaining a code-topic matrix and a topic-word embedding matrix based on a document topic classification algorithm, and multiplying the code-topic matrix and the topic-word embedding matrix to obtain a topic expression vector; and
performing weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain a multi-modal feature vector of the code sample;
the multi-modal feature fusion module is used for splicing the semantic expression vector and the multi-modal feature vector to obtain multi-modal fusion features of the code sample;
and the detection model training module is used for training a malicious code detection model based on the multi-modal fusion characteristics of the training code sample set and is used for detecting the malicious code.
One or more embodiments provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the multi-modal feature fusion based malicious code detection method when executing the program.
One or more embodiments provide a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the multimodal feature fusion based malicious code detection method.
The above one or more technical solutions have the following beneficial effects:
according to the method and the device, code features are extracted from multiple dimensions and fused, effective feature information in the codes can be effectively mined, and the detection accuracy is improved.
The semantic features are constructed by combining two dimensions of context semantic features and keyword features based on word embedding results, so that the defect of sparse code semantics is overcome to a certain extent, and richer semantic information is beneficially mined.
In addition, the keywords in the codes are analyzed by combining various modes of a keyword extraction algorithm, a chi-square test and a theme classification algorithm to obtain the weight of the words, the independent relation statistics of the words and the classification result and the word-theme-code multilevel theme characteristics, so that the fusion of the multidimensional characteristics is realized, the benign codes and the malicious codes can be effectively distinguished by the fusion characteristics, and the accuracy of a subsequent model is facilitated.
By introducing the GLDA topic classification model to extract the topic expression vector, the code semantics can be effectively enriched, the defect of sparse code semantics is further overcome, and the detection accuracy is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
Fig. 1 is an architecture diagram of an implementation of the malicious code detection method based on multi-modal feature fusion according to one or more embodiments of the present invention;
FIG. 2 is a flow diagram of data preprocessing in one or more embodiments of the invention;
FIG. 3 is a schematic diagram of GLDA-based code-topic-word embedding mining in one or more embodiments of the invention;
FIG. 4 is a graph of experimental validation results for GLDA using different numbers of topics;
FIG. 5 is a graph illustrating the comparative effectiveness of models and other baseline models as a function of training times in one or more embodiments of the invention.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
Because malicious codes have a large difference in format and content compared with traditional texts, the malicious codes include special formats, brackets, line feeds, indents, special symbols and the like. And semantic features of the text are fewer than those of the traditional text, so that a better detection result is difficult to obtain from the aspect of semantic extraction alone. Therefore, the embodiment provides a malicious code detection method based on multimodal feature fusion, which includes the following steps:
step 1: acquiring a training code sample set, wherein the code sample comprises a benign code and a malignant code, performing data preprocessing, and performing keyword recommendation and word embedding to obtain a keyword embedding matrix;
because the malicious code is not the same as the traditional text, the code text usually has a large amount of invalid characters and stop words besides the programming code sentence, and in the preprocessing stage, the interference of invalid information is usually removed, the training complexity is reduced, the quality of an input data set is further improved, and the subsequent model training precision is favorably improved. Besides the conventional data preprocessing method, effective special processing is required according to the characteristics of malicious codes.
The data preprocessing comprises the following steps: the method comprises the following steps of eliminating invalid characters, deactivating words, splitting variable names, recommending keywords and embedding words, and is shown in figure 2.
The Word embedding is executed based on Word2Vec, and unsupervised learning is carried out in a shallow network of the Word embedding to form a multidimensional vector matrix.
Step 2: and extracting semantic features based on the preprocessed code samples to obtain semantic expression vectors. The step 2 specifically comprises:
step 2.1: performing keyword recommendation on the preprocessed malicious codes;
step 2.2: generating a keyword embedding matrix based on the keywords; meanwhile, performing part-of-speech tagging based on the recommended keywords to generate a part-of-speech matrix;
step 2.3: obtaining a characteristic enhancement word embedding matrix according to the keyword embedding matrix and the part-of-speech matrix;
step 2.4: executing context semantic feature extraction aiming at the feature enhancement word embedding matrix to obtain a semantic expression vector;
in the embodiment, on the aspect of main semantic processing, the Bi-directional gating-based circulation unit Bi-GRU with the most ideal effect at present is adopted, data can be processed according to the sequence from front to back and from back to front at the same time, and then context information of two sequences is obtained.
Step 2.5: performing maximum pooling operation aiming at the keyword embedded matrix to obtain a keyword vector; in order to improve the classification detection effect of malicious codes, a keyword recommendation algorithm is applied to the codes to screen keywords, keyword information is subjected to embedding layer operation to obtain keyword embedding, and then max-posing maximum pooling operation is carried out, so that keyword vectors are obtained
Figure DEST_PATH_IMAGE001
Step 2.6: and splicing the semantic expression vector and the keyword vector based on an attention mechanism to obtain a semantic feature matrix.
In order to fully utilize the keyword information, the semantic feature matrix obtained in the step 2.4 is used
Figure 734188DEST_PATH_IMAGE002
Combining with the keyword vector obtained in the step 2.5 to obtain a feature matrix
Figure DEST_PATH_IMAGE003
Fully paying attention to information corresponding to the keywords in the malicious codes by using an attention mechanism, and calculating a semantic expression vector
Figure 548561DEST_PATH_IMAGE004
The calculation process is shown in equations (1) - (2). Wherein the weight matrix is composed of
Figure DEST_PATH_IMAGE005
Indicating that the random bias parameter is
Figure 557974DEST_PATH_IMAGE006
It is shown that,
Figure DEST_PATH_IMAGE007
representing the activation function.
Figure 762690DEST_PATH_IMAGE008
(1)
Figure DEST_PATH_IMAGE009
(2)
And 3, step 3: and performing weight extraction on each keyword based on a TF-IDF algorithm, and connecting to obtain a code weight vector.
Through the operation of the TF-IDF algorithm, the more times a word appears in a text, and the less times the word appears in a corpus, the more representative the word is. After the weight value of each word in the current data is obtained, the weight values of all words in the current data are connected to obtain a code weight vector
Figure 148541DEST_PATH_IMAGE010
And 4, step 4: and performing statistic calculation on each keyword based on chi-square test, and connecting to obtain a code statistic vector.
By chi-square test (i.e.
Figure DEST_PATH_IMAGE011
Checking out),
Figure 755103DEST_PATH_IMAGE011
the higher the value of the statistic, the less independence of a word in the provable malicious code from its corresponding category, and the normalization operation prevents certain words from being caused by
Figure 748467DEST_PATH_IMAGE011
The value of the statistic is too large to affect the final classification effect. The calculation finally obtains the malicious code
Figure 502796DEST_PATH_IMAGE011
Statistical vector
Figure 432618DEST_PATH_IMAGE012
And 5: and obtaining a code topic representation vector based on a document topic classification algorithm according to the word embedding matrix.
The GLDA (Gaussian LDA) proposed on the basis of LDA utilizes multivariate Gaussian distribution to replace a generation part of words in the LDA, and adopts a Collapsed Gibbs sampling rapid sampling algorithm.
After word embedding is obtained, inputting the word embedding information into a GLDA model for discovering information in word embedding and carrying out hierarchical modeling, and finally obtaining a theme representation vector corresponding to current data by utilizing a hierarchical model of 'code-theme-word embedding': a code-topic matrix and a topic-word embedding matrix, as shown in fig. 3.
The malicious codes respectively represent the frequency of the malicious codes under different subjects and the frequency of words embedded under different subjects, and the words embedded under different subjects are multiplied to obtain the feature vectors of the words embedded in the codes, namely the subject expression vectors.
Step 6: and carrying out weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain the multi-modal feature vector.
Obtaining a code weight vector
Figure DEST_PATH_IMAGE013
Figure 893686DEST_PATH_IMAGE014
Statistical vector
Figure DEST_PATH_IMAGE015
And a topic representation vector
Figure 323531DEST_PATH_IMAGE016
And then, performing feature fusion on the three parts, as shown in formula (3):
Figure DEST_PATH_IMAGE017
(3)
wherein the content of the first and second substances,
Figure 752107DEST_PATH_IMAGE018
Figure DEST_PATH_IMAGE019
Figure 964913DEST_PATH_IMAGE020
the sum of the three weight parameters is 1, and the respective numerical values are obtained by iterative training, and the feature vector is generated by LDA
Figure DEST_PATH_IMAGE021
The semantic enhancement effect can be performed on the codes of the input model.
And 7: and splicing the semantic expression vector and the multi-modal feature vector to obtain multi-modal fusion features, and executing the training of a malicious code detection model for malicious code detection.
The malicious code detection model is trained based on a Sigmoid classifier. And connecting the semantic expression vector with the multi-modal feature vector, sending the connected vector into a full connection layer, using a Sigmoid classifier and outputting a result. The overall structure diagram of the finally obtained Multi-modal Feature Fusion malicious code detection Model MFFM (Multi-modal Feature Fusion Model) is shown in fig. 1.
Example two
Based on the method provided by the first embodiment, the present embodiment provides a malicious code detection system based on multimodal feature fusion, including:
the training data acquisition module is used for acquiring a training code sample set, and the training code sample set comprises benign codes and malignant codes;
the data preprocessing module is used for carrying out keyword recommendation and word embedding on each code sample to obtain a keyword embedding matrix;
the semantic feature extraction module is used for extracting semantic features according to the keyword embedding matrix to obtain a semantic expression vector;
the multi-modal feature extraction module is used for performing weight extraction on each keyword based on a keyword extraction algorithm and connecting to obtain a code weight vector; performing statistic calculation on each keyword based on chi-square test, and connecting to obtain a code statistic vector; obtaining a code-theme matrix and a theme-word embedding matrix based on a document theme classification algorithm, and multiplying the code-theme matrix and the theme-word embedding matrix to obtain a theme representation vector; and
carrying out weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain a multi-modal feature vector of the code sample;
the multi-modal feature fusion module is used for splicing the semantic expression vector and the multi-modal feature vector to obtain multi-modal fusion features of the code sample;
and the detection model training module is used for training a malicious code detection model based on the multi-modal fusion characteristics of the training code sample set and is used for malicious code detection.
EXAMPLE III
The embodiment aims at providing an electronic device.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described in embodiment one when executing the program.
Example four
An object of the present embodiment is to provide a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth in the first embodiment.
The steps involved in the second to fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the related description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.
Results of the experiment
Data set and experimental environment
The data set adopts JavaScript codes, and comprises 10483 benign codes and 8367 malicious codes. In order to avoid the problem of data imbalance, 8000 of the benign codes and the malicious codes are randomly extracted. And the data set is divided into training set and test set by 8.
Model evaluation
And (3) defining the samples of the malicious codes as negative samples and the samples of the benign codes as positive samples, and selecting Precision, recall and F value F-score in the evaluation model to evaluate and judge the experimental results.
Analysis of experiments
In LDA, different setting sizes of the theme number K may cause different results generated by the code-theme and theme-word embedding layers, and therefore, the final theme representation vector may be directly affected and the classification effect of the model may be finally affected. In order to determine the size of K, different values of K are selected for experimental verification. Thereby determining the optimal number of subjects. The results are evaluated here using the Log-likelihood result value Log-likelihood. The experimental results are shown in fig. 4, and experiments prove that the best log-likelihood results can be obtained when K is 90.
In addition, to demonstrate the effectiveness of the model, it was compared to other baseline models that are more prevalent in existence. The performance index parameters are shown in the following table:
TABLE 1 malicious code detection model Performance indicators in different ways
Type of model Rate of accuracy Recall rate F value
LSTM 0.869 0.882 0.875
Bi-LSTM 0.881 0.901 0.889
CNN-BiLSTM 0.875 0.883 0.878
MFFM 0.901 0.890 0.895
According to the experimental result, compared with the MFFM provided by the invention, other models only perform semantic feature extraction and lack multi-modal feature information, so that the final experimental effect is lower than that of the MFFM, and therefore, the fact that the multi-modal feature information is used in malicious code detection has a positive effect on improving the training effect of the models is proved.
In addition, in fig. 5, the abscissa represents the epoch times of the model training, the ordinate represents the loss value of the model verification set, the loss value of the model is continuously reduced along with the increase of the epoch times, and as can be seen from the decreasing trend of the loss, the MFFM model proposed herein has a significantly faster convergence rate and a relatively lower loss value when the training is finally stopped, compared with the other three models, because the Bi-GRU adopted lacks a gate structure compared with the Bi-LSTM.
Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive changes in the technical solutions of the present invention.

Claims (10)

1. A malicious code detection method based on multi-modal feature fusion is characterized by comprising the following steps:
acquiring a training code sample set, wherein the training code sample set comprises benign codes and malignant codes;
carrying out keyword recommendation and word embedding on each code sample to obtain a keyword embedding matrix;
extracting semantic features according to the keyword embedded matrix to obtain a semantic expression vector;
performing weight extraction on each keyword based on a keyword extraction algorithm, and connecting to obtain a code weight vector; performing statistic calculation on each keyword based on chi-square test, and connecting to obtain a code statistic vector; obtaining a code-topic matrix and a topic-word embedding matrix based on a document topic classification algorithm, and multiplying the code-topic matrix and the topic-word embedding matrix to obtain a topic expression vector;
carrying out weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain a multi-modal feature vector of the code sample;
splicing the semantic expression vector and the multi-modal feature vector to obtain multi-modal fusion features of the code sample;
and training a malicious code detection model based on the multi-modal fusion characteristics of the training code sample set for malicious code detection.
2. The multi-modal feature fusion-based malicious code detection method as claimed in claim 1, wherein after the training code sample set is obtained, the preprocessing of eliminating invalid characters, removing stop words and splitting variable names is further performed.
3. The multi-modal feature fusion based malicious code detection method according to claim 1, wherein the performing semantic feature extraction comprises:
respectively executing context semantic feature extraction and maximum pooling operation aiming at the keyword embedded matrix to obtain a semantic expression vector and a keyword vector;
and splicing the semantic representation vector and the keyword vector based on an attention mechanism to obtain a semantic feature matrix.
4. The multi-modal feature fusion based malicious code detection method according to claim 3, wherein the performing context semantic feature extraction on the keyword embedding matrix comprises:
performing part-of-speech tagging based on the recommended keywords to generate a part-of-speech matrix;
obtaining a characteristic enhancement word embedding matrix according to the keyword embedding matrix and the part-of-speech matrix;
and executing context semantic feature extraction aiming at the feature enhancement word embedding matrix to obtain a semantic expression vector.
5. The multi-modal feature fusion based malicious code detection method according to claim 3, wherein the context semantic feature extraction is based on a bidirectional gating cycle unit.
6. The multi-modal feature fusion based malicious code detection method according to claim 1, wherein the keyword extraction algorithm adopts a TF-IDF algorithm.
7. The multi-modal feature fusion based malicious code detection method according to claim 1, wherein the document topic classification algorithm employs a Gaussian LDA topic model.
8. A malicious code detection system based on multi-modal feature fusion is characterized by comprising:
the training data acquisition module is used for acquiring a training code sample set, and the training code sample set comprises benign codes and malignant codes;
the data preprocessing module is used for carrying out keyword recommendation and word embedding on each code sample in the code samples to obtain a keyword embedding matrix;
the semantic feature extraction module is used for extracting semantic features according to the keyword embedded matrix to obtain a semantic expression vector;
the multi-modal feature extraction module is used for performing weight extraction on each keyword based on a keyword extraction algorithm and connecting to obtain a code weight vector; performing statistic calculation on each keyword based on chi-square test, and connecting to obtain a code statistic vector; obtaining a code-theme matrix and a theme-word embedding matrix based on a document theme classification algorithm, and multiplying the code-theme matrix and the theme-word embedding matrix to obtain a theme representation vector; and
carrying out weighted fusion on the code weight vector, the code statistical vector and the code theme representation vector to obtain a multi-modal feature vector of the code sample;
the multi-modal feature fusion module is used for splicing the semantic expression vector and the multi-modal feature vector to obtain multi-modal fusion features of the code sample;
and the detection model training module is used for training a malicious code detection model based on the multi-modal fusion characteristics of the training code sample set and is used for detecting the malicious code.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-modal feature fusion based malicious code detection method according to any of claims 1-7 when executing the program.
10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the multi-modal feature fusion based malicious code detection method according to any of claims 1-7.
CN202210849728.6A 2022-07-20 2022-07-20 Malicious code detection method and system based on multi-mode feature fusion Active CN115080973B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210849728.6A CN115080973B (en) 2022-07-20 2022-07-20 Malicious code detection method and system based on multi-mode feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210849728.6A CN115080973B (en) 2022-07-20 2022-07-20 Malicious code detection method and system based on multi-mode feature fusion

Publications (2)

Publication Number Publication Date
CN115080973A CN115080973A (en) 2022-09-20
CN115080973B true CN115080973B (en) 2022-12-06

Family

ID=83259799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210849728.6A Active CN115080973B (en) 2022-07-20 2022-07-20 Malicious code detection method and system based on multi-mode feature fusion

Country Status (1)

Country Link
CN (1) CN115080973B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093996B (en) * 2023-10-18 2024-02-06 湖南惟储信息技术有限公司 Safety protection method and system for embedded operating system
CN117668237B (en) * 2024-01-29 2024-05-03 深圳开源互联网安全技术有限公司 Sample data processing method and system for intelligent model training and intelligent model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990565A (en) * 2019-11-20 2020-04-10 广州商品清算中心股份有限公司 Extensible text analysis system and method for public sentiment analysis
CN112182568A (en) * 2019-07-02 2021-01-05 四川大学 Malicious code classification based on graph convolution network and topic model
CN112328475A (en) * 2020-10-28 2021-02-05 南京航空航天大学 Defect positioning method for multiple suspicious code files
CN112328469A (en) * 2020-10-22 2021-02-05 南京航空航天大学 Function level defect positioning method based on embedding technology

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190272375A1 (en) * 2019-03-28 2019-09-05 Intel Corporation Trust model for malware classification

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182568A (en) * 2019-07-02 2021-01-05 四川大学 Malicious code classification based on graph convolution network and topic model
CN110990565A (en) * 2019-11-20 2020-04-10 广州商品清算中心股份有限公司 Extensible text analysis system and method for public sentiment analysis
CN112328469A (en) * 2020-10-22 2021-02-05 南京航空航天大学 Function level defect positioning method based on embedding technology
CN112328475A (en) * 2020-10-28 2021-02-05 南京航空航天大学 Defect positioning method for multiple suspicious code files

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Relational Database Watermarking Based on Chinese Word Segmentation and Word Embedding;Wenling Li 等;《2020 29th International Conference on Computer Communications and Networks (ICCCN)》;20200930;第1-6页 *
一种基于主题模型的软件缺陷预测技术研究;张泽涛等;《计算机工程与科学》;20160515(第05期);第932-937页 *
基于信息检索的软件缺陷定位技术研究进展;张芸 等;《软件学报》;20200831;第31卷(第8期);第2432-2452页 *

Also Published As

Publication number Publication date
CN115080973A (en) 2022-09-20

Similar Documents

Publication Publication Date Title
Nguyen et al. Relation extraction: Perspective from convolutional neural networks
CN115080973B (en) Malicious code detection method and system based on multi-mode feature fusion
CN107807987B (en) Character string classification method and system and character string classification equipment
Peng et al. Phonetic-enriched text representation for Chinese sentiment analysis with reinforcement learning
CN108197109A (en) A kind of multilingual analysis method and device based on natural language processing
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN109635297B (en) Entity disambiguation method and device, computer device and computer storage medium
CN111061861B (en) Text abstract automatic generation method based on XLNet
CN111460820A (en) Network space security domain named entity recognition method and device based on pre-training model BERT
CN111126067B (en) Entity relationship extraction method and device
CN107341143A (en) A kind of sentence continuity determination methods and device and electronic equipment
CN106372640A (en) Character frequency text classification method
Hossain et al. Authorship classification in a resource constraint language using convolutional neural networks
De Felice et al. Automatically acquiring models of preposition use
CN111475651B (en) Text classification method, computing device and computer storage medium
CN114896398A (en) Text classification system and method based on feature selection
CN111241271B (en) Text emotion classification method and device and electronic equipment
Guo et al. Supervised contrastive learning with term weighting for improving Chinese text classification
Theophilo et al. Explainable artificial intelligence for authorship attribution on social media
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN116561320A (en) Method, device, equipment and medium for classifying automobile comments
Althobaiti et al. A semi-supervised learning approach to arabic named entity recognition
CN108763258B (en) Document theme parameter extraction method, product recommendation method, device and storage medium
Mohammadi et al. Cooking up a neural-based model for recipe classification
CN114969324A (en) Chinese news title classification method based on subject word feature expansion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant