CN117610002B

CN117610002B - Multi-mode feature alignment-based lightweight malicious software threat detection method

Info

Publication number: CN117610002B
Application number: CN202410086383.2A
Authority: CN
Inventors: 孙捷; 车洵; 陈亚当
Original assignee: Nanjing Zhongzhiwei Information Technology Co ltd
Current assignee: Nanjing Zhongzhiwei Information Technology Co ltd
Priority date: 2024-01-22
Filing date: 2024-01-22
Publication date: 2024-04-30
Anticipated expiration: 2044-01-22
Also published as: CN117610002A

Abstract

The invention discloses a lightweight malicious software threat detection method based on multi-mode feature alignment, which comprises the following steps: giving log information and a malicious software tag table of a program sample of software to be detected; analyzing log information of the software to be detected, and preliminarily outputting different probability labels of the software to be detected; introducing a malicious software tag to construct a vocabulary, and obtaining an embedded vector; dividing nodes in the graph into a plurality of clusters; obtaining a classified voting label of the software to be detected through a cluster voting prompt realignment algorithm; establishing a student encoder to obtain a software prediction tag to be detected; for weak labels and predictive labels, calculating the loss of samples by using a real sample label and a maximized boundary method; obtaining a decision result according to the loss of the sample; the method has the characteristics of identifying and detecting the threat of the malicious software, realizing high-efficiency and light threat detection, and reducing the risk of the threat on user data and privacy.

Description

Multi-mode feature alignment-based lightweight malicious software threat detection method

Technical Field

The invention relates to the technical field of network security, in particular to a lightweight malicious software threat detection method based on multi-modal feature alignment.

Background

With the continued development of computer and internet technology, malware threats have become a serious challenge in the field of information security. Malware is now in a variety of forms including, but not limited to, viruses, worms, trojan horses, lux software, advertising software, malicious browser plug-ins, and the like. The variability between these malware types makes detection more complex, posing a serious threat to information assets of individuals, businesses, and government agencies. These threats may cause problems of data leakage, system paralysis, financial loss, personal privacy leakage, and the like. Malware detection requires not only timely discovery of threats, but also rapid response to isolate, clear, or repair infected systems and data. Real-time is critical to limit the spread of threats to reduce potential damage. Thus, malware threat detection is one of the important research directions in the field of information security.

In the current information security environment, the threat forms of malicious software are varied, and attackers continuously adopt new technical means to avoid the traditional detection method. Traditional malware detection methods mainly include feature-based detection and behavior-based detection.

Feature-based detection methods rely on known malware features such as virus signatures or patterns of malicious code. However, this approach is susceptible to malware variants, as an attacker can easily modify the characteristics of malware to evade detection. Furthermore, feature-based detection methods typically require a large library of features, which can lead to wasted storage and computing resources, and are not effective against zero-day threats, i.e., malware that has not been discovered and recorded. Furthermore, malware authors continue to improve tools and techniques to evade detection. This includes techniques of code obfuscation, multi-layer encryption, self-modifying malicious code, etc. These techniques make conventional feature-based detection methods more difficult in identifying and blocking malware.

Behavior-based detection methods attempt to analyze the execution behavior of malware without relying on specific features. Although this approach has certain advantages, it also has some problems. First, behavior-based detection methods typically require extensive training data and complex machine learning models, which increase the computational complexity of the detection. Second, this approach may produce false positives because some legitimate software may have similar behavioral characteristics that are difficult to distinguish. Finally, behavior-based detection methods can present challenges in terms of real-time and efficiency, as analysis of malware behavior requires time, which is often a critical factor in malware attacks. Thus, current malware threat detection areas face increasingly complex and diverse threats, and conventional approaches often fail to provide adequate protection. In order to ensure the security of information systems and data, a novel artificial intelligence method such as deep learning is combined with the research and development of malware threat detection and technology.

Disclosure of Invention

Therefore, it is necessary to provide a method for detecting the threat of lightweight malicious software based on multi-modal feature alignment, which aims to overcome the problems of the conventional method, and by introducing the multi-modal feature alignment technology, the malicious software can be detected more accurately, meanwhile, the computational complexity is reduced, and the real-time performance and the efficiency of the detection are improved by using a lightweight detection model.

To achieve the above object, the present inventors provide a lightweight malware threat detection method based on multi-modal feature alignment, comprising the steps of:

S1, giving log information and a malicious software tag table of a program sample of software to be detected;

S2, analyzing log information of the software to be detected, obtaining relevant fields through regularization to judge association relations, respectively initializing the association relations of the relevant fields as nodes and edges, inputting the nodes and the edges into a graph encoder, embedding the graph encoder into the nodes through information transmission update, and preliminarily outputting different probability labels of the software to be detected;

S3, introducing a malicious software tag to construct a vocabulary, and obtaining an embedded vector after the vocabulary passes through a CLIP encoder;

s4, performing spectral clustering on the nodes and edges obtained in the step S2, and dividing the nodes in the graph into a plurality of clusters, so that the nodes in the same cluster have high similarity, and the nodes among different clusters have low similarity;

S5, obtaining a classified voting label of the software to be detected through a cluster voting prompt realignment algorithm by using the embedded vector obtained in the step S3 and the clustering result obtained in the step S4, and forming a weak label by using the classified voting label and the probability label obtained in the step S2;

S6, establishing a student encoder, updating the student encoder by using an exponential moving average for the graph encoder, inputting log information of the software to be detected into the graph encoder, and predicting by using updated weights of nodes and edges to obtain a software prediction tag to be detected;

s7, calculating the loss of the sample by adopting a real sample label and a maximized boundary method aiming at the weak label obtained in the step S5 and the predicted label obtained in the step S6;

And S8, obtaining a decision result according to the loss of the sample, judging whether an execution program of the software to be detected is judged to be a malicious software threat behavior according to the decision result, and adding the execution program into a training set to perform the next round of detection of other software.

As a preferred mode of the present invention, the log information in step S1 includes: date, timestamp, IP address, file path, user operation, port, and event type.

As a preferred mode of the present invention, step S2 further includes:

s201, transforming an original data field by adopting a regularization method, regularizing the field X into an expression of X' within the same scale range, wherein the expression is as follows:

Wherein X represents an original data field, and X' represents a regularized data field;

S202, selecting related fields as initialization of nodes and edges in regularized data fields, wherein the nodes represent different data fields, the edges represent association relations among the fields, the initialization of the nodes and the edges is used for constructing a graph, wherein the nodes represent different data attributes, the edges represent the association relations among the attributes, the value of each field is taken as a node for each field in log data, and each field F _i has regularized value X' _i for field F, and the expression initialized by the nodes is:

V＝{v₁,v₂,...,v_k}

Where V represents a set of nodes, each node V ₁ corresponding to field f _i, k representing the number of fields;

S203, initializing edges to represent the association relation between fields, wherein an edge exists between each pair of nodes in a fully connected graph mode, initializing the weight of the edge to be a default value, and representing the default value as an adjacency matrix A, wherein A _ij represents the weight of the edge between a node v _i and a node v _j;

S204, updating the embedded representation of the nodes by using a graph encoder, wherein each node V ₁ has an embedded vector h _i for the obtained node set V and the adjacent matrix A, and the embedded vector is initially set to be the node initialized value, and the updating expression of the graph encoder is as follows:

Wherein, Representing the embedding of node v _i at layer i, σ represents the activation function, N (v _i) represents the set of neighbor nodes of node v _i, c _ij represents the normalization constant, typically the sum of the weights of the edges between node v _i and its neighbor node v _j, and W ^l represents the weight matrix of layer i for the linear transformation.

S205, obtaining node characteristics of the graph encoderRepresentation for prediction, using an additional fully connected layer to map node features to class probabilities, the expression of this process is:

Wherein z _i represents a classification score or probability of node v _i, W represents a weight matrix, and b represents a bias vector;

S206, converting the classification score of the node into probability distribution by using a normalization function, wherein the expression is as follows:

label(v_i)＝argmax P(i/z)

Where P (i/z) represents the probability that it belongs to class i for a given node v _i, label (v _i) is the label of node v _i.

As a preferred mode of the present invention, step S3 further includes:

S301, introducing a malicious software tag for describing different types of malicious software features and behaviors, and constructing the malicious software tag into a malicious software tag vocabulary with multi-modal features;

s302, introducing a CLIP encoder, and processing a malicious software tag vocabulary through the CLIP encoder to obtain an embedded vector associated with the tag, wherein the expression is as follows:

E_text＝CE(t)

Where E _text denotes the embedded vector with which the tag is associated, t denotes the malware tag, and CE denotes the CLIP encoder.

As a preferred mode of the present invention, step S4 further includes: and calculating the similarity among the nodes, then calculating a similarity graph matrix, and obtaining clusters by adopting a clustering algorithm, wherein the expression is as follows:

Wherein I represents the number of clusters, C _i is the I-th cluster center, and S _c is the cluster set.

As a preferred mode of the present invention, step S5 further includes:

s501, combining the embedded vector obtained in the step S3 with the clustering result obtained in the step S4, realigning a software sample through a cluster voting prompt, and obtaining a classification voting label of the software, wherein the expression is as follows:

T_C＝CVP(E_text,S_c)

wherein T _c represents a classification voting label of the software sample text, CVP represents a cluster voting prompt algorithm for realignment and classification, E _text represents a multi-modal feature embedding vector of the software sample, and S _c represents a clustering result.

S502, combining the probability label and the classified voting label in the step S2 to generate a weak label of the software sample.

As a preferred mode of the present invention, step S6 further includes: and establishing a student encoder, which is used for converting the log information of the software to be detected into a characteristic representation, introducing an index moving average method to update the weight of the student encoder, and obtaining a prediction label T _s of the self encoder according to the prediction step in the step S2.

As a preferred mode of the present invention, in step S7, for the weak tag of the software sample obtained in step S5 and the predictive tag obtained in step S6, the loss of the sample is calculated using the true sample tag and the maximum boundary method, and the expression is:

Where L (θ) represents a loss function of the model, N represents a total number of samples, Δ represents a minimum interval of boundaries represented by a super parameter for training the model, Δ _c (b) represents a weak tag loss of the sample b, which is a difference between the weak tag and the real tag, and Δ _s (b) represents a predicted tag loss of the sample b, which is a difference between an output probability tag of the model and the real tag.

Compared with the prior art, the beneficial effects achieved by the technical scheme are as follows:

(1) According to the method, a malicious software threat detection framework is constructed on the basis of a traditional model, structural feature information of malicious software is extracted, unlabeled data is processed through a graph neural network iteration, and pseudo-supervision structural feature information is obtained. And then adopting a cluster voting prompt realignment algorithm, and initially identifying the malicious software category in the vocabulary by an iterative graph clustering method. Meanwhile, a malware text category hint is generated using the CLIP and the malware vocabulary, and the software encoded image and the software text category hint are rearranged into structural alignment. Finally, a lightweight threat detection model is constructed based on teacher and student learning strategies, so that malicious software threats can be effectively identified and detected;

(2) The method realizes the structural analysis of unknown customized malicious software, thereby realizing high-efficiency and light threat detection;

(3) The method has adaptability and intelligence, can better identify and cope with the continuously evolving threats, improves the safety of computers and networks, and reduces the risks of the threats on user data and privacy.

Drawings

FIG. 1 is a training frame diagram of a method according to an embodiment;

Fig. 2 is a detailed flowchart of a process of the method according to the embodiment.

Detailed Description

In order to describe the technical content, constructional features, achieved objects and effects of the technical solution in detail, the following description is made in connection with the specific embodiments in conjunction with the accompanying drawings.

The embodiment provides a lightweight malicious software threat detection method based on multi-mode feature alignment, which takes a teacher-student learning framework as a basic structure of malicious software threat detection analysis, can detect the threat according to log information of malicious software, can form a lightweight model classifier based on a multi-mode feature encoder to fully mine structured information of existing malicious software data, and can realize high-efficiency and low-cost construction of a threat detection framework by analyzing potential malicious software threats. Therefore, personnel interference is not needed, errors caused by human factors are reduced, and the efficiency of network security operation is improved.

As shown in fig. 1 and 2, the method specifically comprises the steps of:

In the implementation process of the above embodiment, for step S2, the log information of the software to be detected is analyzed in detail. Such log information typically includes, but is not limited to, date, time stamp, IP address, file path, user operation, port, and event type. To further process this information, the original data fields are transformed using a regularization method to ensure that the regularized data fields are within the same scale. Specifically, in the present embodiment, the field X is regularized to X' using the following formula:

where X represents an original data field, such as a date, a time stamp, an IP address, etc., and X' represents a regularized data field.

In the regularized data fields in step S2, relevant fields are selected as initialization of nodes and edges, the nodes may represent different data fields, such as date, time stamp, IP address, etc., and the edges represent association relations between the fields. Initialization of nodes and edges will be used to construct a graph in which nodes represent different data attributes and edges represent associations between these attributes. For each field in the log data, the value of each field is taken as a node. For field F, each field F _i has a regularized value of X' _i, then the node initialization can be expressed as:

V＝{v₁,v₂,...,v_k}

Where V represents a set of nodes, each node V ₁ corresponding to field f _i, and k is the number of fields.

The initialization of the edges is used for representing the association relation between the fields; in general, a fully connected graph approach may be used, where there is one edge between each pair of nodes. The weights of the edges may be initialized to some default value, such as: 1. this may be represented as a adjacency matrix a, where a _ij represents the weight of the edge between node v _i and neighbor node v _j.

The embedded representation of the node is updated using a graph encoder. The step S3 is performed to obtain a node set V and an adjacency matrix a. Each node v ₁ has an embedded vector h _i that can be initially set to the value of the node initialization. The updated formula of the graph encoder can be expressed as:

In the above embodiment, node characteristics of the graph encoder are obtainedThe expression representing to predict, using an additional fully connected layer to map node features to class probability processes is:

Where z _i represents the classification score or probability of node v _i, W represents the weight matrix, and b represents the bias vector.

The classification scores of the nodes are obtained, the scores are converted into probability distribution by using a normalization function, and the expression is as follows:

label(v_i)＝argmax P(i/z)

For step S3 in the above embodiment, specific: to enable multi-modal malware detection, malware tags are first introduced, which are used to describe different types of malware features and behaviors. The purpose of malware label construction is to build a vocabulary of multimodal features to better understand and represent the various characteristics of malware.

In this embodiment, a CLIP encoder is introduced, and the malware tag vocabulary is processed by the CLIP encoder to obtain embedded vectors associated with the tags, expressed as:

E_text＝CE(t)

Where E _text denotes the embedded vector with which the tag is associated, t denotes the tag of malware, and CE denotes the CLIP encoder.

For step S4 in the above embodiment, specific: the nodes in the graph are divided into a plurality of clusters by adopting a spectral clustering mode, so that the nodes in the same cluster have higher similarity, and the node similarity between different clusters is lower. Specifically: and calculating the similarity among the nodes, then calculating a similarity graph matrix, and obtaining a final cluster by adopting a traditional clustering algorithm, wherein the expression is as follows:

Wherein I is the number of clusters, C _i is the I-th cluster center, and S _c is a cluster set.

For step S5 in the above embodiment, specific: the clustering of the above steps represents similarities and associations between software samples. Combining the embedded vector in the step S3 and the clustering result in the step S4, and realigning the software samples by a cluster voting hint (CVP) method, specifically: the cluster set S _c and the text label E _text are used as inputs of a cluster voting prompting method. In view of the semantic clustering result Sc, a vocabulary voting distribution matrix M is calculated, where M represents the probability that E _text belongs to each cluster. Taking a clustering result with the highest probability in the matrix M as a weak tag T of the classification voting of the software sample text; and gets the classified voting label of the software, this process can be expressed as the following formula:

T_c＝CVP(E_text,S_c)

Wherein T _c represents a weak tag of classification voting of the software sample text, CVP is a cluster voting hint algorithm for realignment and classification, E _text is a multi-modal feature embedding vector of the software sample, and S _c represents a clustering result.

With the help of the realignment algorithm, the present embodiment can reorganize and align the embedded vectors according to these clustering results to better reflect the similarity between software samples. And finally, combining the probability label obtained in the step S2 with the classified voting label to generate a weak label of the software sample. These weak tags reflect the classification information of the software sample and can be used for further threat detection and analysis.

For step S6 in the above embodiment, specific: a student encoder is built for processing the log information of the software, the task of the student encoder being to convert the log information into a characteristic representation for subsequent prediction and classification. An exponential moving average method is introduced to update the weights of the student encoders. EMA is a smooth weight update strategy that helps to improve the stability and generalization performance of the model. The prediction step in step S2 is followed to obtain a prediction tag T _s from the encoder.

For step S7 in the above embodiment, specific: weak tags T _c and predictive tags T _s of software samples need to be processed to calculate the loss of samples. The method aims at calculating loss by combining a real sample label according to a weak label T _c and a prediction label T _s of a software sample output by a model and using a mode of maximizing a boundary so as to help the model learn a more accurate classification decision boundary and improve the detection efficiency of malicious software threat; the expression for calculating the loss is as follows:

Where L (θ) represents a loss function of the model, N represents a total number of samples, Δ represents a minimum interval of boundaries for the super-parameters for controlling the degree of maximizing the boundaries, Δ _c (b) represents a pseudo tag loss of the sample b, which is a difference between a weak tag and a real tag, and Δ _s (b) represents a predicted tag loss of the sample b, which is a difference between an output probability tag of the model and the real tag.

For the above embodiment, in order to better use the light weight of the already constructed multi-modal feature alignment model (MFC, multimodal Feature Alignment Model) to perform the detection of the threat behavior of the malicious software, the embodiment also proposes a model for detecting the threat, which constructs a malicious software threat detection framework based on a traditional model, extracts the structural feature information of the malicious software, and iteratively processes the unlabeled data through a graph neural network to obtain the pseudo-supervision structural feature information. And then adopting a cluster voting prompt realignment algorithm, and initially identifying the malicious software category in the vocabulary by an iterative graph clustering method. Meanwhile, a malware text category hint is generated using the CLIP and the malware vocabulary, and the software encoded image and the software text category hint are rearranged into structural alignment. Finally, a lightweight threat detection model is built based on teacher and student learning strategies to effectively identify and detect malware threats, and rapid and efficient threat detection is never achieved.

To verify the performance of the model, based on the above embodiments, the present embodiment tests the model on a generic vulnerability disclosure library (Common Vulnerabilities and Exposures, CVE), a signature dataset (AposematIoT-23) of malicious and benign internet of things network traffic, an intrusion detection dataset (ADFA) in combination with an on-network disclosed emergency response handling method. The performance of the small sample-based learning model in malware detection and defense under an evaluation system based on accuracy, recall and F1 values is shown in table 1 from three aspects:

Table 1: performance comparison table of small sample learning model in malware detection and defense

According to the analysis of the actual results, on the basis of the malware detection effect on the same dataset, compared with other model methods, the comparison learning based on a small sample learning model has higher promotion in the aspects of malware detection and defense, and in the transverse comparison, different models are used for comparison on ADFA datasets, and compared with the framework of a basic malware network base, the framework is as follows: long Short-Term Memory (LSTM), bidirectional Long-Term Memory (Bidirectional Long Short-Term Memory, biLSTM), gate-controlled Memory (Gated Recurrent Unit, GRU) and other models are added to detect the attack behavior of the malicious software based on a few-sample comparison learning classifier, and the recognition accuracy, recall rate and F1 value are shown in Table 2:

Table 2: comparison table of accuracy rate, recall rate and F1 value of small sample learning model

It can be seen that the best performance is respectively improved by 8.82%, 5.70% and 7.70%, and the average level in the industry is obviously improved. And under different scenes of malware detection, the requirement for the sample size of the attack behaviors of the malware is also greatly reduced. Therefore, in the embodiment, the multi-modal feature aligned lightweight model provided by the method can detect the malicious software under the condition of few malicious software samples, and the analysis shows that the multi-modal feature aligned lightweight model can effectively detect the threat of the malicious software by combining with the introduction of a teacher-student policy method.

In this embodiment, the whole flow framework shown in fig. 1 needs to be trained in advance, and the prediction modes of the training phase and the testing phase are the same as follows:

Pretraining with MALWARE TRAINING SETS (malware training set): the pre-training task is performed by using a teacher-student strategy learning mode, the two branches simultaneously predict the labels of the software through a graph encoder and a self-encoder, the graph encoder outputs predicted features by using a mode of aligning two modes of a graph and a text, and the threat types of the software are classified according to the features. Meanwhile, the graph encoder updates the self-encoder in a moving index average manner, and the labels output from the encoder and the alignment labels output from the graph encoder calculate losses in a manner of maximizing boundaries, so that parameters of the model are iteratively updated.

After the pre-training is completed, the network model is trimmed 15000 times with an open source dataset MALWARE TRAINING SETS (malware training set).

In this embodiment, the network model is initialized with random parameters, and the maximum boundary is used for the final loss calculation, using adamW optimizers, with default setting momentum β ₁＝0.9,β₂ =0.999, and dropout set to 0.1.

The maximum length of the input log sequence is 256, the trained batch is set to 16, and the learning rate is setSelf-encoderTraining is started to 5000 times, then descending is started, training is started to 10000 times, the L2 attenuation parameter is 0.01, and the parameters of the backbone network are fixed at the moment and do not participate in training. In the prediction phase, the software is classified using only the self-encoder branches, the number of nodes in the self-encoder is set to 128, the weight decay is set to 0.015, the minimum interval delta of the boundaries is 0.3, and the same configuration is adopted in the training and reasoning phase.

It should be noted that, although the foregoing embodiments have been described herein, the scope of the present invention is not limited thereby. Therefore, based on the innovative concepts of the present invention, alterations and modifications to the embodiments described herein, or equivalent structures or equivalent flow transformations made by the present description and drawings, apply the above technical solution, directly or indirectly, to other relevant technical fields, all of which are included in the scope of the invention.

Claims

1. The lightweight malicious software threat detection method based on multi-modal feature alignment is characterized by comprising the following steps:

s8, obtaining a decision result according to the loss of the sample, judging whether an execution program of the software to be detected is judged to be a malicious software threat behavior according to the decision result, and adding the execution program into a training set to perform the next round of detection of other software;

Step S2 further includes:

V＝{v₁,v₂,...,v_k}

Wherein, Representing the embedding of node v _i at layer i, σ represents the activation function, N (v _i) represents the set of neighbor nodes of node v _i, c _ij represents the normalization constant, typically the sum of the weights of the edges between node v _i and its neighbor nodes v _j, W ^l represents the weight matrix of layer i for the linear transformation;

s205, obtaining node characteristics of the graph encoder Representation for prediction, using an additional fully connected layer to map node features to class probabilities, the expression of this process is:

label(v_i)＝argmax P(i/z)

Wherein P (i/z) represents the probability that it belongs to class i for a given node v _i, label (v _i) is the label of node v _i;

Step S3 further includes:

E_text＝CE(t)

Wherein E _text represents the embedded vector associated with the tag, t represents the malware tag, CE represents the CLIP encoder;

step S4 further includes: and calculating the similarity among the nodes, then calculating a similarity graph matrix, and obtaining clusters by adopting a clustering algorithm, wherein the expression is as follows:

Wherein, I represents the number of clusters, C _i is the I-th cluster center, S _c is a cluster set;

step S5 further includes:

T_c＝CVP(E_text,S_c)

Wherein, T _c represents a classification voting label of the software sample text, CVP represents a cluster voting prompt algorithm for realignment and classification, E _text represents a multi-modal feature embedding vector of the software sample, and S _c represents a clustering result;

S502, combining the probability label and the classified voting label in the step S2 to generate a weak label of a software sample;

Step S6 further includes: establishing a student encoder, which is used for converting log information of software to be detected into characteristic representation, introducing an index moving average method to update the weight of the student encoder, and obtaining a prediction label T _s of the self encoder according to a prediction step in the step S2;

In step S7, for the weak label of the software sample obtained in step S5 and the prediction label obtained in step S6, the loss of the sample is calculated by using the real sample label and the maximum boundary method, and the expression is:

2. The multi-modal feature alignment-based lightweight malware threat detection method of claim 1, wherein the log information in step S1 comprises: date, timestamp, IP address, file path, user operation, port, and event type.