CN115758362A - Multi-feature-based automatic malicious software detection method - Google Patents

Multi-feature-based automatic malicious software detection method Download PDF

Info

Publication number
CN115758362A
CN115758362A CN202211511935.7A CN202211511935A CN115758362A CN 115758362 A CN115758362 A CN 115758362A CN 202211511935 A CN202211511935 A CN 202211511935A CN 115758362 A CN115758362 A CN 115758362A
Authority
CN
China
Prior art keywords
substep
feature
binary
detection method
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211511935.7A
Other languages
Chinese (zh)
Inventor
李益洲
杨星
李梦龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202211511935.7A priority Critical patent/CN115758362A/en
Publication of CN115758362A publication Critical patent/CN115758362A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Computer And Data Communications (AREA)

Abstract

The invention discloses a malicious software automatic detection method based on multiple characteristics, which comprises the following steps of S1: preprocessing data; step S2: carrying out feature extraction on the data to obtain a feature vector; and step S3: effectively aligning and fusing the binary feature vector obtained in the step S2 and the operation code feature vector generated by the triangular attention mechanism to generate a final fusion vector; and step S4: responsible for detection and classification of malware; the method can effectively detect the variants of the malicious software, has higher accuracy, and enhances the performance of an automatic malicious software analysis tool by increasing the identification efficiency of the variants of the malicious software in an actual production environment, thereby reducing the labor cost.

Description

Multi-feature-based automatic malicious software detection method
Technical Field
The invention belongs to the field of software, and particularly relates to a malicious software automatic detection method based on multiple characteristics.
Background
The internet is gradually permeating into various fields of human life and work, and novel internet applications which are continuously and massively emerging are also deeply changing social life of the information age. The global outbreak of COVID-19 makes the office model of each field gradually turn to the line, the proliferation of network office valley also provides business opportunities for developers of malicious software, and network criminals develop various malicious software, such as malicious advertisements, lesoh software, mine digging viruses and the like. According to the data representation from 11 months in 2019 to 10 months in 2021 of Kasperssky Security Network (KSN) statistics, only in one year, 15.45% of Internet user computers around the world suffer from malware attacks at least once, wherein Lesojour software and miners' viruses are rampant. The malicious software collects private information on a target network, abuse server resources, even infect hosts on the whole Internet through zombie application programs, the behaviors cause huge threats to the safety of individuals and various fields, and the timely detection and response of the appearing malicious software become infrastructure for ensuring the network office safety.
The detection method of the malicious software is mainly divided into static detection and dynamic detection. The static detection method generally extracts features of malware through static analysis tools such as decompilation and the like, such as: the method is characterized in that whether the code is malicious code or not is judged by binary sequences, assembly codes, operation code sequences and the like, extracted key features are used for generating feature signatures, and a rule matching-based method is usually used in detection. Dynamic detection is generally divided into dynamic monitoring and dynamic analysis, wherein the dynamic monitoring refers to observing software behaviors by virtually running target software and using a mode of instant debugging and the like, and the dynamic analysis refers to obtaining related information such as an API (application program interface) calling sequence, system resource use and the like after running the target software in environments such as a sandbox and the like and then further analyzing.
Disclosure of Invention
Aiming at the defects or the improvement requirement of the prior art, the invention provides a multi-feature-based malware automatic detection method.
The specific technical scheme is as follows:
the multi-feature-based malware automatic detection method comprises the following steps,
step S1: preprocessing data;
step S2: performing feature extraction on the data to obtain a feature vector;
and step S3: effectively aligning and fusing the binary characteristic vector obtained in the step S2 and the operation code characteristic vector generated by the triangular attention mechanism to generate a final fused vector;
and step S4: responsible for the detection and classification of malware.
Preferably, step S1 comprises the following sub-steps:
substep S11: after an original malicious software file is taken, performing static decompilation on the file by using an IDApro tool to obtain a binary file and an assembly file of the malicious software;
substep S12: and counting the frequency of the key operation codes from the assembly file.
Preferably, the substep S12 comprises the substeps of:
substep S121: dividing each program into a plurality of subprogram blocks, and calculating the frequency of the key operation codes of each subprogram block;
substep S122: these key opcodes include the basic opcode, the variable name and the register name.
Preferably, step 2 comprises the following sub-steps:
substep S21: malicious software sequence information expressed by binary system contains key information such as core function, resource calling and the like, and is usually scattered to different positions of the sequence along with compiling, and a stacked double-layer one-dimensional convolution neural network model is used for carrying out feature coding on binary files, so that the problem of dependence of remote information of binary texts can be solved;
substep S22: extracting operation code information, and calculating the internal correlation of operation code blocks and the long-distance relation between different operation code blocks by using a currently advanced triangular attention algorithm;
substep S23: and calculating the relevance scores between the operation codes in different subprogram blocks and other operation codes so as to obtain a matrix vector for describing the malware assembly file.
Preferably, step S3 comprises the following sub-steps:
substep S31: using a gated self-attention mechanism module to further filter out critical information;
substep S32: using a cross attention module and using a self-attention calculation method to take the operation code matrix vector as a calculation condition of the binary eigenvector, so that the calculated output vector is fused with the information of the two eigenvectors;
substep S33: and a residual error network structure is used to reduce information loss in the process of transmitting the training model.
Preferably, S4 comprises the following sub-steps:
substep S41: generating a classification result using the DNN network;
substep S42: at the output level, the input software is marked as its malicious family using the softmax function;
substep S43: and a cross entropy loss function is adopted in the training process.
Compared with the prior art, the invention has the following beneficial effects:
1. a stacked two-layer one-dimensional convolutional network is provided for efficiently extracting features of binary files of malware.
2. The method comprises the steps of extracting the characteristics of the assembly file by using a triangular attention algorithm, and calculating the characteristics of important attention in the assembly file by respectively taking an operation code and an assembly code block as main bodies.
3. A neural network classification model based on multiple features is provided, and a good malicious software classification effect is achieved by using a multi-feature alignment fusion algorithm.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a diagram of a stacked two-layer one-dimensional convolutional neural network model of the present invention.
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the drawings and examples. The following detailed description of the embodiments and the accompanying drawings are provided to illustrate the principles of the invention and are not intended to limit the scope of the invention, which is defined by the claims and their equivalents, which can be combined with other features of the embodiments to form new embodiments by those skilled in the art without inventive faculty, and which are also covered by the scope of the invention.
By efficiently integrating binary files and assembly code, a hybrid attention model for malware detection is presented. In order to overcome the problem of incomplete capture of binary sequence information, a binary sequence is used as an original input, features are extracted through a stacked double-layer convolution network, time information is extracted through a first convolution layer, and discontinuity in function calling and jumping processes is captured through a second convolution layer. Meanwhile, a triangle attention module is used for extracting code level features, the integral relation of functions is considered, and the inherent use mode of operation codes is combined. Finally, the two features are aligned and fused by using a cross attention module, and the fused feature vector is input into a network for training, so that the network can learn the relationship between the binary file and the assembly code, the stability of fusion feature representation is improved, and better performance is obtained in independent test.
Fig. 1 is a main design diagram of the technical solution of the present invention. As shown in fig. 1, the automatic malware detection method based on multiple features provided by the present invention includes the following four modules:
a module a: and (4) preprocessing data. After the original malware file is taken, the file is statically decompiled using the IDApro tool to obtain a binary file and an assembly file of the malware. Then, extracting an operation code sequence from the binary file, specifically comprising the following steps: each program is divided into a plurality of subprogram blocks, and the frequency of occurrence of key operation codes of each subprogram block is calculated, wherein the key operation codes comprise basic operation codes, variable names and register names.
And a module b: and (5) feature extraction. And (4) further extracting the features of the binary file obtained in the module a and the operation codes extracted from the assembly file. Specifically, the malware sequence information represented by binary contains key information such as core functions and resource calling, and the key information is usually scattered to different positions of a sequence along with compiling, and a stacked double-layer one-dimensional convolutional neural network model is used for carrying out feature coding on a binary file, so that the problem of dependence of long-distance information of a binary text can be solved. Second, opcode information is extracted. The intra-correlation of the operational code blocks and the long-distance relationship between different operational code blocks are calculated using the currently advanced trigonometric attention algorithm. The algorithm calculates the relevance scores between the operation codes in different subprogram blocks and other operation codes, so as to obtain a matrix vector for describing the malware assembly file.
The stacked two-layer one-dimensional convolutional neural network model proposed by the present invention is further described with reference to fig. 2.
The binary sequence is first input into a one-dimensional convolution to extract byte features within a fixed window size.
Then, vectors are extracted by using dilation convolution, and by adding a receiving domain, the range of information contained in each convolution output is expanded, and the problem of long-distance information dependence of binary texts is solved.
The feature map of the two-layer one-dimensional convolution output is input to the pooling layer to reduce dimensionality and compress the features.
The infrastructure consisting of the two layers of one-dimensional convolutions and pooling layers will be repeated until the dimension of the output characteristic matches the threshold accepted by the subsequent module.
And a module c: in order to align and fuse the binary feature vectors obtained by the above module b with the matrix vectors generated by the triangular attention mechanism efficiently.
First, a gated self-attention mechanism module is used to further filter out critical information,
then, using a cross attention module and using a self-attention calculation method, the operation code matrix vector is used as a calculation condition of the binary feature vector, so that the calculated output vector fuses the information of the two feature vectors.
Finally, to reduce information loss during the transmission of the training model, we use a residual network structure.
A module d: responsible for the detection and classification of malware. The classification result is generated using a DNN network. At the output level, we use the softmax function to mark the input software as its malicious family. And a cross entropy loss function is adopted in the training process.

Claims (6)

1. The multi-feature-based malware automatic detection method is characterized by comprising the following steps,
step S1: preprocessing data;
step S2: carrying out feature extraction on the data to obtain a feature vector;
and step S3: effectively aligning and fusing the binary characteristic vector obtained in the step S2 and the operation code characteristic vector generated by the triangular attention mechanism to generate a final fused vector;
and step S4: responsible for the detection and classification of malware.
2. The automated multi-feature-based malware detection method according to claim 1, wherein the step S1 comprises the sub-steps of:
substep S11: after an original malicious software file is taken, performing static decompilation on the file by using an IDApro tool to obtain a binary file and an assembly file of the malicious software;
substep S12: and counting the frequency of the key operation codes from the assembly file.
3. The multi-feature based malware automated detection method of claim 1, wherein the substep S12 comprises the substeps of:
substep S121: dividing each program into a plurality of subprogram blocks, and calculating the frequency of the key operation codes of each subprogram block;
substep S122: these key opcodes include the basic opcode, the variable name and the register name.
4. The automated multi-feature-based malware detection method according to claim 1, wherein the step 2 comprises the sub-steps of:
substep S21: malicious software sequence information expressed by binary system contains key information such as core function, resource calling and the like, and is usually scattered to different positions of the sequence along with compiling, and a stacked double-layer one-dimensional convolution neural network model is used for carrying out feature coding on binary files, so that the problem of dependence of remote information of binary texts can be solved;
substep S22: extracting operation code information, and calculating the internal correlation of operation code blocks and the long-distance relation between different operation code blocks by using a currently advanced triangular attention algorithm;
substep S23: and calculating the relevance scores between the operation codes in different subprogram blocks and other operation codes so as to obtain a matrix vector for describing the malware assembly file.
5. The automated multi-feature-based malware detection method according to claim 1, wherein said step S3 comprises the sub-steps of:
substep S31: using a gated self-attention mechanism module to further filter out critical information;
substep S32: using a cross attention module and using a self-attention calculation method to take the operation code matrix vector as a calculation condition of the binary eigenvector, so that the calculated output vector is fused with the information of the two eigenvectors;
substep S33: and a residual error network structure is used to reduce information loss in the process of transmitting the training model.
6. The automated multi-feature-based malware detection method according to claim 1, wherein said S4 comprises the sub-steps of:
substep S41: generating a classification result using the DNN network;
substep S42: at the output level, the input software is marked as its malicious family using the softmax function;
substep S43: and a cross entropy loss function is adopted in the training process.
CN202211511935.7A 2022-11-29 2022-11-29 Multi-feature-based automatic malicious software detection method Pending CN115758362A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211511935.7A CN115758362A (en) 2022-11-29 2022-11-29 Multi-feature-based automatic malicious software detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211511935.7A CN115758362A (en) 2022-11-29 2022-11-29 Multi-feature-based automatic malicious software detection method

Publications (1)

Publication Number Publication Date
CN115758362A true CN115758362A (en) 2023-03-07

Family

ID=85340286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211511935.7A Pending CN115758362A (en) 2022-11-29 2022-11-29 Multi-feature-based automatic malicious software detection method

Country Status (1)

Country Link
CN (1) CN115758362A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861428A (en) * 2023-09-04 2023-10-10 北京安天网络安全技术有限公司 Malicious detection method, device, equipment and medium based on associated files
CN117272303A (en) * 2023-09-27 2023-12-22 四川大学 Malicious code sample variant generation method and system based on genetic countermeasure

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861428A (en) * 2023-09-04 2023-10-10 北京安天网络安全技术有限公司 Malicious detection method, device, equipment and medium based on associated files
CN116861428B (en) * 2023-09-04 2023-12-08 北京安天网络安全技术有限公司 Malicious detection method, device, equipment and medium based on associated files
CN117272303A (en) * 2023-09-27 2023-12-22 四川大学 Malicious code sample variant generation method and system based on genetic countermeasure

Similar Documents

Publication Publication Date Title
CN112235283B (en) Vulnerability description attack graph-based network attack evaluation method for power engineering control system
CN115758362A (en) Multi-feature-based automatic malicious software detection method
CN106557695B (en) A kind of malicious application detection method and system
CN111585955B (en) HTTP request abnormity detection method and system
Jeon et al. Hybrid malware detection based on Bi-LSTM and SPP-Net for smart IoT
CN112989348B (en) Attack detection method, model training method, device, server and storage medium
CN116305168B (en) Multi-dimensional information security risk assessment method, system and storage medium
CN105989287A (en) Method and system for judging homology of massive malicious samples
CN115396169B (en) Method and system for multi-step attack detection and scene restoration based on TTP
CN112507336A (en) Server-side malicious program detection method based on code characteristics and flow behaviors
CN116305119A (en) APT malicious software classification method and device based on predictive guidance prototype
CN115632874A (en) Method, device, equipment and storage medium for detecting threat of entity object
Lian et al. Cryptomining malware detection based on edge computing-oriented multi-modal features deep learning
Hong et al. [Retracted] Abnormal Access Behavior Detection of Ideological and Political MOOCs in Colleges and Universities
Yujie et al. End-to-end android malware classification based on pure traffic images
Shao et al. Malicious code classification method based on deep residual network and hybrid attention mechanism for edge security
CN117240632A (en) Attack detection method and system based on knowledge graph
CN116467720A (en) Intelligent contract vulnerability detection method based on graph neural network and electronic equipment
CN115622810A (en) Business application identification system and method based on machine learning algorithm
CN115567325A (en) Threat hunting method based on graph matching
CN114510717A (en) ELF file detection method and device and storage medium
CN112733144A (en) Malicious program intelligent detection method based on deep learning technology
CN113935034A (en) Malicious code family classification method and device based on graph neural network and storage medium
CN117556425B (en) Intelligent contract vulnerability detection method, system and equipment based on graph neural network
CN111125699B (en) Malicious program visual detection method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination