CN114491523A - Malicious software detection method and device, electronic equipment, medium and product - Google Patents

Malicious software detection method and device, electronic equipment, medium and product Download PDF

Info

Publication number
CN114491523A
CN114491523A CN202111530747.4A CN202111530747A CN114491523A CN 114491523 A CN114491523 A CN 114491523A CN 202111530747 A CN202111530747 A CN 202111530747A CN 114491523 A CN114491523 A CN 114491523A
Authority
CN
China
Prior art keywords
software
api
tested
information
dynamic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111530747.4A
Other languages
Chinese (zh)
Inventor
刘浩然
王占一
李宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qax Technology Group Inc
Secworld Information Technology Beijing Co Ltd
Original Assignee
Qax Technology Group Inc
Secworld Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qax Technology Group Inc, Secworld Information Technology Beijing Co Ltd filed Critical Qax Technology Group Inc
Priority to CN202111530747.4A priority Critical patent/CN114491523A/en
Publication of CN114491523A publication Critical patent/CN114491523A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/561Virus type analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Virology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a malicious software detection method, a malicious software detection device, electronic equipment, a medium and a product, wherein the malicious software detection method comprises the following steps: preprocessing a behavior log of the software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested; static attribute information of the software to be tested is subjected to static feature extraction, and static features of the software to be tested are obtained; extracting dynamic characteristics of dynamic behavior API sequences of all APIs in the software to be tested to obtain the dynamic characteristics of all APIs in the software to be tested; and inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested. According to the embodiment of the invention, the time sequence information and the semantic information of the API sequence are utilized, so that the malicious software can be detected more accurately, the false alarm rate of a malicious sample detection model is reduced, and the detection rate is improved.

Description

Malicious software detection method and device, electronic equipment, medium and product
Technical Field
The present invention relates to the field of computer malware detection technologies, and in particular, to a malware detection method and apparatus, an electronic device, a medium, and a product.
Background
Malicious programs are the main security threats facing computer system and network security, the scale and influence of the malicious programs are increasing, and malicious software detection can be divided into static detection and dynamic detection according to whether software is operated or not. The static detection of the malware is to analyze statistical characteristics of the malware without running the malware, for example, analyze binary software or analyze disassembly codes after the malware is disassembled; and the malicious software dynamic detection is to analyze and extract features according to the operating system resource calling behavior generated in the running process of the software.
The existing static detection method has certain limitations: in order to bypass detection, a malware writer deliberately hides information of malware by technical means such as modifying code logic, obfuscating codes and shelling files, characteristics of the malware are usually hidden and difficult to obtain, and even malicious behaviors of the software are exposed only when the malware runs. Meanwhile, the existing static feature analysis technology cannot comprehensively obtain the characteristics of the malicious software, so that the detection rate of the traditional static detection method is low for complicated and variable malicious software.
At present, a dynamic detection method of malicious software can be used for performing dynamic behavior analysis on a sample in combination with deep learning, so that the behavior of the sample can be reflected more comprehensively, but only the API calling characteristics are emphasized, only the information such as API calling times and API calling sequences is counted, and the detection result is not accurate enough.
Disclosure of Invention
The invention provides a malicious software detection method, a malicious software detection device, electronic equipment, a malicious software detection medium and a malicious software detection product, which are used for solving the defect that different risk levels of the same API function under different operating environments cannot be detected in the prior art.
The invention provides a malicious software detection method, which comprises the following steps: preprocessing a behavior log of the software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested;
performing static characteristic extraction on the static attribute information of the software to be tested to obtain the static characteristics of the software to be tested;
extracting dynamic characteristics of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested;
inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
According to the malware detection method provided by the invention, the dynamic features comprise an API category vector and an API semantic vector, and the API semantic vector at least comprises one or more of a file path information vector, a registry information vector and a network behavior information vector.
According to the malware detection method provided by the invention, the dynamic feature extraction is performed on the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic feature of each API in the software to be tested, and the method comprises the following steps:
obtaining API category vectors of all APIs in the software to be tested according to API name type information in the dynamic behavior API sequence of all APIs in the software to be tested;
performing semantic analysis on parameters in the dynamic behavior API sequence of each API in the software to be tested to obtain an API semantic vector of each API in the software to be tested; wherein the parameters at least comprise one or more of file path information, registry modified location information and IP address information and/or domain name information;
and splicing the API category vector of each API in the software to be tested with the API semantic vector of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested.
According to the malware detection method provided by the invention, semantic analysis is performed on parameters in a dynamic behavior API sequence of each API in the software to be tested to obtain an API semantic vector of each API in the software to be tested, and the method comprises the following steps:
under the condition that the API has a file modification function, acquiring file path information from the dynamic behavior API sequence;
performing character level segmentation on the file path information;
inputting the segmented file path information into a pre-trained file path extraction model to obtain a file path information vector;
the file path extraction model is obtained by training based on file path information after character level segmentation of the sample file and label information of the sample file;
and/or the presence of a gas in the gas,
under the condition that the API has a registry modifying function, acquiring the position information of the registry modification from the dynamic behavior API sequence;
performing character level segmentation on the position information modified by the registry;
inputting the modified position information of the segmented registry into a pre-trained registry extraction model to obtain an information vector of the registry;
the registry extraction model is obtained by training based on the position information modified by the registry after character level segmentation of the sample registry and the label information of the sample registry;
and/or the presence of a gas in the atmosphere,
under the condition that the API has a network access behavior, acquiring IP address information and/or domain name information from the dynamic behavior API sequence;
performing character level segmentation on the IP address information and/or the domain name information;
inputting the segmented IP address information and/or domain name information into a pre-trained network behavior extraction model to obtain the network behavior information vector;
the network behavior extraction model is obtained by training based on character level segmentation results of sample IP address information and/or sample domain name information and label information of the sample IP address information and/or the sample domain name information.
According to the malware detection method provided by the invention, the obtaining of the API category vector of each API in the software to be tested according to the API name type information in the dynamic behavior API sequence of each API in the software to be tested comprises the following steps:
and converting the API name type information in the dynamic behavior API sequence of each API in the software to be tested into the API category vector of each API in the software to be tested by using a word vector model.
According to the malware detection method provided by the invention, before the preprocessing of the behavior log of the software to be tested, the method further comprises the following steps:
and running the software to be tested in the sandbox environment to obtain a behavior log of the software to be tested.
The present invention also provides a malware detection apparatus, including: the log preprocessing module is used for preprocessing the behavior log of the software to be tested to obtain the static attribute information of the software to be tested and the dynamic behavior API sequence of each API in the software to be tested;
the static feature extraction module is used for carrying out static feature extraction on the static attribute information of the software to be tested to obtain the static features of the software to be tested;
the dynamic characteristic extraction module is used for extracting dynamic characteristics of dynamic behavior API sequences of all APIs in the software to be tested to obtain the dynamic characteristics of all APIs in the software to be tested;
the software detection module is used for inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the above malware detection methods.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the malware detection methods described above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the malware detection method as described in any one of the above.
The invention provides a malicious software detection method, a device, an electronic device, a medium and a product, wherein the malicious software detection method obtains static attribute information and dynamic behavior API sequences by preprocessing a behavior log of software to be detected, then respectively extracts characteristics to obtain static characteristics and each API dynamic characteristic, inputs the static characteristics and each API dynamic characteristic into a malicious software detection model to obtain a detection result of whether the software to be detected is the malicious software, utilizes the time sequence information of the API sequences, and analyzes and extracts parameter information contained in the API functions, for example, aiming at the same API function, different actual meanings can be obtained when different files are operated, under the condition, characteristics representing different risk levels can be extracted from the parameter information, so that the malicious software can be more accurately detected, the false alarm rate of the malicious sample detection model is reduced, and the detection rate is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a flowchart illustrating a malware detection method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an implementation process of a malware detection method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an implementation process of splicing an API category vector and an API semantic vector according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a semantic analysis implementation process provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of an implementation process of integrating static features and dynamic features according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a malware detection apparatus according to an embodiment of the present invention;
fig. 7 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the prior art, API call features extracted by deep learning-based dynamic behavior analysis are all based on information such as API call times, API call sequences, and the like, under this condition, although dynamic detection can detect risk levels corresponding to different API functions, it cannot detect different risk levels of the same API function in different application environments, for example: when the API name is an NtReadFile method, the API function is to read data from an opened file, and when the read file is a common file, the danger degree is low; when the read file path is a user privacy file, the behavior of the read file path is likely to be that user privacy data is stolen, and the risk degree is high. Resulting in inaccurate detection results.
In view of the above problem, an embodiment of the present invention provides a method for detecting malware, which is specifically described below with reference to fig. 1 of the accompanying drawings.
Fig. 1 is a flowchart illustrating a malware detection method according to an embodiment of the present invention; as shown in fig. 1, a malware detection method provided in an embodiment of the present invention includes the following steps:
step 101, preprocessing a behavior log of software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested.
In the step, the behavior log of the software to be tested is obtained by running the software to be tested in the simulation environment, the behavior log is preprocessed, and static attribute information and a dynamic behavior API sequence are extracted from the behavior log.
The behavior log is json structure data, API information called by software to be tested in the running process is recorded in the json data, and the type of the API and related parameters in the calling process are included.
The static attribute information refers to static attribute information of the PE file, and specifically includes PE file node table information, PE file resource information, PE file import/export table information, PE file PDB information, Office file macro code information, mail content information, PDF file information, picture information, release file information, and the like. The static attribute information may be obtained through a malware analysis tool or through sandbox automated analysis, which is not limited in this embodiment.
The dynamic behavior API sequence refers to an API call sequence obtained by recording an API call of software to be tested in an operation process, and the dynamic behavior API sequence may be directly obtained by an API monitoring tool of a system where the dynamic behavior API sequence is located, or may be obtained by other related technologies (for example, an API hooking technology), which is not limited in this embodiment.
And 102, performing static characteristic extraction on the static attribute information of the software to be tested to obtain the static characteristics of the software to be tested.
In this step, the PE file features are extracted as static features from static attribute information such as PE file section table information, PE file resource information, PE file import/export table information, PE file PDB information, Office file macro code information, mail content information, PDF file information, picture information, release file information, and the like.
And 103, performing dynamic feature extraction on the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic features of each API in the software to be tested.
In this step, the behavior log based on the description above includes not only the called API information, but also the type of the API and the related parameters in the calling process, so that the API name type information can be directly obtained from the dynamic behavior API sequence (obtained by preprocessing the behavior log) during the dynamic feature extraction, and the API category vector can be extracted from the API name type information.
When extracting the characteristics of the relevant parameters in the calling process, each dynamic behavior API sequence needs to be detected, whether the API comprises a file operation behavior, a network connection behavior or a registry operation behavior is detected, after determining whether the API has a file modification function, a network access function or a registry modification function, the dynamic behavior API sequences corresponding to the APIs with different functions are correspondingly processed, and the dynamic characteristics with semantic information are extracted from the dynamic behavior API sequences.
And 104, inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested.
In this step, the malware detection model is obtained by training the label information of the sample software based on the static features of the sample software and the dynamic features of each API in the sample software.
In this embodiment, the malware detection model is a simple classifier for detecting whether the software to be tested is malware, and the detection result is malware or not, and correspondingly, in the malware detection model training process, the label of the sample software is that the sample software is normal software or malware, and the specific training process is to obtain the sample software and the label corresponding to the sample software in advance, construct a malware detection network, perform feature extraction on the sample software as in steps S101 to S103, thereby obtaining static features and dynamic features corresponding to the sample software, and input the static features and dynamic features corresponding to the sample software into the malware detection network for training until the malware detection network converges, thereby obtaining a trained malware detection model.
The malicious software detection method provided by the embodiment of the invention obtains static attribute information and dynamic behavior API sequences by preprocessing the behavior logs of the software to be detected, then respectively extracts the characteristics to obtain the static characteristics and the dynamic characteristics of each API, inputs the static characteristics and the dynamic characteristics of each API into the malicious software detection model to obtain the detection result of whether the software to be detected is the malicious software, not only utilizes the time sequence information of the API sequences, but also analyzes and extracts the parameter information contained in the API function, for example, different actual meanings can be provided when different files are operated aiming at the same API function, under the condition, the characteristics representing different risk levels can be extracted from the parameter information, thereby more accurately detecting the malicious software and reducing the false alarm rate of the malicious sample detection model, the detection rate is improved.
In this embodiment, as shown in fig. 2, the extracted static features of the software to be tested and the dynamic features of each API in the software to be tested may also be applied to a malicious family classification task, so as to achieve an effect of accurately determining risk levels corresponding to different behaviors of the same API. Specifically, the static characteristics and the dynamic characteristics are input into a pre-trained malicious family classification model to obtain which class of the malicious family the software to be tested belongs to. Training of the malicious family classification model is carried out on the basis of static features and dynamic features extracted from sample software and a malicious family type label corresponding to the sample software. The malicious family types may include, among others, the macrovirus family, the CIH virus family, the worm virus family, the trojan horse virus family, and the like. The malicious family classification model is a neural network model, such as a convolutional neural network, a cyclic neural network and the like.
Further, the dynamic features include an API category vector and an API semantic vector including at least one or more of a file path information vector, a registry information vector, and a network behavior information vector.
The API category vector is obtained by extracting the characteristics of the API types in the behavior logs, and the API semantic vector is obtained by extracting the characteristics of related parameters generated in the calling process. Determining the API semantic vector based on the function corresponding to the API called in the running of the software to be tested, if only the API with the network access function is called, only the network access parameter is contained in the corresponding related parameter, and then extracting the API semantic vector only containing the network behavior information vector; if the running process of the software to be tested calls the API with the network access function and the API with the file modification function, the corresponding related parameters necessarily comprise the network access parameters and the file modification parameters, and at the moment, the semantic vector of the API comprises the file path information vector and the network behavior information vector.
According to the malicious software detection method provided by the embodiment of the invention, the API category vector and the API semantic vector are obtained, and the API semantic vector at least comprises one or more of the file path information vector, the registry information vector and the network behavior information vector, so that the API category and the semantic information corresponding to the API of the category in different environments can be determined, and the type and the semantic information are combined, and whether the software to be detected is malicious software can be detected more accurately.
Further, the dynamic feature extraction of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic feature of each API in the software to be tested includes:
obtaining API category vectors of all APIs in the software to be tested according to API name type information in the dynamic behavior API sequence of all APIs in the software to be tested;
performing semantic analysis on parameters in the dynamic behavior API sequence of each API in the software to be tested to obtain an API semantic vector of each API in the software to be tested; wherein the parameters at least comprise one or more of file path information, registry modified location information and IP address information and/or domain name information;
and splicing the API category vector of each API in the software to be tested with the API semantic vector of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested.
Specifically, the operation log includes an API name category, and the API name category is directly converted into an API category vector.
In the running process of the software to be tested, different APIs (application program interfaces) are called for realizing different functions, and further, the generated parameter contents are different. For example, if the software to be tested only performs network access during the running process, the parameters in the dynamic behavior API sequence only include network connection behavior parameters (only include IP address information or only include domain name information or both), and after performing semantic analysis on the parameters in the API sequence, a network behavior information vector is obtained. When the software to be tested calls the APIs with different functions at the same time, the parameters comprise information corresponding to the APIs, and semantic information of the same API can be obtained when the same API realizes different functions.
In this embodiment, as shown in fig. 3, the API category vector and the API semantic vector are spliced to obtain a dynamic feature having both API category information and API semantic information, which will not be the same when the categories are the same but the operating environments are different.
The malicious software detection method provided by the invention can accurately judge different risk levels of the same API under different operating environments based on the dynamic characteristics, thereby improving the detection accuracy and reducing the false detection rate.
Further, performing semantic analysis on parameters in the dynamic behavior API sequence of each API in the software to be tested to obtain an API semantic vector of each API in the software to be tested, including:
under the condition that the API has a file modification function, acquiring file path information from the dynamic behavior API sequence;
performing character level segmentation on the file path information;
inputting the segmented file path information into a pre-trained file path extraction model to obtain a file path information vector;
the file path extraction model is obtained by training based on file path information after character level segmentation of the sample file and label information of the sample file;
and/or the presence of a gas in the gas,
under the condition that the API has a registry modifying function, acquiring the position information of the registry modification from the dynamic behavior API sequence;
performing character level segmentation on the position information modified by the registry;
inputting the modified position information of the segmented registry into a pre-trained registry extraction model to obtain an information vector of the registry;
the registry extraction model is obtained by training based on the position information modified by the registry after character level segmentation of the sample registry and the label information of the sample registry;
and/or the presence of a gas in the gas,
under the condition that the API has a network access behavior, acquiring IP address information and/or domain name information from the dynamic behavior API sequence;
performing character level segmentation on the IP address information and/or the domain name information;
inputting the segmented IP address information and/or domain name information into a pre-trained network behavior extraction model to obtain the network behavior information vector;
the network behavior extraction model is obtained by training based on character level segmentation results of sample IP address information and/or sample domain name information and label information of the sample IP address information and/or the sample domain name information.
Specifically, if the API relates to file modification, segmenting the file operation path information into character levels, and inputting the segmented file path information into a pre-trained file path extraction model to obtain the file path information vector.
The file path extraction model may be a recurrent neural network RNN model, a long-short term memory network LSTM model, a variant-threshold recurrent unit GRU network model of LSTM, or a variant of other LSTM, which is not limited in this embodiment. The file path extraction model is obtained by training based on file path information after character level segmentation of the sample file and label information of the sample file.
If the API relates to registry modification, segmenting the position information of registry modification into character levels, inputting the segmented position information of registry modification into a pre-trained registry extraction model, and obtaining the registry information vector.
The registry extraction model is similar to the file path extraction model, and may be a recurrent neural network RNN model, a long-short term memory network LSTM model, a LSTM variant-threshold cycle unit GRU network model, or a variant of other LSTM, which is not limited in this embodiment. The registry extraction model is obtained by training based on the position information of the modified registry after the character level segmentation of the sample registry and the label information of the sample registry.
If the API relates to network access, performing character-level segmentation on the IP address and the domain name (if the API only relates to the IP address, only performing character-level segmentation on the IP address; similarly, if the API only relates to the domain name, only performing character-level segmentation on the domain name; and if the API and the domain name have the IP address and the domain name at the same time, performing character-level segmentation at the same time), and inputting the segmented IP address information and/or the domain name information into a pre-trained network behavior extraction model to obtain the network behavior information vector.
The network behavior extraction model is similar to the file path extraction model and the registry extraction model, and may be a recurrent neural network RNN model, a long-short term memory network LSTM model, a LSTM variant-threshold cycle unit GRU network model, or a variant of other LSTM, which is not limited in this embodiment. The network behavior extraction model is obtained by training label information of sample IP address information and/or sample domain name information based on character level segmentation results of the sample IP address information and/or the sample domain name information.
The three models of the file path extraction model, the registry extraction model and the network behavior extraction model may be the same neural network or different neural networks, which is not limited in this embodiment.
In this embodiment, as shown in fig. 4, the parameters recorded in the behavior log are generated by calling an API with a registry modification function, an API with a file modification function, and an API with a network access function, performing character-level segmentation on a registry key name, a file path, an IP address, and a url domain name, and inputting the segmented data into an intelligent semantic analysis model (i.e., a file path extraction model, a registry extraction model, and a network behavior extraction model), thereby obtaining a behavior information vector (i.e., an API semantic vector) of each API.
The malicious software detection method provided by the invention can comprehensively cover all conditions influencing the risk level of the API by extracting the characteristics of the API with different functions, and obtains the semantic information of the API under different operating environments from the parameters generated in the calling process of the API, thereby improving the accuracy of malicious software detection.
Further, the obtaining of the API category vector of each API in the software to be tested according to the API name type information in the dynamic behavior API sequence of each API in the software to be tested includes:
and converting the API name type information in the dynamic behavior API sequence of each API in the software to be tested into the API category vector of each API in the software to be tested by using a word vector model.
Wherein, the word vector model can be any one of word2vec, glove, ELMo or BERT, and is used for converting words in natural language into dense vectors. In this embodiment, API name type information is converted into an API category vector by a word2vec method.
By the malicious software detection method provided by the invention, path parameter information contained in the API sequence can be extracted, dynamic behavior semantic vectors are output, different behaviors of the same API are represented by dynamic characteristics, and the accuracy of malicious software detection is improved.
Further, before the preprocessing the behavior log of the software to be tested, the method further comprises: and running the software to be tested in the sandbox environment to obtain a behavior log of the software to be tested.
In this embodiment, a simulation environment in which software to be tested runs is constructed by a sandbox. The sandbox is a virtual system program used for testing the behavior of an untrusted file or an application program and the like.
According to the malicious software detection method, the behavior log is obtained in the sandbox environment, and static characteristic and dynamic characteristic extraction is performed on the basis of the behavior log, so that the software to be tested is processed when the malicious software is detected, further influence on a system is avoided, and the safety of the system is improved.
In this embodiment, as shown in fig. 5, before inputting the static features of the software to be tested and the dynamic features of each API in the software to be tested into a pre-trained malware detection model, the preprocessing of the static features is further required, which specifically includes: performing mapping dictionary conversion on the discrete value features to obtain numerical value type features; carrying out standardization processing on the continuous numerical type characteristics (namely mapping numerical values to an interval of 0-1); the missing values are replaced by mode or mean values.
The missing data is data which should be extracted and is not extracted due to the abnormality of the software to be tested in the operation process, and can be replaced by the data mean value after a large amount of data statistics.
After the preprocessed static features and the spliced dynamic features are obtained, the preprocessed static features and the spliced dynamic features are integrated to obtain software features of the software to be tested, and the software features are input into a malicious software detection model to obtain a detection result corresponding to the software to be tested.
In this embodiment, as shown in fig. 2, if the parameters in the dynamic behavior API sequence include other information in addition to the file path information, the location information of the registry modification, the IP address information, and the domain name information, the features of the other information are also spliced with the API category vector and the API semantic vector to obtain the dynamic features.
In the following, the malware detection apparatus provided by the present invention is described, and the malware detection apparatus described below and the malware detection method described above may be referred to in correspondence with each other.
Fig. 6 is a schematic structural diagram of a malware detection apparatus according to an embodiment of the present invention; as shown in fig. 6, a malware detection apparatus includes:
the log preprocessing module 610 is configured to preprocess the behavior log of the software to be tested, and obtain static attribute information of the software to be tested and a dynamic behavior API sequence of each API in the software to be tested.
Specifically, the log preprocessing module 610 runs the software to be tested in the simulation environment to obtain a behavior log of the software to be tested, preprocesses the behavior log, and extracts static attribute information and a dynamic behavior API sequence from the behavior log.
The static attribute information refers to static attribute information of the PE file, and specifically includes PE file node table information, PE file resource information, PE file import/export table information, PE file PDB information, Office file macro code information, mail content information, PDF file information, picture information, release file information, and the like. The static attribute information may be obtained through a malware analysis tool or through sandbox automated analysis, which is not limited in this embodiment.
The dynamic behavior API sequence refers to an API call sequence obtained by recording an API call of software to be tested in an operation process, and the dynamic behavior API sequence may be directly obtained by an API monitoring tool of a system where the dynamic behavior API sequence is located, or may be obtained by other related technologies (for example, an API hooking technology), which is not limited in this embodiment.
And the static feature extraction module 620 is configured to perform static feature extraction on the static attribute information of the software to be tested, so as to obtain a static feature of the software to be tested.
Specifically, the static feature extraction module 620 extracts the PE file features as static features from static attribute information such as PE file section table information, PE file resource information, PE file import/export table information, PE file PDB information, Office file macro code information, mail content information, PDF file information, picture information, release file information, and the like.
The dynamic feature extraction module 630 is configured to perform dynamic feature extraction on the dynamic behavior API sequence of each API in the software to be tested, so as to obtain the dynamic feature of each API in the software to be tested.
Specifically, the dynamic feature extraction module 630 directly obtains API name type information from the dynamic behavior API sequence (obtained by preprocessing the behavior log), and extracts an API category vector from the API name type information.
When extracting the characteristics of the relevant parameters in the calling process, each dynamic behavior API sequence needs to be detected, whether the API comprises a file operation behavior, a network connection behavior or a registry operation behavior is detected, after determining whether the API has a file modification function, a network access function or a registry modification function, the dynamic behavior API sequences corresponding to the APIs with different functions are correspondingly processed, and the dynamic characteristics with semantic information are extracted from the dynamic behavior API sequences.
The software detection module 640 is configured to input the static features of the software to be tested and the dynamic features of each API in the software to be tested into a pre-trained malware detection model, so as to obtain a detection result corresponding to the software to be tested.
The malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
Specifically, the malware detection model is a simple classifier for detecting whether the software to be tested is malware, the detection result is that the malware is malicious software or not, correspondingly, in the training process of the malware detection model, the label of the sample software is that the sample software is normal software or malicious software, the specific training process is to obtain the sample software and the label corresponding to the sample software in advance, construct a malicious software detection network, the sample software is subjected to feature extraction through the log preprocessing module 610, the static feature extraction module 620 and the dynamic feature extraction module 630, and inputting the static characteristics and the dynamic characteristics corresponding to the sample software into the malicious software detection network for training until the malicious software detection network converges, thereby obtaining a trained malicious software detection model.
The malware detection device provided by the embodiment of the invention obtains static attribute information and dynamic behavior API sequences by preprocessing the behavior logs of the software to be detected, then respectively extracts the characteristics to obtain static characteristics and various API dynamic characteristics, inputs the static characteristics and various API dynamic characteristics into the malware detection model, thereby obtaining the detection result of whether the software to be detected is the malware, not only utilizes the time sequence information of the API sequences, but also analyzes and extracts the parameter information contained in the API functions, for example, different actual meanings may be provided when different files are operated aiming at the same API function, under the condition, the characteristics representing different risk levels can be extracted from the parameter information, thereby more accurately detecting the malware and reducing the false alarm rate of the malicious sample detection model, the detection rate is improved.
In this embodiment, the dynamic features include an API category vector and an API semantic vector, where the API semantic vector includes at least one or more of a file path information vector, a registry information vector, and a network behavior information vector.
The API category vector is obtained by extracting the characteristics of the API types in the behavior logs, and the API semantic vector is obtained by extracting the characteristics of related parameters generated in the calling process. Determining the API semantic vector based on the function corresponding to the API called in the running of the software to be tested, if only the API with the network access function is called, only the network access parameter is contained in the corresponding related parameter, and then extracting the API semantic vector only containing the network behavior information vector; if the running process of the software to be tested calls the API with the network access function and the API with the file modification function, the corresponding related parameters necessarily comprise the network access parameters and the file modification parameters, and at the moment, the semantic vector of the API comprises the file path information vector and the network behavior information vector.
According to the malicious software detection device provided by the embodiment of the invention, the API category vector and the API semantic vector are obtained, and the API semantic vector at least comprises one or more of the file path information vector, the registry information vector and the network behavior information vector, so that the API category and the semantic information corresponding to the API of the category in different environments can be determined, and the type and the semantic information are combined, so that whether the software to be detected is malicious software can be detected more accurately.
Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a malware detection method comprising: preprocessing a behavior log of the software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested;
performing static characteristic extraction on the static attribute information of the software to be tested to obtain the static characteristics of the software to be tested;
extracting dynamic characteristics of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested;
inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the method for malware detection provided by the above methods, the method comprising: preprocessing a behavior log of the software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested;
performing static characteristic extraction on the static attribute information of the software to be tested to obtain the static characteristics of the software to be tested;
extracting dynamic characteristics of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested;
inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the methods provided above to perform a malware detection method, the method comprising: preprocessing a behavior log of the software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested;
performing static characteristic extraction on the static attribute information of the software to be tested to obtain the static characteristics of the software to be tested;
extracting dynamic characteristics of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested;
inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A malware detection method, comprising:
preprocessing a behavior log of the software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested;
performing static characteristic extraction on the static attribute information of the software to be tested to obtain the static characteristics of the software to be tested;
extracting dynamic characteristics of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested;
inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
2. The malware detection method of claim 1, wherein the dynamic features comprise an API category vector and an API semantic vector, the API semantic vector comprising at least one or more of a file path information vector, a registry information vector, and a network behavior information vector.
3. The malware detection method of claim 2, wherein the dynamic feature extraction of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic feature of each API in the software to be tested comprises:
obtaining API category vectors of all APIs in the software to be tested according to API name type information in the dynamic behavior API sequence of all APIs in the software to be tested;
performing semantic analysis on parameters in the dynamic behavior API sequence of each API in the software to be tested to obtain an API semantic vector of each API in the software to be tested; wherein the parameters at least comprise one or more of file path information, registry modified location information and IP address information and/or domain name information;
and splicing the API category vector of each API in the software to be tested with the API semantic vector of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested.
4. The malware detection method of claim 3, wherein performing semantic analysis on parameters in the dynamic behavior API sequence of each API in the software to be tested to obtain an API semantic vector of each API in the software to be tested comprises:
under the condition that the API has a file modification function, acquiring file path information from the dynamic behavior API sequence;
performing character level segmentation on the file path information;
inputting the segmented file path information into a pre-trained file path extraction model to obtain a file path information vector;
the file path extraction model is obtained by training based on file path information after character level segmentation of the sample file and label information of the sample file;
and/or the presence of a gas in the gas,
under the condition that the API has a registry modifying function, acquiring the position information of the registry modification from the dynamic behavior API sequence;
performing character level segmentation on the position information modified by the registry;
inputting the position information modified by the segmented registry into a registry extraction model trained in advance to obtain an information vector of the registry;
the registry extraction model is obtained by training on the basis of the position information of the registry modification after the character level segmentation of the sample registry and the label information of the sample registry;
and/or the presence of a gas in the gas,
under the condition that the API has a network access behavior, acquiring IP address information and/or domain name information from the dynamic behavior API sequence;
performing character level segmentation on the IP address information and/or the domain name information;
inputting the segmented IP address information and/or domain name information into a pre-trained network behavior extraction model to obtain the network behavior information vector;
the network behavior extraction model is obtained by training based on character level segmentation results of sample IP address information and/or sample domain name information and label information of the sample IP address information and/or the sample domain name information.
5. The malware detection method of claim 3, wherein obtaining the API category vector of each API in the software to be tested according to the API name type information in the API sequence of the dynamic behavior of each API in the software to be tested comprises:
and converting the API name type information in the dynamic behavior API sequence of each API in the software to be tested into the API category vector of each API in the software to be tested by using a word vector model.
6. The malware detection method of any one of claims 1 to 5, wherein prior to the preprocessing of the behavior log of the software to be tested, the method further comprises:
and running the software to be tested in the sandbox environment to obtain a behavior log of the software to be tested.
7. A malware detection apparatus, comprising:
the log preprocessing module is used for preprocessing the behavior log of the software to be tested to obtain the static attribute information of the software to be tested and the dynamic behavior API sequence of each API in the software to be tested;
the static feature extraction module is used for carrying out static feature extraction on the static attribute information of the software to be tested to obtain the static features of the software to be tested;
the dynamic characteristic extraction module is used for extracting dynamic characteristics of dynamic behavior API sequences of all APIs in the software to be tested to obtain the dynamic characteristics of all APIs in the software to be tested;
the software detection module is used for inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the malware detection method of any one of claims 1 to 6 are implemented when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the malware detection method of any one of claims 1 to 6.
10. A computer program product comprising a computer program, wherein the computer program when executed by a processor implements the steps of the malware detection method of any one of claims 1 to 6.
CN202111530747.4A 2021-12-14 2021-12-14 Malicious software detection method and device, electronic equipment, medium and product Pending CN114491523A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111530747.4A CN114491523A (en) 2021-12-14 2021-12-14 Malicious software detection method and device, electronic equipment, medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111530747.4A CN114491523A (en) 2021-12-14 2021-12-14 Malicious software detection method and device, electronic equipment, medium and product

Publications (1)

Publication Number Publication Date
CN114491523A true CN114491523A (en) 2022-05-13

Family

ID=81495002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111530747.4A Pending CN114491523A (en) 2021-12-14 2021-12-14 Malicious software detection method and device, electronic equipment, medium and product

Country Status (1)

Country Link
CN (1) CN114491523A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117521068A (en) * 2023-12-08 2024-02-06 北京云弈科技有限公司 Linux host malicious software detection method, system, device and medium
CN117707986A (en) * 2024-02-05 2024-03-15 钦原科技有限公司 Software power consumption testing method and system for mobile terminal
CN117972702A (en) * 2024-04-01 2024-05-03 山东省计算中心(国家超级计算济南中心) API call heterogeneous parameter enhancement-based malicious software detection method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117521068A (en) * 2023-12-08 2024-02-06 北京云弈科技有限公司 Linux host malicious software detection method, system, device and medium
CN117707986A (en) * 2024-02-05 2024-03-15 钦原科技有限公司 Software power consumption testing method and system for mobile terminal
CN117972702A (en) * 2024-04-01 2024-05-03 山东省计算中心(国家超级计算济南中心) API call heterogeneous parameter enhancement-based malicious software detection method and system

Similar Documents

Publication Publication Date Title
CN108734012B (en) Malicious software identification method and device and electronic equipment
CN111460446B (en) Malicious file detection method and device based on model
CN109858248B (en) Malicious Word document detection method and device
CN107659570A (en) Webshell detection methods and system based on machine learning and static and dynamic analysis
US11212297B2 (en) Access classification device, access classification method, and recording medium
CN109992969B (en) Malicious file detection method and device and detection platform
CN110909348B (en) Internal threat detection method and device
CN114491523A (en) Malicious software detection method and device, electronic equipment, medium and product
CN111368289B (en) Malicious software detection method and device
CN112492059A (en) DGA domain name detection model training method, DGA domain name detection device and storage medium
CN109145030B (en) Abnormal data access detection method and device
CN113360912A (en) Malicious software detection method, device, equipment and storage medium
CN112632537A (en) Malicious code detection method, device, equipment and storage medium
CN114090406A (en) Electric power Internet of things equipment behavior safety detection method, system, equipment and storage medium
CN112688966A (en) Webshell detection method, device, medium and equipment
CN111651768A (en) Method and device for identifying link library function name of computer binary program
CN113468524B (en) RASP-based machine learning model security detection method
CN109241739B (en) API-based android malicious program detection method and device and storage medium
CN110210216B (en) Virus detection method and related device
CN112231696B (en) Malicious sample identification method, device, computing equipment and medium
CN113971284B (en) JavaScript-based malicious webpage detection method, equipment and computer readable storage medium
Čeponis et al. Evaluation of deep learning methods efficiency for malicious and benign system calls classification on the AWSCTD
JP7439916B2 (en) Learning device, detection device, learning method, detection method, learning program and detection program
CN115146275A (en) Container safety protection method and device, electronic equipment and storage medium
CN115643044A (en) Data processing method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination