CN111460446A

CN111460446A - Malicious file detection method and device based on model

Info

Publication number: CN111460446A
Application number: CN202010151740.0A
Authority: CN
Inventors: 白皓文; 白敏�; 刘爽; 白子潘; 汪列军; 潘博文; 卫福龙
Original assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Current assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2020-07-28
Anticipated expiration: 2040-03-06
Also published as: CN111460446B

Abstract

The embodiment of the invention provides a malicious file detection method and device based on a model; the method comprises the following steps: acquiring a file to be detected; analyzing the file to be detected to obtain characteristic information of the file to be detected; and inputting the characteristic information of the file to be detected into a pre-constructed malicious family detection model, and acquiring whether the file to be detected belongs to a certain malicious family and a first detection result of a corresponding confidence coefficient. The malicious file detection method and device based on the model provided by the embodiment of the invention can be used for analyzing various types of files to be detected, extracting characteristic information with various dimensions and abundant types from the files to be detected, and inputting the characteristic information into the malicious family detection model, thereby realizing the detection of whether the files to be detected belong to a certain type of malicious family.

Description

Malicious file detection method and device based on model

Technical Field

The invention relates to the field of network security, in particular to a malicious file detection method and device based on a model.

Background

With the large-scale popularization of intelligent devices such as computers, mobile intelligent terminals and the like, some organizations or individuals add malicious codes with specific purposes in electronic files so as to steal information and funds of users or achieve the purpose of other unaffordable people. These electronic files carrying malicious code are also referred to as malicious files. In recent years, the number of malicious files has increased explosively, and the timely detection of malicious files has become a first problem for network security analysts and operators.

In the method for detecting the malicious file in the prior art, static information of a sample file to be detected is mainly analyzed, and whether the sample file is the malicious file or not is judged according to an analysis result. The malicious file detection method has limitations on information based on detection, so that the detection result of the malicious file is not high in accuracy and low in detection efficiency.

Disclosure of Invention

The embodiment of the invention provides a malicious file detection method and device based on a model, which are used for solving the defects that the detection result of the malicious file detection method in the prior art is low in accuracy and low in detection efficiency.

An embodiment of a first aspect of the present invention provides a method for detecting a malicious file based on a model, including:

acquiring a file to be detected;

analyzing the file to be detected to obtain characteristic information of the file to be detected; the characteristic information comprises dynamic behavior information and static file information of the subfiles of each level in each level contained in the file to be detected, and the level relation of the subfiles in all levels contained in the file to be detected; the dynamic behavior information is information generated in the process that the subfiles are executed, and the static file information is information obtained by analyzing the subfiles in the non-executed state in a static analysis mode;

inputting the characteristic information of the file to be detected into a pre-constructed malicious family detection model, and acquiring whether the file to be detected belongs to a certain malicious family and a first detection result of a corresponding confidence coefficient; wherein the content of the first and second substances,

the malicious family detection model is a model which is obtained by training in a machine learning mode by taking the characteristic information of a known malicious file and the label information of the known malicious file as sample data and is used for acquiring whether the file to be detected belongs to a certain malicious family and a detection result of a corresponding confidence coefficient; wherein the content of the first and second substances,

the characteristic information of the known malicious file comprises dynamic behavior information and static file information of subfiles of each level in all levels contained in the known malicious file, and the level relation of the subfiles in all levels contained in the known malicious file; the dynamic behavior information is information generated in the process that the subfiles are executed, and the static file information is information obtained by analyzing the subfiles in the non-executed state in a static analysis mode; the tag information of the known malicious file includes information of a malicious family to which the known malicious file belongs.

In the above technical solution, further comprising:

inputting the characteristic information of the file to be detected into a pre-constructed attack group detection model, and acquiring whether the file to be detected is from a certain attack group and a second detection result of a corresponding confidence coefficient; the attack group detection model is a model which is obtained by taking characteristic information of known malicious files and label information of the known malicious files as sample data and adopting a machine learning mode to train and is used for obtaining whether the files to be detected are from a certain attack group and a detection result of corresponding confidence; wherein the content of the first and second substances,

the characteristic information of the known malicious file comprises dynamic behavior information and static file information of subfiles of each level in all levels contained in the known malicious file, and the level relation of the subfiles in all levels contained in the known malicious file; the dynamic behavior information is information generated in the process that the subfiles are executed, and the static file information is information obtained by analyzing the subfiles in the non-executed state in a static analysis mode; the tag information for a known malicious file includes information for the attack partner from which the known malicious file originated.

In the above technical solution, further comprising:

inputting the characteristic information of the file to be detected into a pre-constructed malicious family detection IBK model, and acquiring whether the file to be detected belongs to a certain malicious family and a third detection result of a corresponding confidence coefficient;

comparing a first detection result obtained by adopting the malicious family detection model with a third detection result obtained by adopting the malicious family detection IBK model, and obtaining a final confidence coefficient about the malicious family detection result according to the comparison result; wherein the content of the first and second substances,

the malicious family detection IBK model is a model which is obtained by taking the characteristic information of known malicious files and the label information of the known malicious files as sample data and adopting IBK classification algorithm training and is used for obtaining whether the files to be detected belong to a certain malicious family and the detection result of the corresponding confidence coefficient.

In the above technical solution, further comprising:

inputting the characteristic information of the file to be detected into a pre-constructed attack group detection IBK model, and acquiring whether the file to be detected is from a certain attack group and a fourth detection result of a corresponding confidence coefficient;

comparing the second detection result obtained by adopting the attack group detection model with a fourth detection result obtained by adopting the attack group detection IBK model, and obtaining a final confidence coefficient about the attack group detection result according to the comparison result; wherein the content of the first and second substances,

the attack group detection IBK model is a model which is obtained by taking characteristic information of known malicious files and label information of the known malicious files as sample data and adopting IBK classification algorithm training and is used for obtaining whether the files to be detected are from a certain attack group and the detection result of corresponding confidence coefficient.

In the above technical solution, the malicious family detection model is created in the following manner:

acquiring a plurality of known malicious files; the known malicious files comprise tag information for describing a malicious family to which the known malicious files belong;

analyzing the acquired multiple known malicious files to obtain characteristic information of the known malicious files;

and training by using the characteristic information of the known malicious file and the label information of the known malicious file as sample data in a machine learning mode to generate a malicious family detection model for acquiring whether the file to be detected belongs to a certain malicious family and a detection result of a corresponding confidence coefficient.

In the above technical solution, the malicious family detection IBK model is created in the following manner:

and training by using the characteristic information of the known malicious file and the label information of the known malicious file as sample data by adopting an IBK classification algorithm to generate a malicious family detection IBK model for acquiring whether the file to be detected belongs to a certain malicious family and a detection result of a corresponding confidence coefficient.

In the above technical solution, the attack group detection model is created in the following manner:

acquiring a plurality of known malicious files; the known malicious files comprise tag information describing attack partners from which the known malicious files originate;

and training by using the characteristic information of the known malicious file and the label information of the known malicious file as sample data in a machine learning mode to generate an attack group detection model for acquiring whether the file to be detected is from a certain attack group and a detection result of corresponding confidence.

In the above technical solution, the attack group detection IBK model is created in the following manner:

and training by using the characteristic information of the known malicious file and the label information of the known malicious file as sample data by adopting an IBK classification algorithm to generate an attack group detection IBK model for acquiring whether the file to be detected is from a certain attack group and a detection result of corresponding confidence.

In the above technical solution, the analyzing the file to be detected to obtain the characteristic information of the file to be detected includes:

analyzing a file to be detected and determining the hierarchical structure of the file to be detected;

analyzing information items to be loaded when the subfiles of each level in each level of a file to be detected are executed, and obtaining dynamic execution information of the subfiles;

analyzing the fixed item of the subfile of each level in each level of the file to be detected to obtain the static file information of the subfile;

recording the hierarchical relation of each subfile in all hierarchies contained in the file to be detected;

converting the dynamic execution information and the static file information into a uniform intermediate temporary file object, and performing digital characterization on the intermediate temporary file object to obtain a first feature set corresponding to the dynamic execution information and a second feature set corresponding to the static file information;

and determining a feature vector for representing feature information of the file to be detected according to the hierarchical relationship of each subfile in all the hierarchies contained in the first feature set, the second feature set and the file.

In the technical scheme, the malicious family detection model is obtained by training through a random forest method.

In the technical scheme, the attack group detection model is obtained by training through a random forest method.

An embodiment of a second aspect of the present invention provides a malicious file detection apparatus based on a model, including:

the file acquisition module to be detected is used for acquiring a file to be detected;

the file analysis module to be detected is used for analyzing the file to be detected to obtain the characteristic information of the file to be detected; the characteristic information comprises dynamic behavior information and static file information of the subfiles of each level in each level contained in the file to be detected, and the level relation of the subfiles in all levels contained in the file to be detected; the dynamic behavior information is information generated in the process that the subfiles are executed, and the static file information is information obtained by analyzing the subfiles in the non-executed state in a static analysis mode;

the malicious family detection module is used for inputting the characteristic information of the file to be detected into a pre-constructed malicious family detection model, and acquiring whether the file to be detected belongs to a certain malicious family and a first detection result of corresponding confidence; wherein the content of the first and second substances,

In the above technical solution, further comprising:

the attack group detection module is used for inputting the characteristic information of the file to be detected into a pre-constructed attack group detection model and acquiring whether the file to be detected is from a certain attack group and a second detection result of a corresponding confidence coefficient; wherein the content of the first and second substances,

the attack group detection model is a model which is obtained by taking the characteristic information of known malicious files and the label information of the known malicious files as sample data and adopting a machine learning mode to train and is used for obtaining whether the files to be detected are from a certain attack group and the detection result of the corresponding confidence coefficient; wherein the content of the first and second substances,

In an embodiment of a third aspect of the present invention, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the method for detecting a malicious file based on a model according to an embodiment of the first aspect of the present invention are implemented.

A fourth aspect of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the model-based malicious file detection method according to the first aspect of the present invention.

A fifth aspect embodiment of the present invention provides a computer program product, which includes computer executable instructions, and when executed, the instructions are configured to implement the steps of the model-based malicious file detection method according to the first aspect embodiment of the present invention.

According to the method and the device for detecting the malicious file based on the model, provided by the embodiment of the invention, the files to be detected are analyzed, the characteristic information with various dimensions and abundant types is extracted from the files to be detected, and the characteristic information is input into the malicious family detection model, so that the detection of whether the files to be detected belong to a certain type of malicious family is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a malicious file detection method based on a model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a hierarchical structure of a file;

FIG. 3 is a flowchart of a malicious file detection apparatus based on a model according to an embodiment of the present invention;

fig. 4 illustrates a physical structure diagram of an electronic device.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Before describing the present invention in detail, a unified description of related concepts involved in the present invention will be provided.

Malicious family: refers to a collection of malware that has similarities, inheritance, and derivations.

Attack group: a group that is conducting an APT attack is defined as an attack group. APT (Advanced persistent threat) refers to a process of computer intrusion that is both insidious and persistent, usually carefully planned by someone, to target a specific target. It is usually for commercial or political reasons, specific to a particular organization or country, and requires high concealment to be maintained over a long period of time. Advanced persistent threats consist of three elements: advanced, persistent, threat. High level emphasis is on the use of sophisticated malware and techniques to exploit vulnerabilities in the system. Persistence implies that some external force will continue to monitor a particular target and obtain data therefrom. A threat refers to an attack that is being planned for human participation.

As can be seen from the definition of the malicious family and the attack group, the object of the malicious family is software and the object of the attack group is a person or an organization.

Fig. 1 is a flowchart of a method for detecting a malicious file based on a model according to an embodiment of the present invention, and as shown in fig. 1, the method for detecting a malicious file based on a model according to an embodiment of the present invention includes:

step 101, acquiring a file to be detected.

As the name implies, the file to be detected refers to a file for which malicious file detection has not been performed. The malicious file detection method provided by the embodiment of the invention is used for determining whether the file to be detected is a normal file or a malicious file. In the case of a malicious file, it is also necessary to detect which malicious family the malicious file belongs to and/or from which attack group.

The types of the files to be detected can be various, and include but are not limited to Window executable files, Office documents, Office compound documents, PDF files, ZIP compressed package files, RAR compressed package files, GZ compressed package files, Rich textFormat files, Email files, L inux executable files, Adobe Flash files, Windows shortcut files, HWP files, Inpage files, android APK files and the like.

As will be mentioned in the following description, the malicious file detection method provided by the embodiment of the present invention can extract corresponding feature information deep inside a file through analysis of a file hierarchy structure, and thus can support detection of multiple types of files. Compared with the malicious file detection method in the prior art, the malicious file detection method provided by the embodiment of the invention has the advantage that the types of the supported files are obviously increased.

And 102, analyzing the file to be detected to obtain a feature vector of the file.

The specific process of analyzing the file to obtain the characteristic information of the file comprises the following steps:

102-1, analyzing the file, determining the hierarchical structure of the file, and acquiring a hierarchical file information set;

and 102-2, obtaining a feature vector of the file according to the hierarchical file information set.

Because the malicious file can hide effective identification information in the inner layer of the file, the malicious file is difficult to be effectively identified by simply depending on external detection. For example, a RAR compresses a package file, and multiple different types of files may be stored within the compressed package. Also for example, word files, various links are set in the files.

Aiming at the characteristic of the malicious file, in the embodiment of the invention, when the file is analyzed to extract the characteristic value of the file, the file needs to be deeply inserted into the file. In order to achieve the purpose of entering the inside of a file deeply, the hierarchical structure of the file needs to be determined, and a hierarchical file information set corresponding to the file is acquired from the file according to the hierarchical structure.

Files generally have a hierarchical structure, for example, a RAR compressed package file includes two levels, a compressed package of which serves as a subfile of a first level, and a file in the compressed package of which serves as a subfile of a second level. The hierarchical structure of the file is not limited to the two-layer structure in the above example, and may be a multi-layer structure. FIG. 2 is a schematic diagram of a hierarchical structure of a file.

The hierarchical file information set includes dynamic behavior information and static file information of the subfiles in each hierarchy, and hierarchical relationships of the subfiles in all the hierarchies.

The dynamic behavior information refers to information generated in the process of executing the file. For example, the maximum value of the file sub-stream, the minimum value of the file sub-stream, the number of pe sub-streams, the number of pdf sub-streams, the number of png pictures, the number of jpg pictures, the number of ole objects, the number of api calls, the number of registry operations, the number of released files, and the like.

The static file information is obtained by analyzing the sub-files in the non-execution state in a static analysis mode. Such as file name, file author name, file size, file type, hash value, creation time, modification time, etc.

The hierarchical relationship is the affiliated relationship between the files. For example, the compressed package file contains a word file, a connection file is inserted into the word file, and a picture file is in the connection file.

If the file to be analyzed only has one level, the level file information set corresponding to the file only comprises the dynamic behavior information and the static file information of the file in the first level.

If the file to be analyzed comprises at least two levels, the level file information set corresponding to the file comprises the dynamic behavior information and the static file information of the subfiles in each level and the level relation of each subfile in all levels.

If a subfile in a hierarchy does not have a condition to be executed, then the subfile in the hierarchy only has static file information.

Analyzing the execution items existing in the file to obtain the dynamic behavior information of the file; wherein the execution item is an information item that a file loads when executed.

Analyzing the basic items of the file to obtain static file information of the file; wherein, the basic item refers to a fixed item of the file, such as an author item, a time item, a type item, and the like.

The hierarchical relationship may be obtained by recording the relationship between subfiles of one level and subfiles of other levels (if any) in the file.

After the hierarchical file information set is obtained, the feature vector of the file can be obtained according to the hierarchical file information set.

The hierarchical file information set comprises dynamic behavior information of the file and static file information of the file. However, some types of specific information may appear in the dynamic execution process of the file, or may be obtained from static analysis of the file, that is, some types of specific information may be classified as either dynamic execution information or static file information, such as the png number of pictures. If the information is processed once when the dynamic execution information is processed, and the information is processed again when the static file information is processed, not only can the computing resources be wasted, but also the accuracy of the subsequent malicious file detection result can be influenced.

Therefore, after the dynamic behavior information of the file and the static file information of the file are obtained, the files of different types can be converted into a uniform intermediate temporary file object by combining the dynamic behavior information and the static file information, and the intermediate temporary file object is subjected to digital characterization processing to generate a digital feature vector. For example, a static file information is stored in a Json file format, and a dynamic behavior information is also stored in the Json file format. And combining the two Json files into a Json file, wherein the combined Json file is an intermediate temporary file object obtained after conversion.

It has been mentioned in the foregoing that the method provided by the implementation of the present invention supports multiple types of files, and thus can generate a variety of different intermediate temporary file objects. In the embodiment of the present invention, in addition to the file in the Json format mentioned in the previous example, the types of the intermediate temporary file object include, but are not limited to: PE file section table information, PE file resource information, PE file import and export table information, PE file PDB information, Office file VB macro code information, Office file Sheet macro code information, Office file version information, PDF file script information, Email mail text content information, Email mail attachment information, sandbox API sequence information, sandbox API calling frequency information, sandbox network behavior information, sandbox release file information, sandbox registry operation information and the like.

The intermediate temporary file object may be divided into word information and number information according to the expression of the information. The predicated information refers to information described in terms or sentences in the file, such as author name-Lisan, which is information expressed by terms. The number information is information described in a file in a numerical manner. For example, the file size-20 kb, 20 is a numerical expression.

When the digital characteristic processing is performed on the dynamic behavior information and the static file information, different processing modes are available according to whether the corresponding information is word information or number information. Carrying out digital conversion on the word information; and carrying out numerical extraction on the logarithmic information. Specifically, for word information, word frequency and word length statistics can be carried out by adopting a word bag method to generate digital characteristics. For number information, the corresponding numerical value can be directly acquired to generate the digitized feature.

And after the dynamic behavior information and the static file information in the hierarchical file information set are converted and subjected to digital feature processing by the intermediate temporary file object, a first feature set and a second feature set are respectively generated. And generating corresponding feature vectors according to the hierarchical relation of the generated first feature set, the second feature set and each file in the hierarchy according to a preset rule. In the embodiment of the present invention, the obtained feature vector is a feature vector with 1 × n dimensions. For example, a feature vector is (3,4,1,0,0, … …, -1), where 3 represents the number of bars equal to 10 after the square of the string length, 4 represents the number of bars equal to 11 after the square of the string length, 1 represents the hostxx function call 1 time, 0 represents the internet xxx function call 0 time, 0 represents the number of files of the hfxx type 0, … …, -1 represents that the feature vector label is unknown. The value of n can be adjusted according to specific application scenarios.

It should be noted that a uniform feature vector format may be set for different files, that is, all features that a file can theoretically contain are described in one feature vector, and then, according to the features of a specific file, corresponding features in the feature vector are assigned. For a specific file, if the file does not contain a feature, the feature value corresponding to the feature in the feature vector is 0 by default.

It should be understood by those skilled in the art that the feature values in the feature vector obtained by analyzing the file to be detected depend on the file to be detected itself, and the feature values corresponding to different files to be detected are very likely to be different.

Step 103, inputting the characteristic information of the file to be detected into a pre-constructed malicious family detection model, and acquiring whether the file to be detected belongs to a certain malicious family and a first detection result of a corresponding confidence coefficient.

In the embodiment of the invention, the malicious family detection model is a model which is obtained by training in a machine learning mode by taking the characteristic information of the known malicious file and the label information of the known malicious file as sample data and is used for acquiring whether the file to be detected belongs to a certain malicious family and a detection result of a corresponding confidence coefficient.

The tag information of the known malicious file includes information of which malicious family the known malicious file belongs to.

The characteristic information of the known malicious file is the same as that of the file to be detected in category, and if the characteristic information of the known malicious file and the characteristic information of the file to be detected both include dynamic behavior information and static file information of the file, the dynamic behavior information includes information generated in the executed process of the file. Such as maximum file sub-stream, minimum file sub-stream, number of pe sub-streams, number of pdf sub-streams, number of png pictures, number of jpg pictures, number of ole objects, number of api calls, number of registry operations, number of released files, etc. The static file information may include fixed information that cannot be executed or executed in the file, such as a file name, a file author name, a file size, a file type, a hash value, creation time, modification time, and the like.

The characteristic values of known malicious files have their own characteristics. Known malicious files used for training the malicious family detection model are files that can determine the malicious family to which the files belong, so that the characteristic values reflect the characteristics of the malicious files belonging to a certain malicious family. For example, a malicious file belonging to the malicious family a may show a significant difference from malicious files of other malicious families, such as within a specific value range, in feature values of a plurality of features, such as the number of pictures in png format, the number of pictures in jpg format, the number of api calls, the maximum value of file sub-streams, and the minimum value of file sub-streams. Therefore, by combining the characteristic values of the known malicious files with the tag information of the known malicious files, a malicious family detection model capable of identifying a specific type of malicious family can be trained.

In the embodiment of the invention, the malicious family detection model is obtained based on random forest method training. When the random forest is used for malicious family detection, the random trees in the random forest are used for voting possible results, each tree has one vote, the possible result with the largest number of votes is the output result of the model, and the confidence coefficient of the output result of the model is obtained by dividing the number of votes of the final result by the total number of votes.

In other embodiments of the present invention, the malicious family detection model may also be generated by other methods, such as a J48 decision tree algorithm, a bayesian classification algorithm, an adaboost algorithm, and the like.

The detection result of the malicious family detection model is described below with reference to an example.

The output result of a malicious family detection model is as follows:

{'sha256':

'57abdf298632cd08f5da86aeed73f00a7167f1c1aad36fef2603aeb5be3fb95a',

'classify_family':{'adwind':0.98}}

the output results of another malicious family detection model are:

{'sha256':

'57b230b03bf5db1b80c1ed4d942b74b6c44e25c1d7a47ac6aa4f9afd3a1dbeb5',

'classify_family':{'xtreme':0.65}}

in the two detection results, the sha256 field represents the hash value of the file to be detected; the classification _ family field indicates the current classification result, wherein the file to be detected with the name 57 abddf 298632cd08f5da86aeed73f00a7167f1c1aad36fef2603aeb5be3fb95a is determined as the adwind family with a confidence of 0.98; the file to be detected with the name 57b230b03bf5db1b80c1ed4d942b74b6c44e25c1d7a47ac6aa4f9afd3a1dbeb5 was judged as the xtreme family with a confidence of 0.65.

A confidence equal to 0 indicates that the result is 100% untrusted, a confidence equal to 1 indicates that the result is 100% trusted, and a confidence between the intervals 0,1 indicates that the result is more trustworthy with higher confidence.

The malicious file detection method based on the model provided by the embodiment of the invention can analyze various types of files to be detected, extract characteristic information with various dimensions and abundant types from the files to be detected, and input the characteristic information into the malicious family detection model, thereby realizing the detection of whether the files to be detected belong to a certain type of malicious family.

Based on any of the above embodiments, in an embodiment of the present invention, the method for detecting a malicious file based on a model further includes:

inputting the characteristic information of the file to be detected into a pre-constructed attack group detection model, and acquiring whether the file to be detected is from a certain attack group and a second detection result of a corresponding confidence coefficient.

In the embodiment of the invention, the attack group detection model is a model which is obtained by taking the characteristic information of the known malicious file and the label information of the known malicious file as sample data and adopting a machine learning mode to train and is used for obtaining whether the file to be detected is from a certain attack group and the detection result of the corresponding confidence coefficient.

In the embodiment of the invention, the attack group detection model is obtained based on random forest method training. In other embodiments of the present invention, the attack group detection model may also be generated by other methods, such as a J48 decision tree algorithm, a bayesian classification algorithm, an adaboost algorithm, and the like.

The malicious file detection method based on the model provided by the embodiment of the invention can analyze various types of files to be detected, extract characteristic information with various dimensions and abundant types from the files to be detected, and input the characteristic information into the malicious family detection model and the attack group detection model, thereby realizing the detection of whether the files to be detected belong to a certain type of malicious family and whether the files to be detected are from a certain attack group.

and comparing the first detection result obtained by adopting the malicious family detection model with the third detection result obtained by adopting the malicious family detection IBK model, and obtaining the final confidence coefficient of the malicious family detection result according to the comparison result.

In the embodiment of the invention, the malicious family detection IBK model is a model which is obtained by taking the characteristic information of known malicious files and the label information of the known malicious files as sample data and adopting IBK classification algorithm training and is used for obtaining whether the files to be detected belong to a certain malicious family and the detection result of the corresponding confidence coefficient.

It should be noted that the malicious family detection model and the malicious family detection IBK model use the same training data, and the known malicious files used in training the malicious family detection model are the same as the known malicious files used in training the malicious family detection IBK model, including the same file feature information and the same file tag information.

When the first detection result is compared with the third detection result, whether the conclusion that whether the file to be detected belongs to a certain malicious family is the same or not is firstly compared, if the conclusion is different, the conclusion that whether the file to be detected belongs to the certain malicious family in the first detection result is still taken as the final conclusion, but the confidence coefficient in the first detection result is reduced, and the reduced confidence coefficient is taken as the final confidence coefficient. And if the conclusion is the same, calculating the final confidence coefficient according to the confidence coefficient in the first detection result and the confidence coefficient in the third detection result. For example, the confidence in the first detection result is differentiated from the confidence in the third detection result, and the resulting result is used as the final confidence.

According to the malicious file detection method based on the model, the confidence coefficient of the malicious family detection result is adjusted through the IBK classification algorithm, so that the finally obtained confidence coefficient is more accurate.

and comparing the second detection result obtained by adopting the attack group detection model with the fourth detection result obtained by adopting the attack group detection IBK model, and obtaining the final confidence coefficient of the attack group detection result according to the comparison result.

In the embodiment of the invention, the attack group detection IBK model is a model which takes the characteristic information of known malicious files and the label information of the known malicious files as sample data and is obtained by adopting IBK classification algorithm training to obtain whether the files to be detected are from a certain attack group and the detection result of the corresponding confidence coefficient. Wherein the tag information of the known malicious file comprises information of which attack group the known malicious file belongs to.

It should be noted that the attack group detection model and the attack group detection IBK model use the same training data, and the known malicious files used in training the attack group detection model are the same as the known malicious files used in training the attack group detection IBK model, including the same file feature information and the same file tag information.

When the second detection result is compared with the fourth detection result, whether the conclusion that whether the file to be detected is from a certain attack group is the same or not is firstly compared, if the conclusion that whether the file to be detected is from the certain attack group in the second detection result is still taken as the final conclusion, but the confidence coefficient in the second detection result is reduced, and the reduced confidence coefficient is taken as the final confidence coefficient. And if the conclusion is the same, calculating the final confidence coefficient according to the confidence coefficient in the second detection result and the confidence coefficient in the fourth detection result. For example, the difference between the confidence in the second detection result and the confidence in the fourth detection result is obtained, and the obtained result is used as the final confidence.

According to the malicious file detection method based on the model, the confidence coefficient of the attack group detection result is adjusted through the IBK classification algorithm, so that the finally obtained confidence coefficient is more accurate.

Based on any one of the above embodiments, in another embodiment of the present invention, the method for detecting a malicious file based on a model further includes:

and step S1-1, acquiring a plurality of known malicious files.

In the embodiment of the present invention, a known malicious file refers to a file that has been determined to be a malicious file, and the tag information of the malicious file (such as which malicious family belongs to) is known.

The known malicious file can be a malicious file detected by the model-based malicious file detection method provided by the embodiment of the invention at a certain previous time; or malicious files detected by other malicious file detection methods in the prior art.

And step S1-2, analyzing the acquired multiple known malicious files to obtain the characteristic information of the known malicious files.

The specific implementation process of analyzing the known malicious file to obtain the feature information of the known malicious file is not substantially different from the specific implementation process of analyzing the file to be detected described in the previous embodiment of the present invention, and therefore, a repeated description is not provided herein.

The characteristic information (such as characteristic values) of the known malicious file can reflect the characteristics of the malicious family to which the known malicious file belongs.

And S1-3, training by using the characteristic information of the known malicious file and the label information of the known malicious file as sample data in a machine learning mode, and generating a malicious family detection model for acquiring whether the file to be detected belongs to a certain malicious family and a detection result of a corresponding confidence degree.

In the embodiment of the invention, a random forest method is adopted to train the malicious family detection model. In the training process, resampling is carried out on a feature vector set (a feature vector set formed by feature information of known malicious files) according to the optimal weight of a plurality of statistical test results, downsampling is carried out on the optimal dimension of the feature vector for a plurality of statistical test results, a random tree is generated by using a feature vector subset (a part of the feature vector set), the above process is repeated for N times (for example, 200 times), N (for example, 200) random trees are generated to form a random forest model, and the random forest model is a malicious family detection model obtained through training.

Because the malicious file detection method based on the model provided by the embodiment of the invention needs to support diversified file types, and the feature vector formed by the features extracted from the file has high latitude, the malicious family detection model trained by the random forest method has the most obvious detection effect after multiple comparison tests.

In the embodiment of the invention, the characteristic information of the known malicious file and the label information of the known malicious file are used as samples, and the corresponding relation between the characteristic information of the file and the malicious family type can be established by training in a machine learning mode. When the malicious family is detected, whether the file to be detected belongs to a certain malicious family or not and a result of corresponding confidence coefficient can be obtained by processing the characteristic information of the file to be detected. The confidence is generated in the malicious family detection process, and the confidence generation process is exemplified in the description taking the random forest method as an example. According to the malicious file detection method based on the model, provided by the embodiment of the invention, the malicious family detection model is created through the analysis of the known malicious files, so that the automatic detection of the malicious family to which the malicious files belong is realized.

Based on any one of the above embodiments, in a further embodiment of the present invention, the method for detecting a malicious file based on a model further includes:

In the embodiment of the invention, the characteristic information of the known malicious file and the label information of the known malicious file are used as samples, and the corresponding relation between the characteristic information of the file and the malicious family type can be established by training through an IBK classification algorithm. When the malicious family is detected, whether the file to be detected belongs to a certain malicious family or not and a result of corresponding confidence coefficient can be obtained by processing the characteristic information of the file to be detected. The confidence is generated in the malicious family detection process, and the confidence generation process is exemplified in the description taking the random forest method as an example.

According to the malicious file detection method based on the model, provided by the embodiment of the invention, the malicious family detection IBK model is created through the analysis of the known malicious file, and the confidence coefficient of another malicious family detection result can be obtained by using the model, so that the confidence coefficient of the detection result generated by the malicious family detection model can be adjusted, and the finally obtained confidence coefficient is more accurate.

and step S2-1, acquiring a plurality of known malicious files.

In the embodiment of the present invention, a known malicious file refers to a file that has been determined to be a malicious file, and information about the malicious file (e.g., from which attack group) is known.

And step S2-2, analyzing the acquired multiple known malicious files to obtain the characteristic information of the known malicious files.

The characteristic information (such as characteristic value) of the known malicious files can reflect the characteristics of the attack group from which the files are originated.

And S2-3, training by using the characteristic information of the known malicious file and the label information of the known malicious file as sample data in a machine learning mode, and generating an attack group detection model for acquiring whether the file to be detected is from a certain attack group and a detection result of a corresponding confidence coefficient.

In the embodiment of the invention, a random forest method is adopted to train an attack group detection model. In the training process, resampling is carried out on the feature vector set according to the optimal weight of the multiple statistical test results, downsampling is carried out on the optimal dimension of the multiple statistical test results on the feature vectors, a random tree is generated by utilizing the feature vector subset, the process is repeated for N times (for example, 200 times), N (for example, 200) random trees are generated, a random forest model is formed, and the random forest model is an attack group detection model obtained through training.

Because the malicious file detection method based on the model provided by the embodiment of the invention needs to support diversified file types, and the feature vector formed by the features extracted from the files has high latitude, the detection effect of the attack group detection model trained by the random forest method is the most obvious after multiple comparison tests.

In the embodiment of the invention, the characteristic information of the known malicious file and the label information of the known malicious file are used as samples, and the corresponding relation between the characteristic information of the file and the attack group from which the file originates can be established by training in a machine learning mode. When the attack group is detected, whether the file to be detected is from a certain attack group and the result of the corresponding confidence coefficient can be obtained by processing the characteristic information of the file to be detected. The confidence is generated in the attack group identification process, and the confidence generation process is exemplified in the previous description taking the random forest method as an example.

According to the malicious file detection method based on the model, provided by the embodiment of the invention, the attack group detection model is created through the analysis of the known malicious files, so that the automatic detection of the attack group from the malicious files is realized.

In the embodiment of the invention, the characteristic information of the known malicious file and the label information of the known malicious file are used as samples, and the corresponding relation between the characteristic information of the file and the attack group from which the file originates can be established by training through an IBK classification algorithm. When the attack group is detected, whether the file to be detected is from a certain attack group and the result of the corresponding confidence coefficient can be obtained by processing the characteristic information of the file to be detected. The confidence is generated in the attack group identification process, and the confidence generation process is exemplified in the previous description taking the random forest method as an example.

According to the malicious file detection method based on the model, provided by the embodiment of the invention, the attack group detection IBK model is created through the analysis of the known malicious files, and the confidence coefficient of another attack group detection result can be obtained by using the model, so that the adjustment of the confidence coefficient of the detection result generated by the attack group detection model can be realized, and the finally obtained confidence coefficient is more accurate.

Fig. 3 is a schematic diagram of a malicious file detection apparatus based on a model according to an embodiment of the present invention, and as shown in fig. 3, the malicious file detection apparatus based on a model according to an embodiment of the present invention includes:

the to-be-detected file acquisition module 301 is used for acquiring a to-be-detected file;

the to-be-detected file analysis module 302 is configured to analyze the to-be-detected file to obtain characteristic information of the to-be-detected file; the characteristic information comprises dynamic behavior information and static file information of the subfiles of each level in each level contained in the file to be detected, and the level relation of the subfiles in all levels contained in the file to be detected; the dynamic behavior information is information generated in the process that the subfiles are executed, and the static file information is information obtained by analyzing the subfiles in the non-executed state in a static analysis mode;

the malicious family detection module 303 is configured to input the feature information of the to-be-detected file into a pre-constructed malicious family detection model, and acquire whether the to-be-detected file belongs to a certain malicious family and a first detection result of a corresponding confidence degree; wherein the content of the first and second substances,

The malicious file detection device based on the model provided by the embodiment of the invention can analyze various types of files to be detected, extract characteristic information with various dimensions and abundant types from the files to be detected, and input the characteristic information into the malicious family detection model, thereby realizing the detection of whether the files to be detected belong to a certain type of malicious family.

Based on any one of the above embodiments, in an embodiment of the present invention, the malicious file detection apparatus based on a model further includes:

The malicious file detection device based on the model provided by the embodiment of the invention can analyze various types of files to be detected, extracts characteristic information with various dimensions and abundant types from the files to be detected, inputs the characteristic information into the malicious family detection model and the attack group detection model, and realizes the detection of whether the files to be detected belong to a certain type of malicious family and whether the files to be detected are from a certain attack group.

Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)410, a communication Interface 420, a memory (memory)430 and a communication bus 440, wherein the processor 410, the communication Interface 420 and the memory 430 are communicated with each other via the communication bus 440. The processor 410 may call logic instructions in the memory 430 to perform the following method: acquiring a file to be detected; analyzing the file to be detected to obtain characteristic information of the file to be detected; and inputting the characteristic information of the file to be detected into a pre-constructed malicious family detection model, and acquiring whether the file to be detected belongs to a certain malicious family and a first detection result of a corresponding confidence coefficient.

In addition, the logic instructions in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method provided by the foregoing embodiments, for example, including: acquiring a file to be detected; analyzing the file to be detected to obtain characteristic information of the file to be detected; and inputting the characteristic information of the file to be detected into a pre-constructed malicious family detection model, and acquiring whether the file to be detected belongs to a certain malicious family and a first detection result of a corresponding confidence coefficient.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A malicious file detection method based on a model is characterized by comprising the following steps:

acquiring a file to be detected;

2. The model-based malicious file detection method according to claim 1, further comprising:

inputting the characteristic information of the file to be detected into a pre-constructed attack group detection model, and acquiring whether the file to be detected is from a certain attack group and a second detection result of a corresponding confidence coefficient; wherein the content of the first and second substances,

3. The model-based malicious file detection method according to claim 1, further comprising:

4. The model-based malicious file detection method according to claim 2, further comprising:

5. The model-based malicious file detection method according to claim 1, wherein the malicious family detection model is created by:

6. The model-based malicious file detection method according to claim 3, wherein the malicious family detection IBK model is created by:

7. The model-based malicious file detection method according to claim 2, wherein the attack group detection model is created by:

8. The model-based malicious file detection method according to claim 4, wherein the attack group detection IBK model is created by:

9. The model-based malicious file detection method according to claim 1, wherein the analyzing the file to be detected to obtain the feature information of the file to be detected comprises:

analyzing the fixed item of the subfile of each level in each level of the file to be detected to obtain the static file information of the subfile; wherein, the fixed item is an item which is irrelevant to operation or not in the subfile;

10. The model-based malicious file detection method according to claim 1, wherein the malicious family detection model is obtained by random forest training.

11. The model-based malicious file detection method according to claim 2, wherein the attack group detection model is obtained by random forest training.

12. A model-based malicious file detection apparatus, comprising:

13. The model-based malicious file detection apparatus according to claim 12, further comprising:

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the model-based malicious file detection method according to any of claims 1 to 11.

15. A non-transitory computer readable storage medium, having stored thereon a computer program, which, when being executed by a processor, carries out the steps of the model-based malicious file detection method according to any one of claims 1 to 11.

16. A computer program product comprising computer executable instructions for carrying out the steps of the model-based malicious file detection method according to any one of claims 1 to 11 when executed.