CN112395612A - Malicious file detection method and device, electronic equipment and storage medium - Google Patents

Malicious file detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112395612A
CN112395612A CN201910755713.1A CN201910755713A CN112395612A CN 112395612 A CN112395612 A CN 112395612A CN 201910755713 A CN201910755713 A CN 201910755713A CN 112395612 A CN112395612 A CN 112395612A
Authority
CN
China
Prior art keywords
sample
behavior
target
file
api
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910755713.1A
Other languages
Chinese (zh)
Inventor
程强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201910755713.1A priority Critical patent/CN112395612A/en
Priority to PCT/CN2020/108614 priority patent/WO2021027831A1/en
Publication of CN112395612A publication Critical patent/CN112395612A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application discloses a malicious file detection method and device, electronic equipment and a storage medium, and relates to the technical field of network security. The malicious file detection method comprises the following steps: coding the API behavior and API behavior parameters of the obtained target file to obtain a target coding set corresponding to the target file; vectorizing the target coding set to obtain a target behavior vector; determining whether the target file is a malicious file or not according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set; if so, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set. The application discloses a malicious file detection method and device, electronic equipment and a storage medium, which reserve behavior characteristics, improve the richness of training input and reduce the false alarm rate of a machine learning model.

Description

Malicious file detection method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a malicious file detection method and apparatus, an electronic device, and a storage medium.
Background
An attack type with the characteristics of Advanced attack technique, long duration, clear attack target and the like appears in the public visual field due to major network security events such as aurora attack, seismic network attack, night dragon attack, RSA token seed stealing and the like, and is internationally called as Advanced Persistent Threat (APT) attack. The attack not only uses traditional viruses and trojans as attack means, but also performs 'pilot attack' in social engineering modes such as mails and the like, and sends a file which elaborately constructs and uses a 0Day vulnerability to a user. Once the user opens the relevant file, the vulnerability is triggered, the attack code is injected into the user system, and subsequent operations such as downloading other viruses and trojans are carried out to facilitate long-term latent operation. Traditional firewalls, enterprise antivirus software, and the like have very limited ability to detect and protect against such featureless signed malicious files or code.
The APT attack detection defense technology becomes a research hotspot of the new generation of network security, and the technical difficulties thereof are as follows: how to quickly detect attacks that exploit unknown vulnerabilities. A series of researches are carried out at home and abroad, and a plurality of methods are provided, wherein the methods are represented by dynamic behavior analysis technologies based on files or samples. The technology mainly aims at the malicious code implantation process in the APT attack process, dynamically analyzes the dynamic behavior of suspicious sample files entering a protected system through controllable environments such as sandboxes, virtual machines and the like, identifies the malicious behavior and attack codes, prevents the malicious code from being implanted, and prevents the occurrence of subsequent destructive behaviors. The technology can detect and protect before the attack enters the network, thereby avoiding the influence of the attack on the protected system. The judgment of the malice of the code file depends on a behavior feature library, the feature library stores the malice behavior features extracted after manual code analysis, and the update speed and accuracy of the rules of the feature library determine the success rate of malicious code detection.
Because malicious codes are rapid in variety, people try to learn behavior patterns of malicious software by using a large number of malicious sample training models through a machine learning method so as to meet the requirement of actual detection, and therefore the maliciousness of the software is automatically judged. However, the traditional machine learning method relies on the correlation of the distribution conditions of the samples, the imbalance of the samples can cause poor detection accuracy, and meanwhile, the traditional machine learning method has high sensitivity to numerical data types, and is difficult to distinguish and identify behavior data features with semantics.
Disclosure of Invention
In order to solve the above technical problem, the embodiment of the present application is implemented as follows:
the embodiment of the application provides a malicious file detection method, which comprises the following steps:
coding the API behavior and API behavior parameters of the obtained target file to obtain a target coding set corresponding to the target file;
vectorizing the target coding set to obtain a target behavior vector;
determining whether the target file is a malicious file or not according to the distance between the target behavior vector and the sample behavior vectors in the black and white sample set;
if so, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set.
Optionally, the encoding the API behavior and the API behavior parameter of the obtained target file to obtain a target encoding set corresponding to the target file includes:
coding the API behavior to obtain a first coding set;
coding the API behavior parameters to obtain a second coding set;
and carrying out unified dimension combination on the first encoding set and the second encoding set to obtain the normalized target encoding set.
Optionally, the API behavior parameter is a directory path, and the encoding the API behavior parameter to obtain a second encoding set includes:
performing directory layering on the API behavior parameters;
encoding the API behavior parameters after the directory layering to obtain a second encoding set;
when the path length of the API behavior parameter exceeds a preset length, the path length of the API behavior parameter is adjusted to the preset length, and then directory layering is carried out.
Optionally, the encoding the API behavior to obtain a first encoding set includes:
hexadecimal coding is carried out on the API behavior to obtain the first coding set with preset coding length;
the encoding the API behavior parameters to obtain a second encoding set includes:
performing hash coding on the API behavior parameters to obtain the second coding set;
the performing unified dimension combination on the first encoding set and the second encoding set to obtain the normalized target encoding set includes:
and converting the codes in the second code set into hexadecimal codes, and correspondingly combining the codes in the first code set with the converted codes in the second code one by one to obtain the target code set.
Optionally, the determining whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black and white sample set includes:
calculating a first average distance between the target behavior vector and a sample behavior vector corresponding to a black sample in the black and white sample set;
calculating a second average distance between the target behavior vector and a sample behavior vector corresponding to a white sample in the black and white sample set;
and when the first average distance is greater than or equal to the second average distance, judging that the target file is a malicious file.
Optionally, the determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set includes:
calculating a third average distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set;
when the third average distance which does not exceed the preset critical value exists, selecting the malicious category of the black sample corresponding to the minimum value in the third average distance as the malicious category of the target file;
otherwise, dividing the malicious category of the target file into a new malicious category.
Optionally, the method further includes:
and acquiring the API behavior and the API behavior parameters after the external analysis engine operates the target file.
Optionally, the API behavior is loading a system DLL file, writing a temporary file, or modifying a registry.
Optionally, the method further includes:
acquiring sample API behaviors and sample API behavior parameters of sample files in a training sample set, wherein the sample files comprise black sample files and white sample files, the black sample files comprise at least one of viruses, trojans, worms and Lesojous software, and the white sample files are normal files;
coding the acquired sample API behaviors and the sample API behavior parameters to obtain a sample coding set corresponding to the training sample set;
determining the weight corresponding to each code according to the frequency of the same sample file and different sample files corresponding to each code in the sample code set;
and vectorizing the sample code set corresponding to the sample file in the training sample set according to the weight corresponding to each code to obtain the sample behavior vector in the black and white sample set.
Optionally, the type of the sample file is a PE file, a PDF file, or a text file.
An embodiment of the present application further provides a malicious file detection apparatus, where the malicious file detection apparatus includes:
the encoding module is used for encoding the API behavior and the API behavior parameters of the obtained target file to obtain a target encoding set corresponding to the target file;
the vectorization module is used for vectorizing the target coding set to obtain a target behavior vector;
the determining module is used for determining whether the target file is a malicious file or not according to the distance between the target behavior vector and the sample behavior vectors in the black and white sample set; and
if so, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set.
The embodiment of the application also provides electronic equipment which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory finish mutual communication through the bus;
a memory for storing a computer program;
a processor for executing a program stored in the memory to perform the method steps of any of the above claims.
An embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements any of the method steps described above.
The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:
the scheme provided by the embodiment of the application reserves behavior characteristics, improves the richness of training input and reduces the false alarm rate of the machine learning model.
According to the scheme provided by the embodiment of the application, the maliciousness of the target file can be distinguished, and meanwhile the type of the target file can be identified through the distance and a new type of malicious file can be found.
The scheme provided by the embodiment of the application has good expansibility for supporting the file types, and compared with the traditional scheme which only supports an executable PE file analysis model, the scheme provided by the embodiment of the application also supports other types of files such as Word, PDF and the like.
Compared with a deep learning network malicious file detection method, the method and the device for detecting the malicious files in the network have the advantages that complexity is high, weight and parameter value adjustment is reduced, and dependence of a behavior model on sample distribution based on behavior number statistics is improved.
In addition, the scheme provided by the embodiment of the application aims at the sample imbalance and has a better generalization effect.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a flowchart of a malicious file detection method according to a preferred embodiment of the present application.
Fig. 2 is a flowchart of another malicious file detection method according to a preferred embodiment of the present application.
Fig. 3 is a block diagram illustrating an electronic device according to a preferred embodiment of the present application.
Fig. 4 is a block diagram illustrating a malicious file detection apparatus according to a preferred embodiment of the present disclosure.
Icon: 100-an electronic device; 110-a processor; 120-internal bus; 130-a network interface; 140-a memory; 150-malicious file detection means; 151-an encoding module; 152-a vectorization module; 153-determination module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Please refer to fig. 1, which is a flowchart illustrating a malicious file detection method according to an embodiment of the present disclosure, where the malicious file detection method is applied to an electronic device for detecting malicious files such as viruses, trojans, worms, and lasso software. The flow shown in fig. 1 will be explained in detail below.
Step S101, the API behavior and the API behavior parameters of the obtained target file are coded to obtain a target coding set corresponding to the target file.
In the embodiment of the present application, the API behavior may be, but is not limited to, starting a process, loading a system DLL file, writing a temporary file, or modifying a registry in a system such as Windows, Linux, Unix, and the like. The API behavior parameter refers to a parameter included in a command, such as a directory path. The type of the target file may be, but is not limited to, a PE file, a PDF file, a text file, and the like. And the API behaviors of the same target file correspond to the API behavior parameters one by one.
Before encoding the API behavior and API behavior parameters of the target file, the target file to be detected is first run through an external analysis engine, which may be, but is not limited to, a sandbox, a virtual machine, etc. And after the target file is operated, acquiring the API behavior and the API behavior parameters of the target file, coding and unifying the dimension combination to obtain a target coding set corresponding to the target file.
During coding, each API behavior and the corresponding API behavior parameters are coded respectively, and then the obtained codes are combined in a unified dimension mode. Specifically, hexadecimal coding is performed on the API behavior, and the coding length of the API behavior is preset to obtain a first coding set with a preset length. And simultaneously, vectorizing and converting the API behavior parameters to obtain a second coding set. And then, converting the codes in the second code set into hexadecimal codes with the same code format as the codes in the first code set, and combining the codes in the first code set and the converted codes in the second code set one by one to obtain a normalized target code set.
In the embodiment of the present application, hexadecimal coding is used for API behavior coding, and it is understood that in some other embodiments, binary, octal, or decimal coding may also be used. When encoding is performed, if the encoding modes adopted by the encoding in the first encoding set are different from the encoding in the second encoding set, the encoding in the second encoding set needs to be converted into the encoding of the same type as the encoding in the first encoding set, or the encoding in the first encoding set needs to be converted into the encoding of the same type as the encoding in the second encoding set. If the coding in the first code set is the same as the coding in the second code set, no conversion is necessary.
The API behavior parameters may be encoded by, but not limited to, hashing, and the like, and in this embodiment, hash encoding is used for encoding the API behavior parameters.
For convenience of illustration, the target file is exemplified by corresponding to an API behavior and an API behavior parameter, assuming that the first code set corresponding to the API behavior includes a hexadecimal code 0200, the second code set corresponding to the API behavior parameter includes a decimal code 67574613, and the codes in the second code set can be converted into hexadecimal codes 4071B55, so that the codes in the target code set obtained after combination are hexadecimal codes 02004071B 55.
Further, in order to improve the accuracy of detecting the target file, the scheme provided in the embodiment of the present application further preset a path length of the API behavior parameter, and when the API behavior parameter is coded, if the path length of the API behavior parameter exceeds a preset length, the path length of the API behavior parameter is adjusted to the preset length, and then directory layering is performed. And adjusting the path length of the API behavior parameter can be realized by adding fixed tail parameter undefine.
For example, if the longest coding path is preset as c:/system and a certain API behavior parameter is c:/system/host/, the API behavior parameter can be adjusted as c:/system/undefined and then coded. Therefore, the obtained codes are uniform in length, the extraction of the features is facilitated, the situation that the feature discrimination is not high due to wide feature description is avoided, and the accuracy of subsequent malicious file detection is improved.
Step S102, vectorizing the target coding set to obtain a target behavior vector.
In the embodiment of the application, a sample code set is established in advance according to the API behaviors and the API behavior parameters of the black and white samples in the black and white sample set, and each code in the sample code set corresponds to different weights. Wherein the black sample includes at least one of a virus, a trojan, a worm, and a lemonade, and the white sample file is a normal file.
When the target coding set is subjected to vectorization processing, different weights are given to different codes in the target coding set according to the weight of each code, and the weight given to a code which does not appear in the sample coding set is 0, so that a target behavior vector corresponding to the target coding set can be obtained.
For example, if the target code set is { a1, a2, B1, A3, C1, a4}, the weight corresponding to a1 in the sample code set is a1, the weight corresponding to a2 is a2, the weight corresponding to A3 is A3, the weight corresponding to a4 is a5, and no code B1 or code C1 exists in the sample code set, the target behavior vector obtained after the vectorization processing on the target code set is (a1, a2, 0, A3, 0, a 4).
It will be appreciated that in other embodiments, the weights assigned to codes not present in the sample code set may be other values, for example, the weights assigned to codes not present in the sample code set may be 1.
Step S103, determining whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black and white sample set.
The black and white sample set comprises a sample behavior vector corresponding to the black sample and a sample behavior vector corresponding to the white sample. The black samples include at least one of viruses, trojans, worms and leso software, and the white sample files are normal files. The sample behavior vector is obtained by encoding and then quantizing the black and white samples in the black and white sample set through the API behavior and the API behavior parameter, and the process of the sample behavior vector is consistent with the process of encoding and vectorizing the API behavior and the API behavior parameter of the target file.
When determining whether the target file is a malicious file, first distances between the target behavior vector and sample behavior vectors corresponding to all black samples in the black-and-white sample set are calculated, and second distances between the target behavior vector and sample behavior vectors corresponding to all white samples in the black-and-white sample set are calculated. The first distance and the second distance may be, but are not limited to, an average distance or an intermediate value among a plurality of distances. In the embodiment of the present application, the first distance and the second distance are both average distances, that is, a first average distance between the target behavior vector and the sample behavior vectors corresponding to all black samples in the black-and-white sample set is calculated, and a second average distance between the target behavior vector and the sample behavior vectors corresponding to all white samples in the black-and-white sample set is calculated.
The distance calculation between the target behavior vector and the sample behavior vector may be performed, but is not limited to euclidean distance, cosine similarity calculation, and the like. In the embodiment of the application, the distance between the target behavior vector and the sample behavior vector is calculated by cosine identity.
Assume a target behavior vector of JxThe sample behavior vector corresponding to a certain black sample is JkThen the target behavior vector JxSample behavior vector J corresponding to black samplekCan be expressed as
Figure BDA0002168648460000091
Calculating the distance between the sample behavior vector corresponding to all the black samples and the target behavior vector to obtain a distance list [ d1,d2,...dB]Averaging the values in the distance list to obtain a first average distance between the target behavior vector and the behavior vectors of the samples corresponding to all black samples in the black and white sample set, where the first average distance can be expressed as
Figure BDA0002168648460000092
Similarly, a second average distance between the target behavior vector and the sample behavior vectors corresponding to all white samples in the black-and-white sample set can be obtained.
The first average distance is then compared to the second average distance. And if the first average is smaller than the second average distance, judging that the target file is a normal file, and ending the detection. And if the first average distance is greater than or equal to the second average distance, judging that the target file is a malicious file.
And step S104, when the target file is a malicious file, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set.
If the target file corresponding to the target behavior vector is a malicious file, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set.
Specifically, first, a third average distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples (viruses, trojans, worms, and luxo software) in the black-and-white sample set is calculated. Wherein the third average distance between the target behavior vector and the sample behavior vector corresponding to the virus is DαIndicating that the third average distance between the target behavior vector and the sample behavior vector corresponding to Trojan is DβIndicating that the third average distance between the target behavior vector and the sample behavior vector corresponding to the worm is DθIndicating that the third average distance between the target behavior vector and the sample behavior vector corresponding to Lesog software is DμAnd (4) showing. Average the third distance Dα、Dβ、DθAnd DμRespectively comparing with preset critical values to judge Dα、Dβ、DθAnd DμIf the threshold value is exceeded, if D isα、Dβ、DθAnd DμIf the difference between the target behavior vector and the sample behavior vectors corresponding to the various black samples is larger, the target files corresponding to the target behavior vector are divided into new types of malicious files except viruses, trojans, worms and Lesojous software. If D isα、Dβ、DθAnd DμIf one or more than one of the black samples does not exceed the preset critical value, selecting the malicious category of the black sample corresponding to the minimum value as the malicious category of the target file.
For example, assume the threshold is s, if s < Dα<Dβ<Dθ<DμThe target files are classified into viruses, trojans, worms, and new classes of malicious files other than the leso software. If D isα<Dβ<Dθ<DμIf the number of the files is less than s, the target file is judged to belong to a malicious file, and the category of the target file belongs to the virus type.
Please refer to fig. 2, which is a flowchart illustrating another malicious file detection method according to an embodiment of the present disclosure. The flow shown in fig. 2 will be explained in detail below.
Step S201, obtaining sample API behaviors and sample API behavior parameters of sample files in a training sample set.
In an embodiment of the present application, the sample files include black sample files and white sample files, the black sample files include at least one of viruses, trojans, worms, and leso software, and the white sample files are normal files. The type of sample file may be, but is not limited to, a PE file, a PDF file, a text file, or the like.
Before detecting a target file, a training sample set for detecting whether the target file is a malicious file or not and a malicious category needs to be established. Specifically, sample files in a training sample set are run through an external analysis engine, and a sample API behavior and sample API behavior parameters of each sample file are obtained. Where the external analysis engine may be, but is not limited to, a sandbox, a virtual machine, and the like.
Furthermore, when the sample API behavior and the sample API behavior parameters of the sample file in the training sample set are obtained, and when the behavior with the same sample API behavior and sample API behavior parameters exists (the behavior comprises a sample API behavior and corresponding sample API behavior parameters), the behaviors with the same sample API behavior and sample API behavior parameters can be combined to form a non-repeated set, so that data redundancy can be effectively avoided, and the operation amount is reduced.
Step S202, the acquired sample API behaviors and sample API behavior parameters are encoded to obtain a sample encoding set corresponding to the training sample set.
Specifically, hexadecimal coding is performed on the corresponding sample API behavior for each sample file, the coding length is preset, and a third coding set with a preset length is obtained. And meanwhile, coding corresponding sample API behavior parameters to obtain a fourth coding set. And then, the codes in the fourth coding set are converted into hexadecimal codes consistent with the codes in the third coding set, and the codes in the third coding set and the codes in the converted fourth coding set are correspondingly combined one by one to obtain a normalized sample coding set.
In the embodiment of the present application, hexadecimal coding is used for the sample API behavior code, and it is understood that in some other embodiments, binary, octal, or decimal coding may also be used. When other binary codes are adopted, if the codes in the third code set and the codes in the fourth code set adopt different coding modes, the codes in the fourth code set are converted into the same type of codes as the codes in the third code set, or the codes in the third code set are converted into the same type of codes as the codes in the fourth code set. The sample API behavior parameters may be encoded by, but not limited to, hashing, and the like, and in the embodiment of the present application, the sample API behavior parameters are encoded by hashing.
Step S203, determining the weight corresponding to each code according to the frequency of the same sample file and different sample files corresponding to each code in the sample code set.
The weights corresponding to the codes can be determined by adopting, but not limited to TF-IDF algorithm, TextRank algorithm and the like. In the embodiment of the application, a TF-IDF algorithm is adopted. Specifically, regarding the frequency of appearance of the same sample file corresponding to the same code in the sample code set (i.e., the frequency of appearance of the behavior corresponding to the code in the same sample file), the higher the frequency of appearance, the higher the weight given thereto. The frequency of occurrence of different sample files corresponding to the same code in the sample code set (i.e., the frequency of occurrence of behaviors corresponding to the codes in different sample files) is lower if the frequency of occurrence is higher.
Step S204, the sample code set corresponding to the sample file in the training sample set is vectorized according to the weight corresponding to each code, and the sample behavior vector in the black and white sample set is obtained.
Step S205 encodes the API behavior and API behavior parameters of the obtained target file to obtain a target encoding set corresponding to the target file.
Step S206, vectorization processing is carried out on the target coding set to obtain a target behavior vector.
Step S207, determining whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set.
And S208, when the target file is a malicious file, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set.
In summary, the malicious file detection method provided in the embodiment of the present application encodes and normalizes the API behavior and API behavior parameters of the target file, and then performs vector conversion to obtain the target behavior vector of the target file, determines whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set, and determines the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector of different types of black samples in the black-and-white sample set when the target file is a vector. Due to the fact that the behavior characteristics are reserved, the richness of training input is improved, the accuracy of detection can be improved when malicious files are detected, and the false alarm rate of a machine learning model is reduced. Meanwhile, according to the distance between the target behavior vector and the sample behavior vectors corresponding to various black samples, the malicious property of the target file can be distinguished, and meanwhile, the type of the target file can be identified through the distance and a new type of malicious file can be found. And secondly, the scheme provided by the embodiment of the application supports good file type expansibility, and compared with the traditional scheme which only supports an executable PE file analysis model, the scheme provided by the embodiment of the application also supports other types of files such as Word, PDF and the like. Compared with a deep learning network malicious file detection method, the method and the device for detecting the malicious files in the network have the advantages that complexity is high, weight and parameter value adjustment is reduced, and dependence of a behavior counting behavior model on sample distribution is improved. And the scheme provided by the embodiment of the application aims at the sample imbalance and has better generalization effect. In addition, the method provided by the embodiment of the application can enable the obtained codes to be uniform in length, facilitates the extraction of the features, avoids the situation that the feature discrimination is not high due to wide feature description, and further improves the accuracy of malicious file detection. And finally, when a training sample set is established, combining the behaviors with the same sample API behavior and sample API behavior parameters to form a non-repeated set, so that data redundancy can be effectively avoided, and the computation amount is reduced.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Fig. 3 is a block diagram illustrating an electronic device 100 according to an embodiment of the present application. Referring to fig. 3, in a hardware level, the electronic device 100 includes a processor 110, and optionally further includes an internal bus 120, a network interface 130, and a memory 140. The Memory 140 may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device 100 may also include hardware required for other services.
The processor 110, the network interface 130, and the memory 140 may be connected to each other by an internal bus 120, and the internal bus 120 may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 3, but this does not indicate only one bus or one type of bus.
And a memory 140 for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory 140 may include memory and non-volatile storage and provides instructions and data to the processor 110.
The processor 110 reads a corresponding computer program from the non-volatile memory into the memory and then runs the computer program, thereby forming the malicious file detection apparatus 150 on a logical level. The processor 110 executes the program stored in the memory 140, and is specifically configured to perform the following operations:
vectorizing conversion is carried out on the API behavior and the API behavior parameters of the obtained target file to obtain a target behavior vector corresponding to the target file; determining whether the target file is a malicious file or not according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set; and when the target file is a malicious file, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set.
The method performed by the malicious file detection apparatus 150 according to the embodiment shown in fig. 3 of the present application may be applied to the processor 110, or implemented by the processor 110. The processor 110 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 110. The Processor 110 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 140, and the processor 110 reads the information in the memory 140 and completes the steps of the method in combination with the hardware thereof.
The electronic device 100 may also execute the methods shown in fig. 1 and fig. 2, and implement the functions of the malicious file detection apparatus 150 in the embodiments shown in fig. 1 and fig. 2, which are not described herein again in this embodiment of the present application.
Of course, besides the software implementation, the electronic device 100 of the present application does not exclude other implementations, such as a logic device or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.
Embodiments of the present application also propose a computer-readable storage medium storing one or more programs, where the one or more programs include instructions, which, when executed by a portable electronic device including a plurality of application programs, can cause the portable electronic device to perform the method of the embodiment shown in fig. 1 and 2, and is specifically configured to perform the following operations:
vectorizing conversion is carried out on the API behavior and the API behavior parameters of the obtained target file to obtain a target behavior vector corresponding to the target file; determining whether the target file is a malicious file or not according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set; and when the target file is a malicious file, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set.
Fig. 4 is a block diagram illustrating a malicious file detection apparatus 150 according to an embodiment of the present disclosure. Referring to fig. 4, in a software implementation, the malicious file detection apparatus 150 may include:
and an encoding module 151, configured to encode the API behavior and the API behavior parameters of the obtained target file to obtain a target encoding set corresponding to the target file.
It is understood that the encoding module 151 may be configured to perform the step S101 or the step S205.
The vectorization module 152 is configured to perform vectorization processing on the target encoding set to obtain a target behavior vector.
It is understood that the vectorization module 152 may be configured to perform the step S102 or the step S206.
The determining module 153 is configured to determine whether the target file is a malicious file according to a distance between the target behavior vector and a sample behavior vector in the black and white sample set. And when the target file is a malicious file, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set.
It is understood that the determining module 153 may be configured to perform the steps S103 and S104 or the steps S207 and S208 described above.
In short, the above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims (13)

1. A malicious file detection method, comprising:
coding the API behavior and API behavior parameters of the obtained target file to obtain a target coding set corresponding to the target file;
vectorizing the target coding set to obtain a target behavior vector;
determining whether the target file is a malicious file or not according to the distance between the target behavior vector and the sample behavior vectors in the black and white sample set;
if so, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set.
2. The method according to claim 1, wherein the encoding the API behavior and the API behavior parameters of the obtained target file to obtain a target encoding set corresponding to the target file includes:
coding the API behavior to obtain a first coding set;
coding the API behavior parameters to obtain a second coding set;
and carrying out unified dimension combination on the first encoding set and the second encoding set to obtain the normalized target encoding set.
3. The method of claim 2, wherein the API behavior parameter is a directory path, and wherein encoding the API behavior parameter to obtain a second encoding set comprises:
performing directory layering on the API behavior parameters;
encoding the API behavior parameters after the directory layering to obtain a second encoding set;
when the path length of the API behavior parameter exceeds a preset length, the path length of the API behavior parameter is adjusted to the preset length, and then directory layering is carried out.
4. The method of claim 2, wherein encoding the API behavior to obtain a first set of codes comprises:
hexadecimal coding is carried out on the API behavior to obtain the first coding set with preset coding length;
the encoding the API behavior parameters to obtain a second encoding set includes:
performing hash coding on the API behavior parameters to obtain the second coding set;
the performing unified dimension combination on the first encoding set and the second encoding set to obtain the normalized target encoding set includes:
and converting the codes in the second code set into hexadecimal codes, and correspondingly combining the codes in the first code set with the converted codes in the second code one by one to obtain the target code set.
5. The method of claim 1, wherein determining whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black and white sample set comprises:
calculating a first average distance between the target behavior vector and a sample behavior vector corresponding to a black sample in the black and white sample set;
calculating a second average distance between the target behavior vector and a sample behavior vector corresponding to a white sample in the black and white sample set;
and when the first average distance is greater than or equal to the second average distance, judging that the target file is a malicious file.
6. The method according to claim 1, wherein the determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to the different black samples in the black and white sample set comprises:
calculating a third average distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set;
when the third average distance which does not exceed the preset critical value exists, selecting the malicious category of the black sample corresponding to the minimum value in the third average distance as the malicious category of the target file;
otherwise, dividing the malicious category of the target file into a new malicious category.
7. The method of claim 1, further comprising:
and acquiring the API behavior and the API behavior parameters after the external analysis engine operates the target file.
8. The method of claim 1, wherein the API behavior is loading a system DLL file, writing a temporary file, or modifying a registry.
9. The method of claim 1, further comprising:
acquiring sample API behaviors and sample API behavior parameters of sample files in a training sample set, wherein the sample files comprise black sample files and white sample files, the black sample files comprise at least one of viruses, trojans, worms and Lesojous software, and the white sample files are normal files;
coding the acquired sample API behaviors and the sample API behavior parameters to obtain a sample coding set corresponding to the training sample set;
determining the weight corresponding to each code according to the frequency of the same sample file and different sample files corresponding to each code in the sample code set;
and vectorizing the sample code set corresponding to the sample file in the training sample set according to the weight corresponding to each code to obtain the sample behavior vector in the black and white sample set.
10. The method of claim 9, wherein the sample file is of a type that is a PE file, a PDF file, or a text file.
11. A malicious file detection apparatus, comprising:
the encoding module is used for encoding the API behavior and the API behavior parameters of the obtained target file to obtain a target encoding set corresponding to the target file;
the vectorization module is used for vectorizing the target coding set to obtain a target behavior vector;
the determining module is used for determining whether the target file is a malicious file or not according to the distance between the target behavior vector and the sample behavior vectors in the black and white sample set; and
if so, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set.
12. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing the communication between the processor and the memory through the bus;
a memory for storing a computer program;
a processor for executing a program stored in the memory to perform the method steps of any of claims 1 to 10.
13. A computer-readable storage medium, characterized in that a computer program is stored in the storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 10.
CN201910755713.1A 2019-08-15 2019-08-15 Malicious file detection method and device, electronic equipment and storage medium Pending CN112395612A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910755713.1A CN112395612A (en) 2019-08-15 2019-08-15 Malicious file detection method and device, electronic equipment and storage medium
PCT/CN2020/108614 WO2021027831A1 (en) 2019-08-15 2020-08-12 Malicious file detection method and apparatus, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910755713.1A CN112395612A (en) 2019-08-15 2019-08-15 Malicious file detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112395612A true CN112395612A (en) 2021-02-23

Family

ID=74570249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910755713.1A Pending CN112395612A (en) 2019-08-15 2019-08-15 Malicious file detection method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112395612A (en)
WO (1) WO2021027831A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343219A (en) * 2021-05-31 2021-09-03 烟台中科网络技术研究所 Automatic and efficient high-risk mobile application program detection method
CN113449301A (en) * 2021-06-22 2021-09-28 深信服科技股份有限公司 Sample detection method, device, equipment and computer readable storage medium
CN113704761A (en) * 2021-08-31 2021-11-26 上海观安信息技术股份有限公司 Malicious file detection method and device, computer equipment and storage medium
CN114006766A (en) * 2021-11-04 2022-02-01 杭州安恒信息安全技术有限公司 Network attack detection method and device, electronic equipment and readable storage medium
CN114297645A (en) * 2021-12-03 2022-04-08 深圳市木浪云科技有限公司 Method, device and system for identifying Lesox family in cloud backup system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861428B (en) * 2023-09-04 2023-12-08 北京安天网络安全技术有限公司 Malicious detection method, device, equipment and medium based on associated files
CN116910756B (en) * 2023-09-13 2024-01-23 北京安天网络安全技术有限公司 Detection method for malicious PE (polyethylene) files

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8245295B2 (en) * 2007-07-10 2012-08-14 Samsung Electronics Co., Ltd. Apparatus and method for detection of malicious program using program behavior
CN104866763B (en) * 2015-05-28 2019-02-26 天津大学 Android malware mixing detection method based on permission
CN106960153B (en) * 2016-01-12 2021-01-29 阿里巴巴集团控股有限公司 Virus type identification method and device
US10972495B2 (en) * 2016-08-02 2021-04-06 Invincea, Inc. Methods and apparatus for detecting and identifying malware by mapping feature data into a semantic space
CN109145605A (en) * 2018-08-23 2019-01-04 北京理工大学 A kind of Android malware family clustering method based on SinglePass algorithm

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343219A (en) * 2021-05-31 2021-09-03 烟台中科网络技术研究所 Automatic and efficient high-risk mobile application program detection method
CN113343219B (en) * 2021-05-31 2023-03-07 烟台中科网络技术研究所 Automatic and efficient high-risk mobile application program detection method
CN113449301A (en) * 2021-06-22 2021-09-28 深信服科技股份有限公司 Sample detection method, device, equipment and computer readable storage medium
CN113704761A (en) * 2021-08-31 2021-11-26 上海观安信息技术股份有限公司 Malicious file detection method and device, computer equipment and storage medium
CN114006766A (en) * 2021-11-04 2022-02-01 杭州安恒信息安全技术有限公司 Network attack detection method and device, electronic equipment and readable storage medium
CN114297645A (en) * 2021-12-03 2022-04-08 深圳市木浪云科技有限公司 Method, device and system for identifying Lesox family in cloud backup system
CN114297645B (en) * 2021-12-03 2022-09-27 深圳市木浪云科技有限公司 Method, device and system for identifying Lesox family in cloud backup system

Also Published As

Publication number Publication date
WO2021027831A1 (en) 2021-02-18

Similar Documents

Publication Publication Date Title
CN112395612A (en) Malicious file detection method and device, electronic equipment and storage medium
US10104107B2 (en) Methods and systems for behavior-specific actuation for real-time whitelisting
US10430586B1 (en) Methods of identifying heap spray attacks using memory anomaly detection
US10986103B2 (en) Signal tokens indicative of malware
US9798981B2 (en) Determining malware based on signal tokens
US11882134B2 (en) Stateful rule generation for behavior based threat detection
Zhao et al. Malicious executables classification based on behavioral factor analysis
US10216934B2 (en) Inferential exploit attempt detection
US11379581B2 (en) System and method for detection of malicious files
US11100242B2 (en) Restricted resource classes of an operating system
US10623426B1 (en) Building a ground truth dataset for a machine learning-based security application
US11297083B1 (en) Identifying and protecting against an attack against an anomaly detector machine learning classifier
US10860719B1 (en) Detecting and protecting against security vulnerabilities in dynamic linkers and scripts
CN109933986B (en) Malicious code detection method and device
CN107070845B (en) System and method for detecting phishing scripts
EP3798885B1 (en) System and method for detection of malicious files
Andronio Heldroid: Fast and efficient linguistic-based ransomware detection
CN111062035A (en) Lesog software detection method and device, electronic equipment and storage medium
CN110413871B (en) Application recommendation method and device and electronic equipment
CN111240696A (en) Method for extracting similar modules of mobile malicious program
KR102174393B1 (en) Malicious code detection device
Malik Anomaly based Intrusion Detection in Android Mobiles: A Review
Reshi et al. Enhancing Malware Detection using Deep Learning Approach
CN114861179A (en) Risk detection method, device, terminal and medium for mobile terminal application and file
CN111143843A (en) Malicious application detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination