CN116578537A

CN116578537A - File detection method, readable storage medium and electronic device

Info

Publication number: CN116578537A
Application number: CN202310847689.0A
Authority: CN
Inventors: 吕经祥; 李石磊; 肖新光
Original assignee: Beijing Antiy Network Technology Co Ltd
Current assignee: Beijing Antiy Network Technology Co Ltd
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-08-11
Anticipated expiration: 2043-07-12
Also published as: CN116578537B

Abstract

The application provides a file detection method, a readable storage medium and electronic equipment, and relates to the field of file detection, wherein the method comprises the following steps: acquiring a file to be detected; according to the plurality of segments of structured data, structured outline characteristic information corresponding to the file to be detected is obtained; determining whether the probability of the file to be detected as a malicious file is larger than a preset probability threshold according to the structural profile characteristic information and a first preset detection method; if so, processing the content of the plurality of sections of structured data based on a second preset detection method to determine whether the file to be detected is a malicious file. In the application, for the file to be detected after the first detection algorithm, if the probability that the file to be detected is a malicious file is smaller than the preset probability threshold value, the file to be detected is not a malicious file, and a second detection algorithm with more consumed computer resources and higher calculation power is not performed, so that the computer resources are saved.

Description

File detection method, readable storage medium and electronic device

Technical Field

The present application relates to the field of file detection, and in particular, to a file detection method, a readable storage medium, and an electronic device.

Background

In the prior art, when an unknown file is detected, structured data in the unknown file is cut and spliced to generate a new file, so that the content of the new file is directly detected, and the required calculation force is large.

Therefore, how to detect an unknown file with less computation effort and determine whether the unknown file is a malicious file is a technical problem that needs to be solved at present.

Disclosure of Invention

In view of the above, the present application provides a file detection method, a readable storage medium and an electronic device, which at least partially solve the technical problems existing in the prior art, and the technical scheme adopted by the present application is as follows:

according to a first aspect of the present application, there is provided a file detection method comprising:

acquiring a file to be detected; the file to be detected comprises a plurality of sections of structured data and a plurality of sections of unstructured data; structured data and unstructured data are alternately arranged;

according to the plurality of segments of structured data, structured outline characteristic information corresponding to the file to be detected is obtained; the structural outline characteristic information comprises a plurality of sections of structural data, and a starting address and a length of each section of structural data;

Determining whether the probability of the file to be detected as a malicious file is larger than a preset probability threshold according to the structural profile characteristic information and a first preset detection method;

if yes, processing the content of the plurality of sections of structured data based on a second preset detection method to determine whether the file to be detected is a malicious file; wherein, the computer resource needed by implementing the second preset detection method is larger than the computer resource needed by implementing the first preset detection method.

In one exemplary embodiment of the present application, the structured profile feature information is a= (a, DA1, LA1, DA2, LA2,) DAb, LAb, DAa, LAa), b = 1,2, a; where a is the number of segments of the structured data, DAb is the start address of the b-th segment of the structured data, and LAb is the length of the b-th segment of the structured data.

In an exemplary embodiment of the present application, determining whether a probability that a file to be detected is a malicious file is greater than a preset probability threshold according to structured profile feature information and a first preset detection method includes:

obtaining a structural profile matching degree set c= (C1, C2, & gt, cc, & gt, cd), c=1, 2, & gt, d according to a preset structural profile characteristic information set b= (B1, B2, & gt, bc, & gt, bd); d is the number of preset structural outline characteristic information; bc is the c-th preset structural outline characteristic information; cc is the matching degree of A and Bc;

If Cc is larger than the preset matching degree threshold, determining that the probability of the file to be detected as the malicious file is larger than the preset probability threshold.

obtaining a hash value AD of A;

if the AD is the same as De in the set of pre-set structured-profile-feature hash values d= (D1, D2..the de..the Df), determining that the probability that the file to be detected is a malicious file is greater than a pre-set probability threshold; where e=1, 2.

In an exemplary embodiment of the present application, after acquiring the file to be detected, the file detecting method further includes:

according to the unstructured data of a plurality of sections, unstructured profile characteristic information corresponding to the file to be detected is obtained; the unstructured profile characteristic information comprises a plurality of sections of unstructured data, and a starting address and a length of each section of unstructured data;

determining whether the probability of the file to be detected as a malicious file is larger than a preset probability threshold according to unstructured profile characteristic information and a third preset detection method;

If yes, processing the content of a plurality of sections of unstructured data based on a fourth preset detection method to determine whether the file to be detected is a malicious file; wherein, the computer resource required for implementing the fourth preset detection method is greater than the computer resource required for implementing the third preset detection method.

In one exemplary embodiment of the application, the unstructured profile feature information is e= (g, DE1, LE1, DE2, LE2,) DEh, LEh, & DEg, LEg), h = 1,2, & g; wherein g is the number of segments of unstructured data; g=a or g=a-1 or g=a+1; DEh is the start address of the h-th unstructured data, LEh is the length of the h-th unstructured data.

In an exemplary embodiment of the present application, determining whether a probability that a file to be detected is a malicious file is greater than a preset probability threshold according to unstructured profile feature information and a third preset detection method includes:

obtaining an unstructured profile matching degree set g= (G1, G2, gj), i=1, 2, fj according to a preset unstructured profile feature information set f= (F1, F2, fi, fj); wherein j is the number of preset unstructured profile feature information; fi is the ith preset unstructured profile feature information; and if the Gi is the matching degree of E and Fi, determining that the probability of the file to be detected as a malicious file is larger than a preset probability threshold value.

acquiring a hash value EH of E;

if the EH is the same as the Hk in the preset unstructured profile feature hash value set h= (H1, H2,..hk,..hm), determining that the probability of the file to be detected being a malicious file is greater than a preset probability threshold; where k=1, 2.

In an exemplary embodiment of the present application, based on a fourth preset detection method, processing contents of a plurality of pieces of unstructured data to determine whether a file to be detected is a malicious file, includes:

obtaining a first hash value of each piece of unstructured data to obtain a first hash value set i= (I1, I2,..eh,..ig); ih is a first hash value of the h-th unstructured data;

if Ih is the same as Jp in the set of preset unstructured data segment whole hash values j= (J1, J2,., jp,., jq), determining that the file to be detected is a malicious file; wherein p=1, 2.

obtaining a second hash value of the first 128 bytes of each piece of unstructured data to obtain a second set of hash values k= (K1, K2,., kh, kg); wherein Kh is a second hash value corresponding to 128 bytes of data before the h section of unstructured data;

if Kh is the same as Lt in a preset unstructured data segment partial hash value set l= (L1, L2,..and Lt,..and Lu), determining that the file to be detected is a malicious file; where t=1, 2.

acquiring a second hash value K1 of the first 128 bytes of the first section of unstructured data;

if K1 is the same as Lt in the preset unstructured data segment partial hash value set l= (L1, L2, lt, lu), determining that the file to be detected is a malicious file; where t=1, 2.

According to a second aspect of the present application, there is provided a non-transitory computer readable storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement a file detection method.

According to a third aspect of the present application, there is provided an electronic device comprising a processor and the non-transitory computer readable storage medium described above.

The application has at least the following beneficial effects:

according to the file detection method provided by the application, for the file to be detected which contains both structured data and unstructured data and is alternately arranged with the structured data and the unstructured data, when the structured data is processed, firstly, contour feature information (the structured contour feature information comprises a plurality of segments of structured data, and the starting address and the length of each segment of structured data) of the structured data is obtained, and then, based on the structured contour feature information and a first preset detection method, whether the probability of the file to be detected being a malicious file is larger than a preset probability threshold value is determined. The processing is based on the outline characteristics of the structured data only, and the content of the structured data is not processed, so that compared with a second detection method for processing the content of the structured data, the processing consumes less computer resources and has lower computational power, the purpose is to initially determine whether the file to be detected is likely to be a malicious file by using a first preset detection method, and if the probability that the file to be detected is likely to be a malicious file is greater than a preset probability threshold, the file to be detected is likely to be a malicious file. At this time, the second preset method with higher computational power is used for processing the content of the structured data by using more consumed computing resources, so as to obtain a more accurate result, thereby determining whether the file to be detected is a malicious file. In the application, for the file to be detected after the first detection algorithm, if the probability that the file to be detected is a malicious file is smaller than the preset probability threshold value, the file to be detected is not a malicious file, and a second detection algorithm with more consumed computer resources and higher calculation power is not performed, so that the computer resources are saved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an embodiment of a method for detecting documents provided by the present application;

fig. 2 is a flowchart of another embodiment of a file detection method provided by the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

Fig. 1 is a flowchart of a file detection method according to an embodiment of the present application. The file detection method as shown in fig. 1 includes steps S100-S400:

Step S100, obtaining a file to be detected; the file to be detected comprises a plurality of sections of structured data and a plurality of sections of unstructured data; structured data and unstructured data are alternately arranged.

Specifically, the file to be detected comprises a plurality of sections of structured data and a plurality of sections of unstructured data, and the sections of structured data and the sections of unstructured data in the file to be detected are alternately arranged.

Step S200, according to a plurality of sections of structured data, structured outline characteristic information corresponding to a file to be detected is obtained; the structured profile feature information includes a number of segments of the structured data, and a start address and a length of each segment of the structured data.

Here, the structured profile feature information includes the number of pieces of the pieces of structured data, and the start address and length of each piece of structured data. In this embodiment, the structured data between every two unstructured data segments is taken as one segment, or the structured data between the unstructured data with the nearest file header and the file header is taken as one segment, or the structured data between the unstructured data with the nearest file trailer and the structured data in the middle of the file trailer is taken as one segment, and the sum of all the above segments is the segment number of the structured data.

Structured profile feature information is a= (a, DA1, LA1, DA2, LA2,) DAb, LAb, & gt, DAa, LAa), b=1, 2, & gt, a; where a is the number of segments of the structured data, DAb is the start address of the b-th segment of the structured data, and LAb is the length of the b-th segment of the structured data. Specifically, structural analysis is carried out on the structured data in the file to be detected, and then the number of segments of the structured data is determined based on NumberOfSections; and acquiring information of each SECTION in an image_SECTION_HEADER structure of each SECTION of structured data, wherein the information comprises a starting address of each SECTION of structured data and the length of each SECTION of structured data.

Here, the start address and the length of each piece of structured data are arranged in a manner of being adjacently placed, so that the contour information of each piece of structured data can be recorded more clearly.

Step S300, determining whether the probability of the file to be detected being a malicious file is greater than a preset probability threshold according to the structural outline characteristic information and a first preset detection method.

Here, the present application may determine whether the probability that the file to be detected is a malicious file is greater than a preset probability threshold by the following steps 310-320:

step S310, according to the preset structured profile feature information set b= (B1, B2,..bc,..and.bd), a structured profile matching set c= (C1, C2, cc, cd), c=1, 2, d; d is the number of preset structural outline characteristic information; bc is the c-th preset structural outline characteristic information; cc is the matching degree of A and Bc.

Specifically, the B includes d pieces of preset structural contour feature information, where the preset contour feature information is structural contour feature information of a malicious file. And traversing B by using A, and respectively obtaining the matching degree of each piece of preset contour characteristic information in A and B to obtain a plurality of matching degrees, namely a structured contour matching degree set C.

Step S320, if Cc is greater than a preset matching degree threshold, determining that the probability of the file to be detected being a malicious file is greater than a preset probability threshold.

Specifically, a preset matching degree threshold is set, if Cc is greater than the preset matching degree threshold, it is indicated that the matching degree of a and Bc is higher, and Bc is structural profile feature information obtained according to historical malicious files, so that it can be indicated that the probability that the file to be detected corresponding to a is a malicious file is greater than the preset probability threshold, that is, the probability that the file to be detected corresponding to a is a malicious file is higher, and in order to further determine whether the file to be detected corresponding to a is a malicious file, further detection needs to be performed on the content of the file to be detected corresponding to a. Otherwise, if Cc is smaller than the preset matching degree threshold, it indicates that the matching degree between a and Bc is low, so that the probability that the file to be detected corresponding to a is a malicious file is smaller than the preset probability threshold, and at this time, it is determined that a is not a malicious file, so that further content detection is not performed on a.

According to the embodiment, the structural outline characteristic information of the file to be detected is obtained, the structural outline characteristic information of the file to be detected is matched with the preset structural outline characteristic information set, the probability that the file to be detected is a malicious file is primarily judged based on the magnitude relation between the matching degree and the preset matching degree threshold value, whether higher calculation power is needed to be processed is further determined, and further detection is not needed to be carried out on the file to be detected which is smaller than the preset probability threshold value, so that the embodiment does not need to carry out high calculation power processing on all the files to be detected, detection time is shortened, and computer resources are saved.

In addition, the present application may further determine whether the probability that the file to be detected is a malicious file is greater than a preset probability threshold through the following steps 330-340:

in step S330, the hash value AD of a is obtained.

Step S340, if AD and the preset structured-profile hash value set d= (D1, D2,) De, if the De in Df) is the same, determining that the probability of the file to be detected as a malicious file is greater than a preset probability threshold; where e=1, 2.

Specifically, after the hash value AD of a is obtained, each preset structural outline feature hash value in D is traversed by using the AD, de is a hash value of structural outline feature information obtained according to a history malicious file, if the AD and the De are the same, it is determined that the probability that the file to be detected corresponding to a is a malicious file is greater than a preset probability threshold, and at this time, higher calculation force detection needs to be performed on the content of the file to be detected corresponding to a. Otherwise, if the AD and the De are different, determining that the probability that the file to be detected corresponding to the A is a malicious file is smaller than a preset probability threshold, and determining that the A is not a malicious file at this time, so that further content detection is not performed on the A.

According to the embodiment, whether the hash value of the structural outline characteristic information of the file to be detected is the same as any hash value in the preset structural outline characteristic hash value set is determined by acquiring the hash value of the structural outline characteristic information of the file to be detected, if so, the probability that the file to be detected is a malicious file is determined to be greater than a preset probability threshold value, higher calculation force detection is needed to be carried out on the file to be detected, otherwise, the probability that the file to be detected is a malicious file is determined to be less than the preset probability threshold value, at the moment, the file to be detected is not further detected, and therefore high calculation force processing is not needed to be carried out on all the files to be detected, detection time is shortened, and computer resources are saved.

Step S400, if yes, processing the content of a plurality of sections of structured data based on a second preset detection method to determine whether the file to be detected is a malicious file; wherein, the computer resource needed by implementing the second preset detection method is larger than the computer resource needed by implementing the first preset detection method.

Specifically, after further detection is determined, processing is performed on the content of a plurality of sections of structured data of the file to be detected based on a second preset detection method, and whether the file to be detected is a malicious file is further determined.

It should be noted that, the computer resources required for implementing the second preset detection method are greater than those required for implementing the first preset detection method, so that the first detection method is used as the probability screening of the preliminary malicious file, and the data content of the file to be detected is further detected based on the second preset method with higher computing power, where the computer resources required during implementation are greater, so as to further determine whether the file to be detected is a malicious file. Here, the second preset detection method may be an existing detection method, and a person skilled in the art can determine a specific detection method according to the actual situation.

Referring to fig. 2, after the file to be detected is acquired, the file detection method of the present application further includes step S500-step S700, specifically as follows:

step S500, according to a plurality of sections of unstructured data, unstructured profile characteristic information corresponding to a file to be detected is obtained; the unstructured profile characteristic information comprises a number of segments of unstructured data, and a starting address and a length of each segment of unstructured data.

Specifically, the unstructured profile feature information is e= (g, DE1, LE1, DE2, LE2,) DEh, LEh, & gt, DEg, LEg, & gt, h=1, 2, & gt, g; wherein g is the number of segments of unstructured data; g=a or g=a-1 or g=a+1; DEh is the start address of the h-th unstructured data, LEh is the length of the h-th unstructured data. In this embodiment, unstructured data between every two sections of structured data is taken as one section, or unstructured data between the structured data with the nearest file header and the file header is taken as one section, or unstructured data between the structured data with the nearest file trailer and the middle of the file trailer is taken as one section, and the sum of all the sections is the section number of the unstructured data. When the unstructured data between every two sections of structured data is one section, determining a termination address of each section of structured data based on a start address of each section of structured data and the length of the structured data, and simultaneously determining the start address of each section of unstructured data, so as to determine the length of the unstructured data based on the start address and the termination address of two sections of adjacent structured data; when the latest structured data of the file header and the unstructured data between the file headers are one section, the first section of data is the initial address of the unstructured data, and the last section of the initial address of the structured data closest to the file header is the final address of the unstructured data, so that the length of the unstructured data is determined; when the structured data closest to the file tail and the unstructured data in the middle of the file tail are one section, the next section of the termination address of the structured data closest to the file tail is the starting address of the section of unstructured data, and the length from the starting address of the section of unstructured data to the file tail is the length of the section of unstructured data.

Step S600, determining whether the probability of the file to be detected being a malicious file is greater than a preset probability threshold according to unstructured profile characteristic information and a third preset detection method.

Here, the present application may determine whether the probability that the file to be detected is a malicious file is greater than a preset probability threshold by the following steps S610-S620:

step S610, according to the preset unstructured profile feature information set f= (F1, F2,) Fi, & gt, fj, obtaining an unstructured profile matching degree set g= (G1, G2, & gt, gi, & gt, gj), i=1, 2, & gt, j; wherein j is the number of preset unstructured profile feature information; fi is the ith preset unstructured profile feature information; gi is the matching degree of E and Fi.

Specifically, the F includes j preset unstructured profile feature information, where the preset unstructured profile feature information is unstructured profile feature information of a malicious file. And traversing F by using E, and respectively obtaining the matching degree of each piece of preset non-profile characteristic information in E and F to obtain a plurality of matching degrees, namely an unstructured profile matching degree set G.

In step S620, if Gi is greater than the preset matching degree threshold, it is determined that the probability of the file to be detected being a malicious file is greater than the preset probability threshold.

Specifically, a preset matching degree threshold is set, if Gi is greater than the preset matching degree threshold, it is indicated that the matching degree between E and Fi is higher, and Fi is unstructured profile feature information obtained according to historical malicious files, so that it can be indicated that the probability that the file to be detected corresponding to E is a malicious file is greater than the preset probability threshold, that is, the probability that the file to be detected corresponding to E is a malicious file is higher, and further, in order to further determine whether the file to be detected corresponding to E is a malicious file, further detection needs to be performed on the content of the file to be detected corresponding to E. Otherwise, if Gi is smaller than the preset matching degree threshold, it indicates that the matching degree between E and Fi is lower. Therefore, the probability that the file to be detected corresponding to the E is a malicious file is smaller than a preset probability threshold, and at the moment, the E is determined not to be a malicious file and further content detection is not carried out on the E.

According to the method, the unstructured profile characteristic information of the file to be detected is obtained, the unstructured profile characteristic information of the file to be detected is matched with the preset unstructured profile characteristic information set, the probability that the file to be detected is a malicious file is primarily judged based on the magnitude relation between the matching degree and the preset matching degree threshold value, whether higher calculation power is needed is further determined, and further detection is not needed for the file to be detected which is smaller than the preset probability threshold value, so that the method does not need to conduct high calculation power processing on all the files to be detected, detection time is shortened, and computer resources are saved.

In addition, the present application may further determine whether the probability that the file to be detected is a malicious file is greater than a preset probability threshold through the following steps S630-S640:

in step S630, the hash value EH of E is obtained.

Step S640, if EH is the same as Hk in the preset unstructured profile feature hash value set h= (H1, H2,..hk,..hm), determining that the probability of the file to be detected being a malicious file is greater than the preset probability threshold; where k=1, 2.

Specifically, after the hash value EH of E is obtained, each unstructured profile feature in H is traversed by EH, where Hk is a hash value of unstructured profile feature information obtained according to a history malicious file, if EH is the same as Hk, it is determined that the probability that a file to be detected corresponding to E is a malicious file is greater than a preset probability threshold, and then higher-computation-force detection is required to be performed on the content of the file to be detected corresponding to E. Otherwise, if EH is different from Hk, it is determined that the probability that the file to be detected corresponding to E is a malicious file is smaller than the preset probability threshold, and at this time, it is determined that E is not a malicious file, so that further content detection is not performed on E.

According to the method, whether the hash value of the unstructured profile characteristic information of the file to be detected is the same as any hash value in the preset unstructured profile characteristic hash value set or not is further determined by acquiring the hash value of the unstructured profile characteristic information of the file to be detected, if the hash value of the unstructured profile characteristic information of the file to be detected is the same as any hash value in the preset unstructured profile characteristic hash value set, the probability that the file to be detected is a malicious file is determined to be greater than a preset probability threshold value, higher calculation force detection is needed to be conducted on the file to be detected, otherwise, the probability that the file to be detected is a malicious file is determined to be smaller than the preset probability threshold value is determined to be different, at the moment, the file to be detected is not further detected, and therefore high calculation force processing is not needed to be conducted on all the files to be detected, detection time is shortened, and computer resources are saved.

Step S700, if yes, processing the content of a plurality of sections of unstructured data based on a fourth preset detection method to determine whether the file to be detected is a malicious file; wherein, the computer resource required for implementing the fourth preset detection method is greater than the computer resource required for implementing the third preset detection method.

It may be determined whether the file to be detected is a malicious file by the following steps S710 to S720:

step S710, obtaining a first hash value of each piece of unstructured data to obtain a first hash value set i= (I1, I2,., ih,., ig); where Ih is a first hash value of the h-th segment of unstructured data.

Specifically, if the probability that the file to be detected is a malicious file is greater than a preset probability threshold, further detection needs to be performed on the file to be detected, in this embodiment, the probability that the file to be detected is a malicious file is determined to be greater than the preset probability threshold based on unstructured profile characteristic information or unstructured profile characteristic information hash values, and then, a first hash value of each section of unstructured data in all unstructured data in the file to be detected is obtained, so that a first hash value set is obtained.

Step S720, if Ih is the same as Jp in the set of preset unstructured data segment whole hash values j= (J1, J2,., jp,., jq), determining that the file to be detected is a malicious file; wherein p=1, 2.

Specifically, the preset integral hash value of the unstructured data segment concentrates the hash value of each section of unstructured data including a plurality of malicious files, after the embodiment obtains the I of each section of unstructured data, J is traversed based on each first hash value in the I, and if any first hash value in the I is the same as any integral hash value of the preset unstructured data segment in the J, the file to be detected is determined to be a malicious file.

According to the embodiment, first hash values of each section of unstructured data are obtained firstly to obtain a first hash value set, each first hash value in the first hash value set is utilized to traverse the whole hash value set of the preset unstructured data sections, if any one of the first hash values in the first hash value set is the same as any one of the whole hash values of the preset unstructured data sections, the file to be detected is determined to be a malicious file, when the file to be detected is determined to be a malicious file based on the unstructured data, firstly, whether the file to be detected is likely to be a malicious file is determined to be a malicious file based on unstructured profile characteristic information or the hash value of the unstructured profile characteristic information, when the probability that the file to be detected is a malicious file is smaller than a preset probability threshold, no further processing is carried out, and when the probability that the file to be detected is a malicious file is larger than the preset probability threshold, a fourth preset detection algorithm with higher computer resources is needed, namely, when the content of each section of unstructured data is utilized to obtain the hash value of each first hash value, and when the content of each section of unstructured data is the whole hash value is the same as the whole hash value, and when the whole hash value of the whole hash value is more accurate to be a malicious file is determined to be a malicious file. Compared with the calculation of the content hash value for all unstructured data, the method saves computer resources and reduces detection time.

Further, the present application may further determine whether the file to be detected is a malicious file through the following steps S730 to S740:

step S730, obtaining a second hash value of the first 128 bytes of each piece of unstructured data to obtain a second hash value set k= (K1, K2,) and Kh.

Specifically, if the probability that the file to be detected is a malicious file is greater than a preset probability threshold, further detection needs to be performed on the file to be detected, in this embodiment, the probability that the file to be detected is a malicious file is determined to be greater than the preset probability threshold based on unstructured profile characteristic information or unstructured profile characteristic information hash values, and further, second hash values of the first 128 bytes of each section of unstructured data in all unstructured data in the file to be detected are obtained, and a second hash value set is obtained.

Step S740, if Kh is the same as Lt in the preset unstructured data segment partial hash value set l= (L1, L2,) at, lt, & gt, lu), determining that the file to be detected is a malicious file; where t=1, 2.

Specifically, in this embodiment, after obtaining the second hash value of the first 128 bytes of each section of unstructured data to obtain the second hash value set K, traversing L by using Kh, where Lt is a hash value of a part of the unstructured data section obtained according to the historical malicious file, and if Kh is the same as Lt, determining that the file to be detected is a malicious file.

This embodiment is a further improvement over the previous embodiment in that the first hash value of each piece of unstructured data is obtained to obtain a first hash value set, the second hash value set of the first 128 bytes of each piece of unstructured data in each piece of unstructured data is obtained, compared with the previous embodiment, the computer resources and time are required, and through multiple tests, the accuracy rate of determining whether the file to be detected is a malicious file is greater than 90%.

Further, the present application may further determine whether the file to be detected is a malicious file through the following steps S750-S760:

in step S750, a second hash value K1 of the first 128 bytes of the first unstructured data is obtained.

Specifically, if the probability that the file to be detected is a malicious file is greater than a preset probability threshold, further detection needs to be performed on the file to be detected, in this embodiment, the probability that the file to be detected is a malicious file is determined to be greater than the preset probability threshold based on unstructured profile characteristic information or unstructured profile characteristic information hash values, and then a second hash value K1 of the first 128 bytes of unstructured data in the file to be detected is obtained.

Step S760, if K1 is the same as Lt in the preset unstructured data segment partial hash value set l= (L1, L2,) at, lt, & gt, lu), determining that the file to be detected is a malicious file; where t=1, 2.

Specifically, in this embodiment, after obtaining the second hash value K1 of the first 128 bytes of unstructured data in the file to be detected, traversing L by using K1, where Lt is a hash value of a part of the unstructured data segment obtained according to the historical malicious file, and if K1 is the same as Lt, determining that the file to be detected is a malicious file.

This embodiment is a further improvement over the previous embodiment in that the second hash value of 128 bytes before each piece of unstructured data is obtained to obtain a second hash value set, in this embodiment, the second hash value of 128 bytes before the first piece of unstructured data is obtained, compared with the previous embodiment, the computer resources and time are less, and through multiple tests, the accuracy rate of determining whether the file to be detected is a malicious file or not is greater than 90% by using the second hash value set of 128 bytes before the first piece of unstructured data.

In summary, in the file detection method provided in this embodiment, for a to-be-detected file that includes structured data and unstructured data at the same time and in which the structured data and the unstructured data are alternately arranged, first, structured profile information and unstructured profile characteristics corresponding to the structured data and unstructured data are obtained, and second, matching is performed based on the structured profile information (or a hash value of the structured profile information) and a preset structured profile information library (a hash value library of the structured profile information), and meanwhile, comparing is performed based on the unstructured profile information (or a hash value of the unstructured profile information) and a preset unstructured profile information library (a hash value library of the unstructured profile information), so as to preliminarily determine whether the to-be-detected file is likely to be a malicious file, and here, further, an object is to preliminarily determine whether malicious codes exist in the structured data and/or the unstructured data based on the matching result, so that when malicious codes are likely to exist in the structured data, a higher-level analysis is performed on the content of the structured data to determine whether the to be-detected file is likely to be a malicious file; when malicious code is likely to be present in the unstructured data, a more computationally intensive analysis of the content of the unstructured data is performed to determine whether the file to be detected is a malicious file. Here, through setting up the step of preliminary judging whether the file that waits to detect is probably malicious file, follow-up probably need not detect structured data and unstructured data's content totally, practice thrift the computer resource, efficiency is higher.

In some embodiments of the present application, after step S300, the present application further includes the following steps S311-S316:

step S311, compressing each piece of unstructured data in the pieces of unstructured data to obtain a first data segment compression ratio set m= (M1, M2,) Mv,/Mg; wherein v=1, 2, once again, g, mv is the first data segment compression ratio of the v-th segment unstructured data, and Mv is the ratio of the data size before the v-th section unstructured data is compressed to the data size after the v-th section unstructured data is compressed, and g is the number of sections of unstructured data.

Step S312, according to M, obtaining a first compression ratio fluctuation value MN; mn= (Σ) ^g _v=1 (Mv-avg(M)) ² ) /g; wherein: avg () is a preset average value determination function;

here, unstructured data for increasing the file size is basically repeated and nonsensical data. However, some hackers may embed malicious code in unstructured data, increasing detection difficulty. Malicious codes are meaningful data, and the degree of size change after compression is small.

Step S313, if MN is greater than the first preset compression ratio fluctuation value, arranging all the first data segment compression ratios in M in order from small to large, to obtain a first ordered data segment compression ratio set ma= (MA 1, MA2,) MAw,) MAg; wherein w=1, 2, once again, g, MAw is the first data segment compression ratio in MA ordered in the w-th bit, g is the number of segments of unstructured data.

In particular, if all unstructured data is repeated and nonsensical data, the compression ratio of each segment of unstructured data is basically maintained to fluctuate within a small range. If a hacker embeds malicious codes in some unstructured data, the compression ratio of a certain section of unstructured data embedded with the malicious codes is smaller than that of a certain section of unstructured data not embedded with the malicious codes because the malicious codes are meaningful data and the degree of change of the compressed size is small. Therefore, if at least one section of unstructured data embedded with malicious code exists in all unstructured data, the fluctuation range of the compression ratio of the unstructured data embedded with the malicious code is larger than that of other unstructured data not embedded with the malicious code.

In summary, the larger MN is, the greater the deviation degree between the compression ratio of the first data segment in M and the average value of the compression ratios of all the first data segments in M is, that is, at least one unstructured data segment embedded with malicious code and at least one unstructured data segment not migrated with malicious code may exist in M; conversely, the smaller MN indicates that the degree of deviation between the compression ratio of the first data segment in M and the average value of the compression ratios of all the first data segments in M is small, and if MN approaches 0, it indicates that the compression ratio of each piece of unstructured data is basically maintained in a small range, that is, unstructured data embedded with malicious code may not exist in M.

Setting a first preset compression ratio fluctuation value, and if the MN is larger than the first preset compression ratio fluctuation value, indicating that the possibility of embedding malicious codes in a certain section of unstructured data in the file to be detected is higher.

Step S314, extracting a first preset number of first data segment compression ratios from MA according to the order of the first data segment compression ratios from small to large, to obtain a target first data segment compression ratio set mb= (MB 1, MB2,., MBx,., MBy); where x=1, 2,..y, MBx is the target first data segment compression ratio of the MB arranged in the x-th bit, and y is the number of target first data segment compression ratios.

Specifically, the first data segment compression ratios are extracted from the MA in order from small to large, that is, y first data segment compression ratios are extracted from large to small according to the possibility of embedding malicious codes in a certain unstructured data segment.

Step S315, traversing a preset unstructured data segment integral hash value set j= (J1, J2,) based on the first hash value of each segment of unstructured data corresponding to the target first data segment compression ratio in MB in sequence; wherein p=1, 2.

Step S316, if the first hash value of the unstructured data corresponding to MBx is the same as Jp, stopping traversing, and determining the unstructured data corresponding to MBx as target unstructured data; the target unstructured data is unstructured data embedded with malicious code.

Specifically, the existing traversing method directly utilizes the first hash value of each section of unstructured data to traverse J, so that more computer resources are needed to be consumed, and the efficiency is low. In the application, the sequence of traversing J of the first hash value is set, the J is traversed by using the first hash value of unstructured data with high possibility of embedding malicious codes, if the first hash value of unstructured data corresponding to MBx is the same as Jp, the traversal is stopped, and compared with the method of traversing J by directly using the first hash value of each section of unstructured data, the traversal can be completed more quickly, the computer resources are saved, and the traversal efficiency is improved.

In this embodiment, the first compression ratio fluctuation value MN is determined first, and since the compression ratio of a certain section of unstructured data embedded with malicious code is smaller than that of a certain section of unstructured data not embedded with malicious code, if MN is greater than the first preset compression ratio fluctuation value, it is indicated that there is a greater possibility that at least one section of unstructured data embedded with malicious code and at least one section of unstructured data not embedded with malicious code exist in M. And sequencing the first data segment compression ratios in M, extracting a first preset number of first data segment compression ratios to obtain a target first data segment compression ratio set, traversing the preset unstructured data segment integral hash value set sequentially based on the target first data segment compression ratio set, wherein traversing order is set, J is sequentially traversed from a first hash value of unstructured data with highest possibility of embedding malicious codes (namely, a first hash value of unstructured data with the minimum first data segment compression ratio), if the first hash value of unstructured data corresponding to MBx is identical to Jp, traversing is stopped, and the unstructured data corresponding to MBx is determined to be the target unstructured data, and the target unstructured data is the unstructured data embedded with malicious codes. Compared with traversing J by directly utilizing the first hash value of each section of unstructured data, the traversing is possible to be completed faster, and computer resources are saved; otherwise, if the MN is smaller than the first preset compression ratio fluctuation value, it is determined that unstructured data embedded with malicious codes does not exist in the M, and then the following steps are not performed, so that computer resources are saved.

The steps S311 to S316 may be performed after the step S300 or after the step S200.

In addition, after step S300, the present application may further include the following steps S321 to S326:

step S321, respectively compressing the first 128 bytes of data of each of the several pieces of unstructured data to obtain a second data segment compression ratio set p= (P1, P2,., pv, pg); where v=1, 2,..g, pv is the second data segment compression ratio of the v-th segment of unstructured data, and Pv is the ratio of the data size before data compression of the first 128 bytes of the v-th segment of unstructured data to the data size after data compression of the first 128 bytes of the v-th segment of unstructured data, g is the number of segments of unstructured data.

Step S322, according to P, obtaining a second compression ratio fluctuation value PN; PN= (Σ) ^g _v=1 (Pv-avg(P)) ² ) /g; wherein: avg () is a preset average valueAnd (5) determining a function.

Step S323, if PN is greater than the second preset compression ratio fluctuation value, arranging all the second data segment compression ratios in P in order from small to large, to obtain a second ordered data segment compression ratio set pa= (PA 1, PA2,., PA σ,., PAg); wherein σ=1, 2,.. PA sigma is the second data segment compression ratio in PA arranged in the sigma-th bit.

Step S324, extracting a second preset number of second data segment compression ratios from PA according to the order of the second data segment compression ratios from smaller to larger, to obtain a target second data segment compression ratio set pb= (PB 1, PB2,., PBz,., pbβ); where z=1, 2,..beta. PBz is the target second data segment compression ratio in PB ordered in the z-th bit and beta is the number of target second data segment compression ratios.

Step S325, traversing a preset unstructured data segment partial hash value set l= (L1, L2,) based on the second hash value of each segment of unstructured data corresponding to the target second data segment compression ratio in PB in turn; where t=1, 2.

Step S326, if the second hash value of the unstructured data corresponding to PBz is the same as Lt, stopping traversing, and determining PBz the unstructured data corresponding to Lt as the target unstructured data; the target unstructured data is unstructured data embedded with malicious code.

It should be noted that, this embodiment is further preferable in the previous embodiment (step S310-step S360), and only the first 128 bytes of unstructured data are used for compression and subsequent processing, so that compared with the previous embodiment, less computer resources are required, more time is saved, and through multiple tests, the accuracy of determining whether a certain section of unstructured data in the file to be detected is embedded with malicious code is greater than 90% by using the second hash value corresponding to the first 128 bytes of unstructured data. The specific implementation and the related principles are the same and are not repeated here.

Similarly, the steps S321 to S326 may be performed after the step S300 or after the step S200.

In other embodiments of the present application, after step S300, the present application further includes the following steps S331-S336:

step S331, respectively obtaining a first data segment information entropy of each unstructured data in a plurality of unstructured data segments, and obtaining a first data segment information entropy set r= (R1, R2,., rγ, rg); wherein gamma=1, 2, g, rγ is the first data segment information entropy of the gamma-th segment unstructured data, g is the number of segments of unstructured data.

Here, the first data segment information entropy is an average information amount excluding redundancy in the unstructured data information, and the method for obtaining the first data segment information entropy may be existing, so that a person skilled in the art can determine a specific obtaining method according to an actual situation. Because of unstructured data for increasing the file size, basically duplicate and nonsensical data, however, some hackers may embed malicious code in the unstructured data, increasing the detection difficulty. While malicious code is meaningful data, so that the entropy of some unstructured data embedded with malicious code is greater than some unstructured data not embedded with malicious code. I.e. the larger the information entropy of the first data segment, the greater the possibility of embedding malicious code in the corresponding unstructured data.

Step S332, according to R, acquiring a first information entropy fluctuation value RS; rs= (Σ) ^g _γ=1 (Rγ-avg(R)) ² ) /g; wherein: avg () is a preset average value determination function.

Specifically, the larger the RS, the greater the degree of deviation between the information entropy of the first data segment in R and the average value of the information entropy of all the first data segments in R, that is, at least one section of unstructured data embedded with malicious code and at least one section of unstructured data not embedded with malicious code may exist in R; conversely, the smaller the RS, the smaller the deviation degree between the information entropy of the first data segment in R and the average value of the information entropy of all the first data segments in R, and if the RS approaches 0, it is indicated that the information entropy of each segment of unstructured data basically maintains to fluctuate within a smaller range, that is, unstructured data embedded with malicious code may not exist in R.

Step S333, if RS is greater than the first preset information entropy fluctuation value, arranging all the first data segment information entropies in R in order from large to small, to obtain a first ordered data segment information entropy set ra= (RA 1, RA2,) RA delta, RAg; where δ=1, 2,. G, RA δ is the first data segment information entropy ordered in the δ -th bit in RA, g is the number of segments of unstructured data.

Setting a first preset information entropy fluctuation value, and if RS is larger than the first preset compression ratio fluctuation value, indicating that the possibility of embedding malicious codes in unstructured data segments in the file to be detected is larger.

Step S334, extracting a third preset number of first data segment information entropies from RA according to the order of the first data segment information entropies from large to small, to obtain a target first data segment information entropy set rb= (RB 1, RB2,., RB epsilon,., RB tau); where epsilon=1, 2, τ, RB epsilon is the target first data segment information entropy arranged in the epsilon-th bit in RB, and τ is the number of target first data segment information entropies.

Specifically, the information entropy of the first data segments is extracted from the RA from large to small, namely, y first data segment information entropy is extracted from large to small according to the possibility of embedding malicious codes in a certain unstructured data segment.

Step S335, traversing a preset unstructured data segment integral hash value set j= (J1, J2,) based on the first hash value of each segment of unstructured data corresponding to the target first data segment information entropy in RB in turn; wherein p=1, 2.

Step S336, stopping traversing if the first hash value of the unstructured data corresponding to RB epsilon is the same as Jp, and determining the unstructured data corresponding to RB epsilon as target unstructured data; the target unstructured data is unstructured data embedded with malicious code.

Specifically, the existing traversing method directly utilizes the first hash value of each section of unstructured data to traverse J, so that more computer resources are needed to be consumed, and the efficiency is low. In the application, the sequence of traversing J by the first hash value is set, the J is traversed by the first hash value of unstructured data with high possibility of embedding malicious codes, if the first hash value of unstructured data corresponding to RB epsilon is the same as Jp, the traversal is stopped, and compared with the method of traversing J by directly using the first hash value of each section of unstructured data, the traversal is more quickly completed, the computer resources are saved, and the traversal efficiency is improved.

The steps S331 to S336 may be performed after the step S300 or may be performed after the step S200.

In this embodiment, the first information entropy fluctuation value RS is determined first, and because the information entropy of a certain section of unstructured data embedded with malicious code is greater than that of a certain section of unstructured data not embedded with malicious code, if RS is greater than the first preset information entropy fluctuation value, it is indicated that there is a greater possibility that at least one section of unstructured data embedded with malicious code and at least one section of unstructured data not embedded with malicious code exist in R. Sequencing the information entropy of the first data segments in R, extracting a first preset number of the information entropy of the first data segments to obtain a target first data segment information entropy set, traversing the preset unstructured data segment integral hash value set sequentially based on the target first data segment information entropy set, setting traversing sequence, starting from a first hash value of unstructured data with highest possibility of embedding malicious codes (namely, a first hash value of unstructured data with minimum information entropy of the first data segment), traversing J sequentially, and stopping traversing if the first hash value of unstructured data corresponding to RB epsilon is identical to Jp, and determining the unstructured data corresponding to RB epsilon as target unstructured data; the target unstructured data is unstructured data embedded with malicious code. I.e. to locate unstructured data embedded with malicious code. Compared with traversing J by directly utilizing the first hash value of each section of unstructured data, traversing can be completed faster, and computer resources are saved; otherwise, if the RS is smaller than the first preset information entropy fluctuation value, it is determined that unstructured data embedded with malicious codes does not exist in R, and subsequent steps are not performed, so that computer resources are saved.

In addition, after step S300, the present application further includes the following steps S341 to S346:

step S341, respectively obtaining the information entropy of the second data segment of the first 128 bytes of each of the plurality of unstructured data segments, to obtain a second data segment information entropy set t= (T1, T2,., tζ,., tg); wherein ζ=1, 2,. G, T ζ is the second data segment information entropy of the zeta-th segment unstructured data, g is the number of segments of unstructured data.

Step S342, obtaining a second information entropy fluctuation value TS according to T; ts= (Σ) ^g _ζ=1 (Tζ-avg(T)) ² ) /g; wherein: avg () is a preset average value determination function.

Step S343, if TS is greater than the second preset information entropy fluctuation value, arranging all the second data segment information entropies in T in order from large to small, and obtaining a second ordered data segment information entropy set TA (TA 1, TA2, TA),., TAg); wherein (1)>=1,2,...,g，TA/>For arranging in the%>And the second data segment information entropy of the bits, g is the segment number of unstructured data.

Step S344, extracting a fourth preset number of second data segment information entropies from the TA according to the order of the second data segment information entropies from large to small, to obtain a target second data segment information entropy set TB (TB 1, TB2,., TB η,., TB θ); where η=1, 2,..θ, TB η is the target second data segment information entropy arranged in the η bit in TB, and θ is the number of target second data segment information entropies.

Step S345, traversing a preset unstructured data segment partial hash value set l= (L1, L2,) based on the second hash value of each segment of unstructured data corresponding to the target second data segment information entropy in the TB in turn; where t=1, 2.

Step S346, if the second hash value of the unstructured data corresponding to TB eta is identical to Lt, stopping traversing, and determining the unstructured data corresponding to TB eta as target unstructured data; the target unstructured data is unstructured data embedded with malicious code.

It should be noted that, this embodiment is further preferable in the previous embodiment (step S510-step S560), only the first 128 bytes of unstructured data are used to obtain the information entropy and the subsequent processing, the required computer resources are less, the time is saved, and through multiple tests, the accuracy of determining whether a certain section of unstructured data in the file to be detected is embedded with malicious code is greater than 90% by using the second hash value corresponding to the first 128 bytes of unstructured data. The specific implementation and principle are the same, and are not repeated here.

Similarly, the steps S341 to S346 may be performed after the step S300 or after the step S200.

Embodiments of the present application also provide a computer program product comprising program code for causing an electronic device to carry out the steps of the method according to the various exemplary embodiments of the application described in the present specification when the program product is run on the electronic device.

Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device according to this embodiment of the application. The electronic device is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present application.

The electronic device is in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: the at least one processor, the at least one memory, and a bus connecting the various system components, including the memory and the processor.

Wherein the memory stores program code that is executable by the processor to cause the processor to perform steps according to various exemplary embodiments of the present application described in the above section of the exemplary method of this specification.

The storage may include readable media in the form of volatile storage, such as Random Access Memory (RAM) and/or cache memory, and may further include Read Only Memory (ROM).

The storage may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The bus may be one or more of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., router, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. And, the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter. As shown, the network adapter communicates with other modules of the electronic device over a bus. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible embodiments, the aspects of the application may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the application as described in the "exemplary method" section of this specification, when the program product is run on the terminal device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present application, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

The present application is not limited to the above embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A document detection method, comprising:

acquiring a file to be detected; the file to be detected comprises a plurality of sections of structured data and a plurality of sections of unstructured data; the structured data and the unstructured data are alternately arranged;

according to the segments of structured data, structured contour feature information corresponding to the file to be detected is obtained; the structural outline characteristic information comprises the number of the segments of the plurality of segments of structural data, and the starting address and the length of each segment of structural data;

Determining whether the probability of the file to be detected being a malicious file is larger than a preset probability threshold according to the structural outline characteristic information and a first preset detection method;

if yes, processing the content of a plurality of sections of structured data based on a second preset detection method to determine whether the file to be detected is a malicious file; the computer resources required for implementing the second preset detection method are larger than those required for implementing the first preset detection method.

2. The file detection method according to claim 1, wherein the structured profile feature information is a= (a, DA1, LA1, DA2, LA2,) DAb, LAb, & gt, DAa, LAa), b = 1,2, & gt, a; where a is the number of segments of the structured data, DAb is the start address of the b-th segment of the structured data, and LAb is the length of the b-th segment of the structured data.

3. The method for detecting files according to claim 2, wherein determining whether the probability that the file to be detected is a malicious file is greater than a preset probability threshold according to the structured profile feature information and a first preset detection method comprises:

If Cc is larger than a preset matching degree threshold, determining that the probability that the file to be detected is a malicious file is larger than a preset probability threshold.

4. The method for detecting files according to claim 2, wherein determining whether the probability that the file to be detected is a malicious file is greater than a preset probability threshold according to the structured profile feature information and a first preset detection method comprises:

obtaining a hash value AD of A;

5. The document detecting method according to claim 1, wherein after the acquisition of the document to be detected, the document detecting method further comprises:

according to the unstructured data of the segments, unstructured profile characteristic information corresponding to the file to be detected is obtained; the unstructured profile characteristic information comprises the number of the sections of the unstructured data, and the starting address and the length of each section of unstructured data;

Determining whether the probability of the file to be detected being a malicious file is larger than a preset probability threshold according to the unstructured profile characteristic information and a third preset detection method;

if yes, processing the content of a plurality of sections of unstructured data based on a fourth preset detection method to determine whether the file to be detected is a malicious file or not; wherein, the computer resource required for implementing the fourth preset detection method is greater than the computer resource required for implementing the third preset detection method.

6. The method for detecting a document according to claim 5, wherein the unstructured profile information is E= (g, DE1, LE1, DE2, LE2, DEh, LEh, deg., LEg), h=1, 2,/g; wherein g is the number of segments of unstructured data; g=a or g=a-1 or g=a+1; DEh is the start address of the h-th unstructured data, LEh is the length of the h-th unstructured data.

7. The method for detecting files according to claim 6, wherein determining whether the probability that the file to be detected is a malicious file is greater than a preset probability threshold according to the unstructured profile feature information and a third preset detection method comprises:

obtaining an unstructured profile matching degree set g= (G1, G2, gj), i=1, 2, fj according to a preset unstructured profile feature information set f= (F1, F2, fi, fj); wherein j is the number of preset unstructured profile feature information; fi is the ith preset unstructured profile feature information; gi is the matching degree of E and Fi;

If the Gi is larger than the preset matching degree threshold, determining that the probability that the file to be detected is a malicious file is larger than the preset probability threshold.

8. The method for detecting files according to claim 6, wherein determining whether the probability that the file to be detected is a malicious file is greater than a preset probability threshold according to the unstructured profile feature information and a third preset detection method comprises:

acquiring a hash value EH of E;

if EH is the same as Hk in the preset unstructured profile feature hash value set h= (H1, H2,..hk,..hm), determining that the probability that the file to be detected is a malicious file is greater than a preset probability threshold; where k=1, 2.

9. The method for detecting files according to claim 5, wherein the processing the content of the plurality of unstructured data based on the fourth preset detection method to determine whether the file to be detected is a malicious file includes:

If Ih is the same as Jp in a set of preset unstructured data segment whole hash values j= (J1, J2,., jp,., jq), determining that the file to be detected is a malicious file; wherein p=1, 2.

10. The method for detecting files according to claim 5, wherein the processing the content of the plurality of unstructured data based on the fourth preset detection method to determine whether the file to be detected is a malicious file includes:

11. The method for detecting files according to claim 5, wherein the processing the content of the plurality of unstructured data based on the fourth preset detection method to determine whether the file to be detected is a malicious file includes:

if K1 is the same as Lt in a preset unstructured data segment partial hash value set l= (L1, L2,..and, lt,..and, lu), determining that the file to be detected is a malicious file; where t=1, 2.

12. A non-transitory computer readable storage medium having stored therein at least one instruction or at least one program, wherein the at least one instruction or the at least one program is loaded and executed by a processor to implement the file detection method of any one of claims 1-11.

13. An electronic device comprising a processor and the non-transitory computer readable storage medium of claim 12.