CN116992448B

CN116992448B - Sample determination method, device, equipment and medium based on importance degree of data source

Info

Publication number: CN116992448B
Application number: CN202311254330.9A
Authority: CN
Inventors: 吕经祥; 李石磊; 肖新光
Original assignee: Beijing Antiy Network Technology Co Ltd
Current assignee: Beijing Antiy Network Technology Co Ltd
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-12-15
Anticipated expiration: 2043-09-27
Also published as: CN116992448A

Abstract

The invention provides a sample determining method, a device, equipment and a medium based on importance of a data source, which relate to the field of data processing and comprise the following steps: in response to receiving the target malicious files, acquiring name character strings set by each target data source for the target malicious files, and obtaining a target name character string list; carrying out character string splitting on each name character string to obtain a target candidate character string list set; determining the importance degree of each target data source according to the target candidate character string list set; and determining a target similar sample file corresponding to the target malicious file. According to the method, name strings of the target malicious files are split through each target data source, the number of strings for file feature analysis of each target data source is obtained, the corresponding importance degree is determined through the number of the split strings, and the similar sample files are determined through each importance degree, so that the similarity accuracy between the obtained similar sample files and the target malicious files is higher.

Description

Sample determination method, device, equipment and medium based on importance degree of data source

Technical Field

The present application relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a medium for determining a sample based on importance of a data source.

Background

The current method for determining the similar sample files is obtained by acquiring the file characteristics of each historical sample file for statistics, and because the number of the file characteristics of the historical sample files is large, the system resources occupied during acquisition and statistics are also large, so that when the number of the historical sample files is large, the current method for determining the similar sample files can greatly increase the using calculation force of a system, and because the detection rules of different data sources for transmitting the historical sample files and the file characteristics with different detection emphasis are different, the accuracy of the similar sample files determined according to the different data sources can be uneven.

Disclosure of Invention

In view of the above, the application provides a method, a device, equipment and a medium for determining samples based on importance of data sources, which at least partially solve the technical problem that the accuracy of similar sample files determined by different data sources in the prior art is too large, and adopts the following technical scheme:

according to one aspect of the present application, there is provided a sample determination method based on importance of a data source, the method comprising the steps of:

In response to receiving the target malicious file, acquiring name strings set by each target data source for the target malicious file to obtain a target name string list z= (Z) ₁ ,Z ₂ ,...,Z _j ,...,Z _m ) The method comprises the steps of carrying out a first treatment on the surface of the Where j=1, 2, m; m is the number of target data sources; z is Z _j Setting a name character string for the j-th target data source to the target malicious file;

according to the preset character corresponding to the jth target data source, the target data source is used for Z _j Performing character string splitting to obtain a target candidate character string list set N= (N) ₁ ,N ₂ ,...,N _j ,...,N _m )；N _j =(N _j1 ,N _j2 ,...,N _jc ,...,N _jf(j) ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein c=1, 2,., f (j); f (j) is Z _j The number of target candidate strings contained therein; n (N) _j Is Z _j A corresponding list of target candidate strings; n (N) _jc Is Z _j The c-th target candidate character string included in the list;

according to the target candidate character string list set N, determining the importance degree of each target data source to obtain an importance degree set Q= (Q) ₁ ,Q ₂ ,...,Q _j ,...,Q _m ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is _j Importance level for the jth target data source; q (Q) _j =f(j)/(∑ ^m _j=1 f(j))；

And determining at least one target similar sample file corresponding to the target malicious file according to the importance degree of each target data source.

In an exemplary embodiment of the present application, determining at least one target similar sample file corresponding to a target malicious file according to an importance degree of each target data source includes:

Determining a plurality of target sample files from a plurality of history sample files according to the target name character string list Z;

acquiring name strings set by each target data source for each target sample file to obtain a sample name string list set p= (P) ₁ ,P ₂ ,...,P _j ,...,P _m )；P _j =(P _j1 ,P _j2 ,...,P _ja ,...,P _jb ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein a=1, 2, b; b is the number of target sample files; p (P) _j A sample name character string list corresponding to the jth target data source; p (P) _ja A name string set for the jth target data source for the jth target sample file;

according to the preset character corresponding to the jth target data source, P is compared with _ja Splitting character strings to obtain a sample candidate character string list set I corresponding to the jth target data source _j =(I _j1 ,I _j2 ,...,I _ja ,...,I _jb )；I _ja =(I _ja1 ,I _ja2 ,...,I _jag ,...,I _jah(ja) ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein g=1, 2, h (ja); h (ja) is P _ja The number of sample candidate strings contained therein; i _ja Is P _ja A corresponding sample candidate string list; i _jag Is P _ja The g sample candidate character string contained in the list;

according to I _ja And N _j Determining a sample matching degree H between an a-th target sample file and a target malicious file _a ；

According to H _a And determining at least one target similar sample file corresponding to the target malicious file from the b target sample files.

In an exemplary embodiment of the present application, according to I _ja And N _j Determining a sample matching degree H between an a-th target sample file and a target malicious file _a Comprising:

according to N _j Obtaining a target character string statement C of a target malicious file corresponding to the jth target data source _j ；

According to I _ja Obtaining a sample character string statement U of an a-th target sample file corresponding to a j-th target data source _ja ；

Determination of C _j And U _ja Semantic matching degree A between _ja ；

According to A _ja And Q _j Determining a sample matching degree H between an a-th target sample file and a target malicious file _a 。

In an exemplary embodiment of the application, according to A _ja And Q _j Determining a sample matching degree H between an a-th target sample file and a target malicious file _a Comprising:

according to H _a =(∑ ^m _j=1 (A _ja ×Q _j ) A/m determines a sample match between the a-th target sample file and the target malicious file.

In an exemplary embodiment of the present application, according to H _a Determining at least one target similar sample file corresponding to the target malicious file from the b target sample files, wherein the determining comprises the following steps:

if H _a ≥H ₀ Determining the a-th target sample file as a target similar sample file corresponding to the target malicious file; wherein H is ₀ And presetting a sample matching degree threshold value.

In an exemplary embodiment of the present application, determining a plurality of target sample files from a plurality of history sample files includes:

Acquiring name strings corresponding to the s history sample files to obtain a history name string list D= (D) ₁ ,D ₂ ,...,D _w ,...,D _s ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein w=1, 2,..s; d (D) _w The name character string corresponding to the w-th history sample file;

pair D _w Splitting character strings to obtain D _w Corresponding number of character strings B _w ；

If MIN (f (1), f (2), f (j), f (m) is less than or equal to B _w MAX (f (1), f (2), f (j), f (m)), then determining the w-th history sample file as the target sample file; wherein MIN () is a preset minimum value determination function, and MAX () is a preset maximum value determination function.

In one exemplary embodiment of the application, Q _j Also determined by the following steps:

traversal I _j Determining I _jag At I _j The number L of (3) _jag ；

If L _jag ≥L ₀ Will I _jag Determining the sample target character strings to obtain the number M of sample target character strings corresponding to the jth target data source _j The method comprises the steps of carrying out a first treatment on the surface of the Wherein L is ₀ A character threshold value is preset;

determining importance level Q of jth target data source _j =M _j /(∑ ^m _j=1 M _j )。

According to an aspect of the present application, there is provided a sample determination apparatus based on importance of a data source, comprising:

the target name string acquisition module is used for acquiring name strings set by each target data source for the target malicious files when the target malicious files are received, so as to obtain a target name string list Z= (Z) ₁ ,Z ₂ ,...,Z _j ,...,Z _m ) The method comprises the steps of carrying out a first treatment on the surface of the Where j=1, 2, m; m is the number of target data sources; z is Z _j Setting a name character string for the j-th target data source to the target malicious file;

a target candidate character string determining module for determining Z according to the preset character corresponding to the jth target data source _j Performing character string splitting to obtain a target candidate character string list set N= (N) ₁ ,N ₂ ,...,N _j ,...,N _m )；N _j =(N _j1 ,N _j2 ,...,N _jc ,...,N _jf(j) ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein c=1, 2,., f (j); f (j) is Z _j Target candidate character contained in the character stringThe number of strings; n (N) _j Is Z _j A corresponding list of target candidate strings; n (N) _jc Is Z _j The c-th target candidate character string included in the list;

the importance degree determining module is configured to determine an importance degree of each target data source according to the target candidate string list set N, so as to obtain an importance degree set q= (Q) ₁ ,Q ₂ ,...,Q _j ,...,Q _m ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is _j Importance level for the jth target data source; q (Q) _j =f(j)/(∑ ^m _j=1 f(j))；

And the similar sample determining module is used for determining at least one target similar sample file corresponding to the target malicious file according to the importance degree of each target data source.

According to one aspect of the present application, there is provided a non-transitory computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement the aforementioned data source importance-based sample determination method.

According to one aspect of the present application, there is provided an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.

The application has at least the following beneficial effects:

according to the method, name character strings set by each target data source on the target malicious file are obtained according to the received target malicious file, the name character strings are split according to preset characters corresponding to each target data source to obtain a plurality of corresponding target candidate character strings, corresponding importance degrees are determined according to the number of the plurality of target candidate character strings corresponding to each target data source, then a target similar sample file corresponding to the target malicious file is determined from a plurality of target sample files according to the importance degrees of each target data source, the name character strings of the target malicious file are split through each target data source to obtain the number of character strings for file feature analysis of each target data source, the corresponding importance degrees are determined through the number of the character strings obtained through splitting, and the similar sample file is determined through each importance degree.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for determining a sample based on importance of a data source according to an embodiment of the present invention;

fig. 2 is a block diagram of a sample determining device based on importance of a data source according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

A method for determining a sample based on importance of a data source, as shown in fig. 1, the method comprising the steps of:

Step S100, in response to receiving the target malicious file, obtaining name strings set by each target data source for the target malicious file to obtain a target name string list Z= (Z) ₁ ,Z ₂ ,...,Z _j ,...,Z _m ) The method comprises the steps of carrying out a first treatment on the surface of the Where j=1, 2, m; m is the number of target data sources; z is Z _j Setting a name character string for the j-th target data source to the target malicious file;

the target malicious files are malicious files for searching similar sample files, and a plurality of target sample files are determined from a plurality of history sample files according to the received target malicious files. The target sample file can be any history sample file, and can also be a history sample file which is set according to the requirement or meets the preset condition.

The method comprises the steps that target data sources, namely suppliers of history sample files, are provided with a detection rule corresponding to each target data source, the target data sources perform malicious detection on files to be detected through the corresponding detection rule, each target data source is provided with a plurality of preset characters, the preset characters are represented as segmentation characters in corresponding name character strings, the name character strings are character strings of virus names of viruses in the corresponding target malicious files, and the name character strings comprise attack type character strings, virus family character strings, application platform character strings, virus variant character strings and the like of the viruses; because the extraction methods of the name strings of each target data source are different, the information sequences in the name strings of the same file extracted by different target data sources are possibly different, so that the name strings corresponding to the target malicious files are split through preset characters corresponding to the target data sources to obtain a plurality of target candidate strings corresponding to each target data source, wherein the target candidate strings are attack type strings, virus family strings, application platform strings, virus variant strings and the like.

Step S200, according to the preset character corresponding to the jth target data source, the step S is to Z _j Performing character string splitting to obtain a target candidate character string list set N= (N) ₁ ,N ₂ ,...,N _j ,...,N _m )；N _j =(N _j1 ,N _j2 ,...,N _jc ,...,N _jf(j) ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein c=1, 2,., f (j); f (j) is Z _j The number of target candidate strings contained therein; n (N) _j Is Z _j A corresponding list of target candidate strings; n (N) _jc Is Z _j The c-th target candidate character string included in the list;

the target candidate character strings are a plurality of character strings obtained by splitting the name character strings of the target malicious files according to preset characters corresponding to the jth target data source.

Step S300, determining the importance degree of each target data source according to the target candidate character string list set N to obtain an importance degree set Q= (Q) ₁ ,Q ₂ ,...,Q _j ,...,Q _m ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is _j Importance level for the jth target data source; q (Q) _j =f(j)/(∑ ^m _j=1 f(j))；

The importance degree of the target data sources, namely the weight of the corresponding target data sources, is the proportion of each target data source when determining the target similar sample file, and is used for reflecting the proportion degree of the corresponding target data sources.

Step S400, determining at least one target similar sample file corresponding to the target malicious file according to the importance degree of each target data source;

further, in step S400, determining at least one target similar sample file corresponding to the target malicious file according to the importance degree of each target data source, including:

Step S410, determining a plurality of target sample files from a plurality of history sample files according to the target name character string list Z;

the history sample files are sample files which pass detection, wherein the sample files comprise malicious sample files and non-malicious sample files, and a plurality of target sample files are determined from a plurality of history sample files by comparing file information of the history sample files with file information of target malicious files.

In step S410, a plurality of target sample files are determined from a plurality of history sample files, including:

step S411, obtaining name strings corresponding to the S history sample files to obtain a history name string list D= (D) ₁ ,D ₂ ,...,D _w ,...,D _s ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein w=1, 2,..s; d (D) _w The name character string corresponding to the w-th history sample file;

step S412, pair D _w Splitting character strings to obtain D _w Corresponding number of character strings B _w ；

Step S413, if MIN (f (1), f (2), f (j), f (m) is less than or equal to B _w MAX (f (1), f (2), f (j), f (m)), then determining the w-th history sample file as the target sample file; wherein MIN () is a preset minimum value determination function, and MAX () is a preset maximum value determination function.

In step S410, the target sample file is determined according to the number of character strings of the history sample file, so as to obtain name character strings of each history sample file, and split each name character string to obtain the corresponding number of character strings; if the number of the character strings is in the range of the minimum value and the maximum value of the number of the target candidate character strings determined by all the target data sources, the character strings after the history sample file is split are indicated to be the number of the character strings conforming to the splitting rule of the target data sources, and the character strings are determined to be the target sample file.

In addition, the target sample file may be determined by:

traversing each history sample file, and determining the history sample file as a target sample file if the file information of the history sample file is the same as the file information of the target malicious file. The file information of the target malicious file is the file format, the file type, the coding mode and the like of the target malicious file. And determining the historical sample file which is the same as the file information of the target malicious file as a target sample file, and primarily screening the huge historical sample file through the file information to determine the target sample file.

Step S420, obtaining name strings set by each target data source for each target sample file to obtain a sample name string list set P= (P) ₁ ,P ₂ ,...,P _j ,...,P _m )；P _j =(P _j1 ,P _j2 ,...,P _ja ,...,P _jb ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein a=1, 2, b; b is the number of target sample files; p (P) _j A sample name character string list corresponding to the jth target data source; p (P) _ja A name string set for the jth target data source for the jth target sample file;

step S430, according to the preset character corresponding to the jth target data source, P is compared with _ja Splitting character strings to obtain a sample candidate character string list set I corresponding to the jth target data source _j =(I _j1 ,I _j2 ,...,I _ja ,...,I _jb )；I _ja =(I _ja1 ,I _ja2 ,...,I _jag ,...,I _jah(ja) ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein g=1, 2, h (ja); h (ja) is P _ja The number of sample candidate strings contained therein; i _ja Is P _ja A corresponding sample candidate string list; i _jag Is P _ja The g sample candidate character string contained in the list;

step S440, according to I _ja And N _j Determining a sample matching degree H between an a-th target sample file and a target malicious file _a ；

In step S440, according to I _ja And N _j Determining a sample matching degree H between an a-th target sample file and a target malicious file _a Comprising:

step S441, according to N _j Obtaining a target character string statement C of a target malicious file corresponding to the jth target data source _j ；

Step S442, according to I _ja Obtaining a sample character string statement U of an a-th target sample file corresponding to a j-th target data source _ja ；

Step S443, determining C _j And U _ja Semantic matching degree A between _ja ；

Step S444 according to A _ja And Q _j Determining a sample matching degree H between an a-th target sample file and a target malicious file _a =(∑ ^m _j=1 (A _ja ×Q _j ))/m。

Step S450, according to H _a Determining at least one target similar sample file corresponding to the target malicious file from the b target sample files;

further, in step S450, according to H _a Determining at least one target similar sample file corresponding to the target malicious file from the b target sample files, wherein the determining comprises the following steps:

Step S451, ifH _a ≥H ₀ Determining the a-th target sample file as a target similar sample file corresponding to the target malicious file; wherein H is ₀ And presetting a sample matching degree threshold value.

Comparing the sample matching degree with a preset sample matching degree threshold, and if the sample matching degree is greater than or equal to the preset sample matching degree threshold, determining the corresponding target sample file as a target similar sample file corresponding to the target malicious file.

In addition, step S300 is a first embodiment of a method for determining importance levels of target data sources, in which name strings of target malicious files are split according to each target data source to obtain a plurality of target candidate strings corresponding to each target data source, and then importance levels of each target data source are determined according to the number of target candidate strings corresponding to each target data source.

In a second embodiment of the method for determining importance of a target data source, Q _j It can also be determined by the following steps:

step S310, according to I _j Determining a plurality of sample target character strings from a plurality of sample candidate character strings corresponding to the jth target data source;

further, in step S310, according to I _j Determining a plurality of sample target character strings from a plurality of sample candidate character strings corresponding to the jth target data source, wherein the method comprises the following steps:

step S311, traversing I _j Determining I _jag At I _j The number L of (3) _jag ；

Step S312, if L _jag ≥L ₀ Will I _jag Determining the sample target character strings to obtain the number M of sample target character strings corresponding to the jth target data source _j The method comprises the steps of carrying out a first treatment on the surface of the Wherein L is ₀ The character threshold is preset.

Step S320, determining the importance degree of the jth target data source according to a plurality of sample target character strings corresponding to the jth target data source;

step S321, obtaining the number M of sample target strings corresponding to the jth target data source _j ；

Step S322, determining importance level Q of the jth target data source _j =M _j /(∑ ^m _j=1 M _j )。

In the second embodiment of the method for determining the importance degree of the target data source, the importance degree of the target data source is determined by the number of the plurality of sample target strings corresponding to each target data source, and compared with the method for determining the number of the target candidate strings through the target malicious file in the first embodiment, the second embodiment is determined according to the sample candidate strings corresponding to the same target data source, and the number of samples is increased, and the target sample files are historical sample files which pass the transmission verification of each target data source, so that the determined importance degree of the target data source is more accurate.

Accordingly, the first embodiment and the second embodiment of the method for determining the importance level of the target data source can also determine the corresponding third embodiment, that is, the importance level of the third embodiment is the sum of the importance level obtained by the first embodiment and the importance level obtained by the second embodiment, so as to further improve the accuracy of the determined importance level.

In addition, step S440 is the sample matching degree H _a In the first embodiment of the determining method of (a), the first embodiment obtains the corresponding semantic matching degree by carrying out semantic matching on the target character string statement and the sample character string statement, obtains the product of the semantic matching degree and the importance degree of the corresponding target data source, and matches all the semantics of the same target sample fileThe product of the matching degree and the importance degree of each target data source is averaged to obtain a corresponding sample matching degree, the method is suitable for the situation that the number of sample candidate character strings is too large so as to determine character string sentences, when the character string sentences cannot be determined due to the fact that the number of the sample candidate character strings or the target candidate character strings is small, or the determined character string sentences are too short, the obtained semantic matching degree is inaccurate, so that in order to solve the problem, the sample matching degree H is provided _a As shown in steps S500 to S510.

Sample matching degree H _a A second embodiment of the determination method of (2) is:

step S500, according to I _ja And N _j Determining a name matching degree list set e= (E ₁ ,E ₂ ,...,E _a ,...,E _b )；E _a =(E _a1 ,E _a2 ,...,E _aj ,...,E _am ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein E is _a A name matching degree list corresponding to the a-th target sample file and the target malicious file; e (E) _aj Is P _ja And Z is _j The degree of name matching between the two;

wherein E is _aj Is determined by the following steps:

step S501, pair I _ja And N _j Intersection processing is carried out to obtain P _ja And Z is _j K between _ja Matching candidate character strings;

step S502, K _ja Is determined as P _ja And Z is _j Degree of name matching E between _aj 。

Step S510, according to E _aj And Q _j Determining a sample matching degree H between an a-th target sample file and a target malicious file _a =∑ ^m _j=1 (E _aj ×Q _j )。

Further, the method for determining the malicious detection rule through the target malicious file and the target similar sample file is as follows:

step S600, according to the descending order of the sample matching degree corresponding to each target similar sample file, for each target phaseThe similar sample files are sequenced to obtain a sequenced similar sample file list T ₁ ,T ₂ ,...,T _n ,...,T _q The method comprises the steps of carrying out a first treatment on the surface of the Wherein n=1, 2, q; q is the number of target similar sample files; t (T) _n The n-th target similar sample file is sequenced according to the sample matching degree;

And sorting the target similar sample files according to the sample matching degree to obtain a sorted similar sample file list, wherein the lower the position in the sorted similar sample file list is, the lower the similarity between the target similar sample files and the target malicious files is.

Step S610, let n=1;

step S611, if n is less than or equal to q, according to the ordered similar sample file list T ₁ ,...,T _n The method comprises the steps that the candidate detection rules are obtained through the included file characteristics and the file characteristics included in the target malicious file;

step S612, according to the candidate detection rule, for T _n+1 ,...,T _q Performing malicious detection to obtain q-n corresponding malicious detection results;

step S613, if each malicious detection result represents that the corresponding target similar sample file is a malicious file, determining the candidate detection rule as an initial detection rule; otherwise, let n=n+1, and return to step S611.

In order to further reduce the data processing amount, when determining candidate detection rules, according to the sequence of the sample matching degree from high to low, taking the file characteristics of the target similar sample file and the target malicious file to obtain the corresponding candidate detection rules, and then verifying the obtained candidate detection rules, namely, T _n+1 ,...,T _q Performing malicious detection to obtain corresponding malicious detection results, wherein the target similar sample files are similar sample files of the target malicious files, so that the target similar sample files are malicious files, if each malicious detection result represents that the corresponding target similar sample file is a malicious file, the candidate detection rules pass verification detection, the candidate detection rules are determined to be initial detection rules, otherwise, the file characteristics of the target similar sample files with the sample matching degree are continuously taken down to determine the candidate detection rules, and then And verifying the obtained candidate detection rule until the verification is passed or all file characteristics of the target similar sample files are completely fetched.

Step S620, carrying out malicious detection on a plurality of preset verification sample files according to the initial detection rules to obtain detection accuracy corresponding to the initial detection rules;

the verification sample file is a sample file for rule verification.

Step 630, if the detection accuracy is smaller than the preset detection accuracy threshold, a supplementary sample file is obtained;

if the detection accuracy is smaller than the preset detection accuracy threshold, the detection accuracy of the initial detection rule is lower, and then the initial detection rule is redetermined by acquiring a supplementary sample file.

Further, in step S630, the method for acquiring the supplementary sample file includes:

step S631, acquiring the determination time t of the initial detection rule;

step S632, sequentially obtaining the files to be detected received from t to the current time, to obtain a set y= (Y) of files to be detected ₁ ,Y ₂ ,...,Y _k ,...,Y _u ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein k=1, 2,. -%, u; u is the number of files to be detected received from t to the current time; y is Y _k The method comprises the steps that a kth file to be detected is received from t to the current time;

step S633, let k=1;

Step S634, if k is less than or equal to u, according to the initial detection rule, for Y _k Carrying out similarity detection on the included file characteristics to obtain corresponding similarity detection results;

step S635, if the similar detection result represents Y _k Is a similar file, Y is then _k Determining to supplement the sample file; otherwise, let k=k+1, and return to step S634.

The detection accuracy of the initial detection rule being smaller than the preset detection accuracy threshold may be caused by too low reference value due to too early acquisition time of the historical sample file, so the first similar file acquired after the initial detection rule determination time is selected as the supplementary sample file.

Step S640, redetermining an initial detection rule according to the supplementary sample file, the target malicious file and the plurality of target similar sample files, and determining the initial detection rule as a malicious detection rule if the detection accuracy corresponding to the initial detection rule is greater than or equal to a preset detection accuracy threshold.

And re-determining an initial detection rule through the determined supplementary sample file, the target malicious file and the plurality of target similar sample files, verifying the initial detection rule according to the verification sample file, acquiring a new supplementary sample file if the corresponding detection accuracy is still smaller than a preset detection accuracy threshold value, re-determining the initial detection rule until the detection accuracy corresponding to the initial detection rule is larger than or equal to the preset detection accuracy threshold value, indicating that the initial detection rule at the moment meets the detection verification requirement, and determining the initial detection rule as the malicious detection rule.

According to the method, name character strings set by each target data source for the target malicious files are obtained according to the received target malicious files, the name character strings are split according to preset characters corresponding to each target data source to obtain a plurality of corresponding target candidate character strings, corresponding importance degrees are determined according to the number of the plurality of target candidate character strings corresponding to each target data source, and then target similar sample files corresponding to the target malicious files are determined from the plurality of target sample files according to the importance degrees of each target data source. Splitting name character strings of target malicious files through each target data source to obtain the number of character strings for file feature analysis of each target data source, determining corresponding importance degrees through the number of character strings obtained through splitting, and determining similar sample files through each importance degree.

A sample determination apparatus 100 based on the importance of a data source, as shown in fig. 2, includes:

The target name string obtaining module 110 is configured to obtain, when receiving the target malicious file, name strings set by each target data source for the target malicious file, so as to obtain a target name string list z= (Z) ₁ ,Z ₂ ,...,Z _j ,...,Z _m ) The method comprises the steps of carrying out a first treatment on the surface of the Where j=1, 2, m; m is the number of target data sources; z is Z _j Setting a name character string for the j-th target data source to the target malicious file;

a target candidate character string determining module 120, configured to determine, according to the preset character corresponding to the jth target data source, Z _j Performing character string splitting to obtain a target candidate character string list set N= (N) ₁ ,N ₂ ,...,N _j ,...,N _m )；N _j =(N _j1 ,N _j2 ,...,N _jc ,...,N _jf(j) ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein c=1, 2,., f (j); f (j) is Z _j The number of target candidate strings contained therein; n (N) _j Is Z _j A corresponding list of target candidate strings; n (N) _jc Is Z _j The c-th target candidate character string included in the list;

the importance determining module 130 is configured to determine an importance of each target data source according to the target candidate string list set N, so as to obtain an importance set q= (Q) ₁ ,Q ₂ ,...,Q _j ,...,Q _m ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is _j Importance level for the jth target data source; q (Q) _j =f(j)/(∑ ^m _j=1 f(j))；

The similarity sample determining module 140 is configured to determine at least one target similarity sample file corresponding to the target malicious file according to the importance level of each target data source.

Embodiments of the present invention also provide a computer program product comprising program code for causing an electronic device to carry out the steps of the method according to the various exemplary embodiments of the invention as described in the specification, when said program product is run on the electronic device.

Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device according to this embodiment of the invention. The electronic device is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present invention.

The electronic device is in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: the at least one processor, the at least one memory, and a bus connecting the various system components, including the memory and the processor.

Wherein the memory stores program code that is executable by the processor to cause the processor to perform steps according to various exemplary embodiments of the invention described in the "exemplary methods" section of this specification.

The storage may include readable media in the form of volatile storage, such as Random Access Memory (RAM) and/or cache memory, and may further include Read Only Memory (ROM).

The storage may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The bus may be one or more of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., router, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. And, the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter. As shown, the network adapter communicates with other modules of the electronic device over a bus. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for determining a sample based on importance of a data source, the method comprising the steps of:

in response to receiving a target malicious file, acquiring name strings set by each target data source for the target malicious file to obtain a target name string list z= (Z) ₁ ,Z ₂ ,...,Z _j ,...,Z _m ) The method comprises the steps of carrying out a first treatment on the surface of the Where j=1, 2, m; m is the number of target data sources; z is Z _j Setting a name character string for the j-th target data source for the target malicious file;

determining the importance degree of each target data source according to the target candidate character string list set N to obtain an importance degree set Q= (Q) ₁ ,Q ₂ ,...,Q _j ,...,Q _m ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein Q is _j Importance level for the jth target data source; q (Q) _j =f(j)/(∑ ^m _j=1 f(j))；

Determining at least one target similar sample file corresponding to the target malicious file according to the importance degree of each target data source;

the determining at least one target similar sample file corresponding to the target malicious file according to the importance degree of each target data source includes:

according to I _ja And N _j Determining a sample matching degree H between an a-th target sample file and the target malicious file _a ；

According to H _a Determining at least one target similar sample file corresponding to the target malicious file from b target sample files;

wherein, determining a plurality of target sample files from a plurality of history sample files includes:

2. The method according to claim 1, wherein the step of _ja And N _j Determining a sample matching degree H between an a-th target sample file and the target malicious file _a Comprising:

Determination of C _j And U _ja Semantic matching degree A between _ja ；

According to A _ja And Q _j Determining a sample matching degree H between an a-th target sample file and the target malicious file _a 。

3. The method according to claim 2, wherein the method according to a _ja And Q _j Determining a sample matching degree H between an a-th target sample file and the target malicious file _a Comprising:

according to H _a =(∑ ^m _j=1 (A _ja ×Q _j ) Determining an a-th target sample file and the target Sample matching degree between malicious files.

4. The method according to claim 1, wherein the step of forming a pattern according to H _a Determining at least one target similar sample file corresponding to the target malicious file from b target sample files, wherein the method comprises the following steps:

if H _a ≥H ₀ Determining an a-th target sample file as a target similar sample file corresponding to the target malicious file; wherein H is ₀ And presetting a sample matching degree threshold value.

5. The method of claim 1, wherein Q _j Also determined by the following steps:

traversal I _j Determining I _jag At I _j The number L of (3) _jag ；

6. A sample determination device based on importance of a data source, comprising:

a target candidate character string determining module for determining Z according to the preset character corresponding to the jth target data source _j Splitting character strings toObtaining target candidate character string list set N= (N) ₁ ,N ₂ ,...,N _j ,...,N _m )；N _j =(N _j1 ,N _j2 ,...,N _jc ,...,N _jf(j) ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein c=1, 2,., f (j); f (j) is Z _j The number of target candidate strings contained therein; n (N) _j Is Z _j A corresponding list of target candidate strings; n (N) _jc Is Z _j The c-th target candidate character string included in the list;

The similarity sample determining module is used for determining at least one target similarity sample file corresponding to the target malicious file according to the importance degree of each target data source;

according to the importance degree of each target data source, determining at least one target similar sample file corresponding to the target malicious file, including:

According to H _a Determining at least one target similar sample file corresponding to the target malicious file from the b target sample files;

wherein, confirm a plurality of target sample files from a plurality of history sample files, include:

7. A non-transitory computer readable storage medium having stored therein at least one instruction or at least one program, wherein the at least one instruction or the at least one program is loaded and executed by a processor to implement the method of any one of claims 1-5.

8. An electronic device comprising a processor and the non-transitory computer readable storage medium of claim 7.