CN110442845B - File repetition rate calculation method and device - Google Patents

File repetition rate calculation method and device Download PDF

Info

Publication number
CN110442845B
CN110442845B CN201910611026.2A CN201910611026A CN110442845B CN 110442845 B CN110442845 B CN 110442845B CN 201910611026 A CN201910611026 A CN 201910611026A CN 110442845 B CN110442845 B CN 110442845B
Authority
CN
China
Prior art keywords
fingerprint
file
sample
dispersion
repeated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910611026.2A
Other languages
Chinese (zh)
Other versions
CN110442845A (en
Inventor
张志凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Security Technologies Co Ltd
Original Assignee
New H3C Security Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Security Technologies Co Ltd filed Critical New H3C Security Technologies Co Ltd
Priority to CN201910611026.2A priority Critical patent/CN110442845B/en
Publication of CN110442845A publication Critical patent/CN110442845A/en
Application granted granted Critical
Publication of CN110442845B publication Critical patent/CN110442845B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Collating Specific Patterns (AREA)
  • Storage Device Security (AREA)

Abstract

The invention provides a file repetition rate calculation method and device. The method comprises the following steps: extracting a file fingerprint of each sample file in a plurality of sample files to obtain a sample fingerprint set; calculating the dispersion of each file fingerprint in the sample fingerprint set; calculating the information amount of a plurality of sample files as sample information amount; extracting a file fingerprint of each target file in a plurality of target files to obtain a target fingerprint set; recording each file fingerprint belonging to the sample fingerprint set in the target fingerprint set to obtain a repeated fingerprint set; calculating the dispersion of each file fingerprint in the repeated fingerprint set; calculating the information amount of repeated files in the plurality of target files and the plurality of sample files as the repeated information amount; and calculating the ratio of the repeated information quantity to the sample information quantity as the file repetition rate of the target files and the sample files. The accuracy of the calculated repetition rate can be improved.

Description

File repetition rate calculation method and device
Technical Field
The invention relates to the technical field of information security, in particular to a file repetition rate calculation method and device.
Background
In some application scenarios, it may be desirable to determine a file repetition rate between files, such as article duplication checking, data Leakage Prevention (DLP), and so on. Taking data leakage protection as an example, in order to accurately deploy a data leakage protection policy, a user needs to know a file repetition rate between a leaked file and a preset protection file.
In the related art, the file fingerprints of a plurality of leakage files and the file fingerprints of a plurality of preset protection files may be extracted respectively, the number of the preset protection files appearing in the leakage files is counted according to the extracted file fingerprints, and the ratio of the number of the files to the total number of the preset protection files is calculated as the file repetition rate between the leakage files and the preset protection files.
However, in this scheme, files are used as a unit of information amount, and it is simply considered that the information amount contained in each file is the same, while in an actual application scenario, the information amounts of different files may be different, and especially in an application scenario with a large number of files, the information amounts of different files often have a large difference, so that the actual file repetition rate between the files cannot be accurately reflected by the ratio between the file amounts.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for calculating a repetition rate of a file, so as to improve the accuracy of the calculated repetition rate. The specific technical scheme is as follows:
in a first aspect of the present invention, a file repetition rate calculation method is provided, the method including:
extracting a file fingerprint of each sample file in a plurality of sample files to obtain a sample fingerprint set;
calculating the dispersion of each file fingerprint in the sample fingerprint set, wherein the dispersion is used for representing the average value of the difference degree of the file fingerprint and each file fingerprint in the sample fingerprint set;
calculating the information quantity of the plurality of sample files according to the dispersion of each file fingerprint in the sample fingerprint set and a preset information quantity calculation formula to serve as sample information quantity;
extracting a file fingerprint of each target file in a plurality of target files to obtain a target fingerprint set;
recording each file fingerprint belonging to the sample fingerprint set in the target fingerprint set to obtain a repeated fingerprint set;
for each file fingerprint in the set of repeated fingerprints, calculating the dispersion of that file fingerprint;
calculating the information quantity of repeated files in the target files and the sample files according to the dispersion of the fingerprints of the files in the repeated fingerprint set, wherein the information quantity is used as the repeated information quantity;
and calculating the ratio of the repeated information quantity to the sample information quantity as the file repetition rate of the target files and the sample files.
With reference to the first aspect, in a first possible implementation manner, the calculating, according to a preset dispersion algorithm, a dispersion of each file fingerprint in the sample fingerprint set includes:
for each file fingerprint in the sample fingerprint set, counting the occurrence probability of the file fingerprint in the sample fingerprint set;
calculating the dispersion of the file fingerprint according to a preset dispersion calculation formula and the occurrence probability of the file fingerprint, wherein in the preset dispersion calculation formula, the dispersion is negatively correlated with the occurrence probability;
for each file fingerprint in the set of repeated fingerprints, calculating the dispersion of the file fingerprint, including:
for each file fingerprint in the repeated fingerprint set, counting the occurrence probability of the file fingerprint in the sample fingerprint set;
and calculating the dispersion of the file fingerprint according to the occurrence probability of the file fingerprint and the preset dispersion calculation formula.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, the preset dispersion calculation formula is:
D=-log 2 p
wherein D is the dispersion of the file fingerprint, and p is the occurrence probability of the file fingerprint.
With reference to the first aspect, in a third possible implementation manner, the calculating, according to the dispersion of each file fingerprint in the sample fingerprint set and according to a preset information amount calculation formula, information amounts of the plurality of sample files as sample information amounts includes:
weighting and superposing the dispersion of each file fingerprint in the sample fingerprint set to obtain a superposition result, wherein the superposition result is used as sample information quantity;
the calculating, as an amount of repeated information, an amount of information of repeated files in the plurality of target files and the plurality of sample files according to the dispersion of each file fingerprint in the set of repeated fingerprints includes:
and weighting and superposing the dispersion of each file fingerprint in the repeated fingerprint set to obtain a superposition result which is used as repeated information quantity.
With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the weighting and superimposing the dispersion of each file fingerprint in the sample fingerprint set to obtain a superimposed result, which is used as a sample information amount, includes:
weighting and superposing the dispersion of each file fingerprint in the sample fingerprint set by taking the occurrence probability of the file fingerprint in the sample fingerprint set as the weight of the file fingerprint to obtain a superposition result as sample information quantity;
the weighted superposition of the dispersion of each file fingerprint in the repeated fingerprint set to obtain a superposition result, which is used as a repeated information quantity, comprises:
and weighting and superposing the dispersion of each file fingerprint in the repeated fingerprint set by taking the occurrence probability of the file fingerprint in the sample fingerprint set as the weight of the file fingerprint to obtain a superposition result which is used as the repeated information content.
In a second aspect of the present invention, there is provided a file repetition rate calculation apparatus, the apparatus comprising:
the first fingerprint extraction module is used for extracting a file fingerprint of each sample file in a plurality of sample files to obtain a sample fingerprint set;
the first dispersion calculation module is used for calculating the dispersion of each file fingerprint in the sample fingerprint set, and the dispersion is used for representing the average value of the difference degree of the file fingerprint and each file fingerprint in the sample fingerprint set;
the first information quantity calculating module is used for calculating the information quantities of the plurality of sample files according to the dispersion of the file fingerprints in the sample fingerprint set and a preset information quantity calculating formula to serve as sample information quantities;
the second fingerprint extraction module is used for extracting a file fingerprint of each target file in the plurality of target files to obtain a target fingerprint set;
the repeated recording module is used for recording each file fingerprint belonging to the sample fingerprint set in the target fingerprint set to obtain a repeated fingerprint set;
a second dispersion calculation module for calculating the dispersion of each file fingerprint in the set of repeated fingerprints;
a second information amount calculating module, configured to calculate, as repeated information amounts, information amounts of repeated files in the plurality of target files and the plurality of sample files according to the dispersion of each file fingerprint in the repeated fingerprint set;
and the information quantity ratio module is used for calculating the ratio of the repeated information quantity to the sample information quantity as the file repetition rate of the target files and the sample files.
With reference to the second aspect, in a first possible implementation manner, the first dispersion calculation module is specifically configured to, for each file fingerprint in the sample fingerprint set, count an occurrence probability of the file fingerprint in the sample fingerprint set;
calculating the dispersion of the file fingerprint according to a preset dispersion calculation formula according to the occurrence probability of the file fingerprint, wherein the dispersion is negatively related to the occurrence probability in the preset dispersion calculation formula;
the second dispersion calculation module is specifically configured to, for each file fingerprint in the repeated fingerprint set, count an occurrence probability of the file fingerprint in the sample fingerprint set;
and calculating the dispersion of the file fingerprint according to the occurrence probability of the file fingerprint and the preset dispersion calculation formula.
With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the preset dispersion calculation formula is:
D=-log 2 p
wherein D is the dispersion of the file fingerprint, and p is the occurrence probability of the file fingerprint.
With reference to the second aspect, in a third possible implementation manner, the first information amount calculating module is specifically configured to weight and superimpose the dispersion of each file fingerprint in the sample fingerprint set to obtain a superimposed result, which is used as a sample information amount;
the second information amount calculation module is specifically configured to weight and superimpose the dispersion of each file fingerprint in the repeated fingerprint set to obtain a superimposed result, which is used as a repeated information amount.
With reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner, the first information amount calculation module is specifically configured to use an occurrence probability of a file fingerprint in the sample fingerprint set as a weight of the file fingerprint, and weight and overlap the dispersion of each file fingerprint in the sample fingerprint set to obtain an overlap result, which is used as a sample information amount;
the second information amount calculating module is specifically configured to use the occurrence probability of a file fingerprint in the sample fingerprint set as the weight of the file fingerprint, and weight and overlap the dispersion of each file fingerprint in the repeated fingerprint set to obtain an overlap result, which is used as a repeated information amount.
In a third aspect of the invention, there is provided an electronic device comprising:
a memory for storing a computer program;
a processor configured to implement the method steps of any one of the first aspect when executing a program stored in the memory.
In a fourth aspect of the present invention, a computer-readable storage medium is provided, having stored therein a computer program which, when executed by a processor, performs the method steps of any of the above-mentioned first aspects.
The file repetition rate calculating method and device provided by the invention can reflect the difference of information amount between different files by using the dispersion in the process of calculating the file repetition rate, so that the calculated file repetition rate is more accurate. Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a file repetition rate calculation method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a file repetition rate calculation method in the field of data leakage protection according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a file repetition rate calculation apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a file repetition rate calculation method according to an embodiment of the present invention, where the method may be applied to an electronic device with a file repetition rate calculation function, and according to different application scenarios, the electronic device may be a server, a personal computer, or other types of electronic devices. The method can comprise the following steps:
s101, extracting a file fingerprint of each sample file in a plurality of sample files to obtain a sample fingerprint set.
According to different application scenarios, the files referred to by the sample file may be different, and for example, in an application scenario of data leakage protection, the sample file may be a file that needs to be protected and is preset by a user.
A file fingerprint may be considered as an identification of a file, if two files are identical, the file fingerprints of the two files are theoretically identical, and if the two files are different, the file fingerprints of the two files are theoretically different.
S102, aiming at each file fingerprint in the sample fingerprint set, calculating the dispersion of the file fingerprint.
Wherein the dispersion is used to represent the mean of the degree of difference between the file fingerprint and each file fingerprint in the sample fingerprint set. The difference degree of the two file fingerprints may reflect the difference degree of the content between the files corresponding to the two file fingerprints, for example, the difference degree of the two file fingerprints may refer to whether the two file fingerprints are the same, if the two file fingerprints are the same, the file contents corresponding to the two file fingerprints may be considered to be the same, and if the two file fingerprints are different, the file contents corresponding to the two file fingerprints may be considered to be different.
The representation mode of the difference degree of the two file fingerprints can be different according to different application scenes. For example, the difference degree may be represented by a real number with a value range of [0,1], where a higher value of the real number indicates that two file fingerprints are more similar, and a value of the real number is 1 indicates that two file fingerprints are completely identical. For another example, when two file fingerprints are the same, the difference degree between the two file fingerprints may be represented as 1, and when the two file fingerprints are different, the difference degree between the two file fingerprints may be represented as 0, and in this case, the average value of the difference degree between one file fingerprint and each file fingerprint in the sample fingerprint set may be regarded as the occurrence probability of the file fingerprint in the sample fingerprint set.
Therefore, in one possible embodiment, the occurrence probability of the file fingerprint in the sample fingerprint set may be counted, and the dispersion of the file fingerprint is calculated according to the occurrence probability and a preset dispersion calculation formula. The preset dispersion calculation formula can be different according to different application scenes, but the negative correlation between the dispersion and the occurrence probability should be satisfied, that is, under the condition that other factors influencing the dispersion are unchanged, the higher the occurrence probability is, the lower the dispersion is, and the lower the occurrence probability is, the higher the dispersion is.
For example, the preset dispersion formula may be as follows:
D=-log 2 p
wherein D is the dispersion of the file fingerprint, and p is the occurrence probability of the file fingerprint. In other possible embodiments, the dispersion formula may also be a logarithm of p with a value other than 2, such as a natural base e, 10, and the like, which is not limited in this embodiment of the present invention.
In the embodiment of the present invention, the probability of occurrence of one information in the sample information set may refer to a ratio of the number of information having the same fingerprint as the one information in the sample information set to the total number of sample information. For example, assuming that the fingerprint of one information is a, and the fingerprints of the respective sample information in the sample information set are { a, B, C }, the fingerprints of three sample information are a, and there are 5 sample information in total, so the probability of occurrence of the information may be 0.6.
S103, calculating the information quantity of the plurality of sample files according to the dispersion of each file fingerprint in the sample fingerprint set and a preset information quantity calculation formula to serve as the sample information quantity.
Since the dispersion of a file fingerprint may represent a mean value of the degree of difference between the file fingerprint and each file fingerprint in the sample fingerprint set, and the file fingerprint has a corresponding relationship with the sample file, the dispersion of a file fingerprint may be regarded as a mean value of the degree of difference between the sample file corresponding to the file fingerprint and each sample file in the plurality of sample files.
Therefore, the higher the dispersion of a file fingerprint is, the greater the average difference between the sample file corresponding to the file fingerprint and a plurality of sample files is, that is, the less information is duplicated between the sample file corresponding to the file fingerprint and other sample files, and thus the more effective information amount of the sample file corresponding to the file fingerprint can be considered to be.
In one possible embodiment, the dispersion of each file fingerprint in the sample fingerprint set may be weighted and superimposed to obtain the superimposed result as the sample information amount. The weight of the dispersion of each file fingerprint may be the occurrence probability of the file fingerprint, or may be a function related to the occurrence probability, which is not limited in the embodiment of the present invention. For example, the preset information amount calculation formula may be as follows:
Figure BDA0002122241290000081
wherein H (U1) is sample information amount, n is number of sample files, i.e. number of file fingerprints in sample fingerprint set, p i Is the occurrence probability of the ith file fingerprint in the sample fingerprint set. For the probability of occurrence, reference may be made to the relevant description in S102, which is not described herein again.
S104, extracting the file fingerprint of each target file in the plurality of target files to obtain a target fingerprint set.
The way in which the file fingerprint is extracted from the target file should be the same as the way in which the file fingerprint is extracted from the sample file.
And S105, recording each file fingerprint belonging to the sample fingerprint set in the target fingerprint set to obtain a coincident fingerprint set.
For example, if the sample fingerprint set includes { fingerprint a, fingerprint B, fingerprint C }, and the target fingerprint set includes { fingerprint a, fingerprint B, fingerprint D, fingerprint E }, then the fingerprint a and fingerprint B in the target fingerprint set belong to the sample fingerprint set, and thus the obtained coincident fingerprint set includes { fingerprint a, fingerprint B }.
For another example, if the target fingerprint set includes { fingerprint a, fingerprint B, fingerprint C }, and the sample fingerprint set includes { fingerprint a, fingerprint B, fingerprint D, fingerprint E }, then two fingerprints a and fingerprint B in the target fingerprint set belong to the sample fingerprint set, and thus the resulting coincident fingerprint set includes { fingerprint a, fingerprint B }.
S106, aiming at each file fingerprint in the repeated fingerprint set, calculating the dispersion of the file fingerprint.
For the dispersion, reference may be made to the related description in the foregoing S102, and details are not repeated here. The way of calculating the dispersion of the file fingerprints in the repeated fingerprint set should be the same as the way of calculating the dispersion of the file fingerprints in the sample fingerprint set.
And S107, calculating the information quantity of repeated files in the target files and the sample files according to the dispersion of the fingerprints of the files in the repeated fingerprint set to be used as the repeated information quantity.
For the information amount, reference may be made to the related description in the foregoing S103, which is not described herein again. The repetitive information amount should be calculated in the same manner as the sample information amount, differing only in the dispersion according to it.
And S108, calculating the ratio of the repeated information quantity to the sample information quantity as the file repetition rate of the target files and the sample files.
It is understood that the amount of repeated information is used to indicate the amount of information of repeated files among the plurality of target files and the plurality of sample files, and the repeated files are subsets of the sample files, so that the amount of repeated information is smaller than the amount of sample information. The value range of the obtained ratio is [0,1], and the ratio may be expressed in the form of a decimal or in the form of a percentage, which is not limited in this embodiment of the present invention.
By adopting the embodiment of the invention, the difference of information amount between different files can be embodied by utilizing the dispersion in the process of calculating the file repetition rate, so that the calculated file repetition rate is more accurate.
In order to more clearly explain the file repetition rate calculation method provided by the present invention, the following will exemplify the field of data leakage protection. In the application scenario, the sample File is a text File that needs to be protected and is preset by a user, and the target File is a detected leaked File, for example, the target File may be a File transmitted through an FTP (File Transfer Protocol) within a preset time window. The file repetition rate between the target file and the sample file may be used as the file leakage rate of the sample file.
Referring to fig. 2, fig. 2 is a schematic flow chart of a file repetition rate calculation method in the field of data leakage protection according to an embodiment of the present invention, where the method includes:
s201, extracting the file fingerprint of each sample file in the plurality of sample files through a text fingerprint extraction algorithm to obtain a sample fingerprint set.
For convenience of description, it is assumed that { fingerprint a, fingerprint B, fingerprint C, fingerprint D } is included in the sample fingerprint set.
S202, counting the occurrence probability of each file fingerprint in the sample fingerprint set.
Because the sample fingerprint set comprises 5 fingerprints in total, the appearance probability of the fingerprint A is 0.4, the appearance probability of the fingerprint B is 0.2, the appearance probability of the fingerprint C is 0.2, and the appearance probability of the fingerprint D is 0.2.
And S203, calculating the dispersion of the file fingerprint according to the occurrence probability of each file fingerprint in the sample fingerprint set and a preset dispersion calculation formula.
The dispersion calculation formula is as follows:
D=-log 2 p
therefore, in the sample fingerprint set, the dispersion of the fingerprint a is 1.32, the dispersion of the fingerprint B is 2.32, the dispersion of the fingerprint C is 2.32, and the dispersion of the fingerprint D is 2.32.
S204, taking the occurrence probability of the file fingerprint in the sample fingerprint set as the weight of the file fingerprint, and weighting and overlapping the dispersion of each file fingerprint in the sample fingerprint set to obtain an overlapping result as the sample information amount.
The calculation formula of the sample information amount can be as follows:
Figure BDA0002122241290000101
where H (U1) is the sample information content, p i For the probability of occurrence of the i-th file fingerprint, D i Is the dispersion of the ith file fingerprint. Taking the example that the sample fingerprint set includes { fingerprint A, fingerprint B, fingerprint C, fingerprint D }, it can be calculated that H (U1) is equal to 2.45.
S205, extracting the file fingerprints of the plurality of target files through a text extraction algorithm to obtain a target fingerprint set.
For convenience of description, it is assumed that the target fingerprint set includes { fingerprint a, fingerprint B, fingerprint E }.
And S206, recording the file fingerprints in the sample fingerprint set appearing in the target fingerprint set to obtain a repeated fingerprint set.
The fingerprints a and B in the target fingerprint set appear in the sample fingerprint set, and the fingerprint E does not appear in the sample fingerprint set, so the resulting repeated fingerprint set is { fingerprint a and fingerprint B }.
And S207, counting the occurrence probability of each file fingerprint in the sample fingerprint set in the repeated fingerprint set.
The probability of occurrence of fingerprint a in the sample fingerprint set is 0.4 and the probability of occurrence of fingerprint B in the sample fingerprint set is 0.2.
And S208, calculating the dispersion of the file fingerprint according to the occurrence probability of each file fingerprint in the repeated fingerprint set and a preset dispersion calculation formula.
According to the dispersion calculation formula as shown in S203, the dispersion of the fingerprint a is 1.32 and the dispersion of the fingerprint B is 2.32.
S209, taking the occurrence probability of the file fingerprint in the sample fingerprint set as the weight of the file fingerprint, and weighting and overlapping the dispersion of each file fingerprint in the repeated fingerprint set to obtain an overlapping result as the sample information amount.
The calculation formula of the amount of repetitive information can be as follows:
Figure BDA0002122241290000111
where H (U2) is the amount of duplicate information and m is the number of file fingerprints included in the set of duplicate fingerprints, m =2 in this example, it can be calculated that H (U2) is equal to 0.99.
S210, calculating the ratio of the repeated information quantity to the sample information quantity as the file repetition rate of the target files and the sample files.
Still taking the above example as an example, the file repetition rate may be calculated to be 0.99/2.54 × 100% =39.1%, i.e., the file leakage rate of the plurality of sample files may be considered to be 39.1%.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a file repetition rate calculating apparatus according to an embodiment of the present invention, which may include:
a first fingerprint extraction module 301, configured to extract a file fingerprint of each sample file in multiple sample files, to obtain a sample fingerprint set;
a first dispersion calculation module 302, configured to calculate, for each file fingerprint in the sample fingerprint set, a dispersion of the file fingerprint, where the dispersion is used to represent a mean value of a difference degree between the file fingerprint and each file fingerprint in the sample fingerprint set;
the first information amount calculating module 303 is configured to calculate, according to a preset information amount calculating formula, information amounts of a plurality of sample files as sample information amounts according to the dispersion of each file fingerprint in the sample fingerprint set;
a second fingerprint extraction module 304, configured to extract a file fingerprint of each of the multiple target files to obtain a target fingerprint set;
a duplicate recording module 305, configured to record each file fingerprint in the target fingerprint set that belongs to the sample fingerprint set, to obtain a duplicate fingerprint set;
a second dispersion calculation module 306, configured to calculate, for each file fingerprint in the set of repeated fingerprints, a dispersion of the file fingerprint;
a second information amount calculating module 307, configured to calculate, as repeated information amounts, information amounts of repeated files in the multiple target files and the multiple sample files according to the dispersion of each file fingerprint in the repeated fingerprint set;
the information amount ratio module 308 is configured to calculate a ratio of the repeated information amount to the sample information amount as a file repetition rate of the plurality of target files and the plurality of sample files.
In a possible embodiment, the first dispersion calculation module 302 is specifically configured to, for each file fingerprint in the sample fingerprint set, count an occurrence probability of the file fingerprint in the sample fingerprint set;
calculating the dispersion of the file fingerprint according to a preset dispersion calculation formula and the occurrence probability of the file fingerprint, wherein the dispersion is negatively related to the occurrence probability in the preset dispersion calculation formula;
the second dispersion calculation module 306 is specifically configured to, for each file fingerprint in the repeated fingerprint set, count an occurrence probability of the file fingerprint in the sample fingerprint set;
and calculating the dispersion of the file fingerprint according to the occurrence probability of the file fingerprint and a preset dispersion calculation formula.
In one possible embodiment, the preset dispersion calculation formula is:
D=-log 2 p
wherein D is the dispersion of the file fingerprint, and p is the occurrence probability of the file fingerprint.
In a possible embodiment, the first information amount calculating module 303 is specifically configured to weight the dispersion of each file fingerprint in the superimposed sample fingerprint set to obtain a superimposed result, which is used as the sample information amount;
the second information amount calculating module 307 is specifically configured to weight and overlap the dispersion of each file fingerprint in the repeated fingerprint set to obtain an overlap result, which is used as a repeated information amount.
In a possible embodiment, the first information amount calculating module 303 is specifically configured to use an occurrence probability of a file fingerprint in a sample fingerprint set as a weight of the file fingerprint, and weight and overlap a dispersion of each file fingerprint in the sample fingerprint set to obtain an overlap result, which is used as the sample information amount;
the second information amount calculating module 307 is specifically configured to use the occurrence probability of the file fingerprint in the sample fingerprint set as the weight of the file fingerprint, and weight and overlap the dispersion of each file fingerprint in the repeated fingerprint set to obtain an overlap result, which is used as the repeated information amount.
An embodiment of the present invention further provides an electronic device, as shown in fig. 4, including:
a memory 401 for storing a computer program;
the processor 402, when executing the program stored in the memory 401, implements the following steps:
extracting a file fingerprint of each sample file in a plurality of sample files to obtain a sample fingerprint set;
calculating the dispersion of each file fingerprint in the sample fingerprint set, wherein the dispersion is used for representing the mean value of the difference degree between the file fingerprint and each file fingerprint in the sample fingerprint set;
calculating the information quantity of a plurality of sample files according to the dispersion of each file fingerprint in the sample fingerprint set and a preset information quantity calculation formula to serve as the sample information quantity;
extracting a file fingerprint of each target file in a plurality of target files to obtain a target fingerprint set;
recording each file fingerprint belonging to the sample fingerprint set in the target fingerprint set to obtain a repeated fingerprint set;
calculating the dispersion of each file fingerprint in the repeated fingerprint set;
calculating the information quantity of repeated files in the multiple target files and the multiple sample files according to the dispersion of the fingerprints of the files in the repeated fingerprint set, wherein the information quantity is used as the repeated information quantity;
and calculating the ratio of the repeated information quantity to the sample information quantity as the file repetition rate of the target files and the sample files.
In one possible embodiment, for each file fingerprint in the sample fingerprint set, calculating the dispersion of the file fingerprint according to a preset dispersion algorithm includes:
for each file fingerprint in the sample fingerprint set, counting the occurrence probability of the file fingerprint in the sample fingerprint set;
calculating the dispersion of the file fingerprint according to a preset dispersion calculation formula and the occurrence probability of the file fingerprint, wherein the dispersion is negatively related to the occurrence probability in the preset dispersion calculation formula;
for each file fingerprint in the set of repeated fingerprints, calculating a dispersion of the file fingerprint, including:
for each file fingerprint in the repeated fingerprint set, counting the occurrence probability of the file fingerprint in the sample fingerprint set;
and calculating the dispersion of the file fingerprint according to the occurrence probability of the file fingerprint and a preset dispersion calculation formula.
In one possible embodiment, the preset dispersion calculation formula is:
D=-log 2 p
wherein D is the dispersion of the file fingerprint, and p is the occurrence probability of the file fingerprint.
In a possible embodiment, calculating the information amount of the plurality of sample files according to a preset information amount calculation formula according to the dispersion of the fingerprints of the files in the sample fingerprint set as the sample information amount includes:
weighting and superposing the dispersion of each file fingerprint in the sample fingerprint set to obtain a superposition result which is used as the sample information quantity;
calculating the information quantity of repeated files in a plurality of target files and a plurality of sample files according to the dispersion of the fingerprints of the files in the repeated fingerprint set, wherein the information quantity is used as the repeated information quantity and comprises the following steps:
and weighting and superposing the dispersion of each file fingerprint in the repeated fingerprint set to obtain a superposition result as repeated information quantity.
In a possible embodiment, weighting the dispersion of each file fingerprint in the superimposed sample fingerprint set to obtain a superimposed result, as the sample information amount, including:
taking the occurrence probability of the file fingerprints in the sample fingerprint set as the weight of the file fingerprints, and weighting and superposing the dispersion of each file fingerprint in the sample fingerprint set to obtain a superposition result which is used as the sample information content;
weighting and superposing the dispersion of each file fingerprint in the repeated fingerprint set to obtain a superposition result, wherein the superposition result is used as a repeated information quantity and comprises the following steps:
and weighting and overlapping the dispersion of each file fingerprint in the repeated fingerprint set by taking the occurrence probability of the file fingerprint in the sample fingerprint set as the weight of the file fingerprint to obtain an overlapping result as the repeated information content.
The aforementioned electronic device may include a Random Access Memory (RAM) and a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute any of the file repetition rate calculation methods in the above embodiments.
In yet another embodiment provided by the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the file repetition rate calculation method of the above-mentioned embodiment.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, the electronic device, the computer-readable storage medium, and the computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (8)

1. A file repetition rate calculation method, characterized in that the method comprises:
extracting a file fingerprint of each sample file in a plurality of sample files to obtain a sample fingerprint set;
calculating the dispersion of each file fingerprint in the sample fingerprint set, wherein the dispersion is used for representing the mean value of the difference degree of the file fingerprint and each file fingerprint in the sample fingerprint set;
calculating the information quantity of the plurality of sample files according to the dispersion of each file fingerprint in the sample fingerprint set and a preset information quantity calculation formula to serve as sample information quantity; the dispersion is positively correlated with the information quantity, and the higher the dispersion is, the higher the sample information quantity is;
extracting a file fingerprint of each target file in a plurality of target files to obtain a target fingerprint set;
recording each file fingerprint belonging to the sample fingerprint set in the target fingerprint set to obtain a repeated fingerprint set;
for each file fingerprint in the set of repeated fingerprints, calculating the dispersion of that file fingerprint;
calculating the information quantity of repeated files in the target files and the sample files according to the dispersion of the fingerprints of the files in the repeated fingerprint set, wherein the information quantity is used as the repeated information quantity; the dispersion is positively correlated with the information quantity, and the higher the dispersion is, the higher the repeated information quantity is;
calculating the ratio of the repeated information amount to the sample information amount as the file repetition rate of the target files and the sample files;
for each file fingerprint in the sample fingerprint set, calculating the dispersion of the file fingerprint, including:
for each file fingerprint in the sample fingerprint set, counting the occurrence probability of the file fingerprint in the sample fingerprint set;
calculating the dispersion of the file fingerprint according to a preset dispersion calculation formula according to the occurrence probability of the file fingerprint, wherein the dispersion is negatively related to the occurrence probability in the preset dispersion calculation formula;
for each file fingerprint in the set of repeated fingerprints, calculating the dispersion of the file fingerprint, including:
for each file fingerprint in the repeated fingerprint set, counting the occurrence probability of the file fingerprint in the sample fingerprint set;
and calculating the dispersion of the file fingerprint according to the occurrence probability of the file fingerprint and the preset dispersion calculation formula.
2. The method according to claim 1, wherein the preset dispersion calculation formula is:
D=-log 2 p
wherein D is the dispersion of the file fingerprint, and p is the occurrence probability of the file fingerprint.
3. The method according to claim 1, wherein the calculating the information amount of the plurality of sample files according to the dispersion of the respective file fingerprints in the sample fingerprint set and a preset information amount calculation formula as a sample information amount comprises:
weighting and superposing the dispersion of each file fingerprint in the sample fingerprint set to obtain a superposition result, wherein the superposition result is used as sample information content;
the calculating, as an amount of repeated information, an amount of information of repeated files in the plurality of target files and the plurality of sample files according to the dispersion of each file fingerprint in the set of repeated fingerprints includes:
and weighting and superposing the dispersion of each file fingerprint in the repeated fingerprint set to obtain a superposition result which is used as repeated information quantity.
4. The method of claim 3, wherein the weighted overlap of the dispersion of each file fingerprint in the sample fingerprint set to obtain an overlap result as a sample information amount comprises:
weighting and superposing the dispersion of each file fingerprint in the sample fingerprint set by taking the occurrence probability of the file fingerprint in the sample fingerprint set as the weight of the file fingerprint to obtain a superposition result which is used as the sample information content;
the weighted superposition of the dispersion of each file fingerprint in the repeated fingerprint set to obtain a superposition result, which is used as a repeated information quantity, comprises:
and weighting and superposing the dispersion of each file fingerprint in the repeated fingerprint set by taking the occurrence probability of the file fingerprint in the sample fingerprint set as the weight of the file fingerprint to obtain a superposition result which is used as the repeated information content.
5. An apparatus for calculating a file repetition rate, the apparatus comprising:
the first fingerprint extraction module is used for extracting a file fingerprint of each sample file in a plurality of sample files to obtain a sample fingerprint set;
the first dispersion calculation module is used for calculating the dispersion of each file fingerprint in the sample fingerprint set, and the dispersion is used for representing the mean value of the difference degree of the file fingerprint and each file fingerprint in the sample fingerprint set;
the first information quantity calculating module is used for calculating the information quantities of the plurality of sample files according to the dispersion of the file fingerprints in the sample fingerprint set and a preset information quantity calculating formula to serve as sample information quantities; the dispersion is positively correlated with the information quantity, and the higher the dispersion is, the higher the sample information quantity is;
the second fingerprint extraction module is used for extracting a file fingerprint of each target file in the plurality of target files to obtain a target fingerprint set;
the repeated recording module is used for recording each file fingerprint belonging to the sample fingerprint set in the target fingerprint set to obtain a repeated fingerprint set;
a second dispersion calculation module for calculating the dispersion of each file fingerprint in the set of repeated fingerprints;
a second information amount calculating module, configured to calculate, as repeated information amounts, information amounts of repeated files in the plurality of target files and the plurality of sample files according to the dispersion of each file fingerprint in the repeated fingerprint set; the dispersion is positively correlated with the information quantity, and the higher the dispersion is, the higher the repeated information quantity is;
an information quantity ratio module, configured to calculate a ratio of the repeated information quantity to the sample information quantity, as a file repetition rate of the multiple target files and the multiple sample files;
the first dispersion calculation module is specifically configured to, for each file fingerprint in the sample fingerprint set, count an occurrence probability of the file fingerprint in the sample fingerprint set;
calculating the dispersion of the file fingerprint according to a preset dispersion calculation formula according to the occurrence probability of the file fingerprint, wherein the dispersion is negatively related to the occurrence probability in the preset dispersion calculation formula;
the second dispersion calculation module is specifically configured to, for each file fingerprint in the repeated fingerprint set, count an occurrence probability of the file fingerprint in the sample fingerprint set;
and calculating the dispersion of the file fingerprint according to the occurrence probability of the file fingerprint and the preset dispersion calculation formula.
6. The apparatus of claim 5, wherein the preset dispersion calculation formula is:
D=-log 2 p
wherein D is the dispersion of the file fingerprint, and p is the occurrence probability of the file fingerprint.
7. The apparatus according to claim 5, wherein the first information amount calculating module is specifically configured to weight and superimpose the dispersion of each file fingerprint in the sample fingerprint set to obtain a superimposed result as a sample information amount;
the second information amount calculation module is specifically configured to weight and superimpose the dispersion of each file fingerprint in the repeated fingerprint set to obtain a superimposed result, which is used as a repeated information amount.
8. The apparatus according to claim 7, wherein the first information amount calculating module is specifically configured to use an occurrence probability of a file fingerprint in the sample fingerprint set as a weight of the file fingerprint, and weight-superimpose the dispersion of each file fingerprint in the sample fingerprint set to obtain a superimposed result as a sample information amount;
the second information amount calculating module is specifically configured to use the occurrence probability of a file fingerprint in the sample fingerprint set as the weight of the file fingerprint, and weight and overlap the dispersion of each file fingerprint in the repeated fingerprint set to obtain an overlap result, which is used as a repeated information amount.
CN201910611026.2A 2019-07-08 2019-07-08 File repetition rate calculation method and device Active CN110442845B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910611026.2A CN110442845B (en) 2019-07-08 2019-07-08 File repetition rate calculation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910611026.2A CN110442845B (en) 2019-07-08 2019-07-08 File repetition rate calculation method and device

Publications (2)

Publication Number Publication Date
CN110442845A CN110442845A (en) 2019-11-12
CN110442845B true CN110442845B (en) 2022-12-20

Family

ID=68429895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910611026.2A Active CN110442845B (en) 2019-07-08 2019-07-08 File repetition rate calculation method and device

Country Status (1)

Country Link
CN (1) CN110442845B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7774451B1 (en) * 2008-06-30 2010-08-10 Symantec Corporation Method and apparatus for classifying reputation of files on a computer network
CN106649676A (en) * 2016-12-15 2017-05-10 北京锐安科技有限公司 Duplication eliminating method and device based on HDFS storage file

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7774451B1 (en) * 2008-06-30 2010-08-10 Symantec Corporation Method and apparatus for classifying reputation of files on a computer network
CN106649676A (en) * 2016-12-15 2017-05-10 北京锐安科技有限公司 Duplication eliminating method and device based on HDFS storage file

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Data De-duplication on Similar File Detection;Yueguang Zhu et al.;《2014 Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing》;20140702;第66-73页 *
一种新的基于Bloom filter数据结构的数据消冗算法;邓剑勋 等;《南昌大学学报(理科版)》;20171031;第41卷(第5期);第455-459页,第463页 *

Also Published As

Publication number Publication date
CN110442845A (en) 2019-11-12

Similar Documents

Publication Publication Date Title
CN111898360B (en) Text similarity detection method and device based on block chain and electronic equipment
CN108615119B (en) Abnormal user identification method and equipment
US20200013065A1 (en) Method and Apparatus of Identifying a Transaction Risk
CN111401416B (en) Abnormal website identification method and device and abnormal countermeasure identification method
CN109949154B (en) Customer information classification method, apparatus, computer device and storage medium
CN107679856B (en) Transaction-based service control method and device
WO2016145993A1 (en) Method and system for user device identification
CA2942360A1 (en) Systems and methods for detecting copied computer code using fingerprints
CN109040110B (en) Outgoing behavior detection method and device
CN110019785B (en) Text classification method and device
CN112819611A (en) Fraud identification method, device, electronic equipment and computer-readable storage medium
CN113240505A (en) Graph data processing method, device, equipment, storage medium and program product
CN111770047A (en) Abnormal group detection method, device and equipment
CN108053214B (en) False transaction identification method and device
CN113553583A (en) Information system asset security risk assessment method and device
CN106301979B (en) Method and system for detecting abnormal channel
CN110502697B (en) Target user identification method and device and electronic equipment
CN111539929A (en) Copyright detection method and device and electronic equipment
Tournier et al. Expanding the attack surface: Robust profiling attacks threaten the privacy of sparse behavioral data
CN112818868B (en) Method and device for identifying illegal user based on behavior sequence characteristic data
CN112099870B (en) Document processing method, device, electronic equipment and computer readable storage medium
CN111275071B (en) Prediction model training method, prediction device and electronic equipment
CN110442845B (en) File repetition rate calculation method and device
CN112613974A (en) Risk early warning method, device, equipment and readable storage medium
Luz et al. Data preprocessing and feature extraction for phishing URL detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant