CN105956482B

CN105956482B - Method and system for leakage prevention

Info

Publication number: CN105956482B
Application number: CN201610237022.9A
Authority: CN
Inventors: 李唱; 康靖; 陈虎
Original assignee: Baoli Nine Chapter (beijing) Data Technology Co Ltd
Current assignee: Quantum innovation (Beijing) Information Technology Co., Ltd
Priority date: 2016-04-15
Filing date: 2016-04-15
Publication date: 2019-06-04
Anticipated expiration: 2036-04-15
Also published as: CN105956482A

Abstract

The invention discloses the method and systems for leakage prevention.Include: it is a kind of method of the data characteristics to obtain the first data fingerprint and the second data fingerprint is extracted from document, judge the first document and the whether relevant judgment method of the second document using extracted data characteristics and judge according to the degree of correlation suspicious document whether include sensitive content method.Simultaneously present invention provides the corresponding equipment for extracting document data feature, judge the first document and the second document it is whether relevant judge equipment and judge suspicious document whether include sensitive content equipment.

Description

Method and system for leakage prevention

Technical field

Technical field of data security of the present invention, in particular for the method and system of leakage prevention.

Background technique

In recent years, with the rapid development of information technology, data safety is shown during the daily operation of informatization enterprise It obtains particularly important.If data are maliciously distorted or destroyed, the loss that can not be retrieved may be caused to enterprise.In order to improve Information Security generally requires to set some Data Securities, to be monitored and protect to data.In current big data Under environment, with the increase of business data amount, how ever-increasing data is quickly and efficiently monitored and protected, at The major issue faced for current data security fields.

Currently, the leakage of many enterprises data in order to prevent, has affixed one's name to data leak protection (Data in the middle part of Intranet Leakage prevention, DLP) system, to ensure the safety of sensitive data.Data leak guard system is by software to quick Sense data are monitored and protect, and by certain technological means, the specified data or information assets for preventing enterprise are to violate Form as defined in security strategy flows out enterprise, to guarantee that sensitive data is not lost and reveals.So in DLP system, data The extraction of feature and be a very key step to the matching of sensitive data.

Artificial setting keyword or the mode to entire file generated data fingerprint are generallyd use in traditional DLP system Data characteristics is extracted, the former can not be automatically performed feature extraction, when the file is quite large, the accuracy of extraction can reduce the latter. In addition, the matching for sensitive data, it will usually rule match and Hash matching algorithm are used, similarly, when in face of larger text When part, algorithm performance and accuracy all can degradations.

Summary of the invention

For this purpose, the present invention provides the method and system for leakage prevention, to try hard to solve or at least alleviate At least one existing problem above.

According to an aspect of the invention, there is provided a kind of method that data characteristics is extracted from document, wherein extract Data characteristics includes the first data fingerprint and the second data fingerprint, comprising steps of extracting the first predetermined number word from document Language calculates the corresponding data characteristics string of each word, and constructs document based on this first predetermined number data feature string First data fingerprint；Piecemeal is carried out to the word sequence of document in sequence, this is calculated based on the data content in each word block The data characteristics string of word block, the data characteristics string of each word block of recombinant construct second data fingerprint of the document.

According to another aspect of the present invention, it provides and a kind of judges the first document and the whether relevant judgement side of the second document Method, comprising steps of execute data characteristics extracting method as described above to the first document, the data characteristics for extracting document obtains the One characteristic set；Data characteristics extracting method as described above is executed to the second document, the data characteristics for extracting document obtains the Two characteristic sets；And the similarity of fisrt feature set and second feature set is calculated, if similarity reaches preset range, Think that first document and the second document are related.

According to another aspect of the present invention, provide it is a kind of judge suspicious document whether include sensitive content method, packet It includes step: data characteristics extracting method as described above being executed to secure documents, extracts the data characteristics of the document, establish special Levy library；The data characteristics for extracting suspicious document again, execute it is above-mentioned judge the whether relevant judgment method of document, judge suspicious document Whether related to the secure documents in feature database: if judging, suspicious document and secure documents are related, then it is assumed that suspicious document Include sensitive content；If judging, suspicious document is uncorrelated to secure documents, then it is assumed that suspicious document does not include sensitive content.

Correspondingly, the present invention also provides extract the equipment of data characteristics from document, judge the first document and the second text Shelves whether it is relevant judge equipment, judge suspicious document whether include sensitive content equipment.

In accordance with a further aspect of the present invention, a kind of leakage prevention system is provided, comprising: equipment is calculated, with data Safety protection equipment is connected；And data safety safeguard, comprising: document obtains equipment, sensitive content as described above is sentenced Disconnected equipment, control strategy obtain equipment and control equipment.

Based on description above, this programme is using the data fingerprint for automatically extracting document keyword and extraction word block Mode, to extract the data characteristics of document.On the one hand, document is selected by the characteristic value of computational representation word importance Keyword, thus without relying upon artificial setting keyword；On the other hand, word sequence is carried out after document being carried out word segmentation processing Piecemeal processing, word-based piece calculates the data fingerprint of each piecemeal, and generates number using local sensitivity Hash (LSH) algorithm According to fingerprint, it can be effectively prevented the leakage of set of metadata of similar data, and when document is very big, also can guarantee the accurate of feature extraction Property.

In terms of characteristic matching, this programme using single matched data feature string similarity (that is, single matching) or The mode for calculating set of metadata of similar data feature string specific gravity (that is, benchmark matching) carries out matching judgment to the Similar content in document, optional Ground, the similarity between the shelves that can be solicited articles with Hamming distance or Jaccard coefficient table.In this way, sensitivity can be carried out more in all directions Data Matching prevents sensitive data from revealing, and then various documents is effectively avoided to leak means.

Detailed description of the invention

To the accomplishment of the foregoing and related purposes, certain illustrative sides are described herein in conjunction with following description and drawings Face, these aspects indicate the various modes that can practice principles disclosed herein, and all aspects and its equivalent aspect It is intended to fall in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical appended drawing reference generally refers to identical Component or element.

Fig. 1 shows the schematic diagram of leakage prevention system 100 according to an embodiment of the invention；

Fig. 2A shows the process of the method 200 according to an embodiment of the invention that data characteristics is extracted from document Figure；

Fig. 2 B shows the process of the method 200 in accordance with another embodiment of the present invention that data characteristics is extracted from document Figure；

Fig. 3 A shows the signal of the equipment 300 according to an embodiment of the invention that data characteristics is extracted from document Figure；

Fig. 3 B shows the signal of the equipment 300 in accordance with another embodiment of the present invention that data characteristics is extracted from document Figure；

Fig. 4, which is shown, according to an embodiment of the invention judges the first document and the whether relevant judgement side of the second document The flow chart of method 400；

Fig. 5 show it is according to an embodiment of the invention judge the first document and the second document it is whether relevant judgement set Standby 500 schematic diagram；

Fig. 6 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content method 600 Flow chart；

Fig. 7 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content equipment 700 Schematic diagram；And

Fig. 8 schematically illustrates the schematic diagram of piecemeal processing.

Specific embodiment

Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.

Fig. 1 shows the schematic diagram of leakage prevention system 100 according to an embodiment of the invention.In enterprise Portion calculates and is connected between equipment 110 by local area network, and here, the component for calculating equipment 110 can include but is not limited to: one A or multiple processors or processing unit, system storage, the different system components of connection (including system storage and processing Unit) bus.It is of the invention real suitable for being used to realize simultaneously it should be noted that in addition to traditional calculating equipment (for example, computer) The calculating equipment 110 for applying example further includes mobile electronic device, including but not limited to mobile phone, PDA, tablet computer etc., and Server, printer, CD/DVD in enterprise's working environment etc..

Data safety safeguard 120 for leakage prevention is arranged in the local area network, passes through local area network and institute There is calculating equipment 110 to be connected.As shown in Figure 1, the safeguard 120 includes: that document obtains equipment 122, sensitive content judgement Equipment 700, control strategy obtain equipment 124 and control equipment 126.

Document obtains equipment 122 and is suitable for all calculating equipment 110 monitored in real time in the local area network, when monitoring to calculate When equipment 110 sends document, obtains and calculate the document content that equipment 110 is sent.Here, document can be the chat of instant messaging Information, and/or, picture/document of instant messaging transmission.

Sensitive content judges that equipment 700 is suitable for judging whether the document obtained includes sensitive content, for 700 meeting of equipment It introduces in greater detail below.

Control strategy obtains equipment 124 and is suitable for while judging whether document includes sensitive content, acquisition and the document The corresponding control strategy of relevant process.Optionally, control strategy can have: take non-print when specified process is printing Strategy, the strategy of messy code character string is taken when specified process be transmission file.

Control equipment 126 is suitable for when judging that suspicious document includes sensitive content, according to acquired control strategy to institute The operation behavior for stating document is controlled.For example, replacing the data for needing to transmit in the document with the character string of mark messy code Sensitive data in content.

Based on the description above to system 100, in the present system, how to be accurately matched to sensitive content is to realize data The key point of security protection, that is, sensitive content judge 700 operation to be performed of equipment.In simple terms, sensitive content Judge should include in equipment 700 (but being not limited to) memory module (for storing all data characteristicses of secure documents), Extract the equipment (for extracting the data characteristics in suspicious document) of document data feature, document relevance judges that equipment (is used for Judged according to the data characteristics of extraction whether related between suspicious document and secure documents) and determining module (for foundation Correlation judging result determines whether suspicious document includes sensitive content).

The process of composition and their execution to above-mentioned each module is illustrated below.

Fig. 2A and 2B shows the flow chart that the method 200 of data characteristics is extracted from document, and wherein Fig. 2A shows root According to the flow chart for the method for extracting the first data characteristics in the slave document of one embodiment of the invention.As shown in Figure 2 A, this method Start from step S210.In step S210, word segmentation processing first is carried out to document, removes stop words, punctuation mark, new line etc. Garbage, to obtain word sequence.According to an embodiment of the invention, being calculated in this method 200 using the participle based on dictionary Method carries out word segmentation processing, such as MMSEG (A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm), MMSEG is Chinese word segmentation In a segmentation methods common, based on dictionary, have Simple visual, realize uncomplicated, the fast advantage of the speed of service.Simply For, which includes " matching algorithm " and " disambiguation rule ", and wherein matching algorithm refers to how to protect according in dictionary The word deposited matches the sentence for wanting cutting；" disambiguation rule " is to say to divide in this way when in short, can also be with that When sample divides, determined with what rule using which kind of point-score, such as " facility and service " this phrase, is segmented into and " sets Apply/kimonos/business ", it is also segmented into " facility/and/service ", which word segmentation result selected, is exactly the function of " disambiguation rule " Energy.It in MMSEG algorithm, defines there are two types of matching algorithms: simple maximum matching and complicated maximum matching；The disambiguation of definition Rule there are four types of: maximum matching (Maximum matching, corresponding above two matching algorithm), maximum average word length Minimum rate of change (the Smallest variance of word of (Largest average word length), word length Lengths), obtained value, is then added, takes summation maximum by the natural logrithm for calculating all monosyllabic word word frequency in phrase Phrase (Largest sum of degree of morphemic freedom of one-character words).

Such as following document A, after word segmentation processing, document B is obtained.

Document A:

" Group Life Accident Insurance material benefit plan

Unexpected injury: refer to by external, burst, non-original idea, the non-disease objective thing for making body come to harm Part.

It is burnt as traffic accident hit, fire occurs, is caused injury by falling object from high altitude strike, injured, liquid is attacked by ruffian Change gas, gas explosion,

The oil scald etc. that cook is boiled all belongs to accidental injuring event.

Recommend two kinds of assembled schemes, selected for unit combination actual conditions:

1, accident/injury insurance: 150 yuan/people of insurance premium (1,2 grade of occupation)

(1) period insured: 1 year

(2) because unexpected injury is die, 100,000 yuan insurance responsibility: are paid；Or because of unexpected injury Complete Disability, accidental burns, payment 100000 yuan (part is disabled to be paid in proportion)；10,000 yuan of unexpected injury medical treatment.

2, unexpected injury and medical insurance: 100 yuan/people of insurance premium (1,2 grade of occupation)

(1) period insured: 1 year

(2) because unexpected injury is die, 50,000 yuan insurance responsibility: are paid；Or because of unexpected injury Complete Disability or burn, pay 50,000 yuan (part is disabled to be paid in proportion)；10,000 yuan of unexpected injury medical treatment.

Note: my company can be by the concrete condition design insurance scheme of your unit

It pays the bill few, ensures many；It insures conveniently, Claims Resolution is rapid."

Document B (word sequence obtained after word segmentation processing):

[group, the person is unexpected, injures, insurance, material benefit, and plan is unexpected, and injury refers to, by, external, burst, non- Meaning, non-, disease, body, injury, objective, event, traffic, accident are hit, and are occurred, fire, burn, high-altitude, pendant, and object is hit, and are caused Wound, ruffian attack, injured, liquefied gas, coal gas, explosion, cook, boiling, oil, and scald is fixed one's mind on, and outside, injury, event is recommended, Two kinds, combination, scheme, official documents and correspondence, confession, unit, in conjunction with, actual conditions are selected, and it is unexpected, it injures, insurance, insurance premium, 150 yuan, Grade, occupation, insurance, during which, 1 year, insurance, responsibility was unexpected, and injury is die, and paid, 10, Wan Yuan, or because, unexpected, injury, entirely, It is residual, it is unexpected, it burns, pays, 10, Wan Yuan, part, it is disabled, it in proportion, pays, unexpected, injury, medical treatment, 1, Wan Yuan, unexpected, injury, Medical insurance, insurance premium, 100 yuan, grade, occupation, insurance, during which, 1 year, insurance, responsibility was unexpected, and injury is die, and paid, 5, ten thousand Member, or because, unexpected, injury is entirely, residual, burns, pays, 5, Wan Yuan, part, and it is disabled, it in proportion, pays, unexpected, injury, medical treatment, 1, Wan Yuan, note, company can press, your unit, and specifically, situation designs, and insurance, scheme, official documents and correspondence is paid the bill, and seldom, ensures, much, throw It protects, convenient, Claims Resolution, rapidly]

It should be noted that the present invention is not only restricted to specific segmenting method, it is all word segmentation processing to be carried out to document To obtain the method for the significant word in the document all within protection scope of the present invention.

Then in step S220, to each word in the word sequence, the weight of the computational representation word in the document The characteristic value for the property wanted, and the first predetermined number word is chosen from word sequence based on the size of characteristic value.

According to one embodiment of present invention, the importance using TF-IDF value characterization word in a document, calculates TF- The process of IDF value is as follows:

1. calculating word frequency TF of the frequency of occurrences of the word in the document as the word:

2. calculating the ratio conduct between the number of documents in the total number of documents and document library in document library comprising the word The inverse document frequency IDF of the word:

3. calculating TF-IDF according to the word frequency TF of word and inverse document frequency IDF:

TF-IDF=TF × IDF

For a word, TF-IDF value is bigger, bigger to the importance of the document, therefore according to TF-IDF value Size selects keyword of the first predetermined number (for example, 5) word as this document for coming front.

In traditional DLP system, the feature of keyword is extracted mainly in such a way that manually keyword is set, it is clear that This method is time-consuming and laborious, and this programme chooses the keyword of document by calculating TF-IDF value automatically, it is labor-saving simultaneously Also assure the accuracy of extraction.

Then in step S230, for each word in selected first predetermined number word, the word is calculated The corresponding data characteristics string of language, and first data fingerprint of the document is constructed as the data of document based on data characteristics string Feature.Specifically, for selected each keyword, the number of the first predetermined length is hashed to by common hash algorithm Then word string combines data characteristics string obtained as the corresponding data characteristics string of the word to obtain the first data fingerprint Data characteristics as the document.

It shows as follows and the example that keyword generates the first data fingerprint is extracted for a document:

doc_size:2278

word_num:284

The predetermined number of keyword_num:5 // first

Word: injury word_hash:4229635582 offset:18

Word: unexpected word_hash:424898618 offset:12

Word: insurance word_hash:802497295 offset:24

Word: payment word_hash:3684743136 offset:1594

Word: official documents and correspondence word_hash:1412961926 offset:1051

For 2278 bytes, 5 keywords of document structure tree containing 284 words (accident, insurance, give by injury Pay, official documents and correspondence), then this 5 keywords are generated by 1 data feature string by common hash algorithm respectively, in the present embodiment, the The data characteristics string of one predetermined length is 32 unsigned int numbers, and it is exactly the document that this 5 data feature strings, which are linked up, The first data fingerprint.Further include according to a kind of implementation, in the first data fingerprint in this 5 keywords each word in text Deviation post information offset, offset in shelves are recorded for recording the position that keyword occurs for the first time in a document Offset is mainly used for tracing to the source for sensitive data (namely keyword), when the announcement that discovery sensitive data leakage rear line issues Police can carry offset information.Certainly, in order to save memory, the first data fingerprint can also not include offset information, this Invention is to this and with no restrictions.

In order to guarantee feature extraction enough to accurate, embodiment according to the present invention, first in addition to extracting document is counted According to fingerprint, data characteristics of second data fingerprint as document can also be extracted.As Fig. 2 B show according to the present invention another The step of embodiment, the method for extracting from document the second data fingerprint flow chart, this method, is as follows:

Firstly, being segmented according to step S210 to document.

Then on the basis of participle, in step S240, piecemeal is carried out to the word sequence in document in sequence, to obtain The word block of one or more the second predetermined length, wherein overlapped third predetermined length between adjacent word block.In other words, will The sliding window of one the second predetermined length size is slided along word sequence, the displacement of third predetermined length is slided every time, in this way, just Document has been divided into the word block of multiple second predetermined length sizes.According to one embodiment of present invention, setting second is predetermined Length is 64 words, and third predetermined length is 32 words.

For example, the word sequence to above-mentioned document B carries out piecemeal operation, following 4 word blocks are obtained:

Word block 1:[Group Life Accident Insurance material benefit plan unexpected injury refers to by the non-original idea non-disease of external burst Actual bodily harm objective event traffic accident hits generation fire burn falling object from high altitude and hits the injured liquefied gas and coal gas of ruffian's attack of causing injury Cook's boiling oil scald of exploding belongs to two kinds of assembled scheme official documents and correspondences of unexpected injury event recommendation and selects meaning for unit combination actual conditions During 150 yuan of grade employment securities of outer injury insurance insurance premium]

Word block 2:[attacks injured liquefied gas and coal gas explosion cook boiling oil scald and belongs to two kinds of unexpected injury event recommendation combinations Scheme official documents and correspondence selects insurance in 1 year during 150 yuan of grade employment securities of accident/injury insurance insurance premium to blame for unit combination actual conditions Appoint unexpected injury to die to pay 100,000 yuan or pay accidental wound in proportion because 100,000 yuan of parts of unexpected injury Complete Disability accidental burns pair are disabled 10,000 yuan of unexpected injury medical insurance insurance premiums of evil medical treatment]

Word block 3:[insurance responsibility unexpected injury in 1 year, which is die, pays 100,000 yuan or because unexpected injury Complete Disability accidental burns pay 100,000 During unexpected injury 100 yuan of grade employment securities of medical 10,000 yuan of unexpected injury medical insurance insurance premiums are paid in first part deformity in proportion Insurance responsibility unexpected injury in 1 year dies to pay 50,000 yuan or pay 50,000 yuan of part deformity because of the burn of unexpected injury Complete Disability pays meaning in proportion Outer 10,000 yuan of injury medical treatment]

The ten thousand yuan of parts word block 4:[are disabled to pay unexpected injury medical 10,000 yuan of unexpected injury medical insurance insurance premiums 100 in proportion Insurance responsibility unexpected injury in 1 year, which is die, during first grade employment security pays 50,000 yuan or because 50,000 yuan of portions are paid in the burn of unexpected injury Complete Disability Dividing deformity to pay the medical 10,000 Yuan Zhu companies of unexpected injury in proportion can pay the bill not by your unit concrete condition design insurance scheme official documents and correspondence It is ensure that many insure facilitates Claims Resolution rapid] more

Then in step s 250, for one or more obtained word block, based in the data in each word block Hold to calculate the data characteristics string of the word block, optionally, using local sensitivity Hash (LSH) algorithm in the data of each word block Hold and generates a data signature.

Then in step S260, the data characteristics string of each word block of recombinant constructs second data fingerprint of the document Using the data characteristics as the document.

Traditional DLP system is the consequence done so to entire document structure tree data fingerprint when generating data fingerprint It is that, when document is very big, the performance meeting degradation and accuracy of algorithm can also reduce.So this programme is to the word after participle Sequence piecemeal extracts data characteristics of the data characteristics of each word block as entire document.Meanwhile common hash algorithm is such as MD5 shows that initial data is equal under certain probability, but if unequal, removes if 2 data signatures are equal It is outer to show that initial data is different, does not provide any information, traditional hash algorithm obviously cannot be defendd similar well The leakage of data, therefore the advantage is that in this method using local sensitivity hash algorithm, for similar data content Same or similar data signature can be generated, the matched effect of subsequent characteristics can be thus promoted.

The step of wherein calculating the data characteristics string of each word block in step s 250 can be subdivided into following several steps again:

The word in the word block of each second predetermined length is first converted into character, obtains corresponding character string as word Block.

The sub- word block of the 4th predetermined length in word block is successively selected again, wherein the overlapped 5th pre- between adjacent sub- word block Measured length.It similarly, in the implementation, can be with the sliding window of the 4th predetermined length (for example, 5 bytes) size along word Block sliding, slides the displacement of the 5th predetermined length (such as 1 byte) size every time.As shown in figure 8, wherein to represent the 4th predetermined by D1 Length, D2 represent the 5th predetermined length, have a D2 with regard to being overlapped between every two adjacent D1.

The feature value list of the 6th predetermined length (such as 32 bytes, i.e., 256) is calculated further according to the content of sub- word block.

Feature value list finally based on all sub- word blocks is to construct the data characteristics string of the word block.

Specifically, the step of feature value list of the 6th predetermined length being calculated according to the content of sub- word block are as follows:

1) one or more content subset being made of the partial content in sub- word block is extracted, in other words, is extracted every All triples in a sub- word block；

2) it recycles hash algorithm that each triple hash is arrived (0, the 6th predetermined length), in the present embodiment, i.e., will Each triple hash arrives (0,256)；

3) according to value corresponding with each content subset, the analog value in feature value list is set, such as one three Tuple igr, it is assumed that hash value is 15, then cumulative 1 at the 15th position in feature value list.

When all triples in a word block all have been calculated, each position in this feature value list can have One accumulated value calculates the average value of all accumulated values as threshold value, if accumulated value (the namely feature of some position corresponding position The value of some unit in value list) be greater than the average value, then the cell value is set as 1, is just set as 0 on the contrary, in this way dualization Processing, obtain the feature value list of the binaryzation of the 6th predetermined length, the characteristic value of this 6th predetermined length arranged Table is converted into the numeric string of the 6th predetermined length position, using the data characteristics string as the word block.

If 4 data feature strings for carrying out the generation of LSH algorithm to 4 word blocks above are respectively as follows:

LSH1:3f26258da0a5310d6b5203845ab0784eb29acff9814564946ce 458fc086ac2a8

LSH2:2f2465c480a3312f215d80ce53863000a0ad6b78a3616595ae4 c56bc00e9c2ba

LSH3:0fa077e500a531bfa95dd08e53862020a0896b58a362659d2a4 84ebf00e9cbaa

LSH4:0fa467ed10a5331da959d40e53802000b0897310a1616d592a4 844bb02a9c7ea

The example for second data fingerprint of document structure tree is shown as follows:

doc_size:2278

Word_num:284 // word number

Sig_num:8 // word block number

The predetermined length of word_block_size:64 // second

Word_step_size:32 // third predetermined length

LSH:bf32258f90e5390da35083045a83630ab6be5fe9a10577102dc40acd286bd2a8 offset:0

LSH:0f2205c490d01f2fa3d000904a87630896f83fa0a147d79424e446adfc5b50aa offset:254

LSH:b92355848cfa3b6fab5744f14c92ad4892d86e89a143f4bee4a852386e09e2e8 offset:554

LSH:3baf75807cda31ede317c6c4745321ccd2966efd89422598ec62f3dca2f9e39a offset:840

LSH:2fafe588e0b131ade3559a46f31b30c4929e6b7c89d2671cae66f2dc02e9caca offset:1080

LSH:2f2475cc40a731bd2155d0ce53a83000b88d7b5883d06594aa4244be04ebe2ca offset:1326

LSH:0fa467cd00a433bca1d1d48f53a42021b889735481e241bd2a4844bf8ce9c78a offset:1594

LSH:07af6fbde0a5317c61d1544a02802142f29b730c93602cbe2e4ae83b83c9c7ee offset:1797

For 2278 bytes, the document containing 284 words, the data of 8 word blocks are generated respectively using LSH algorithm Feature string (the 4th predetermined length of setting is 5 bytes, and the 5th predetermined length is 1 byte, and the 6th predetermined length is 32 bytes), by this 8 data feature strings link up be exactly the document the second data fingerprint.As described in the first data fingerprint, second Data fingerprint also may include the deviation post information offset of word block in a document.

It should be noted that it is predetermined first can be arranged according to the significance level of document during method 200 executes Number, the second predetermined length and third predetermined length, the fine degree extracted with distinguishing characteristic.That is, document is important Degree is higher, and the keyword number (the first predetermined number) of extraction is more, when piecemeal each piecemeal size (the second pre- fixed length Degree) and displacement stepping (third predetermined length) just it is smaller.

The step of the first data characteristics, the second data characteristics are extracted from document ends here, by method 200, from text The data characteristics of the first data fingerprint and the second data fingerprint as document is extracted in shelves.Correspondingly, Fig. 3 A and Fig. 3 B difference It shows according to embodiments of the present invention for realizing extracting the first data characteristics and the second data characteristics in the slave document of method 200 Equipment 300 schematic diagram.

As shown in Figure 3A, feature extracting device 300 includes: word segmentation module 310, computing module 320, chooses 330 and of module Characteristic extracting module 340.

Word segmentation module 310 is suitable for carrying out word segmentation processing to document, to obtain word sequence.According to an embodiment of the invention, adopting Word segmentation processing is completed to document with the segmentation methods (for example, MMSEG) based on dictionary.

Computing module 320 is suitable for each word in word sequence, the importance of the computational representation word in a document Characteristic value (for example, TF-IDF value).

It chooses module 330 to be suitable for choosing the first predetermined number word from word sequence based on characteristic value, such as by feature The sequence of value from high to low chooses 5 words, then transfers to computing module 320 coupled thereto, is calculated by it each of selected The corresponding data characteristics string of word, optionally, computing module 320 are suitable for that each word is hashed to the by common hash algorithm The numeric string of one predetermined length is as the corresponding data characteristics string of the word.

Characteristic extracting module 340 be suitable for data characteristics string based on being calculated construct the first data fingerprint of document come As the data characteristics of the document, optionally, characteristic extracting module 340 is suitable for combining data characteristics string obtained to obtain First data fingerprint as the document data characteristics.

Wherein, the meter of the characteristic value (by taking TF-IDF value as an example) of the importance of segmenting method and characterization word in a document Calculation method has been disclosed in detail in the step of being based on Fig. 2A description, and details are not described herein again.

According to a kind of implementation, feature extracting device 300 is further adapted for extracting the second data fingerprint of document.Such as Fig. 3 B institute Show, at this point, feature extracting device 300 further includes piecemeal module 350 other than computing module 320 and characteristic extracting module 340, Wherein piecemeal module 350 is coupled with the coupling of 320 phase of computing module, computing module 320 and 340 phase of characteristic extracting module.

Piecemeal module 350 is suitable for carrying out piecemeal to the word sequence in document in sequence, with obtain one or more the The word block of two predetermined lengths, wherein overlapped third predetermined length between adjacent word block.

Meanwhile computing module 320 is further adapted for one or more obtained word block, based on the data in each word block Content calculates the data characteristics string of the word block.The data characteristics string that characteristic extracting module 340 is further adapted for combining each word block comes Second data fingerprint of the document is constructed using the data characteristics as the document.

It is described referring to above for the step of data characteristics string for calculating each word block, computing module 320 further includes character Converting unit 322, blocking unit 324 and computing unit 326.

Character conversion unit 322 is suitable for the word in the word block of each second predetermined length being converted to character, obtains phase The character string answered is as word block.

Blocking unit 324 is suitable for successively selecting the sub- word block of the 4th predetermined length in word block, wherein between adjacent sub- word block Overlapped 5th predetermined length.

Computing unit 326 is suitable for calculating the feature value list of the 6th predetermined length according to the content of the sub- word block.It is optional Ground may include extracting subelement in computing unit 326, constitute one of the partial content suitable for extracting in sub- word block or more A content subset.It is again the 0 to the 6th predetermined length by each content subset hash using hash algorithm by computing unit A value analog value in the 6th predetermined length feature value list is arranged according to value corresponding with each content subset.It calculates Unit can also include count sub-element, suitable for by by the value of corresponding position in the corresponding feature value list of every sub- word block It is overlapped and merges, to obtain the feature value list and dualization subelement of the corresponding word block, be suitable for this feature The value of each unit carries out dualization processing in value list, and obtains the feature value list that each cell value is 0 or 1.It calculates Unit is further adapted for converting the feature value list of this 6th predetermined length to the numeric string of the 6th predetermined length, using as the word The data characteristics string of block.

Wherein, dualization subelement is suitable for calculating the average value of all cell values in feature value list and compares each The value of unit and the size of the average value, if the value of some unit is greater than average value, the value of the unit is 1, if some unit Value be not more than average value, then the value of the unit be 0.

According to one embodiment, the second predetermined length of selection is 64 words, and third predetermined length is 32 words, the 4th Predetermined length is 5 bytes, and the 5th predetermined length is 1 byte, and the 6th predetermined length is 32 bytes (that is, 256).With method 200 Described in as, it is predetermined that the first predetermined number, the second predetermined length and third can be set according to the significance level of document Length, the fine degree extracted with distinguishing characteristic.That is, the significance level of document is higher, the keyword number (of extraction One predetermined number) it is more, when piecemeal each piecemeal size (the second predetermined length) and displacement stepping (the pre- fixed length of third Degree) it is just smaller.

Optionally, characteristic extracting module 340 be further adapted for extract the first predetermined number word in each word in a document Deviation post information, to be included in the first data fingerprint, and word block deviation post information in a document is extracted, with packet It is contained in the second data fingerprint.

To sum up, this programme is by the way of automatically extracting document keyword and/or extracting the data fingerprint of word block, to extract The data characteristics of document.On the one hand, the keyword of document is selected by the characteristic value of computational representation word importance, in this way Just without relying upon artificial setting keyword；On the other hand, the word sequence of document is subjected to piecemeal processing, calculates the number of each word block Data fingerprint is generated according to fingerprint, and using local sensitivity Hash (LSH) algorithm, letting out for set of metadata of similar data can be effectively prevented Dew, and when document is very big, it also can guarantee the accuracy of feature extraction.

Fig. 4, which is shown, according to an embodiment of the invention judges the first document and the whether relevant judgement side of the second document The flow chart of method 400.

As shown in figure 4, this method starts from step S410, the step of method 200 are executed to the first document, the number of document is extracted Fisrt feature set is obtained according to feature, wherein the fisrt feature set includes: the first data fingerprint and/or of the first document Two data fingerprints.

Then in the step s 420, the step of method 200 equally being executed to the second document, the data characteristics for extracting document obtains To second feature set, wherein second feature set includes: the first data fingerprint and/or the second data fingerprint of the second document.

Then in step S430, the similarity of fisrt feature set and second feature set is calculated, if similarity reaches Preset range, then it is assumed that first document and the second document are related.

The process of similarity can be divided into following 3 kinds again being matched in step S430 between feature, calculating document.

◆ the first is single matching:

The data characteristics string for calculating each data fingerprint in fisrt feature set refers to corresponding data in second feature set The Hamming distance of the data characteristics string of line.

Wherein, Hamming distance (Hamming distance) refers to that two (equal length) data characteristics strings correspond to binary digit Different quantity.The Hamming distance between two data feature strings x, y is indicated with d (x, y), two data feature strings is carried out different Or operation, and the number that statistical result is 1, obtained value is exactly Hamming distance, when Hamming distance is greater than the first threshold values, is just sentenced The two fixed data characteristics strings are similar.Such as:

Hamming distance between 1011101 and 1001001 is 2；

Hamming distance between " toned " and " roses " is 3.

For single matching, second of the data characteristics string and the second document in the second data fingerprint of the first document is calculated The Hamming distance of the data characteristics string of data fingerprint, when any one data characteristics string in the second data fingerprint be judged as it is similar When, then it is assumed that the similarity of fisrt feature set and second feature set reaches preset range.As long as that is, having any one A word block is similar, and document has the suspicion of leak data.

Be below the second data fingerprint of two documents is done it is single matching and return whether relevant pseudocode, use respectively Signature_base and signature_cmp represents fisrt feature set and second feature set, wherein nilsima_base The data characteristics string in second data fingerprint of two documents is respectively indicated with nilsima_cmp:

for nilsima_base in signature_base

for nilsima_cmp in signature_cmp

Ham_dist=hamming_distance (nilsima_base, nilsima_cmp)

if(ham_dist>threshold)

{

return 1

break

}

return 0

◆ second is benchmark matching:

The data characteristics string for equally first calculating each data fingerprint in fisrt feature set is corresponding with second feature set The Hamming distance of the data characteristics string of data fingerprint, to determine whether two data feature strings are similar；Then fisrt feature is counted The number of data characteristics string similar with second feature set in set, calculates the number and accounts for data characteristics in fisrt feature set The ratio of string total number, if the ratio reaches second threshold, then it is assumed that the similarity of fisrt feature set and second feature set Reach preset range.

Be below the first data fingerprint of two documents is done benchmark match and return whether relevant pseudocode, word_ Hash_base and word_hash_cmp respectively indicates the data characteristics string in first data fingerprint of two documents, keyword_ Base_num indicates data characteristics string total number in fisrt feature set.

Keyword_base_num=signature_base.keyword_num

Simlar_num=0

for word_hash_base in signature_base

for word_hash_cmp in signature_cmp

If (word_hash_cmp==word_hash_base)

{

Simlar_num+=1

}

return(simlar_num/keyword_base_num)

◆ the third is whole matching:

Calculate all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set Jaccard coefficient, when Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set is similar to second feature set Degree reaches preset range, and the first document and the second document are related.

Wherein Jaccard coefficient refers to the ratio of two intersection of sets collection and two union of sets collection:

Jaccard=| S ∩ T |/| S ∪ T |,

Wherein, S indicates fisrt feature set, and T indicates second feature set.

Be below the first data fingerprint of two documents is done whole matching and return whether relevant pseudocode, Keyword_base_num and keyword_cmp_num respectively indicates data characteristics string total number, word_ in two characteristic sets Hash_base and word_hash_cmp respectively indicates the data characteristics string in first data fingerprint of two documents.

Keyword_base_num=signature_base.keyword_num

Keyword_cmp_num=signature_cmp.keyword_num

Simlar_num=0

for word_hash_base in signature_base

for word_hash_cmp in signature_cmp

If (word_hash_cmp==word_hash_base)

{

Simlar_num+=1

}

return(simlar_num/(keyword_base_num+keyword_cmp_num-simlar_num))

Method 400 devises 3 kinds of modes and carries out matching judgment to the Similar content in two documents, it is alternatively possible to Hamming distance or Jaccard coefficient table are solicited articles the similarity between shelves.In this way, sensitive data matching can be carried out more in all directions, To prevent sensitive data leakage from providing a strong guarantee.

Correspondingly, Fig. 5 show judgement the first document according to an embodiment of the invention for realizing method 400 and The schematic diagram of the whether relevant judgement equipment 500 of second document.As shown in figure 5, the document correlation judges that equipment 500 includes: Feature extracting device 300, similarity calculation module 510 and similarity judgment module 520, wherein similarity calculation module 510 is divided It is not coupled with feature extracting device 300 and 520 phase of similarity judgment module.

Feature extracting device 300 is suitable for extracting fisrt feature set and the second spy of the first document and the second document respectively Collection is closed, and wherein fisrt feature set includes: the first data fingerprint and/or the second data fingerprint of the first document；Second feature Set includes: the first data fingerprint and/or the second data fingerprint of the second document.

Similarity calculation module 510 is suitable for calculating the similarity of fisrt feature set and second feature set.

Similarity judgment module 520 is suitable for when judging that similarity reaches preset range, it is believed that the first document and the second text Shelves are related.

According to one embodiment of present invention, similarity calculation module 510 further include: similarity calculated, based on It is special to calculate the data characteristics string of each data fingerprint and the data of corresponding data fingerprint in second feature set in fisrt feature set Levy the Hamming distance of string.

Similarity judgment module 520 is further adapted for determining two corresponding data fingerprints when Hamming distance is greater than first threshold It is similar.Specifically, for single matching way, similarity judgment module 520 is suitable for when any number in the second data fingerprint When being judged as similar according to feature string, it is believed that the similarity of fisrt feature set and second feature set reaches preset range.

For benchmark matching way, similarity judgment module 520 can also include statistic unit, for counting fisrt feature It the number of data characteristics string similar with second feature set and calculates the number in set to account in fisrt feature set data special The ratio of sign string total number, similarity judgment module 520 are further adapted for when the ratio reaches second threshold, it is believed that fisrt feature collection It closes and reaches preset range with the similarity of second feature set.

According to still another embodiment of the invention, under whole matching mode, similarity calculated is further adapted for calculating The Jaccard coefficient of all data characteristics strings in all data characteristics strings and second feature set of one characteristic set, when described When Jaccard coefficient reaches third threshold value, similarity judgment module 520 assert the phase of fisrt feature set and second feature set Reach preset range like degree.Jaccard coefficient is used to characterize the degree of correlation of two set:

Jaccard=| S ∩ T |/| S ∪ T |,

Wherein, S indicates fisrt feature set, and T indicates second feature set.

Fig. 6 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content method 600 Flow chart.As shown in fig. 6, this method starts from step S610, the step of method 200 are executed to secure documents, this article is extracted The data characteristics of shelves, and feature database is established, wherein include in feature database: the first data fingerprint of all secure documents and second Data fingerprint.

Then in step S620, to suspicious document execute method 400 the step of, during executing method 400, mention Take the data characteristics of suspicious document as second feature set, and using feature database obtained in previous step as fisrt feature collection It closes, judges whether suspicious document is related to secure documents with this；

Then in step S630, if judging, suspicious document is related to secure documents, then it is assumed that wraps in the suspicious document Containing sensitive content；If judging, suspicious document is uncorrelated to secure documents, then it is assumed that the suspicious document does not include sensitive content.

Correspondingly, Fig. 7, which shows the suspicious document of the judgement for realizing method 600 according to an embodiment of the invention, is The no equipment comprising sensitive content, that is, sensitive content described in Fig. 1 judge the schematic diagram of equipment 700.The equipment 700 packet Include: feature extracting device 300, memory module 710, document relevance judge equipment 500 and determining module 720.According to this hair Bright one embodiment, feature extracting device 300 can also be arranged in document relevance and judge in equipment 500.

As it was noted above, feature extracting device 300 be suitable for secure documents extract data characteristics, be further adapted for extracting it is suspicious The data characteristics of document is as second feature set.

The data characteristics that memory module 710 is suitable for storing secure documents wherein includes in feature database as feature database: The first data fingerprint and the second data fingerprint of secure documents.

Document relevance judges that equipment 500 is suitable for judging whether suspicious document is related to the secure documents in feature database； And

Determining module 720 is suitable for when judging that suspicious document is related to secure documents, determines that suspicious document includes sensitivity Content and when judging that suspicious document is uncorrelated to secure documents determines that the suspicious document does not include sensitive content.

To sum up, the method and system according to the present invention for leakage prevention, provided file characteristics extraction side Method can more easily extract the data characteristics of document, and as far as possible include more data characteristic informations；In addition, devising 3 kinds of single matching, benchmark matching, whole matching modes carry out sensitive data matching in all directions, it is possible to prevente effectively from various texts Shelves leak means.

It should be appreciated that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, it is right above In the description of exemplary embodiment of the present invention, each feature of the invention be grouped together into sometimes single embodiment, figure or In person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed hair Bright requirement is than feature more features expressly recited in each claim.More precisely, as the following claims As book reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real Thus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this hair Bright separate embodiments.

Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple Submodule.

Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.

A5, the method as described in any one of A1-4, wherein being constructed based on selected first predetermined number word First data fingerprint of document as document data characteristics the step of include: for each of first predetermined number word Word hashes to the numeric string of the first predetermined length as the corresponding data characteristics string of the word by common hash algorithm；Group Data characteristics string obtained is closed to obtain the first data fingerprint as the data characteristics of the document.It is any in A6, such as A1-5 Method described in, wherein the step of carrying out word segmentation processing to document includes: to be segmented using the segmentation methods based on dictionary Processing, wherein segmentation methods include the rule of a dictionary, two kinds of matching algorithms and four disambiguations.Appoint in A7, such as A2-6 Method described in one, wherein the step of data characteristics string of the data content in word-based piece to calculate the word block include: by Word in the word block of each second predetermined length is converted to character, obtains corresponding character string as word block；Successively select word The sub- word block of 4th predetermined length in block, wherein overlapped 5th predetermined length between adjacent sub- word block；For every sub- word Block calculates the feature value list of the 6th predetermined length according to the content of sub- word block；And the characteristic value column based on all sub- word blocks Table is to construct the data characteristics string of the word block.A8, the method as described in A7, wherein it is predetermined to calculate the 6th according to the content of sub- word block The step of feature value list of length includes: one or more content for extracting and being made of the partial content in sub- word block Collection；Each content subset hash is worth for one between the 0 to the 6th predetermined length using hash algorithm；According to it is each interior Hold the corresponding value of subset, the analog value in the 6th predetermined length feature value list is set.A9, the method as described in A8, wherein base It include: by by every sub- word block phase in the step of data characteristics string of the feature value list to construct the word block of all sub- word blocks The value of corresponding position is overlapped and merges in corresponding feature value list, to obtain the characteristic value column of the corresponding word block Table；Dualization processing is carried out to the value of each unit in this feature value list, and obtains the feature that each cell value is 0 or 1 Value list；And convert the feature value list of the 6th predetermined length to the numeric string of the 6th predetermined length position, using as the word The data characteristics string of block.A10, the method as described in A9, wherein the value to each unit in this feature value list carries out dualization The step of processing includes: the average value for calculating all cell values in feature value list；Compare the value and the average value of each unit Size；And if the value of some unit, greater than average value, the value of the unit is 1, if the value of some unit is no more than average Value, then the value of the unit is 0.A11, the method as described in any one of A7-10, wherein the first predetermined number is 5, first is predetermined Length is 32；Second predetermined length is 64 words, and third predetermined length is 32 words；4th predetermined length is 5 bytes, 5th predetermined length is 1 byte；It is 32 bytes with the 6th predetermined length.A12, the method as described in any one of A1-11, wherein First data fingerprint further includes the deviation post information of each word in a document in the first predetermined number word.A13, such as Method described in any one of A2-12, wherein the second data fingerprint further includes the deviation post information of institute's predicate block in a document.

B15, the equipment as described in B14, equipment further include: piecemeal module, suitable in sequence to the word sequence in document Piecemeal is carried out, to obtain the word block of one or more the second predetermined length, wherein overlapped third is pre- between adjacent word block Measured length；Computing module is further adapted for one or more obtained word block, and the data content in each word block is come based on Calculate the data characteristics string of the word block；Characteristic extracting module is further adapted for combining the data characteristics string of each word block to construct the document Second data fingerprint is using the data characteristics as the document.B16, the equipment as described in B14, wherein computing module is further adapted for: meter Calculate word frequency of the frequency of occurrences of the word in the document as the word；It calculates in the total number of documents and document library in document library Inverse document frequency of the ratio as the word between number of documents comprising the word；And word frequency and inverse text according to word Shelves frequency calculates the characteristic value of characterization word importance in a document.B17, the equipment as described in B16, wherein characterization should The characteristic value TF-IDF of word importance in a document is defined as: TF-IDF=TF × IDF, wherein TF is the word of the word Frequently, IDF is the inverse document frequency of the word, and TF and IDF are respectively as follows:

B18, the equipment as described in B17 are further adapted for according to counted counted TF-IDF value from high to low wherein choosing module Sequence choose the first predetermined number word.B19, the equipment as described in any one of B16-19, wherein computing module is also suitable In to each word in the first selected predetermined number word, the first predetermined length is hashed to by common hash algorithm Numeric string as the corresponding data characteristics string of the word；And characteristic extracting module is further adapted for combining data characteristics obtained String is to obtain the first data fingerprint as the data characteristics of the document.B20, the equipment as described in any one of B14-33, Middle word segmentation module is further adapted for carrying out word segmentation processing with the segmentation methods based on dictionary, and wherein segmentation methods include dictionary, two The rule of kind matching algorithm and four disambiguations.B21, the equipment as described in any one of B15-20, wherein computing module is also Include: character conversion unit, suitable for the word in the word block of each second predetermined length is converted to character, obtains corresponding word Symbol string is used as word block；Blocking unit, suitable for successively selecting the sub- word block of the 4th predetermined length in word block, wherein adjacent sub- word block it Between overlapped 5th predetermined length；And computing unit, it is suitable for calculating the 6th according to the content of sub- word block to every sub- word block The feature value list of predetermined length and feature value list based on all sub- word blocks are to construct the data characteristics string of the word block. B22, the equipment as described in B21, wherein computing unit further include: subelement is extracted, suitable for extracting by the part in sub- word block Hold one or more content subset constituted；And computing unit is further adapted for hashing each content subset using hash algorithm For a value between the 0 to the 6th predetermined length and according to value corresponding with each content subset, the 6th pre- fixed length is set Spend the analog value in feature value list.B23, the equipment as described in B22, wherein computing unit further include: count sub-element is suitable for It is merged and being overlapped the value of corresponding position in the corresponding feature value list of every sub- word block, to obtain correspondence The feature value list of the word block；Dualization subelement carries out at dualization suitable for the value to each unit in this feature value list Reason, and obtain the feature value list that each cell value is 0 or 1；And computing unit is further adapted for the spy of the 6th predetermined length Value indicative list is converted into the numeric string of the 6th predetermined length, using the data characteristics string as the word block.B24, setting as described in B23 Standby, wherein dualization subelement is further adapted for calculating the average value and more each unit of all cell values in feature value list Value and the average value size, if the value of some unit is greater than average value, the value of the unit is 1, if the value of some unit No more than average value, then the value of the unit is 0.B25, the equipment as described in any one of B15-24, wherein the first predetermined number It is 5, the first predetermined length is 32；Second predetermined length is 64 words, and third predetermined length is 32 words；4th is predetermined Length is 5 bytes, and the 5th predetermined length is 1 byte；It is 32 bytes with the 6th predetermined length.B26, such as any one of B14-25 institute The equipment stated, wherein characteristic extracting module is further adapted for extracting the offset of each word in a document in the first predetermined number word Location information, to be included in the first data fingerprint.B27, the equipment as described in any one of B15-26, wherein feature extraction mould Block is further adapted for extracting the deviation post information of institute's predicate block in a document, to be included in the second data fingerprint.

C29, the judgment method as described in C28, wherein calculating the step of fisrt feature set and second feature set similarity It suddenly include: to calculate the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set Data characteristics string Hamming distance；When Hamming distance is greater than first threshold, two corresponding data fingerprint characteristic string phases are determined Seemingly.C30, the method as described in C29 are further comprised the steps of: when any one data characteristics string in the second data fingerprint is judged as When similar, then it is assumed that the similarity of fisrt feature set and second feature set reaches preset range.C31, the side as described in C29 Method further comprises the steps of: the number of data characteristics string similar with second feature set in statistics fisrt feature set；Calculate the number Mesh accounts for the ratio of data characteristics string total number in fisrt feature set；If ratio reaches second threshold, then it is assumed that fisrt feature collection It closes and reaches preset range with the similarity of second feature set.C32, the method as described in C28, wherein calculating fisrt feature set The step of with second feature set similarity further include: calculate all data characteristics strings and second feature collection of fisrt feature set The Jaccard coefficient of all data characteristics strings in conjunction；When Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set Reach preset range with the similarity of second feature set.C33, the method as described in C32, wherein Jaccard coefficient be: Jaccard=| S ∩ T |/| S ∪ T |, wherein S indicates fisrt feature set, and T indicates second feature set.

D35, the judgement equipment as described in D34, wherein similarity calculation module further include: similarity calculated is suitable for Calculate the data of the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set The Hamming distance of feature string；And similarity judgment module is further adapted for when Hamming distance is greater than first threshold, judgement two is right Answer data characteristics string similar.D36, the judgement equipment as described in D35, wherein similarity judgment module is further adapted for referring to when the second data When data characteristics string in line is judged as similar, it is believed that the similarity of fisrt feature set and second feature set reaches predetermined Range.D37, the judgement equipment as described in D35, wherein similarity judgment module further include: it is special to be suitable for statistics first for statistic unit Collection close in data characteristics string similar with second feature set number and calculate the number and account for data in fisrt feature set The ratio of feature string total number；And similarity judgment module is further adapted for reaching second threshold when ratio, it is believed that fisrt feature collection It closes and reaches preset range with the similarity of second feature set.D38, the judgement equipment as described in D35, wherein similarity calculation list Member is further adapted for calculating all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set Jaccard coefficient；And similarity judgment module is further adapted for when Jaccard coefficient reaches third threshold value, it is believed that fisrt feature The similarity of set and second feature set reaches preset range.D39, the judgement equipment as described in D38, wherein Jaccard system Number is: Jaccard=| S ∩ T |/| S ∪ T |, wherein S indicates fisrt feature set, and T indicates second feature set.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.

In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the invention element performed by Function.

As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc. Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.

Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit Determine subject of the present invention and selects.Therefore, without departing from the scope and spirit of the appended claims, for this Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present invention, to this Invent done disclosure be it is illustrative and not restrictive, it is intended that the scope of the present invention be defined by the claims appended hereto.

Claims

1. a kind of method that data characteristics is extracted from document, comprising steps of

Word segmentation processing is carried out to the document, to obtain word sequence；

For each word in the word sequence, the characteristic value of the importance of the computational representation word within said document, and The first predetermined number word is chosen from the word sequence based on the characteristic value, wherein first predetermined number is based on text The significance level of shelves is arranged；And

For each word in selected first predetermined number word, the corresponding data characteristics string of the word is calculated, and Feature string constructs the first data fingerprint of the document as the data characteristics of the document based on the data, wherein It further include the deviation post information of the first predetermined number word in a document, the offset in first data fingerprint The position that location information record word occurs for the first time in a document.

2. the method as described in claim 1 further comprises the steps of:

Piecemeal is carried out to the word sequence in the document in sequence, to obtain the word of one or more the second predetermined length Block, wherein overlapped third predetermined length between adjacent word block；

For one or more obtained word block, the number of the word block is calculated based on the data content in each word block According to feature string；And

The data characteristics string of each word block is combined to construct second data fingerprint of the document using the data characteristics as the document.

3. method according to claim 2, wherein the feature of the importance of the computational representation word within said document The step of value includes:

Calculate word frequency of the frequency of occurrences of the word in the document as the word；

The ratio between the number of documents in the total number of documents and document library in document library comprising the word is calculated as the word Inverse document frequency；And

The characteristic value of characterization word importance in a document is calculated according to the word frequency of the word and inverse document frequency.

4. method as claimed in claim 3, wherein the characteristic value TF-IDF quilt for characterizing word importance in a document Is defined as:

TF-IDF=TF × IDF,

Wherein, TF is the word frequency of the word, and IDF is the inverse document frequency of the word, and TF and IDF are respectively as follows:

And

The step of choosing the first predetermined number word from the word sequence based on the characteristic value include:

The first predetermined number word is chosen according to the sequence of counted counted TF-IDF value from high to low.

5. method as claimed in claim 4, wherein constructing the document based on selected first predetermined number word The first data fingerprint as the document data characteristics the step of include:

For each word in the first predetermined number word, the first predetermined length is hashed to by common hash algorithm Numeric string is as the corresponding data characteristics string of the word；

Data characteristics string obtained is combined to obtain the first data fingerprint as the data characteristics of the document.

6. method as claimed in claim 5, wherein the step of progress word segmentation processing to document includes:

Word segmentation processing is carried out using the segmentation methods based on dictionary, wherein the segmentation methods include a dictionary, two kinds of matchings The rule of algorithm and four disambiguations.

7. method as claimed in claim 6, wherein data content in word-based piece calculates the data characteristics string of the word block The step of include:

Word in the word block of each second predetermined length is converted into character, obtains corresponding character string as word block；

The sub- word block of the 4th predetermined length in institute's predicate block is successively selected, wherein the overlapped 5th predetermined between adjacent sub- word block Length；

For every sub- word block, the feature value list of the 6th predetermined length is calculated according to the content of the sub- word block；And

Feature value list based on all sub- word blocks is to construct the data characteristics string of the word block.

8. the method for claim 7, wherein the characteristic value for calculating the 6th predetermined length according to the content of sub- word block The step of list includes:

Extract one or more content subset being made of the partial content in the sub- word block；

Each content subset hash is worth for one between the 0 to the 6th predetermined length using hash algorithm；

According to value corresponding with each content subset, the analog value in the 6th predetermined length feature value list is set.

9. method according to claim 8, wherein the feature value list based on all sub- word blocks is to construct the word block The step of data characteristics string includes:

It is merged and being overlapped the value of corresponding position in the corresponding feature value list of every sub- word block, to obtain The feature value list of the corresponding word block；

Dualization processing is carried out to the value of each unit in this feature value list, and obtains the feature that each cell value is 0 or 1 Value list；And

The feature value list of the 6th predetermined length is converted to the numeric string of the 6th predetermined length position, using as the word block Data characteristics string.

10. method as claimed in claim 9, wherein the value to each unit in this feature value list carries out at dualization The step of reason includes:

Calculate the average value of all cell values in feature value list；

Compare the value of each unit and the size of the average value；And

If the value of some unit is greater than average value, the value of the unit is 1, should if the value of some unit is not more than average value The value of unit is 0.

11. method as claimed in claim 10, wherein

First predetermined number is 5, and first predetermined length is 32；

Second predetermined length is 64 words, and the third predetermined length is 32 words；

4th predetermined length is 5 bytes, and the 5th predetermined length is 1 byte；With

6th predetermined length is 32 bytes.

12. such as method of any of claims 1-11, wherein

First data fingerprint further includes the deviation post of each word in a document in the first predetermined number word Information.

13. the method as described in any one of claim 2-11, wherein

Second data fingerprint further includes the deviation post information of word block in a document.

14. a kind of equipment for extracting data characteristics from document, the equipment include:

Word segmentation module is suitable for carrying out word segmentation processing to the document, to obtain word sequence；

Computing module, suitable for each word in the word sequence, the importance of the computational representation word within said document Characteristic value, be further adapted for calculating the corresponding data of the word to each word in selected first predetermined number word Feature string, wherein first predetermined number is arranged based on the significance level of document；

Module is chosen, suitable for choosing the first predetermined number word from the word sequence based on the characteristic value；And

Characteristic extracting module constructs the first data fingerprint of the document suitable for feature string based on the data as described The data characteristics of document, wherein further include in first data fingerprint the first predetermined number word in a document Deviation post information, the position that the deviation post information record word occurs for the first time in a document.

15. equipment as claimed in claim 14, the equipment further include:

Piecemeal module, suitable in sequence in the document word sequence carry out piecemeal, with obtain one or more second The word block of predetermined length, wherein overlapped third predetermined length between adjacent word block；

The computing module is further adapted for one or more obtained word block, based on the data content in each word block To calculate the data characteristics string of the word block；

The characteristic extracting module is further adapted for combining the data characteristics string of each word block to construct second data fingerprint of the document Using the data characteristics as the document.

16. equipment as claimed in claim 14, wherein the computing module is further adapted for:

17. equipment as claimed in claim 16, wherein the characteristic value TF-IDF for characterizing word importance in a document It is defined as:

TF-IDF=TF × IDF,

18. equipment as claimed in claim 17, wherein the selection module be further adapted for according to counted counted TF-IDF value from High to Low sequence chooses the first predetermined number word.

19. equipment as claimed in claim 18, wherein

The computing module is further adapted for calculating each word in the first selected predetermined number word by common Hash Method hashes to the numeric string of the first predetermined length as the corresponding data characteristics string of the word；And

The characteristic extracting module is further adapted for combining data characteristics string obtained to obtain the first data fingerprint as this article The data characteristics of shelves.

20. equipment as claimed in claim 19, wherein the word segmentation module is further adapted for being carried out with the segmentation methods based on dictionary Word segmentation processing, wherein the segmentation methods include the rule of a dictionary, two kinds of matching algorithms and four disambiguations.

21. equipment as claimed in claim 20, wherein the computing module further include:

Character conversion unit obtains corresponding suitable for the word in the word block of each second predetermined length is converted to character Character string as word block；

Blocking unit, suitable for successively selecting the sub- word block of the 4th predetermined length in institute's predicate block, wherein phase between adjacent sub- word block Mutually the 5th predetermined length of overlapping；And

Computing unit is suitable for arranging every sub- word block according to the characteristic value that the content of the sub- word block calculates the 6th predetermined length Table and feature value list based on all sub- word blocks are to construct the data characteristics string of the word block.

22. equipment as claimed in claim 21, wherein the computing unit further include:

Subelement is extracted, suitable for extracting one or more content subset being made of the partial content in the sub- word block；With And

The computing unit is further adapted for each content subset hash using hash algorithm between the 0 to the 6th predetermined length One is worth and according to value corresponding with each content subset, is arranged corresponding in the 6th predetermined length feature value list Value.

23. equipment as claimed in claim 22, wherein the computing unit further include:

Count sub-element, suitable for and being overlapped the value of corresponding position in the corresponding feature value list of every sub- word block It merges, to obtain the feature value list of the corresponding word block；

Dualization subelement carries out dualization processing suitable for the value to each unit in this feature value list, and obtains each list The feature value list that member value is 0 or 1；And

The computing unit is further adapted for converting the feature value list of the 6th predetermined length to the number of the 6th predetermined length String, using the data characteristics string as the word block.

24. equipment as claimed in claim 23, wherein

The dualization subelement is further adapted for calculating the average value and more each unit of all cell values in feature value list Value and the average value size, if the value of some unit is greater than average value, the value of the unit is 1, if the value of some unit No more than average value, then the value of the unit is 0.

25. equipment as claimed in claim 24, wherein

First predetermined number is 5, and first predetermined length is 32；

Second predetermined length is 64 words, and third predetermined length is 32 words；

6th predetermined length is 32 bytes.

26. the equipment as described in any one of claim 14-25, wherein

The characteristic extracting module is further adapted for extracting the offset of each word in a document in the first predetermined number word Location information, to be included in the first data fingerprint.

27. the equipment as described in any one of claim 15-25, wherein

The characteristic extracting module is further adapted for extracting the deviation post information of word block in a document, to be included in the second data fingerprint In.

28. a kind of judge the first document and the whether relevant judgment method of the second document, the method includes the steps:

First document is executed such as method of any of claims 1-13, the data characteristics for extracting document obtains Fisrt feature set, wherein the fisrt feature set includes: that the first data fingerprint of the first document and/or the second data refer to Line；

Second document is executed such as method of any of claims 1-13, the data characteristics for extracting document obtains Second feature set, wherein the second feature set includes: that the first data fingerprint of the second document and/or the second data refer to Line；And

The similarity for calculating fisrt feature set and second feature set, if similarity reaches preset range, then it is assumed that this first Document and the second document are related.

29. judgment method as claimed in claim 28, wherein the calculating fisrt feature set is similar to second feature set The step of spending include:

Calculate the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set The Hamming distance of data characteristics string；

When the Hamming distance is greater than first threshold, determine that two corresponding data fingerprint characteristic strings are similar.

30. method as claimed in claim 29, further comprises the steps of:

When any one data characteristics string in the second data fingerprint is judged as similar, then it is assumed that fisrt feature set and second The similarity of characteristic set reaches preset range.

31. method as claimed in claim 29, further comprises the steps of:

Count the number of data characteristics string similar with second feature set in the fisrt feature set；

Calculate the ratio that the number accounts for data characteristics string total number in fisrt feature set；

If the ratio reaches second threshold, then it is assumed that the similarity of fisrt feature set and second feature set reaches predetermined model It encloses.

32. method as claimed in claim 28, wherein described calculate fisrt feature set and second feature set similarity Step further include:

Calculate the Jaccard system of all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set Number；

When the Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set and the similarity of second feature set reach To preset range.

33. method as claimed in claim 32, wherein the Jaccard coefficient is:

Jaccard=| S ∩ T |/| S ∪ T |,

Wherein, S indicates fisrt feature set, and T indicates second feature set.

34. a kind of judge the first document and the whether relevant judgement equipment of the second document, the equipment includes:

The equipment that data characteristics is extracted in slave document as described in any one of claim 14-27, suitable for extracting institute respectively State the fisrt feature set and second feature set of the first document and the second document, wherein

The fisrt feature set includes: the first data fingerprint and/or the second data fingerprint of the first document；

The second feature set includes: the first data fingerprint and/or the second data fingerprint of the second document；

Similarity calculation module, suitable for calculating the similarity of fisrt feature set and second feature set；And

Similarity judgment module, suitable for when judging that similarity reaches preset range, it is believed that first document and the second document phase It closes.

35. equipment is judged as claimed in claim 34, wherein the similarity calculation module further include:

Similarity calculated, suitable for calculating the data characteristics string and second feature collection of each data fingerprint in fisrt feature set The Hamming distance of the data characteristics string of corresponding data fingerprint in conjunction；And

The similarity judgment module is further adapted for determining two corresponding data features when the Hamming distance is greater than first threshold It goes here and there similar.

36. judging equipment as claimed in claim 35, wherein

The similarity judgment module is further adapted for when the data characteristics string in the second data fingerprint is judged as similar, it is believed that the The similarity of one characteristic set and second feature set reaches preset range.

37. equipment is judged as claimed in claim 35, wherein the similarity judgment module further include:

Statistic unit, suitable for count the number of data characteristics string similar with second feature set in the fisrt feature set, And calculate the ratio that the number accounts for data characteristics string total number in fisrt feature set；And

The similarity judgment module is further adapted for reaching second threshold when the ratio, it is believed that fisrt feature set and second feature The similarity of set reaches preset range.

38. judging equipment as claimed in claim 35, wherein

The similarity calculated is further adapted for calculating in all data characteristics strings and the second feature set of fisrt feature set The Jaccard coefficient of all data characteristics strings；And

The similarity judgment module is further adapted for when the Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set Reach preset range with the similarity of second feature set.

39. judge equipment as claimed in claim 38, wherein the Jaccard coefficient is:

Jaccard=| S ∩ T |/| S ∪ T |,

Wherein, S indicates fisrt feature set, and T indicates second feature set.

40. it is a kind of judge suspicious document whether include sensitive content method, the method includes the steps:

Such as method of any of claims 1-13 is executed to secure documents, the data characteristics of the document is extracted, builds Feature database is found, wherein includes in feature database: the first data fingerprint and the second data fingerprint of secure documents；

Judgment method as described in any one of claim 28-33 is executed to suspicious document, wherein extract the suspicious document Data characteristics as fisrt feature set, using the feature database as fisrt feature set；

If judging, the suspicious document is related to secure documents, then it is assumed that the suspicious document includes sensitive content；And

If judging, the suspicious document is uncorrelated to secure documents, then it is assumed that the suspicious document does not include sensitive content.

41. it is a kind of judge suspicious document whether include sensitive content equipment, the equipment includes:

The equipment that data characteristics is extracted in slave document as described in any one of claim 14-27, is suitable for secure documents It extracts data characteristics, be further adapted for extracting the data characteristics of suspicious document as second feature set；

Memory module, the data characteristics suitable for storing the secure documents wherein includes in feature database as feature database: by Protect the first data fingerprint and the second data fingerprint of document；

Judgement equipment as described in any one of claim 34-39, suitable for judge suspicious document with it is protected in feature database Whether document is related；And

Determining module is suitable for when judging that the suspicious document is related to secure documents, determines that the suspicious document includes quick Feel content and it is uncorrelated to secure documents when judge the suspicious document when, determine the suspicious document not comprising in sensitive Hold.

42. a kind of leakage prevention system, comprising:

Equipment is calculated, is connected with data safety safeguard；And

Data safety safeguard, comprising:

Document obtains equipment, suitable for obtaining the document content for calculating equipment and sending；

Sensitive content as claimed in claim 41 judges equipment, suitable for judging whether the document obtained includes sensitive content；

Control strategy obtains equipment, suitable for obtaining process pair relevant to document when judging whether document includes sensitive content The control strategy answered；With

Equipment is controlled, is suitable for when judging that suspicious document includes sensitive content, according to acquired control strategy to the document Operation behavior controlled.