CN105893859B

CN105893859B - Method and system for leakage prevention

Info

Publication number: CN105893859B
Application number: CN201610236738.7A
Authority: CN
Inventors: 李唱; 康靖; 陈虎
Original assignee: Baoli Nine Chapter (beijing) Data Technology Co Ltd
Current assignee: Baolixintong Science And Technology Co ltd Beijing
Priority date: 2016-04-15
Filing date: 2016-04-15
Publication date: 2019-05-03
Anticipated expiration: 2036-04-15
Also published as: CN105893859A

Abstract

The invention discloses the method and systems for leakage prevention.Include: it is a kind of method of the data characteristics to obtain the first data fingerprint and the second data fingerprint is extracted from document, judge the first document and the whether relevant judgment method of the second document using extracted data characteristics and judge according to the degree of correlation suspicious document whether include sensitive content method.Simultaneously present invention provides the corresponding equipment for extracting document data feature, judge the first document and the second document it is whether relevant judge equipment and judge suspicious document whether include sensitive content equipment.

Description

Method and system for leakage prevention

Technical field

Technical field of data security of the present invention, in particular for the method and system of leakage prevention.

Background technique

In recent years, with the rapid development of information technology, data safety is shown during the daily operation of informatization enterprise It obtains particularly important.If data are maliciously distorted or destroyed, the loss that can not be retrieved may be caused to enterprise.In order to improve Information Security generally requires to set some Data Securities, to be monitored and protect to data.In current big data Under environment, with the increase of business data amount, how ever-increasing data is quickly and efficiently monitored and protected, at The major issue faced for current data security fields.

Currently, the leakage of many enterprises data in order to prevent, has affixed one's name to data leak protection (Data in the middle part of Intranet Leakage prevention, DLP) system, to ensure the safety of sensitive data.Data leak guard system is by software to quick Sense data are monitored and protect, and by certain technological means, the specified data or information assets for preventing enterprise are to violate Form as defined in security strategy flows out enterprise, to guarantee that sensitive data is not lost and reveals.So in DLP system, data The extraction of feature and be a very key step to the matching of sensitive data.

Artificial setting keyword or the mode to entire file generated data fingerprint are generallyd use in traditional DLP system Data characteristics is extracted, the former can not be automatically performed feature extraction, when the file is quite large, the accuracy of extraction can reduce the latter. In addition, the matching for sensitive data, it will usually rule match and Hash matching algorithm are used, similarly, when in face of larger text When part, algorithm performance and accuracy all can degradations.

Summary of the invention

For this purpose, the present invention provides the method and system for leakage prevention, to try hard to solve or at least alleviate At least one existing problem above.

According to an aspect of the invention, there is provided a kind of method that data characteristics is extracted from document, wherein extract Data characteristics includes the first data fingerprint and the second data fingerprint, comprising steps of dividing in sequence the data in document Block calculates the data characteristics string of the data block, the number of each data block of recombinant based on the data content in each data block First data fingerprint of the document is constructed according to feature string；Word segmentation processing is carried out to document, to obtain word sequence, in sequence Piecemeal is carried out to the word sequence of document, the data characteristics string of the word block, then group are calculated based on the data content in each word block The data characteristics string of each word block is closed to construct second data fingerprint of the document.

According to another aspect of the present invention, it provides and a kind of judges the first document and the whether relevant judgement side of the second document Method, comprising steps of execute data characteristics extracting method as described above to the first document, the data characteristics for extracting document obtains the One characteristic set；Data characteristics extracting method as described above is executed to the second document, the data characteristics for extracting document obtains the Two characteristic sets；And the similarity of fisrt feature set and second feature set is calculated, if similarity reaches preset range, Think that first document and the second document are related.

According to another aspect of the present invention, provide it is a kind of judge suspicious document whether include sensitive content method, packet It includes step: data characteristics extracting method as described above being executed to secure documents, extracts the data characteristics of the document, establish special Levy library；The data characteristics for extracting suspicious document again, execute it is above-mentioned judge the whether relevant judgment method of document, judge suspicious document Whether related to the secure documents in feature database: if judging, suspicious document and secure documents are related, then it is assumed that suspicious document Include sensitive content；If judging, suspicious document is uncorrelated to secure documents, then it is assumed that suspicious document does not include sensitive content.

Correspondingly, the present invention also provides extract the equipment of data characteristics from document, judge the first document and the second text Shelves whether it is relevant judge equipment, judge suspicious document whether include sensitive content equipment.

In accordance with a further aspect of the present invention, a kind of leakage prevention system is provided, comprising: equipment is calculated, with data Safety protection equipment is connected；And data safety safeguard, comprising: document obtains equipment, sensitive content as described above is sentenced Disconnected equipment, control strategy obtain equipment and control equipment.

Based on description above, this programme, which is used, carries out piecemeal to document, extracts the data fingerprint of data block and word block Mode extracts the data characteristics of document.The data fingerprint of each piecemeal is calculated, and uses local sensitivity Hash (LSH) algorithm Data fingerprint is generated, the leakage of set of metadata of similar data can be effectively prevented, and when document is very big, also can guarantee feature extraction Accuracy.

In terms of characteristic matching, this programme using single matched data feature string similarity (that is, single matching) or The mode for calculating set of metadata of similar data feature string specific gravity (that is, benchmark matching) carries out matching judgment to the Similar content in document, optional Ground, the similarity between the shelves that can be solicited articles with Hamming distance or Jaccard coefficient table.In this way, sensitivity can be carried out more in all directions Data Matching prevents sensitive data from revealing, and then various documents is effectively avoided to leak means.

Detailed description of the invention

To the accomplishment of the foregoing and related purposes, certain illustrative sides are described herein in conjunction with following description and drawings Face, these aspects indicate the various modes that can practice principles disclosed herein, and all aspects and its equivalent aspect It is intended to fall in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical appended drawing reference generally refers to identical Component or element.

Fig. 1 shows the schematic diagram of leakage prevention system 100 according to an embodiment of the invention；

Fig. 2A shows the process of the method 200 according to an embodiment of the invention that data characteristics is extracted from document Figure；

Fig. 2 B shows the process of the method 200 in accordance with another embodiment of the present invention that data characteristics is extracted from document Figure；

Fig. 3 A shows the signal of the equipment 300 according to an embodiment of the invention that data characteristics is extracted from document Figure；

Fig. 3 B shows the signal of the equipment 300 in accordance with another embodiment of the present invention that data characteristics is extracted from document Figure；

Fig. 4, which is shown, according to an embodiment of the invention judges the first document and the whether relevant judgement side of the second document The flow chart of method 400；

Fig. 5 show it is according to an embodiment of the invention judge the first document and the second document it is whether relevant judgement set Standby 500 schematic diagram；

Fig. 6 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content method 600 Flow chart；

Fig. 7 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content equipment 700 Schematic diagram；And

Fig. 8 schematically illustrates the schematic diagram of piecemeal processing.

Specific embodiment

Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.

Fig. 1 shows the schematic diagram of leakage prevention system 100 according to an embodiment of the invention.In enterprise Portion calculates and is connected between equipment 110 by local area network, and here, the component for calculating equipment 110 can include but is not limited to: one A or multiple processors or processing unit, system storage, the different system components of connection (including system storage and processing Unit) bus.It is of the invention real suitable for being used to realize simultaneously it should be noted that in addition to traditional calculating equipment (for example, computer) The calculating equipment 110 for applying example further includes mobile electronic device, including but not limited to mobile phone, PDA, tablet computer etc., and Server, printer, CD/DVD in enterprise's working environment etc..

Data safety safeguard 120 for leakage prevention is arranged in the local area network, passes through local area network and institute There is calculating equipment 110 to be connected.As shown in Figure 1, the safeguard 120 includes: that document obtains equipment 122, sensitive content judgement Equipment 700, control strategy obtain equipment 124 and control equipment 126.

Document obtains equipment 122 and is suitable for all calculating equipment 110 monitored in real time in the local area network, when monitoring to calculate When equipment 110 sends document, obtains and calculate the document content that equipment 110 is sent.Here, document can be the chat of instant messaging Information, and/or, picture/document of instant messaging transmission.

Sensitive content judges that equipment 700 is suitable for judging whether the document obtained includes sensitive content, for 700 meeting of equipment It introduces in greater detail below.

Control strategy obtains equipment 124 and is suitable for while judging whether document includes sensitive content, acquisition and the document The corresponding control strategy of relevant process.Optionally, control strategy can have: take non-print when specified process is printing Strategy, the strategy of messy code character string is taken when specified process be transmission file.

Control equipment 126 is suitable for when judging that suspicious document includes sensitive content, according to acquired control strategy to institute The operation behavior for stating document is controlled.For example, replacing the data for needing to transmit in the document with the character string of mark messy code Sensitive data in content.

Based on the description above to system 100, in the present system, how to be accurately matched to sensitive content is to realize data The key point of security protection, that is, sensitive content judge 700 operation to be performed of equipment.In simple terms, sensitive content Judge should include in equipment 700 (but being not limited to) memory module (for storing all data characteristicses of secure documents), Extract the equipment (for extracting the data characteristics in suspicious document) of document data feature, document relevance judges that equipment (is used for Judged according to the data characteristics of extraction whether related between suspicious document and secure documents) and determining module (for foundation Correlation judging result determines whether suspicious document includes sensitive content).

The process of composition and their execution to above-mentioned each module is illustrated below.

Fig. 2A shows the process of the method 200 according to an embodiment of the invention that data characteristics is extracted from document Figure.As shown in Figure 2 A, this method starts from step S210.In step S210, piecemeal is carried out to the data in document in sequence, To obtain the data block of one or more the first predetermined length, wherein overlapped second pre- fixed length between adjacent data blocks Degree.In other words, the sliding window of a first predetermined length size is slided along document, slides the position of the second predetermined length every time It moves, in this way, document to be just divided into the data block of multiple first predetermined length sizes.According to one embodiment of present invention, It is 512 bytes that the first predetermined length, which is arranged, and the second predetermined length is 256 bytes.

Then in step S220, for one or more obtained data block, based on the number in each data block The data characteristics string of the data block is calculated according to content.Optionally, using local sensitivity Hash (LSH) algorithm to each data block Data content generate a data signature.

Then in step S230, the data characteristics string of obtained each data block is combined to construct the first of the document Data fingerprint is using the data characteristics as the document.

If it is to entire document structure tree data fingerprint, when document is very big, the performance of algorithm can degradation and standard True property can also reduce, so this programme, which uses, first carries out deblocking to document, then extract the data characteristics conduct of each piecemeal The data characteristics of entire document.Meanwhile common hash algorithm such as MD5 shows original number if 2 data signatures are equal According to being equal under certain probability, but if unequal, other than showing initial data and being different, any letter is not provided Breath, traditional hash algorithm obviously cannot defend the leakage of set of metadata of similar data well, therefore quick using part in this method The advantages of sense hash algorithm, LSH algorithm, is, same or similar data signature can be generated for similar data content, It is able to ascend the matched effect of subsequent characteristics.

Specifically, the step of data characteristics string of each data block is calculated in step S220 is as follows:

The data sub-block of the 5th predetermined length in data block is first successively selected, it is wherein overlapped between adjacent data sub-block 6th predetermined length.Similarly, in the implementation, the sliding window of the 5th predetermined length (for example, 5 bytes) size can be used It is slided along data block, slides the displacement of the 6th predetermined length (such as 1 byte) size every time.As Fig. 8 is schematically shown The diagram of piecemeal processing is carried out to document, wherein D1 indicates that the 5th predetermined length, D2 indicate the 6th predetermined length, and every two is adjacent The 5th predetermined length between be overlapped have the 6th predetermined length.

The feature value list of the 7th predetermined length (such as 32 bytes, i.e., 256) is calculated further according to the content of data sub-block.

The data characteristics string of the data block is finally constructed based on the feature value list of all data sub-blocks.

Specifically, the step of calculating the feature value list of the 7th predetermined length according to the content of data sub-block is as follows:

1) one or more content subset being made of the partial content in data sub-block is extracted, in other words, is extracted All triples in each data sub-block；

2) recycle hash algorithm that each triple hash is arrived (0,256)；

3) according to value corresponding with each content subset, the analog value in feature value list is set, such as one three Tuple igr, it is assumed that hash value is 15, then cumulative 1 at the 15th position in feature value list.

When all triples in a data block all have been calculated, each position in this feature value list can There is an accumulated value, calculate the average value of all accumulated values as threshold value, if the accumulated value of some position corresponding position is (namely special The value of some unit in value indicative list) be greater than the average value, then the cell value is set as 1, is just set as 0 on the contrary.Binary in this way The processing of change obtains the feature value list of the binaryzation of the 6th predetermined length, by the characteristic value of this 6th predetermined length List is converted into the numeric string of the 6th predetermined length, using the data characteristics string as the data block.

The example for first data fingerprint of document structure tree is shown as follows:

doc_size:2278

Sig_num:9 // data block number

The predetermined length of bin_block_size:512 // first

The predetermined length of bin_step_size:256 // second

Threshold value when threshold:75 // execution LSH algorithm

LSH:4f2745a4400f311cab5843643a9771299c9c5f4d81e1669ce3554e4d75fed43a offset:0

LSH:ef2205c4808533748bda836571976d2196b81dec8da154d6aad5366dfcb9d6d9 offset:256

LSH:ade326d490a64b77bbd349f0c0bced2096f874e9ad42dc7d24bef279ea05c5d9 offset:512

LSH:3b276695d8c63bdfeb1340c0c450c0c096ea6e79cdc2bc596ce7e35cea28e7be offset:768

LSH:2faded8472873999e9154bcc684270ec92a67a7cc9c02cd8eae742dc2a58c21e offset:1024

LSH:afa1f584d0a733a7a3559bc8530b78688aa473fc1be06df5aa23469c1b78c28a offset:1280

LSH:8fa4f5cd80e533b6abc1964b520f306088b073f081616df52803461f2af9c78a offset:1536

LSH:1fa56499c0a7333ca05046c1520420608a9577a093f12d586a114c3f12e8e3ce offset:1792

LSH:180d2cba6027725c0010468030c08142c857e7808a91275e2a51097b9300a34e offset:2048

In the present embodiment, for the document of 2278 bytes, the first predetermined length is 512 bytes, the second pre- fixed length Degree is 256 bytes, generates 9 data blocks, generates the data characteristics string (setting the 5th of 9 data blocks respectively using LSH algorithm Predetermined length is 5 bytes, and the 6th predetermined length is 1 byte, and the 7th predetermined length is 32 bytes), this 9 data feature strings are connected Get up be exactly the document the first data fingerprint.It optionally, further include the offset of data block in a document in the first data fingerprint Location information offset, offset are mainly used for tracing back for data for recording the position of data block in a document, record offset Source, when the alarm that discovery sensitive data leakage rear line issues can carry offset information.Certainly, in order to save memory, First data fingerprint can also not include offset information, and the present invention is to this and with no restrictions.

In order to guarantee feature extraction enough to accurate, embodiment according to the present invention, first in addition to extracting document is counted According to fingerprint, data characteristics of second data fingerprint as document can also be extracted.As Fig. 2 B show according to the present invention another The step of embodiment, the method for extracting from document the second data fingerprint flow chart, this method, is as follows:

In step S240, word segmentation processing first is carried out to document, removes the useless letter such as stop words, punctuation mark, new line Breath, to obtain word sequence.According to an embodiment of the invention, being carried out in this method 200 using the segmentation methods based on dictionary Word segmentation processing, such as MMSEG (A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm), MMSEG is one in Chinese word segmentation A common, based on dictionary segmentation methods have Simple visual, realize uncomplicated, the fast advantage of the speed of service.It is simple next It says, which includes " matching algorithm " and " disambiguation rule ", and wherein matching algorithm refers to how to save according in dictionary Word, the sentence for cutting is matched；" disambiguation rule " is to say to divide in this way when in short, can also be such When dividing, determined with what rule using which kind of point-score, such as " facility and service " this phrase, be segmented into " facility/ Kimonos/business " is also segmented into " facility/and/service ", which word segmentation result is selected, and is exactly the function of " disambiguation rule ". It in MMSEG algorithm, defines there are two types of matching algorithms: simple maximum matching and complicated maximum matching；The disambiguation of definition There are four types of rules: maximum matching (Maximum matching, corresponding above two matching algorithm), maximum average word length Minimum rate of change (the Smallest variance of word of (Largest average word length), word length Lengths), obtained value, is then added, takes summation maximum by the natural logrithm for calculating all monosyllabic word word frequency in phrase Phrase (Largest sum of degree of morphemic freedom of one-character words).

Such as following document A, after word segmentation processing, document B is obtained.

Document A:

" Group Life Accident Insurance material benefit plan

Unexpected injury: refer to by external, burst, non-original idea, the non-disease objective thing for making body come to harm Part.

It is burnt as traffic accident hit, fire occurs, is caused injury by falling object from high altitude strike, injured, liquid is attacked by ruffian Change gas, gas explosion,

The oil scald etc. that cook is boiled all belongs to accidental injuring event.

Recommend two kinds of assembled schemes, selected for unit combination actual conditions:

1, accident/injury insurance: 150 yuan/people of insurance premium (1,2 grade of occupation)

(1) period insured: 1 year

(2) because unexpected injury is die, 100,000 yuan insurance responsibility: are paid；Or because of unexpected injury Complete Disability, accidental burns, payment 100000 yuan (part is disabled to be paid in proportion)；10,000 yuan of unexpected injury medical treatment.

2, unexpected injury and medical insurance: 100 yuan/people of insurance premium (1,2 grade of occupation)

(1) period insured: 1 year

(2) because unexpected injury is die, 50,000 yuan insurance responsibility: are paid；Or because of unexpected injury Complete Disability or burn, pay 50,000 yuan (part is disabled to be paid in proportion)；10,000 yuan of unexpected injury medical treatment.

Note: my company can be by the concrete condition design insurance scheme of your unit

It pays the bill few, ensures many；It insures conveniently, Claims Resolution is rapid."

Document B (word sequence obtained after word segmentation processing):

[group, the person is unexpected, injures, insurance, material benefit, and plan is unexpected, and injury refers to, by, external, burst, non- Meaning, non-, disease, body, injury, objective, event, traffic, accident are hit, and are occurred, fire, burn, high-altitude, pendant, and object is hit, and are caused Wound, ruffian attack, injured, liquefied gas, coal gas, explosion, cook, boiling, oil, and scald is fixed one's mind on, and outside, injury, event is recommended, Two kinds, combination, scheme, official documents and correspondence, confession, unit, in conjunction with, actual conditions are selected, and it is unexpected, it injures, insurance, insurance premium, 150 yuan, Grade, occupation, insurance, during which, 1 year, insurance, responsibility was unexpected, and injury is die, and paid, 10, Wan Yuan, or because, unexpected, injury, entirely, It is residual, it is unexpected, it burns, pays, 10, Wan Yuan, part, it is disabled, it in proportion, pays, unexpected, injury, medical treatment, 1, Wan Yuan, unexpected, injury, Medical insurance, insurance premium, 100 yuan, grade, occupation, insurance, during which, 1 year, insurance, responsibility was unexpected, and injury is die, and paid, 5, ten thousand Member, or because, unexpected, injury is entirely, residual, burns, pays, 5, Wan Yuan, part, and it is disabled, it in proportion, pays, unexpected, injury, medical treatment, 1, Wan Yuan, note, company can press, your unit, and specifically, situation designs, and insurance, scheme, official documents and correspondence is paid the bill, and seldom, ensures, much, throw It protects, convenient, Claims Resolution, rapidly]

It should be noted that the present invention is not only restricted to specific segmenting method, it is all word segmentation processing to be carried out to document To obtain the method for the significant word in the document all within protection scope of the present invention.

Then in step s 250, piecemeal is carried out to the word sequence in document in sequence, to obtain one or more The word block of third predetermined length, wherein overlapped 4th predetermined length between adjacent word block.In other words, a third is made a reservation for The sliding window of length scale is slided along word sequence, the displacement of the 4th predetermined length is slided every time, in this way, being just divided into document The word blocks of multiple third predetermined length sizes.According to one embodiment of present invention, setting third predetermined length is 64 words Language, the 4th predetermined length are 32 words.

For example, the word sequence to above-mentioned document B carries out piecemeal operation, following 4 word blocks are obtained:

Word block 1:[Group Life Accident Insurance material benefit plan unexpected injury refers to by the non-original idea non-disease of external burst Actual bodily harm objective event traffic accident hits generation fire burn falling object from high altitude and hits the injured liquefied gas and coal gas of ruffian's attack of causing injury Cook's boiling oil scald of exploding belongs to two kinds of assembled scheme official documents and correspondences of unexpected injury event recommendation and selects meaning for unit combination actual conditions During 150 yuan of grade employment securities of outer injury insurance insurance premium]

Word block 2:[attacks injured liquefied gas and coal gas explosion cook boiling oil scald and belongs to two kinds of unexpected injury event recommendation combinations Scheme official documents and correspondence selects insurance in 1 year during 150 yuan of grade employment securities of accident/injury insurance insurance premium to blame for unit combination actual conditions Appoint unexpected injury to die to pay 100,000 yuan or pay accidental wound in proportion because 100,000 yuan of parts of unexpected injury Complete Disability accidental burns pair are disabled 10,000 yuan of unexpected injury medical insurance insurance premiums of evil medical treatment]

Word block 3:[insurance responsibility unexpected injury in 1 year, which is die, pays 100,000 yuan or because unexpected injury Complete Disability accidental burns pay 100,000 During unexpected injury 100 yuan of grade employment securities of medical 10,000 yuan of unexpected injury medical insurance insurance premiums are paid in first part deformity in proportion Insurance responsibility unexpected injury in 1 year dies to pay 50,000 yuan or pay 50,000 yuan of part deformity because of the burn of unexpected injury Complete Disability pays meaning in proportion Outer 10,000 yuan of injury medical treatment]

The ten thousand yuan of parts word block 4:[are disabled to pay unexpected injury medical 10,000 yuan of unexpected injury medical insurance insurance premiums 100 in proportion Insurance responsibility unexpected injury in 1 year, which is die, during first grade employment security pays 50,000 yuan or because 50,000 yuan of portions are paid in the burn of unexpected injury Complete Disability Dividing deformity to pay the medical 10,000 Yuan Zhu companies of unexpected injury in proportion can pay the bill not by your unit concrete condition design insurance scheme official documents and correspondence It is ensure that many insure facilitates Claims Resolution rapid] more

Then in step S260, for one or more obtained word block, based in the data in each word block Hold to calculate the data characteristics string of the word block, optionally, using local sensitivity Hash (LSH) algorithm in the data of each word block Hold and generates a data signature.

Then in step S270, the data characteristics string of each word block of recombinant constructs second data fingerprint of the document Using the data characteristics as the document.

According to one embodiment of present invention, identical LSH algorithm process word block and data block can be used, to be counted According to signature.Therefore the step of calculating the data characteristics string of each word block, can be with reference to the data for calculating each data block in step S220 The step of feature string, details are not described herein again.

If 4 data feature strings for carrying out the generation of LSH algorithm to 4 word blocks above are respectively as follows:

LSH1:3f26258da0a5310d6b5203845ab0784eb29acff9814564946ce 458fc086ac2a8

LSH2:2f2465c480a3312f215d80ce53863000a0ad6b78a3616595ae4 c56bc00e9c2ba

LSH3:0fa077e500a531bfa95dd08e53862020a0896b58a362659d2a4 84ebf00e9cbaa

LSH4:0fa467ed10a5331da959d40e53802000b0897310a1616d592a4 844bb02a9c7ea

The example for second data fingerprint of document structure tree is shown as follows:

doc_size:2278

Word_num:284 // word number

Sig_num:8 // word block number

Word_block_size:64 // third predetermined length

The predetermined length of word_step_size:32 // the 4th

LSH:bf32258f90e5390da35083045a83630ab6be5fe9a10577102dc40acd286bd2a8 offset:0

LSH:0f2205c490d01f2fa3d000904a87630896f83fa0a147d79424e446adfc5b50aa offset:254

LSH:b92355848cfa3b6fab5744f14c92ad4892d86e89a143f4bee4a852386e09e2e8 offset:554

LSH:3baf75807cda31ede317c6c4745321ccd2966efd89422598ec62f3dca2f9e39a offset:840

LSH:2fafe588e0b131ade3559a46f31b30c4929e6b7c89d2671cae66f2dc02e9caca offset:1080

LSH:2f2475cc40a731bd2155d0ce53a83000b88d7b5883d06594aa4244be04ebe2ca offset:1326

LSH:0fa467cd00a433bca1d1d48f53a42021b889735481e241bd2a4844bf8ce9c78a offset:1594

LSH:07af6fbde0a5317c61d1544a02802142f29b730c93602cbe2e4ae83b83c9c7ee offset:1797

For 2278 bytes, the document containing 284 words, the data of 8 word blocks are generated respectively using LSH algorithm Feature string (the 4th predetermined length of setting is 5 bytes, and the 5th predetermined length is 1 byte, and the 6th predetermined length is 32 bytes), by this 8 data feature strings link up be exactly the document the second data fingerprint.As described in the first data fingerprint, second Data fingerprint also may include the deviation post information offset of word block in a document.

It should be noted that it is predetermined first can be arranged according to the significance level of document during method 200 executes Length, the second predetermined length, third predetermined length and the 4th predetermined length, the fine degree extracted with distinguishing characteristic.Namely It says, the significance level of document is higher, and (the first predetermined length, third are pre- for the size of each piecemeal when to data and word sequence piecemeal Measured length) and displacement stepping (the second predetermined length, the 4th predetermined length) just it is smaller.

The step of data characteristics is extracted from document ends here, and by method 200, the first data are extracted from document The data characteristics of fingerprint and the second data fingerprint as document.Correspondingly, Fig. 3 A and 3B is respectively illustrated implements according to the present invention Example for realizing the equipment 300 that the first data characteristics and the second data characteristics are extracted in the slave document of method 200 schematic diagram.

As shown in Figure 3A, feature extracting device 300 includes: piecemeal module 310, computing module 320 and characteristic extracting module 330, computing module 320 is coupled with piecemeal module 310 and 330 phase of characteristic extracting module respectively.

Piecemeal module 310 be suitable in sequence in document data carry out piecemeal, with obtain one or more first The data block of predetermined length, wherein overlapped second predetermined length between adjacent data blocks.

Computing module 320 is suitable for one or more obtained data block, based in the data in each data block Hold to calculate the data characteristics string of the data block.

The data characteristics string that characteristic extracting module 330 is suitable for combining each data block refers to construct first data of the document Line is using the data characteristics as the document.

It is described referring to above for the step of data characteristics string for calculating each data block, computing module 320 further includes point Module unit 322, computing unit 324.

Blocking unit 322 is suitable for successively selecting the data sub-block of the 5th predetermined length in data block, wherein adjacent data Overlapped 6th predetermined length between block.

Computing unit 324 is suitable for calculating the feature value list of the 7th predetermined length according to the content of data sub-block.Optionally, May include in computing unit 324 extract subelement, suitable for extract data sub-block in partial content constitute one or it is more A content subset.Again by computing unit 324 using hash algorithm by each content subset hash for the 0 to the 7th predetermined length it Between a value analog value in the 7th predetermined length feature value list is arranged according to value corresponding with each content subset.Meter Calculating unit 324 can also include count sub-element, be suitable for by will in the corresponding feature value list of each data sub-block it is corresponding The value of position is overlapped and merges, and to obtain the feature value list and dualization subelement of the corresponding data block, fits The value of each unit carries out dualization processing in this feature value list, and obtains the characteristic value that each cell value is 0 or 1 List.Computing unit is further adapted for converting the feature value list of this 7th predetermined length to the number of the 7th predetermined length position String, using the data characteristics string as the data block.

Wherein, dualization subelement is suitable for calculating the average value of all cell values in feature value list and compares each The value of unit and the size of the average value, if the value of some unit is greater than average value, the value of the unit is 1, if some unit Value be not more than average value, then the value of the unit be 0.

According to a kind of implementation, feature extracting device 300 is further adapted for extracting the second data fingerprint of document.At this point, special Extract equipment 300 is levied other than piecemeal module 310, computing module 320 and characteristic extracting module 330, further includes word segmentation module 340, it is suitable for carrying out word segmentation processing to document, to obtain word sequence.According to an embodiment of the invention, word segmentation module 340 is configured To complete word segmentation processing to document using the segmentation methods (for example, MMSEG) based on dictionary.

Piecemeal module 310 is further adapted for carrying out piecemeal to the word sequence in document in sequence, to obtain one or more The word block of third predetermined length, wherein overlapped 4th predetermined length between adjacent word block.

Meanwhile computing module 320 is further adapted for one or more obtained word block, based on the data in each word block Content calculates the data characteristics string of the word block.The data characteristics string that characteristic extracting module 330 is further adapted for combining each word block comes Second data fingerprint of the document is constructed using the data characteristics as the document.

It is described referring to above for the step of data characteristics string for calculating each word block, computing module 320 is in addition to comprising dividing It further include character conversion unit 326 outside module unit 322 and computing unit 324.

Character conversion unit 326 is suitable for the word in the word block of each third predetermined length being converted to character, obtains phase The character string answered is as word block.Blocking unit 322 is further adapted for successively selecting the sub- word block of the 5th predetermined length in word block, wherein phase Overlapped 6th predetermined length between adjacent sub- word block.Computing unit 324 is further adapted for calculating the 7th in advance according to the content of sub- word block The feature value list of measured length.Optionally, computing unit calculates the data characteristics of word block using LSH algorithm same as data block String, details are not described herein again.

As described in method 200, the first predetermined length, second can be set according to the significance level of document in advance Measured length, third predetermined length and the 4th predetermined length, the fine degree extracted with distinguishing characteristic.That is, the weight of document Want degree higher, when piecemeal each piecemeal size (the first predetermined length, third predetermined length) and the stepping of displacement (second is pre- Measured length, the 4th predetermined length) it is just smaller.

Optionally, characteristic extracting module 330 is further adapted for extracting the deviation post information of data block in a document, to be included in In first data fingerprint, and the deviation post information of word block in a document is extracted, to be included in the second data fingerprint.

To sum up, this programme, which is used, carries out piecemeal to document, extracts the mode of the data fingerprint of data block and word block to extract The data characteristics of document.The data fingerprint of each piecemeal is calculated, and data are generated using local sensitivity Hash (LSH) algorithm Fingerprint can be effectively prevented the leakage of set of metadata of similar data, and when document is very big, also can guarantee the accuracy of feature extraction.

Fig. 4, which is shown, according to an embodiment of the invention judges the first document and the whether relevant judgement side of the second document The flow chart of method 400.

As shown in figure 4, this method starts from step S410, the step of method 200 are executed to the first document, the number of document is extracted Fisrt feature set is obtained according to feature, wherein the fisrt feature set includes: the first data fingerprint and/or of the first document Two data fingerprints.

Then in the step s 420, the step of method 200 equally being executed to the second document, the data characteristics for extracting document obtains To second feature set, wherein second feature set includes: the first data fingerprint and/or the second data fingerprint of the second document.

Then in step S430, the similarity of fisrt feature set and second feature set is calculated, if similarity reaches Preset range, then it is assumed that first document and the second document are related.

The process of similarity can be divided into following 3 kinds again being matched in step S430 between feature, calculating document.

◆ the first is single matching:

The data characteristics string for calculating each data fingerprint in fisrt feature set refers to corresponding data in second feature set The Hamming distance of the data characteristics string of line.

Wherein, Hamming distance (Hamming distance) refers to that two (equal length) data characteristics strings correspond to binary digit Different quantity.The Hamming distance between two data feature strings x, y is indicated with d (x, y), two data feature strings is carried out different Or operation, and the number that statistical result is 1, obtained value is exactly Hamming distance, when Hamming distance is greater than the first threshold values, is just sentenced The two fixed data characteristics strings are similar.Such as:

Hamming distance between 1011101 and 1001001 is 2；

Hamming distance between " toned " and " roses " is 3.

For single matching, as long as be set as any of two documents data characteristics string and be judged as similar, recognize Similarity for fisrt feature set and second feature set reaches preset range.As long as that is, there is any one data Block or word block are similar, and document has the suspicion of leak data.

Be below the data characteristics string of two documents is done it is single matching and return whether relevant pseudocode, use respectively Signature_base and signature_cmp represents fisrt feature set and second feature set, wherein nilsima_base The data characteristics string in the characteristic set of two documents is respectively indicated with nilsima_cmp:

for nilsima_base in signature_base

for nilsima_cmp in signature_cmp

Ham_dist=hamming_distance (nilsima_base, nilsima_cmp)

if(ham_dist>threshold)

{

return 1

break

}

return 0

◆ second is benchmark matching:

The data characteristics string for equally first calculating each data fingerprint in fisrt feature set is corresponding with second feature set The Hamming distance of the data characteristics string of data fingerprint, to determine whether two data feature strings are similar；It is different from single matching It is that benchmark matching is the number of data characteristics string similar with second feature set in fisrt feature set to be counted, then calculates The ratio that the number accounts for data characteristics string total number in fisrt feature set just thinks first if the ratio reaches second threshold The similarity of characteristic set and second feature set reaches preset range.

Be below the data fingerprint of two documents is done benchmark match and return whether relevant pseudocode, signature_ Base_num indicates data characteristics string total number in fisrt feature set.

Signature_base_num=signature_base.signature_num

Simlar_num=0

for nilsima_base in signature_base

for nilsima_cmp in signature_cmp

If (nilsima_cmp==nilsima_base)

{

Simlar_num+=1

}

return(simlar_num/signature_base_num)

◆ the third is whole matching:

Calculate all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set Jaccard coefficient, when Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set is similar to second feature set Degree reaches preset range, and the first document and the second document are related.

Wherein Jaccard coefficient refers to the ratio of two intersection of sets collection and two union of sets collection:

Jaccard=| S ∩ T |/| S ∪ T |,

Wherein, S indicates fisrt feature set, and T indicates second feature set.

Be below the first data fingerprint of two documents is done whole matching and return whether relevant pseudocode, Signature_base_num and signature_cmp_num respectively indicates data characteristics string total number in two characteristic sets:

Signatur_base_num=signature_base.signature_num

Signature_cmp_num=signature_cmp.signature_num

Simlar_num=0

for nilsima_base in signature_base

for nilsima_cmp in signature_cmp

If (nilsima_cmp==nilsima_base)

{

Simlar_num+=1

}

return(simlar_num/(signatur_base_num+signature_cmp_num-simlar_num))

Method 400 devises 3 kinds of modes and carries out matching judgment to the Similar content in two documents, it is alternatively possible to Hamming distance or Jaccard coefficient table are solicited articles the similarity between shelves.In such manner, it is possible to which matching way is adaptive selected to document phase Judged like property, carry out sensitive data matching more in all directions, to prevent sensitive data leakage from providing a strong guarantee.

Correspondingly, Fig. 5 show judgement the first document according to an embodiment of the invention for realizing method 400 and The schematic diagram of the whether relevant judgement equipment 500 of second document.As shown in figure 5, the document correlation judges that equipment 500 includes: Feature extracting device 300, similarity calculation module 510 and similarity judgment module 520, wherein similarity calculation module 510 is divided It is not coupled with feature extracting device 300 and 520 phase of similarity judgment module.

Feature extracting device 300 is suitable for extracting fisrt feature set and the second spy of the first document and the second document respectively Collection is closed, and wherein fisrt feature set includes: the first data fingerprint and/or the second data fingerprint of the first document；Second feature Set includes: the first data fingerprint and/or the second data fingerprint of the second document.

Similarity calculation module 510 is suitable for calculating the similarity of fisrt feature set and second feature set.

Similarity judgment module 520 is suitable for when judging that similarity reaches preset range, it is believed that the first document and the second text Shelves are related.

According to one embodiment of present invention, similarity calculation module 510 further include: similarity calculated, based on It is special to calculate the data characteristics string of each data fingerprint and the data of corresponding data fingerprint in second feature set in fisrt feature set Levy the Hamming distance of string.

Similarity judgment module 520 is further adapted for determining two corresponding data fingerprints when Hamming distance is greater than first threshold It is similar.Specifically, for single matching way, similarity judgment module 520 is suitable for being judged as when any one data characteristics string When similar, it is believed that the similarity of fisrt feature set and second feature set reaches preset range.

For benchmark matching way, similarity judgment module 520 can also include statistic unit, for counting fisrt feature It the number of data characteristics string similar with second feature set and calculates the number in set to account in fisrt feature set data special The ratio of sign string total number, similarity judgment module 520 are further adapted for when the ratio reaches second threshold, it is believed that fisrt feature collection It closes and reaches preset range with the similarity of second feature set.

According to still another embodiment of the invention, under whole matching mode, similarity calculated is further adapted for calculating The Jaccard coefficient of all data characteristics strings in all data characteristics strings and second feature set of one characteristic set, when described When Jaccard coefficient reaches third threshold value, similarity judgment module 520 assert the phase of fisrt feature set and second feature set Reach preset range like degree.Jaccard coefficient is used to characterize the degree of correlation of two set:

Jaccard=| S ∩ T |/| S ∪ T |,

Wherein, S indicates fisrt feature set, and T indicates second feature set.

Fig. 6 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content method 600 Flow chart.As shown in fig. 6, this method starts from step S610, the step of method 200 are executed to secure documents, this article is extracted The data characteristics of shelves, and feature database is established, wherein include in feature database: the first data fingerprint of all secure documents and second Data fingerprint.

Then in step S620, to suspicious document execute method 400 the step of, during executing method 400, mention Take the data characteristics of suspicious document as second feature set, and using feature database obtained in previous step as fisrt feature collection It closes, judges whether suspicious document is related to secure documents with this；

Then in step S630, if judging, suspicious document is related to secure documents, then it is assumed that wraps in the suspicious document Containing sensitive content；If judging, suspicious document is uncorrelated to secure documents, then it is assumed that the suspicious document does not include sensitive content.

Correspondingly, Fig. 7, which shows the suspicious document of the judgement for realizing method 600 according to an embodiment of the invention, is The no equipment comprising sensitive content, that is, sensitive content described in Fig. 1 judge the schematic diagram of equipment 700.The equipment 700 packet Include: feature extracting device 300, memory module 710, document relevance judge equipment 500 and determining module 720.According to this hair Bright one embodiment, feature extracting device 300 can also be arranged in document relevance and judge in equipment 500.

As it was noted above, feature extracting device 300 be suitable for secure documents extract data characteristics, be further adapted for extracting it is suspicious The data characteristics of document is as second feature set.

The data characteristics that memory module 710 is suitable for storing secure documents wherein includes in feature database as feature database: The first data fingerprint and the second data fingerprint of secure documents.

Document relevance judges that equipment 500 is suitable for judging whether suspicious document is related to the secure documents in feature database； And

Determining module 720 is suitable for when judging that suspicious document is related to secure documents, determines that suspicious document includes sensitivity Content and when judging that suspicious document is uncorrelated to secure documents determines that the suspicious document does not include sensitive content.

To sum up, the method and system according to the present invention for leakage prevention, provided file characteristics extraction side Method can more easily extract the data characteristics of document, and as far as possible include more data characteristic informations；In addition, devising 3 kinds of single matching, benchmark matching, whole matching modes carry out sensitive data matching in all directions, it is possible to prevente effectively from various texts Shelves leak means.

It should be appreciated that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, it is right above In the description of exemplary embodiment of the present invention, each feature of the invention be grouped together into sometimes single embodiment, figure or In person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed hair Bright requirement is than feature more features expressly recited in each claim.More precisely, as the following claims As book reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real Thus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this hair Bright separate embodiments.

Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple Submodule.

Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.

A5, the method as described in any one of A2-4, wherein data content in word-based piece calculates the number of the word block Include: that the word in the word block of each third predetermined length is converted into character according to the step of feature string, obtains corresponding character String is used as word block；The sub- word block for successively selecting the 5th predetermined length in word block, wherein the overlapped 6th between adjacent sub- word block Predetermined length；For every sub- word block, the feature value list of the 7th predetermined length is calculated according to the content of sub- word block；And it is based on The feature value list of all sub- word blocks is to construct the data characteristics string of the word block.A6, the method as described in A4 or 5, wherein basis The step of data sub-block or the content of sub- word block calculate the feature value list of the 7th predetermined length includes: to extract by data One or more content subset that partial content in block or sub- word block is constituted；Each content subset is dissipated using hash algorithm A value being classified as between the 0 to the 7th predetermined length；According to value corresponding with each content subset, the 7th predetermined length is set Analog value in feature value list.A7, the method as described in A6, wherein based on feature value list to construct the data block or word block Data characteristics string the step of include: by by corresponding position in each data sub-block or the corresponding feature value list of sub- word block Value be overlapped and merge, to obtain the feature value list of the corresponding data block or word block；To in this feature value list The value of each unit carries out dualization processing, and obtains the feature value list that each cell value is 0 or 1；And it is pre- by the 7th The feature value list of measured length is converted into the numeric string of the 7th predetermined length, using the data characteristics as the data block or word block String.A8, the method as described in A7, wherein the value to each unit in this feature value list is wrapped the step of carrying out dualization processing It includes: calculating the average value of all cell values in feature value list；Compare the value of each unit and the size of the average value；And if The value of some unit is greater than average value, then the value of the unit is 1, if the value of some unit is not more than average value, the unit Value is 0.A9, the method as described in any one of A1-8, wherein the first predetermined length is 512 bytes, the second predetermined length is 256 Byte；Third predetermined length is 64 words, and the 4th predetermined length is 32 words；5th predetermined length is 5 bytes, and the 6th is pre- Measured length is 1 byte；It is 32 bytes with the 7th predetermined length.A10, the method as described in any one of A1-9, wherein the first number It further include the deviation post information of data block in a document according to fingerprint.A11, the method as described in any one of A2-10, wherein the Two data fingerprints further include the deviation post information of institute's predicate block in a document.

B13, as described in B12 equipment, equipment further include: word segmentation module is suitable for carrying out word segmentation processing to document, to obtain Obtain word sequence；Piecemeal module is further adapted for carrying out piecemeal to the word sequence in sequence, pre- to obtain one or more third The word block of measured length, wherein overlapped 4th predetermined length between adjacent word block；Computing module is further adapted for obtained one A or multiple word blocks calculate the data characteristics string of the word block based on the data content in each word block；Characteristic extracting module It is special using the data as the document to construct second data fingerprint of the document to be further adapted for combining the data characteristics string of each word block Sign.B14, the equipment as described in B13, wherein word segmentation module is further adapted for carrying out word segmentation processing with the segmentation methods based on dictionary, Middle segmentation methods include the rule of a dictionary, two kinds of matching algorithms and four disambiguations.B15, any one of such as B12-14 The equipment, computing module include: blocking unit, suitable for successively selecting the data sub-block of the 5th predetermined length in data block, Wherein overlapped 6th predetermined length between adjacent data sub-block；Computing unit is suitable for for each data sub-block, according to number Calculate the feature value list of the 7th predetermined length and based on the feature value list of all data sub-blocks according to the content of sub-block with structure Make the data characteristics string of the data block.B16, the equipment as described in any one of B13-15, wherein computing module further include: character Converting unit obtains corresponding character string conduct suitable for the word in the word block of each third predetermined length is converted to character Word block；Blocking unit is further adapted for successively selecting the sub- word block of the 5th predetermined length in word block, wherein between adjacent sub- word block mutually It is overlapped the 6th predetermined length；And computing unit is further adapted for every sub- word block, it is predetermined to calculate the 7th according to the content of sub- word block The feature value list of length and feature value list based on all sub- word blocks are to construct the data characteristics string of the word block.B17, Equipment as described in B15 or 16, wherein computing unit includes: extraction subelement, is suitable for extracting by data sub-block or sub- word block Partial content constitute one or more content subset；And computing unit is further adapted for utilizing hash algorithm by each content Subset hash is for one between the 0 to the 7th predetermined length value and according to value corresponding with each content subset, setting the Analog value in seven predetermined length feature value lists.B18, the equipment as described in B17, wherein computing unit further include: count son Unit, suitable for and being overlapped the value of corresponding position in each data sub-block or the corresponding feature value list of sub- word block It merges, to obtain the feature value list of the corresponding data block or word block；Dualization subelement is suitable for this feature value list In each unit value carry out dualization processing, and obtain each cell value be 0 or 1 feature value list；And it calculates single Member is further adapted for converting the feature value list of the 7th predetermined length to the numeric string of the 7th predetermined length, using as the data block or The data characteristics string of word block.B19, the equipment as described in B18, wherein dualization subelement is further adapted for calculating institute in feature value list There is the size of the average value of cell value and the value of more each unit and the average value, if the value of some unit is greater than averagely Value, then the value of the unit is 1, if the value of some unit is not more than average value, the value of the unit is 0.In B20, such as B12-19 Described in any item equipment, wherein the first predetermined length is 512 bytes, the second predetermined length is 256 bytes；Third predetermined length It is 64 words, the 4th predetermined length is 32 words；5th predetermined length is 5 bytes, and the 6th predetermined length is 1 byte；With 7th predetermined length is 32 bytes.B21, the equipment as described in any one of B12-20, wherein characteristic extracting module is further adapted for mentioning The deviation post information of data block in a document is taken, to be included in the first data fingerprint.B22, such as any one of B13-21 institute The equipment stated, wherein characteristic extracting module is further adapted for extracting the deviation post information of word block in a document, to be included in the second number According in fingerprint.

C24, the judgment method as described in C23, wherein calculating the step of fisrt feature set and second feature set similarity It suddenly include: to calculate the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set Data characteristics string Hamming distance；When Hamming distance is greater than first threshold, determine that two corresponding data feature strings are similar. C25, the method as described in C24, if further comprise the steps of: any one data characteristics string and be judged as similar, then it is assumed that first is special Collection, which is closed, reaches preset range with the similarity of second feature set.C26, the method as described in C24, further comprise the steps of: statistics The number of data characteristics string similar with second feature set in fisrt feature set；The number is calculated to account in fisrt feature set The ratio of data characteristics string total number；If ratio reaches second threshold, then it is assumed that fisrt feature set and second feature set Similarity reaches preset range.C27, the method as described in C23, wherein it is similar to second feature set to calculate fisrt feature set The step of spending further include: calculate all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set Jaccard coefficient；When Jaccard coefficient reaches third threshold value, it is believed that the phase of fisrt feature set and second feature set Reach preset range like degree.C28, the method as described in C27, wherein Jaccard coefficient be: Jaccard=| S ∩ T |/| S ∪ T |, wherein S indicates fisrt feature set, and T indicates second feature set.

D30, the judgement equipment as described in D29, wherein similarity calculation module further include: similarity calculated is suitable for Calculate the data of the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set The Hamming distance of feature string；And similarity judgment module is further adapted for when Hamming distance is greater than first threshold, judgement two is right Answer data characteristics string similar.D31, the judgement equipment as described in D30, wherein similarity judgment module is further adapted in data characteristics string When being judged as similar, it is believed that the similarity of fisrt feature set and second feature set reaches preset range.D32, such as D30 institute The judgement equipment stated, wherein similarity judgment module further include: statistic unit is suitable in statistics fisrt feature set and second is special Collection closes the number of similar data characteristics string and calculates the ratio that the number accounts for data characteristics string total number in fisrt feature set Value；And similarity judgment module is further adapted for when ratio reaches second threshold, it is believed that fisrt feature set and second feature collection The similarity of conjunction reaches preset range.D33, the judgement equipment as described in D30, wherein similarity calculated is further adapted for calculating The Jaccard coefficient of all data characteristics strings in all data characteristics strings and second feature set of one characteristic set；And phase It is further adapted for like degree judgment module when Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set and second feature set Similarity reach preset range.D34, the judgement equipment as described in D33, wherein Jaccard coefficient is Jaccard=| S ∩ T |/| S ∪ T |, wherein S indicates fisrt feature set, and T indicates second feature set.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.

In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the invention element performed by Function.

As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc. Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.

Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit Determine subject of the present invention and selects.Therefore, without departing from the scope and spirit of the appended claims, for this Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present invention, to this Invent done disclosure be it is illustrative and not restrictive, it is intended that the scope of the present invention be defined by the claims appended hereto.

Claims

1. a kind of method that data characteristics is extracted from document, comprising steps of

Piecemeal is carried out to the data in the document in sequence, to obtain the data of one or more the first predetermined length Block, wherein overlapped second predetermined length between adjacent data blocks；

For one or more obtained data block, which is calculated based on the data content in each data block Data characteristics string；And

It is special using the data as the document to construct first data fingerprint of the document to combine the data characteristics string of each data block Sign,

Wherein, first predetermined length and the second predetermined length are arranged based on the significance level of the document, if the text The significance level of shelves is higher, then the first predetermined length and the second predetermined length are smaller.

2. the method as described in claim 1 further comprises the steps of:

Word segmentation processing is carried out to the document, to obtain word sequence；

Piecemeal is carried out to the word sequence in the document in sequence, to obtain the word of one or more third predetermined length Block, wherein overlapped 4th predetermined length between adjacent word block；

For one or more obtained word block, the number of the word block is calculated based on the data content in each word block According to feature string；And

The data characteristics string of each word block is combined to construct second data fingerprint of the document using the data characteristics as the document.

3. method according to claim 2, wherein the step of progress word segmentation processing to document includes:

Word segmentation processing is carried out using the segmentation methods based on dictionary, wherein the segmentation methods include a dictionary, two kinds of matchings The rule of algorithm and four disambiguations.

4. method as claimed in claim 3, wherein the data for calculating the data block based on the data content in data block are special Levying the step of going here and there includes:

The data sub-block of the 5th predetermined length in the data block is successively selected, wherein overlapped between adjacent data sub-block Six predetermined lengths；

For each data sub-block, the feature value list of the 7th predetermined length is calculated according to the content of the data sub-block；And

The data characteristics string of the data block is constructed based on the feature value list of all data sub-blocks.

5. method as claimed in claim 4, wherein data content in word-based piece calculates the data characteristics string of the word block The step of include:

Word in the word block of each third predetermined length is converted into character, obtains corresponding character string as word block；

The sub- word block of the 5th predetermined length in institute's predicate block is successively selected, wherein the overlapped 6th predetermined between adjacent sub- word block Length；

For every sub- word block, the feature value list of the 7th predetermined length is calculated according to the content of the sub- word block；And

Feature value list based on all sub- word blocks is to construct the data characteristics string of the word block.

6. method as claimed in claim 5, wherein calculating the 7th predetermined length according to data sub-block or the content of sub- word block The step of feature value list includes:

Extract one or more content subset being made of the partial content in the data sub-block or sub- word block；

Each content subset hash is worth for one between the 0 to the 7th predetermined length using hash algorithm；

According to value corresponding with each content subset, the analog value in the 7th predetermined length feature value list is set.

7. method as claimed in claim 6, wherein constructing the data characteristics of the data block or word block based on feature value list The step of string includes:

It is carried out and being overlapped the value of corresponding position in each data sub-block or the corresponding feature value list of sub- word block Merge, to obtain the feature value list of the corresponding data block or word block；

Dualization processing is carried out to the value of each unit in this feature value list, and obtains the feature that each cell value is 0 or 1 Value list；And

Convert the feature value list of the 7th predetermined length to the numeric string of the 7th predetermined length, using as the data block or The data characteristics string of word block.

8. the method for claim 7, wherein the value to each unit in this feature value list carries out at dualization The step of reason includes:

Calculate the average value of all cell values in feature value list；

Compare the value of each unit and the size of the average value；And

If the value of some unit is greater than average value, the value of the unit is 1, should if the value of some unit is not more than average value The value of unit is 0.

9. method according to claim 8, wherein

First predetermined length is 512 bytes, and second predetermined length is 256 bytes；

The third predetermined length is 64 words, and the 4th predetermined length is 32 words；

5th predetermined length is 5 bytes, and the 6th predetermined length is 1 byte；With

7th predetermined length is 32 bytes.

10. method as claimed in any one of claims 1-9 wherein, wherein

First data fingerprint further includes the deviation post information of the data block in a document.

11. the method as described in any one of claim 2-9, wherein

Second data fingerprint further includes the deviation post information of institute's predicate block in a document.

12. a kind of equipment for extracting data characteristics from document, the equipment include:

Piecemeal module, it is first pre- to obtain one or more suitable for carrying out piecemeal to the data in the document in sequence The data block of measured length, wherein overlapped second predetermined length between adjacent data blocks；

Computing module is suitable for one or more obtained data block, and the data content in each data block is come based on Calculate the data characteristics string of the data block；And

Characteristic extracting module constructs first data fingerprint of the document suitable for combining the data characteristics string of each data block to make For the data characteristics of the document,

13. equipment as claimed in claim 12, the equipment further include:

Word segmentation module is suitable for carrying out word segmentation processing to the document, to obtain word sequence；

The piecemeal module is further adapted for carrying out piecemeal to the word sequence in sequence, predetermined to obtain one or more third The word block of length, wherein overlapped 4th predetermined length between adjacent word block；

The computing module is further adapted for one or more obtained word block, based on the data content in each word block To calculate the data characteristics string of the word block；

The characteristic extracting module is further adapted for combining the data characteristics string of each word block to construct second data fingerprint of the document Using the data characteristics as the document.

14. equipment as claimed in claim 13, wherein the word segmentation module is further adapted for being carried out with the segmentation methods based on dictionary Word segmentation processing, wherein the segmentation methods include the rule of a dictionary, two kinds of matching algorithms and four disambiguations.

15. equipment as claimed in claim 14, the computing module include:

Blocking unit, suitable for successively selecting the data sub-block of the 5th predetermined length in the data block, wherein adjacent data sub-block Between overlapped 6th predetermined length；

Computing unit is suitable for calculating the spy of the 7th predetermined length according to the content of the data sub-block for each data sub-block Value indicative list and the data characteristics string that the data block is constructed based on the feature value list of all data sub-blocks.

16. equipment as claimed in claim 15, wherein the computing module further include:

Character conversion unit obtains corresponding suitable for the word in the word block of each third predetermined length is converted to character Character string as word block；

The blocking unit is further adapted for successively selecting the sub- word block of the 5th predetermined length in institute's predicate block, wherein adjacent sub- word block it Between overlapped 6th predetermined length；And

The computing unit is further adapted for calculating the feature of the 7th predetermined length according to the content of the sub- word block to every sub- word block Value list and feature value list based on all sub- word blocks are to construct the data characteristics string of the word block.

17. equipment as claimed in claim 16, wherein the computing unit includes:

Subelement is extracted, suitable for extracting in one or more being made of the partial content in the data sub-block or sub- word block Hold subset；And

The computing unit is further adapted for each content subset hash using hash algorithm between the 0 to the 7th predetermined length One is worth and according to value corresponding with each content subset, is arranged corresponding in the 7th predetermined length feature value list Value.

18. equipment as claimed in claim 17, wherein the computing unit further include:

Count sub-element, suitable for by by the value of corresponding position in each data sub-block or the corresponding feature value list of sub- word block It is overlapped and merges, to obtain the feature value list of the corresponding data block or word block；

Dualization subelement carries out dualization processing suitable for the value to each unit in this feature value list, and obtains each list The feature value list that member value is 0 or 1；And

The computing unit is further adapted for converting the feature value list of the 7th predetermined length to the number of the 7th predetermined length String, using the data characteristics string as the data block or word block.

19. equipment as claimed in claim 18, wherein

The dualization subelement is further adapted for calculating the average value and more each unit of all cell values in feature value list Value and the average value size, if the value of some unit is greater than average value, the value of the unit is 1, if the value of some unit No more than average value, then the value of the unit is 0.

20. equipment as claimed in claim 19, wherein

7th predetermined length is 32 bytes.

21. the equipment as described in any one of claim 12-20, wherein

The characteristic extracting module is further adapted for extracting the deviation post information of the data block in a document, to be included in the first number According in fingerprint.

22. the equipment as described in any one of claim 13-20, wherein

The characteristic extracting module is further adapted for extracting the deviation post information of institute's predicate block in a document, to be included in the second data In fingerprint.

23. a kind of judge the first document and the whether relevant judgment method of the second document, the method includes the steps:

First document is executed such as method of any of claims 1-11, the data characteristics for extracting document obtains Fisrt feature set, wherein the fisrt feature set includes: that the first data fingerprint of the first document and/or the second data refer to Line；

Second document is executed such as method of any of claims 1-11, the data characteristics for extracting document obtains Second feature set, wherein the second feature set includes: that the first data fingerprint of the second document and/or the second data refer to Line；And

The similarity for calculating fisrt feature set and second feature set, if similarity reaches preset range, then it is assumed that this first Document and the second document are related.

24. judgment method as claimed in claim 23, wherein the calculating fisrt feature set is similar to second feature set The step of spending include:

Calculate the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set The Hamming distance of data characteristics string；

When the Hamming distance is greater than first threshold, determine that two corresponding data feature strings are similar.

25. method as claimed in claim 24, further comprises the steps of:

If any one data characteristics string is judged as similar, then it is assumed that the similarity of fisrt feature set and second feature set Reach preset range.

26. method as claimed in claim 24, further comprises the steps of:

Count the number of data characteristics string similar with second feature set in the fisrt feature set；

Calculate the ratio that the number accounts for data characteristics string total number in fisrt feature set；

If the ratio reaches second threshold, then it is assumed that the similarity of fisrt feature set and second feature set reaches predetermined model It encloses.

27. method as claimed in claim 23, wherein described calculate fisrt feature set and second feature set similarity Step further include:

Calculate the Jaccard system of all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set Number；

When the Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set and the similarity of second feature set reach To preset range.

28. method as claimed in claim 27, wherein the Jaccard coefficient is:

Jaccard=| S ∩ T |/| S ∪ T |,

Wherein, S indicates fisrt feature set, and T indicates second feature set.

29. a kind of judge the first document and the whether relevant judgement equipment of the second document, the equipment includes:

The equipment that data characteristics is extracted in slave document as described in any one of claim 12-22, suitable for extracting institute respectively State the fisrt feature set and second feature set of the first document and the second document, wherein

The fisrt feature set includes: the first data fingerprint and/or the second data fingerprint of the first document；

The second feature set includes: the first data fingerprint and/or the second data fingerprint of the second document；

Similarity calculation module, suitable for calculating the similarity of fisrt feature set and second feature set；And

Similarity judgment module, suitable for when judging that similarity reaches preset range, it is believed that first document and the second document phase It closes.

30. equipment is judged as claimed in claim 29, wherein the similarity calculation module further include:

Similarity calculated, suitable for calculating the data characteristics string and second feature collection of each data fingerprint in fisrt feature set The Hamming distance of the data characteristics string of corresponding data fingerprint in conjunction；And

The similarity judgment module is further adapted for determining two corresponding data features when the Hamming distance is greater than first threshold It goes here and there similar.

31. judging equipment as claimed in claim 30, wherein

The similarity judgment module is further adapted for when data characteristics string is judged as similar, it is believed that fisrt feature set and second The similarity of characteristic set reaches preset range.

32. equipment is judged as claimed in claim 30, wherein the similarity judgment module further include:

Statistic unit, suitable for count the number of data characteristics string similar with second feature set in the fisrt feature set, And calculate the ratio that the number accounts for data characteristics string total number in fisrt feature set；And

The similarity judgment module is further adapted for when the ratio reaches second threshold, it is believed that fisrt feature set is special with second The similarity that collection is closed reaches preset range.

33. judging equipment as claimed in claim 30, wherein

The similarity calculated is further adapted for calculating in all data characteristics strings and the second feature set of fisrt feature set The Jaccard coefficient of all data characteristics strings；And

The similarity judgment module is further adapted for when the Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set Reach preset range with the similarity of second feature set.

34. judge equipment as claimed in claim 33, wherein the Jaccard coefficient is:

Jaccard=| S ∩ T |/| S ∪ T |,

Wherein, S indicates fisrt feature set, and T indicates second feature set.

35. it is a kind of judge suspicious document whether include sensitive content method, the method includes the steps:

Such as method of any of claims 1-11 is executed to secure documents, the data characteristics of the document is extracted, builds Feature database is found, wherein includes in feature database: the first data fingerprint and the second data fingerprint of secure documents；

Judgment method as described in any one of claim 23-28 is executed to suspicious document, wherein extract the suspicious document Data characteristics as second feature set, using the feature database as fisrt feature set；

If judging, the suspicious document is related to secure documents, then it is assumed that the suspicious document includes sensitive content；And

If judging, the suspicious document is uncorrelated to secure documents, then it is assumed that the suspicious document does not include sensitive content.

36. it is a kind of judge suspicious document whether include sensitive content equipment, the equipment includes:

The equipment that data characteristics is extracted in slave document as described in any one of claim 12-22, is suitable for secure documents It extracts data characteristics, be further adapted for extracting the data characteristics of suspicious document as second feature set；

Memory module, the data characteristics suitable for storing the secure documents wherein includes in feature database as feature database: by Protect the first data fingerprint and the second data fingerprint of document；

Judgement equipment as described in any one of claim 29-34, suitable for judge suspicious document with it is protected in feature database Whether document is related；And

Determining module is suitable for when judging that the suspicious document is related to secure documents, determines that the suspicious document includes quick Feel content and it is uncorrelated to secure documents when judge the suspicious document when, determine the suspicious document not comprising in sensitive Hold.

37. a kind of leakage prevention system, comprising:

Equipment is calculated, is connected with data safety safeguard；And

Data safety safeguard, comprising:

Document obtains equipment, suitable for obtaining the document content for calculating equipment and sending；

Sensitive content as claimed in claim 36 judges equipment, suitable for judging whether the document obtained includes sensitive content；

Control strategy obtains equipment, suitable for obtaining process pair relevant to document when judging whether document includes sensitive content The control strategy answered；With

Equipment is controlled, is suitable for when judging that suspicious document includes sensitive content, according to acquired control strategy to the document Operation behavior controlled.