CN105955978B - Method and system for leakage prevention - Google Patents

Method and system for leakage prevention Download PDF

Info

Publication number
CN105955978B
CN105955978B CN201610237061.9A CN201610237061A CN105955978B CN 105955978 B CN105955978 B CN 105955978B CN 201610237061 A CN201610237061 A CN 201610237061A CN 105955978 B CN105955978 B CN 105955978B
Authority
CN
China
Prior art keywords
data
document
word
value
feature set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610237061.9A
Other languages
Chinese (zh)
Other versions
CN105955978A (en
Inventor
李唱
康靖
陈虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Quantum innovation (Beijing) Information Technology Co., Ltd
Original Assignee
Baoli Nine Chapter (beijing) Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baoli Nine Chapter (beijing) Data Technology Co Ltd filed Critical Baoli Nine Chapter (beijing) Data Technology Co Ltd
Priority to CN201610237061.9A priority Critical patent/CN105955978B/en
Publication of CN105955978A publication Critical patent/CN105955978A/en
Application granted granted Critical
Publication of CN105955978B publication Critical patent/CN105955978B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Storage Device Security (AREA)

Abstract

The invention discloses the method and systems for leakage prevention.Include: it is a kind of method of the data characteristics to obtain the first data fingerprint and the second data fingerprint is extracted from document, judge the first document and the whether relevant judgment method of the second document using extracted data characteristics and judge according to the degree of correlation suspicious document whether include sensitive content method.Simultaneously present invention provides the corresponding equipment for extracting document data feature, judge the first document and the second document it is whether relevant judge equipment and judge suspicious document whether include sensitive content equipment.

Description

Method and system for leakage prevention
Technical field
Technical field of data security of the present invention, in particular for the method and system of leakage prevention.
Background technique
In recent years, with the rapid development of information technology, data safety is shown during the daily operation of informatization enterprise It obtains particularly important.If data are maliciously distorted or destroyed, the loss that can not be retrieved may be caused to enterprise.In order to improve Information Security generally requires to set some Data Securities, to be monitored and protect to data.In current big data Under environment, with the increase of business data amount, how ever-increasing data is quickly and efficiently monitored and protected, at The major issue faced for current data security fields.
Currently, the leakage of many enterprises data in order to prevent, has affixed one's name to data leak protection (Data in the middle part of Intranet Leakage prevention, DLP) system, to ensure the safety of sensitive data.Data leak guard system is by software to quick Sense data are monitored and protect, and by certain technological means, the specified data or information assets for preventing enterprise are to violate Form as defined in security strategy flows out enterprise, to guarantee that sensitive data is not lost and reveals.So in DLP system, data The extraction of feature and be a very key step to the matching of sensitive data.
Artificial setting keyword or the mode to entire file generated data fingerprint are generallyd use in traditional DLP system Data characteristics is extracted, the former can not be automatically performed feature extraction, when the file is quite large, the accuracy of extraction can reduce the latter. In addition, the matching for sensitive data, it will usually rule match and Hash matching algorithm are used, similarly, when in face of larger text When part, algorithm performance and accuracy all can degradations.
Summary of the invention
For this purpose, the present invention provides the method and system for leakage prevention, to try hard to solve or at least alleviate At least one existing problem above.
According to an aspect of the invention, there is provided a kind of method that data characteristics is extracted from document, wherein extract Data characteristics includes the first data fingerprint and the second data fingerprint, comprising steps of extracting the first predetermined number word from document Language calculates the corresponding data characteristics string of each word, and constructs document based on this first predetermined number data feature string First data fingerprint;Piecemeal is carried out to the data in document in sequence, is calculated based on the data content in each data block The data characteristics string of the data block, the data characteristics string of each data block of recombinant construct second data fingerprint of the document.
According to another aspect of the present invention, it provides and a kind of judges the first document and the whether relevant judgement side of the second document Method, comprising steps of execute data characteristics extracting method as described above to the first document, the data characteristics for extracting document obtains the One characteristic set;Data characteristics extracting method as described above is executed to the second document, the data characteristics for extracting document obtains the Two characteristic sets;And the similarity of fisrt feature set and second feature set is calculated, if similarity reaches preset range, Think that first document and the second document are related.
According to another aspect of the present invention, provide it is a kind of judge suspicious document whether include sensitive content method, packet It includes step: data characteristics extracting method as described above being executed to secure documents, extracts the data characteristics of the document, establish special Levy library;The data characteristics for extracting suspicious document again, execute it is above-mentioned judge the whether relevant judgment method of document, judge suspicious document Whether related to the secure documents in feature database: if judging, suspicious document and secure documents are related, then it is assumed that suspicious document Include sensitive content;If judging, suspicious document is uncorrelated to secure documents, then it is assumed that suspicious document does not include sensitive content.
Correspondingly, the present invention also provides extract the equipment of data characteristics from document, judge the first document and the second text Shelves whether it is relevant judge equipment, judge suspicious document whether include sensitive content equipment.
In accordance with a further aspect of the present invention, a kind of leakage prevention system is provided, comprising: equipment is calculated, with data Safety protection equipment is connected;And data safety safeguard, comprising: document obtains equipment, sensitive content as described above is sentenced Disconnected equipment, control strategy obtain equipment and control equipment.
Based on description above, this programme is using the data fingerprint for automatically extracting document keyword and extraction data block Mode, to extract the data characteristics of document.On the one hand, document is selected by the characteristic value of computational representation word importance Keyword, thus without relying upon artificial setting keyword;On the other hand, document is subjected to piecemeal processing, is based on data block The data fingerprint of each piecemeal is calculated, and data fingerprint is generated using local sensitivity Hash (LSH) algorithm, it can be effectively The leakage of set of metadata of similar data is prevented, and when document is very big, also can guarantee the accuracy of feature extraction.
In terms of characteristic matching, this programme using single matched data feature string similarity (that is, single matching) or The mode for calculating set of metadata of similar data feature string specific gravity (that is, benchmark matching) carries out matching judgment to the Similar content in document, optional Ground, the similarity between the shelves that can be solicited articles with Hamming distance or Jaccard coefficient table.In this way, sensitivity can be carried out more in all directions Data Matching prevents sensitive data from revealing, and then various documents is effectively avoided to leak means.
Detailed description of the invention
To the accomplishment of the foregoing and related purposes, certain illustrative sides are described herein in conjunction with following description and drawings Face, these aspects indicate the various modes that can practice principles disclosed herein, and all aspects and its equivalent aspect It is intended to fall in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical appended drawing reference generally refers to identical Component or element.
Fig. 1 shows the schematic diagram of leakage prevention system 100 according to an embodiment of the invention;
Fig. 2A shows the process of the method 200 according to an embodiment of the invention that data characteristics is extracted from document Figure;
Fig. 2 B shows the process of the method 200 in accordance with another embodiment of the present invention that data characteristics is extracted from document Figure;
Fig. 3 A shows the signal of the equipment 300 according to an embodiment of the invention that data characteristics is extracted from document Figure;
Fig. 3 B shows the signal of the equipment 300 in accordance with another embodiment of the present invention that data characteristics is extracted from document Figure;
Fig. 4, which is shown, according to an embodiment of the invention judges the first document and the whether relevant judgement side of the second document The flow chart of method 400;
Fig. 5 show it is according to an embodiment of the invention judge the first document and the second document it is whether relevant judgement set Standby 500 schematic diagram;
Fig. 6 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content method 600 Flow chart;
Fig. 7 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content equipment 700 Schematic diagram;And
Fig. 8 schematically illustrates the schematic diagram of piecemeal processing.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
Fig. 1 shows the schematic diagram of leakage prevention system 100 according to an embodiment of the invention.In enterprise Portion calculates and is connected between equipment 110 by local area network, and here, the component for calculating equipment 110 can include but is not limited to: one A or multiple processors or processing unit, system storage, the different system components of connection (including system storage and processing Unit) bus.It is of the invention real suitable for being used to realize simultaneously it should be noted that in addition to traditional calculating equipment (for example, computer) The calculating equipment 110 for applying example further includes mobile electronic device, including but not limited to mobile phone, PDA, tablet computer etc., and Server, printer, CD/DVD in enterprise's working environment etc..
Data safety safeguard 120 for leakage prevention is arranged in the local area network, passes through local area network and institute There is calculating equipment 110 to be connected.As shown in Figure 1, the safeguard 120 includes: that document obtains equipment 122, sensitive content judgement Equipment 700, control strategy obtain equipment 124 and control equipment 126.
Document obtains equipment 122 and is suitable for all calculating equipment 110 monitored in real time in the local area network, when monitoring to calculate When equipment 110 sends document, obtains and calculate the document content that equipment 110 is sent.Here, document can be the chat of instant messaging Information, and/or, picture/document of instant messaging transmission.
Sensitive content judges that equipment 700 is suitable for judging whether the document obtained includes sensitive content, for 700 meeting of equipment It introduces in greater detail below.
Control strategy obtains equipment 124 and is suitable for while judging whether document includes sensitive content, acquisition and the document The corresponding control strategy of relevant process.Optionally, control strategy can have: take non-print when specified process is printing Strategy, the strategy of messy code character string is taken when specified process be transmission file.
Control equipment 126 is suitable for when judging that suspicious document includes sensitive content, according to acquired control strategy to institute The operation behavior for stating document is controlled.For example, replacing the data for needing to transmit in the document with the character string of mark messy code Sensitive data in content.
Based on the description above to system 100, in the present system, how to be accurately matched to sensitive content is to realize data The key point of security protection, that is, sensitive content judge 700 operation to be performed of equipment.In simple terms, sensitive content Judge should include in equipment 700 (but being not limited to) memory module (for storing all data characteristicses of secure documents), Extract the equipment (for extracting the data characteristics in suspicious document) of document data feature, document relevance judges that equipment (is used for Judged according to the data characteristics of extraction whether related between suspicious document and secure documents) and determining module (for foundation Correlation judging result determines whether suspicious document includes sensitive content).
The process of composition and their execution to above-mentioned each module is illustrated below.
Fig. 2A and 2B shows the flow chart that the method 200 of data characteristics is extracted from document, and wherein Fig. 2A shows root According to the flow chart for the method for extracting the first data characteristics in the slave document of one embodiment of the invention.As shown in Figure 2 A, this method Start from step S210.In step S210, word segmentation processing first is carried out to document, removes stop words, punctuation mark, new line etc. Garbage, to obtain word sequence.According to an embodiment of the invention, being calculated in this method 200 using the participle based on dictionary Method carries out word segmentation processing, such as MMSEG (A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm), MMSEG is Chinese word segmentation In a segmentation methods common, based on dictionary, have Simple visual, realize uncomplicated, the fast advantage of the speed of service.Simply For, which includes " matching algorithm " and " disambiguation rule ", and wherein matching algorithm refers to how to protect according in dictionary The word deposited matches the sentence for wanting cutting;" disambiguation rule " is to say to divide in this way when in short, can also be with that When sample divides, determined with what rule using which kind of point-score, such as " facility and service " this phrase, is segmented into and " sets Apply/kimonos/business ", it is also segmented into " facility/and/service ", which word segmentation result selected, is exactly the function of " disambiguation rule " Energy.It in MMSEG algorithm, defines there are two types of matching algorithms: simple maximum matching and complicated maximum matching;The disambiguation of definition Rule there are four types of: maximum matching (Maximum matching, corresponding above two matching algorithm), maximum average word length Minimum rate of change (the Smallest variance of word of (Largest average word length), word length Lengths), obtained value, is then added, takes summation maximum by the natural logrithm for calculating all monosyllabic word word frequency in phrase Phrase (Largest sum of degree of morphemic freedom of one-character words).
Such as following document A, after word segmentation processing, document B is obtained.
Document A:
" Group Life Accident Insurance material benefit plan
Unexpected injury: refer to by external, burst, non-original idea, the non-disease objective thing for making body come to harm Part.
It is burnt as traffic accident hit, fire occurs, is caused injury by falling object from high altitude strike, injured, liquid is attacked by ruffian Change gas, gas explosion,
The oil scald etc. that cook is boiled all belongs to accidental injuring event.
Recommend two kinds of assembled schemes, selected for unit combination actual conditions:
1, accident/injury insurance: 150 yuan/people of insurance premium (1,2 grade of occupation)
(1) period insured: 1 year
(2) because unexpected injury is die, 100,000 yuan insurance responsibility: are paid;Or because of unexpected injury Complete Disability, accidental burns, payment 100000 yuan (part is disabled to be paid in proportion);10,000 yuan of unexpected injury medical treatment.
2, unexpected injury and medical insurance: 100 yuan/people of insurance premium (1,2 grade of occupation)
(1) period insured: 1 year
(2) because unexpected injury is die, 50,000 yuan insurance responsibility: are paid;Or because of unexpected injury Complete Disability or burn, pay 50,000 yuan (part is disabled to be paid in proportion);10,000 yuan of unexpected injury medical treatment.
Note: my company can be by the concrete condition design insurance scheme of your unit
It pays the bill few, ensures many;It insures conveniently, Claims Resolution is rapid."
Document B (word sequence obtained after word segmentation processing):
[group, the person is unexpected, injures, insurance, material benefit, and plan is unexpected, and injury refers to, by, external, burst, non- Meaning, non-, disease, body, injury, objective, event, traffic, accident are hit, and are occurred, fire, burn, high-altitude, pendant, and object is hit, and are caused Wound, ruffian attack, injured, liquefied gas, coal gas, explosion, cook, boiling, oil, and scald is fixed one's mind on, and outside, injury, event is recommended, Two kinds, combination, scheme, official documents and correspondence, confession, unit, in conjunction with, actual conditions are selected, and it is unexpected, it injures, insurance, insurance premium, 150 yuan, Grade, occupation, insurance, during which, 1 year, insurance, responsibility was unexpected, and injury is die, and paid, 10, Wan Yuan, or because, unexpected, injury, entirely, It is residual, it is unexpected, it burns, pays, 10, Wan Yuan, part, it is disabled, it in proportion, pays, unexpected, injury, medical treatment, 1, Wan Yuan, unexpected, injury, Medical insurance, insurance premium, 100 yuan, grade, occupation, insurance, during which, 1 year, insurance, responsibility was unexpected, and injury is die, and paid, 5, ten thousand Member, or because, unexpected, injury is entirely, residual, burns, pays, 5, Wan Yuan, part, and it is disabled, it in proportion, pays, unexpected, injury, medical treatment, 1, Wan Yuan, note, company can press, your unit, and specifically, situation designs, and insurance, scheme, official documents and correspondence is paid the bill, and seldom, ensures, much, throw It protects, convenient, Claims Resolution, rapidly]
It should be noted that the present invention is not only restricted to specific segmenting method, it is all word segmentation processing to be carried out to document To obtain the method for the significant word in the document all within protection scope of the present invention.
Then in step S220, to each word in the word sequence, the weight of the computational representation word in the document The characteristic value for the property wanted, and the first predetermined number word is chosen from word sequence based on the size of characteristic value.
According to one embodiment of present invention, the importance using TF-IDF value characterization word in a document, calculates TF- The process of IDF value is as follows:
1. calculating word frequency TF of the frequency of occurrences of the word in the document as the word:
2. calculating the ratio conduct between the number of documents in the total number of documents and document library in document library comprising the word The inverse document frequency IDF of the word:
3. calculating TF-IDF according to the word frequency TF of word and inverse document frequency IDF:
TF-IDF=TF × IDF
For a word, TF-IDF value is bigger, bigger to the importance of the document, therefore according to TF-IDF value Size selects keyword of the first predetermined number (for example, 5) word as this document for coming front.
In traditional DLP system, the feature of keyword is extracted mainly in such a way that manually keyword is set, it is clear that This method is time-consuming and laborious, and this programme chooses the keyword of document by calculating TF-IDF value automatically, it is labor-saving simultaneously Also assure the accuracy of extraction.
Then in step S230, for each word in selected first predetermined number word, the word is calculated The corresponding data characteristics string of language, and first data fingerprint of the document is constructed as the data of document based on data characteristics string Feature.Specifically, for selected each keyword, the number of the first predetermined length is hashed to by common hash algorithm Then word string combines data characteristics string obtained as the corresponding data characteristics string of the word to obtain the first data fingerprint Data characteristics as the document.
It shows as follows and the example that keyword generates the first data fingerprint is extracted for a document:
doc_size:2278
word_num:284
The predetermined number of keyword_num:5//first
Word: injury word_hash:4229635582offset:18
Word: unexpected word_hash:424898618offset:12
Word: insurance word_hash:802497295offset:24
Word: payment word_hash:3684743136offset:1594
Word: official documents and correspondence word_hash:1412961926offset:1051
For 2278 bytes, 5 keywords of document structure tree containing 284 words (accident, insurance, give by injury Pay, official documents and correspondence), then this 5 keywords are generated by 1 data feature string by common hash algorithm respectively, in the present embodiment, the The data characteristics string of one predetermined length is 32 unsigned int numbers, and it is exactly the document that this 5 data feature strings, which are linked up, The first data fingerprint.Further include according to a kind of implementation, in the first data fingerprint in this 5 keywords each word in text Deviation post information offset, offset in shelves are recorded for recording the position that keyword occurs for the first time in a document Offset is mainly used for tracing to the source for sensitive data (namely keyword), when the announcement that discovery sensitive data leakage rear line issues Police can carry offset information.Certainly, in order to save memory, the first data fingerprint can also not include offset information, this Invention is to this and with no restrictions.
In order to guarantee feature extraction enough to accurate, embodiment according to the present invention, first in addition to extracting document is counted According to fingerprint, data characteristics of second data fingerprint as document can also be extracted, as Fig. 2 B show according to the present invention another The step of embodiment, the method for extracting from document the second data fingerprint flow chart, this method, is as follows:
In step S240 first, in sequence in document data carry out piecemeal, with obtain one or more second The data block of predetermined length, wherein overlapped third predetermined length between adjacent data blocks.In other words, one second is made a reservation for The sliding window of length scale is slided along document, the displacement of third predetermined length is slided every time, in this way, being just divided into document The data block of multiple second predetermined length sizes.According to one embodiment of present invention, the second predetermined length of setting is 512 words Section, the third predetermined length is 256 bytes.
Then in step s 250, for one or more obtained data block, based on the number in each data block The data characteristics string of the data block is calculated according to content.Optionally, using local sensitivity Hash (LSH) algorithm to each data block Data content generate a data signature.
Then in step S260, the data characteristics string of each data block is combined to construct second data fingerprint of the document Using the data characteristics as the document.
If it is to entire document structure tree data fingerprint, when document is very big, the performance of algorithm can degradation and standard True property can also reduce, so this programme, which uses, first carries out deblocking to document, then extract the data characteristics conduct of each piecemeal The data characteristics of entire document.Meanwhile common hash algorithm such as MD5 shows original number if 2 data signatures are equal According to being equal under certain probability, but if unequal, other than showing initial data and being different, any letter is not provided Breath, traditional hash algorithm obviously cannot defend the leakage of set of metadata of similar data well, therefore quick using part in this method The advantages of sense hash algorithm, LSH algorithm, is, same or similar data signature can be generated for similar data content, It is able to ascend the matched effect of subsequent characteristics.
The step of wherein calculating the data characteristics string of each data block in step s 250 can be subdivided into the following steps again:
The data sub-block of the 4th predetermined length in data block is first successively selected, it is wherein overlapped between adjacent data sub-block 5th predetermined length.Similarly, in the implementation, the sliding window of the 4th predetermined length (for example, 5 bytes) size can be used It is slided along data block, slides the displacement of the 5th predetermined length (such as 1 byte) size every time.As shown in figure 8, wherein D1 is represented 4th predetermined length, D2 represents the 5th predetermined length, during deblocking, is just overlapped between every two adjacent D1 There is a D2.
The feature value list of the 6th predetermined length (e.g., 32 byte) is calculated further according to the content of data sub-block.
The data characteristics string of the data block is finally constructed based on the feature value list of all data sub-blocks.
Specifically, the step of the feature value list of the 6th predetermined length (e.g., 32 byte) is calculated according to the content of data sub-block It is rapid as follows:
1) one or more content subset being made of the partial content in data sub-block is extracted, in other words, is extracted All triples in each data sub-block;
2) recycle hash algorithm that each triple hash is arrived (0, the 6th predetermined length);
3) according to value corresponding with each content subset, the analog value in feature value list is set, such as one three Tuple igr, it is assumed that hash value is 15, then cumulative 1 at the 15th position in feature value list.
When all triples in a data block all have been calculated, each position in this feature value list can There is an accumulated value, calculate the average value of all accumulated values as threshold value, if the accumulated value of some position corresponding position is (namely special The value of some unit in value indicative list) be greater than the average value, then the cell value is set as 1, is just set as 0 on the contrary, in this way binary The processing of change obtains the feature value list of the binaryzation of the 6th predetermined length, by the characteristic value of this 6th predetermined length List is converted into the numeric string of the 6th predetermined length position, using the data characteristics string as the data block.
The example for second data fingerprint of document structure tree is shown as follows:
doc_size:2278
Sig_num:9//data block number
The predetermined length of bin_block_size:512//second
Bin_step_size:256//third predetermined length
Threshold value when threshold:75//execution LSH algorithm
LSH:4f2745a4400f311cab5843643a9771299c9c5f4d81e1669ce3554e4d75fed43a offset:0
LSH:ef2205c4808533748bda836571976d2196b81dec8da154d6aad5366dfcb9d6d9 offset:256
LSH:ade326d490a64b77bbd349f0c0bced2096f874e9ad42dc7d24bef279ea05c5d9 offset:512
LSH:3b276695d8c63bdfeb1340c0c450c0c096ea6e79cdc2bc596ce7e35cea28e7be offset:768
LSH:2faded8472873999e9154bcc684270ec92a67a7cc9c02cd8eae742dc2a58c21e offset:1024
LSH:afa1f584d0a733a7a3559bc8530b78688aa473fc1be06df5aa23469c1b78c28a offset:1280
LSH:8fa4f5cd80e533b6abc1964b520f306088b073f081616df52803461f2af9c78a offset:1536
LSH:1fa56499c0a7333ca05046c1520420608a9577a093f12d586a114c3f12e8e3ce offset:1792
LSH:180d2cba6027725c0010468030c08142c857e7808a91275e2a51097b9300a34e offset:2048
In the present embodiment, for the document of 2278 bytes, the second predetermined length is 512 bytes, the pre- fixed length of third Degree is 256 bytes, generates 9 data blocks, generates the data characteristics string (setting the 4th of 9 data blocks respectively using LSH algorithm Predetermined length is 5 bytes, and the 5th predetermined length is 1 byte, and the 6th predetermined length is 32 bytes), this 9 data feature strings are connected Get up be exactly the document the second data fingerprint.As described in the first data fingerprint, the second data fingerprint can also be with Deviation post information offset including data block in a document.
It should be noted that it is predetermined first can be arranged according to the significance level of document during method 200 executes Number, the second predetermined length and third predetermined length, the fine degree extracted with distinguishing characteristic.That is, document is important Degree is higher, and the keyword number (the first predetermined number) of extraction is more, when piecemeal each piecemeal size (the second pre- fixed length Degree) and displacement stepping (third predetermined length) just it is smaller.
The step of data characteristics is extracted from document ends here, and by method 200, the first data are extracted from document The data characteristics of fingerprint and the second data fingerprint as document.Correspondingly, Fig. 3 A and 3B is respectively illustrated implements according to the present invention Example for realizing the equipment 300 that the first data characteristics and the second data characteristics are extracted in the slave document of method 200 schematic diagram.
As shown in Figure 3A, feature extracting device 300 includes: word segmentation module 310, computing module 320, chooses 330 and of module Characteristic extracting module 340.
Word segmentation module 310 is suitable for carrying out word segmentation processing to document, to obtain word sequence.According to an embodiment of the invention, adopting Word segmentation processing is completed to document with the segmentation methods (for example, MMSEG) based on dictionary.
Computing module 320 is suitable for each word in word sequence, the importance of the computational representation word in a document Characteristic value (for example, TF-IDF value).
It chooses module 330 to be suitable for choosing the first predetermined number word from word sequence based on characteristic value, such as by feature The sequence of value from high to low chooses 5 words, then transfers to computing module 320 coupled thereto, is calculated by it each of selected The corresponding data characteristics string of word, optionally, computing module 320 are suitable for that each word is hashed to the by common hash algorithm The numeric string of one predetermined length is as the corresponding data characteristics string of the word.
Characteristic extracting module 340 be suitable for data characteristics string based on being calculated construct the first data fingerprint of document come As the data characteristics of the document, optionally, characteristic extracting module 340 is suitable for combining data characteristics string obtained to obtain First data fingerprint as the document data characteristics.
Wherein, the calculating of the characteristic value (by taking TF-IDF value as an example) of the importance of characterization word in a document, based on figure It has been disclosed in detail in the description of the step of 2A, details are not described herein again.
According to a kind of implementation, feature extracting device 300 is further adapted for extracting the second data fingerprint of document, such as Fig. 3 B institute Show.At this point, feature extracting device 300 further includes piecemeal module 350 other than computing module 320, characteristic extracting module 340, Wherein piecemeal module 350 is coupled with the coupling of 320 phase of computing module, computing module 320 and 340 phase of characteristic extracting module.
Piecemeal module 350 be suitable in sequence in document data carry out piecemeal, with obtain one or more second The data block of predetermined length, wherein overlapped third predetermined length between adjacent data blocks.Meanwhile computing module 320 is also suitable In to one or more obtained data block, the data of the data block are calculated based on the data content in each data block Feature string.Characteristic extracting module 340 is further adapted for combining the data characteristics string of each data block to construct second data of the document Fingerprint is using the data characteristics as the document.
It is described referring to above for the step of data characteristics string for calculating each data block, computing module 320 further includes point Module unit 322 and computing unit 324, as shown in Figure 3B.
Blocking unit 322 is suitable for successively selecting the data sub-block of the 4th predetermined length in data block, wherein adjacent data Overlapped 5th predetermined length between block.
Computing unit 324 is suitable for calculating the feature value list of the 6th predetermined length according to the content of the data sub-block.It can Selection of land may include extracting subelement in computing unit, constitute one of the partial content suitable for extracting in data sub-block or Multiple content subsets.Again by computing unit using hash algorithm by each content subset hash for the 0 to the 6th predetermined length it Between a value analog value in the 6th predetermined length feature value list is arranged according to value corresponding with each content subset.Meter Calculating unit can also include count sub-element, be suitable for by by corresponding position in the corresponding feature value list of each data sub-block Value be overlapped and merge, to obtain the feature value list and dualization subelement of the corresponding data block, be suitable for pair The value of each unit carries out dualization processing in this feature value list, and obtains the characteristic value that each cell value is 0 or 1 and arrange Table.Computing unit is further adapted for converting the feature value list of this 6th predetermined length to the numeric string of the 6th predetermined length position, Using the data characteristics string as the data block.
Wherein, dualization subelement is suitable for calculating the average value of all cell values in feature value list and compares each The value of unit and the size of the average value, if the value of some unit is greater than average value, the value of the unit is 1, if some unit Value be not more than average value, then the value of the unit be 0.
According to one embodiment, the second predetermined length of selection is 512 bytes, and third predetermined length is 256 bytes, and the 4th is pre- Measured length is 5 bytes, and the 5th predetermined length is 1 byte, and the 6th predetermined length is 32 bytes.
As described in method 200, the first predetermined number, second can be set according to the significance level of document in advance Measured length and third predetermined length, the fine degree extracted with distinguishing characteristic.That is, the significance level of document is higher, mention The keyword number (the first predetermined number) taken is more, when piecemeal each piecemeal size (the second predetermined length) and displacement Stepping (third predetermined length) is just smaller.
Optionally, characteristic extracting module 340 be further adapted for extract the first predetermined number word in each word in a document Deviation post information, to be included in the first data fingerprint, and extract data block deviation post information in a document, with Included in the second data fingerprint.
To sum up, this programme is by the way of automatically extracting document keyword and extracting the data fingerprint of data block, to mention Take the data characteristics of document.On the one hand, the keyword of document is selected by the characteristic value of computational representation word importance, this Sample is just without relying upon artificial setting keyword;On the other hand, document is subjected to piecemeal processing, each piecemeal is calculated based on data block Data fingerprint, and data fingerprint is generated using local sensitivity Hash (LSH) algorithm, set of metadata of similar data can be effectively prevented Leakage also can guarantee the accuracy of feature extraction and when document is very big.
Fig. 4, which is shown, according to an embodiment of the invention judges the first document and the whether relevant judgement side of the second document The flow chart of method 400.
As shown in figure 4, this method starts from step S410, the step of method 200 are executed to the first document, the number of document is extracted Fisrt feature set is obtained according to feature, wherein the fisrt feature set includes: the first data fingerprint and/or of the first document Two data fingerprints.
Then in the step s 420, the step of method 200 equally being executed to the second document, the data characteristics for extracting document obtains To second feature set, wherein second feature set includes: the first data fingerprint and/or the second data fingerprint of the second document.
Then in step S430, the similarity of fisrt feature set and second feature set is calculated, if similarity reaches Preset range, then it is assumed that first document and the second document are related.
The process of similarity can be divided into following 3 kinds again being matched in step S430 between feature, calculating document.
◆ the first is single matching:
The data characteristics string for calculating each data fingerprint in fisrt feature set refers to corresponding data in second feature set The Hamming distance of the data characteristics string of line.
Wherein, Hamming distance (Hamming distance) refers to that two (equal length) data characteristics strings correspond to binary digit Different quantity.The Hamming distance between two data feature strings x, y is indicated with d (x, y), two data feature strings is carried out different Or operation, and the number that statistical result is 1, obtained value is exactly Hamming distance, when Hamming distance is greater than the first threshold values, is just sentenced The two fixed data characteristics strings are similar.Such as:
Hamming distance between 1011101 and 1001001 is 2;
Hamming distance between " toned " and " roses " is 3.
For single matching, second of the data characteristics string and the second document in the second data fingerprint of the first document is calculated The Hamming distance of the data characteristics string of data fingerprint, when any one data characteristics string in the second data fingerprint be judged as it is similar When, then it is assumed that the similarity of fisrt feature set and second feature set reaches preset range.As long as that is, having any one A data block is similar, and document has the suspicion of leak data.
Be below the second data fingerprint of two documents is done it is single matching and return whether relevant pseudocode, use respectively Signature_base and signature_cmp represents fisrt feature set and second feature set, wherein nilsima_base The data characteristics string in second data fingerprint of two documents is respectively indicated with nilsima_cmp:
for nilsima_base in signature_base
for nilsima_cmp in signature_cmp
Ham_dist=hamming_distance (nilsima_base, nilsima_cmp)
if(ham_dist>threshold)
{
return 1
break
}
return 0
◆ second is benchmark matching:
The data characteristics string for equally first calculating each data fingerprint in fisrt feature set is corresponding with second feature set The Hamming distance of the data characteristics string of data fingerprint, to determine whether two data feature strings are similar;Then fisrt feature is counted The number of data characteristics string similar with second feature set in set, calculates the number and accounts for data characteristics in fisrt feature set The ratio of string total number, if the ratio reaches second threshold, then it is assumed that the similarity of fisrt feature set and second feature set Reach preset range.
Be below the first data fingerprint of two documents is done benchmark match and return whether relevant pseudocode, word_ Hash_base and word_hash_cmp respectively indicates the data characteristics string in first data fingerprint of two documents, keyword_ Base_num indicates data characteristics string total number in fisrt feature set.
Keyword_base_num=signature_base.keyword_num
Simlar_num=0
for word_hash_base in signature_base
for word_hash_cmp in signature_cmp
If (word_hash_cmp==word_hash_base)
{
Simlar_num+=1
}
return(simlar_num/keyword_base_num)
◆ the third is whole matching:
Calculate all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set Jaccard coefficient, when Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set is similar to second feature set Degree reaches preset range, and the first document and the second document are related.
Wherein Jaccard coefficient refers to the ratio of two intersection of sets collection and two union of sets collection:
Jaccard=| S ∩ T |/| S ∪ T |,
Wherein, S indicates fisrt feature set, and T indicates second feature set.
Be below the first data fingerprint of two documents is done whole matching and return whether relevant pseudocode, Keyword_base_num and keyword_cmp_num respectively indicates data characteristics string total number, word_ in two characteristic sets Hash_base and word_hash_cmp respectively indicates the data characteristics string in first data fingerprint of two documents.
Keyword_base_num=signature_base.keyword_num
Keyword_cmp_num=signature_cmp.keyword_num
Simlar_num=0
for word_hash_base in signature_base
for word_hash_cmp in signature_cmp
If (word_hash_cmp==word_hash_base)
{
Simlar_num+=1
}
return(simlar_num/(keyword_base_num+keyword_cmp_num-simlar_num))
Method 400 devises 3 kinds of modes and carries out matching judgment to the Similar content in two documents, it is alternatively possible to Hamming distance or Jaccard coefficient table are solicited articles the similarity between shelves.In this way, sensitive data matching can be carried out more in all directions, To prevent sensitive data leakage from providing a strong guarantee.
Correspondingly, Fig. 5 show judgement the first document according to an embodiment of the invention for realizing method 400 and The schematic diagram of the whether relevant judgement equipment 500 of second document.As shown in figure 5, the document correlation judges that equipment 500 includes: Feature extracting device 300, similarity calculation module 510 and similarity judgment module 520, wherein similarity calculation module 510 is divided It is not coupled with feature extracting device 300 and 520 phase of similarity judgment module.
Feature extracting device 300 is suitable for extracting fisrt feature set and the second spy of the first document and the second document respectively Collection is closed, and wherein fisrt feature set includes: the first data fingerprint and/or the second data fingerprint of the first document;Second feature Set includes: the first data fingerprint and/or the second data fingerprint of the second document.
Similarity calculation module 510 is suitable for calculating the similarity of fisrt feature set and second feature set.
Similarity judgment module 520 is suitable for when judging that similarity reaches preset range, it is believed that the first document and the second text Shelves are related.
According to one embodiment of present invention, similarity calculation module 510 further include: similarity calculated, based on It is special to calculate the data characteristics string of each data fingerprint and the data of corresponding data fingerprint in second feature set in fisrt feature set Levy the Hamming distance of string.
Similarity judgment module 520 is further adapted for determining two corresponding data fingerprints when Hamming distance is greater than first threshold It is similar.Specifically, for single matching way, similarity judgment module 520 is suitable for when any number in the second data fingerprint When being judged as similar according to feature string, it is believed that the similarity of fisrt feature set and second feature set reaches preset range.
For benchmark matching way, similarity judgment module 520 can also include statistic unit, for counting fisrt feature It the number of data characteristics string similar with second feature set and calculates the number in set to account in fisrt feature set data special The ratio of sign string total number, similarity judgment module are further adapted for when the ratio reaches second threshold, it is believed that fisrt feature set Reach preset range with the similarity of second feature set.
According to still another embodiment of the invention, under whole matching mode, similarity calculated is further adapted for calculating The Jaccard coefficient of all data characteristics strings in all data characteristics strings and second feature set of one characteristic set, when described When Jaccard coefficient reaches third threshold value, similarity judgment module 520 assert the phase of fisrt feature set and second feature set Reach preset range like degree.Jaccard coefficient is used to characterize the degree of correlation of two set:
Jaccard=| S ∩ T |/| S ∪ T |,
Wherein, S indicates fisrt feature set, and T indicates second feature set.
Fig. 6 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content method 600 Flow chart.As shown in fig. 6, this method starts from step S610, the step of method 200 are executed to secure documents, this article is extracted The data characteristics of shelves, and feature database is established, wherein include in feature database: the first data fingerprint of all secure documents and second Data fingerprint.
Then in step S620, to suspicious document execute method 400 the step of, during executing method 400, mention Take the data characteristics of suspicious document as second feature set, and using feature database obtained in previous step as fisrt feature collection It closes, that is, judges whether suspicious document is related to secure documents;
Then in step S630, if judging, suspicious document is related to secure documents, then it is assumed that wraps in the suspicious document Containing sensitive content;If judging, suspicious document is uncorrelated to secure documents, then it is assumed that the suspicious document does not include sensitive content.
Correspondingly, Fig. 7, which shows the suspicious document of the judgement for realizing method 600 according to an embodiment of the invention, is The no equipment comprising sensitive content, that is, sensitive content described in Fig. 1 judge the schematic diagram of equipment 700.The equipment 700 packet Include: feature extracting device 300, memory module 710, document relevance judge equipment 500 and determining module 720.According to this hair Bright one embodiment, feature extracting device 300 can also be arranged in document relevance and judge in equipment 500.
As it was noted above, feature extracting device 300 be suitable for secure documents extract data characteristics, be further adapted for extracting it is suspicious The data characteristics of document is as second feature set.
The data characteristics that memory module 710 is suitable for storing secure documents wherein includes in feature database as feature database: The first data fingerprint and the second data fingerprint of secure documents.
Document relevance judges that equipment 500 is suitable for judging whether suspicious document is related to the secure documents in feature database; And
Determining module 720 is suitable for when judging that suspicious document is related to secure documents, determines that suspicious document includes sensitivity Content and when judging that suspicious document is uncorrelated to secure documents determines that the suspicious document does not include sensitive content.
To sum up, the method and system according to the present invention for leakage prevention, provided file characteristics extraction side Method can more easily extract the data characteristics of document, and as far as possible include more data characteristic informations;In addition, devising 3 kinds of single matching, benchmark matching, whole matching modes carry out sensitive data matching in all directions, it is possible to prevente effectively from various texts Shelves leak means.
It should be appreciated that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, it is right above In the description of exemplary embodiment of the present invention, each feature of the invention be grouped together into sometimes single embodiment, figure or In person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed hair Bright requirement is than feature more features expressly recited in each claim.More precisely, as the following claims As book reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real Thus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this hair Bright separate embodiments.
Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple Submodule.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.
A5, the method as described in any one of A1-4, wherein being constructed based on selected first predetermined number word First data fingerprint of document as document data characteristics the step of include: for each of first predetermined number word Word hashes to the numeric string of the first predetermined length as the corresponding data characteristics string of the word by common hash algorithm;Group Data characteristics string obtained is closed to obtain the first data fingerprint as the data characteristics of the document.It is any in A6, such as A1-5 Method described in, wherein the step of carrying out word segmentation processing to document includes: to be segmented using the segmentation methods based on dictionary Processing, wherein segmentation methods include the rule of a dictionary, two kinds of matching algorithms and four disambiguations.Appoint in A7, such as A2-6 Method described in one, wherein the step of calculating the data characteristics string of the data block based on the data content in data block packet It includes: successively selecting the data sub-block of the 4th predetermined length in data block, it is wherein the overlapped 5th pre- between adjacent data sub-block Measured length;For each data sub-block, the feature value list of the 6th predetermined length is calculated according to the content of data sub-block;And base The data characteristics string of the data block is constructed in the feature value list of all data sub-blocks.A8, the method as described in A7, wherein root The step of calculating the feature value list of the 6th predetermined length according to the content of data sub-block includes: to extract by the part in data sub-block One or more content subset of Composition of contents;Each content subset hash is made a reservation for 0 to the 6th using hash algorithm A value between length;According to value corresponding with each content subset, the phase in the 6th predetermined length feature value list is set It should be worth.A9, the method as described in A8, wherein special to construct the data of the data block based on the feature value list of all data sub-blocks Sign string the step of include: and the value of corresponding position in the corresponding feature value list of each data sub-block is overlapped into Row merges, to obtain the feature value list of the corresponding data block;Dualization is carried out to the value of each unit in this feature value list Processing, and obtain the feature value list that each cell value is 0 or 1;And the feature value list of the 6th predetermined length is converted For the numeric string of the 6th predetermined length, using the data characteristics string as the data block.A10, the method as described in A9, wherein to this The step of value progress dualization processing of each unit includes: to calculate all cell values in feature value list in feature value list Average value;Compare the value of each unit and the size of the average value;And if the value of some unit is greater than average value, the unit Value be 1, if the value of some unit is not more than average value, the value of the unit is 0.A11, as described in any one of A7-10 Method, wherein the first predetermined number is 5, and the first predetermined length is 32;Second predetermined length is 512 bytes, the pre- fixed length of third Degree is 256 bytes;4th predetermined length is 5 bytes, and the 5th predetermined length is 1 byte;It is 32 bytes with the 6th predetermined length. A12, the method as described in any one of A1-11, wherein the first data fingerprint further includes in the first predetermined number word The deviation post information of each word in a document.A13, the method as described in any one of A2-12, wherein the second data refer to Line further includes the deviation post information of data block in a document.
B15, the equipment as described in B14, equipment further include: piecemeal module, suitable in sequence to the data in document into Row piecemeal, to obtain the data block of one or more the second predetermined length, wherein overlapped third between adjacent data blocks Predetermined length;And computing module is further adapted for one or more obtained data block, based on the number in each data block The data characteristics string of the data block is calculated according to content;Characteristic extracting module is further adapted for combining the data characteristics string of each data block To construct second data fingerprint of the document using the data characteristics as the document.B16, the equipment as described in B14, wherein calculating Module is further adapted for: calculating word frequency of the frequency of occurrences of the word in the document as the word;Calculate the document in document library Inverse document frequency of the ratio between number of documents as the word in sum and document library comprising the word;And according to word The word frequency and inverse document frequency of language calculate the characteristic value of characterization word importance in a document.B17, as described in B16 Equipment, wherein the characteristic value TF-IDF for characterizing word importance in a document is defined as: TF-IDF=TF × IDF, wherein TF is the word frequency of the word, and IDF is the inverse document frequency of the word, and TF and IDF are respectively as follows:
B18, the equipment as described in B17 are further adapted for according to counted counted TF-IDF value from high to low wherein choosing module Sequence choose the first predetermined number word.B19, the equipment as described in any one of B14-18, wherein computing module is also suitable In to each word in the first selected predetermined number word, the first predetermined length is hashed to by common hash algorithm Numeric string as the corresponding data characteristics string of the word;And characteristic extracting module is further adapted for combining data characteristics obtained String is to obtain the first data fingerprint as the data characteristics of the document.B20, the equipment as described in any one of B14-19, Middle word segmentation module is further adapted for carrying out word segmentation processing with the segmentation methods based on dictionary, and wherein segmentation methods include dictionary, two The rule of kind matching algorithm and four disambiguations.B21, the equipment as described in any one of B15-20, wherein computing module is also It include: blocking unit, suitable for successively selecting the data sub-block of the 4th predetermined length in data block, wherein between adjacent data sub-block Overlapped 5th predetermined length;Computing unit, is suitable for for each data sub-block, calculates the 6th according to the content of data sub-block The feature value list of predetermined length and the data characteristics that the data block is constructed based on the feature value list of all data sub-blocks String.B22, the equipment as described in B21, wherein computing unit further include: subelement is extracted, suitable for extracting by the portion in data sub-block Divide one or more content subset of Composition of contents;And computing unit is further adapted for utilizing hash algorithm by each content subset For hash for a value between the 0 to the 6th predetermined length and according to value corresponding with each content subset, setting the 6th is pre- Analog value in measured length feature value list.B23, the equipment as described in B22, wherein computing unit further include: count sub-element, Suitable for being merged and being overlapped the value of corresponding position in the corresponding feature value list of each data sub-block, to obtain The feature value list of the data block must be corresponded to;Dualization subelement is carried out suitable for the value to each unit in this feature value list Dualization processing, and obtain the feature value list that each cell value is 0 or 1;And computing unit is further adapted for the described 6th The feature value list of predetermined length is converted into the numeric string of the 6th predetermined length, using the data characteristics string as the data block. B24, the equipment as described in B23, wherein dualization subelement is further adapted for calculating all cell values in feature value list and is averaged The size of the value and the average value of value and more each unit, if the value of some unit is greater than average value, the value of the unit It is 1, if the value of some unit is not more than average value, the value of the unit is 0.B25, setting as described in any one of B15-24 Standby, wherein the first predetermined number is 5, the first predetermined length is 32;Second predetermined length is 512 bytes, third predetermined length It is 256 bytes;4th predetermined length is 5 bytes, and the 5th predetermined length is 1 byte;It is 32 bytes with the 6th predetermined length.B26, Equipment as described in any one of B14-25, wherein characteristic extracting module is further adapted for extracting in the first predetermined number word The deviation post information of each word in a document, to be included in the first data fingerprint.B27, such as any one of B15-26 institute The equipment stated, wherein characteristic extracting module is further adapted for extracting the deviation post information of data block in a document, to be included in second In data fingerprint.
C29, the judgment method as described in C28, wherein calculating the step of fisrt feature set and second feature set similarity It suddenly include: to calculate the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set Data characteristics string Hamming distance;When Hamming distance is greater than first threshold, determine that two corresponding data feature strings are similar. C30, the method as described in C29, further comprise the steps of: any one data characteristics string in the second data fingerprint be judged as it is similar When, then it is assumed that the similarity of fisrt feature set and second feature set reaches preset range.C31, the method as described in C29, Further comprise the steps of: the number of data characteristics string similar with second feature set in statistics fisrt feature set;Calculate the number Account for the ratio of data characteristics string total number in fisrt feature set;If ratio reaches second threshold, then it is assumed that fisrt feature set Reach preset range with the similarity of second feature set.C32, the method as described in C28, wherein calculate fisrt feature set with The step of second feature set similarity further include: calculate all data characteristics strings and second feature set of fisrt feature set In all data characteristics strings Jaccard coefficient;When Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set with The similarity of second feature set reaches preset range.C33, the method as described in C32, wherein Jaccard coefficient be: Jaccard=| S ∩ T |/| S ∪ T |, wherein S indicates fisrt feature set, and T indicates second feature set.
D35, the judgement equipment as described in D34, wherein similarity calculation module further include: similarity calculated is suitable for Calculate the data of the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set The Hamming distance of feature string;And similarity judgment module is further adapted for when Hamming distance is greater than first threshold, judgement two is right Answer data characteristics string similar.D36, the judgement equipment as described in D35, wherein similarity judgment module is further adapted for referring to when the second data When data characteristics string in line is judged as similar, it is believed that the similarity of fisrt feature set and second feature set reaches predetermined Range.D37, the judgement equipment as described in D35, wherein similarity judgment module further include: it is special to be suitable for statistics first for statistic unit Collection close in data characteristics string similar with second feature set number and calculate the number and account for data in fisrt feature set The ratio of feature string total number;And similarity judgment module is further adapted for reaching second threshold when ratio, it is believed that fisrt feature collection It closes and reaches preset range with the similarity of second feature set.D38, the judgement equipment as described in D35, wherein similarity calculation list Member is further adapted for calculating all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set Jaccard coefficient;And similarity judgment module is further adapted for when Jaccard coefficient reaches third threshold value, it is believed that fisrt feature The similarity of set and second feature set reaches preset range.D39, the judgement equipment as described in D38, wherein Jaccard system Number is: Jaccard=| S ∩ T |/| S ∪ T |, wherein S indicates fisrt feature set, and T indicates second feature set.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.
In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the invention element performed by Function.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc. Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.
Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit Determine subject of the present invention and selects.Therefore, without departing from the scope and spirit of the appended claims, for this Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present invention, to this Invent done disclosure be it is illustrative and not restrictive, it is intended that the scope of the present invention be defined by the claims appended hereto.

Claims (40)

1. a kind of method that data characteristics is extracted from document, comprising steps of
Word segmentation processing is carried out to the document, to obtain word sequence;
For each word in the word sequence, the characteristic value of the importance of the computational representation word within said document, and The first predetermined number word is chosen from the word sequence based on the characteristic value;And
For each word in selected first predetermined number word, the corresponding data characteristics string of the word is calculated, and Feature string constructs the first data fingerprint of the document as the data characteristics of the document based on the data;
Piecemeal is carried out to the data in the document in sequence, to obtain the data of one or more the second predetermined length Block, wherein overlapped third predetermined length between adjacent data blocks;
For one or more obtained data block, which is calculated based on the data content in each data block The data characteristics string of block;And
It is special using the data as the document to construct second data fingerprint of the document to combine the data characteristics string of each data block Sign.
2. the method as described in claim 1, wherein the feature of the importance of the computational representation word within said document The step of value includes:
Calculate word frequency of the frequency of occurrences of the word in the document as the word;
The ratio between the number of documents in the total number of documents and document library in document library comprising the word is calculated as the word Inverse document frequency;And
The characteristic value of characterization word importance in a document is calculated according to the word frequency of the word and inverse document frequency.
3. method according to claim 2, wherein the characteristic value TF-IDF quilt for characterizing word importance in a document Is defined as:
TF-IDF=TF × IDF,
Wherein, TF is the word frequency of the word, and IDF is the inverse document frequency of the word, and TF and IDF are respectively as follows:
And
The step of choosing the first predetermined number word from the word sequence based on the characteristic value include:
The first predetermined number word is chosen according to the sequence of counted counted TF-IDF value from high to low.
4. the method as described in claim 1, wherein it is described constructed based on selected first predetermined number word described in First data fingerprint of document as the document data characteristics the step of include:
For each word in the first predetermined number word, the first predetermined length is hashed to by common hash algorithm Numeric string is as the corresponding data characteristics string of the word;
Data characteristics string obtained is combined to obtain the first data fingerprint as the data characteristics of the document.
5. the method as described in claim 1, wherein the step of progress word segmentation processing to document includes:
Word segmentation processing is carried out using the segmentation methods based on dictionary, wherein the segmentation methods include a dictionary, two kinds of matchings The rule of algorithm and four disambiguations.
6. method as claimed in claim 4, wherein the data for calculating the data block based on the data content in data block are special Levying the step of going here and there includes:
The data sub-block of the 4th predetermined length in the data block is successively selected, wherein overlapped between adjacent data sub-block Five predetermined lengths;
For each data sub-block, the feature value list of the 6th predetermined length is calculated according to the content of the data sub-block;And
The data characteristics string of the data block is constructed based on the feature value list of all data sub-blocks.
7. method as claimed in claim 6, wherein the feature for calculating the 6th predetermined length according to the content of data sub-block The step of value list includes:
Extract one or more content subset being made of the partial content in the data sub-block;
Each content subset hash is worth for one between the 0 to the 6th predetermined length using hash algorithm;
According to value corresponding with each content subset, the analog value in the 6th predetermined length feature value list is set.
8. the method for claim 7, wherein it is described based on the feature value list of all data sub-blocks to construct the data The step of data characteristics string of block includes:
It is merged and being overlapped the value of corresponding position in the corresponding feature value list of each data sub-block, to obtain The feature value list of the data block must be corresponded to;
Dualization processing is carried out to the value of each unit in this feature value list, and obtains the feature that each cell value is 0 or 1 Value list;And
The feature value list of the 6th predetermined length is converted to the numeric string of the 6th predetermined length, using as the data block Data characteristics string.
9. method according to claim 8, wherein the value to each unit in this feature value list carries out at dualization The step of reason includes:
Calculate the average value of all cell values in feature value list;
Compare the value of each unit and the size of the average value;And
If the value of some unit is greater than average value, the value of the unit is 1, should if the value of some unit is not more than average value The value of unit is 0.
10. the method as described in any one of claim 6-9, wherein
First predetermined number is 5, and first predetermined length is 32;
Second predetermined length is 512 bytes, and the third predetermined length is 256 bytes;
4th predetermined length is 5 bytes, and the 5th predetermined length is 1 byte;With
6th predetermined length is 32 bytes.
11. the method for claim 1, wherein
First data fingerprint further includes the deviation post of each word in a document in the first predetermined number word Information.
12. method as claimed in claim 11, wherein
Second data fingerprint further includes the deviation post information of the data block in a document.
13. a kind of equipment for extracting data characteristics from document, the equipment include:
Word segmentation module is suitable for carrying out word segmentation processing to the document, to obtain word sequence;
Computing module, suitable for each word in the word sequence, the importance of the computational representation word within said document Characteristic value, be further adapted for calculating the corresponding data of the word to each word in selected first predetermined number word Feature string;
Module is chosen, suitable for choosing the first predetermined number word from the word sequence based on the characteristic value;And
Characteristic extracting module constructs the first data fingerprint of the document suitable for feature string based on the data as described The data characteristics of document;And
Piecemeal module, it is second pre- to obtain one or more suitable for carrying out piecemeal to the data in the document in sequence The data block of measured length, wherein overlapped third predetermined length between adjacent data blocks;And
The computing module is further adapted for one or more obtained data block, based on the data in each data block Content calculates the data characteristics string of the data block;
The characteristic extracting module is further adapted for combining the data characteristics string of each data block and refers to construct second data of the document Line is using the data characteristics as the document.
14. equipment as claimed in claim 13, wherein the computing module is further adapted for:
Calculate word frequency of the frequency of occurrences of the word in the document as the word;
The ratio between the number of documents in the total number of documents and document library in document library comprising the word is calculated as the word Inverse document frequency;And
The characteristic value of characterization word importance in a document is calculated according to the word frequency of the word and inverse document frequency.
15. equipment as claimed in claim 14, wherein the characteristic value TF-IDF for characterizing word importance in a document It is defined as:
TF-IDF=TF × IDF,
Wherein, TF is the word frequency of the word, and IDF is the inverse document frequency of the word, and TF and IDF are respectively as follows:
16. equipment as claimed in claim 15, wherein the selection module be further adapted for according to counted counted TF-IDF value from High to Low sequence chooses the first predetermined number word.
17. equipment as claimed in claim 13, wherein
The computing module is further adapted for calculating each word in the first selected predetermined number word by common Hash Method hashes to the numeric string of the first predetermined length as the corresponding data characteristics string of the word;And
The characteristic extracting module is further adapted for combining data characteristics string obtained to obtain the first data fingerprint as this article The data characteristics of shelves.
18. equipment as claimed in claim 13, wherein the word segmentation module is further adapted for being carried out with the segmentation methods based on dictionary Word segmentation processing, wherein the segmentation methods include the rule of a dictionary, two kinds of matching algorithms and four disambiguations.
19. equipment as claimed in claim 17, wherein the computing module further include:
Blocking unit, suitable for successively selecting the data sub-block of the 4th predetermined length in the data block, wherein adjacent data sub-block Between overlapped 5th predetermined length;
Computing unit is suitable for calculating the spy of the 6th predetermined length according to the content of the data sub-block for each data sub-block Value indicative list and the data characteristics string that the data block is constructed based on the feature value list of all data sub-blocks.
20. equipment as claimed in claim 19, wherein the computing unit further include:
Subelement is extracted, suitable for extracting one or more content subset being made of the partial content in the data sub-block; And
The computing unit is further adapted for each content subset hash using hash algorithm between the 0 to the 6th predetermined length One is worth and according to value corresponding with each content subset, is arranged corresponding in the 6th predetermined length feature value list Value.
21. equipment as claimed in claim 20, wherein the computing unit further include:
Count sub-element, suitable for by the way that the value of corresponding position in the corresponding feature value list of each data sub-block to be overlapped And merge, to obtain the feature value list of the corresponding data block;
Dualization subelement carries out dualization processing suitable for the value to each unit in this feature value list, and obtains each list The feature value list that member value is 0 or 1;And
The computing unit is further adapted for converting the feature value list of the 6th predetermined length to the number of the 6th predetermined length String, using the data characteristics string as the data block.
22. equipment as claimed in claim 21, wherein
The dualization subelement is further adapted for calculating the average value and more each unit of all cell values in feature value list Value and the average value size, if the value of some unit is greater than average value, the value of the unit is 1, if the value of some unit No more than average value, then the value of the unit is 0.
23. the equipment as described in any one of claim 19-22, wherein
First predetermined number is 5, and first predetermined length is 32;
Second predetermined length is 512 bytes, and the third predetermined length is 256 bytes;
4th predetermined length is 5 bytes, and the 5th predetermined length is 1 byte;With
6th predetermined length is 32 bytes.
24. equipment as claimed in claim 13, wherein
The characteristic extracting module is further adapted for extracting the offset of each word in a document in the first predetermined number word Location information, to be included in the first data fingerprint.
25. equipment as claimed in claim 24, wherein
The characteristic extracting module is further adapted for extracting the deviation post information of the data block in a document, to be included in the second number According in fingerprint.
26. a kind of judge the first document and the whether relevant judgment method of the second document, the method includes the steps:
First document is executed such as method of any of claims 1-12, the data characteristics for extracting document obtains Fisrt feature set, wherein the fisrt feature set includes: that the first data fingerprint of the first document and/or the second data refer to Line;
Second document is executed such as method of any of claims 1-12, the data characteristics for extracting document obtains Second feature set, wherein the second feature set includes: that the first data fingerprint of the second document and/or the second data refer to Line;And
The similarity for calculating fisrt feature set and second feature set, if similarity reaches preset range, then it is assumed that this first Document and the second document are related.
27. judgment method as claimed in claim 26, wherein the calculating fisrt feature set is similar to second feature set The step of spending include:
Calculate the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set The Hamming distance of data characteristics string;
When the Hamming distance is greater than first threshold, determine that two corresponding data feature strings are similar.
28. method as claimed in claim 27, further comprises the steps of:
When any one data characteristics string in the second data fingerprint is judged as similar, then it is assumed that fisrt feature set and second The similarity of characteristic set reaches preset range.
29. method as claimed in claim 27, further comprises the steps of:
Count the number of data characteristics string similar with second feature set in the fisrt feature set;
Calculate the ratio that the number accounts for data characteristics string total number in fisrt feature set;
If the ratio reaches second threshold, then it is assumed that the similarity of fisrt feature set and second feature set reaches predetermined model It encloses.
30. method as claimed in claim 26, wherein described calculate fisrt feature set and second feature set similarity Step further include:
Calculate the Jaccard system of all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set Number;
When the Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set and the similarity of second feature set reach To preset range.
31. method as claimed in claim 30, wherein the Jaccard coefficient is:
Jaccard=| S ∩ T |/| S ∪ T |,
Wherein, S indicates fisrt feature set, and T indicates second feature set.
32. a kind of judge the first document and the whether relevant judgement equipment of the second document, the equipment includes:
The equipment that data characteristics is extracted in slave document as described in any one of claim 13-25, suitable for extracting institute respectively State the fisrt feature set and second feature set of the first document and the second document, wherein
The fisrt feature set includes: the first data fingerprint and/or the second data fingerprint of the first document;
The second feature set includes: the first data fingerprint and/or the second data fingerprint of the second document;
Similarity calculation module, suitable for calculating the similarity of fisrt feature set and second feature set;And
Similarity judgment module, suitable for when judging that similarity reaches preset range, it is believed that first document and the second document phase It closes.
33. equipment is judged as claimed in claim 32, wherein the similarity calculation module further include:
Similarity calculated, suitable for calculating the data characteristics string and second feature collection of each data fingerprint in fisrt feature set The Hamming distance of the data characteristics string of corresponding data fingerprint in conjunction;And
The similarity judgment module is further adapted for determining two corresponding data features when the Hamming distance is greater than first threshold It goes here and there similar.
34. judging equipment as claimed in claim 33, wherein
The similarity judgment module is further adapted for when the data characteristics string in the second data fingerprint is judged as similar, it is believed that the The similarity of one characteristic set and second feature set reaches preset range.
35. equipment is judged as claimed in claim 33, wherein the similarity judgment module further include:
Statistic unit, suitable for count the number of data characteristics string similar with second feature set in the fisrt feature set, And calculate the ratio that the number accounts for data characteristics string total number in fisrt feature set;And
The similarity judgment module is further adapted for reaching second threshold when the ratio, it is believed that fisrt feature set and second feature The similarity of set reaches preset range.
36. judging equipment as claimed in claim 33, wherein
The similarity calculated is further adapted for calculating in all data characteristics strings and the second feature set of fisrt feature set The Jaccard coefficient of all data characteristics strings;And
The similarity judgment module is further adapted for when the Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set Reach preset range with the similarity of second feature set.
37. judge equipment as claimed in claim 36, wherein the Jaccard coefficient is:
Jaccard=| S ∩ T |/| S ∪ T |,
Wherein, S indicates fisrt feature set, and T indicates second feature set.
38. it is a kind of judge suspicious document whether include sensitive content method, the method includes the steps:
Such as method of any of claims 1-12 is executed to secure documents, the data characteristics of the document is extracted, builds Feature database is found, wherein includes in feature database: the first data fingerprint and the second data fingerprint of secure documents;
Judgment method as described in any one of claim 26-31 is executed to suspicious document, wherein extract the suspicious document Data characteristics as second feature set, using the feature database as fisrt feature set;
If judging, the suspicious document is related to secure documents, then it is assumed that the suspicious document includes sensitive content;And
If judging, the suspicious document is uncorrelated to secure documents, then it is assumed that the suspicious document does not include sensitive content.
39. it is a kind of judge suspicious document whether include sensitive content equipment, the equipment includes:
The equipment that data characteristics is extracted in slave document as described in any one of claim 13-25, is suitable for secure documents It extracts data characteristics, be further adapted for extracting the data characteristics of suspicious document as second feature set;
Memory module, the data characteristics suitable for storing the secure documents wherein includes in feature database as feature database: by Protect the first data fingerprint and the second data fingerprint of document;
Judgement equipment as described in any one of claim 32-37, suitable for judge suspicious document with it is protected in feature database Whether document is related;And
Determining module is suitable for when judging that the suspicious document is related to secure documents, determines that the suspicious document includes quick Feel content and it is uncorrelated to secure documents when judge the suspicious document when, determine the suspicious document not comprising in sensitive Hold.
40. a kind of leakage prevention system, comprising:
Equipment is calculated, is connected with data safety safeguard;And
Data safety safeguard, comprising:
Document obtains equipment, suitable for obtaining the document content for calculating equipment and sending;
Sensitive content as claimed in claim 39 judges equipment, suitable for judging whether the document obtained includes sensitive content;
Control strategy obtains equipment, suitable for obtaining process pair relevant to document when judging whether document includes sensitive content The control strategy answered;With
Equipment is controlled, is suitable for when judging that suspicious document includes sensitive content, according to acquired control strategy to the document Operation behavior controlled.
CN201610237061.9A 2016-04-15 2016-04-15 Method and system for leakage prevention Active CN105955978B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610237061.9A CN105955978B (en) 2016-04-15 2016-04-15 Method and system for leakage prevention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610237061.9A CN105955978B (en) 2016-04-15 2016-04-15 Method and system for leakage prevention

Publications (2)

Publication Number Publication Date
CN105955978A CN105955978A (en) 2016-09-21
CN105955978B true CN105955978B (en) 2019-07-02

Family

ID=56917999

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610237061.9A Active CN105955978B (en) 2016-04-15 2016-04-15 Method and system for leakage prevention

Country Status (1)

Country Link
CN (1) CN105955978B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649262B (en) * 2016-10-31 2020-07-07 复旦大学 Method for protecting sensitive information of enterprise hardware facilities in social media
CN108073821B (en) * 2016-11-09 2021-08-06 ***通信有限公司研究院 Data security processing method and device
CN112685775A (en) * 2020-12-29 2021-04-20 北京八分量信息科技有限公司 Method and device for monitoring data leakage prevention in block chain system and related products

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294671A (en) * 2012-02-22 2013-09-11 腾讯科技(深圳)有限公司 Document detection method and system
CN103425639A (en) * 2013-09-06 2013-12-04 广州一呼百应网络技术有限公司 Similar information identifying method based on information fingerprints
CN103441924A (en) * 2013-09-03 2013-12-11 盈世信息科技(北京)有限公司 Method and device for spam filtering based on short text
CN103971061A (en) * 2014-05-26 2014-08-06 中电长城网际***应用有限公司 Method and device for acquiring text file fingerprint and data management method
CN104506545A (en) * 2014-12-30 2015-04-08 北京奇虎科技有限公司 Data leakage prevention method and data leakage prevention device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8504489B2 (en) * 2009-03-27 2013-08-06 Bank Of America Corporation Predictive coding of documents in an electronic discovery system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294671A (en) * 2012-02-22 2013-09-11 腾讯科技(深圳)有限公司 Document detection method and system
CN103441924A (en) * 2013-09-03 2013-12-11 盈世信息科技(北京)有限公司 Method and device for spam filtering based on short text
CN103425639A (en) * 2013-09-06 2013-12-04 广州一呼百应网络技术有限公司 Similar information identifying method based on information fingerprints
CN103971061A (en) * 2014-05-26 2014-08-06 中电长城网际***应用有限公司 Method and device for acquiring text file fingerprint and data management method
CN104506545A (en) * 2014-12-30 2015-04-08 北京奇虎科技有限公司 Data leakage prevention method and data leakage prevention device

Also Published As

Publication number Publication date
CN105955978A (en) 2016-09-21

Similar Documents

Publication Publication Date Title
US9760548B2 (en) System, process and method for the detection of common content in multiple documents in an electronic system
KR101627592B1 (en) Detection of confidential information
CN110738039B (en) Case auxiliary information prompting method and device, storage medium and server
CN105955978B (en) Method and system for leakage prevention
US6751607B2 (en) System and method for the identification of latent relationships amongst data elements in large databases
CN103430504A (en) System and method for protecting specified data combinations
CN103679053B (en) A kind of detection method of webpage tamper and device
EP2284752B1 (en) Intrusion detection systems and methods
WO2022116419A1 (en) Automatic determination method and apparatus for domain name infringement, electronic device, and storage medium
CN105956482B (en) Method and system for leakage prevention
CN105893859B (en) Method and system for leakage prevention
Ma et al. An API Semantics‐Aware Malware Detection Method Based on Deep Learning
CN110490750B (en) Data identification method, system, electronic equipment and computer storage medium
JP6777612B2 (en) Systems and methods to prevent data loss in computer systems
CN115314236A (en) System and method for detecting phishing domains in a Domain Name System (DNS) record set
CN105844118A (en) Methods and system for preventing data leakage
CN107562720A (en) A kind of alarm data matching process of information network security of power system linkage defense
CN110888877A (en) Event information display method and device, computing equipment and storage medium
CN106547780A (en) Article reprints statistics of variables method and device
CN116431754A (en) Keyword extraction method, keyword extraction device, keyword extraction equipment and computer readable medium
CN112948887A (en) Social engineering defense method based on confrontation sample generation
Whitham et al. Automated processes for evaluating the realism of high-interaction honeyfiles
CN109408789A (en) A kind of notes template and its generation method and notes stencil-chosen system
EP4202745A1 (en) Improvements in data leakage prevention
Bakır et al. A Review about Forensic Informatics and Tools

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200123

Address after: 100094 west side of the first floor of Building 1, yard 68, Zhongguo Beiqing Road, Haidian District, Beijing

Patentee after: Quantum innovation (Beijing) Information Technology Co., Ltd

Address before: 100086, A, building 1, building 48, No. 3 West Third Ring Road, Haidian District, Beijing, 23E

Patentee before: Baoli Nine Chapters (Beijing) Data Technology Co., Ltd.