CN105893859B - Method and system for leakage prevention - Google Patents
Method and system for leakage prevention Download PDFInfo
- Publication number
- CN105893859B CN105893859B CN201610236738.7A CN201610236738A CN105893859B CN 105893859 B CN105893859 B CN 105893859B CN 201610236738 A CN201610236738 A CN 201610236738A CN 105893859 B CN105893859 B CN 105893859B
- Authority
- CN
- China
- Prior art keywords
- data
- document
- block
- predetermined length
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 111
- 230000002265 prevention Effects 0.000 title claims abstract description 12
- 230000011218 segmentation Effects 0.000 claims description 32
- 238000012545 processing Methods 0.000 claims description 26
- 239000000284 extract Substances 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 12
- 238000011217 control strategy Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000000903 blocking effect Effects 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000006399 behavior Effects 0.000 claims description 2
- 208000027418 Wounds and injury Diseases 0.000 description 43
- 230000006378 damage Effects 0.000 description 43
- 208000014674 injury Diseases 0.000 description 41
- 238000010586 diagram Methods 0.000 description 9
- 238000000605 extraction Methods 0.000 description 9
- 230000008901 benefit Effects 0.000 description 7
- 230000035945 sensitivity Effects 0.000 description 6
- 206010053615 Thermal burn Diseases 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000006073 displacement reaction Methods 0.000 description 4
- 238000009835 boiling Methods 0.000 description 3
- 239000003034 coal gas Substances 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 238000004880 explosion Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 206010039203 Road traffic accident Diseases 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 101100128908 Arabidopsis thaliana LSH1 gene Proteins 0.000 description 1
- 101100128909 Arabidopsis thaliana LSH2 gene Proteins 0.000 description 1
- 101100128910 Arabidopsis thaliana LSH3 gene Proteins 0.000 description 1
- 101100128911 Arabidopsis thaliana LSH4 gene Proteins 0.000 description 1
- 241001672694 Citrus reticulata Species 0.000 description 1
- 235000004789 Rosa xanthina Nutrition 0.000 description 1
- 241000109329 Rosa xanthina Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/604—Tools and structures for managing or administering access control systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Automation & Control Theory (AREA)
- Bioethics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Storage Device Security (AREA)
Abstract
The invention discloses the method and systems for leakage prevention.Include: it is a kind of method of the data characteristics to obtain the first data fingerprint and the second data fingerprint is extracted from document, judge the first document and the whether relevant judgment method of the second document using extracted data characteristics and judge according to the degree of correlation suspicious document whether include sensitive content method.Simultaneously present invention provides the corresponding equipment for extracting document data feature, judge the first document and the second document it is whether relevant judge equipment and judge suspicious document whether include sensitive content equipment.
Description
Technical field
Technical field of data security of the present invention, in particular for the method and system of leakage prevention.
Background technique
In recent years, with the rapid development of information technology, data safety is shown during the daily operation of informatization enterprise
It obtains particularly important.If data are maliciously distorted or destroyed, the loss that can not be retrieved may be caused to enterprise.In order to improve
Information Security generally requires to set some Data Securities, to be monitored and protect to data.In current big data
Under environment, with the increase of business data amount, how ever-increasing data is quickly and efficiently monitored and protected, at
The major issue faced for current data security fields.
Currently, the leakage of many enterprises data in order to prevent, has affixed one's name to data leak protection (Data in the middle part of Intranet
Leakage prevention, DLP) system, to ensure the safety of sensitive data.Data leak guard system is by software to quick
Sense data are monitored and protect, and by certain technological means, the specified data or information assets for preventing enterprise are to violate
Form as defined in security strategy flows out enterprise, to guarantee that sensitive data is not lost and reveals.So in DLP system, data
The extraction of feature and be a very key step to the matching of sensitive data.
Artificial setting keyword or the mode to entire file generated data fingerprint are generallyd use in traditional DLP system
Data characteristics is extracted, the former can not be automatically performed feature extraction, when the file is quite large, the accuracy of extraction can reduce the latter.
In addition, the matching for sensitive data, it will usually rule match and Hash matching algorithm are used, similarly, when in face of larger text
When part, algorithm performance and accuracy all can degradations.
Summary of the invention
For this purpose, the present invention provides the method and system for leakage prevention, to try hard to solve or at least alleviate
At least one existing problem above.
According to an aspect of the invention, there is provided a kind of method that data characteristics is extracted from document, wherein extract
Data characteristics includes the first data fingerprint and the second data fingerprint, comprising steps of dividing in sequence the data in document
Block calculates the data characteristics string of the data block, the number of each data block of recombinant based on the data content in each data block
First data fingerprint of the document is constructed according to feature string;Word segmentation processing is carried out to document, to obtain word sequence, in sequence
Piecemeal is carried out to the word sequence of document, the data characteristics string of the word block, then group are calculated based on the data content in each word block
The data characteristics string of each word block is closed to construct second data fingerprint of the document.
According to another aspect of the present invention, it provides and a kind of judges the first document and the whether relevant judgement side of the second document
Method, comprising steps of execute data characteristics extracting method as described above to the first document, the data characteristics for extracting document obtains the
One characteristic set;Data characteristics extracting method as described above is executed to the second document, the data characteristics for extracting document obtains the
Two characteristic sets;And the similarity of fisrt feature set and second feature set is calculated, if similarity reaches preset range,
Think that first document and the second document are related.
According to another aspect of the present invention, provide it is a kind of judge suspicious document whether include sensitive content method, packet
It includes step: data characteristics extracting method as described above being executed to secure documents, extracts the data characteristics of the document, establish special
Levy library;The data characteristics for extracting suspicious document again, execute it is above-mentioned judge the whether relevant judgment method of document, judge suspicious document
Whether related to the secure documents in feature database: if judging, suspicious document and secure documents are related, then it is assumed that suspicious document
Include sensitive content;If judging, suspicious document is uncorrelated to secure documents, then it is assumed that suspicious document does not include sensitive content.
Correspondingly, the present invention also provides extract the equipment of data characteristics from document, judge the first document and the second text
Shelves whether it is relevant judge equipment, judge suspicious document whether include sensitive content equipment.
In accordance with a further aspect of the present invention, a kind of leakage prevention system is provided, comprising: equipment is calculated, with data
Safety protection equipment is connected;And data safety safeguard, comprising: document obtains equipment, sensitive content as described above is sentenced
Disconnected equipment, control strategy obtain equipment and control equipment.
Based on description above, this programme, which is used, carries out piecemeal to document, extracts the data fingerprint of data block and word block
Mode extracts the data characteristics of document.The data fingerprint of each piecemeal is calculated, and uses local sensitivity Hash (LSH) algorithm
Data fingerprint is generated, the leakage of set of metadata of similar data can be effectively prevented, and when document is very big, also can guarantee feature extraction
Accuracy.
In terms of characteristic matching, this programme using single matched data feature string similarity (that is, single matching) or
The mode for calculating set of metadata of similar data feature string specific gravity (that is, benchmark matching) carries out matching judgment to the Similar content in document, optional
Ground, the similarity between the shelves that can be solicited articles with Hamming distance or Jaccard coefficient table.In this way, sensitivity can be carried out more in all directions
Data Matching prevents sensitive data from revealing, and then various documents is effectively avoided to leak means.
Detailed description of the invention
To the accomplishment of the foregoing and related purposes, certain illustrative sides are described herein in conjunction with following description and drawings
Face, these aspects indicate the various modes that can practice principles disclosed herein, and all aspects and its equivalent aspect
It is intended to fall in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned
And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical appended drawing reference generally refers to identical
Component or element.
Fig. 1 shows the schematic diagram of leakage prevention system 100 according to an embodiment of the invention;
Fig. 2A shows the process of the method 200 according to an embodiment of the invention that data characteristics is extracted from document
Figure;
Fig. 2 B shows the process of the method 200 in accordance with another embodiment of the present invention that data characteristics is extracted from document
Figure;
Fig. 3 A shows the signal of the equipment 300 according to an embodiment of the invention that data characteristics is extracted from document
Figure;
Fig. 3 B shows the signal of the equipment 300 in accordance with another embodiment of the present invention that data characteristics is extracted from document
Figure;
Fig. 4, which is shown, according to an embodiment of the invention judges the first document and the whether relevant judgement side of the second document
The flow chart of method 400;
Fig. 5 show it is according to an embodiment of the invention judge the first document and the second document it is whether relevant judgement set
Standby 500 schematic diagram;
Fig. 6 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content method 600
Flow chart;
Fig. 7 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content equipment 700
Schematic diagram;And
Fig. 8 schematically illustrates the schematic diagram of piecemeal processing.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
Fig. 1 shows the schematic diagram of leakage prevention system 100 according to an embodiment of the invention.In enterprise
Portion calculates and is connected between equipment 110 by local area network, and here, the component for calculating equipment 110 can include but is not limited to: one
A or multiple processors or processing unit, system storage, the different system components of connection (including system storage and processing
Unit) bus.It is of the invention real suitable for being used to realize simultaneously it should be noted that in addition to traditional calculating equipment (for example, computer)
The calculating equipment 110 for applying example further includes mobile electronic device, including but not limited to mobile phone, PDA, tablet computer etc., and
Server, printer, CD/DVD in enterprise's working environment etc..
Data safety safeguard 120 for leakage prevention is arranged in the local area network, passes through local area network and institute
There is calculating equipment 110 to be connected.As shown in Figure 1, the safeguard 120 includes: that document obtains equipment 122, sensitive content judgement
Equipment 700, control strategy obtain equipment 124 and control equipment 126.
Document obtains equipment 122 and is suitable for all calculating equipment 110 monitored in real time in the local area network, when monitoring to calculate
When equipment 110 sends document, obtains and calculate the document content that equipment 110 is sent.Here, document can be the chat of instant messaging
Information, and/or, picture/document of instant messaging transmission.
Sensitive content judges that equipment 700 is suitable for judging whether the document obtained includes sensitive content, for 700 meeting of equipment
It introduces in greater detail below.
Control strategy obtains equipment 124 and is suitable for while judging whether document includes sensitive content, acquisition and the document
The corresponding control strategy of relevant process.Optionally, control strategy can have: take non-print when specified process is printing
Strategy, the strategy of messy code character string is taken when specified process be transmission file.
Control equipment 126 is suitable for when judging that suspicious document includes sensitive content, according to acquired control strategy to institute
The operation behavior for stating document is controlled.For example, replacing the data for needing to transmit in the document with the character string of mark messy code
Sensitive data in content.
Based on the description above to system 100, in the present system, how to be accurately matched to sensitive content is to realize data
The key point of security protection, that is, sensitive content judge 700 operation to be performed of equipment.In simple terms, sensitive content
Judge should include in equipment 700 (but being not limited to) memory module (for storing all data characteristicses of secure documents),
Extract the equipment (for extracting the data characteristics in suspicious document) of document data feature, document relevance judges that equipment (is used for
Judged according to the data characteristics of extraction whether related between suspicious document and secure documents) and determining module (for foundation
Correlation judging result determines whether suspicious document includes sensitive content).
The process of composition and their execution to above-mentioned each module is illustrated below.
Fig. 2A shows the process of the method 200 according to an embodiment of the invention that data characteristics is extracted from document
Figure.As shown in Figure 2 A, this method starts from step S210.In step S210, piecemeal is carried out to the data in document in sequence,
To obtain the data block of one or more the first predetermined length, wherein overlapped second pre- fixed length between adjacent data blocks
Degree.In other words, the sliding window of a first predetermined length size is slided along document, slides the position of the second predetermined length every time
It moves, in this way, document to be just divided into the data block of multiple first predetermined length sizes.According to one embodiment of present invention,
It is 512 bytes that the first predetermined length, which is arranged, and the second predetermined length is 256 bytes.
Then in step S220, for one or more obtained data block, based on the number in each data block
The data characteristics string of the data block is calculated according to content.Optionally, using local sensitivity Hash (LSH) algorithm to each data block
Data content generate a data signature.
Then in step S230, the data characteristics string of obtained each data block is combined to construct the first of the document
Data fingerprint is using the data characteristics as the document.
If it is to entire document structure tree data fingerprint, when document is very big, the performance of algorithm can degradation and standard
True property can also reduce, so this programme, which uses, first carries out deblocking to document, then extract the data characteristics conduct of each piecemeal
The data characteristics of entire document.Meanwhile common hash algorithm such as MD5 shows original number if 2 data signatures are equal
According to being equal under certain probability, but if unequal, other than showing initial data and being different, any letter is not provided
Breath, traditional hash algorithm obviously cannot defend the leakage of set of metadata of similar data well, therefore quick using part in this method
The advantages of sense hash algorithm, LSH algorithm, is, same or similar data signature can be generated for similar data content,
It is able to ascend the matched effect of subsequent characteristics.
Specifically, the step of data characteristics string of each data block is calculated in step S220 is as follows:
The data sub-block of the 5th predetermined length in data block is first successively selected, it is wherein overlapped between adjacent data sub-block
6th predetermined length.Similarly, in the implementation, the sliding window of the 5th predetermined length (for example, 5 bytes) size can be used
It is slided along data block, slides the displacement of the 6th predetermined length (such as 1 byte) size every time.As Fig. 8 is schematically shown
The diagram of piecemeal processing is carried out to document, wherein D1 indicates that the 5th predetermined length, D2 indicate the 6th predetermined length, and every two is adjacent
The 5th predetermined length between be overlapped have the 6th predetermined length.
The feature value list of the 7th predetermined length (such as 32 bytes, i.e., 256) is calculated further according to the content of data sub-block.
The data characteristics string of the data block is finally constructed based on the feature value list of all data sub-blocks.
Specifically, the step of calculating the feature value list of the 7th predetermined length according to the content of data sub-block is as follows:
1) one or more content subset being made of the partial content in data sub-block is extracted, in other words, is extracted
All triples in each data sub-block;
2) recycle hash algorithm that each triple hash is arrived (0,256);
3) according to value corresponding with each content subset, the analog value in feature value list is set, such as one three
Tuple igr, it is assumed that hash value is 15, then cumulative 1 at the 15th position in feature value list.
When all triples in a data block all have been calculated, each position in this feature value list can
There is an accumulated value, calculate the average value of all accumulated values as threshold value, if the accumulated value of some position corresponding position is (namely special
The value of some unit in value indicative list) be greater than the average value, then the cell value is set as 1, is just set as 0 on the contrary.Binary in this way
The processing of change obtains the feature value list of the binaryzation of the 6th predetermined length, by the characteristic value of this 6th predetermined length
List is converted into the numeric string of the 6th predetermined length, using the data characteristics string as the data block.
The example for first data fingerprint of document structure tree is shown as follows:
doc_size:2278
Sig_num:9 // data block number
The predetermined length of bin_block_size:512 // first
The predetermined length of bin_step_size:256 // second
Threshold value when threshold:75 // execution LSH algorithm
LSH:4f2745a4400f311cab5843643a9771299c9c5f4d81e1669ce3554e4d75fed43a
offset:0
LSH:ef2205c4808533748bda836571976d2196b81dec8da154d6aad5366dfcb9d6d9
offset:256
LSH:ade326d490a64b77bbd349f0c0bced2096f874e9ad42dc7d24bef279ea05c5d9
offset:512
LSH:3b276695d8c63bdfeb1340c0c450c0c096ea6e79cdc2bc596ce7e35cea28e7be
offset:768
LSH:2faded8472873999e9154bcc684270ec92a67a7cc9c02cd8eae742dc2a58c21e
offset:1024
LSH:afa1f584d0a733a7a3559bc8530b78688aa473fc1be06df5aa23469c1b78c28a
offset:1280
LSH:8fa4f5cd80e533b6abc1964b520f306088b073f081616df52803461f2af9c78a
offset:1536
LSH:1fa56499c0a7333ca05046c1520420608a9577a093f12d586a114c3f12e8e3ce
offset:1792
LSH:180d2cba6027725c0010468030c08142c857e7808a91275e2a51097b9300a34e
offset:2048
In the present embodiment, for the document of 2278 bytes, the first predetermined length is 512 bytes, the second pre- fixed length
Degree is 256 bytes, generates 9 data blocks, generates the data characteristics string (setting the 5th of 9 data blocks respectively using LSH algorithm
Predetermined length is 5 bytes, and the 6th predetermined length is 1 byte, and the 7th predetermined length is 32 bytes), this 9 data feature strings are connected
Get up be exactly the document the first data fingerprint.It optionally, further include the offset of data block in a document in the first data fingerprint
Location information offset, offset are mainly used for tracing back for data for recording the position of data block in a document, record offset
Source, when the alarm that discovery sensitive data leakage rear line issues can carry offset information.Certainly, in order to save memory,
First data fingerprint can also not include offset information, and the present invention is to this and with no restrictions.
In order to guarantee feature extraction enough to accurate, embodiment according to the present invention, first in addition to extracting document is counted
According to fingerprint, data characteristics of second data fingerprint as document can also be extracted.As Fig. 2 B show according to the present invention another
The step of embodiment, the method for extracting from document the second data fingerprint flow chart, this method, is as follows:
In step S240, word segmentation processing first is carried out to document, removes the useless letter such as stop words, punctuation mark, new line
Breath, to obtain word sequence.According to an embodiment of the invention, being carried out in this method 200 using the segmentation methods based on dictionary
Word segmentation processing, such as MMSEG (A Word Identification System for Mandarin Chinese Text
Based on Two Variants of the Maximum Matching Algorithm), MMSEG is one in Chinese word segmentation
A common, based on dictionary segmentation methods have Simple visual, realize uncomplicated, the fast advantage of the speed of service.It is simple next
It says, which includes " matching algorithm " and " disambiguation rule ", and wherein matching algorithm refers to how to save according in dictionary
Word, the sentence for cutting is matched;" disambiguation rule " is to say to divide in this way when in short, can also be such
When dividing, determined with what rule using which kind of point-score, such as " facility and service " this phrase, be segmented into " facility/
Kimonos/business " is also segmented into " facility/and/service ", which word segmentation result is selected, and is exactly the function of " disambiguation rule ".
It in MMSEG algorithm, defines there are two types of matching algorithms: simple maximum matching and complicated maximum matching;The disambiguation of definition
There are four types of rules: maximum matching (Maximum matching, corresponding above two matching algorithm), maximum average word length
Minimum rate of change (the Smallest variance of word of (Largest average word length), word length
Lengths), obtained value, is then added, takes summation maximum by the natural logrithm for calculating all monosyllabic word word frequency in phrase
Phrase (Largest sum of degree of morphemic freedom of one-character words).
Such as following document A, after word segmentation processing, document B is obtained.
Document A:
" Group Life Accident Insurance material benefit plan
Unexpected injury: refer to by external, burst, non-original idea, the non-disease objective thing for making body come to harm
Part.
It is burnt as traffic accident hit, fire occurs, is caused injury by falling object from high altitude strike, injured, liquid is attacked by ruffian
Change gas, gas explosion,
The oil scald etc. that cook is boiled all belongs to accidental injuring event.
Recommend two kinds of assembled schemes, selected for unit combination actual conditions:
1, accident/injury insurance: 150 yuan/people of insurance premium (1,2 grade of occupation)
(1) period insured: 1 year
(2) because unexpected injury is die, 100,000 yuan insurance responsibility: are paid;Or because of unexpected injury Complete Disability, accidental burns, payment
100000 yuan (part is disabled to be paid in proportion);10,000 yuan of unexpected injury medical treatment.
2, unexpected injury and medical insurance: 100 yuan/people of insurance premium (1,2 grade of occupation)
(1) period insured: 1 year
(2) because unexpected injury is die, 50,000 yuan insurance responsibility: are paid;Or because of unexpected injury Complete Disability or burn, pay 50,000 yuan
(part is disabled to be paid in proportion);10,000 yuan of unexpected injury medical treatment.
Note: my company can be by the concrete condition design insurance scheme of your unit
It pays the bill few, ensures many;It insures conveniently, Claims Resolution is rapid."
Document B (word sequence obtained after word segmentation processing):
[group, the person is unexpected, injures, insurance, material benefit, and plan is unexpected, and injury refers to, by, external, burst, non-
Meaning, non-, disease, body, injury, objective, event, traffic, accident are hit, and are occurred, fire, burn, high-altitude, pendant, and object is hit, and are caused
Wound, ruffian attack, injured, liquefied gas, coal gas, explosion, cook, boiling, oil, and scald is fixed one's mind on, and outside, injury, event is recommended,
Two kinds, combination, scheme, official documents and correspondence, confession, unit, in conjunction with, actual conditions are selected, and it is unexpected, it injures, insurance, insurance premium, 150 yuan,
Grade, occupation, insurance, during which, 1 year, insurance, responsibility was unexpected, and injury is die, and paid, 10, Wan Yuan, or because, unexpected, injury, entirely,
It is residual, it is unexpected, it burns, pays, 10, Wan Yuan, part, it is disabled, it in proportion, pays, unexpected, injury, medical treatment, 1, Wan Yuan, unexpected, injury,
Medical insurance, insurance premium, 100 yuan, grade, occupation, insurance, during which, 1 year, insurance, responsibility was unexpected, and injury is die, and paid, 5, ten thousand
Member, or because, unexpected, injury is entirely, residual, burns, pays, 5, Wan Yuan, part, and it is disabled, it in proportion, pays, unexpected, injury, medical treatment, 1,
Wan Yuan, note, company can press, your unit, and specifically, situation designs, and insurance, scheme, official documents and correspondence is paid the bill, and seldom, ensures, much, throw
It protects, convenient, Claims Resolution, rapidly]
It should be noted that the present invention is not only restricted to specific segmenting method, it is all word segmentation processing to be carried out to document
To obtain the method for the significant word in the document all within protection scope of the present invention.
Then in step s 250, piecemeal is carried out to the word sequence in document in sequence, to obtain one or more
The word block of third predetermined length, wherein overlapped 4th predetermined length between adjacent word block.In other words, a third is made a reservation for
The sliding window of length scale is slided along word sequence, the displacement of the 4th predetermined length is slided every time, in this way, being just divided into document
The word blocks of multiple third predetermined length sizes.According to one embodiment of present invention, setting third predetermined length is 64 words
Language, the 4th predetermined length are 32 words.
For example, the word sequence to above-mentioned document B carries out piecemeal operation, following 4 word blocks are obtained:
Word block 1:[Group Life Accident Insurance material benefit plan unexpected injury refers to by the non-original idea non-disease of external burst
Actual bodily harm objective event traffic accident hits generation fire burn falling object from high altitude and hits the injured liquefied gas and coal gas of ruffian's attack of causing injury
Cook's boiling oil scald of exploding belongs to two kinds of assembled scheme official documents and correspondences of unexpected injury event recommendation and selects meaning for unit combination actual conditions
During 150 yuan of grade employment securities of outer injury insurance insurance premium]
Word block 2:[attacks injured liquefied gas and coal gas explosion cook boiling oil scald and belongs to two kinds of unexpected injury event recommendation combinations
Scheme official documents and correspondence selects insurance in 1 year during 150 yuan of grade employment securities of accident/injury insurance insurance premium to blame for unit combination actual conditions
Appoint unexpected injury to die to pay 100,000 yuan or pay accidental wound in proportion because 100,000 yuan of parts of unexpected injury Complete Disability accidental burns pair are disabled
10,000 yuan of unexpected injury medical insurance insurance premiums of evil medical treatment]
Word block 3:[insurance responsibility unexpected injury in 1 year, which is die, pays 100,000 yuan or because unexpected injury Complete Disability accidental burns pay 100,000
During unexpected injury 100 yuan of grade employment securities of medical 10,000 yuan of unexpected injury medical insurance insurance premiums are paid in first part deformity in proportion
Insurance responsibility unexpected injury in 1 year dies to pay 50,000 yuan or pay 50,000 yuan of part deformity because of the burn of unexpected injury Complete Disability pays meaning in proportion
Outer 10,000 yuan of injury medical treatment]
The ten thousand yuan of parts word block 4:[are disabled to pay unexpected injury medical 10,000 yuan of unexpected injury medical insurance insurance premiums 100 in proportion
Insurance responsibility unexpected injury in 1 year, which is die, during first grade employment security pays 50,000 yuan or because 50,000 yuan of portions are paid in the burn of unexpected injury Complete Disability
Dividing deformity to pay the medical 10,000 Yuan Zhu companies of unexpected injury in proportion can pay the bill not by your unit concrete condition design insurance scheme official documents and correspondence
It is ensure that many insure facilitates Claims Resolution rapid] more
Then in step S260, for one or more obtained word block, based in the data in each word block
Hold to calculate the data characteristics string of the word block, optionally, using local sensitivity Hash (LSH) algorithm in the data of each word block
Hold and generates a data signature.
Then in step S270, the data characteristics string of each word block of recombinant constructs second data fingerprint of the document
Using the data characteristics as the document.
According to one embodiment of present invention, identical LSH algorithm process word block and data block can be used, to be counted
According to signature.Therefore the step of calculating the data characteristics string of each word block, can be with reference to the data for calculating each data block in step S220
The step of feature string, details are not described herein again.
If 4 data feature strings for carrying out the generation of LSH algorithm to 4 word blocks above are respectively as follows:
LSH1:3f26258da0a5310d6b5203845ab0784eb29acff9814564946ce 458fc086ac2a8
LSH2:2f2465c480a3312f215d80ce53863000a0ad6b78a3616595ae4 c56bc00e9c2ba
LSH3:0fa077e500a531bfa95dd08e53862020a0896b58a362659d2a4 84ebf00e9cbaa
LSH4:0fa467ed10a5331da959d40e53802000b0897310a1616d592a4 844bb02a9c7ea
The example for second data fingerprint of document structure tree is shown as follows:
doc_size:2278
Word_num:284 // word number
Sig_num:8 // word block number
Word_block_size:64 // third predetermined length
The predetermined length of word_step_size:32 // the 4th
LSH:bf32258f90e5390da35083045a83630ab6be5fe9a10577102dc40acd286bd2a8
offset:0
LSH:0f2205c490d01f2fa3d000904a87630896f83fa0a147d79424e446adfc5b50aa
offset:254
LSH:b92355848cfa3b6fab5744f14c92ad4892d86e89a143f4bee4a852386e09e2e8
offset:554
LSH:3baf75807cda31ede317c6c4745321ccd2966efd89422598ec62f3dca2f9e39a
offset:840
LSH:2fafe588e0b131ade3559a46f31b30c4929e6b7c89d2671cae66f2dc02e9caca
offset:1080
LSH:2f2475cc40a731bd2155d0ce53a83000b88d7b5883d06594aa4244be04ebe2ca
offset:1326
LSH:0fa467cd00a433bca1d1d48f53a42021b889735481e241bd2a4844bf8ce9c78a
offset:1594
LSH:07af6fbde0a5317c61d1544a02802142f29b730c93602cbe2e4ae83b83c9c7ee
offset:1797
For 2278 bytes, the document containing 284 words, the data of 8 word blocks are generated respectively using LSH algorithm
Feature string (the 4th predetermined length of setting is 5 bytes, and the 5th predetermined length is 1 byte, and the 6th predetermined length is 32 bytes), by this
8 data feature strings link up be exactly the document the second data fingerprint.As described in the first data fingerprint, second
Data fingerprint also may include the deviation post information offset of word block in a document.
It should be noted that it is predetermined first can be arranged according to the significance level of document during method 200 executes
Length, the second predetermined length, third predetermined length and the 4th predetermined length, the fine degree extracted with distinguishing characteristic.Namely
It says, the significance level of document is higher, and (the first predetermined length, third are pre- for the size of each piecemeal when to data and word sequence piecemeal
Measured length) and displacement stepping (the second predetermined length, the 4th predetermined length) just it is smaller.
The step of data characteristics is extracted from document ends here, and by method 200, the first data are extracted from document
The data characteristics of fingerprint and the second data fingerprint as document.Correspondingly, Fig. 3 A and 3B is respectively illustrated implements according to the present invention
Example for realizing the equipment 300 that the first data characteristics and the second data characteristics are extracted in the slave document of method 200 schematic diagram.
As shown in Figure 3A, feature extracting device 300 includes: piecemeal module 310, computing module 320 and characteristic extracting module
330, computing module 320 is coupled with piecemeal module 310 and 330 phase of characteristic extracting module respectively.
Piecemeal module 310 be suitable in sequence in document data carry out piecemeal, with obtain one or more first
The data block of predetermined length, wherein overlapped second predetermined length between adjacent data blocks.
Computing module 320 is suitable for one or more obtained data block, based in the data in each data block
Hold to calculate the data characteristics string of the data block.
The data characteristics string that characteristic extracting module 330 is suitable for combining each data block refers to construct first data of the document
Line is using the data characteristics as the document.
It is described referring to above for the step of data characteristics string for calculating each data block, computing module 320 further includes point
Module unit 322, computing unit 324.
Blocking unit 322 is suitable for successively selecting the data sub-block of the 5th predetermined length in data block, wherein adjacent data
Overlapped 6th predetermined length between block.
Computing unit 324 is suitable for calculating the feature value list of the 7th predetermined length according to the content of data sub-block.Optionally,
May include in computing unit 324 extract subelement, suitable for extract data sub-block in partial content constitute one or it is more
A content subset.Again by computing unit 324 using hash algorithm by each content subset hash for the 0 to the 7th predetermined length it
Between a value analog value in the 7th predetermined length feature value list is arranged according to value corresponding with each content subset.Meter
Calculating unit 324 can also include count sub-element, be suitable for by will in the corresponding feature value list of each data sub-block it is corresponding
The value of position is overlapped and merges, and to obtain the feature value list and dualization subelement of the corresponding data block, fits
The value of each unit carries out dualization processing in this feature value list, and obtains the characteristic value that each cell value is 0 or 1
List.Computing unit is further adapted for converting the feature value list of this 7th predetermined length to the number of the 7th predetermined length position
String, using the data characteristics string as the data block.
Wherein, dualization subelement is suitable for calculating the average value of all cell values in feature value list and compares each
The value of unit and the size of the average value, if the value of some unit is greater than average value, the value of the unit is 1, if some unit
Value be not more than average value, then the value of the unit be 0.
According to a kind of implementation, feature extracting device 300 is further adapted for extracting the second data fingerprint of document.At this point, special
Extract equipment 300 is levied other than piecemeal module 310, computing module 320 and characteristic extracting module 330, further includes word segmentation module
340, it is suitable for carrying out word segmentation processing to document, to obtain word sequence.According to an embodiment of the invention, word segmentation module 340 is configured
To complete word segmentation processing to document using the segmentation methods (for example, MMSEG) based on dictionary.
Piecemeal module 310 is further adapted for carrying out piecemeal to the word sequence in document in sequence, to obtain one or more
The word block of third predetermined length, wherein overlapped 4th predetermined length between adjacent word block.
Meanwhile computing module 320 is further adapted for one or more obtained word block, based on the data in each word block
Content calculates the data characteristics string of the word block.The data characteristics string that characteristic extracting module 330 is further adapted for combining each word block comes
Second data fingerprint of the document is constructed using the data characteristics as the document.
It is described referring to above for the step of data characteristics string for calculating each word block, computing module 320 is in addition to comprising dividing
It further include character conversion unit 326 outside module unit 322 and computing unit 324.
Character conversion unit 326 is suitable for the word in the word block of each third predetermined length being converted to character, obtains phase
The character string answered is as word block.Blocking unit 322 is further adapted for successively selecting the sub- word block of the 5th predetermined length in word block, wherein phase
Overlapped 6th predetermined length between adjacent sub- word block.Computing unit 324 is further adapted for calculating the 7th in advance according to the content of sub- word block
The feature value list of measured length.Optionally, computing unit calculates the data characteristics of word block using LSH algorithm same as data block
String, details are not described herein again.
As described in method 200, the first predetermined length, second can be set according to the significance level of document in advance
Measured length, third predetermined length and the 4th predetermined length, the fine degree extracted with distinguishing characteristic.That is, the weight of document
Want degree higher, when piecemeal each piecemeal size (the first predetermined length, third predetermined length) and the stepping of displacement (second is pre-
Measured length, the 4th predetermined length) it is just smaller.
Optionally, characteristic extracting module 330 is further adapted for extracting the deviation post information of data block in a document, to be included in
In first data fingerprint, and the deviation post information of word block in a document is extracted, to be included in the second data fingerprint.
To sum up, this programme, which is used, carries out piecemeal to document, extracts the mode of the data fingerprint of data block and word block to extract
The data characteristics of document.The data fingerprint of each piecemeal is calculated, and data are generated using local sensitivity Hash (LSH) algorithm
Fingerprint can be effectively prevented the leakage of set of metadata of similar data, and when document is very big, also can guarantee the accuracy of feature extraction.
Fig. 4, which is shown, according to an embodiment of the invention judges the first document and the whether relevant judgement side of the second document
The flow chart of method 400.
As shown in figure 4, this method starts from step S410, the step of method 200 are executed to the first document, the number of document is extracted
Fisrt feature set is obtained according to feature, wherein the fisrt feature set includes: the first data fingerprint and/or of the first document
Two data fingerprints.
Then in the step s 420, the step of method 200 equally being executed to the second document, the data characteristics for extracting document obtains
To second feature set, wherein second feature set includes: the first data fingerprint and/or the second data fingerprint of the second document.
Then in step S430, the similarity of fisrt feature set and second feature set is calculated, if similarity reaches
Preset range, then it is assumed that first document and the second document are related.
The process of similarity can be divided into following 3 kinds again being matched in step S430 between feature, calculating document.
◆ the first is single matching:
The data characteristics string for calculating each data fingerprint in fisrt feature set refers to corresponding data in second feature set
The Hamming distance of the data characteristics string of line.
Wherein, Hamming distance (Hamming distance) refers to that two (equal length) data characteristics strings correspond to binary digit
Different quantity.The Hamming distance between two data feature strings x, y is indicated with d (x, y), two data feature strings is carried out different
Or operation, and the number that statistical result is 1, obtained value is exactly Hamming distance, when Hamming distance is greater than the first threshold values, is just sentenced
The two fixed data characteristics strings are similar.Such as:
Hamming distance between 1011101 and 1001001 is 2;
Hamming distance between " toned " and " roses " is 3.
For single matching, as long as be set as any of two documents data characteristics string and be judged as similar, recognize
Similarity for fisrt feature set and second feature set reaches preset range.As long as that is, there is any one data
Block or word block are similar, and document has the suspicion of leak data.
Be below the data characteristics string of two documents is done it is single matching and return whether relevant pseudocode, use respectively
Signature_base and signature_cmp represents fisrt feature set and second feature set, wherein nilsima_base
The data characteristics string in the characteristic set of two documents is respectively indicated with nilsima_cmp:
for nilsima_base in signature_base
for nilsima_cmp in signature_cmp
Ham_dist=hamming_distance (nilsima_base, nilsima_cmp)
if(ham_dist>threshold)
{
return 1
break
}
return 0
◆ second is benchmark matching:
The data characteristics string for equally first calculating each data fingerprint in fisrt feature set is corresponding with second feature set
The Hamming distance of the data characteristics string of data fingerprint, to determine whether two data feature strings are similar;It is different from single matching
It is that benchmark matching is the number of data characteristics string similar with second feature set in fisrt feature set to be counted, then calculates
The ratio that the number accounts for data characteristics string total number in fisrt feature set just thinks first if the ratio reaches second threshold
The similarity of characteristic set and second feature set reaches preset range.
Be below the data fingerprint of two documents is done benchmark match and return whether relevant pseudocode, signature_
Base_num indicates data characteristics string total number in fisrt feature set.
Signature_base_num=signature_base.signature_num
Simlar_num=0
for nilsima_base in signature_base
for nilsima_cmp in signature_cmp
If (nilsima_cmp==nilsima_base)
{
Simlar_num+=1
}
return(simlar_num/signature_base_num)
◆ the third is whole matching:
Calculate all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set
Jaccard coefficient, when Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set is similar to second feature set
Degree reaches preset range, and the first document and the second document are related.
Wherein Jaccard coefficient refers to the ratio of two intersection of sets collection and two union of sets collection:
Jaccard=| S ∩ T |/| S ∪ T |,
Wherein, S indicates fisrt feature set, and T indicates second feature set.
Be below the first data fingerprint of two documents is done whole matching and return whether relevant pseudocode,
Signature_base_num and signature_cmp_num respectively indicates data characteristics string total number in two characteristic sets:
Signatur_base_num=signature_base.signature_num
Signature_cmp_num=signature_cmp.signature_num
Simlar_num=0
for nilsima_base in signature_base
for nilsima_cmp in signature_cmp
If (nilsima_cmp==nilsima_base)
{
Simlar_num+=1
}
return(simlar_num/(signatur_base_num+signature_cmp_num-simlar_num))
Method 400 devises 3 kinds of modes and carries out matching judgment to the Similar content in two documents, it is alternatively possible to
Hamming distance or Jaccard coefficient table are solicited articles the similarity between shelves.In such manner, it is possible to which matching way is adaptive selected to document phase
Judged like property, carry out sensitive data matching more in all directions, to prevent sensitive data leakage from providing a strong guarantee.
Correspondingly, Fig. 5 show judgement the first document according to an embodiment of the invention for realizing method 400 and
The schematic diagram of the whether relevant judgement equipment 500 of second document.As shown in figure 5, the document correlation judges that equipment 500 includes:
Feature extracting device 300, similarity calculation module 510 and similarity judgment module 520, wherein similarity calculation module 510 is divided
It is not coupled with feature extracting device 300 and 520 phase of similarity judgment module.
Feature extracting device 300 is suitable for extracting fisrt feature set and the second spy of the first document and the second document respectively
Collection is closed, and wherein fisrt feature set includes: the first data fingerprint and/or the second data fingerprint of the first document;Second feature
Set includes: the first data fingerprint and/or the second data fingerprint of the second document.
Similarity calculation module 510 is suitable for calculating the similarity of fisrt feature set and second feature set.
Similarity judgment module 520 is suitable for when judging that similarity reaches preset range, it is believed that the first document and the second text
Shelves are related.
According to one embodiment of present invention, similarity calculation module 510 further include: similarity calculated, based on
It is special to calculate the data characteristics string of each data fingerprint and the data of corresponding data fingerprint in second feature set in fisrt feature set
Levy the Hamming distance of string.
Similarity judgment module 520 is further adapted for determining two corresponding data fingerprints when Hamming distance is greater than first threshold
It is similar.Specifically, for single matching way, similarity judgment module 520 is suitable for being judged as when any one data characteristics string
When similar, it is believed that the similarity of fisrt feature set and second feature set reaches preset range.
For benchmark matching way, similarity judgment module 520 can also include statistic unit, for counting fisrt feature
It the number of data characteristics string similar with second feature set and calculates the number in set to account in fisrt feature set data special
The ratio of sign string total number, similarity judgment module 520 are further adapted for when the ratio reaches second threshold, it is believed that fisrt feature collection
It closes and reaches preset range with the similarity of second feature set.
According to still another embodiment of the invention, under whole matching mode, similarity calculated is further adapted for calculating
The Jaccard coefficient of all data characteristics strings in all data characteristics strings and second feature set of one characteristic set, when described
When Jaccard coefficient reaches third threshold value, similarity judgment module 520 assert the phase of fisrt feature set and second feature set
Reach preset range like degree.Jaccard coefficient is used to characterize the degree of correlation of two set:
Jaccard=| S ∩ T |/| S ∪ T |,
Wherein, S indicates fisrt feature set, and T indicates second feature set.
Fig. 6 show it is according to an embodiment of the invention judge suspicious document whether include sensitive content method 600
Flow chart.As shown in fig. 6, this method starts from step S610, the step of method 200 are executed to secure documents, this article is extracted
The data characteristics of shelves, and feature database is established, wherein include in feature database: the first data fingerprint of all secure documents and second
Data fingerprint.
Then in step S620, to suspicious document execute method 400 the step of, during executing method 400, mention
Take the data characteristics of suspicious document as second feature set, and using feature database obtained in previous step as fisrt feature collection
It closes, judges whether suspicious document is related to secure documents with this;
Then in step S630, if judging, suspicious document is related to secure documents, then it is assumed that wraps in the suspicious document
Containing sensitive content;If judging, suspicious document is uncorrelated to secure documents, then it is assumed that the suspicious document does not include sensitive content.
Correspondingly, Fig. 7, which shows the suspicious document of the judgement for realizing method 600 according to an embodiment of the invention, is
The no equipment comprising sensitive content, that is, sensitive content described in Fig. 1 judge the schematic diagram of equipment 700.The equipment 700 packet
Include: feature extracting device 300, memory module 710, document relevance judge equipment 500 and determining module 720.According to this hair
Bright one embodiment, feature extracting device 300 can also be arranged in document relevance and judge in equipment 500.
As it was noted above, feature extracting device 300 be suitable for secure documents extract data characteristics, be further adapted for extracting it is suspicious
The data characteristics of document is as second feature set.
The data characteristics that memory module 710 is suitable for storing secure documents wherein includes in feature database as feature database:
The first data fingerprint and the second data fingerprint of secure documents.
Document relevance judges that equipment 500 is suitable for judging whether suspicious document is related to the secure documents in feature database;
And
Determining module 720 is suitable for when judging that suspicious document is related to secure documents, determines that suspicious document includes sensitivity
Content and when judging that suspicious document is uncorrelated to secure documents determines that the suspicious document does not include sensitive content.
To sum up, the method and system according to the present invention for leakage prevention, provided file characteristics extraction side
Method can more easily extract the data characteristics of document, and as far as possible include more data characteristic informations;In addition, devising
3 kinds of single matching, benchmark matching, whole matching modes carry out sensitive data matching in all directions, it is possible to prevente effectively from various texts
Shelves leak means.
It should be appreciated that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, it is right above
In the description of exemplary embodiment of the present invention, each feature of the invention be grouped together into sometimes single embodiment, figure or
In person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed hair
Bright requirement is than feature more features expressly recited in each claim.More precisely, as the following claims
As book reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real
Thus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this hair
Bright separate embodiments.
Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups
Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example
In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple
Submodule.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment
Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or
Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any
Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed
All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power
Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose
It replaces.
A5, the method as described in any one of A2-4, wherein data content in word-based piece calculates the number of the word block
Include: that the word in the word block of each third predetermined length is converted into character according to the step of feature string, obtains corresponding character
String is used as word block;The sub- word block for successively selecting the 5th predetermined length in word block, wherein the overlapped 6th between adjacent sub- word block
Predetermined length;For every sub- word block, the feature value list of the 7th predetermined length is calculated according to the content of sub- word block;And it is based on
The feature value list of all sub- word blocks is to construct the data characteristics string of the word block.A6, the method as described in A4 or 5, wherein basis
The step of data sub-block or the content of sub- word block calculate the feature value list of the 7th predetermined length includes: to extract by data
One or more content subset that partial content in block or sub- word block is constituted;Each content subset is dissipated using hash algorithm
A value being classified as between the 0 to the 7th predetermined length;According to value corresponding with each content subset, the 7th predetermined length is set
Analog value in feature value list.A7, the method as described in A6, wherein based on feature value list to construct the data block or word block
Data characteristics string the step of include: by by corresponding position in each data sub-block or the corresponding feature value list of sub- word block
Value be overlapped and merge, to obtain the feature value list of the corresponding data block or word block;To in this feature value list
The value of each unit carries out dualization processing, and obtains the feature value list that each cell value is 0 or 1;And it is pre- by the 7th
The feature value list of measured length is converted into the numeric string of the 7th predetermined length, using the data characteristics as the data block or word block
String.A8, the method as described in A7, wherein the value to each unit in this feature value list is wrapped the step of carrying out dualization processing
It includes: calculating the average value of all cell values in feature value list;Compare the value of each unit and the size of the average value;And if
The value of some unit is greater than average value, then the value of the unit is 1, if the value of some unit is not more than average value, the unit
Value is 0.A9, the method as described in any one of A1-8, wherein the first predetermined length is 512 bytes, the second predetermined length is 256
Byte;Third predetermined length is 64 words, and the 4th predetermined length is 32 words;5th predetermined length is 5 bytes, and the 6th is pre-
Measured length is 1 byte;It is 32 bytes with the 7th predetermined length.A10, the method as described in any one of A1-9, wherein the first number
It further include the deviation post information of data block in a document according to fingerprint.A11, the method as described in any one of A2-10, wherein the
Two data fingerprints further include the deviation post information of institute's predicate block in a document.
B13, as described in B12 equipment, equipment further include: word segmentation module is suitable for carrying out word segmentation processing to document, to obtain
Obtain word sequence;Piecemeal module is further adapted for carrying out piecemeal to the word sequence in sequence, pre- to obtain one or more third
The word block of measured length, wherein overlapped 4th predetermined length between adjacent word block;Computing module is further adapted for obtained one
A or multiple word blocks calculate the data characteristics string of the word block based on the data content in each word block;Characteristic extracting module
It is special using the data as the document to construct second data fingerprint of the document to be further adapted for combining the data characteristics string of each word block
Sign.B14, the equipment as described in B13, wherein word segmentation module is further adapted for carrying out word segmentation processing with the segmentation methods based on dictionary,
Middle segmentation methods include the rule of a dictionary, two kinds of matching algorithms and four disambiguations.B15, any one of such as B12-14
The equipment, computing module include: blocking unit, suitable for successively selecting the data sub-block of the 5th predetermined length in data block,
Wherein overlapped 6th predetermined length between adjacent data sub-block;Computing unit is suitable for for each data sub-block, according to number
Calculate the feature value list of the 7th predetermined length and based on the feature value list of all data sub-blocks according to the content of sub-block with structure
Make the data characteristics string of the data block.B16, the equipment as described in any one of B13-15, wherein computing module further include: character
Converting unit obtains corresponding character string conduct suitable for the word in the word block of each third predetermined length is converted to character
Word block;Blocking unit is further adapted for successively selecting the sub- word block of the 5th predetermined length in word block, wherein between adjacent sub- word block mutually
It is overlapped the 6th predetermined length;And computing unit is further adapted for every sub- word block, it is predetermined to calculate the 7th according to the content of sub- word block
The feature value list of length and feature value list based on all sub- word blocks are to construct the data characteristics string of the word block.B17,
Equipment as described in B15 or 16, wherein computing unit includes: extraction subelement, is suitable for extracting by data sub-block or sub- word block
Partial content constitute one or more content subset;And computing unit is further adapted for utilizing hash algorithm by each content
Subset hash is for one between the 0 to the 7th predetermined length value and according to value corresponding with each content subset, setting the
Analog value in seven predetermined length feature value lists.B18, the equipment as described in B17, wherein computing unit further include: count son
Unit, suitable for and being overlapped the value of corresponding position in each data sub-block or the corresponding feature value list of sub- word block
It merges, to obtain the feature value list of the corresponding data block or word block;Dualization subelement is suitable for this feature value list
In each unit value carry out dualization processing, and obtain each cell value be 0 or 1 feature value list;And it calculates single
Member is further adapted for converting the feature value list of the 7th predetermined length to the numeric string of the 7th predetermined length, using as the data block or
The data characteristics string of word block.B19, the equipment as described in B18, wherein dualization subelement is further adapted for calculating institute in feature value list
There is the size of the average value of cell value and the value of more each unit and the average value, if the value of some unit is greater than averagely
Value, then the value of the unit is 1, if the value of some unit is not more than average value, the value of the unit is 0.In B20, such as B12-19
Described in any item equipment, wherein the first predetermined length is 512 bytes, the second predetermined length is 256 bytes;Third predetermined length
It is 64 words, the 4th predetermined length is 32 words;5th predetermined length is 5 bytes, and the 6th predetermined length is 1 byte;With
7th predetermined length is 32 bytes.B21, the equipment as described in any one of B12-20, wherein characteristic extracting module is further adapted for mentioning
The deviation post information of data block in a document is taken, to be included in the first data fingerprint.B22, such as any one of B13-21 institute
The equipment stated, wherein characteristic extracting module is further adapted for extracting the deviation post information of word block in a document, to be included in the second number
According in fingerprint.
C24, the judgment method as described in C23, wherein calculating the step of fisrt feature set and second feature set similarity
It suddenly include: to calculate the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set
Data characteristics string Hamming distance;When Hamming distance is greater than first threshold, determine that two corresponding data feature strings are similar.
C25, the method as described in C24, if further comprise the steps of: any one data characteristics string and be judged as similar, then it is assumed that first is special
Collection, which is closed, reaches preset range with the similarity of second feature set.C26, the method as described in C24, further comprise the steps of: statistics
The number of data characteristics string similar with second feature set in fisrt feature set;The number is calculated to account in fisrt feature set
The ratio of data characteristics string total number;If ratio reaches second threshold, then it is assumed that fisrt feature set and second feature set
Similarity reaches preset range.C27, the method as described in C23, wherein it is similar to second feature set to calculate fisrt feature set
The step of spending further include: calculate all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set
Jaccard coefficient;When Jaccard coefficient reaches third threshold value, it is believed that the phase of fisrt feature set and second feature set
Reach preset range like degree.C28, the method as described in C27, wherein Jaccard coefficient be: Jaccard=| S ∩ T |/| S ∪ T
|, wherein S indicates fisrt feature set, and T indicates second feature set.
D30, the judgement equipment as described in D29, wherein similarity calculation module further include: similarity calculated is suitable for
Calculate the data of the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set
The Hamming distance of feature string;And similarity judgment module is further adapted for when Hamming distance is greater than first threshold, judgement two is right
Answer data characteristics string similar.D31, the judgement equipment as described in D30, wherein similarity judgment module is further adapted in data characteristics string
When being judged as similar, it is believed that the similarity of fisrt feature set and second feature set reaches preset range.D32, such as D30 institute
The judgement equipment stated, wherein similarity judgment module further include: statistic unit is suitable in statistics fisrt feature set and second is special
Collection closes the number of similar data characteristics string and calculates the ratio that the number accounts for data characteristics string total number in fisrt feature set
Value;And similarity judgment module is further adapted for when ratio reaches second threshold, it is believed that fisrt feature set and second feature collection
The similarity of conjunction reaches preset range.D33, the judgement equipment as described in D30, wherein similarity calculated is further adapted for calculating
The Jaccard coefficient of all data characteristics strings in all data characteristics strings and second feature set of one characteristic set;And phase
It is further adapted for like degree judgment module when Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set and second feature set
Similarity reach preset range.D34, the judgement equipment as described in D33, wherein Jaccard coefficient is Jaccard=| S ∩ T
|/| S ∪ T |, wherein S indicates fisrt feature set, and T indicates second feature set.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
Meaning one of can in any combination mode come using.
In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment
The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method
The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice
Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the invention element performed by
Function.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc.
Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must
Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.
Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from
It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that
Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit
Determine subject of the present invention and selects.Therefore, without departing from the scope and spirit of the appended claims, for this
Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present invention, to this
Invent done disclosure be it is illustrative and not restrictive, it is intended that the scope of the present invention be defined by the claims appended hereto.
Claims (37)
1. a kind of method that data characteristics is extracted from document, comprising steps of
Piecemeal is carried out to the data in the document in sequence, to obtain the data of one or more the first predetermined length
Block, wherein overlapped second predetermined length between adjacent data blocks;
For one or more obtained data block, which is calculated based on the data content in each data block
Data characteristics string;And
It is special using the data as the document to construct first data fingerprint of the document to combine the data characteristics string of each data block
Sign,
Wherein, first predetermined length and the second predetermined length are arranged based on the significance level of the document, if the text
The significance level of shelves is higher, then the first predetermined length and the second predetermined length are smaller.
2. the method as described in claim 1 further comprises the steps of:
Word segmentation processing is carried out to the document, to obtain word sequence;
Piecemeal is carried out to the word sequence in the document in sequence, to obtain the word of one or more third predetermined length
Block, wherein overlapped 4th predetermined length between adjacent word block;
For one or more obtained word block, the number of the word block is calculated based on the data content in each word block
According to feature string;And
The data characteristics string of each word block is combined to construct second data fingerprint of the document using the data characteristics as the document.
3. method according to claim 2, wherein the step of progress word segmentation processing to document includes:
Word segmentation processing is carried out using the segmentation methods based on dictionary, wherein the segmentation methods include a dictionary, two kinds of matchings
The rule of algorithm and four disambiguations.
4. method as claimed in claim 3, wherein the data for calculating the data block based on the data content in data block are special
Levying the step of going here and there includes:
The data sub-block of the 5th predetermined length in the data block is successively selected, wherein overlapped between adjacent data sub-block
Six predetermined lengths;
For each data sub-block, the feature value list of the 7th predetermined length is calculated according to the content of the data sub-block;And
The data characteristics string of the data block is constructed based on the feature value list of all data sub-blocks.
5. method as claimed in claim 4, wherein data content in word-based piece calculates the data characteristics string of the word block
The step of include:
Word in the word block of each third predetermined length is converted into character, obtains corresponding character string as word block;
The sub- word block of the 5th predetermined length in institute's predicate block is successively selected, wherein the overlapped 6th predetermined between adjacent sub- word block
Length;
For every sub- word block, the feature value list of the 7th predetermined length is calculated according to the content of the sub- word block;And
Feature value list based on all sub- word blocks is to construct the data characteristics string of the word block.
6. method as claimed in claim 5, wherein calculating the 7th predetermined length according to data sub-block or the content of sub- word block
The step of feature value list includes:
Extract one or more content subset being made of the partial content in the data sub-block or sub- word block;
Each content subset hash is worth for one between the 0 to the 7th predetermined length using hash algorithm;
According to value corresponding with each content subset, the analog value in the 7th predetermined length feature value list is set.
7. method as claimed in claim 6, wherein constructing the data characteristics of the data block or word block based on feature value list
The step of string includes:
It is carried out and being overlapped the value of corresponding position in each data sub-block or the corresponding feature value list of sub- word block
Merge, to obtain the feature value list of the corresponding data block or word block;
Dualization processing is carried out to the value of each unit in this feature value list, and obtains the feature that each cell value is 0 or 1
Value list;And
Convert the feature value list of the 7th predetermined length to the numeric string of the 7th predetermined length, using as the data block or
The data characteristics string of word block.
8. the method for claim 7, wherein the value to each unit in this feature value list carries out at dualization
The step of reason includes:
Calculate the average value of all cell values in feature value list;
Compare the value of each unit and the size of the average value;And
If the value of some unit is greater than average value, the value of the unit is 1, should if the value of some unit is not more than average value
The value of unit is 0.
9. method according to claim 8, wherein
First predetermined length is 512 bytes, and second predetermined length is 256 bytes;
The third predetermined length is 64 words, and the 4th predetermined length is 32 words;
5th predetermined length is 5 bytes, and the 6th predetermined length is 1 byte;With
7th predetermined length is 32 bytes.
10. method as claimed in any one of claims 1-9 wherein, wherein
First data fingerprint further includes the deviation post information of the data block in a document.
11. the method as described in any one of claim 2-9, wherein
Second data fingerprint further includes the deviation post information of institute's predicate block in a document.
12. a kind of equipment for extracting data characteristics from document, the equipment include:
Piecemeal module, it is first pre- to obtain one or more suitable for carrying out piecemeal to the data in the document in sequence
The data block of measured length, wherein overlapped second predetermined length between adjacent data blocks;
Computing module is suitable for one or more obtained data block, and the data content in each data block is come based on
Calculate the data characteristics string of the data block;And
Characteristic extracting module constructs first data fingerprint of the document suitable for combining the data characteristics string of each data block to make
For the data characteristics of the document,
Wherein, first predetermined length and the second predetermined length are arranged based on the significance level of the document, if the text
The significance level of shelves is higher, then the first predetermined length and the second predetermined length are smaller.
13. equipment as claimed in claim 12, the equipment further include:
Word segmentation module is suitable for carrying out word segmentation processing to the document, to obtain word sequence;
The piecemeal module is further adapted for carrying out piecemeal to the word sequence in sequence, predetermined to obtain one or more third
The word block of length, wherein overlapped 4th predetermined length between adjacent word block;
The computing module is further adapted for one or more obtained word block, based on the data content in each word block
To calculate the data characteristics string of the word block;
The characteristic extracting module is further adapted for combining the data characteristics string of each word block to construct second data fingerprint of the document
Using the data characteristics as the document.
14. equipment as claimed in claim 13, wherein the word segmentation module is further adapted for being carried out with the segmentation methods based on dictionary
Word segmentation processing, wherein the segmentation methods include the rule of a dictionary, two kinds of matching algorithms and four disambiguations.
15. equipment as claimed in claim 14, the computing module include:
Blocking unit, suitable for successively selecting the data sub-block of the 5th predetermined length in the data block, wherein adjacent data sub-block
Between overlapped 6th predetermined length;
Computing unit is suitable for calculating the spy of the 7th predetermined length according to the content of the data sub-block for each data sub-block
Value indicative list and the data characteristics string that the data block is constructed based on the feature value list of all data sub-blocks.
16. equipment as claimed in claim 15, wherein the computing module further include:
Character conversion unit obtains corresponding suitable for the word in the word block of each third predetermined length is converted to character
Character string as word block;
The blocking unit is further adapted for successively selecting the sub- word block of the 5th predetermined length in institute's predicate block, wherein adjacent sub- word block it
Between overlapped 6th predetermined length;And
The computing unit is further adapted for calculating the feature of the 7th predetermined length according to the content of the sub- word block to every sub- word block
Value list and feature value list based on all sub- word blocks are to construct the data characteristics string of the word block.
17. equipment as claimed in claim 16, wherein the computing unit includes:
Subelement is extracted, suitable for extracting in one or more being made of the partial content in the data sub-block or sub- word block
Hold subset;And
The computing unit is further adapted for each content subset hash using hash algorithm between the 0 to the 7th predetermined length
One is worth and according to value corresponding with each content subset, is arranged corresponding in the 7th predetermined length feature value list
Value.
18. equipment as claimed in claim 17, wherein the computing unit further include:
Count sub-element, suitable for by by the value of corresponding position in each data sub-block or the corresponding feature value list of sub- word block
It is overlapped and merges, to obtain the feature value list of the corresponding data block or word block;
Dualization subelement carries out dualization processing suitable for the value to each unit in this feature value list, and obtains each list
The feature value list that member value is 0 or 1;And
The computing unit is further adapted for converting the feature value list of the 7th predetermined length to the number of the 7th predetermined length
String, using the data characteristics string as the data block or word block.
19. equipment as claimed in claim 18, wherein
The dualization subelement is further adapted for calculating the average value and more each unit of all cell values in feature value list
Value and the average value size, if the value of some unit is greater than average value, the value of the unit is 1, if the value of some unit
No more than average value, then the value of the unit is 0.
20. equipment as claimed in claim 19, wherein
First predetermined length is 512 bytes, and second predetermined length is 256 bytes;
The third predetermined length is 64 words, and the 4th predetermined length is 32 words;
5th predetermined length is 5 bytes, and the 6th predetermined length is 1 byte;With
7th predetermined length is 32 bytes.
21. the equipment as described in any one of claim 12-20, wherein
The characteristic extracting module is further adapted for extracting the deviation post information of the data block in a document, to be included in the first number
According in fingerprint.
22. the equipment as described in any one of claim 13-20, wherein
The characteristic extracting module is further adapted for extracting the deviation post information of institute's predicate block in a document, to be included in the second data
In fingerprint.
23. a kind of judge the first document and the whether relevant judgment method of the second document, the method includes the steps:
First document is executed such as method of any of claims 1-11, the data characteristics for extracting document obtains
Fisrt feature set, wherein the fisrt feature set includes: that the first data fingerprint of the first document and/or the second data refer to
Line;
Second document is executed such as method of any of claims 1-11, the data characteristics for extracting document obtains
Second feature set, wherein the second feature set includes: that the first data fingerprint of the second document and/or the second data refer to
Line;And
The similarity for calculating fisrt feature set and second feature set, if similarity reaches preset range, then it is assumed that this first
Document and the second document are related.
24. judgment method as claimed in claim 23, wherein the calculating fisrt feature set is similar to second feature set
The step of spending include:
Calculate the data characteristics string of each data fingerprint and corresponding data fingerprint in second feature set in fisrt feature set
The Hamming distance of data characteristics string;
When the Hamming distance is greater than first threshold, determine that two corresponding data feature strings are similar.
25. method as claimed in claim 24, further comprises the steps of:
If any one data characteristics string is judged as similar, then it is assumed that the similarity of fisrt feature set and second feature set
Reach preset range.
26. method as claimed in claim 24, further comprises the steps of:
Count the number of data characteristics string similar with second feature set in the fisrt feature set;
Calculate the ratio that the number accounts for data characteristics string total number in fisrt feature set;
If the ratio reaches second threshold, then it is assumed that the similarity of fisrt feature set and second feature set reaches predetermined model
It encloses.
27. method as claimed in claim 23, wherein described calculate fisrt feature set and second feature set similarity
Step further include:
Calculate the Jaccard system of all data characteristics strings in all data characteristics strings and second feature set of fisrt feature set
Number;
When the Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set and the similarity of second feature set reach
To preset range.
28. method as claimed in claim 27, wherein the Jaccard coefficient is:
Jaccard=| S ∩ T |/| S ∪ T |,
Wherein, S indicates fisrt feature set, and T indicates second feature set.
29. a kind of judge the first document and the whether relevant judgement equipment of the second document, the equipment includes:
The equipment that data characteristics is extracted in slave document as described in any one of claim 12-22, suitable for extracting institute respectively
State the fisrt feature set and second feature set of the first document and the second document, wherein
The fisrt feature set includes: the first data fingerprint and/or the second data fingerprint of the first document;
The second feature set includes: the first data fingerprint and/or the second data fingerprint of the second document;
Similarity calculation module, suitable for calculating the similarity of fisrt feature set and second feature set;And
Similarity judgment module, suitable for when judging that similarity reaches preset range, it is believed that first document and the second document phase
It closes.
30. equipment is judged as claimed in claim 29, wherein the similarity calculation module further include:
Similarity calculated, suitable for calculating the data characteristics string and second feature collection of each data fingerprint in fisrt feature set
The Hamming distance of the data characteristics string of corresponding data fingerprint in conjunction;And
The similarity judgment module is further adapted for determining two corresponding data features when the Hamming distance is greater than first threshold
It goes here and there similar.
31. judging equipment as claimed in claim 30, wherein
The similarity judgment module is further adapted for when data characteristics string is judged as similar, it is believed that fisrt feature set and second
The similarity of characteristic set reaches preset range.
32. equipment is judged as claimed in claim 30, wherein the similarity judgment module further include:
Statistic unit, suitable for count the number of data characteristics string similar with second feature set in the fisrt feature set,
And calculate the ratio that the number accounts for data characteristics string total number in fisrt feature set;And
The similarity judgment module is further adapted for when the ratio reaches second threshold, it is believed that fisrt feature set is special with second
The similarity that collection is closed reaches preset range.
33. judging equipment as claimed in claim 30, wherein
The similarity calculated is further adapted for calculating in all data characteristics strings and the second feature set of fisrt feature set
The Jaccard coefficient of all data characteristics strings;And
The similarity judgment module is further adapted for when the Jaccard coefficient reaches third threshold value, it is believed that fisrt feature set
Reach preset range with the similarity of second feature set.
34. judge equipment as claimed in claim 33, wherein the Jaccard coefficient is:
Jaccard=| S ∩ T |/| S ∪ T |,
Wherein, S indicates fisrt feature set, and T indicates second feature set.
35. it is a kind of judge suspicious document whether include sensitive content method, the method includes the steps:
Such as method of any of claims 1-11 is executed to secure documents, the data characteristics of the document is extracted, builds
Feature database is found, wherein includes in feature database: the first data fingerprint and the second data fingerprint of secure documents;
Judgment method as described in any one of claim 23-28 is executed to suspicious document, wherein extract the suspicious document
Data characteristics as second feature set, using the feature database as fisrt feature set;
If judging, the suspicious document is related to secure documents, then it is assumed that the suspicious document includes sensitive content;And
If judging, the suspicious document is uncorrelated to secure documents, then it is assumed that the suspicious document does not include sensitive content.
36. it is a kind of judge suspicious document whether include sensitive content equipment, the equipment includes:
The equipment that data characteristics is extracted in slave document as described in any one of claim 12-22, is suitable for secure documents
It extracts data characteristics, be further adapted for extracting the data characteristics of suspicious document as second feature set;
Memory module, the data characteristics suitable for storing the secure documents wherein includes in feature database as feature database: by
Protect the first data fingerprint and the second data fingerprint of document;
Judgement equipment as described in any one of claim 29-34, suitable for judge suspicious document with it is protected in feature database
Whether document is related;And
Determining module is suitable for when judging that the suspicious document is related to secure documents, determines that the suspicious document includes quick
Feel content and it is uncorrelated to secure documents when judge the suspicious document when, determine the suspicious document not comprising in sensitive
Hold.
37. a kind of leakage prevention system, comprising:
Equipment is calculated, is connected with data safety safeguard;And
Data safety safeguard, comprising:
Document obtains equipment, suitable for obtaining the document content for calculating equipment and sending;
Sensitive content as claimed in claim 36 judges equipment, suitable for judging whether the document obtained includes sensitive content;
Control strategy obtains equipment, suitable for obtaining process pair relevant to document when judging whether document includes sensitive content
The control strategy answered;With
Equipment is controlled, is suitable for when judging that suspicious document includes sensitive content, according to acquired control strategy to the document
Operation behavior controlled.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610236738.7A CN105893859B (en) | 2016-04-15 | 2016-04-15 | Method and system for leakage prevention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610236738.7A CN105893859B (en) | 2016-04-15 | 2016-04-15 | Method and system for leakage prevention |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105893859A CN105893859A (en) | 2016-08-24 |
CN105893859B true CN105893859B (en) | 2019-05-03 |
Family
ID=56705053
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610236738.7A Expired - Fee Related CN105893859B (en) | 2016-04-15 | 2016-04-15 | Method and system for leakage prevention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105893859B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109583233A (en) * | 2018-11-23 | 2019-04-05 | 阿里巴巴集团控股有限公司 | Data leak monitoring method and device |
CN111629027B (en) * | 2020-04-10 | 2023-06-23 | 云南电网有限责任公司信息中心 | Method for storing and processing trusted file based on blockchain |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103425639A (en) * | 2013-09-06 | 2013-12-04 | 广州一呼百应网络技术有限公司 | Similar information identifying method based on information fingerprints |
CN103514213A (en) * | 2012-06-28 | 2014-01-15 | 华为技术有限公司 | Term extraction method and device |
CN103632080A (en) * | 2013-11-06 | 2014-03-12 | 国家电网公司 | Mobile data application safety protection system and mobile data application safety protection method based on USBKey |
CN104142969A (en) * | 2013-11-27 | 2014-11-12 | 北京星网锐捷网络技术有限公司 | Data segmentation processing method and device |
CN104506545A (en) * | 2014-12-30 | 2015-04-08 | 北京奇虎科技有限公司 | Data leakage prevention method and data leakage prevention device |
-
2016
- 2016-04-15 CN CN201610236738.7A patent/CN105893859B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514213A (en) * | 2012-06-28 | 2014-01-15 | 华为技术有限公司 | Term extraction method and device |
CN103425639A (en) * | 2013-09-06 | 2013-12-04 | 广州一呼百应网络技术有限公司 | Similar information identifying method based on information fingerprints |
CN103632080A (en) * | 2013-11-06 | 2014-03-12 | 国家电网公司 | Mobile data application safety protection system and mobile data application safety protection method based on USBKey |
CN104142969A (en) * | 2013-11-27 | 2014-11-12 | 北京星网锐捷网络技术有限公司 | Data segmentation processing method and device |
CN104506545A (en) * | 2014-12-30 | 2015-04-08 | 北京奇虎科技有限公司 | Data leakage prevention method and data leakage prevention device |
Also Published As
Publication number | Publication date |
---|---|
CN105893859A (en) | 2016-08-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101627592B1 (en) | Detection of confidential information | |
JP5727027B2 (en) | System and method for protecting specific combinations of data | |
US9692762B2 (en) | Systems and methods for efficient detection of fingerprinted data and information | |
CN110825757B (en) | Equipment behavior risk analysis method and system | |
Isaac | Hope, hype, and fear: the promise and potential pitfalls of artificial intelligence in criminal justice | |
CN105955978B (en) | Method and system for leakage prevention | |
CA2710614C (en) | Intrusion detection systems and methods | |
CN105893859B (en) | Method and system for leakage prevention | |
Ma et al. | An API Semantics‐Aware Malware Detection Method Based on Deep Learning | |
CN110490750B (en) | Data identification method, system, electronic equipment and computer storage medium | |
JP6777612B2 (en) | Systems and methods to prevent data loss in computer systems | |
CN105956482B (en) | Method and system for leakage prevention | |
CN115314236A (en) | System and method for detecting phishing domains in a Domain Name System (DNS) record set | |
CN105844118A (en) | Methods and system for preventing data leakage | |
CN109359481A (en) | It is a kind of based on BK tree anti-collision search about subtract method | |
CN115563288B (en) | Text detection method and device, electronic equipment and storage medium | |
CN112948887B (en) | Social engineering defense method based on confrontation sample generation | |
CN113923011A (en) | Phishing early warning method and device, computer equipment and storage medium | |
CN114491563A (en) | Method for acquiring risk level of information security event and related device | |
CN116431754A (en) | Keyword extraction method, keyword extraction device, keyword extraction equipment and computer readable medium | |
Geng et al. | Using data mining methods to predict personally identifiable information in emails | |
Bal et al. | Towards a content-based defense against text ddos in 9–1-1 emergency systems | |
Freeman | Have New South Wales criminal courts become more lenient in the past 20 years? | |
Sutton | Whose lives matter? War, the media, and structural inequities & racism | |
CN117201104A (en) | Log processing method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20200424 Address after: Room 101 and 102, floor 1, building 103, No. 3, minzhuang Road, Haidian District, Beijing 100093 Patentee after: Baolixintong Science and Technology Co.,Ltd. Beijing Address before: 100086, A, building 1, building 48, No. 3 West Third Ring Road, Haidian District, Beijing, 23E Patentee before: POLY DATA (BEIJING) DATA TECHNOLOGY Co.,Ltd. |
|
TR01 | Transfer of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190503 |
|
CF01 | Termination of patent right due to non-payment of annual fee |