CN103425639A - Similar information identifying method based on information fingerprints - Google Patents
Similar information identifying method based on information fingerprints Download PDFInfo
- Publication number
- CN103425639A CN103425639A CN2013104024655A CN201310402465A CN103425639A CN 103425639 A CN103425639 A CN 103425639A CN 2013104024655 A CN2013104024655 A CN 2013104024655A CN 201310402465 A CN201310402465 A CN 201310402465A CN 103425639 A CN103425639 A CN 103425639A
- Authority
- CN
- China
- Prior art keywords
- information
- recognition methods
- information fingerprint
- word
- methods based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Collating Specific Patterns (AREA)
Abstract
The invention discloses a similar information identifying method based on information fingerprints. According to the method, firstly, Chinese word segmenting is carried out on texts of each document, word frequency is calculated, and words with high word frequency are taken out to serve as feature values; according to the extracted feature values, an information fingerprint of each document is calculated, the information fingerprints of the two documents are compared, and if a comparison result is larger than a threshold value, then the result that the two documents are similar can be judged. According to the method, the situation that in the prior art, all information of two documents needs to be calculated and compared can be avoided, and calculation complexity is greatly reduced; due to the fact that the information fingerprint of each document is unique, when similarity of multiple documents is judged, what is needed is to compare the information fingerprints, and work efficiency can be effectively improved.
Description
Technical field
The present invention relates to a kind of analog information recognition methods based on information fingerprint.
Background technology
Existing duplicate message recognition methods is mainly that information is carried out to the md5 coding, then compares the md5 value of two information, if just the same, these two information are the same, and the value difference is different.Existing analog information recognition methods mainly, is carried out cutting to two information by character, in order character is compared, and according to the number percent of the on all four character in position, draws two similarities between information.
The major defect of existing judgement duplicate message technology is to judge the duplicate information of character string, if two identical information, one has added individual space or other character, and program will be judged as and not be duplicate message, and degree of accuracy is not high.Existing analog information recognition methods major defect more all will be contrasted the character string of cutting at every turn, and calculated amount is large, and under the environment of large data, performance is very low.
Summary of the invention
The purpose of this invention is to provide a kind of degree of accuracy high, be applicable to the analog information recognition methods under large data environment.
Analog information recognition methods based on information fingerprint of the present invention comprises the following steps:
Text to document carries out Chinese word segmentation;
The statistics word frequency, take out the forward word of word frequency, as eigenwert;
Calculate the information fingerprint of document according to the eigenwert extracted;
The information fingerprint of two pieces of documents of comparison, if comparison result is greater than threshold values, be judged as similar article.
Analog information recognition methods based on information fingerprint of the present invention, adopt the forward word of extraction word frequency to carry out the computing information fingerprint as eigenwert, thereby go the method that judges that whether document is similar, compare existing duplicate message recognition methods, if add a small amount of character in one piece of document therein, result to judgement can not exert an influence yet, and can improve the accuracy of judgement.In addition, because the information fingerprint of document has uniqueness, when many pieces of documents judgement similaritys, only need mutual comparison information fingerprint get final product, can improve existing analog information recognition methods calculated amount large, the shortcoming of performance poor efficiency under data environment greatly.
The accompanying drawing explanation
Fig. 1 is the analog information recognition methods process flow diagram that the present invention is based on information fingerprint.
The calculation procedure process flow diagram that Fig. 2 is document information fingerprint of the present invention.
Embodiment
The analog information recognition methods based on information fingerprint as shown in Figure 1 comprises the following steps:
Text to document carries out Chinese word segmentation;
The statistics word frequency, take out the forward word of word frequency, as eigenwert;
Calculate the information fingerprint of document according to the eigenwert extracted;
The information fingerprint of two pieces of documents of comparison, if comparison result is greater than threshold values, be judged as similar article.
When wherein extracting quantity as the word of eigenwert and being 15-25, can substantially meet the performance requirement of recognition methods, by a large amount of sampling tests, calculate and find, getting when word is 20 is optimal selection.
The step of the information fingerprint of calculating document as shown in Figure 2, comprising:
The eigenwert extracted is carried out respectively to the polynomials Hash operation of 64, draw 64
Cryptographic hash;
The cryptographic hash of 64 are carried out to computing, if the i position of this cryptographic hash is 1, this position equals special
Weight, if the i position of cryptographic hash is 0. this equals the negative of feature weight;
Equal the number of times that this word occurs on the weight numerical value of this feature;
After completing the processing of all eigenwerts, all eigenwerts are carried out to addition by the row correspondence, draw the numeral of a string 64, finally by positive number, corresponding position is made as 1, and the position that negative is corresponding is made as 0, has just obtained the 01 value array of 64, i.e. the information fingerprint of this information.
The reason of choosing 64 Hash operation is that producible 2 64 powers that are combined as, met the requirement of the present invention to repetition rate when using 64, select the words repetition rate of 32 still can be higher, in the time of 128, figure place is oversize, can affect calculated performance, so the Hash operation of 64 is selected in compromise.
Adopt with exclusive disjunction or XOR when the information fingerprint of two pieces of documents of comparison, according to operation result 0 or 1 number of times occurred, can judge fast the similarity of two documents.
When the information fingerprint of two pieces of documents of comparison adopts XOR, count 1 number occurred in result, if zero degree, this means that these two information are just the same.1 number of times occurred is more, means that two information are more different.In addition, this method is when being judged, the threshold values of choosing is 3.Equal at 3 o'clock if 1 number of times occurred is less than, can be judged as analog information.
When the information fingerprint of two pieces of documents of comparison adopts with exclusive disjunction, count 0 number occurred in result, if zero degree, this means that these two information are just the same.0 number of times occurred is more, means that two information are more different.During judgement, the threshold values of choosing is 3.Equal at 3 o'clock if 0 number of times occurred is less than, can be judged as analog information.
Claims (7)
1. the analog information recognition methods based on information fingerprint, it is characterized in that: described method comprises the following steps:
Text to document carries out Chinese word segmentation;
The statistics word frequency, take out the forward word of word frequency, as eigenwert;
Calculate the information fingerprint of document according to the eigenwert extracted;
The information fingerprint of two pieces of documents of comparison, if comparison result is greater than threshold values, be judged as similar article.
2. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, the information fingerprint that calculates document comprises the following steps:
The eigenwert extracted is carried out respectively to the polynomials Hash operation of 64, draw the cryptographic hash of 64;
The cryptographic hash of 64 are carried out to computing, if the i position of this cryptographic hash is 1, this equals the weight of feature; If the i position of cryptographic hash is 0. this equals the negative of feature weight;
After completing the processing of all eigenwerts, all eigenwerts are carried out to addition by the row correspondence, draw the numeral of a string 64, finally by positive number, corresponding position is made as 1, and the position that negative is corresponding is made as 0, has just obtained the 01 value array of 64.
3. the analog information recognition methods based on information fingerprint according to claim 2, is characterized in that, equals the number of times that this word occurs on the weight numerical value of feature.
4. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, extraction is 15-25 as the quantity of the word of eigenwert.
5. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, extraction is 20 as the quantity of the word of eigenwert.
6. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, while comparing the information fingerprint of two pieces of documents, adopt with or logical operation.
7. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, adopts the XOR computing while comparing the information fingerprint of two pieces of documents.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013104024655A CN103425639A (en) | 2013-09-06 | 2013-09-06 | Similar information identifying method based on information fingerprints |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013104024655A CN103425639A (en) | 2013-09-06 | 2013-09-06 | Similar information identifying method based on information fingerprints |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103425639A true CN103425639A (en) | 2013-12-04 |
Family
ID=49650403
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013104024655A Pending CN103425639A (en) | 2013-09-06 | 2013-09-06 | Similar information identifying method based on information fingerprints |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103425639A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105260878A (en) * | 2015-09-23 | 2016-01-20 | 成都网安科技发展有限公司 | Auxiliary secret-level setting method and device |
CN105681046A (en) * | 2016-02-29 | 2016-06-15 | 郑州悉知信息科技股份有限公司 | UGC fingerprint signature determination method and device as well as UGC deduplication method and device |
CN105844118A (en) * | 2016-04-15 | 2016-08-10 | 宝利九章(北京)数据技术有限公司 | Methods and system for preventing data leakage |
CN105844214A (en) * | 2016-03-02 | 2016-08-10 | 华南理工大学 | Multi-path depth encoded information fingerprint extraction method based on bit space |
CN105893859A (en) * | 2016-04-15 | 2016-08-24 | 宝利九章(北京)数据技术有限公司 | Data leakage prevention method and system |
CN105955978A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
CN105956482A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
CN106649214A (en) * | 2016-10-21 | 2017-05-10 | 天津海量信息技术股份有限公司 | Internet information content similarity definition method |
CN106649257A (en) * | 2016-09-21 | 2017-05-10 | 联动优势科技有限公司 | Semantic section conversion method and device |
CN106873964A (en) * | 2016-12-23 | 2017-06-20 | 浙江工业大学 | A kind of improved SimHash detection method of code similarities |
CN107368472A (en) * | 2017-07-26 | 2017-11-21 | 成都科来软件有限公司 | It is a kind of can iteration optimization document analysis result store method |
CN108282328A (en) * | 2018-02-02 | 2018-07-13 | 沈阳航空航天大学 | A kind of ciphertext statistical method based on homomorphic cryptography |
CN109145080A (en) * | 2018-07-26 | 2019-01-04 | 新华三信息安全技术有限公司 | A kind of text fingerprints preparation method and device |
CN110019642A (en) * | 2017-08-06 | 2019-07-16 | 北京国双科技有限公司 | A kind of Similar Text detection method and device |
CN110134761A (en) * | 2019-04-16 | 2019-08-16 | 深圳壹账通智能科技有限公司 | Adjudicate document information retrieval method, device, computer equipment and storage medium |
CN110891010A (en) * | 2018-09-05 | 2020-03-17 | 百度在线网络技术(北京)有限公司 | Method and apparatus for transmitting information |
CN112733523A (en) * | 2020-12-30 | 2021-04-30 | 深信服科技股份有限公司 | Document sending method, device, equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8140505B1 (en) * | 2005-03-31 | 2012-03-20 | Google Inc. | Near-duplicate document detection for web crawling |
CN103123618A (en) * | 2011-11-21 | 2013-05-29 | 北京新媒传信科技有限公司 | Text similarity obtaining method and device |
-
2013
- 2013-09-06 CN CN2013104024655A patent/CN103425639A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8140505B1 (en) * | 2005-03-31 | 2012-03-20 | Google Inc. | Near-duplicate document detection for web crawling |
CN103123618A (en) * | 2011-11-21 | 2013-05-29 | 北京新媒传信科技有限公司 | Text similarity obtaining method and device |
Non-Patent Citations (4)
Title |
---|
MIN LU等: "Rank hash similarity for fast similarity", 《INFORMATION PROCESSING & MANAGEMENT》, vol. 49, no. 1, 31 January 2013 (2013-01-31), pages 158 - 168 * |
段飞: "相似网页识别算法的研究与实现", 《中国优秀硕士学位论文全文数据库》, 15 September 2011 (2011-09-15) * |
胡可云等: "《数据挖掘理论与应用》", 30 April 2008, article "数据挖掘理论与应用", pages: 124-125 * |
董博等: "基于多SimHash指纹的近似文本检测", 《小型微型计算机***》, vol. 32, no. 11, 30 November 2011 (2011-11-30), pages 2152 - 2157 * |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105260878A (en) * | 2015-09-23 | 2016-01-20 | 成都网安科技发展有限公司 | Auxiliary secret-level setting method and device |
CN105681046A (en) * | 2016-02-29 | 2016-06-15 | 郑州悉知信息科技股份有限公司 | UGC fingerprint signature determination method and device as well as UGC deduplication method and device |
CN105844214B (en) * | 2016-03-02 | 2019-06-21 | 华南理工大学 | A kind of information fingerprint extracting method of the multipath depth coding based on bit space |
CN105844214A (en) * | 2016-03-02 | 2016-08-10 | 华南理工大学 | Multi-path depth encoded information fingerprint extraction method based on bit space |
CN105844118B (en) * | 2016-04-15 | 2020-02-21 | 量子创新(北京)信息技术有限公司 | Method and system for data leakage protection |
CN105893859A (en) * | 2016-04-15 | 2016-08-24 | 宝利九章(北京)数据技术有限公司 | Data leakage prevention method and system |
CN105956482A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
CN105955978B (en) * | 2016-04-15 | 2019-07-02 | 宝利九章(北京)数据技术有限公司 | Method and system for leakage prevention |
CN105955978A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
CN105844118A (en) * | 2016-04-15 | 2016-08-10 | 宝利九章(北京)数据技术有限公司 | Methods and system for preventing data leakage |
CN105956482B (en) * | 2016-04-15 | 2019-06-04 | 宝利九章(北京)数据技术有限公司 | Method and system for leakage prevention |
CN105893859B (en) * | 2016-04-15 | 2019-05-03 | 宝利九章(北京)数据技术有限公司 | Method and system for leakage prevention |
CN106649257A (en) * | 2016-09-21 | 2017-05-10 | 联动优势科技有限公司 | Semantic section conversion method and device |
CN106649257B (en) * | 2016-09-21 | 2019-06-18 | 联动优势科技有限公司 | A kind of conversion method and device of semanteme section |
CN106649214A (en) * | 2016-10-21 | 2017-05-10 | 天津海量信息技术股份有限公司 | Internet information content similarity definition method |
CN106873964A (en) * | 2016-12-23 | 2017-06-20 | 浙江工业大学 | A kind of improved SimHash detection method of code similarities |
CN107368472B (en) * | 2017-07-26 | 2021-01-05 | 成都科来软件有限公司 | Storage method of document analysis result capable of being iteratively optimized |
CN107368472A (en) * | 2017-07-26 | 2017-11-21 | 成都科来软件有限公司 | It is a kind of can iteration optimization document analysis result store method |
CN110019642A (en) * | 2017-08-06 | 2019-07-16 | 北京国双科技有限公司 | A kind of Similar Text detection method and device |
CN108282328A (en) * | 2018-02-02 | 2018-07-13 | 沈阳航空航天大学 | A kind of ciphertext statistical method based on homomorphic cryptography |
CN109145080A (en) * | 2018-07-26 | 2019-01-04 | 新华三信息安全技术有限公司 | A kind of text fingerprints preparation method and device |
CN109145080B (en) * | 2018-07-26 | 2021-01-01 | 新华三信息安全技术有限公司 | Text fingerprint obtaining method and device |
CN110891010A (en) * | 2018-09-05 | 2020-03-17 | 百度在线网络技术(北京)有限公司 | Method and apparatus for transmitting information |
CN110134761A (en) * | 2019-04-16 | 2019-08-16 | 深圳壹账通智能科技有限公司 | Adjudicate document information retrieval method, device, computer equipment and storage medium |
CN112733523A (en) * | 2020-12-30 | 2021-04-30 | 深信服科技股份有限公司 | Document sending method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103425639A (en) | Similar information identifying method based on information fingerprints | |
CN109241274B (en) | Text clustering method and device | |
CN103123618B (en) | Text similarity acquisition methods and device | |
US10645105B2 (en) | Network attack detection method and device | |
CN103258037A (en) | Trademark identification searching method for multiple combined contents | |
CN103679012A (en) | Clustering method and device of portable execute (PE) files | |
CN105574156B (en) | Text Clustering Method, device and calculating equipment | |
CN103324632B (en) | A kind of concept identification method based on Cooperative Study and device | |
CN104636319A (en) | Text duplicate removal method and device | |
CN105550253B (en) | Method and device for acquiring type relationship | |
US20180143979A1 (en) | Method for segmenting and indexing features from multidimensional data | |
CN102081598A (en) | Method for detecting duplicated texts | |
KR20170004983A (en) | Line segmentation method | |
CN104572872A (en) | Data deduplication blocking method based on extreme value | |
CN101604408B (en) | Generation of detectors and detecting method | |
CN104346411B (en) | The method and apparatus that multiple contributions are clustered | |
CN102346830B (en) | Gradient histogram-based virus detection method | |
CN103246640B (en) | A kind of method and device detecting repeated text | |
CN102364458B (en) | Method for extracting file abstract | |
CN115941281A (en) | Abnormal network flow detection method based on bidirectional time convolution neural network and multi-head self-attention mechanism | |
CN110737748B (en) | Text deduplication method and system | |
CN103336806A (en) | Method for sequencing keywords based on entropy difference between word-spacing-appearing internal mode and external mode | |
CN114266251A (en) | Malicious domain name detection method and device, electronic equipment and storage medium | |
CN104881395A (en) | Method and system for obtaining similarity of vectors in matrix | |
CN112559474A (en) | Log processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20131204 |