CN103425639A - Similar information identifying method based on information fingerprints - Google Patents

Similar information identifying method based on information fingerprints Download PDF

Info

Publication number
CN103425639A
CN103425639A CN2013104024655A CN201310402465A CN103425639A CN 103425639 A CN103425639 A CN 103425639A CN 2013104024655 A CN2013104024655 A CN 2013104024655A CN 201310402465 A CN201310402465 A CN 201310402465A CN 103425639 A CN103425639 A CN 103425639A
Authority
CN
China
Prior art keywords
information
recognition methods
information fingerprint
word
methods based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013104024655A
Other languages
Chinese (zh)
Inventor
戴森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGZHOU YIHUBAIYING NETWORK TECHNIQUE CO Ltd
Original Assignee
GUANGZHOU YIHUBAIYING NETWORK TECHNIQUE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGZHOU YIHUBAIYING NETWORK TECHNIQUE CO Ltd filed Critical GUANGZHOU YIHUBAIYING NETWORK TECHNIQUE CO Ltd
Priority to CN2013104024655A priority Critical patent/CN103425639A/en
Publication of CN103425639A publication Critical patent/CN103425639A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a similar information identifying method based on information fingerprints. According to the method, firstly, Chinese word segmenting is carried out on texts of each document, word frequency is calculated, and words with high word frequency are taken out to serve as feature values; according to the extracted feature values, an information fingerprint of each document is calculated, the information fingerprints of the two documents are compared, and if a comparison result is larger than a threshold value, then the result that the two documents are similar can be judged. According to the method, the situation that in the prior art, all information of two documents needs to be calculated and compared can be avoided, and calculation complexity is greatly reduced; due to the fact that the information fingerprint of each document is unique, when similarity of multiple documents is judged, what is needed is to compare the information fingerprints, and work efficiency can be effectively improved.

Description

A kind of analog information recognition methods based on information fingerprint
Technical field
The present invention relates to a kind of analog information recognition methods based on information fingerprint.
Background technology
Existing duplicate message recognition methods is mainly that information is carried out to the md5 coding, then compares the md5 value of two information, if just the same, these two information are the same, and the value difference is different.Existing analog information recognition methods mainly, is carried out cutting to two information by character, in order character is compared, and according to the number percent of the on all four character in position, draws two similarities between information.
The major defect of existing judgement duplicate message technology is to judge the duplicate information of character string, if two identical information, one has added individual space or other character, and program will be judged as and not be duplicate message, and degree of accuracy is not high.Existing analog information recognition methods major defect more all will be contrasted the character string of cutting at every turn, and calculated amount is large, and under the environment of large data, performance is very low.
Summary of the invention
The purpose of this invention is to provide a kind of degree of accuracy high, be applicable to the analog information recognition methods under large data environment.
Analog information recognition methods based on information fingerprint of the present invention comprises the following steps:
Text to document carries out Chinese word segmentation;
The statistics word frequency, take out the forward word of word frequency, as eigenwert;
Calculate the information fingerprint of document according to the eigenwert extracted;
The information fingerprint of two pieces of documents of comparison, if comparison result is greater than threshold values, be judged as similar article.
Analog information recognition methods based on information fingerprint of the present invention, adopt the forward word of extraction word frequency to carry out the computing information fingerprint as eigenwert, thereby go the method that judges that whether document is similar, compare existing duplicate message recognition methods, if add a small amount of character in one piece of document therein, result to judgement can not exert an influence yet, and can improve the accuracy of judgement.In addition, because the information fingerprint of document has uniqueness, when many pieces of documents judgement similaritys, only need mutual comparison information fingerprint get final product, can improve existing analog information recognition methods calculated amount large, the shortcoming of performance poor efficiency under data environment greatly.
The accompanying drawing explanation
Fig. 1 is the analog information recognition methods process flow diagram that the present invention is based on information fingerprint.
The calculation procedure process flow diagram that Fig. 2 is document information fingerprint of the present invention.
Embodiment
The analog information recognition methods based on information fingerprint as shown in Figure 1 comprises the following steps:
Text to document carries out Chinese word segmentation;
The statistics word frequency, take out the forward word of word frequency, as eigenwert;
Calculate the information fingerprint of document according to the eigenwert extracted;
The information fingerprint of two pieces of documents of comparison, if comparison result is greater than threshold values, be judged as similar article.
When wherein extracting quantity as the word of eigenwert and being 15-25, can substantially meet the performance requirement of recognition methods, by a large amount of sampling tests, calculate and find, getting when word is 20 is optimal selection.
The step of the information fingerprint of calculating document as shown in Figure 2, comprising:
The eigenwert extracted is carried out respectively to the polynomials Hash operation of 64, draw 64
Cryptographic hash;
The cryptographic hash of 64 are carried out to computing, if the i position of this cryptographic hash is 1, this position equals special
Weight, if the i position of cryptographic hash is 0. this equals the negative of feature weight;
Equal the number of times that this word occurs on the weight numerical value of this feature;
After completing the processing of all eigenwerts, all eigenwerts are carried out to addition by the row correspondence, draw the numeral of a string 64, finally by positive number, corresponding position is made as 1, and the position that negative is corresponding is made as 0, has just obtained the 01 value array of 64, i.e. the information fingerprint of this information.
The reason of choosing 64 Hash operation is that producible 2 64 powers that are combined as, met the requirement of the present invention to repetition rate when using 64, select the words repetition rate of 32 still can be higher, in the time of 128, figure place is oversize, can affect calculated performance, so the Hash operation of 64 is selected in compromise.
Adopt with exclusive disjunction or XOR when the information fingerprint of two pieces of documents of comparison, according to operation result 0 or 1 number of times occurred, can judge fast the similarity of two documents.
When the information fingerprint of two pieces of documents of comparison adopts XOR, count 1 number occurred in result, if zero degree, this means that these two information are just the same.1 number of times occurred is more, means that two information are more different.In addition, this method is when being judged, the threshold values of choosing is 3.Equal at 3 o'clock if 1 number of times occurred is less than, can be judged as analog information.
When the information fingerprint of two pieces of documents of comparison adopts with exclusive disjunction, count 0 number occurred in result, if zero degree, this means that these two information are just the same.0 number of times occurred is more, means that two information are more different.During judgement, the threshold values of choosing is 3.Equal at 3 o'clock if 0 number of times occurred is less than, can be judged as analog information.

Claims (7)

1. the analog information recognition methods based on information fingerprint, it is characterized in that: described method comprises the following steps:
Text to document carries out Chinese word segmentation;
The statistics word frequency, take out the forward word of word frequency, as eigenwert;
Calculate the information fingerprint of document according to the eigenwert extracted;
The information fingerprint of two pieces of documents of comparison, if comparison result is greater than threshold values, be judged as similar article.
2. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, the information fingerprint that calculates document comprises the following steps:
The eigenwert extracted is carried out respectively to the polynomials Hash operation of 64, draw the cryptographic hash of 64;
The cryptographic hash of 64 are carried out to computing, if the i position of this cryptographic hash is 1, this equals the weight of feature; If the i position of cryptographic hash is 0. this equals the negative of feature weight;
After completing the processing of all eigenwerts, all eigenwerts are carried out to addition by the row correspondence, draw the numeral of a string 64, finally by positive number, corresponding position is made as 1, and the position that negative is corresponding is made as 0, has just obtained the 01 value array of 64.
3. the analog information recognition methods based on information fingerprint according to claim 2, is characterized in that, equals the number of times that this word occurs on the weight numerical value of feature.
4. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, extraction is 15-25 as the quantity of the word of eigenwert.
5. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, extraction is 20 as the quantity of the word of eigenwert.
6. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, while comparing the information fingerprint of two pieces of documents, adopt with or logical operation.
7. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, adopts the XOR computing while comparing the information fingerprint of two pieces of documents.
CN2013104024655A 2013-09-06 2013-09-06 Similar information identifying method based on information fingerprints Pending CN103425639A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013104024655A CN103425639A (en) 2013-09-06 2013-09-06 Similar information identifying method based on information fingerprints

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013104024655A CN103425639A (en) 2013-09-06 2013-09-06 Similar information identifying method based on information fingerprints

Publications (1)

Publication Number Publication Date
CN103425639A true CN103425639A (en) 2013-12-04

Family

ID=49650403

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013104024655A Pending CN103425639A (en) 2013-09-06 2013-09-06 Similar information identifying method based on information fingerprints

Country Status (1)

Country Link
CN (1) CN103425639A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260878A (en) * 2015-09-23 2016-01-20 成都网安科技发展有限公司 Auxiliary secret-level setting method and device
CN105681046A (en) * 2016-02-29 2016-06-15 郑州悉知信息科技股份有限公司 UGC fingerprint signature determination method and device as well as UGC deduplication method and device
CN105844118A (en) * 2016-04-15 2016-08-10 宝利九章(北京)数据技术有限公司 Methods and system for preventing data leakage
CN105844214A (en) * 2016-03-02 2016-08-10 华南理工大学 Multi-path depth encoded information fingerprint extraction method based on bit space
CN105893859A (en) * 2016-04-15 2016-08-24 宝利九章(北京)数据技术有限公司 Data leakage prevention method and system
CN105955978A (en) * 2016-04-15 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for data leakage protection
CN105956482A (en) * 2016-04-15 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for data leakage protection
CN106649214A (en) * 2016-10-21 2017-05-10 天津海量信息技术股份有限公司 Internet information content similarity definition method
CN106649257A (en) * 2016-09-21 2017-05-10 联动优势科技有限公司 Semantic section conversion method and device
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities
CN107368472A (en) * 2017-07-26 2017-11-21 成都科来软件有限公司 It is a kind of can iteration optimization document analysis result store method
CN108282328A (en) * 2018-02-02 2018-07-13 沈阳航空航天大学 A kind of ciphertext statistical method based on homomorphic cryptography
CN109145080A (en) * 2018-07-26 2019-01-04 新华三信息安全技术有限公司 A kind of text fingerprints preparation method and device
CN110019642A (en) * 2017-08-06 2019-07-16 北京国双科技有限公司 A kind of Similar Text detection method and device
CN110134761A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Adjudicate document information retrieval method, device, computer equipment and storage medium
CN110891010A (en) * 2018-09-05 2020-03-17 百度在线网络技术(北京)有限公司 Method and apparatus for transmitting information
CN112733523A (en) * 2020-12-30 2021-04-30 深信服科技股份有限公司 Document sending method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8140505B1 (en) * 2005-03-31 2012-03-20 Google Inc. Near-duplicate document detection for web crawling
CN103123618A (en) * 2011-11-21 2013-05-29 北京新媒传信科技有限公司 Text similarity obtaining method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8140505B1 (en) * 2005-03-31 2012-03-20 Google Inc. Near-duplicate document detection for web crawling
CN103123618A (en) * 2011-11-21 2013-05-29 北京新媒传信科技有限公司 Text similarity obtaining method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MIN LU等: "Rank hash similarity for fast similarity", 《INFORMATION PROCESSING & MANAGEMENT》, vol. 49, no. 1, 31 January 2013 (2013-01-31), pages 158 - 168 *
段飞: "相似网页识别算法的研究与实现", 《中国优秀硕士学位论文全文数据库》, 15 September 2011 (2011-09-15) *
胡可云等: "《数据挖掘理论与应用》", 30 April 2008, article "数据挖掘理论与应用", pages: 124-125 *
董博等: "基于多SimHash指纹的近似文本检测", 《小型微型计算机***》, vol. 32, no. 11, 30 November 2011 (2011-11-30), pages 2152 - 2157 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260878A (en) * 2015-09-23 2016-01-20 成都网安科技发展有限公司 Auxiliary secret-level setting method and device
CN105681046A (en) * 2016-02-29 2016-06-15 郑州悉知信息科技股份有限公司 UGC fingerprint signature determination method and device as well as UGC deduplication method and device
CN105844214B (en) * 2016-03-02 2019-06-21 华南理工大学 A kind of information fingerprint extracting method of the multipath depth coding based on bit space
CN105844214A (en) * 2016-03-02 2016-08-10 华南理工大学 Multi-path depth encoded information fingerprint extraction method based on bit space
CN105844118B (en) * 2016-04-15 2020-02-21 量子创新(北京)信息技术有限公司 Method and system for data leakage protection
CN105893859A (en) * 2016-04-15 2016-08-24 宝利九章(北京)数据技术有限公司 Data leakage prevention method and system
CN105956482A (en) * 2016-04-15 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for data leakage protection
CN105955978B (en) * 2016-04-15 2019-07-02 宝利九章(北京)数据技术有限公司 Method and system for leakage prevention
CN105955978A (en) * 2016-04-15 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for data leakage protection
CN105844118A (en) * 2016-04-15 2016-08-10 宝利九章(北京)数据技术有限公司 Methods and system for preventing data leakage
CN105956482B (en) * 2016-04-15 2019-06-04 宝利九章(北京)数据技术有限公司 Method and system for leakage prevention
CN105893859B (en) * 2016-04-15 2019-05-03 宝利九章(北京)数据技术有限公司 Method and system for leakage prevention
CN106649257A (en) * 2016-09-21 2017-05-10 联动优势科技有限公司 Semantic section conversion method and device
CN106649257B (en) * 2016-09-21 2019-06-18 联动优势科技有限公司 A kind of conversion method and device of semanteme section
CN106649214A (en) * 2016-10-21 2017-05-10 天津海量信息技术股份有限公司 Internet information content similarity definition method
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities
CN107368472B (en) * 2017-07-26 2021-01-05 成都科来软件有限公司 Storage method of document analysis result capable of being iteratively optimized
CN107368472A (en) * 2017-07-26 2017-11-21 成都科来软件有限公司 It is a kind of can iteration optimization document analysis result store method
CN110019642A (en) * 2017-08-06 2019-07-16 北京国双科技有限公司 A kind of Similar Text detection method and device
CN108282328A (en) * 2018-02-02 2018-07-13 沈阳航空航天大学 A kind of ciphertext statistical method based on homomorphic cryptography
CN109145080A (en) * 2018-07-26 2019-01-04 新华三信息安全技术有限公司 A kind of text fingerprints preparation method and device
CN109145080B (en) * 2018-07-26 2021-01-01 新华三信息安全技术有限公司 Text fingerprint obtaining method and device
CN110891010A (en) * 2018-09-05 2020-03-17 百度在线网络技术(北京)有限公司 Method and apparatus for transmitting information
CN110134761A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Adjudicate document information retrieval method, device, computer equipment and storage medium
CN112733523A (en) * 2020-12-30 2021-04-30 深信服科技股份有限公司 Document sending method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103425639A (en) Similar information identifying method based on information fingerprints
CN109241274B (en) Text clustering method and device
CN103123618B (en) Text similarity acquisition methods and device
US10645105B2 (en) Network attack detection method and device
CN103258037A (en) Trademark identification searching method for multiple combined contents
CN103679012A (en) Clustering method and device of portable execute (PE) files
CN105574156B (en) Text Clustering Method, device and calculating equipment
CN103324632B (en) A kind of concept identification method based on Cooperative Study and device
CN104636319A (en) Text duplicate removal method and device
CN105550253B (en) Method and device for acquiring type relationship
US20180143979A1 (en) Method for segmenting and indexing features from multidimensional data
CN102081598A (en) Method for detecting duplicated texts
KR20170004983A (en) Line segmentation method
CN104572872A (en) Data deduplication blocking method based on extreme value
CN101604408B (en) Generation of detectors and detecting method
CN104346411B (en) The method and apparatus that multiple contributions are clustered
CN102346830B (en) Gradient histogram-based virus detection method
CN103246640B (en) A kind of method and device detecting repeated text
CN102364458B (en) Method for extracting file abstract
CN115941281A (en) Abnormal network flow detection method based on bidirectional time convolution neural network and multi-head self-attention mechanism
CN110737748B (en) Text deduplication method and system
CN103336806A (en) Method for sequencing keywords based on entropy difference between word-spacing-appearing internal mode and external mode
CN114266251A (en) Malicious domain name detection method and device, electronic equipment and storage medium
CN104881395A (en) Method and system for obtaining similarity of vectors in matrix
CN112559474A (en) Log processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20131204