CN108009253A

CN108009253A - A kind of improved character string Similar contrasts method

Info

Publication number: CN108009253A
Application number: CN201711263775.8A
Authority: CN
Inventors: 杜庆治; 陈鸣; 邵玉斌; 龙华
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2017-12-05
Filing date: 2017-12-05
Publication date: 2018-05-08

Abstract

The present invention relates to a kind of improved character string Similar contrasts method, belong to duplicate checking field.Then the present invention calculates the local sensitivity cryptographic Hash of contrast character string, Hamming distance between the two is finally calculated, so as to obtain the similarity of two articles by calculating by contrast character string local sensitivity cryptographic Hash.In order to improve the accuracy of calculating, by after participle, each Chinese character is encoded character string using phonetic-stroke code, and similar contrast is carried out so as to the similar Chinese character of the Chinese character to unisonance, font well.

Description

A kind of improved character string Similar contrasts method

Technical field

The present invention relates to a kind of improved character string Similar contrasts method, belong to duplicate checking field.

Background technology

Field is found in data mining and knowledge information, and a major challenge brought of gushing of mass data is exactly the big of information Amount repeats, and at home, there is 30% repeated pages according to statistics, and duplicate message be too much cause the difficult main problem of retrieval it One.Algorithm specially solves the problems, such as hundreds of millions grades of removing duplicate webpages, is applied certainly in terms of text duplicate removal also widely, no Cross it is more complicated compared to removing duplicate webpages because Chinese sentence structure is special and polysemy.The main core concept of this algorithm is just It is dimensionality reduction, by the feature vector of the maps feature vectors of higher-dimension to low-dimensional, judges two by calculating two vectorial Hamming distances Document is similar or repeats degree.In information theory, the Hamming distance between two isometric character strings is correspondence between two character strings The different number of position digital.For example 1000110 and 1000001 Hamming distance is 3.Based on the above, can be generalized to Character string Similar contrasts, and be that duplicate checking is carried out based on character string phonetically similar word and shape similar word, it can not only greatly reduce contrast Time, and the accuracy of contrast can be improved.

The content of the invention

The present invention provides a kind of improved character string Similar contrasts method, for realizing that the similitude of character string is sentenced It is disconnected.

The technical scheme is that：A kind of improved character string Similar contrasts method, the method step are as follows：

S1, participle：String is accorded with to word to be searched and carries out Chinese word segmentation；

S2、Hash：By each encoding of chinese characters in each word into the phonetic-stroke code of ten, by the sound shape of corresponding Chinese character in each word Code adds up, and the phonetic-stroke code after adding up is converted into cryptographic Hash of the binary system as the word；Wherein phonetic-stroke code includes a semitone The shape code of code and half, tone code is made of the simple or compound vowel of a Chinese syllable of each Chinese character, initial consonant, complement code, tone, shape code by each Chinese character structure Position, quadrangle coding and stroke number composition；

S3, weight：Its weight is calculated using TF_IDF algorithms to the result of each word of participle, calculates the step of the weight of each word Suddenly：

1. calculate the word frequency of each word after word to be searched symbol string participle：

TF=n₁/n₂

Wherein, TF represents the word frequency of certain word, n₁Represent the number that equivalent occurs in this character string, n₂Represent this character string Total word number；

2. calculate reverse document frequency：

IDF=ln(ducom/ducom1)

Wherein, IDF represents that inverse document represents certain word in how many documents to frequency, ducom corpus total number of files, ducom1 Occurred；

3. the weight of word：

TF_IDF=TF*IDF

Wherein, TF_IDF represents word weight, word frequency same word corresponding with reverse document-frequency；

S4, weighting：Binary cryptographic Hash is changed into each word divided in step S2, each is judged, if The position is 1, then with the positive weights of this；If 0, then with the negative weighted value of this, until each binary Hash is sentenced It is disconnected when completing untill, each last morphology is into string number sequence；

S5, add up：Word to be searched is accorded with to each word after string participle, each correspondence of the weighted results of calculating adds up, most Accumulation result of the end form into this character string；

S6, dimensionality reduction：Each for the Serial No. that the character string accumulated result is formed carries out dimensionality reduction, if it is big to be judged position In 0, then the position is into 1；Otherwise, then the position into 0；Untill each has judged, the office of this character string is finally then formed Portion's sensitive hash value；

S7, duplicate checking：The character string of contrast is calculated into its local sensitivity cryptographic Hash according to above step, then calculates to accord with word to be searched and goes here and there Hamming distance between the two judges both similitudes.

In the step S2, when carrying out cumulative, accumulation result not-carry.

In the step S7, Hamming distance is less than 3, then judges that both are similar.

The beneficial effects of the invention are as follows：1st, the feature vector of high dimensional feature vector dimensionality reduction to low-dimensional is saved word by the present invention The time of symbol string contrast；The coding mode of phonetic-stroke code used when the 2nd, calculating cryptographic Hash, the characteristics of not only having remained phonetic transcriptions of Chinese characters, but also The characteristics of remaining font, so as to improve the accuracy to phonetically similar word, shape similar word contrast.

Brief description of the drawings

Fig. 1 is Chinese-character sound-shape code coding rule in the present invention.

Embodiment

Embodiment 1：As shown in Figure 1, a kind of improved character string Similar contrasts method, the method step are as follows：

TF=n₁/n₂

2. calculate reverse document frequency：

IDF=ln(ducom/ducom1)

3. the weight of word：

TF_IDF=TF*IDF

It is possible to further set in the step S2, when carrying out cumulative, accumulation result not-carry.

It is possible to further set in the step S7, Hamming distance is less than 3, then judges that both are similar.

Embodiment 2：As shown in Figure 1, a kind of improved character string Similar contrasts method, the method step are as follows：

S1, participle：Chinese word segmentation is carried out to being looked into short character strings；

Such as：Chinese yunnan Kunming Zi Lang roads, participle instrument wrap for participle, and word segmentation result is：China/Yunnan/Kunming/Zi Lang roads.

S2、Hash：By each encoding of chinese characters into the phonetic-stroke code of ten, including the shape code of half tone code and half, tone code It is made of the simple or compound vowel of a Chinese syllable of each Chinese character, initial consonant, complement code, tone, i.e.,：First is simple or compound vowel of a Chinese syllable, by simple alternative rule, by the Chinese The simple or compound vowel of a Chinese syllable part mapping of word is to a character bit.One shares 24 kinds of simple or compound vowel of a Chinese syllable in the phonetic of Chinese character, and which part calculates for the later stage Purpose, is substituted, as shown in table 1 using identical character.

Table 1：

Second is initial consonant, likewise, initial consonant is converted into character also with a substitution table, as shown in table 2.

Table 2：

3rd is complement code, commonly used in when between initial consonant and simple or compound vowel of a Chinese syllable also have a consonant when, using rhythm matrix phase Same alternative rule, is defaulted as 0 if not.4th is tone, substitutes the four tones of standard Chinese pronunciation in Chinese character with 1,2,3,4 respectively； And shape code is structure bit by the 5th, according to the different structure of Chinese character, the structure of the Chinese character is represented with a character, such as table 3 It is shown.

Table 3：

6th to the 9th describes the form of the Chinese character for quadrangle coding, and the first upper left corner, the rear upper right corner, lower-left are pressed per word Angle, the order in the last lower right corner take the number at four angles, and coding rule is：Horizontal stroke one, which hangs down two or three points, to be pressed down, slotting five squares six of fork four, and seven Angle 889 is small, there is horizontal change fraction under point；For example the quadrangle coding in " village " is 0021, the quadrangle coding of " gas " is 8001, the Ten are stroke number, i.e., from one to nine, represent the stroke of the Chinese character respectively as one to nine, represent 10 followed by A, B is represented 11, and and so on.Z represents 35, and any all uses Z more than 35.Each in above-mentioned chapter phonetic-stroke code all belongs to In 1-9, A-Z sections.

Finally, the phonetic-stroke code of corresponding Chinese character in each word is added up, the phonetic-stroke code after adding up is converted into binary system Cryptographic Hash as the word.

For example the phone configuration code of " thinkling sound " is " F70211313B ", is then converted to binary system.Likewise, " wolf ", can encode For F70214323A, phonetically similar word contrast is carried out, " refined ", " China fir China fir ", carries out font Similar contrasts;Such as the meter of Zi Lang roads cryptographic Hash Calculate, purple, thinkling sound, the triliteral phonetic-stroke code in road are all sectioned out according to above method, then add up obtain for：41GE5EC78Y,（I.e. ten The the 1st, 2,3,10 meeting is more than F in the phonetic-stroke code of position, therefore the position exceeded respectively becomes two hexadecimals, without departing from six Invariant position, common 8+6=14 hexadecimal, every hexadecimal are 4 binary systems）It is then converted into as 56（For subsequent calculations more Add accurate）Binary system is：111010111 ... .., and so on the cryptographic Hash in China, Yunnan, Kunming is calculated, pay attention to tired Position is corresponded to during adding and is added not-carry, such as：Z+Z=Y；

（3）Weight：Its weight is calculated using TF_IDF algorithms to the result of each word of participle, calculates the step of the weight of each word Suddenly：

TF=n₁/n₂

2. calculate reverse document frequency：

IDF=ln(ducom/ducom1)

3. the weight of word：

TF_IDF=TF*IDF

Such as：Zi Lang roads occur 1 time in the character string, a total of 4 words, word frequency 0.25, and corpus has 10000 texts Chapter, Zi Lang roads occurred in 1000 articles, then the reverse frequency of the word is：2.3, the weight on " Zi Lang roads " is then：0.25* 2.3 it is：0.57, and so on the weight calculation in China, Yunnan, Kunming is come out；

（4）Weighting：To step（2）In each word divided change into binary cryptographic Hash, each is judged, if The position is 1, then weight is TF_IDF, is then-TF_IDF with the weight of this, until each binary system Hash if 0 Untill when value judges to complete, each last morphology is into string number sequence；

As Zi Lang roads cryptographic Hash changes into 56 binary systems as 111010111 ..., the weighted results of the word are then 56： 0.57,0.57,0.57, -0.57,0.57, -0.57,0.57,0.57,0.57 ... ...;Again will and so on by China, Yunnan, The weight calculation in Kunming comes out；

（5）It is cumulative：Word to be searched is accorded with to each word after string participle, each correspondence of the weighted results of calculating adds up, most Accumulation result of the end form into this character string；

Such as：China/Yunnan/Kunming/Zi Lang roads, the result that weighted results add up out are 56：13,108, -22, -5, -32, 55 ... ...；

（6）Dimensionality reduction：Each for the Serial No. that the character string accumulated result is formed carries out dimensionality reduction, if it is big to be judged position In 0, then the position 1, otherwise, is then set to 0；Untill each has judged, finally, then the local sensitivity of this character string is formed Cryptographic Hash；

Such as：China/Yunnan/Kunming/Zi Lang roads, the result that dimensionality reduction comes out is 56 binary systems：110001…….；

（7）Duplicate checking：The character string of contrast is calculated into its local sensitivity cryptographic Hash according to above step, then calculates to accord with word to be searched and goes here and there Hamming distance between the two, if Hamming distance is less than 3, can be determined that both are similar；

Such as：China/Yunnan/Kunming/crape myrtle road, the result that dimensionality reduction comes out is 56 binary systems：111101………；With China/ Yunnan/Kunming/Zi Lang roads, the result that dimensionality reduction comes out is 56 binary systems：110001…….；Hamming distance is 2, then can sentence The two character strings of breaking are similar（It is digital identical behind ellipsis）；

The embodiment of the present invention is explained in detail above in conjunction with attached drawing, but the present invention is not limited to above-mentioned implementation Mode, within the knowledge of a person skilled in the art, can also be on the premise of present inventive concept not be departed from Various changes can be made.

Claims

A kind of 1. improved character string Similar contrasts method, it is characterised in that：The method step is as follows：

S1, participle：String is accorded with to word to be searched and carries out Chinese word segmentation；

S2、Hash：By each encoding of chinese characters in each word into the phonetic-stroke code of ten, by the sound shape of corresponding Chinese character in each word Code adds up, and the phonetic-stroke code after adding up is converted into cryptographic Hash of the binary system as the word；Wherein phonetic-stroke code includes a semitone The shape code of code and half, tone code is made of the simple or compound vowel of a Chinese syllable of each Chinese character, initial consonant, complement code, tone, shape code by each Chinese character structure Position, quadrangle coding and stroke number composition；

S3, weight：Its weight is calculated using TF_IDF algorithms to the result of each word of participle, calculates the step of the weight of each word Suddenly：

1. calculate the word frequency of each word after word to be searched symbol string participle：

TF=n₁/n₂

Wherein, TF represents the word frequency of certain word, n₁Represent the number that equivalent occurs in this character string, n₂Represent this character string Total word number；

2. calculate reverse document frequency：

IDF=ln(ducom/ducom1)

Wherein, IDF represents that inverse document represents certain word in how many documents to frequency, ducom corpus total number of files, ducom1 Occurred；

3. the weight of word：

TF_IDF=TF*IDF

Wherein, TF_IDF represents word weight, word frequency same word corresponding with reverse document-frequency；

S4, weighting：Binary cryptographic Hash is changed into each word divided in step S2, each is judged, if The position is 1, then with the positive weights of this；If 0, then with the negative weighted value of this, until each binary Hash is sentenced It is disconnected when completing untill, each last morphology is into string number sequence；

S5, add up：Word to be searched is accorded with to each word after string participle, each correspondence of the weighted results of calculating adds up, most Accumulation result of the end form into this character string；

S6, dimensionality reduction：Each for the Serial No. that the character string accumulated result is formed carries out dimensionality reduction, if it is big to be judged position In 0, then the position is into 1；Otherwise, then the position into 0；Untill each has judged, the office of this character string is finally then formed Portion's sensitive hash value；

S7, duplicate checking：The character string of contrast is calculated into its local sensitivity cryptographic Hash according to above step, then calculates to accord with word to be searched and goes here and there Hamming distance between the two judges both similitudes.
2. improved character string Similar contrasts method according to claim 1, it is characterised in that：In the step S2, into When row is cumulative, accumulation result not-carry.
3. improved character string Similar contrasts method according to claim 1, it is characterised in that：In the step S7, the Chinese Prescribed distance is less than 3, then judges that both are similar.