CN104008119B

CN104008119B - A kind of one-to-many mixed characters string fusion comparison method

Info

Publication number: CN104008119B
Application number: CN201310746846.5A
Authority: CN
Inventors: 童晓阳; 甄威; 郑永康; 姜振超; 庄先涛; 吴继维; 张茜; 丁宣文
Original assignee: Southwest Jiaotong University; Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Current assignee: Southwest Jiaotong University; Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Priority date: 2013-12-30
Filing date: 2013-12-30
Publication date: 2017-09-26
Anticipated expiration: 2033-12-30
Also published as: CN104008119A

Abstract

Comparison method is merged the invention discloses a kind of one-to-many mixed characters string, a source string is found into most like or matching target string from one group of character string to be compared.First with improved GST* algorithms, a kind of orderly character string alignment algorithm POC partially is reused.The respective feature character string is unordered, in terms of partial order matching respectively with reference to both the above algorithm, two kinds of algorithms is calculated obtained matching angle value and is weighted to merge and tries to achieve final matching degree.In addition, there is different expression ways under different occasions for synonymous character string, using character string equivalencing strategy, to synonymous character substring equivalencing in source string, character string to be compared into identical character string, the matching degree of two character strings is greatly improved.By the way that source string is matched respectively with one group of matching string, then each matching degree is ranked up, using the character string of highest matching degree as target string, realizes the best match of one-to-many mixed characters string.

Description

A kind of one-to-many mixed characters string fusion comparison method

Technical field

The invention belongs to the intelligent comparison technology field of character string, and in particular to a kind of new one-to-many mixed characters string melts Close comparison method.

Background technology

It is a basic problem in computer science that character string, which compares problem, and its research contents is in information retrieval, pattern The various fields such as identification have important application value^[1]-[4]。

Document 1 studies Approximate String Matching for Chinese Text respectively, and document 2 have studied a kind of based on Chinese characters clustering feature Chinese character string similarity calculating method.Document 3 is compared with GST algorithms LCS, and GST algorithms are a kind of greedy character strings Alignment algorithm, is also a kind of unordered matching algorithm, and application is wider at present, but the algorithm employs two character strings character ratio one by one Compared with method, so the time complexity of algorithm is larger.Document 4 have studied to RKR-GST algorithms after GST algorithm improvements, improve The operational efficiency of GST algorithms, but in RKR-GST algorithms the selection of hash function is very big to the influence on system operation of algorithm.

Existing character string comparison method often only with a kind of algorithm, can make full use of unordered character substring and Features of the partial order character substring when matching degree is calculated, often their comparison effect is unsatisfactory.Some certain In the practical application of a little mixed characters strings, not only require that the accuracy compared is high, and require that the speed compared is fast.At present, lead to Single matching degree computational methods are crossed, are difficult often the similarity degree for accurately expressing character string.

In addition, existing character string comparison method does not account for the feelings that synonymous character string there may be different expression ways Condition so that existing character string comparison method is extremely difficult to the requirement of more accurate, high matching rate in such cases.

Bibliography：

[1] it is old to build canals, Zhao Jie, Peng Zhi prestige Fast Approximate String Matching for Chinese Text [J] Journal of Chinese Information Processings, 2003,18 (2)：58-65

[2] Chinese character string Similarity Measure research [J] modem long jump skill information of the graceful of Wang Jing based on Chinese characters clustering feature Technology, 2011,20 (2)：48-53

[3] LCS and GST algorithm comparisons [J] electronics technologies, 2011,24 (3) in extra large English similarity of character string measurement： 101-103

[4] ox analysis of the clean .RKR_GST algorithms in _ NET forever and realization [J] information technologies, 2012,3：171-174

The content of the invention

In view of prior art is not enough above, it is an object of the invention to provide a kind of more accurately mixed characters string fusion ratio To method.Solve in practical application and to be extremely difficult to similar journey between accurate expression character string with single matching degree computational methods Degree, synonymous character string have that existing character string comparison method is almost under different expression way situations.

The purpose of the present invention is realized by following means：

A kind of one-to-many mixed characters string fusion comparison method, to based on Chinese characters clustering feature by Chinese character, numeral, English The similarity of the mixed characters string of word mother's composition carries out fusion ratio pair, to improve the similar accuracy of expression character string, bag Include following key step：

1）Take out source string and one group of matching string；

2）The character string equivalencing dictionary built in memory in advance is read, to part in this group of matching string Character（Substring）Carry out equivalencing；Using equivalencing dictionary, by above-mentioned in source string occasion and matching string Conjunction has different descriptions but two kinds of substrings of implication identical are unified；

3）Source string is taken out, a word to be matched in the matching string array after equivalencing is taken out according to this Symbol string；

4）Source string and the matching degree a of the matching string are calculated using GST* algorithms：

Using traditional GST algorithms, each public substring in two character strings is obtained, they are stored in public substring chained list. If the ratio of the character length of some public substring and longer character string character length is more than or equal to 0.33, in calculating With the character number of the public substring is multiplied by into weight when spending, the weight is the constant more than 1；If the word of some public substring Accord with length and longer character string character length ratio be less than 0.33 and the character number of public substring to be more than smallest match long Degree, then be brought directly to calculating when calculating matching degree by the character number of the public substring；

5）Utilize partially orderly string matching algorithm POC（Partial Order Comparison,POC）Calculate source word symbol The matching degree b of string and matching string：

Two mixed characters strings containing Chinese character, numeral and English alphabet to be matched are referred to as source string and treated Matched character string,

First, source string and identical character or Chinese character in matching string are first searched out, their is recorded Number,;

Secondly, longer character string seeks the (match_ of matching degree 1 as standard using in source string and matching string degree1)：

, as standard, to seek matching degree 2 (match_degree2) wherein compared with short character strings：

Formula（1）、（2）In [] represent round;

Again, be respectively compared in source string and matching string the 1st or the 2nd numeral and letter, last 1 or Second-to-last numeral and letter, if wherein 1 equal, the match_degree2 numerical value of adjustment matching degree 2 is match_ degree2+1:

Different weights 0.41,0.59 are assigned to matching degree 1 and matching degree 2, source string and matching string is asked Final matching value b：

B=match_degree1 × 0.41+match_degree2 × 0.59（3）

6）By step 4）Matching degree a and step 5 obtained by GST* calculating）Matching degree b obtained by POC calculating, which is weighted, to be melted Close, fusion method is, if matching degree a is more than matching degree b, final matching degree is a;If matching degree a is less than matching degree b, Then final matching degree is equal to (a+b)/2;

7）Each matching string in source string and matching string array is calculated and obtains matching degree progress Sequence, the corresponding matching string of maximum matching degree, is used as the target string most matched with source string.

In step 4）In, source string and each public substring of matching string identical are first searched out, then to different length The public substring of degree assigns different weights, increases the weight of longer common characters substring.

The GST* algorithms of the present invention, the shorter public substring matching degree existed for traditional GST algorithms may long public affairs The bigger phenomenon of the matching degree of substring, is improved it altogether：If the character length of public substring and longer character string word The ratio for according with length is more than or equal to 0.33, then the character number of the public substring is multiplied by into weight when calculating matching degree（Greatly In 1 constant）；If the ratio of the character length of public substring and longer character string character length is less than 0.33 and public son The character number of string is more than smallest match length, then the character number of the public substring is brought directly into meter when calculating matching degree Calculate.

In step 5）In, using two containing numeral, letter, Chinese character mixed characters string is as source string and treats With character string；Respectively matching degree 1 and matching degree 2 are obtained using wherein longer character string, compared with short character strings as standard；Then compare again Numerals one or more compared with the and letter, last or multiple numerals with it is alphabetical whether equal, matching degree 2 is modified. Different weights are finally assigned respectively to two kinds of matching degrees, obtain the matching angle value between two character strings.

The partially orderly character string alignment algorithm POC of the present invention considers that matching degree 2 can more reflect actual match situation, therefore Assign matching degree 2 somewhat greater weight.

The present invention gives character string equivalencing strategy.Such as, " high-pressure side " and " 220KV sides ", " kilovolt " and " kV ", It is of equal value in implication.The equivalence relation that can not be reflected exactly between them using existing all kinds of alignment algorithms, therefore Propose character string equivalencing strategy.A character substring equivalencing dictionary is built in advance, is used：Middle substring to be matched=etc. The form of the source substring of valency, such as kilovolt=kV, the character substring of its expression equal sign both sides is identical, equal sign in implication Left side substring represents substring on the right side of certain substring in matching string, equal sign and represents the source string neutron of equal value with left side String.

Before matching degree calculating is done, first check in matching string whether contain in character substring equivalencing dictionary The character substring in left side in each row, if so, it is the source character substring on the right side of equal sign then to replace it.On this basis, then transport It is compared with this fusion alignment algorithm, calculating obtains corresponding matching degree, so substantially increases the accuracy of matching, can Reflect that real match condition between two character strings is compared in participation.

The present invention is applied to the comparison of one-to-many mixed characters string.Source string and one group of matching string are calculated respectively Matching degree, and obtained each matching degree is ranked up, therefrom finds out the to be matched character maximum with source string matching degree String, it is determined that being target string, it is achieved thereby that the best match of one-to-many character string.

Brief description of the drawings：

Fig. 1 is the flow chart of the fusion comparison method of new one-to-many character string.

Fig. 2 is the application example of the fusion comparison method of one-to-many mixed characters string.

Embodiment

The method to the present invention is described in further detail below in conjunction with the accompanying drawings

The present invention is done below in conjunction with the accompanying drawings and further described in detail.Present invention relates particularly to a kind of mixed characters string Merge comparison method.Character string to be matched is referred to as source string and matching string first.The present invention can more be fitted Together in the object matching character string that searching is most matched with source string from one group of matching string.

Embodiment is as follows.

1. take out source string and one group of matching string；

2. the character string equivalencing dictionary built in advance is read, partial character in this group of matching string is carried out etc. Valency is replaced.Such as " high-pressure side " is of equal value with " 220KV sides ", and " kilovolt " is of equal value with " kV ".It is calculated carrying out string matching degree Before, above-mentioned different descriptions can be unified using equivalencing dictionary；

3. taking out source string, one taken out according to this in this group of matching string array after equivalencing is to be matched Character string；

4. the matching degree of source string and matching string is calculated using GST* algorithms.

The improvement effect of GST* algorithms and traditional GST algorithms, by illustrated below.

Such as " abcde " is two groups of character strings to be compared with " qbcio ", " abcde " and " qbico ", utilizes GST algorithm meters It is 40% to calculate two groups of string matching degree.

And use GST* algorithms to calculate two groups of string matching degree, as a result respectively 43.2% and 40%.It can be seen that GST* is calculated The comparison result of method is more accurate.

GST* algorithms make the matching degree of two character strings with longer public substring higher.

5. the matching degree of source string and matching string is calculated using partially orderly string matching algorithm POC.

Two mixed characters strings containing Chinese character, numeral and letter to be matched are referred to as source string and to be matched Character string.

First, source string and matching string identical character are first searched out, their number is recorded.

Formula（1）、（2）In [ ] represent round.

Again, the 1st is respectively compared in source string and matching string（Or the 2nd）Numeral and letter, last 1 （Or second-to-last）Numeral and letter, if wherein 1 equal, the match_degree2 numerical value of adjustment matching degree 2 is match_degree2+1。

Finally, due to which in actual applications, matching degree 2 can more reflect actual match situation, therefore it is bigger to assign matching degree 2 Weight.

B=match_degree1 × 0.41+match_degree2 × 0.59（3）

6）By step 4）Matching degree a and step 5 obtained by GST* calculating）Matching degree b obtained by POC calculating, which is weighted, to be melted Close, fusion method is, if matching degree a is more than matching degree b, final matching degree is a.If matching degree a is less than matching degree b, Then final matching degree is equal to (a+b)/2.

Final comparison result is obtained, the characteristics of taking full advantage of two kinds of algorithms；

7. check whether circulation is finished；

8. pair each matching degree is ranked up, the maximum corresponding character string of matching degree is found out, the target character most matched is used as String.

Fig. 2 is the application example of the one-to-many comparison method of mixed characters string of the present invention.Calculate one group of mixed characters string Match condition.List and added using GST* algorithms, partially orderly string matching algorithm (POC algorithms), two kinds of algorithms respectively in Fig. 2 Weigh the matching degree of GST*_POC after fusion method.

It can be seen that, in Fig. 2 in the 1st article of comparison, the 2nd article of comparison, the 1st article of matching string is to be matched than the 2nd article Source string of the character string closer to the 1st row.

Result can be derived that conclusion from Fig. 2, and ideal comparison result has been obtained using inventive algorithm.

Claims

1. a kind of one-to-many mixed characters string fusion comparison method, to based on Chinese characters clustering feature by Chinese character, numeral, English The similarity of the mixed characters string of letter composition carries out fusion ratio pair, to improve the similar accuracy of expression character string, including Following key step：

1) source string and one group of matching string are taken out；

2) the character string equivalencing dictionary built in memory in advance is read, to partial character in this group of matching string I.e. substring carries out equivalencing；Using equivalencing dictionary, will have not in source string occasion and matching string occasion Describe together but two kinds of substrings of implication identical are unified；

3) source string is taken out, a character to be matched in the matching string array after equivalencing is taken out successively String；

4) source string and the matching degree a of the matching string are calculated using GST* algorithms：

Using traditional GST algorithms, each public substring in two character strings is obtained, they are stored in public substring chained list, if The ratio of the character length of some public substring and longer character string character length is more than or equal to 0.33, then is calculating matching degree When the character number of the public substring is multiplied by weight, the weight is constant more than 1；If the character of some public substring is long Degree and the ratio of longer character string character length are less than 0.33 and the character number of public substring is more than smallest match length, then The character number of the public substring is brought directly to calculating when calculating matching degree；

5) source string and the matching degree b of matching string are calculated using partially orderly string matching algorithm POC：

Two mixed characters strings containing Chinese character, numeral and English alphabet to be matched are referred to as source string and to be matched Character string,

First, source string and identical character in matching string are searched out, their number is recorded, secondly, with source Longer character string is standard in character string and matching string, seeks matching degree 1：

, as standard, to seek matching degree 2 wherein compared with short character strings：

[] represents to round in formula (1), (2)；

Again, the 1st or the 2nd numeral and letter, last 1 or reciprocal in source string and matching string are respectively compared 2nd numeral and letter, if wherein 1 equal, the numerical value of adjustment matching degree 2 is match_degree2+1:

Different weights 0.41,0.59 are assigned to matching degree 1 and matching degree 2, the matching of source string and matching string is asked Spend b：

B=match_degree1 × 0.41+match_degree2 × 0.59 (3)

6) by step 4) GST* calculate obtained by matching degree a and step 5) POC calculate obtained by matching degree b be weighted fusion, Fusion method is, if matching degree a is more than matching degree b, final matching degree is a；If matching degree a is less than matching degree b, most Whole matching degree is equal to (a+b)/2；

7) each matching string calculating in source string and matching string array is obtained into matching degree to be ranked up, The corresponding matching string of maximum matching degree, the target string most matched with source string is used as.