CN104008119B - A kind of one-to-many mixed characters string fusion comparison method - Google Patents

A kind of one-to-many mixed characters string fusion comparison method Download PDF

Info

Publication number
CN104008119B
CN104008119B CN201310746846.5A CN201310746846A CN104008119B CN 104008119 B CN104008119 B CN 104008119B CN 201310746846 A CN201310746846 A CN 201310746846A CN 104008119 B CN104008119 B CN 104008119B
Authority
CN
China
Prior art keywords
string
matching
character
matching degree
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310746846.5A
Other languages
Chinese (zh)
Other versions
CN104008119A (en
Inventor
童晓阳
甄威
郑永康
姜振超
庄先涛
吴继维
张茜
丁宣文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Original Assignee
Southwest Jiaotong University
Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University, Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd filed Critical Southwest Jiaotong University
Priority to CN201310746846.5A priority Critical patent/CN104008119B/en
Publication of CN104008119A publication Critical patent/CN104008119A/en
Application granted granted Critical
Publication of CN104008119B publication Critical patent/CN104008119B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Comparison method is merged the invention discloses a kind of one-to-many mixed characters string, a source string is found into most like or matching target string from one group of character string to be compared.First with improved GST* algorithms, a kind of orderly character string alignment algorithm POC partially is reused.The respective feature character string is unordered, in terms of partial order matching respectively with reference to both the above algorithm, two kinds of algorithms is calculated obtained matching angle value and is weighted to merge and tries to achieve final matching degree.In addition, there is different expression ways under different occasions for synonymous character string, using character string equivalencing strategy, to synonymous character substring equivalencing in source string, character string to be compared into identical character string, the matching degree of two character strings is greatly improved.By the way that source string is matched respectively with one group of matching string, then each matching degree is ranked up, using the character string of highest matching degree as target string, realizes the best match of one-to-many mixed characters string.

Description

A kind of one-to-many mixed characters string fusion comparison method
Technical field
The invention belongs to the intelligent comparison technology field of character string, and in particular to a kind of new one-to-many mixed characters string melts Close comparison method.
Background technology
It is a basic problem in computer science that character string, which compares problem, and its research contents is in information retrieval, pattern The various fields such as identification have important application value[1]-[4]
Document 1 studies Approximate String Matching for Chinese Text respectively, and document 2 have studied a kind of based on Chinese characters clustering feature Chinese character string similarity calculating method.Document 3 is compared with GST algorithms LCS, and GST algorithms are a kind of greedy character strings Alignment algorithm, is also a kind of unordered matching algorithm, and application is wider at present, but the algorithm employs two character strings character ratio one by one Compared with method, so the time complexity of algorithm is larger.Document 4 have studied to RKR-GST algorithms after GST algorithm improvements, improve The operational efficiency of GST algorithms, but in RKR-GST algorithms the selection of hash function is very big to the influence on system operation of algorithm.
Existing character string comparison method often only with a kind of algorithm, can make full use of unordered character substring and Features of the partial order character substring when matching degree is calculated, often their comparison effect is unsatisfactory.Some certain In the practical application of a little mixed characters strings, not only require that the accuracy compared is high, and require that the speed compared is fast.At present, lead to Single matching degree computational methods are crossed, are difficult often the similarity degree for accurately expressing character string.
In addition, existing character string comparison method does not account for the feelings that synonymous character string there may be different expression ways Condition so that existing character string comparison method is extremely difficult to the requirement of more accurate, high matching rate in such cases.
Bibliography:
[1] it is old to build canals, Zhao Jie, Peng Zhi prestige Fast Approximate String Matching for Chinese Text [J] Journal of Chinese Information Processings, 2003,18 (2):58-65
[2] Chinese character string Similarity Measure research [J] modem long jump skill information of the graceful of Wang Jing based on Chinese characters clustering feature Technology, 2011,20 (2):48-53
[3] LCS and GST algorithm comparisons [J] electronics technologies, 2011,24 (3) in extra large English similarity of character string measurement: 101-103
[4] ox analysis of the clean .RKR_GST algorithms in _ NET forever and realization [J] information technologies, 2012,3:171-174
The content of the invention
In view of prior art is not enough above, it is an object of the invention to provide a kind of more accurately mixed characters string fusion ratio To method.Solve in practical application and to be extremely difficult to similar journey between accurate expression character string with single matching degree computational methods Degree, synonymous character string have that existing character string comparison method is almost under different expression way situations.
The purpose of the present invention is realized by following means:
A kind of one-to-many mixed characters string fusion comparison method, to based on Chinese characters clustering feature by Chinese character, numeral, English The similarity of the mixed characters string of word mother's composition carries out fusion ratio pair, to improve the similar accuracy of expression character string, bag Include following key step:
1)Take out source string and one group of matching string;
2)The character string equivalencing dictionary built in memory in advance is read, to part in this group of matching string Character(Substring)Carry out equivalencing;Using equivalencing dictionary, by above-mentioned in source string occasion and matching string Conjunction has different descriptions but two kinds of substrings of implication identical are unified;
3)Source string is taken out, a word to be matched in the matching string array after equivalencing is taken out according to this Symbol string;
4)Source string and the matching degree a of the matching string are calculated using GST* algorithms:
Using traditional GST algorithms, each public substring in two character strings is obtained, they are stored in public substring chained list. If the ratio of the character length of some public substring and longer character string character length is more than or equal to 0.33, in calculating With the character number of the public substring is multiplied by into weight when spending, the weight is the constant more than 1;If the word of some public substring Accord with length and longer character string character length ratio be less than 0.33 and the character number of public substring to be more than smallest match long Degree, then be brought directly to calculating when calculating matching degree by the character number of the public substring;
5)Utilize partially orderly string matching algorithm POC(Partial Order Comparison,POC)Calculate source word symbol The matching degree b of string and matching string:
Two mixed characters strings containing Chinese character, numeral and English alphabet to be matched are referred to as source string and treated Matched character string,
First, source string and identical character or Chinese character in matching string are first searched out, their is recorded Number,;
Secondly, longer character string seeks the (match_ of matching degree 1 as standard using in source string and matching string degree1):
, as standard, to seek matching degree 2 (match_degree2) wherein compared with short character strings:
Formula(1)、(2)In [] represent round;
Again, be respectively compared in source string and matching string the 1st or the 2nd numeral and letter, last 1 or Second-to-last numeral and letter, if wherein 1 equal, the match_degree2 numerical value of adjustment matching degree 2 is match_ degree2+1:
Different weights 0.41,0.59 are assigned to matching degree 1 and matching degree 2, source string and matching string is asked Final matching value b:
B=match_degree1 × 0.41+match_degree2 × 0.59(3)
6)By step 4)Matching degree a and step 5 obtained by GST* calculating)Matching degree b obtained by POC calculating, which is weighted, to be melted Close, fusion method is, if matching degree a is more than matching degree b, final matching degree is a;If matching degree a is less than matching degree b, Then final matching degree is equal to (a+b)/2;
7)Each matching string in source string and matching string array is calculated and obtains matching degree progress Sequence, the corresponding matching string of maximum matching degree, is used as the target string most matched with source string.
In step 4)In, source string and each public substring of matching string identical are first searched out, then to different length The public substring of degree assigns different weights, increases the weight of longer common characters substring.
The GST* algorithms of the present invention, the shorter public substring matching degree existed for traditional GST algorithms may long public affairs The bigger phenomenon of the matching degree of substring, is improved it altogether:If the character length of public substring and longer character string word The ratio for according with length is more than or equal to 0.33, then the character number of the public substring is multiplied by into weight when calculating matching degree(Greatly In 1 constant);If the ratio of the character length of public substring and longer character string character length is less than 0.33 and public son The character number of string is more than smallest match length, then the character number of the public substring is brought directly into meter when calculating matching degree Calculate.
In step 5)In, using two containing numeral, letter, Chinese character mixed characters string is as source string and treats With character string;Respectively matching degree 1 and matching degree 2 are obtained using wherein longer character string, compared with short character strings as standard;Then compare again Numerals one or more compared with the and letter, last or multiple numerals with it is alphabetical whether equal, matching degree 2 is modified. Different weights are finally assigned respectively to two kinds of matching degrees, obtain the matching angle value between two character strings.
The partially orderly character string alignment algorithm POC of the present invention considers that matching degree 2 can more reflect actual match situation, therefore Assign matching degree 2 somewhat greater weight.
The present invention gives character string equivalencing strategy.Such as, " high-pressure side " and " 220KV sides ", " kilovolt " and " kV ", It is of equal value in implication.The equivalence relation that can not be reflected exactly between them using existing all kinds of alignment algorithms, therefore Propose character string equivalencing strategy.A character substring equivalencing dictionary is built in advance, is used:Middle substring to be matched=etc. The form of the source substring of valency, such as kilovolt=kV, the character substring of its expression equal sign both sides is identical, equal sign in implication Left side substring represents substring on the right side of certain substring in matching string, equal sign and represents the source string neutron of equal value with left side String.
Before matching degree calculating is done, first check in matching string whether contain in character substring equivalencing dictionary The character substring in left side in each row, if so, it is the source character substring on the right side of equal sign then to replace it.On this basis, then transport It is compared with this fusion alignment algorithm, calculating obtains corresponding matching degree, so substantially increases the accuracy of matching, can Reflect that real match condition between two character strings is compared in participation.
The present invention is applied to the comparison of one-to-many mixed characters string.Source string and one group of matching string are calculated respectively Matching degree, and obtained each matching degree is ranked up, therefrom finds out the to be matched character maximum with source string matching degree String, it is determined that being target string, it is achieved thereby that the best match of one-to-many character string.
Brief description of the drawings:
Fig. 1 is the flow chart of the fusion comparison method of new one-to-many character string.
Fig. 2 is the application example of the fusion comparison method of one-to-many mixed characters string.
Embodiment
The method to the present invention is described in further detail below in conjunction with the accompanying drawings
The present invention is done below in conjunction with the accompanying drawings and further described in detail.Present invention relates particularly to a kind of mixed characters string Merge comparison method.Character string to be matched is referred to as source string and matching string first.The present invention can more be fitted Together in the object matching character string that searching is most matched with source string from one group of matching string.
Embodiment is as follows.
1. take out source string and one group of matching string;
2. the character string equivalencing dictionary built in advance is read, partial character in this group of matching string is carried out etc. Valency is replaced.Such as " high-pressure side " is of equal value with " 220KV sides ", and " kilovolt " is of equal value with " kV ".It is calculated carrying out string matching degree Before, above-mentioned different descriptions can be unified using equivalencing dictionary;
3. taking out source string, one taken out according to this in this group of matching string array after equivalencing is to be matched Character string;
4. the matching degree of source string and matching string is calculated using GST* algorithms.
The improvement effect of GST* algorithms and traditional GST algorithms, by illustrated below.
Such as " abcde " is two groups of character strings to be compared with " qbcio ", " abcde " and " qbico ", utilizes GST algorithm meters It is 40% to calculate two groups of string matching degree.
And use GST* algorithms to calculate two groups of string matching degree, as a result respectively 43.2% and 40%.It can be seen that GST* is calculated The comparison result of method is more accurate.
GST* algorithms make the matching degree of two character strings with longer public substring higher.
5. the matching degree of source string and matching string is calculated using partially orderly string matching algorithm POC.
Two mixed characters strings containing Chinese character, numeral and letter to be matched are referred to as source string and to be matched Character string.
First, source string and matching string identical character are first searched out, their number is recorded.
Secondly, longer character string seeks the (match_ of matching degree 1 as standard using in source string and matching string degree1):
, as standard, to seek matching degree 2 (match_degree2) wherein compared with short character strings:
Formula(1)、(2)In [ ] represent round.
Again, the 1st is respectively compared in source string and matching string(Or the 2nd)Numeral and letter, last 1 (Or second-to-last)Numeral and letter, if wherein 1 equal, the match_degree2 numerical value of adjustment matching degree 2 is match_degree2+1。
Finally, due to which in actual applications, matching degree 2 can more reflect actual match situation, therefore it is bigger to assign matching degree 2 Weight.
Different weights 0.41,0.59 are assigned to matching degree 1 and matching degree 2, source string and matching string is asked Final matching value b:
B=match_degree1 × 0.41+match_degree2 × 0.59(3)
6)By step 4)Matching degree a and step 5 obtained by GST* calculating)Matching degree b obtained by POC calculating, which is weighted, to be melted Close, fusion method is, if matching degree a is more than matching degree b, final matching degree is a.If matching degree a is less than matching degree b, Then final matching degree is equal to (a+b)/2.
Final comparison result is obtained, the characteristics of taking full advantage of two kinds of algorithms;
7. check whether circulation is finished;
8. pair each matching degree is ranked up, the maximum corresponding character string of matching degree is found out, the target character most matched is used as String.
Fig. 2 is the application example of the one-to-many comparison method of mixed characters string of the present invention.Calculate one group of mixed characters string Match condition.List and added using GST* algorithms, partially orderly string matching algorithm (POC algorithms), two kinds of algorithms respectively in Fig. 2 Weigh the matching degree of GST*_POC after fusion method.
It can be seen that, in Fig. 2 in the 1st article of comparison, the 2nd article of comparison, the 1st article of matching string is to be matched than the 2nd article Source string of the character string closer to the 1st row.
Result can be derived that conclusion from Fig. 2, and ideal comparison result has been obtained using inventive algorithm.

Claims (1)

1. a kind of one-to-many mixed characters string fusion comparison method, to based on Chinese characters clustering feature by Chinese character, numeral, English The similarity of the mixed characters string of letter composition carries out fusion ratio pair, to improve the similar accuracy of expression character string, including Following key step:
1) source string and one group of matching string are taken out;
2) the character string equivalencing dictionary built in memory in advance is read, to partial character in this group of matching string I.e. substring carries out equivalencing;Using equivalencing dictionary, will have not in source string occasion and matching string occasion Describe together but two kinds of substrings of implication identical are unified;
3) source string is taken out, a character to be matched in the matching string array after equivalencing is taken out successively String;
4) source string and the matching degree a of the matching string are calculated using GST* algorithms:
Using traditional GST algorithms, each public substring in two character strings is obtained, they are stored in public substring chained list, if The ratio of the character length of some public substring and longer character string character length is more than or equal to 0.33, then is calculating matching degree When the character number of the public substring is multiplied by weight, the weight is constant more than 1;If the character of some public substring is long Degree and the ratio of longer character string character length are less than 0.33 and the character number of public substring is more than smallest match length, then The character number of the public substring is brought directly to calculating when calculating matching degree;
5) source string and the matching degree b of matching string are calculated using partially orderly string matching algorithm POC:
Two mixed characters strings containing Chinese character, numeral and English alphabet to be matched are referred to as source string and to be matched Character string,
First, source string and identical character in matching string are searched out, their number is recorded, secondly, with source Longer character string is standard in character string and matching string, seeks matching degree 1:
, as standard, to seek matching degree 2 wherein compared with short character strings:
[] represents to round in formula (1), (2);
Again, the 1st or the 2nd numeral and letter, last 1 or reciprocal in source string and matching string are respectively compared 2nd numeral and letter, if wherein 1 equal, the numerical value of adjustment matching degree 2 is match_degree2+1:
Different weights 0.41,0.59 are assigned to matching degree 1 and matching degree 2, the matching of source string and matching string is asked Spend b:
B=match_degree1 × 0.41+match_degree2 × 0.59 (3)
6) by step 4) GST* calculate obtained by matching degree a and step 5) POC calculate obtained by matching degree b be weighted fusion, Fusion method is, if matching degree a is more than matching degree b, final matching degree is a;If matching degree a is less than matching degree b, most Whole matching degree is equal to (a+b)/2;
7) each matching string calculating in source string and matching string array is obtained into matching degree to be ranked up, The corresponding matching string of maximum matching degree, the target string most matched with source string is used as.
CN201310746846.5A 2013-12-30 2013-12-30 A kind of one-to-many mixed characters string fusion comparison method Expired - Fee Related CN104008119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310746846.5A CN104008119B (en) 2013-12-30 2013-12-30 A kind of one-to-many mixed characters string fusion comparison method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310746846.5A CN104008119B (en) 2013-12-30 2013-12-30 A kind of one-to-many mixed characters string fusion comparison method

Publications (2)

Publication Number Publication Date
CN104008119A CN104008119A (en) 2014-08-27
CN104008119B true CN104008119B (en) 2017-09-26

Family

ID=51368778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310746846.5A Expired - Fee Related CN104008119B (en) 2013-12-30 2013-12-30 A kind of one-to-many mixed characters string fusion comparison method

Country Status (1)

Country Link
CN (1) CN104008119B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732041B (en) * 2015-04-13 2017-09-29 国网四川省电力公司电力科学研究院 A kind of empty terminal table automatic generation method based on many SCD templates
CN105184713A (en) * 2015-07-17 2015-12-23 四川久远银海软件股份有限公司 Intelligent matching and sorting system and method capable of benefitting contrast of assigned drugs of medical insurance
CN107102998A (en) 2016-02-22 2017-08-29 阿里巴巴集团控股有限公司 A kind of String distance computational methods and device
CN106484678A (en) * 2016-10-13 2017-03-08 北京智能管家科技有限公司 A kind of short text similarity calculating method and device
CN106919663A (en) * 2017-02-14 2017-07-04 华北电力大学 Character string matching method in the multi-source heterogeneous data fusion of power regulation system
CN109741745A (en) * 2019-01-28 2019-05-10 中国银行股份有限公司 A kind of transaction air navigation aid and device
CN112215216A (en) * 2020-09-10 2021-01-12 中国东方电气集团有限公司 Character string fuzzy matching system and method for image recognition result

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329680A (en) * 2008-07-17 2008-12-24 安徽科大讯飞信息科技股份有限公司 Large scale rapid matching method of sentence surface

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4538449B2 (en) * 2003-03-03 2010-09-08 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ String search method and equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329680A (en) * 2008-07-17 2008-12-24 安徽科大讯飞信息科技股份有限公司 Large scale rapid matching method of sentence surface

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Unified Approach for Computing Document Similarity with Fingerprinting and Alignments;Jongkyu Seo等;《2012 IEEE 12th International Conference on Computer and Information Technology》;20121029;第448-455页 *
多种字符串相似度算法的比较研究;牛永洁 等;《计算机与数字工程》;20120320(第3期);第14-17页 *

Also Published As

Publication number Publication date
CN104008119A (en) 2014-08-27

Similar Documents

Publication Publication Date Title
CN104008119B (en) A kind of one-to-many mixed characters string fusion comparison method
Joshi et al. Language geometry using random indexing
US10579661B2 (en) System and method for machine learning and classifying data
Asada et al. Enhancing drug-drug interaction extraction from texts by molecular structure information
CN104102626B (en) A kind of method for short text Semantic Similarity Measurement
CN107562824B (en) Text similarity detection method
US8943091B2 (en) System, method, and computer program product for performing a string search
CN105808709B (en) Recognition of face method for quickly retrieving and device
Yamaguchi et al. Text segmentation by language using minimum description length
Zhang et al. An improved Adagrad gradient descent optimization algorithm
WO2020114100A1 (en) Information processing method and apparatus, and computer storage medium
CN103345496A (en) Multimedia information searching method and system
US20160292198A1 (en) A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure
CN103823814A (en) Information processing method and information processing device
CN104021234A (en) Large-scale image library retrieval method based on self-adaptive bit allocation Hash algorithm
WO2023004528A1 (en) Distributed system-based parallel named entity recognition method and apparatus
CN109325242A (en) It is word-based to judge method, device and equipment that whether sentence be aligned to translation
CN110110035A (en) Data processing method and device and computer readable storage medium
CN109657061A (en) A kind of Ensemble classifier method for the more word short texts of magnanimity
Boucher et al. Computing the original eBWT faster, simpler, and with less memory
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
WO2008119297A1 (en) Method for matching character string based on characteristic parameters
CN105183792A (en) Distributed fast text classification method based on locality sensitive hashing
CN105893601B (en) A kind of data comparison method
CN108170716B (en) Text duplicate checking method based on human vision

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170926

Termination date: 20181230