CN104008119B - A kind of one-to-many mixed characters string fusion comparison method - Google Patents
A kind of one-to-many mixed characters string fusion comparison method Download PDFInfo
- Publication number
- CN104008119B CN104008119B CN201310746846.5A CN201310746846A CN104008119B CN 104008119 B CN104008119 B CN 104008119B CN 201310746846 A CN201310746846 A CN 201310746846A CN 104008119 B CN104008119 B CN 104008119B
- Authority
- CN
- China
- Prior art keywords
- string
- matching
- character
- matching degree
- source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Comparison method is merged the invention discloses a kind of one-to-many mixed characters string, a source string is found into most like or matching target string from one group of character string to be compared.First with improved GST* algorithms, a kind of orderly character string alignment algorithm POC partially is reused.The respective feature character string is unordered, in terms of partial order matching respectively with reference to both the above algorithm, two kinds of algorithms is calculated obtained matching angle value and is weighted to merge and tries to achieve final matching degree.In addition, there is different expression ways under different occasions for synonymous character string, using character string equivalencing strategy, to synonymous character substring equivalencing in source string, character string to be compared into identical character string, the matching degree of two character strings is greatly improved.By the way that source string is matched respectively with one group of matching string, then each matching degree is ranked up, using the character string of highest matching degree as target string, realizes the best match of one-to-many mixed characters string.
Description
Technical field
The invention belongs to the intelligent comparison technology field of character string, and in particular to a kind of new one-to-many mixed characters string melts
Close comparison method.
Background technology
It is a basic problem in computer science that character string, which compares problem, and its research contents is in information retrieval, pattern
The various fields such as identification have important application value[1]-[4]。
Document 1 studies Approximate String Matching for Chinese Text respectively, and document 2 have studied a kind of based on Chinese characters clustering feature
Chinese character string similarity calculating method.Document 3 is compared with GST algorithms LCS, and GST algorithms are a kind of greedy character strings
Alignment algorithm, is also a kind of unordered matching algorithm, and application is wider at present, but the algorithm employs two character strings character ratio one by one
Compared with method, so the time complexity of algorithm is larger.Document 4 have studied to RKR-GST algorithms after GST algorithm improvements, improve
The operational efficiency of GST algorithms, but in RKR-GST algorithms the selection of hash function is very big to the influence on system operation of algorithm.
Existing character string comparison method often only with a kind of algorithm, can make full use of unordered character substring and
Features of the partial order character substring when matching degree is calculated, often their comparison effect is unsatisfactory.Some certain
In the practical application of a little mixed characters strings, not only require that the accuracy compared is high, and require that the speed compared is fast.At present, lead to
Single matching degree computational methods are crossed, are difficult often the similarity degree for accurately expressing character string.
In addition, existing character string comparison method does not account for the feelings that synonymous character string there may be different expression ways
Condition so that existing character string comparison method is extremely difficult to the requirement of more accurate, high matching rate in such cases.
Bibliography:
[1] it is old to build canals, Zhao Jie, Peng Zhi prestige Fast Approximate String Matching for Chinese Text [J] Journal of Chinese Information Processings,
2003,18 (2):58-65
[2] Chinese character string Similarity Measure research [J] modem long jump skill information of the graceful of Wang Jing based on Chinese characters clustering feature
Technology, 2011,20 (2):48-53
[3] LCS and GST algorithm comparisons [J] electronics technologies, 2011,24 (3) in extra large English similarity of character string measurement:
101-103
[4] ox analysis of the clean .RKR_GST algorithms in _ NET forever and realization [J] information technologies, 2012,3:171-174
The content of the invention
In view of prior art is not enough above, it is an object of the invention to provide a kind of more accurately mixed characters string fusion ratio
To method.Solve in practical application and to be extremely difficult to similar journey between accurate expression character string with single matching degree computational methods
Degree, synonymous character string have that existing character string comparison method is almost under different expression way situations.
The purpose of the present invention is realized by following means:
A kind of one-to-many mixed characters string fusion comparison method, to based on Chinese characters clustering feature by Chinese character, numeral, English
The similarity of the mixed characters string of word mother's composition carries out fusion ratio pair, to improve the similar accuracy of expression character string, bag
Include following key step:
1)Take out source string and one group of matching string;
2)The character string equivalencing dictionary built in memory in advance is read, to part in this group of matching string
Character(Substring)Carry out equivalencing;Using equivalencing dictionary, by above-mentioned in source string occasion and matching string
Conjunction has different descriptions but two kinds of substrings of implication identical are unified;
3)Source string is taken out, a word to be matched in the matching string array after equivalencing is taken out according to this
Symbol string;
4)Source string and the matching degree a of the matching string are calculated using GST* algorithms:
Using traditional GST algorithms, each public substring in two character strings is obtained, they are stored in public substring chained list.
If the ratio of the character length of some public substring and longer character string character length is more than or equal to 0.33, in calculating
With the character number of the public substring is multiplied by into weight when spending, the weight is the constant more than 1;If the word of some public substring
Accord with length and longer character string character length ratio be less than 0.33 and the character number of public substring to be more than smallest match long
Degree, then be brought directly to calculating when calculating matching degree by the character number of the public substring;
5)Utilize partially orderly string matching algorithm POC(Partial Order Comparison,POC)Calculate source word symbol
The matching degree b of string and matching string:
Two mixed characters strings containing Chinese character, numeral and English alphabet to be matched are referred to as source string and treated
Matched character string,
First, source string and identical character or Chinese character in matching string are first searched out, their is recorded
Number,;
Secondly, longer character string seeks the (match_ of matching degree 1 as standard using in source string and matching string
degree1):
, as standard, to seek matching degree 2 (match_degree2) wherein compared with short character strings:
Formula(1)、(2)In [] represent round;
Again, be respectively compared in source string and matching string the 1st or the 2nd numeral and letter, last 1 or
Second-to-last numeral and letter, if wherein 1 equal, the match_degree2 numerical value of adjustment matching degree 2 is match_
degree2+1:
Different weights 0.41,0.59 are assigned to matching degree 1 and matching degree 2, source string and matching string is asked
Final matching value b:
B=match_degree1 × 0.41+match_degree2 × 0.59(3)
6)By step 4)Matching degree a and step 5 obtained by GST* calculating)Matching degree b obtained by POC calculating, which is weighted, to be melted
Close, fusion method is, if matching degree a is more than matching degree b, final matching degree is a;If matching degree a is less than matching degree b,
Then final matching degree is equal to (a+b)/2;
7)Each matching string in source string and matching string array is calculated and obtains matching degree progress
Sequence, the corresponding matching string of maximum matching degree, is used as the target string most matched with source string.
In step 4)In, source string and each public substring of matching string identical are first searched out, then to different length
The public substring of degree assigns different weights, increases the weight of longer common characters substring.
The GST* algorithms of the present invention, the shorter public substring matching degree existed for traditional GST algorithms may long public affairs
The bigger phenomenon of the matching degree of substring, is improved it altogether:If the character length of public substring and longer character string word
The ratio for according with length is more than or equal to 0.33, then the character number of the public substring is multiplied by into weight when calculating matching degree(Greatly
In 1 constant);If the ratio of the character length of public substring and longer character string character length is less than 0.33 and public son
The character number of string is more than smallest match length, then the character number of the public substring is brought directly into meter when calculating matching degree
Calculate.
In step 5)In, using two containing numeral, letter, Chinese character mixed characters string is as source string and treats
With character string;Respectively matching degree 1 and matching degree 2 are obtained using wherein longer character string, compared with short character strings as standard;Then compare again
Numerals one or more compared with the and letter, last or multiple numerals with it is alphabetical whether equal, matching degree 2 is modified.
Different weights are finally assigned respectively to two kinds of matching degrees, obtain the matching angle value between two character strings.
The partially orderly character string alignment algorithm POC of the present invention considers that matching degree 2 can more reflect actual match situation, therefore
Assign matching degree 2 somewhat greater weight.
The present invention gives character string equivalencing strategy.Such as, " high-pressure side " and " 220KV sides ", " kilovolt " and " kV ",
It is of equal value in implication.The equivalence relation that can not be reflected exactly between them using existing all kinds of alignment algorithms, therefore
Propose character string equivalencing strategy.A character substring equivalencing dictionary is built in advance, is used:Middle substring to be matched=etc.
The form of the source substring of valency, such as kilovolt=kV, the character substring of its expression equal sign both sides is identical, equal sign in implication
Left side substring represents substring on the right side of certain substring in matching string, equal sign and represents the source string neutron of equal value with left side
String.
Before matching degree calculating is done, first check in matching string whether contain in character substring equivalencing dictionary
The character substring in left side in each row, if so, it is the source character substring on the right side of equal sign then to replace it.On this basis, then transport
It is compared with this fusion alignment algorithm, calculating obtains corresponding matching degree, so substantially increases the accuracy of matching, can
Reflect that real match condition between two character strings is compared in participation.
The present invention is applied to the comparison of one-to-many mixed characters string.Source string and one group of matching string are calculated respectively
Matching degree, and obtained each matching degree is ranked up, therefrom finds out the to be matched character maximum with source string matching degree
String, it is determined that being target string, it is achieved thereby that the best match of one-to-many character string.
Brief description of the drawings:
Fig. 1 is the flow chart of the fusion comparison method of new one-to-many character string.
Fig. 2 is the application example of the fusion comparison method of one-to-many mixed characters string.
Embodiment
The method to the present invention is described in further detail below in conjunction with the accompanying drawings
The present invention is done below in conjunction with the accompanying drawings and further described in detail.Present invention relates particularly to a kind of mixed characters string
Merge comparison method.Character string to be matched is referred to as source string and matching string first.The present invention can more be fitted
Together in the object matching character string that searching is most matched with source string from one group of matching string.
Embodiment is as follows.
1. take out source string and one group of matching string;
2. the character string equivalencing dictionary built in advance is read, partial character in this group of matching string is carried out etc.
Valency is replaced.Such as " high-pressure side " is of equal value with " 220KV sides ", and " kilovolt " is of equal value with " kV ".It is calculated carrying out string matching degree
Before, above-mentioned different descriptions can be unified using equivalencing dictionary;
3. taking out source string, one taken out according to this in this group of matching string array after equivalencing is to be matched
Character string;
4. the matching degree of source string and matching string is calculated using GST* algorithms.
The improvement effect of GST* algorithms and traditional GST algorithms, by illustrated below.
Such as " abcde " is two groups of character strings to be compared with " qbcio ", " abcde " and " qbico ", utilizes GST algorithm meters
It is 40% to calculate two groups of string matching degree.
And use GST* algorithms to calculate two groups of string matching degree, as a result respectively 43.2% and 40%.It can be seen that GST* is calculated
The comparison result of method is more accurate.
GST* algorithms make the matching degree of two character strings with longer public substring higher.
5. the matching degree of source string and matching string is calculated using partially orderly string matching algorithm POC.
Two mixed characters strings containing Chinese character, numeral and letter to be matched are referred to as source string and to be matched
Character string.
First, source string and matching string identical character are first searched out, their number is recorded.
Secondly, longer character string seeks the (match_ of matching degree 1 as standard using in source string and matching string
degree1):
, as standard, to seek matching degree 2 (match_degree2) wherein compared with short character strings:
Formula(1)、(2)In [ ] represent round.
Again, the 1st is respectively compared in source string and matching string(Or the 2nd)Numeral and letter, last 1
(Or second-to-last)Numeral and letter, if wherein 1 equal, the match_degree2 numerical value of adjustment matching degree 2 is
match_degree2+1。
Finally, due to which in actual applications, matching degree 2 can more reflect actual match situation, therefore it is bigger to assign matching degree 2
Weight.
Different weights 0.41,0.59 are assigned to matching degree 1 and matching degree 2, source string and matching string is asked
Final matching value b:
B=match_degree1 × 0.41+match_degree2 × 0.59(3)
6)By step 4)Matching degree a and step 5 obtained by GST* calculating)Matching degree b obtained by POC calculating, which is weighted, to be melted
Close, fusion method is, if matching degree a is more than matching degree b, final matching degree is a.If matching degree a is less than matching degree b,
Then final matching degree is equal to (a+b)/2.
Final comparison result is obtained, the characteristics of taking full advantage of two kinds of algorithms;
7. check whether circulation is finished;
8. pair each matching degree is ranked up, the maximum corresponding character string of matching degree is found out, the target character most matched is used as
String.
Fig. 2 is the application example of the one-to-many comparison method of mixed characters string of the present invention.Calculate one group of mixed characters string
Match condition.List and added using GST* algorithms, partially orderly string matching algorithm (POC algorithms), two kinds of algorithms respectively in Fig. 2
Weigh the matching degree of GST*_POC after fusion method.
It can be seen that, in Fig. 2 in the 1st article of comparison, the 2nd article of comparison, the 1st article of matching string is to be matched than the 2nd article
Source string of the character string closer to the 1st row.
Result can be derived that conclusion from Fig. 2, and ideal comparison result has been obtained using inventive algorithm.
Claims (1)
1. a kind of one-to-many mixed characters string fusion comparison method, to based on Chinese characters clustering feature by Chinese character, numeral, English
The similarity of the mixed characters string of letter composition carries out fusion ratio pair, to improve the similar accuracy of expression character string, including
Following key step:
1) source string and one group of matching string are taken out;
2) the character string equivalencing dictionary built in memory in advance is read, to partial character in this group of matching string
I.e. substring carries out equivalencing;Using equivalencing dictionary, will have not in source string occasion and matching string occasion
Describe together but two kinds of substrings of implication identical are unified;
3) source string is taken out, a character to be matched in the matching string array after equivalencing is taken out successively
String;
4) source string and the matching degree a of the matching string are calculated using GST* algorithms:
Using traditional GST algorithms, each public substring in two character strings is obtained, they are stored in public substring chained list, if
The ratio of the character length of some public substring and longer character string character length is more than or equal to 0.33, then is calculating matching degree
When the character number of the public substring is multiplied by weight, the weight is constant more than 1;If the character of some public substring is long
Degree and the ratio of longer character string character length are less than 0.33 and the character number of public substring is more than smallest match length, then
The character number of the public substring is brought directly to calculating when calculating matching degree;
5) source string and the matching degree b of matching string are calculated using partially orderly string matching algorithm POC:
Two mixed characters strings containing Chinese character, numeral and English alphabet to be matched are referred to as source string and to be matched
Character string,
First, source string and identical character in matching string are searched out, their number is recorded, secondly, with source
Longer character string is standard in character string and matching string, seeks matching degree 1:
, as standard, to seek matching degree 2 wherein compared with short character strings:
[] represents to round in formula (1), (2);
Again, the 1st or the 2nd numeral and letter, last 1 or reciprocal in source string and matching string are respectively compared
2nd numeral and letter, if wherein 1 equal, the numerical value of adjustment matching degree 2 is match_degree2+1:
Different weights 0.41,0.59 are assigned to matching degree 1 and matching degree 2, the matching of source string and matching string is asked
Spend b:
B=match_degree1 × 0.41+match_degree2 × 0.59 (3)
6) by step 4) GST* calculate obtained by matching degree a and step 5) POC calculate obtained by matching degree b be weighted fusion,
Fusion method is, if matching degree a is more than matching degree b, final matching degree is a;If matching degree a is less than matching degree b, most
Whole matching degree is equal to (a+b)/2;
7) each matching string calculating in source string and matching string array is obtained into matching degree to be ranked up,
The corresponding matching string of maximum matching degree, the target string most matched with source string is used as.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310746846.5A CN104008119B (en) | 2013-12-30 | 2013-12-30 | A kind of one-to-many mixed characters string fusion comparison method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310746846.5A CN104008119B (en) | 2013-12-30 | 2013-12-30 | A kind of one-to-many mixed characters string fusion comparison method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104008119A CN104008119A (en) | 2014-08-27 |
CN104008119B true CN104008119B (en) | 2017-09-26 |
Family
ID=51368778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310746846.5A Expired - Fee Related CN104008119B (en) | 2013-12-30 | 2013-12-30 | A kind of one-to-many mixed characters string fusion comparison method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104008119B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104732041B (en) * | 2015-04-13 | 2017-09-29 | 国网四川省电力公司电力科学研究院 | A kind of empty terminal table automatic generation method based on many SCD templates |
CN105184713A (en) * | 2015-07-17 | 2015-12-23 | 四川久远银海软件股份有限公司 | Intelligent matching and sorting system and method capable of benefitting contrast of assigned drugs of medical insurance |
CN107102998A (en) | 2016-02-22 | 2017-08-29 | 阿里巴巴集团控股有限公司 | A kind of String distance computational methods and device |
CN106484678A (en) * | 2016-10-13 | 2017-03-08 | 北京智能管家科技有限公司 | A kind of short text similarity calculating method and device |
CN106919663A (en) * | 2017-02-14 | 2017-07-04 | 华北电力大学 | Character string matching method in the multi-source heterogeneous data fusion of power regulation system |
CN109741745A (en) * | 2019-01-28 | 2019-05-10 | 中国银行股份有限公司 | A kind of transaction air navigation aid and device |
CN112215216A (en) * | 2020-09-10 | 2021-01-12 | 中国东方电气集团有限公司 | Character string fuzzy matching system and method for image recognition result |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101329680A (en) * | 2008-07-17 | 2008-12-24 | 安徽科大讯飞信息科技股份有限公司 | Large scale rapid matching method of sentence surface |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4538449B2 (en) * | 2003-03-03 | 2010-09-08 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | String search method and equipment |
-
2013
- 2013-12-30 CN CN201310746846.5A patent/CN104008119B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101329680A (en) * | 2008-07-17 | 2008-12-24 | 安徽科大讯飞信息科技股份有限公司 | Large scale rapid matching method of sentence surface |
Non-Patent Citations (2)
Title |
---|
A Unified Approach for Computing Document Similarity with Fingerprinting and Alignments;Jongkyu Seo等;《2012 IEEE 12th International Conference on Computer and Information Technology》;20121029;第448-455页 * |
多种字符串相似度算法的比较研究;牛永洁 等;《计算机与数字工程》;20120320(第3期);第14-17页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104008119A (en) | 2014-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104008119B (en) | A kind of one-to-many mixed characters string fusion comparison method | |
Joshi et al. | Language geometry using random indexing | |
US10579661B2 (en) | System and method for machine learning and classifying data | |
Asada et al. | Enhancing drug-drug interaction extraction from texts by molecular structure information | |
CN104102626B (en) | A kind of method for short text Semantic Similarity Measurement | |
CN107562824B (en) | Text similarity detection method | |
US8943091B2 (en) | System, method, and computer program product for performing a string search | |
CN105808709B (en) | Recognition of face method for quickly retrieving and device | |
Yamaguchi et al. | Text segmentation by language using minimum description length | |
Zhang et al. | An improved Adagrad gradient descent optimization algorithm | |
WO2020114100A1 (en) | Information processing method and apparatus, and computer storage medium | |
CN103345496A (en) | Multimedia information searching method and system | |
US20160292198A1 (en) | A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure | |
CN103823814A (en) | Information processing method and information processing device | |
CN104021234A (en) | Large-scale image library retrieval method based on self-adaptive bit allocation Hash algorithm | |
WO2023004528A1 (en) | Distributed system-based parallel named entity recognition method and apparatus | |
CN109325242A (en) | It is word-based to judge method, device and equipment that whether sentence be aligned to translation | |
CN110110035A (en) | Data processing method and device and computer readable storage medium | |
CN109657061A (en) | A kind of Ensemble classifier method for the more word short texts of magnanimity | |
Boucher et al. | Computing the original eBWT faster, simpler, and with less memory | |
CN111090994A (en) | Chinese-internet-forum-text-oriented event place attribution province identification method | |
WO2008119297A1 (en) | Method for matching character string based on characteristic parameters | |
CN105183792A (en) | Distributed fast text classification method based on locality sensitive hashing | |
CN105893601B (en) | A kind of data comparison method | |
CN108170716B (en) | Text duplicate checking method based on human vision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170926 Termination date: 20181230 |