CN113268986A

CN113268986A - Unit name matching and searching method and device based on fuzzy matching algorithm

Info

Publication number: CN113268986A
Application number: CN202110566536.XA
Authority: CN
Inventors: 李君�; 许志坚
Original assignee: Bank of Communications Co Ltd
Current assignee: Bank of Communications Co Ltd
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2021-08-17
Anticipated expiration: 2041-05-24
Also published as: CN113268986B

Abstract

The invention relates to a unit name matching and searching method and device based on a fuzzy matching algorithm. Performing custom word segmentation processing on two unit names to be matched to obtain corresponding custom words, wherein the custom words comprise three types, namely attribute words, regional words and nouns; and then calculating fuzzy matching scores of all types of self-defined participles in the two unit names based on the self-defined participle results, and further carrying out weighting processing on the fuzzy matching scores of all types of self-defined participles to obtain the fuzzy matching scores of the two unit names. Compared with the prior art, the method has the advantages of unit name matching, accurate searching and the like.

Description

Unit name matching and searching method and device based on fuzzy matching algorithm

Technical Field

The invention relates to a unit name matching and searching method and device, in particular to a unit name matching and searching method and device based on a fuzzy matching algorithm.

Background

Under the background of the information technology era, the mining requirements of various industries, particularly financial industries, on the client information are more and more urgent, the client information is required to be more accurate, and the mining speed is faster. To meet this demand, an ES database is introduced, which is a distributed database, provides scalable searching, has near real-time searching capability, and supports precise query and fuzzy query processing of large amounts of data.

The ES fuzzy query is a query method of fast fuzzy matching realized based on an elastic search self-contained word segmentation method. The Elasticsearch is a distributed database that provides a distributed multi-user capable full-text search engine using JSON for data indexing over HTTP. The ES can be used for full-text retrieval, structured search, and data analysis. Through the self-contained dictionary database of the ES and the word segmentation method, the rapid query of the similar data in the database table can be realized.

However, the native word segmentation method is single, and the application scenario is also single, so that the service scenario that fuzzy matching is performed on a plurality of unit names and similarity is obtained cannot be met. And the original ES fuzzy query method depends on a word stock, so that the data accurate matching with higher fuzzy matching requirement of the unit name cannot be achieved. In the prior art, the unit name is searched by directly segmenting words through IK of an elastic search, and segmenting words of the unit name needing fuzzy matching. When the application calls the ES native fuzzy matching method, fuzzy matching and query are realized by sending a JSON message of HTTP to an ES database. However, the technique is limited to IK segmentation of the input unit name, to separate it into different indexes, and then to locate the matched name in the database according to the indexes. The matching method is simple, and for some units of specific rules, the condition of mismatching exists, which is specifically embodied as follows:

1. short names exist in the unit names, and the short names cannot be directly identified by ES;

2. the unit name contains the area name, but the ES cannot identify the area, and other unit names which are not in the area can be screened out;

3. the unit name contains an alias field, but the ES cannot directly screen out the unit name of the alias;

4. uncertainty of user input causes that the unit name contains a plurality of invalid characters such as brackets, points and the like, and ES cannot be identified;

5. without the concept of weight, the weight of each matched participle index of the ES is the same, and there is no way to dynamically adjust and control the weight of different participles.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a unit name matching and searching method and device based on a fuzzy matching algorithm.

The purpose of the invention can be realized by the following technical scheme:

a unit name matching method based on fuzzy matching algorithm includes the following steps:

s1, acquiring two unit names to be matched, judging whether the unit names contain each other or are completely matched, if so, directly outputting a fuzzy matching score of 100, otherwise, executing a step S2;

s2, respectively preprocessing the two unit names, including standardization processing and filtering processing;

s3, performing custom word segmentation processing on the two unit names respectively to obtain corresponding custom words, wherein the custom words comprise three types, namely attribute words, regional words and nouns;

s4, judging whether fuzzy matching calculation can be directly carried out or not based on the self-defined word segmentation result, if so, executing a step S6, otherwise, executing a step S5;

s5, carrying out fuzzy matching direct judgment on the two unit names, if the fuzzy matching direct judgment condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing a step S6;

s6, calculating fuzzy matching scores of all types of self-defined participles in the two unit names based on the self-defined participle results, and further carrying out weighting processing on the fuzzy matching scores of all types of self-defined participles to obtain the fuzzy matching scores of the two unit names.

Preferably, the normalizing process in step S2 includes: replacing the acronyms in the unit names with standard words based on the acronym library.

Preferably, the filtering process in step S2 includes: and deleting meaningless characters in the unit name, and deleting invalid words in the unit name based on the invalid word library.

Preferably, the step S4 determines that the condition capable of directly performing the fuzzy matching calculation includes any one of:

the word segmentation results of the two unit names comprise regional words which are different;

the word segmentation result of one unit name contains the noun, and the word segmentation result of the other unit name does not contain the noun.

Preferably, the specific determination of the fuzzy matching direct determination of step S5 includes:

s51, if the word segmentation results of the two unit names both contain different nouns and the different alias words, executing step S52, otherwise executing step S53;

s52, judging ES word segmentation: respectively carrying out ES word segmentation on two unit names by using an IK word segmentation method carried by an ES word stock to obtain two word segmentation combinations, recording a shorter word segmentation set as Aset and a longer word segmentation set as Bset, directly outputting a fuzzy matching score of 100 if all the words in the Aset are contained in the Bset, otherwise, counting repeated words in the two word segmentation sets and recording Cnt, and simultaneously recording the total number of the words in the Aset as Cnaset, directly outputting the fuzzy matching score of 100 if the Cnaset is more than or equal to esLen and the Cnt/Cnaset is more than or equal to esPct, and otherwise, executing a step S53, wherein the esLen and the esPct are set thresholds;

s53, judging public words: and (3) recording the unit name with less characters as A, recording the unit name with more characters as B, sequentially traversing each character in A into B to obtain a longest matched character string set called as a public word C, and if the public word is completely matched with one of the unit names and the length of the public word exceeds 3 bits, directly outputting a fuzzy matching score of 100, otherwise: carrying out self-defined word segmentation on the public word C and the unit name B again, if the attribute words in the word segmentation results of the public word C and the unit name B are the same and the nouns are null, directly outputting a fuzzy matching score of 100, and otherwise, executing the step S54;

s54, subset judgment: performing subset judgment on the unit names preprocessed in the step S2, if the unit names are mutually contained, directly outputting a fuzzy matching score of 100, and otherwise, executing a step S55;

s55, self-defined word segmentation judgment: if the two regional word sets in the two unit name word segmentation results are not empty and equal and the two individual noun sets are not empty and equal, the fuzzy matching score is directly output as 100, otherwise, step S6 is executed.

Preferably, the fuzzy matching scores of the various custom segmented words in step S6 are obtained as follows:

recording two company names A and B, and for the self-defined participles of the i category, calculating the longest public subsequence length maxSubSeq (A, B) of the self-defined participles of the i category by a longest public subsequence length calculation method for the set of the self-defined participles of the category corresponding to the two company names_imaxSubSeq (a, B), then calculate the fuzzy match score for the i-category custom participle by the following equation_i：

Wherein | A | Y_iRepresents the self-defined word segmentation length of i category in A, | B_iAnd (3) representing the self-defined word segmentation length of the i category in the B, wherein the i corresponds to three categories of attribute words, area words and alias words.

Preferably, the fuzzy match scores for two unit names are obtained by:

wherein, the mathScore represents fuzzy matching scores of two unit names,

expressed by taking 2 as a decimal number, w_iWeights for custom participles for the i category.

A unit name searching method based on fuzzy matching algorithm includes:

acquiring a unit name to be searched, and determining a name to be matched with the highest matching degree in a unit name word library through ES fuzzy matching;

and matching the unit name to be searched and the name to be matched by adopting the unit name matching method based on the fuzzy matching algorithm of any one of claims 1 to 7 to determine the fuzzy matching score of the name to be matched and the unit name to be searched in the unit name lexicon.

A unit name matching device based on a fuzzy matching algorithm comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the unit name matching method based on the fuzzy matching algorithm when the computer program is executed.

A unit name searching device based on a fuzzy matching algorithm comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the unit name searching method based on the fuzzy matching algorithm when the computer program is executed.

Compared with the prior art, the invention has the following advantages: the matching/searching method is more precise, the matching/searching result is more accurate and reliable, and the method is specifically represented as follows:

(1) the unit name has a abbreviation, and the unit name with the abbreviation can be identified and matched by the method;

(2) the unit name comprises the area name, the area name in the unit name needing to be matched can be distinguished, and the matching is carried out according to the real area name;

(3) the unit name contains an alias field, and the invention can identify the alias and carry out matching;

(4) the invention can screen out useless words and then carry out matching, thus improving the matching accuracy;

(5) the unit name to be matched is divided into a region, an attribute and an alias. And different classifications can independently set the matched weight, and the configuration is more flexible.

Drawings

FIG. 1 is a block diagram of a unit name matching method based on fuzzy matching algorithm according to the present invention;

FIG. 2 is a block diagram of a unit name search method based on fuzzy matching algorithm according to the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. Note that the following description of the embodiments is merely a substantial example, and the present invention is not intended to be limited to the application or the use thereof, and is not limited to the following embodiments.

Example 1

This embodiment 1 provides a unit name matching method based on a fuzzy matching algorithm, which implements one-to-one fuzzy matching, and includes the following steps:

s1, acquiring two unit names to be matched, respectively recording the unit names as A and B, judging whether the unit names mutually contain or completely match, if so, directly outputting a fuzzy matching score of 100, otherwise, executing a step S2.

S2, respectively preprocessing the two unit names, including standardization processing and filtering processing, wherein the standardization processing comprises: replacing the short words in the unit names with standard words based on the short word library; the filtering treatment comprises the following steps: deleting meaningless characters in the unit name, such as: "! \\\ \ \ in \ in \: \\ + \\. The nonsense characters of \\\\ \ "and the like delete the invalid words in the unit name based on the invalid word bank.

And S3, performing custom word segmentation processing on the two unit names respectively to obtain corresponding custom words, wherein the custom words comprise three types, namely attribute words, area words and alias words, the area words represent areas represented in the unit names, and the attribute words represent industry types represented by the units, such as belonging to security companies, electric power companies and the like, and other words except the attribute words and the area words are alias words. Before word segmentation, the regional words are restored, regional abbreviation, province and city word samples are supplemented, regional word extraction is carried out, and finally regional word segmentation sets of province, city and county are extracted.

S4, judging whether fuzzy matching calculation can be directly carried out or not based on the self-defined word segmentation result, if so, executing a step S6, otherwise, executing a step S5, and specifically, judging that the conditions for directly carrying out the fuzzy matching calculation include any one of the following conditions: the word segmentation results of the two unit names comprise regional words which are different; the word segmentation result of one unit name contains the noun, and the word segmentation result of the other unit name does not contain the noun.

S5, carrying out fuzzy matching direct judgment on the two unit names, if the fuzzy matching direct judgment condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing a step S6, wherein the step S5 specifically comprises the following steps:

S6, calculating fuzzy matching scores of all kinds of self-defined participles in the two unit names based on the self-defined participle results, and further carrying out weighting processing on the fuzzy matching scores of all kinds of self-defined participles to obtain the fuzzy matching scores of the two unit names, wherein the fuzzy matching scores represent the matching degree of the two unit names, the larger the fuzzy matching score is, the higher the matching degree of the two unit names is, and otherwise, the smaller the fuzzy matching score is, the lower the matching degree of the two unit names is.

The fuzzy matching scores of the various custom participles in the step S6 are obtained through the following method:

for the self-defined participles of the category i, calculating the longest common subsequence length maxSubSeq (A, B) of the self-defined participles of the category i by using a longest common subsequence length calculation method for a set of the self-defined participles of the category corresponding to two company names_imaxSubSeq (a, B), then calculate the fuzzy match score for the i-category custom participle by the following equation_i：

The fuzzy match scores for the two unit names are obtained by:

wherein, the mathScore represents fuzzy matching scores of two unit names,

According to the business scene and the personal weight proportion, the weights of the regional words, the attribute words and the nouns are initialized, and the weights in the embodiment are set as follows: the area word weight is 0.3, the attribute word weight is 0.15, and the alias word weight is 0.55 (here, the initial weight may be adjusted according to the individual situation); and then, respectively adjusting the initialization weights according to the segmentation sets obtained by segmenting the words A and B, for example, if the words A and B obtain no alias segmentation, respectively assigning an initial value of 0.55 of the alias weight to the region and the attribute weight, wherein the region weight is modified to be 0.3+ 0.55/2-0.575, and the attribute weight is modified to be 0.15+ 0.55/2-0.425.

In the present embodiment, the "Guangzhou sports east security of the Huatai Union securities Limited liability company" and the "Huatai futures Limited company" are used as examples to specifically describe:

the two company names are segmented respectively, and the self-defined segmentation result is respectively:

the custom word segmentation result of Guangzhou sports east road securities of Huatai union securities Limited liability company is as follows: [ Guangdong province, Guangzhou city ], [ sports, securities ], [ Huatai union east way ]), wherein [ Guangdong province, Guangzhou city ] is a regional word set, [ sports, securities ] is an attribute word set, and [ Huatai union east way ] is an alias word set;

the custom segmentation result of "huatai futures limited" is: ([ ], [ futures ], [ huatai ]), wherein the regional word set is empty, [ futures ] is the attribute word set, and [ huatai ] is the noun set;

obtaining the following results after extracting the region:

([ Guangdong province ], [ sports, securities ], [ east China Union of China ])

([ ], [ futures ], [ huatai ]);

step S4 shows that fuzzy matching calculation cannot be directly performed, step S5 shows that public word judgment, subset judgment and custom word segmentation judgment cannot be sequentially performed, fuzzy matching score cannot be directly output, and step S6 shows that fuzzy matching score calculation is performed, and the following steps are obtained:

since the set of regional words in "huatai futures limited" is empty, the fuzzy matching scores of the two regional words are: 0;

[ east road in the union of huatai ], [ huatai ], fuzzy match score of nouns: 0.6666665, respectively;

[ sports securities ], [ futures ], fuzzy match score for attribute words: 0;

weight value: the weight of the regional word is 0.3, the weight of the alias word is 0.55, and the weight of the attribute word is 0.15;

the fuzzy match scores for the two unit names, round (0.55 0.6666665 100, 2) 37.

Example 2

As shown in fig. 2, the unit name searching method based on the fuzzy matching algorithm of the present embodiment includes:

In this embodiment, the unit name matching method based on the fuzzy matching algorithm is completely the same as that in embodiment 1, and is not described in detail in this embodiment.

The following specific examples are illustrative:

name of unit to be searched: the general offices of broadcasting and television in Jiangxi province;

the unit name word library sets a name field type as keyword (accurate query) in advance, acquires a name with the highest matching score after the ES keyword is accurately queried and matched, and then carries out one-to-one matching similarity calculation to obtain the unit name and the similarity score which are matched as follows:

jiangxi province broadcast television bureau (es matching score: 18.56986), Heilongjiang province broadcast television bureau Shandong province broadcast television bureau (es matching score: 14.986288), Jiangxi province general workshop (es matching score: 14.648104), Guangdong province West river basin administration (es matching score: 13.420963), Jiangxi province Water and Electricity engineering bureau Limited (es matching score: 12.696965)

Screening out the matching name with the highest score: and the Jiangxi province broadcast television bureau carries out one-to-one matching calculation on similarity of the Jiangxi province broadcast television bureau and the key word matching hit Jiangxi province broadcast television bureau, and similarity of 100 points is obtained.

Therefore, the searching method realizes one-to-many matching and realizes the matching search of the unit name to be searched in the unit name word bank.

Example 3

The present embodiment provides a unit name matching device based on the fuzzy matching algorithm based on embodiment 1, which includes a memory for storing a computer program and a processor for implementing the unit name matching method based on the fuzzy matching algorithm in embodiment 1 when executing the computer program.

Example 4

The present embodiment provides a unit name lookup apparatus based on fuzzy matching algorithm based on embodiment 2, which includes a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for implementing the unit name lookup method based on fuzzy matching algorithm in embodiment 2 when executing the computer program.

The invention improves the accuracy of matching of high-quality customer groups of banks through the user-defined fuzzy matching technology. And is suitable for inputting the matching of single unit names and inputting the service scenes that two unit names are matched with each other. The fuzzy matching scene is more flexible, and the fuzzy matching result is more accurate. And a self-defined word segmentation technology of regions, attributes and aliases is provided, scores and weights are provided for each type of word segmentation, the weights can be adjusted in a self-defined mode, and subjective initiative of services is increased.

The above embodiments are merely examples and do not limit the scope of the present invention. These embodiments may be implemented in other various manners, and various omissions, substitutions, and changes may be made without departing from the technical spirit of the present invention.

Claims

1. A unit name matching method based on fuzzy matching algorithm is characterized by comprising the following steps:

2. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the normalizing process in step S2 comprises: replacing the acronyms in the unit names with standard words based on the acronym library.

3. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the filtering process in step S2 comprises: and deleting meaningless characters in the unit name, and deleting invalid words in the unit name based on the invalid word library.

4. The unit name matching method based on the fuzzy matching algorithm as claimed in claim 1, wherein the condition that the step S4 determines that the fuzzy matching calculation can be directly performed includes any one of:

5. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the specific decision of step S5 fuzzy matching direct decision comprises:

6. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the fuzzy matching score of each type of custom segmentation in step S6 is obtained as follows:

7. The unit name matching method based on fuzzy matching algorithm as claimed in claim 6, wherein the fuzzy matching score of two unit names is obtained by the following formula:

wherein, the mathScore represents fuzzy matching scores of two unit names,

8. A unit name searching method based on fuzzy matching algorithm is characterized by comprising the following steps:

9. A unit name matching device based on a fuzzy matching algorithm, which is characterized by comprising a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the unit name matching method based on the fuzzy matching algorithm according to any one of claims 1 to 7 when the computer program is executed.

10. An apparatus for unit name lookup based on fuzzy matching algorithm, characterized in that the apparatus comprises a memory for storing a computer program and a processor for implementing the unit name lookup based on fuzzy matching algorithm of claim 8 when executing said computer program.