CN113268986B

CN113268986B - Unit name matching and searching method and device based on fuzzy matching algorithm

Info

Publication number: CN113268986B
Application number: CN202110566536.XA
Authority: CN
Inventors: 李君�; 许志坚
Original assignee: Bank of Communications Co Ltd
Current assignee: Bank of Communications Co Ltd
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2024-05-24
Anticipated expiration: 2041-05-24
Also published as: CN113268986A

Abstract

The invention relates to a unit name matching and searching method and device based on a fuzzy matching algorithm. Carrying out custom word segmentation processing on two unit names to be matched to obtain corresponding custom word segmentation, wherein the custom word segmentation comprises three types, namely attribute words, regional words and other nouns; and further calculating fuzzy matching scores of various custom words in the two unit names based on the custom word segmentation result, and further carrying out weighting processing on the fuzzy matching scores of the various custom words to obtain the fuzzy matching scores of the two unit names. Compared with the prior art, the method has the advantages of unit name matching, accurate searching and the like.

Description

Unit name matching and searching method and device based on fuzzy matching algorithm

Technical Field

The invention relates to a unit name matching and searching method and device, in particular to a unit name matching and searching method and device based on a fuzzy matching algorithm.

Background

Under the background of the current information technology era, the mining requirements of various industries, particularly the financial industry, on client information are more and more urgent, the client information is required to be more accurate, and the mining speed is faster. To meet this need, ES databases have been introduced, as a distributed database, providing scalable searching, near real-time searching capabilities, and supporting accurate query and fuzzy query processing of large amounts of data.

ES fuzzy query is a query method of quick fuzzy matching realized based on the method of self-contained word segmentation of elastic search. The elastic search is a distributed database that provides a distributed multi-user capable full text search engine that uses JSON for data indexing via HTTP. ES can be used for full text retrieval, structured search, and data analysis. By means of the dictionary library of the ES and the word segmentation method, the similar data in the library table can be quickly queried.

However, the native word segmentation method is single, so that the application scene is single, and the business scenes of fuzzy matching of a plurality of unit names and similarity acquisition cannot be met. The original ES fuzzy query method depends on word libraries, so that accurate data matching with high fuzzy matching requirement on unit names cannot be achieved. In the prior art, the unit name search is directly through IK word segmentation of the elastic search, and word segmentation is carried out on the unit name needing fuzzy matching. When the fuzzy matching method of the ES native is called, fuzzy matching and query are realized by sending the HTTP JSON message to the ES database. However, this technique is limited to IK-segmentation of the input unit name into different indices, and then locating the matched name in the database based on the index. The matching method is simple, and for units of certain rules, there is a situation of wrong matching, which is specifically expressed as follows:

1. The unit names are internally provided with abbreviations, and the abbreviations cannot be directly identified by the ES;

2. the unit names contain the area names, but the ES cannot recognize the area, and other unit names which are not the area can be screened out;

3. the unit names contain alias fields, but the ES cannot directly screen the unit names of the aliases;

4. The uncertainty of user input causes that a unit name contains a plurality of invalid characters, such as brackets, points and the like, and the ES cannot be identified;

5. Without a weight concept, the weights of each matched word segmentation index of the ES are the same, and there is no way to dynamically regulate the weights of different word segments.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a unit name matching and searching method and device based on a fuzzy matching algorithm.

The aim of the invention can be achieved by the following technical scheme:

A unit name matching method based on fuzzy matching algorithm includes the following steps:

s1, acquiring two unit names to be matched, judging whether the two unit names are mutually contained or completely matched, if so, directly outputting a fuzzy matching score of 100, otherwise, executing the step S2;

s2, respectively preprocessing the two unit names, including standardization processing and filtering processing;

S3, performing custom word segmentation processing on the two unit names respectively to obtain corresponding custom word segmentation, wherein the custom word segmentation comprises three types, namely attribute words, regional words and other nouns;

s4, determining whether fuzzy matching calculation can be directly carried out or not based on a user-defined word segmentation result, if yes, executing a step S6, otherwise, executing a step S5;

S5, directly judging fuzzy matching of the two unit names, if the fuzzy matching direct judging condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing the step S6;

S6, calculating fuzzy matching scores of various custom words in the two unit names based on the custom word segmentation result, and further carrying out weighting processing on the fuzzy matching scores of the various custom words to obtain fuzzy matching scores of the two unit names.

Preferably, the normalization process in step S2 includes: and replacing the abbreviation words in the unit names with standard words based on the abbreviation word stock.

Preferably, the filtering in step S2 includes: and deleting nonsensical characters in the unit names, and deleting invalid words in the unit names based on the invalid word stock.

Preferably, the condition for determining in step S4 that the fuzzy matching calculation can be directly performed includes any one of:

the word segmentation results of the two unit names comprise regional words, and the regional words are different;

The word segmentation result of one unit name contains other nouns, and the word segmentation result of the other unit name does not contain other nouns.

Preferably, the specific determination of the fuzzy matching direct determination of step S5 includes:

S51, if the word segmentation results of the two unit names contain other nouns and the other nouns are different, executing the step S52, otherwise executing the step S53;

S52, ES word segmentation judgment: respectively performing ES word segmentation on two unit names by using an IK word segmentation method carried by an ES word stock to obtain two word segmentation combinations, recording a shorter word segmentation set as an Aset and a longer word segmentation set as a Bset, directly outputting a fuzzy matching score of 100 if all the words in the Aset are contained in the Bset, otherwise, counting repeated word segmentation in the two word segmentation sets as a Cnt, and simultaneously recording the total number of the word segments in the Aset as Cnaset, wherein if Cnaset is more than or equal to esLen and Cnt/Cnaset is more than or equal to esPct, directly outputting the fuzzy matching score of 100, otherwise, executing step S53, wherein esLen and esPct are set thresholds;

s53, public word judgment: the unit names with fewer characters are marked as A, the unit names with more characters are marked as B, each character in A is sequentially traversed into B, a longest matched character string set is obtained and is called as a public word C, if the public word is completely matched with one of the unit names and the length of the public word exceeds 3 bits, a fuzzy matching score is directly output as 100, otherwise: carrying out custom word segmentation again on the public word C and the unit name B, if the attribute words in the word segmentation results of the public word C and the unit name B are the same, and if the other nouns are empty, directly outputting a fuzzy matching score of 100, otherwise, executing the step S54;

S54, judging the subset: carrying out subset judgment on the unit names preprocessed in the step S2, if the unit names are contained in each other, directly outputting a fuzzy matching score of 100, otherwise, executing the step S55;

S55, user-defined word segmentation judgment: if the two regional word sets are not empty and equal and the two individual noun sets are not empty and equal in the two unit name word segmentation results, the fuzzy matching score is directly output as 100, otherwise, the step S6 is executed.

Preferably, the fuzzy matching score of each type of custom word in step S6 is obtained by the following method:

Recording two company names A and B, for the class i custom word, calculating the longest public subsequence length maxSubSeq (A, B) _i maxSubSeq (A, B) of the class i custom word by using a longest public subsequence length calculation method through a set of the class i custom word corresponding to the two company names, and then calculating a fuzzy matching score _i of the class i custom word by using the following formula:

Wherein, |A| _i represents the custom word segmentation length of the i category in A, |B| _i represents the custom word segmentation length of the i category in B, and i corresponds to three categories of attribute words, regional words and alias words.

Preferably, the fuzzy match scores for the two unit names are obtained by:

wherein mathScore denotes the fuzzy match scores of the two unit names, Representing the weight of the custom word of class i by taking 2 as the decimal calculation and w _i.

A unit name searching method based on fuzzy matching algorithm includes:

Obtaining a unit name to be searched, and determining the name to be matched with highest matching degree in a unit name word stock through ES fuzzy matching;

The unit names to be searched and the unit names to be matched are matched by adopting the unit name matching method based on the fuzzy matching algorithm as set forth in any one of claims 1 to 7, and fuzzy matching scores of the unit names to be searched and the unit names to be searched in a unit name word stock are determined.

A unit name matching device based on a fuzzy matching algorithm, the device comprising a memory for storing a computer program and a processor for implementing the unit name matching method based on the fuzzy matching algorithm when executing the computer program.

A fuzzy matching algorithm based unit name lookup apparatus comprising a memory for storing a computer program and a processor for implementing the fuzzy matching algorithm based unit name lookup method when the computer program is executed.

Compared with the prior art, the invention has the following advantages: the matching/searching method is finer, and the matching/searching result is more accurate and reliable, and is specifically shown as follows:

(1) The unit names are abbreviated, and the invention can identify and match the unit names abbreviated;

(2) The unit names contain the area names, the invention can distinguish the area names in the unit names needing to be matched, and the matching is carried out according to the real area names;

(3) The unit name contains an alias field, and the invention can identify the alias and match the alias;

(4) According to the invention, the useless words are screened out and then matched, so that the matching accuracy is improved;

(5) The invention divides the unit names to be matched into areas, attributes and aliases. And the matched weights can be independently set by different classifications, so that the configuration is more flexible.

Drawings

FIG. 1 is a block flow diagram of a unit name matching method based on a fuzzy matching algorithm;

FIG. 2 is a block flow diagram of a unit name lookup method based on fuzzy matching algorithm.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. Note that the following description of the embodiments is merely an example, and the present invention is not intended to be limited to the applications and uses thereof, and is not intended to be limited to the following embodiments.

Example 1

The embodiment 1 provides a unit name matching method based on a fuzzy matching algorithm, which realizes one-to-one fuzzy matching, and comprises the following steps:

s1, acquiring two unit names to be matched, namely A and B, judging whether the unit names are mutually contained or completely matched, if so, directly outputting a fuzzy matching score of 100, otherwise, executing the step S2.

S2, respectively preprocessing the two unit names, wherein the preprocessing comprises standardization processing and filtering processing, and the standardization processing comprises the following steps: replacing the abbreviation words in the unit names with standard words based on the abbreviation word stock; the filtering treatment comprises the following steps: deleting meaningless characters in the unit name, such as: "|! % \ $ \ $ \ # & #)' and \the following are incorporated herein by reference: 2/: +, a.. In the process of \, a nonsensical character such as \ "etc., and deleting the invalid words in the unit names based on the invalid word stock.

S3, carrying out custom word segmentation processing on the two unit names to obtain corresponding custom word segmentation, wherein the custom word segmentation comprises three types, namely an attribute word, a regional word and an alias word, the regional word represents a region represented in the unit name, the attribute word represents an industry type represented by the unit, for example, the attribute word belongs to securities companies, electric power companies and the like, and other words except the attribute word and the regional word are the alias word. Before word segmentation, the regional words are firstly recovered for short, the word patterns of the provinces and the cities are supplemented, then regional word extraction is carried out, and finally the regional word segmentation set of the provinces, the cities and the counties is extracted.

S4, determining whether fuzzy matching calculation can be directly performed or not based on a user-defined word segmentation result, if yes, executing a step S6, otherwise executing a step S5, wherein specifically, the condition for determining that the fuzzy matching calculation can be directly performed in the step includes any one of the following conditions: the word segmentation results of the two unit names comprise regional words, and the regional words are different; the word segmentation result of one unit name contains other nouns, and the word segmentation result of the other unit name does not contain other nouns.

S5, directly judging fuzzy matching of the two unit names, if the fuzzy matching direct judging condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing a step S6, wherein the step S5 specifically comprises the following steps:

S6, calculating fuzzy matching scores of various custom-segmented words in the two unit names based on the custom-segmented word result, and further carrying out weighting processing on the fuzzy matching scores of the various custom-segmented words to obtain fuzzy matching scores of the two unit names, wherein the size of the fuzzy matching scores represents the matching degree of the two unit names, the larger the fuzzy matching score is, the higher the matching degree of the two unit names is, otherwise, the smaller the fuzzy matching score is, and the lower the matching degree of the two unit names is.

In the step S6, fuzzy matching scores of various custom segmentation words are obtained by the following modes:

For the custom word of the i category, calculating the longest public subsequence length maxSubSeq (A, B) _i maxSubSeq (A, B) of the custom word of the i category from the set of the custom word of the i category corresponding to the two company names by a longest public subsequence length calculation method, and then calculating a fuzzy matching score _i of the custom word of the i category by the following formula:

The fuzzy match scores for the two unit names are obtained by:

Initializing weights of regional words, attribute words and other nouns according to service scenes and personal weight proportion, wherein the weights are set as follows in the embodiment: regional word weight = 0.3, attribute word weight = 0.15, alias word weight = 0.55 (initial weights may be adjusted here as personal case); and then, according to word segmentation sets obtained by the word segmentation of A and B, respectively adjusting initialization weights, for example, if no alias word segmentation is obtained by the word segmentation of A and B, respectively giving an initial value of 0.55 of the alias weight to a region and an attribute weight, modifying the region weight to be 0.3+0.55/2=0.575, and modifying the attribute weight to be 0.15+0.55/2=0.425.

In this embodiment, the following specific descriptions are given by taking "Huatai united securities limited liability company Guangzhou sports east securities" and "Huatai futures limited company":

The two company names are segmented respectively, and the self-defined segmentation results are respectively as follows:

The self-defined word segmentation result of the 'Huatai united securities limited liability company Guangzhou sports east securities' is as follows: ([ Guangdong province, guangzhou City ], [ sports, securities ], [ Huatai joint east road ]), wherein [ Guangdong province, guangzhou City ] is a regional word set, [ sports, securities ] is an attribute word set, [ Huatai joint east road ] is an alias word set;

The custom word segmentation result of "Huatai futures limited company" is: ([ ], [ futures ], [ Huatai ]), wherein the regional word set is empty, [ futures ] is the attribute word set, and [ Huatai ] is the alias word set;

After extracting the region, the following steps are obtained:

([ Guangdong province ], [ sports, securities ], [ Huatai United east line ])

([ ], [ Futures ], [ Huatai ]);

After the step S4, the fact that fuzzy matching calculation cannot be directly carried out is judged, the step S5 is carried out to sequentially carry out public word judgment, subset judgment and user-defined word segmentation judgment, fuzzy matching scores cannot be directly output, and the step S6 is carried out to carry out fuzzy matching score calculation, so that the fuzzy matching score is obtained:

because the regional word set in "Huatai fuku limited" is empty, the fuzzy matching score of the regional words of both: 0;

[ Huatai unites east ], [ Huatai ], fuzzy match score of other nouns: 0.6666665;

[ sports securities ], [ futures ], fuzzy matching score for attributed words: 0;

Weight value: regional word weight=0.3, alias word weight=0.55, attribute word weight=0.15;

Fuzzy match score for two unit names = round (0.55 x 0.6666665 x 100, 2) =37.

Example 2

As shown in fig. 2, the unit name searching method based on the fuzzy matching algorithm in this embodiment includes:

In this embodiment, the unit name matching method based on the fuzzy matching algorithm is identical to that of embodiment 1, and will not be described in detail in this embodiment.

Specific examples are described below:

Unit name to be searched: the Jiangxi province broadcasting and television general office;

The unit name word library sets the name field type as keyword (accurate query) in advance, acquires one name with the highest matching score after being accurately queried and matched through the ES key words, and then carries out one-to-one matching similarity calculation to acquire the unit name and the similarity score matched as follows:

Broadcast television bureau in Jiangxi province (es match score: 18.56986), broadcast television bureau in Shandong province in Heilongjiang province (es match score: 14.986288), general Condition in Jiangxi province (es match score: 14.648104), river basin bureau in Guangdong province (es match score: 13.420963), water engineering bureau Limited in Jiangxi province (es match score: 12.696965)

Screening out the matching name of the highest score: and the Jiangxi province broadcast television office performs one-to-one matching calculation on the Jiangxi province broadcast television office and the Jiangxi province broadcast television office with the keyword matching hit to obtain 100 points of similarity.

The above results show that the searching method realizes one-to-many matching and realizes the matching searching of the unit names to be searched in the unit name word stock.

Example 3

The present embodiment provides a unit name matching device based on a fuzzy matching algorithm based on embodiment 1, the device including a memory for storing a computer program and a processor for implementing the unit name matching method based on the fuzzy matching algorithm in embodiment 1 when the computer program is executed.

Example 4

The present embodiment provides a unit name search device based on a fuzzy matching algorithm based on embodiment 2, the device including a memory for storing a computer program and a processor for implementing the unit name search method based on the fuzzy matching algorithm in embodiment 2 when the computer program is executed.

The invention improves the accuracy of matching the high-quality customer group units of the bank through the custom fuzzy matching technology. And is suitable for inputting single unit name matching and inputting business scenes in which two unit names match each other. The fuzzy matching scene is more flexible, and the fuzzy matching result is more accurate. And the user-defined word segmentation technology of the region, the attribute and the alias is provided, the score and the weight are provided for each kind of word segmentation, the weight can be adjusted in a user-defined manner, and the subjective activity of the service is improved.

The above embodiments are merely examples, and do not limit the scope of the present invention. These embodiments may be implemented in various other ways, and various omissions, substitutions, and changes may be made without departing from the scope of the technical idea of the present invention.

Claims

1. The unit name matching method based on the fuzzy matching algorithm is characterized by comprising the following steps of:

S6, calculating fuzzy matching scores of various custom words in the two unit names based on the custom word segmentation result, and further carrying out weighting treatment on the fuzzy matching scores of the various custom words to obtain fuzzy matching scores of the two unit names;

The condition for determining in step S4 that the fuzzy matching calculation can be directly performed includes any one of:

the word segmentation result of one unit name contains other nouns, and the word segmentation result of the other unit name does not contain other nouns;

the specific determination of the fuzzy matching direct determination in the step S5 comprises the following steps:

S55, user-defined word segmentation judgment: if the two regional word sets are not empty and equal and the two individual noun sets are not empty and equal in the two unit name word segmentation results, directly outputting a fuzzy matching score of 100, otherwise, executing the step S6;

Wherein, |A| _i represents the custom word segmentation length of the i category in A, |B| _i represents the custom word segmentation length of the i category in B, i corresponds to three categories of attribute words, regional words and alias words;

the fuzzy match scores for the two unit names are obtained by:

2. The unit name matching method based on the fuzzy matching algorithm of claim 1, wherein the normalization processing in step S2 includes: and replacing the abbreviation words in the unit names with standard words based on the abbreviation word stock.

3. The unit name matching method based on the fuzzy matching algorithm of claim 1, wherein the filtering in step S2 includes: and deleting nonsensical characters in the unit names, and deleting invalid words in the unit names based on the invalid word stock.

4. The unit name searching method based on the fuzzy matching algorithm is characterized by comprising the following steps of:

the unit names to be searched and the unit names to be matched are matched by adopting the unit name matching method based on the fuzzy matching algorithm as set forth in any one of claims 1 to 3, and fuzzy matching scores of the unit names to be searched and the unit names to be searched in a unit name word stock are determined.

5. A unit name matching device based on a fuzzy matching algorithm, characterized in that the device comprises a memory for storing a computer program and a processor for implementing the unit name matching method based on a fuzzy matching algorithm as claimed in any one of claims 1 to 3 when said computer program is executed.

6. A fuzzy matching algorithm based unit name lookup apparatus comprising a memory for storing a computer program and a processor for implementing the fuzzy matching algorithm based unit name lookup method as claimed in claim 4 when said computer program is executed.