CN113268986B - Unit name matching and searching method and device based on fuzzy matching algorithm - Google Patents

Unit name matching and searching method and device based on fuzzy matching algorithm Download PDF

Info

Publication number
CN113268986B
CN113268986B CN202110566536.XA CN202110566536A CN113268986B CN 113268986 B CN113268986 B CN 113268986B CN 202110566536 A CN202110566536 A CN 202110566536A CN 113268986 B CN113268986 B CN 113268986B
Authority
CN
China
Prior art keywords
unit
word
fuzzy matching
word segmentation
names
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110566536.XA
Other languages
Chinese (zh)
Other versions
CN113268986A (en
Inventor
李君�
许志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of Communications Co Ltd
Original Assignee
Bank of Communications Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of Communications Co Ltd filed Critical Bank of Communications Co Ltd
Priority to CN202110566536.XA priority Critical patent/CN113268986B/en
Publication of CN113268986A publication Critical patent/CN113268986A/en
Application granted granted Critical
Publication of CN113268986B publication Critical patent/CN113268986B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a unit name matching and searching method and device based on a fuzzy matching algorithm. Carrying out custom word segmentation processing on two unit names to be matched to obtain corresponding custom word segmentation, wherein the custom word segmentation comprises three types, namely attribute words, regional words and other nouns; and further calculating fuzzy matching scores of various custom words in the two unit names based on the custom word segmentation result, and further carrying out weighting processing on the fuzzy matching scores of the various custom words to obtain the fuzzy matching scores of the two unit names. Compared with the prior art, the method has the advantages of unit name matching, accurate searching and the like.

Description

Unit name matching and searching method and device based on fuzzy matching algorithm
Technical Field
The invention relates to a unit name matching and searching method and device, in particular to a unit name matching and searching method and device based on a fuzzy matching algorithm.
Background
Under the background of the current information technology era, the mining requirements of various industries, particularly the financial industry, on client information are more and more urgent, the client information is required to be more accurate, and the mining speed is faster. To meet this need, ES databases have been introduced, as a distributed database, providing scalable searching, near real-time searching capabilities, and supporting accurate query and fuzzy query processing of large amounts of data.
ES fuzzy query is a query method of quick fuzzy matching realized based on the method of self-contained word segmentation of elastic search. The elastic search is a distributed database that provides a distributed multi-user capable full text search engine that uses JSON for data indexing via HTTP. ES can be used for full text retrieval, structured search, and data analysis. By means of the dictionary library of the ES and the word segmentation method, the similar data in the library table can be quickly queried.
However, the native word segmentation method is single, so that the application scene is single, and the business scenes of fuzzy matching of a plurality of unit names and similarity acquisition cannot be met. The original ES fuzzy query method depends on word libraries, so that accurate data matching with high fuzzy matching requirement on unit names cannot be achieved. In the prior art, the unit name search is directly through IK word segmentation of the elastic search, and word segmentation is carried out on the unit name needing fuzzy matching. When the fuzzy matching method of the ES native is called, fuzzy matching and query are realized by sending the HTTP JSON message to the ES database. However, this technique is limited to IK-segmentation of the input unit name into different indices, and then locating the matched name in the database based on the index. The matching method is simple, and for units of certain rules, there is a situation of wrong matching, which is specifically expressed as follows:
1. The unit names are internally provided with abbreviations, and the abbreviations cannot be directly identified by the ES;
2. the unit names contain the area names, but the ES cannot recognize the area, and other unit names which are not the area can be screened out;
3. the unit names contain alias fields, but the ES cannot directly screen the unit names of the aliases;
4. The uncertainty of user input causes that a unit name contains a plurality of invalid characters, such as brackets, points and the like, and the ES cannot be identified;
5. Without a weight concept, the weights of each matched word segmentation index of the ES are the same, and there is no way to dynamically regulate the weights of different word segments.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a unit name matching and searching method and device based on a fuzzy matching algorithm.
The aim of the invention can be achieved by the following technical scheme:
A unit name matching method based on fuzzy matching algorithm includes the following steps:
s1, acquiring two unit names to be matched, judging whether the two unit names are mutually contained or completely matched, if so, directly outputting a fuzzy matching score of 100, otherwise, executing the step S2;
s2, respectively preprocessing the two unit names, including standardization processing and filtering processing;
S3, performing custom word segmentation processing on the two unit names respectively to obtain corresponding custom word segmentation, wherein the custom word segmentation comprises three types, namely attribute words, regional words and other nouns;
s4, determining whether fuzzy matching calculation can be directly carried out or not based on a user-defined word segmentation result, if yes, executing a step S6, otherwise, executing a step S5;
S5, directly judging fuzzy matching of the two unit names, if the fuzzy matching direct judging condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing the step S6;
S6, calculating fuzzy matching scores of various custom words in the two unit names based on the custom word segmentation result, and further carrying out weighting processing on the fuzzy matching scores of the various custom words to obtain fuzzy matching scores of the two unit names.
Preferably, the normalization process in step S2 includes: and replacing the abbreviation words in the unit names with standard words based on the abbreviation word stock.
Preferably, the filtering in step S2 includes: and deleting nonsensical characters in the unit names, and deleting invalid words in the unit names based on the invalid word stock.
Preferably, the condition for determining in step S4 that the fuzzy matching calculation can be directly performed includes any one of:
the word segmentation results of the two unit names comprise regional words, and the regional words are different;
The word segmentation result of one unit name contains other nouns, and the word segmentation result of the other unit name does not contain other nouns.
Preferably, the specific determination of the fuzzy matching direct determination of step S5 includes:
S51, if the word segmentation results of the two unit names contain other nouns and the other nouns are different, executing the step S52, otherwise executing the step S53;
S52, ES word segmentation judgment: respectively performing ES word segmentation on two unit names by using an IK word segmentation method carried by an ES word stock to obtain two word segmentation combinations, recording a shorter word segmentation set as an Aset and a longer word segmentation set as a Bset, directly outputting a fuzzy matching score of 100 if all the words in the Aset are contained in the Bset, otherwise, counting repeated word segmentation in the two word segmentation sets as a Cnt, and simultaneously recording the total number of the word segments in the Aset as Cnaset, wherein if Cnaset is more than or equal to esLen and Cnt/Cnaset is more than or equal to esPct, directly outputting the fuzzy matching score of 100, otherwise, executing step S53, wherein esLen and esPct are set thresholds;
s53, public word judgment: the unit names with fewer characters are marked as A, the unit names with more characters are marked as B, each character in A is sequentially traversed into B, a longest matched character string set is obtained and is called as a public word C, if the public word is completely matched with one of the unit names and the length of the public word exceeds 3 bits, a fuzzy matching score is directly output as 100, otherwise: carrying out custom word segmentation again on the public word C and the unit name B, if the attribute words in the word segmentation results of the public word C and the unit name B are the same, and if the other nouns are empty, directly outputting a fuzzy matching score of 100, otherwise, executing the step S54;
S54, judging the subset: carrying out subset judgment on the unit names preprocessed in the step S2, if the unit names are contained in each other, directly outputting a fuzzy matching score of 100, otherwise, executing the step S55;
S55, user-defined word segmentation judgment: if the two regional word sets are not empty and equal and the two individual noun sets are not empty and equal in the two unit name word segmentation results, the fuzzy matching score is directly output as 100, otherwise, the step S6 is executed.
Preferably, the fuzzy matching score of each type of custom word in step S6 is obtained by the following method:
Recording two company names A and B, for the class i custom word, calculating the longest public subsequence length maxSubSeq (A, B) i maxSubSeq (A, B) of the class i custom word by using a longest public subsequence length calculation method through a set of the class i custom word corresponding to the two company names, and then calculating a fuzzy matching score i of the class i custom word by using the following formula:
Wherein, |A| i represents the custom word segmentation length of the i category in A, |B| i represents the custom word segmentation length of the i category in B, and i corresponds to three categories of attribute words, regional words and alias words.
Preferably, the fuzzy match scores for the two unit names are obtained by:
wherein mathScore denotes the fuzzy match scores of the two unit names, Representing the weight of the custom word of class i by taking 2 as the decimal calculation and w i.
A unit name searching method based on fuzzy matching algorithm includes:
Obtaining a unit name to be searched, and determining the name to be matched with highest matching degree in a unit name word stock through ES fuzzy matching;
The unit names to be searched and the unit names to be matched are matched by adopting the unit name matching method based on the fuzzy matching algorithm as set forth in any one of claims 1 to 7, and fuzzy matching scores of the unit names to be searched and the unit names to be searched in a unit name word stock are determined.
A unit name matching device based on a fuzzy matching algorithm, the device comprising a memory for storing a computer program and a processor for implementing the unit name matching method based on the fuzzy matching algorithm when executing the computer program.
A fuzzy matching algorithm based unit name lookup apparatus comprising a memory for storing a computer program and a processor for implementing the fuzzy matching algorithm based unit name lookup method when the computer program is executed.
Compared with the prior art, the invention has the following advantages: the matching/searching method is finer, and the matching/searching result is more accurate and reliable, and is specifically shown as follows:
(1) The unit names are abbreviated, and the invention can identify and match the unit names abbreviated;
(2) The unit names contain the area names, the invention can distinguish the area names in the unit names needing to be matched, and the matching is carried out according to the real area names;
(3) The unit name contains an alias field, and the invention can identify the alias and match the alias;
(4) According to the invention, the useless words are screened out and then matched, so that the matching accuracy is improved;
(5) The invention divides the unit names to be matched into areas, attributes and aliases. And the matched weights can be independently set by different classifications, so that the configuration is more flexible.
Drawings
FIG. 1 is a block flow diagram of a unit name matching method based on a fuzzy matching algorithm;
FIG. 2 is a block flow diagram of a unit name lookup method based on fuzzy matching algorithm.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples. Note that the following description of the embodiments is merely an example, and the present invention is not intended to be limited to the applications and uses thereof, and is not intended to be limited to the following embodiments.
Example 1
The embodiment 1 provides a unit name matching method based on a fuzzy matching algorithm, which realizes one-to-one fuzzy matching, and comprises the following steps:
s1, acquiring two unit names to be matched, namely A and B, judging whether the unit names are mutually contained or completely matched, if so, directly outputting a fuzzy matching score of 100, otherwise, executing the step S2.
S2, respectively preprocessing the two unit names, wherein the preprocessing comprises standardization processing and filtering processing, and the standardization processing comprises the following steps: replacing the abbreviation words in the unit names with standard words based on the abbreviation word stock; the filtering treatment comprises the following steps: deleting meaningless characters in the unit name, such as: "|! % \ $ \ $ \ # & #)' and \the following are incorporated herein by reference: 2/: +, a.. In the process of \, a nonsensical character such as \ "etc., and deleting the invalid words in the unit names based on the invalid word stock.
S3, carrying out custom word segmentation processing on the two unit names to obtain corresponding custom word segmentation, wherein the custom word segmentation comprises three types, namely an attribute word, a regional word and an alias word, the regional word represents a region represented in the unit name, the attribute word represents an industry type represented by the unit, for example, the attribute word belongs to securities companies, electric power companies and the like, and other words except the attribute word and the regional word are the alias word. Before word segmentation, the regional words are firstly recovered for short, the word patterns of the provinces and the cities are supplemented, then regional word extraction is carried out, and finally the regional word segmentation set of the provinces, the cities and the counties is extracted.
S4, determining whether fuzzy matching calculation can be directly performed or not based on a user-defined word segmentation result, if yes, executing a step S6, otherwise executing a step S5, wherein specifically, the condition for determining that the fuzzy matching calculation can be directly performed in the step includes any one of the following conditions: the word segmentation results of the two unit names comprise regional words, and the regional words are different; the word segmentation result of one unit name contains other nouns, and the word segmentation result of the other unit name does not contain other nouns.
S5, directly judging fuzzy matching of the two unit names, if the fuzzy matching direct judging condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing a step S6, wherein the step S5 specifically comprises the following steps:
S51, if the word segmentation results of the two unit names contain other nouns and the other nouns are different, executing the step S52, otherwise executing the step S53;
S52, ES word segmentation judgment: respectively performing ES word segmentation on two unit names by using an IK word segmentation method carried by an ES word stock to obtain two word segmentation combinations, recording a shorter word segmentation set as an Aset and a longer word segmentation set as a Bset, directly outputting a fuzzy matching score of 100 if all the words in the Aset are contained in the Bset, otherwise, counting repeated word segmentation in the two word segmentation sets as a Cnt, and simultaneously recording the total number of the word segments in the Aset as Cnaset, wherein if Cnaset is more than or equal to esLen and Cnt/Cnaset is more than or equal to esPct, directly outputting the fuzzy matching score of 100, otherwise, executing step S53, wherein esLen and esPct are set thresholds;
s53, public word judgment: the unit names with fewer characters are marked as A, the unit names with more characters are marked as B, each character in A is sequentially traversed into B, a longest matched character string set is obtained and is called as a public word C, if the public word is completely matched with one of the unit names and the length of the public word exceeds 3 bits, a fuzzy matching score is directly output as 100, otherwise: carrying out custom word segmentation again on the public word C and the unit name B, if the attribute words in the word segmentation results of the public word C and the unit name B are the same, and if the other nouns are empty, directly outputting a fuzzy matching score of 100, otherwise, executing the step S54;
S54, judging the subset: carrying out subset judgment on the unit names preprocessed in the step S2, if the unit names are contained in each other, directly outputting a fuzzy matching score of 100, otherwise, executing the step S55;
S55, user-defined word segmentation judgment: if the two regional word sets are not empty and equal and the two individual noun sets are not empty and equal in the two unit name word segmentation results, the fuzzy matching score is directly output as 100, otherwise, the step S6 is executed.
S6, calculating fuzzy matching scores of various custom-segmented words in the two unit names based on the custom-segmented word result, and further carrying out weighting processing on the fuzzy matching scores of the various custom-segmented words to obtain fuzzy matching scores of the two unit names, wherein the size of the fuzzy matching scores represents the matching degree of the two unit names, the larger the fuzzy matching score is, the higher the matching degree of the two unit names is, otherwise, the smaller the fuzzy matching score is, and the lower the matching degree of the two unit names is.
In the step S6, fuzzy matching scores of various custom segmentation words are obtained by the following modes:
For the custom word of the i category, calculating the longest public subsequence length maxSubSeq (A, B) i maxSubSeq (A, B) of the custom word of the i category from the set of the custom word of the i category corresponding to the two company names by a longest public subsequence length calculation method, and then calculating a fuzzy matching score i of the custom word of the i category by the following formula:
Wherein, |A| i represents the custom word segmentation length of the i category in A, |B| i represents the custom word segmentation length of the i category in B, and i corresponds to three categories of attribute words, regional words and alias words.
The fuzzy match scores for the two unit names are obtained by:
wherein mathScore denotes the fuzzy match scores of the two unit names, Representing the weight of the custom word of class i by taking 2 as the decimal calculation and w i.
Initializing weights of regional words, attribute words and other nouns according to service scenes and personal weight proportion, wherein the weights are set as follows in the embodiment: regional word weight = 0.3, attribute word weight = 0.15, alias word weight = 0.55 (initial weights may be adjusted here as personal case); and then, according to word segmentation sets obtained by the word segmentation of A and B, respectively adjusting initialization weights, for example, if no alias word segmentation is obtained by the word segmentation of A and B, respectively giving an initial value of 0.55 of the alias weight to a region and an attribute weight, modifying the region weight to be 0.3+0.55/2=0.575, and modifying the attribute weight to be 0.15+0.55/2=0.425.
In this embodiment, the following specific descriptions are given by taking "Huatai united securities limited liability company Guangzhou sports east securities" and "Huatai futures limited company":
The two company names are segmented respectively, and the self-defined segmentation results are respectively as follows:
The self-defined word segmentation result of the 'Huatai united securities limited liability company Guangzhou sports east securities' is as follows: ([ Guangdong province, guangzhou City ], [ sports, securities ], [ Huatai joint east road ]), wherein [ Guangdong province, guangzhou City ] is a regional word set, [ sports, securities ] is an attribute word set, [ Huatai joint east road ] is an alias word set;
The custom word segmentation result of "Huatai futures limited company" is: ([ ], [ futures ], [ Huatai ]), wherein the regional word set is empty, [ futures ] is the attribute word set, and [ Huatai ] is the alias word set;
After extracting the region, the following steps are obtained:
([ Guangdong province ], [ sports, securities ], [ Huatai United east line ])
([ ], [ Futures ], [ Huatai ]);
After the step S4, the fact that fuzzy matching calculation cannot be directly carried out is judged, the step S5 is carried out to sequentially carry out public word judgment, subset judgment and user-defined word segmentation judgment, fuzzy matching scores cannot be directly output, and the step S6 is carried out to carry out fuzzy matching score calculation, so that the fuzzy matching score is obtained:
because the regional word set in "Huatai fuku limited" is empty, the fuzzy matching score of the regional words of both: 0;
[ Huatai unites east ], [ Huatai ], fuzzy match score of other nouns: 0.6666665;
[ sports securities ], [ futures ], fuzzy matching score for attributed words: 0;
Weight value: regional word weight=0.3, alias word weight=0.55, attribute word weight=0.15;
Fuzzy match score for two unit names = round (0.55 x 0.6666665 x 100, 2) =37.
Example 2
As shown in fig. 2, the unit name searching method based on the fuzzy matching algorithm in this embodiment includes:
Obtaining a unit name to be searched, and determining the name to be matched with highest matching degree in a unit name word stock through ES fuzzy matching;
The unit names to be searched and the unit names to be matched are matched by adopting the unit name matching method based on the fuzzy matching algorithm as set forth in any one of claims 1 to 7, and fuzzy matching scores of the unit names to be searched and the unit names to be searched in a unit name word stock are determined.
In this embodiment, the unit name matching method based on the fuzzy matching algorithm is identical to that of embodiment 1, and will not be described in detail in this embodiment.
Specific examples are described below:
Unit name to be searched: the Jiangxi province broadcasting and television general office;
The unit name word library sets the name field type as keyword (accurate query) in advance, acquires one name with the highest matching score after being accurately queried and matched through the ES key words, and then carries out one-to-one matching similarity calculation to acquire the unit name and the similarity score matched as follows:
Broadcast television bureau in Jiangxi province (es match score: 18.56986), broadcast television bureau in Shandong province in Heilongjiang province (es match score: 14.986288), general Condition in Jiangxi province (es match score: 14.648104), river basin bureau in Guangdong province (es match score: 13.420963), water engineering bureau Limited in Jiangxi province (es match score: 12.696965)
Screening out the matching name of the highest score: and the Jiangxi province broadcast television office performs one-to-one matching calculation on the Jiangxi province broadcast television office and the Jiangxi province broadcast television office with the keyword matching hit to obtain 100 points of similarity.
The above results show that the searching method realizes one-to-many matching and realizes the matching searching of the unit names to be searched in the unit name word stock.
Example 3
The present embodiment provides a unit name matching device based on a fuzzy matching algorithm based on embodiment 1, the device including a memory for storing a computer program and a processor for implementing the unit name matching method based on the fuzzy matching algorithm in embodiment 1 when the computer program is executed.
Example 4
The present embodiment provides a unit name search device based on a fuzzy matching algorithm based on embodiment 2, the device including a memory for storing a computer program and a processor for implementing the unit name search method based on the fuzzy matching algorithm in embodiment 2 when the computer program is executed.
The invention improves the accuracy of matching the high-quality customer group units of the bank through the custom fuzzy matching technology. And is suitable for inputting single unit name matching and inputting business scenes in which two unit names match each other. The fuzzy matching scene is more flexible, and the fuzzy matching result is more accurate. And the user-defined word segmentation technology of the region, the attribute and the alias is provided, the score and the weight are provided for each kind of word segmentation, the weight can be adjusted in a user-defined manner, and the subjective activity of the service is improved.
The above embodiments are merely examples, and do not limit the scope of the present invention. These embodiments may be implemented in various other ways, and various omissions, substitutions, and changes may be made without departing from the scope of the technical idea of the present invention.

Claims (6)

1. The unit name matching method based on the fuzzy matching algorithm is characterized by comprising the following steps of:
s1, acquiring two unit names to be matched, judging whether the two unit names are mutually contained or completely matched, if so, directly outputting a fuzzy matching score of 100, otherwise, executing the step S2;
s2, respectively preprocessing the two unit names, including standardization processing and filtering processing;
S3, performing custom word segmentation processing on the two unit names respectively to obtain corresponding custom word segmentation, wherein the custom word segmentation comprises three types, namely attribute words, regional words and other nouns;
s4, determining whether fuzzy matching calculation can be directly carried out or not based on a user-defined word segmentation result, if yes, executing a step S6, otherwise, executing a step S5;
S5, directly judging fuzzy matching of the two unit names, if the fuzzy matching direct judging condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing the step S6;
S6, calculating fuzzy matching scores of various custom words in the two unit names based on the custom word segmentation result, and further carrying out weighting treatment on the fuzzy matching scores of the various custom words to obtain fuzzy matching scores of the two unit names;
The condition for determining in step S4 that the fuzzy matching calculation can be directly performed includes any one of:
the word segmentation results of the two unit names comprise regional words, and the regional words are different;
the word segmentation result of one unit name contains other nouns, and the word segmentation result of the other unit name does not contain other nouns;
the specific determination of the fuzzy matching direct determination in the step S5 comprises the following steps:
S51, if the word segmentation results of the two unit names contain other nouns and the other nouns are different, executing the step S52, otherwise executing the step S53;
S52, ES word segmentation judgment: respectively performing ES word segmentation on two unit names by using an IK word segmentation method carried by an ES word stock to obtain two word segmentation combinations, recording a shorter word segmentation set as an Aset and a longer word segmentation set as a Bset, directly outputting a fuzzy matching score of 100 if all the words in the Aset are contained in the Bset, otherwise, counting repeated word segmentation in the two word segmentation sets as a Cnt, and simultaneously recording the total number of the word segments in the Aset as Cnaset, wherein if Cnaset is more than or equal to esLen and Cnt/Cnaset is more than or equal to esPct, directly outputting the fuzzy matching score of 100, otherwise, executing step S53, wherein esLen and esPct are set thresholds;
s53, public word judgment: the unit names with fewer characters are marked as A, the unit names with more characters are marked as B, each character in A is sequentially traversed into B, a longest matched character string set is obtained and is called as a public word C, if the public word is completely matched with one of the unit names and the length of the public word exceeds 3 bits, a fuzzy matching score is directly output as 100, otherwise: carrying out custom word segmentation again on the public word C and the unit name B, if the attribute words in the word segmentation results of the public word C and the unit name B are the same, and if the other nouns are empty, directly outputting a fuzzy matching score of 100, otherwise, executing the step S54;
S54, judging the subset: carrying out subset judgment on the unit names preprocessed in the step S2, if the unit names are contained in each other, directly outputting a fuzzy matching score of 100, otherwise, executing the step S55;
S55, user-defined word segmentation judgment: if the two regional word sets are not empty and equal and the two individual noun sets are not empty and equal in the two unit name word segmentation results, directly outputting a fuzzy matching score of 100, otherwise, executing the step S6;
In the step S6, fuzzy matching scores of various custom segmentation words are obtained by the following modes:
Recording two company names A and B, for the class i custom word, calculating the longest public subsequence length maxSubSeq (A, B) i maxSubSeq (A, B) of the class i custom word by using a longest public subsequence length calculation method through a set of the class i custom word corresponding to the two company names, and then calculating a fuzzy matching score i of the class i custom word by using the following formula:
Wherein, |A| i represents the custom word segmentation length of the i category in A, |B| i represents the custom word segmentation length of the i category in B, i corresponds to three categories of attribute words, regional words and alias words;
the fuzzy match scores for the two unit names are obtained by:
wherein mathScore denotes the fuzzy match scores of the two unit names, Representing the weight of the custom word of class i by taking 2 as the decimal calculation and w i.
2. The unit name matching method based on the fuzzy matching algorithm of claim 1, wherein the normalization processing in step S2 includes: and replacing the abbreviation words in the unit names with standard words based on the abbreviation word stock.
3. The unit name matching method based on the fuzzy matching algorithm of claim 1, wherein the filtering in step S2 includes: and deleting nonsensical characters in the unit names, and deleting invalid words in the unit names based on the invalid word stock.
4. The unit name searching method based on the fuzzy matching algorithm is characterized by comprising the following steps of:
Obtaining a unit name to be searched, and determining the name to be matched with highest matching degree in a unit name word stock through ES fuzzy matching;
the unit names to be searched and the unit names to be matched are matched by adopting the unit name matching method based on the fuzzy matching algorithm as set forth in any one of claims 1 to 3, and fuzzy matching scores of the unit names to be searched and the unit names to be searched in a unit name word stock are determined.
5. A unit name matching device based on a fuzzy matching algorithm, characterized in that the device comprises a memory for storing a computer program and a processor for implementing the unit name matching method based on a fuzzy matching algorithm as claimed in any one of claims 1 to 3 when said computer program is executed.
6. A fuzzy matching algorithm based unit name lookup apparatus comprising a memory for storing a computer program and a processor for implementing the fuzzy matching algorithm based unit name lookup method as claimed in claim 4 when said computer program is executed.
CN202110566536.XA 2021-05-24 2021-05-24 Unit name matching and searching method and device based on fuzzy matching algorithm Active CN113268986B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110566536.XA CN113268986B (en) 2021-05-24 2021-05-24 Unit name matching and searching method and device based on fuzzy matching algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110566536.XA CN113268986B (en) 2021-05-24 2021-05-24 Unit name matching and searching method and device based on fuzzy matching algorithm

Publications (2)

Publication Number Publication Date
CN113268986A CN113268986A (en) 2021-08-17
CN113268986B true CN113268986B (en) 2024-05-24

Family

ID=77232606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110566536.XA Active CN113268986B (en) 2021-05-24 2021-05-24 Unit name matching and searching method and device based on fuzzy matching algorithm

Country Status (1)

Country Link
CN (1) CN113268986B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905473B (en) * 2022-12-20 2024-03-05 无锡锡银金科信息技术有限责任公司 Full noun fuzzy matching method, device and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102651013A (en) * 2012-03-23 2012-08-29 上海安捷力信息***有限公司 Method and system for extracting area information from enterprise name data
CN107644010A (en) * 2016-07-20 2018-01-30 阿里巴巴集团控股有限公司 A kind of Text similarity computing method and device
CN108829661A (en) * 2018-05-09 2018-11-16 成都信息工程大学 A kind of subject of news title extracting method based on fuzzy matching
CN109657213A (en) * 2018-12-21 2019-04-19 北京金山安全软件有限公司 Text similarity detection method and device and electronic equipment
CN110046341A (en) * 2018-12-29 2019-07-23 ***股份有限公司 For carrying out matched method and system to information
CN110309504A (en) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 Text handling method, device, equipment and storage medium based on participle
CN111104795A (en) * 2019-11-19 2020-05-05 平安金融管理学院(中国·深圳) Company name matching method and device, computer equipment and storage medium
CN111859042A (en) * 2020-07-30 2020-10-30 上海妙一生物科技有限公司 Retrieval method and device and electronic equipment
CN112364635A (en) * 2020-11-30 2021-02-12 中国银行股份有限公司 Enterprise name duplication checking method and device
CN112749255A (en) * 2020-12-30 2021-05-04 科大国创云网科技有限公司 Human-computer interaction semantic recognition intention matching method and system based on ES

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014028860A2 (en) * 2012-08-17 2014-02-20 Opera Solutions, Llc System and method for matching data using probabilistic modeling techniques
CN108108373B (en) * 2016-11-25 2020-09-25 阿里巴巴集团控股有限公司 Name matching method and device
US11232111B2 (en) * 2019-04-14 2022-01-25 Zoominfo Apollo Llc Automated company matching
US20210089562A1 (en) * 2019-09-20 2021-03-25 Jpmorgan Chase Bank, N.A. System and method for generating and implementing context weighted words

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102651013A (en) * 2012-03-23 2012-08-29 上海安捷力信息***有限公司 Method and system for extracting area information from enterprise name data
CN107644010A (en) * 2016-07-20 2018-01-30 阿里巴巴集团控股有限公司 A kind of Text similarity computing method and device
CN108829661A (en) * 2018-05-09 2018-11-16 成都信息工程大学 A kind of subject of news title extracting method based on fuzzy matching
CN109657213A (en) * 2018-12-21 2019-04-19 北京金山安全软件有限公司 Text similarity detection method and device and electronic equipment
CN110046341A (en) * 2018-12-29 2019-07-23 ***股份有限公司 For carrying out matched method and system to information
CN110309504A (en) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 Text handling method, device, equipment and storage medium based on participle
CN111104795A (en) * 2019-11-19 2020-05-05 平安金融管理学院(中国·深圳) Company name matching method and device, computer equipment and storage medium
CN111859042A (en) * 2020-07-30 2020-10-30 上海妙一生物科技有限公司 Retrieval method and device and electronic equipment
CN112364635A (en) * 2020-11-30 2021-02-12 中国银行股份有限公司 Enterprise name duplication checking method and device
CN112749255A (en) * 2020-12-30 2021-05-04 科大国创云网科技有限公司 Human-computer interaction semantic recognition intention matching method and system based on ES

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
乔世权 ; 戴继勇 ; .基于文本相似度的智能查号引擎研究.河北科技大学学报.2018,第39卷(第3期),第282-288页. *
段晨迪.基于ElasticSearch面向M00C的垂直搜索引擎设计与实现.《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》.2020,(第1期),第1138-2507页. *
连誉舜.中文组织机构名检索***的设计与实现.《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》.2016,(第4期),第I138-1519页. *
钟良伍,郑 方.基于中文机构名简称的检索方法研究.中文信息学报.2007,第21卷(第1期),第38-42页. *

Also Published As

Publication number Publication date
CN113268986A (en) 2021-08-17

Similar Documents

Publication Publication Date Title
CN111783419B (en) Address similarity calculation method, device, equipment and storage medium
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
US8316041B1 (en) Generation and processing of numerical identifiers
Paul et al. Lesicin: A heterogeneous graph-based approach for automatic legal statute identification from indian legal documents
CN112347223B (en) Document retrieval method, apparatus, and computer-readable storage medium
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN108334610A (en) A kind of newsletter archive sorting technique, device and server
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN113254643B (en) Text classification method and device, electronic equipment and text classification program
CN112559747B (en) Event classification processing method, device, electronic equipment and storage medium
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN102428467A (en) Similarity-Based Feature Set Supplementation For Classification
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN106844482B (en) Search engine-based retrieval information matching method and device
KR20220134695A (en) System for author identification using artificial intelligence learning model and a method thereof
CN110781333A (en) Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning
CN110910175A (en) Tourist ticket product portrait generation method
CN113268986B (en) Unit name matching and searching method and device based on fuzzy matching algorithm
CN109214445A (en) A kind of multi-tag classification method based on artificial intelligence
Sun et al. Learning deep semantic attributes for user video summarization
CN113515939B (en) System and method for extracting key information of investigation report text
Sagcan et al. Toponym recognition in social media for estimating the location of events
CN112579781A (en) Text classification method and device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant