CN113268986A - Unit name matching and searching method and device based on fuzzy matching algorithm - Google Patents

Unit name matching and searching method and device based on fuzzy matching algorithm Download PDF

Info

Publication number
CN113268986A
CN113268986A CN202110566536.XA CN202110566536A CN113268986A CN 113268986 A CN113268986 A CN 113268986A CN 202110566536 A CN202110566536 A CN 202110566536A CN 113268986 A CN113268986 A CN 113268986A
Authority
CN
China
Prior art keywords
fuzzy matching
unit
words
unit name
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110566536.XA
Other languages
Chinese (zh)
Other versions
CN113268986B (en
Inventor
李君�
许志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of Communications Co Ltd
Original Assignee
Bank of Communications Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of Communications Co Ltd filed Critical Bank of Communications Co Ltd
Priority to CN202110566536.XA priority Critical patent/CN113268986B/en
Publication of CN113268986A publication Critical patent/CN113268986A/en
Application granted granted Critical
Publication of CN113268986B publication Critical patent/CN113268986B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a unit name matching and searching method and device based on a fuzzy matching algorithm. Performing custom word segmentation processing on two unit names to be matched to obtain corresponding custom words, wherein the custom words comprise three types, namely attribute words, regional words and nouns; and then calculating fuzzy matching scores of all types of self-defined participles in the two unit names based on the self-defined participle results, and further carrying out weighting processing on the fuzzy matching scores of all types of self-defined participles to obtain the fuzzy matching scores of the two unit names. Compared with the prior art, the method has the advantages of unit name matching, accurate searching and the like.

Description

Unit name matching and searching method and device based on fuzzy matching algorithm
Technical Field
The invention relates to a unit name matching and searching method and device, in particular to a unit name matching and searching method and device based on a fuzzy matching algorithm.
Background
Under the background of the information technology era, the mining requirements of various industries, particularly financial industries, on the client information are more and more urgent, the client information is required to be more accurate, and the mining speed is faster. To meet this demand, an ES database is introduced, which is a distributed database, provides scalable searching, has near real-time searching capability, and supports precise query and fuzzy query processing of large amounts of data.
The ES fuzzy query is a query method of fast fuzzy matching realized based on an elastic search self-contained word segmentation method. The Elasticsearch is a distributed database that provides a distributed multi-user capable full-text search engine using JSON for data indexing over HTTP. The ES can be used for full-text retrieval, structured search, and data analysis. Through the self-contained dictionary database of the ES and the word segmentation method, the rapid query of the similar data in the database table can be realized.
However, the native word segmentation method is single, and the application scenario is also single, so that the service scenario that fuzzy matching is performed on a plurality of unit names and similarity is obtained cannot be met. And the original ES fuzzy query method depends on a word stock, so that the data accurate matching with higher fuzzy matching requirement of the unit name cannot be achieved. In the prior art, the unit name is searched by directly segmenting words through IK of an elastic search, and segmenting words of the unit name needing fuzzy matching. When the application calls the ES native fuzzy matching method, fuzzy matching and query are realized by sending a JSON message of HTTP to an ES database. However, the technique is limited to IK segmentation of the input unit name, to separate it into different indexes, and then to locate the matched name in the database according to the indexes. The matching method is simple, and for some units of specific rules, the condition of mismatching exists, which is specifically embodied as follows:
1. short names exist in the unit names, and the short names cannot be directly identified by ES;
2. the unit name contains the area name, but the ES cannot identify the area, and other unit names which are not in the area can be screened out;
3. the unit name contains an alias field, but the ES cannot directly screen out the unit name of the alias;
4. uncertainty of user input causes that the unit name contains a plurality of invalid characters such as brackets, points and the like, and ES cannot be identified;
5. without the concept of weight, the weight of each matched participle index of the ES is the same, and there is no way to dynamically adjust and control the weight of different participles.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a unit name matching and searching method and device based on a fuzzy matching algorithm.
The purpose of the invention can be realized by the following technical scheme:
a unit name matching method based on fuzzy matching algorithm includes the following steps:
s1, acquiring two unit names to be matched, judging whether the unit names contain each other or are completely matched, if so, directly outputting a fuzzy matching score of 100, otherwise, executing a step S2;
s2, respectively preprocessing the two unit names, including standardization processing and filtering processing;
s3, performing custom word segmentation processing on the two unit names respectively to obtain corresponding custom words, wherein the custom words comprise three types, namely attribute words, regional words and nouns;
s4, judging whether fuzzy matching calculation can be directly carried out or not based on the self-defined word segmentation result, if so, executing a step S6, otherwise, executing a step S5;
s5, carrying out fuzzy matching direct judgment on the two unit names, if the fuzzy matching direct judgment condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing a step S6;
s6, calculating fuzzy matching scores of all types of self-defined participles in the two unit names based on the self-defined participle results, and further carrying out weighting processing on the fuzzy matching scores of all types of self-defined participles to obtain the fuzzy matching scores of the two unit names.
Preferably, the normalizing process in step S2 includes: replacing the acronyms in the unit names with standard words based on the acronym library.
Preferably, the filtering process in step S2 includes: and deleting meaningless characters in the unit name, and deleting invalid words in the unit name based on the invalid word library.
Preferably, the step S4 determines that the condition capable of directly performing the fuzzy matching calculation includes any one of:
the word segmentation results of the two unit names comprise regional words which are different;
the word segmentation result of one unit name contains the noun, and the word segmentation result of the other unit name does not contain the noun.
Preferably, the specific determination of the fuzzy matching direct determination of step S5 includes:
s51, if the word segmentation results of the two unit names both contain different nouns and the different alias words, executing step S52, otherwise executing step S53;
s52, judging ES word segmentation: respectively carrying out ES word segmentation on two unit names by using an IK word segmentation method carried by an ES word stock to obtain two word segmentation combinations, recording a shorter word segmentation set as Aset and a longer word segmentation set as Bset, directly outputting a fuzzy matching score of 100 if all the words in the Aset are contained in the Bset, otherwise, counting repeated words in the two word segmentation sets and recording Cnt, and simultaneously recording the total number of the words in the Aset as Cnaset, directly outputting the fuzzy matching score of 100 if the Cnaset is more than or equal to esLen and the Cnt/Cnaset is more than or equal to esPct, and otherwise, executing a step S53, wherein the esLen and the esPct are set thresholds;
s53, judging public words: and (3) recording the unit name with less characters as A, recording the unit name with more characters as B, sequentially traversing each character in A into B to obtain a longest matched character string set called as a public word C, and if the public word is completely matched with one of the unit names and the length of the public word exceeds 3 bits, directly outputting a fuzzy matching score of 100, otherwise: carrying out self-defined word segmentation on the public word C and the unit name B again, if the attribute words in the word segmentation results of the public word C and the unit name B are the same and the nouns are null, directly outputting a fuzzy matching score of 100, and otherwise, executing the step S54;
s54, subset judgment: performing subset judgment on the unit names preprocessed in the step S2, if the unit names are mutually contained, directly outputting a fuzzy matching score of 100, and otherwise, executing a step S55;
s55, self-defined word segmentation judgment: if the two regional word sets in the two unit name word segmentation results are not empty and equal and the two individual noun sets are not empty and equal, the fuzzy matching score is directly output as 100, otherwise, step S6 is executed.
Preferably, the fuzzy matching scores of the various custom segmented words in step S6 are obtained as follows:
recording two company names A and B, and for the self-defined participles of the i category, calculating the longest public subsequence length maxSubSeq (A, B) of the self-defined participles of the i category by a longest public subsequence length calculation method for the set of the self-defined participles of the category corresponding to the two company namesimaxSubSeq (a, B), then calculate the fuzzy match score for the i-category custom participle by the following equationi
Figure BDA0003081147450000031
Wherein | A | YiRepresents the self-defined word segmentation length of i category in A, | BiAnd (3) representing the self-defined word segmentation length of the i category in the B, wherein the i corresponds to three categories of attribute words, area words and alias words.
Preferably, the fuzzy match scores for two unit names are obtained by:
Figure BDA0003081147450000041
wherein, the mathScore represents fuzzy matching scores of two unit names,
Figure BDA0003081147450000042
expressed by taking 2 as a decimal number, wiWeights for custom participles for the i category.
A unit name searching method based on fuzzy matching algorithm includes:
acquiring a unit name to be searched, and determining a name to be matched with the highest matching degree in a unit name word library through ES fuzzy matching;
and matching the unit name to be searched and the name to be matched by adopting the unit name matching method based on the fuzzy matching algorithm of any one of claims 1 to 7 to determine the fuzzy matching score of the name to be matched and the unit name to be searched in the unit name lexicon.
A unit name matching device based on a fuzzy matching algorithm comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the unit name matching method based on the fuzzy matching algorithm when the computer program is executed.
A unit name searching device based on a fuzzy matching algorithm comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the unit name searching method based on the fuzzy matching algorithm when the computer program is executed.
Compared with the prior art, the invention has the following advantages: the matching/searching method is more precise, the matching/searching result is more accurate and reliable, and the method is specifically represented as follows:
(1) the unit name has a abbreviation, and the unit name with the abbreviation can be identified and matched by the method;
(2) the unit name comprises the area name, the area name in the unit name needing to be matched can be distinguished, and the matching is carried out according to the real area name;
(3) the unit name contains an alias field, and the invention can identify the alias and carry out matching;
(4) the invention can screen out useless words and then carry out matching, thus improving the matching accuracy;
(5) the unit name to be matched is divided into a region, an attribute and an alias. And different classifications can independently set the matched weight, and the configuration is more flexible.
Drawings
FIG. 1 is a block diagram of a unit name matching method based on fuzzy matching algorithm according to the present invention;
FIG. 2 is a block diagram of a unit name search method based on fuzzy matching algorithm according to the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. Note that the following description of the embodiments is merely a substantial example, and the present invention is not intended to be limited to the application or the use thereof, and is not limited to the following embodiments.
Example 1
This embodiment 1 provides a unit name matching method based on a fuzzy matching algorithm, which implements one-to-one fuzzy matching, and includes the following steps:
s1, acquiring two unit names to be matched, respectively recording the unit names as A and B, judging whether the unit names mutually contain or completely match, if so, directly outputting a fuzzy matching score of 100, otherwise, executing a step S2.
S2, respectively preprocessing the two unit names, including standardization processing and filtering processing, wherein the standardization processing comprises: replacing the short words in the unit names with standard words based on the short word library; the filtering treatment comprises the following steps: deleting meaningless characters in the unit name, such as: "! \\\ \ \ in \ in \: \\ + \\. The nonsense characters of \\\\ \ "and the like delete the invalid words in the unit name based on the invalid word bank.
And S3, performing custom word segmentation processing on the two unit names respectively to obtain corresponding custom words, wherein the custom words comprise three types, namely attribute words, area words and alias words, the area words represent areas represented in the unit names, and the attribute words represent industry types represented by the units, such as belonging to security companies, electric power companies and the like, and other words except the attribute words and the area words are alias words. Before word segmentation, the regional words are restored, regional abbreviation, province and city word samples are supplemented, regional word extraction is carried out, and finally regional word segmentation sets of province, city and county are extracted.
S4, judging whether fuzzy matching calculation can be directly carried out or not based on the self-defined word segmentation result, if so, executing a step S6, otherwise, executing a step S5, and specifically, judging that the conditions for directly carrying out the fuzzy matching calculation include any one of the following conditions: the word segmentation results of the two unit names comprise regional words which are different; the word segmentation result of one unit name contains the noun, and the word segmentation result of the other unit name does not contain the noun.
S5, carrying out fuzzy matching direct judgment on the two unit names, if the fuzzy matching direct judgment condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing a step S6, wherein the step S5 specifically comprises the following steps:
s51, if the word segmentation results of the two unit names both contain different nouns and the different alias words, executing step S52, otherwise executing step S53;
s52, judging ES word segmentation: respectively carrying out ES word segmentation on two unit names by using an IK word segmentation method carried by an ES word stock to obtain two word segmentation combinations, recording a shorter word segmentation set as Aset and a longer word segmentation set as Bset, directly outputting a fuzzy matching score of 100 if all the words in the Aset are contained in the Bset, otherwise, counting repeated words in the two word segmentation sets and recording Cnt, and simultaneously recording the total number of the words in the Aset as Cnaset, directly outputting the fuzzy matching score of 100 if the Cnaset is more than or equal to esLen and the Cnt/Cnaset is more than or equal to esPct, and otherwise, executing a step S53, wherein the esLen and the esPct are set thresholds;
s53, judging public words: and (3) recording the unit name with less characters as A, recording the unit name with more characters as B, sequentially traversing each character in A into B to obtain a longest matched character string set called as a public word C, and if the public word is completely matched with one of the unit names and the length of the public word exceeds 3 bits, directly outputting a fuzzy matching score of 100, otherwise: carrying out self-defined word segmentation on the public word C and the unit name B again, if the attribute words in the word segmentation results of the public word C and the unit name B are the same and the nouns are null, directly outputting a fuzzy matching score of 100, and otherwise, executing the step S54;
s54, subset judgment: performing subset judgment on the unit names preprocessed in the step S2, if the unit names are mutually contained, directly outputting a fuzzy matching score of 100, and otherwise, executing a step S55;
s55, self-defined word segmentation judgment: if the two regional word sets in the two unit name word segmentation results are not empty and equal and the two individual noun sets are not empty and equal, the fuzzy matching score is directly output as 100, otherwise, step S6 is executed.
S6, calculating fuzzy matching scores of all kinds of self-defined participles in the two unit names based on the self-defined participle results, and further carrying out weighting processing on the fuzzy matching scores of all kinds of self-defined participles to obtain the fuzzy matching scores of the two unit names, wherein the fuzzy matching scores represent the matching degree of the two unit names, the larger the fuzzy matching score is, the higher the matching degree of the two unit names is, and otherwise, the smaller the fuzzy matching score is, the lower the matching degree of the two unit names is.
The fuzzy matching scores of the various custom participles in the step S6 are obtained through the following method:
for the self-defined participles of the category i, calculating the longest common subsequence length maxSubSeq (A, B) of the self-defined participles of the category i by using a longest common subsequence length calculation method for a set of the self-defined participles of the category corresponding to two company namesimaxSubSeq (a, B), then calculate the fuzzy match score for the i-category custom participle by the following equationi
Figure BDA0003081147450000061
Wherein | A | YiRepresents the self-defined word segmentation length of i category in A, | BiAnd (3) representing the self-defined word segmentation length of the i category in the B, wherein the i corresponds to three categories of attribute words, area words and alias words.
The fuzzy match scores for the two unit names are obtained by:
Figure BDA0003081147450000071
wherein, the mathScore represents fuzzy matching scores of two unit names,
Figure BDA0003081147450000072
expressed by taking 2 as a decimal number, wiWeights for custom participles for the i category.
According to the business scene and the personal weight proportion, the weights of the regional words, the attribute words and the nouns are initialized, and the weights in the embodiment are set as follows: the area word weight is 0.3, the attribute word weight is 0.15, and the alias word weight is 0.55 (here, the initial weight may be adjusted according to the individual situation); and then, respectively adjusting the initialization weights according to the segmentation sets obtained by segmenting the words A and B, for example, if the words A and B obtain no alias segmentation, respectively assigning an initial value of 0.55 of the alias weight to the region and the attribute weight, wherein the region weight is modified to be 0.3+ 0.55/2-0.575, and the attribute weight is modified to be 0.15+ 0.55/2-0.425.
In the present embodiment, the "Guangzhou sports east security of the Huatai Union securities Limited liability company" and the "Huatai futures Limited company" are used as examples to specifically describe:
the two company names are segmented respectively, and the self-defined segmentation result is respectively:
the custom word segmentation result of Guangzhou sports east road securities of Huatai union securities Limited liability company is as follows: [ Guangdong province, Guangzhou city ], [ sports, securities ], [ Huatai union east way ]), wherein [ Guangdong province, Guangzhou city ] is a regional word set, [ sports, securities ] is an attribute word set, and [ Huatai union east way ] is an alias word set;
the custom segmentation result of "huatai futures limited" is: ([ ], [ futures ], [ huatai ]), wherein the regional word set is empty, [ futures ] is the attribute word set, and [ huatai ] is the noun set;
obtaining the following results after extracting the region:
([ Guangdong province ], [ sports, securities ], [ east China Union of China ])
([ ], [ futures ], [ huatai ]);
step S4 shows that fuzzy matching calculation cannot be directly performed, step S5 shows that public word judgment, subset judgment and custom word segmentation judgment cannot be sequentially performed, fuzzy matching score cannot be directly output, and step S6 shows that fuzzy matching score calculation is performed, and the following steps are obtained:
since the set of regional words in "huatai futures limited" is empty, the fuzzy matching scores of the two regional words are: 0;
[ east road in the union of huatai ], [ huatai ], fuzzy match score of nouns: 0.6666665, respectively;
[ sports securities ], [ futures ], fuzzy match score for attribute words: 0;
weight value: the weight of the regional word is 0.3, the weight of the alias word is 0.55, and the weight of the attribute word is 0.15;
the fuzzy match scores for the two unit names, round (0.55 0.6666665 100, 2) 37.
Example 2
As shown in fig. 2, the unit name searching method based on the fuzzy matching algorithm of the present embodiment includes:
acquiring a unit name to be searched, and determining a name to be matched with the highest matching degree in a unit name word library through ES fuzzy matching;
and matching the unit name to be searched and the name to be matched by adopting the unit name matching method based on the fuzzy matching algorithm of any one of claims 1 to 7 to determine the fuzzy matching score of the name to be matched and the unit name to be searched in the unit name lexicon.
In this embodiment, the unit name matching method based on the fuzzy matching algorithm is completely the same as that in embodiment 1, and is not described in detail in this embodiment.
The following specific examples are illustrative:
name of unit to be searched: the general offices of broadcasting and television in Jiangxi province;
the unit name word library sets a name field type as keyword (accurate query) in advance, acquires a name with the highest matching score after the ES keyword is accurately queried and matched, and then carries out one-to-one matching similarity calculation to obtain the unit name and the similarity score which are matched as follows:
jiangxi province broadcast television bureau (es matching score: 18.56986), Heilongjiang province broadcast television bureau Shandong province broadcast television bureau (es matching score: 14.986288), Jiangxi province general workshop (es matching score: 14.648104), Guangdong province West river basin administration (es matching score: 13.420963), Jiangxi province Water and Electricity engineering bureau Limited (es matching score: 12.696965)
Screening out the matching name with the highest score: and the Jiangxi province broadcast television bureau carries out one-to-one matching calculation on similarity of the Jiangxi province broadcast television bureau and the key word matching hit Jiangxi province broadcast television bureau, and similarity of 100 points is obtained.
Therefore, the searching method realizes one-to-many matching and realizes the matching search of the unit name to be searched in the unit name word bank.
Example 3
The present embodiment provides a unit name matching device based on the fuzzy matching algorithm based on embodiment 1, which includes a memory for storing a computer program and a processor for implementing the unit name matching method based on the fuzzy matching algorithm in embodiment 1 when executing the computer program.
Example 4
The present embodiment provides a unit name lookup apparatus based on fuzzy matching algorithm based on embodiment 2, which includes a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for implementing the unit name lookup method based on fuzzy matching algorithm in embodiment 2 when executing the computer program.
The invention improves the accuracy of matching of high-quality customer groups of banks through the user-defined fuzzy matching technology. And is suitable for inputting the matching of single unit names and inputting the service scenes that two unit names are matched with each other. The fuzzy matching scene is more flexible, and the fuzzy matching result is more accurate. And a self-defined word segmentation technology of regions, attributes and aliases is provided, scores and weights are provided for each type of word segmentation, the weights can be adjusted in a self-defined mode, and subjective initiative of services is increased.
The above embodiments are merely examples and do not limit the scope of the present invention. These embodiments may be implemented in other various manners, and various omissions, substitutions, and changes may be made without departing from the technical spirit of the present invention.

Claims (10)

1. A unit name matching method based on fuzzy matching algorithm is characterized by comprising the following steps:
s1, acquiring two unit names to be matched, judging whether the unit names contain each other or are completely matched, if so, directly outputting a fuzzy matching score of 100, otherwise, executing a step S2;
s2, respectively preprocessing the two unit names, including standardization processing and filtering processing;
s3, performing custom word segmentation processing on the two unit names respectively to obtain corresponding custom words, wherein the custom words comprise three types, namely attribute words, regional words and nouns;
s4, judging whether fuzzy matching calculation can be directly carried out or not based on the self-defined word segmentation result, if so, executing a step S6, otherwise, executing a step S5;
s5, carrying out fuzzy matching direct judgment on the two unit names, if the fuzzy matching direct judgment condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing a step S6;
s6, calculating fuzzy matching scores of all types of self-defined participles in the two unit names based on the self-defined participle results, and further carrying out weighting processing on the fuzzy matching scores of all types of self-defined participles to obtain the fuzzy matching scores of the two unit names.
2. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the normalizing process in step S2 comprises: replacing the acronyms in the unit names with standard words based on the acronym library.
3. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the filtering process in step S2 comprises: and deleting meaningless characters in the unit name, and deleting invalid words in the unit name based on the invalid word library.
4. The unit name matching method based on the fuzzy matching algorithm as claimed in claim 1, wherein the condition that the step S4 determines that the fuzzy matching calculation can be directly performed includes any one of:
the word segmentation results of the two unit names comprise regional words which are different;
the word segmentation result of one unit name contains the noun, and the word segmentation result of the other unit name does not contain the noun.
5. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the specific decision of step S5 fuzzy matching direct decision comprises:
s51, if the word segmentation results of the two unit names both contain different nouns and the different alias words, executing step S52, otherwise executing step S53;
s52, judging ES word segmentation: respectively carrying out ES word segmentation on two unit names by using an IK word segmentation method carried by an ES word stock to obtain two word segmentation combinations, recording a shorter word segmentation set as Aset and a longer word segmentation set as Bset, directly outputting a fuzzy matching score of 100 if all the words in the Aset are contained in the Bset, otherwise, counting repeated words in the two word segmentation sets and recording Cnt, and simultaneously recording the total number of the words in the Aset as Cnaset, directly outputting the fuzzy matching score of 100 if the Cnaset is more than or equal to esLen and the Cnt/Cnaset is more than or equal to esPct, and otherwise, executing a step S53, wherein the esLen and the esPct are set thresholds;
s53, judging public words: and (3) recording the unit name with less characters as A, recording the unit name with more characters as B, sequentially traversing each character in A into B to obtain a longest matched character string set called as a public word C, and if the public word is completely matched with one of the unit names and the length of the public word exceeds 3 bits, directly outputting a fuzzy matching score of 100, otherwise: carrying out self-defined word segmentation on the public word C and the unit name B again, if the attribute words in the word segmentation results of the public word C and the unit name B are the same and the nouns are null, directly outputting a fuzzy matching score of 100, and otherwise, executing the step S54;
s54, subset judgment: performing subset judgment on the unit names preprocessed in the step S2, if the unit names are mutually contained, directly outputting a fuzzy matching score of 100, and otherwise, executing a step S55;
s55, self-defined word segmentation judgment: if the two regional word sets in the two unit name word segmentation results are not empty and equal and the two individual noun sets are not empty and equal, the fuzzy matching score is directly output as 100, otherwise, step S6 is executed.
6. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the fuzzy matching score of each type of custom segmentation in step S6 is obtained as follows:
recording two company names A and B, and for the self-defined participles of the i category, calculating the longest public subsequence length maxSubSeq (A, B) of the self-defined participles of the i category by a longest public subsequence length calculation method for the set of the self-defined participles of the category corresponding to the two company namesimaxSubSeq (a, B), then calculate the fuzzy match score for the i-category custom participle by the following equationi
Figure FDA0003081147440000021
Wherein | A | YiRepresents the self-defined word segmentation length of i category in A, | BiAnd (3) representing the self-defined word segmentation length of the i category in the B, wherein the i corresponds to three categories of attribute words, area words and alias words.
7. The unit name matching method based on fuzzy matching algorithm as claimed in claim 6, wherein the fuzzy matching score of two unit names is obtained by the following formula:
Figure FDA0003081147440000031
wherein, the mathScore represents fuzzy matching scores of two unit names,
Figure FDA0003081147440000032
expressed by taking 2 as a decimal number, wiWeights for custom participles for the i category.
8. A unit name searching method based on fuzzy matching algorithm is characterized by comprising the following steps:
acquiring a unit name to be searched, and determining a name to be matched with the highest matching degree in a unit name word library through ES fuzzy matching;
and matching the unit name to be searched and the name to be matched by adopting the unit name matching method based on the fuzzy matching algorithm of any one of claims 1 to 7 to determine the fuzzy matching score of the name to be matched and the unit name to be searched in the unit name lexicon.
9. A unit name matching device based on a fuzzy matching algorithm, which is characterized by comprising a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the unit name matching method based on the fuzzy matching algorithm according to any one of claims 1 to 7 when the computer program is executed.
10. An apparatus for unit name lookup based on fuzzy matching algorithm, characterized in that the apparatus comprises a memory for storing a computer program and a processor for implementing the unit name lookup based on fuzzy matching algorithm of claim 8 when executing said computer program.
CN202110566536.XA 2021-05-24 2021-05-24 Unit name matching and searching method and device based on fuzzy matching algorithm Active CN113268986B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110566536.XA CN113268986B (en) 2021-05-24 2021-05-24 Unit name matching and searching method and device based on fuzzy matching algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110566536.XA CN113268986B (en) 2021-05-24 2021-05-24 Unit name matching and searching method and device based on fuzzy matching algorithm

Publications (2)

Publication Number Publication Date
CN113268986A true CN113268986A (en) 2021-08-17
CN113268986B CN113268986B (en) 2024-05-24

Family

ID=77232606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110566536.XA Active CN113268986B (en) 2021-05-24 2021-05-24 Unit name matching and searching method and device based on fuzzy matching algorithm

Country Status (1)

Country Link
CN (1) CN113268986B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905473A (en) * 2022-12-20 2023-04-04 无锡锡银金科信息技术有限责任公司 Full noun fuzzy matching method, device and storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102651013A (en) * 2012-03-23 2012-08-29 上海安捷力信息***有限公司 Method and system for extracting area information from enterprise name data
US20140052688A1 (en) * 2012-08-17 2014-02-20 Opera Solutions, Llc System and Method for Matching Data Using Probabilistic Modeling Techniques
CN107644010A (en) * 2016-07-20 2018-01-30 阿里巴巴集团控股有限公司 A kind of Text similarity computing method and device
CN108829661A (en) * 2018-05-09 2018-11-16 成都信息工程大学 A kind of subject of news title extracting method based on fuzzy matching
CN109657213A (en) * 2018-12-21 2019-04-19 北京金山安全软件有限公司 Text similarity detection method and device and electronic equipment
CN110046341A (en) * 2018-12-29 2019-07-23 ***股份有限公司 For carrying out matched method and system to information
US20190251085A1 (en) * 2016-11-25 2019-08-15 Alibaba Group Holding Limited Method and apparatus for matching names
CN110309504A (en) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 Text handling method, device, equipment and storage medium based on participle
CN111104795A (en) * 2019-11-19 2020-05-05 平安金融管理学院(中国·深圳) Company name matching method and device, computer equipment and storage medium
US20200327136A1 (en) * 2019-04-14 2020-10-15 EverString Innovation Technology Automated company matching
CN111859042A (en) * 2020-07-30 2020-10-30 上海妙一生物科技有限公司 Retrieval method and device and electronic equipment
CN112364635A (en) * 2020-11-30 2021-02-12 中国银行股份有限公司 Enterprise name duplication checking method and device
US20210089562A1 (en) * 2019-09-20 2021-03-25 Jpmorgan Chase Bank, N.A. System and method for generating and implementing context weighted words
CN112749255A (en) * 2020-12-30 2021-05-04 科大国创云网科技有限公司 Human-computer interaction semantic recognition intention matching method and system based on ES

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102651013A (en) * 2012-03-23 2012-08-29 上海安捷力信息***有限公司 Method and system for extracting area information from enterprise name data
US20140052688A1 (en) * 2012-08-17 2014-02-20 Opera Solutions, Llc System and Method for Matching Data Using Probabilistic Modeling Techniques
CN107644010A (en) * 2016-07-20 2018-01-30 阿里巴巴集团控股有限公司 A kind of Text similarity computing method and device
US20190251085A1 (en) * 2016-11-25 2019-08-15 Alibaba Group Holding Limited Method and apparatus for matching names
CN108829661A (en) * 2018-05-09 2018-11-16 成都信息工程大学 A kind of subject of news title extracting method based on fuzzy matching
CN109657213A (en) * 2018-12-21 2019-04-19 北京金山安全软件有限公司 Text similarity detection method and device and electronic equipment
CN110046341A (en) * 2018-12-29 2019-07-23 ***股份有限公司 For carrying out matched method and system to information
US20200327136A1 (en) * 2019-04-14 2020-10-15 EverString Innovation Technology Automated company matching
CN110309504A (en) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 Text handling method, device, equipment and storage medium based on participle
US20210089562A1 (en) * 2019-09-20 2021-03-25 Jpmorgan Chase Bank, N.A. System and method for generating and implementing context weighted words
CN111104795A (en) * 2019-11-19 2020-05-05 平安金融管理学院(中国·深圳) Company name matching method and device, computer equipment and storage medium
CN111859042A (en) * 2020-07-30 2020-10-30 上海妙一生物科技有限公司 Retrieval method and device and electronic equipment
CN112364635A (en) * 2020-11-30 2021-02-12 中国银行股份有限公司 Enterprise name duplication checking method and device
CN112749255A (en) * 2020-12-30 2021-05-04 科大国创云网科技有限公司 Human-computer interaction semantic recognition intention matching method and system based on ES

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
乔世权;戴继勇;: "基于文本相似度的智能查号引擎研究", 河北科技大学学报, vol. 39, no. 3, pages 282 - 288 *
段晨迪: "基于ElasticSearch面向M00C的垂直搜索引擎设计与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》, no. 1, pages 1138 - 2507 *
连誉舜: "中文组织机构名检索***的设计与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》, no. 4, pages 138 - 1519 *
钟良伍, 郑 方: "基于中文机构名简称的检索方法研究", 中文信息学报, vol. 21, no. 1, pages 38 - 42 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905473A (en) * 2022-12-20 2023-04-04 无锡锡银金科信息技术有限责任公司 Full noun fuzzy matching method, device and storage medium
CN115905473B (en) * 2022-12-20 2024-03-05 无锡锡银金科信息技术有限责任公司 Full noun fuzzy matching method, device and storage medium

Also Published As

Publication number Publication date
CN113268986B (en) 2024-05-24

Similar Documents

Publication Publication Date Title
CN111783419B (en) Address similarity calculation method, device, equipment and storage medium
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
US8812300B2 (en) Identifying related names
US7424421B2 (en) Word collection method and system for use in word-breaking
CN107862070B (en) Online classroom discussion short text instant grouping method and system based on text clustering
US8316041B1 (en) Generation and processing of numerical identifiers
CN103106287B (en) A kind of processing method and system of user search sentence
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN111259151A (en) Method and device for recognizing mixed text sensitive word variants
CN111899890A (en) Medical data similarity detection system and method based on bit string Hash
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN110990676A (en) Social media hotspot topic extraction method and system
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
CN112989414A (en) Mobile service data desensitization rule generation method based on width learning
CN111061873B (en) Multi-channel text classification method based on Attention mechanism
CN113268986B (en) Unit name matching and searching method and device based on fuzzy matching algorithm
CN111753547A (en) Keyword extraction method and system for sensitive data leakage detection
CN111597423A (en) Performance evaluation method and device of interpretable method of text classification model
CN114943285B (en) Intelligent auditing system for internet news content data
CN108172304B (en) Medical information visualization processing method and system based on user medical feedback
CN107291952B (en) Method and device for extracting meaningful strings
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
CN113657443A (en) Online Internet of things equipment identification method based on SOINN network
TWI534640B (en) Chinese network information monitoring and analysis system and its method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant