CN113268986A - Unit name matching and searching method and device based on fuzzy matching algorithm - Google Patents
Unit name matching and searching method and device based on fuzzy matching algorithm Download PDFInfo
- Publication number
- CN113268986A CN113268986A CN202110566536.XA CN202110566536A CN113268986A CN 113268986 A CN113268986 A CN 113268986A CN 202110566536 A CN202110566536 A CN 202110566536A CN 113268986 A CN113268986 A CN 113268986A
- Authority
- CN
- China
- Prior art keywords
- fuzzy matching
- unit
- words
- unit name
- word segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 35
- 230000011218 segmentation Effects 0.000 claims abstract description 73
- 238000004364 calculation method Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 12
- 238000001914 filtration Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a unit name matching and searching method and device based on a fuzzy matching algorithm. Performing custom word segmentation processing on two unit names to be matched to obtain corresponding custom words, wherein the custom words comprise three types, namely attribute words, regional words and nouns; and then calculating fuzzy matching scores of all types of self-defined participles in the two unit names based on the self-defined participle results, and further carrying out weighting processing on the fuzzy matching scores of all types of self-defined participles to obtain the fuzzy matching scores of the two unit names. Compared with the prior art, the method has the advantages of unit name matching, accurate searching and the like.
Description
Technical Field
The invention relates to a unit name matching and searching method and device, in particular to a unit name matching and searching method and device based on a fuzzy matching algorithm.
Background
Under the background of the information technology era, the mining requirements of various industries, particularly financial industries, on the client information are more and more urgent, the client information is required to be more accurate, and the mining speed is faster. To meet this demand, an ES database is introduced, which is a distributed database, provides scalable searching, has near real-time searching capability, and supports precise query and fuzzy query processing of large amounts of data.
The ES fuzzy query is a query method of fast fuzzy matching realized based on an elastic search self-contained word segmentation method. The Elasticsearch is a distributed database that provides a distributed multi-user capable full-text search engine using JSON for data indexing over HTTP. The ES can be used for full-text retrieval, structured search, and data analysis. Through the self-contained dictionary database of the ES and the word segmentation method, the rapid query of the similar data in the database table can be realized.
However, the native word segmentation method is single, and the application scenario is also single, so that the service scenario that fuzzy matching is performed on a plurality of unit names and similarity is obtained cannot be met. And the original ES fuzzy query method depends on a word stock, so that the data accurate matching with higher fuzzy matching requirement of the unit name cannot be achieved. In the prior art, the unit name is searched by directly segmenting words through IK of an elastic search, and segmenting words of the unit name needing fuzzy matching. When the application calls the ES native fuzzy matching method, fuzzy matching and query are realized by sending a JSON message of HTTP to an ES database. However, the technique is limited to IK segmentation of the input unit name, to separate it into different indexes, and then to locate the matched name in the database according to the indexes. The matching method is simple, and for some units of specific rules, the condition of mismatching exists, which is specifically embodied as follows:
1. short names exist in the unit names, and the short names cannot be directly identified by ES;
2. the unit name contains the area name, but the ES cannot identify the area, and other unit names which are not in the area can be screened out;
3. the unit name contains an alias field, but the ES cannot directly screen out the unit name of the alias;
4. uncertainty of user input causes that the unit name contains a plurality of invalid characters such as brackets, points and the like, and ES cannot be identified;
5. without the concept of weight, the weight of each matched participle index of the ES is the same, and there is no way to dynamically adjust and control the weight of different participles.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a unit name matching and searching method and device based on a fuzzy matching algorithm.
The purpose of the invention can be realized by the following technical scheme:
a unit name matching method based on fuzzy matching algorithm includes the following steps:
s1, acquiring two unit names to be matched, judging whether the unit names contain each other or are completely matched, if so, directly outputting a fuzzy matching score of 100, otherwise, executing a step S2;
s2, respectively preprocessing the two unit names, including standardization processing and filtering processing;
s3, performing custom word segmentation processing on the two unit names respectively to obtain corresponding custom words, wherein the custom words comprise three types, namely attribute words, regional words and nouns;
s4, judging whether fuzzy matching calculation can be directly carried out or not based on the self-defined word segmentation result, if so, executing a step S6, otherwise, executing a step S5;
s5, carrying out fuzzy matching direct judgment on the two unit names, if the fuzzy matching direct judgment condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing a step S6;
s6, calculating fuzzy matching scores of all types of self-defined participles in the two unit names based on the self-defined participle results, and further carrying out weighting processing on the fuzzy matching scores of all types of self-defined participles to obtain the fuzzy matching scores of the two unit names.
Preferably, the normalizing process in step S2 includes: replacing the acronyms in the unit names with standard words based on the acronym library.
Preferably, the filtering process in step S2 includes: and deleting meaningless characters in the unit name, and deleting invalid words in the unit name based on the invalid word library.
Preferably, the step S4 determines that the condition capable of directly performing the fuzzy matching calculation includes any one of:
the word segmentation results of the two unit names comprise regional words which are different;
the word segmentation result of one unit name contains the noun, and the word segmentation result of the other unit name does not contain the noun.
Preferably, the specific determination of the fuzzy matching direct determination of step S5 includes:
s51, if the word segmentation results of the two unit names both contain different nouns and the different alias words, executing step S52, otherwise executing step S53;
s52, judging ES word segmentation: respectively carrying out ES word segmentation on two unit names by using an IK word segmentation method carried by an ES word stock to obtain two word segmentation combinations, recording a shorter word segmentation set as Aset and a longer word segmentation set as Bset, directly outputting a fuzzy matching score of 100 if all the words in the Aset are contained in the Bset, otherwise, counting repeated words in the two word segmentation sets and recording Cnt, and simultaneously recording the total number of the words in the Aset as Cnaset, directly outputting the fuzzy matching score of 100 if the Cnaset is more than or equal to esLen and the Cnt/Cnaset is more than or equal to esPct, and otherwise, executing a step S53, wherein the esLen and the esPct are set thresholds;
s53, judging public words: and (3) recording the unit name with less characters as A, recording the unit name with more characters as B, sequentially traversing each character in A into B to obtain a longest matched character string set called as a public word C, and if the public word is completely matched with one of the unit names and the length of the public word exceeds 3 bits, directly outputting a fuzzy matching score of 100, otherwise: carrying out self-defined word segmentation on the public word C and the unit name B again, if the attribute words in the word segmentation results of the public word C and the unit name B are the same and the nouns are null, directly outputting a fuzzy matching score of 100, and otherwise, executing the step S54;
s54, subset judgment: performing subset judgment on the unit names preprocessed in the step S2, if the unit names are mutually contained, directly outputting a fuzzy matching score of 100, and otherwise, executing a step S55;
s55, self-defined word segmentation judgment: if the two regional word sets in the two unit name word segmentation results are not empty and equal and the two individual noun sets are not empty and equal, the fuzzy matching score is directly output as 100, otherwise, step S6 is executed.
Preferably, the fuzzy matching scores of the various custom segmented words in step S6 are obtained as follows:
recording two company names A and B, and for the self-defined participles of the i category, calculating the longest public subsequence length maxSubSeq (A, B) of the self-defined participles of the i category by a longest public subsequence length calculation method for the set of the self-defined participles of the category corresponding to the two company namesimaxSubSeq (a, B), then calculate the fuzzy match score for the i-category custom participle by the following equationi:
Wherein | A | YiRepresents the self-defined word segmentation length of i category in A, | BiAnd (3) representing the self-defined word segmentation length of the i category in the B, wherein the i corresponds to three categories of attribute words, area words and alias words.
Preferably, the fuzzy match scores for two unit names are obtained by:
wherein, the mathScore represents fuzzy matching scores of two unit names,expressed by taking 2 as a decimal number, wiWeights for custom participles for the i category.
A unit name searching method based on fuzzy matching algorithm includes:
acquiring a unit name to be searched, and determining a name to be matched with the highest matching degree in a unit name word library through ES fuzzy matching;
and matching the unit name to be searched and the name to be matched by adopting the unit name matching method based on the fuzzy matching algorithm of any one of claims 1 to 7 to determine the fuzzy matching score of the name to be matched and the unit name to be searched in the unit name lexicon.
A unit name matching device based on a fuzzy matching algorithm comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the unit name matching method based on the fuzzy matching algorithm when the computer program is executed.
A unit name searching device based on a fuzzy matching algorithm comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the unit name searching method based on the fuzzy matching algorithm when the computer program is executed.
Compared with the prior art, the invention has the following advantages: the matching/searching method is more precise, the matching/searching result is more accurate and reliable, and the method is specifically represented as follows:
(1) the unit name has a abbreviation, and the unit name with the abbreviation can be identified and matched by the method;
(2) the unit name comprises the area name, the area name in the unit name needing to be matched can be distinguished, and the matching is carried out according to the real area name;
(3) the unit name contains an alias field, and the invention can identify the alias and carry out matching;
(4) the invention can screen out useless words and then carry out matching, thus improving the matching accuracy;
(5) the unit name to be matched is divided into a region, an attribute and an alias. And different classifications can independently set the matched weight, and the configuration is more flexible.
Drawings
FIG. 1 is a block diagram of a unit name matching method based on fuzzy matching algorithm according to the present invention;
FIG. 2 is a block diagram of a unit name search method based on fuzzy matching algorithm according to the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. Note that the following description of the embodiments is merely a substantial example, and the present invention is not intended to be limited to the application or the use thereof, and is not limited to the following embodiments.
Example 1
This embodiment 1 provides a unit name matching method based on a fuzzy matching algorithm, which implements one-to-one fuzzy matching, and includes the following steps:
s1, acquiring two unit names to be matched, respectively recording the unit names as A and B, judging whether the unit names mutually contain or completely match, if so, directly outputting a fuzzy matching score of 100, otherwise, executing a step S2.
S2, respectively preprocessing the two unit names, including standardization processing and filtering processing, wherein the standardization processing comprises: replacing the short words in the unit names with standard words based on the short word library; the filtering treatment comprises the following steps: deleting meaningless characters in the unit name, such as: "! \\\ \ \ in \ in \: \\ + \\. The nonsense characters of \\\\ \ "and the like delete the invalid words in the unit name based on the invalid word bank.
And S3, performing custom word segmentation processing on the two unit names respectively to obtain corresponding custom words, wherein the custom words comprise three types, namely attribute words, area words and alias words, the area words represent areas represented in the unit names, and the attribute words represent industry types represented by the units, such as belonging to security companies, electric power companies and the like, and other words except the attribute words and the area words are alias words. Before word segmentation, the regional words are restored, regional abbreviation, province and city word samples are supplemented, regional word extraction is carried out, and finally regional word segmentation sets of province, city and county are extracted.
S4, judging whether fuzzy matching calculation can be directly carried out or not based on the self-defined word segmentation result, if so, executing a step S6, otherwise, executing a step S5, and specifically, judging that the conditions for directly carrying out the fuzzy matching calculation include any one of the following conditions: the word segmentation results of the two unit names comprise regional words which are different; the word segmentation result of one unit name contains the noun, and the word segmentation result of the other unit name does not contain the noun.
S5, carrying out fuzzy matching direct judgment on the two unit names, if the fuzzy matching direct judgment condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing a step S6, wherein the step S5 specifically comprises the following steps:
s51, if the word segmentation results of the two unit names both contain different nouns and the different alias words, executing step S52, otherwise executing step S53;
s52, judging ES word segmentation: respectively carrying out ES word segmentation on two unit names by using an IK word segmentation method carried by an ES word stock to obtain two word segmentation combinations, recording a shorter word segmentation set as Aset and a longer word segmentation set as Bset, directly outputting a fuzzy matching score of 100 if all the words in the Aset are contained in the Bset, otherwise, counting repeated words in the two word segmentation sets and recording Cnt, and simultaneously recording the total number of the words in the Aset as Cnaset, directly outputting the fuzzy matching score of 100 if the Cnaset is more than or equal to esLen and the Cnt/Cnaset is more than or equal to esPct, and otherwise, executing a step S53, wherein the esLen and the esPct are set thresholds;
s53, judging public words: and (3) recording the unit name with less characters as A, recording the unit name with more characters as B, sequentially traversing each character in A into B to obtain a longest matched character string set called as a public word C, and if the public word is completely matched with one of the unit names and the length of the public word exceeds 3 bits, directly outputting a fuzzy matching score of 100, otherwise: carrying out self-defined word segmentation on the public word C and the unit name B again, if the attribute words in the word segmentation results of the public word C and the unit name B are the same and the nouns are null, directly outputting a fuzzy matching score of 100, and otherwise, executing the step S54;
s54, subset judgment: performing subset judgment on the unit names preprocessed in the step S2, if the unit names are mutually contained, directly outputting a fuzzy matching score of 100, and otherwise, executing a step S55;
s55, self-defined word segmentation judgment: if the two regional word sets in the two unit name word segmentation results are not empty and equal and the two individual noun sets are not empty and equal, the fuzzy matching score is directly output as 100, otherwise, step S6 is executed.
S6, calculating fuzzy matching scores of all kinds of self-defined participles in the two unit names based on the self-defined participle results, and further carrying out weighting processing on the fuzzy matching scores of all kinds of self-defined participles to obtain the fuzzy matching scores of the two unit names, wherein the fuzzy matching scores represent the matching degree of the two unit names, the larger the fuzzy matching score is, the higher the matching degree of the two unit names is, and otherwise, the smaller the fuzzy matching score is, the lower the matching degree of the two unit names is.
The fuzzy matching scores of the various custom participles in the step S6 are obtained through the following method:
for the self-defined participles of the category i, calculating the longest common subsequence length maxSubSeq (A, B) of the self-defined participles of the category i by using a longest common subsequence length calculation method for a set of the self-defined participles of the category corresponding to two company namesimaxSubSeq (a, B), then calculate the fuzzy match score for the i-category custom participle by the following equationi:
Wherein | A | YiRepresents the self-defined word segmentation length of i category in A, | BiAnd (3) representing the self-defined word segmentation length of the i category in the B, wherein the i corresponds to three categories of attribute words, area words and alias words.
The fuzzy match scores for the two unit names are obtained by:
wherein, the mathScore represents fuzzy matching scores of two unit names,expressed by taking 2 as a decimal number, wiWeights for custom participles for the i category.
According to the business scene and the personal weight proportion, the weights of the regional words, the attribute words and the nouns are initialized, and the weights in the embodiment are set as follows: the area word weight is 0.3, the attribute word weight is 0.15, and the alias word weight is 0.55 (here, the initial weight may be adjusted according to the individual situation); and then, respectively adjusting the initialization weights according to the segmentation sets obtained by segmenting the words A and B, for example, if the words A and B obtain no alias segmentation, respectively assigning an initial value of 0.55 of the alias weight to the region and the attribute weight, wherein the region weight is modified to be 0.3+ 0.55/2-0.575, and the attribute weight is modified to be 0.15+ 0.55/2-0.425.
In the present embodiment, the "Guangzhou sports east security of the Huatai Union securities Limited liability company" and the "Huatai futures Limited company" are used as examples to specifically describe:
the two company names are segmented respectively, and the self-defined segmentation result is respectively:
the custom word segmentation result of Guangzhou sports east road securities of Huatai union securities Limited liability company is as follows: [ Guangdong province, Guangzhou city ], [ sports, securities ], [ Huatai union east way ]), wherein [ Guangdong province, Guangzhou city ] is a regional word set, [ sports, securities ] is an attribute word set, and [ Huatai union east way ] is an alias word set;
the custom segmentation result of "huatai futures limited" is: ([ ], [ futures ], [ huatai ]), wherein the regional word set is empty, [ futures ] is the attribute word set, and [ huatai ] is the noun set;
obtaining the following results after extracting the region:
([ Guangdong province ], [ sports, securities ], [ east China Union of China ])
([ ], [ futures ], [ huatai ]);
step S4 shows that fuzzy matching calculation cannot be directly performed, step S5 shows that public word judgment, subset judgment and custom word segmentation judgment cannot be sequentially performed, fuzzy matching score cannot be directly output, and step S6 shows that fuzzy matching score calculation is performed, and the following steps are obtained:
since the set of regional words in "huatai futures limited" is empty, the fuzzy matching scores of the two regional words are: 0;
[ east road in the union of huatai ], [ huatai ], fuzzy match score of nouns: 0.6666665, respectively;
[ sports securities ], [ futures ], fuzzy match score for attribute words: 0;
weight value: the weight of the regional word is 0.3, the weight of the alias word is 0.55, and the weight of the attribute word is 0.15;
the fuzzy match scores for the two unit names, round (0.55 0.6666665 100, 2) 37.
Example 2
As shown in fig. 2, the unit name searching method based on the fuzzy matching algorithm of the present embodiment includes:
acquiring a unit name to be searched, and determining a name to be matched with the highest matching degree in a unit name word library through ES fuzzy matching;
and matching the unit name to be searched and the name to be matched by adopting the unit name matching method based on the fuzzy matching algorithm of any one of claims 1 to 7 to determine the fuzzy matching score of the name to be matched and the unit name to be searched in the unit name lexicon.
In this embodiment, the unit name matching method based on the fuzzy matching algorithm is completely the same as that in embodiment 1, and is not described in detail in this embodiment.
The following specific examples are illustrative:
name of unit to be searched: the general offices of broadcasting and television in Jiangxi province;
the unit name word library sets a name field type as keyword (accurate query) in advance, acquires a name with the highest matching score after the ES keyword is accurately queried and matched, and then carries out one-to-one matching similarity calculation to obtain the unit name and the similarity score which are matched as follows:
jiangxi province broadcast television bureau (es matching score: 18.56986), Heilongjiang province broadcast television bureau Shandong province broadcast television bureau (es matching score: 14.986288), Jiangxi province general workshop (es matching score: 14.648104), Guangdong province West river basin administration (es matching score: 13.420963), Jiangxi province Water and Electricity engineering bureau Limited (es matching score: 12.696965)
Screening out the matching name with the highest score: and the Jiangxi province broadcast television bureau carries out one-to-one matching calculation on similarity of the Jiangxi province broadcast television bureau and the key word matching hit Jiangxi province broadcast television bureau, and similarity of 100 points is obtained.
Therefore, the searching method realizes one-to-many matching and realizes the matching search of the unit name to be searched in the unit name word bank.
Example 3
The present embodiment provides a unit name matching device based on the fuzzy matching algorithm based on embodiment 1, which includes a memory for storing a computer program and a processor for implementing the unit name matching method based on the fuzzy matching algorithm in embodiment 1 when executing the computer program.
Example 4
The present embodiment provides a unit name lookup apparatus based on fuzzy matching algorithm based on embodiment 2, which includes a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for implementing the unit name lookup method based on fuzzy matching algorithm in embodiment 2 when executing the computer program.
The invention improves the accuracy of matching of high-quality customer groups of banks through the user-defined fuzzy matching technology. And is suitable for inputting the matching of single unit names and inputting the service scenes that two unit names are matched with each other. The fuzzy matching scene is more flexible, and the fuzzy matching result is more accurate. And a self-defined word segmentation technology of regions, attributes and aliases is provided, scores and weights are provided for each type of word segmentation, the weights can be adjusted in a self-defined mode, and subjective initiative of services is increased.
The above embodiments are merely examples and do not limit the scope of the present invention. These embodiments may be implemented in other various manners, and various omissions, substitutions, and changes may be made without departing from the technical spirit of the present invention.
Claims (10)
1. A unit name matching method based on fuzzy matching algorithm is characterized by comprising the following steps:
s1, acquiring two unit names to be matched, judging whether the unit names contain each other or are completely matched, if so, directly outputting a fuzzy matching score of 100, otherwise, executing a step S2;
s2, respectively preprocessing the two unit names, including standardization processing and filtering processing;
s3, performing custom word segmentation processing on the two unit names respectively to obtain corresponding custom words, wherein the custom words comprise three types, namely attribute words, regional words and nouns;
s4, judging whether fuzzy matching calculation can be directly carried out or not based on the self-defined word segmentation result, if so, executing a step S6, otherwise, executing a step S5;
s5, carrying out fuzzy matching direct judgment on the two unit names, if the fuzzy matching direct judgment condition is met, directly outputting a fuzzy matching score of 100, otherwise, executing a step S6;
s6, calculating fuzzy matching scores of all types of self-defined participles in the two unit names based on the self-defined participle results, and further carrying out weighting processing on the fuzzy matching scores of all types of self-defined participles to obtain the fuzzy matching scores of the two unit names.
2. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the normalizing process in step S2 comprises: replacing the acronyms in the unit names with standard words based on the acronym library.
3. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the filtering process in step S2 comprises: and deleting meaningless characters in the unit name, and deleting invalid words in the unit name based on the invalid word library.
4. The unit name matching method based on the fuzzy matching algorithm as claimed in claim 1, wherein the condition that the step S4 determines that the fuzzy matching calculation can be directly performed includes any one of:
the word segmentation results of the two unit names comprise regional words which are different;
the word segmentation result of one unit name contains the noun, and the word segmentation result of the other unit name does not contain the noun.
5. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the specific decision of step S5 fuzzy matching direct decision comprises:
s51, if the word segmentation results of the two unit names both contain different nouns and the different alias words, executing step S52, otherwise executing step S53;
s52, judging ES word segmentation: respectively carrying out ES word segmentation on two unit names by using an IK word segmentation method carried by an ES word stock to obtain two word segmentation combinations, recording a shorter word segmentation set as Aset and a longer word segmentation set as Bset, directly outputting a fuzzy matching score of 100 if all the words in the Aset are contained in the Bset, otherwise, counting repeated words in the two word segmentation sets and recording Cnt, and simultaneously recording the total number of the words in the Aset as Cnaset, directly outputting the fuzzy matching score of 100 if the Cnaset is more than or equal to esLen and the Cnt/Cnaset is more than or equal to esPct, and otherwise, executing a step S53, wherein the esLen and the esPct are set thresholds;
s53, judging public words: and (3) recording the unit name with less characters as A, recording the unit name with more characters as B, sequentially traversing each character in A into B to obtain a longest matched character string set called as a public word C, and if the public word is completely matched with one of the unit names and the length of the public word exceeds 3 bits, directly outputting a fuzzy matching score of 100, otherwise: carrying out self-defined word segmentation on the public word C and the unit name B again, if the attribute words in the word segmentation results of the public word C and the unit name B are the same and the nouns are null, directly outputting a fuzzy matching score of 100, and otherwise, executing the step S54;
s54, subset judgment: performing subset judgment on the unit names preprocessed in the step S2, if the unit names are mutually contained, directly outputting a fuzzy matching score of 100, and otherwise, executing a step S55;
s55, self-defined word segmentation judgment: if the two regional word sets in the two unit name word segmentation results are not empty and equal and the two individual noun sets are not empty and equal, the fuzzy matching score is directly output as 100, otherwise, step S6 is executed.
6. The unit name matching method based on fuzzy matching algorithm as claimed in claim 1, wherein the fuzzy matching score of each type of custom segmentation in step S6 is obtained as follows:
recording two company names A and B, and for the self-defined participles of the i category, calculating the longest public subsequence length maxSubSeq (A, B) of the self-defined participles of the i category by a longest public subsequence length calculation method for the set of the self-defined participles of the category corresponding to the two company namesimaxSubSeq (a, B), then calculate the fuzzy match score for the i-category custom participle by the following equationi:
Wherein | A | YiRepresents the self-defined word segmentation length of i category in A, | BiAnd (3) representing the self-defined word segmentation length of the i category in the B, wherein the i corresponds to three categories of attribute words, area words and alias words.
7. The unit name matching method based on fuzzy matching algorithm as claimed in claim 6, wherein the fuzzy matching score of two unit names is obtained by the following formula:
8. A unit name searching method based on fuzzy matching algorithm is characterized by comprising the following steps:
acquiring a unit name to be searched, and determining a name to be matched with the highest matching degree in a unit name word library through ES fuzzy matching;
and matching the unit name to be searched and the name to be matched by adopting the unit name matching method based on the fuzzy matching algorithm of any one of claims 1 to 7 to determine the fuzzy matching score of the name to be matched and the unit name to be searched in the unit name lexicon.
9. A unit name matching device based on a fuzzy matching algorithm, which is characterized by comprising a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the unit name matching method based on the fuzzy matching algorithm according to any one of claims 1 to 7 when the computer program is executed.
10. An apparatus for unit name lookup based on fuzzy matching algorithm, characterized in that the apparatus comprises a memory for storing a computer program and a processor for implementing the unit name lookup based on fuzzy matching algorithm of claim 8 when executing said computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110566536.XA CN113268986B (en) | 2021-05-24 | 2021-05-24 | Unit name matching and searching method and device based on fuzzy matching algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110566536.XA CN113268986B (en) | 2021-05-24 | 2021-05-24 | Unit name matching and searching method and device based on fuzzy matching algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113268986A true CN113268986A (en) | 2021-08-17 |
CN113268986B CN113268986B (en) | 2024-05-24 |
Family
ID=77232606
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110566536.XA Active CN113268986B (en) | 2021-05-24 | 2021-05-24 | Unit name matching and searching method and device based on fuzzy matching algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113268986B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115905473A (en) * | 2022-12-20 | 2023-04-04 | 无锡锡银金科信息技术有限责任公司 | Full noun fuzzy matching method, device and storage medium |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102651013A (en) * | 2012-03-23 | 2012-08-29 | 上海安捷力信息***有限公司 | Method and system for extracting area information from enterprise name data |
US20140052688A1 (en) * | 2012-08-17 | 2014-02-20 | Opera Solutions, Llc | System and Method for Matching Data Using Probabilistic Modeling Techniques |
CN107644010A (en) * | 2016-07-20 | 2018-01-30 | 阿里巴巴集团控股有限公司 | A kind of Text similarity computing method and device |
CN108829661A (en) * | 2018-05-09 | 2018-11-16 | 成都信息工程大学 | A kind of subject of news title extracting method based on fuzzy matching |
CN109657213A (en) * | 2018-12-21 | 2019-04-19 | 北京金山安全软件有限公司 | Text similarity detection method and device and electronic equipment |
CN110046341A (en) * | 2018-12-29 | 2019-07-23 | ***股份有限公司 | For carrying out matched method and system to information |
US20190251085A1 (en) * | 2016-11-25 | 2019-08-15 | Alibaba Group Holding Limited | Method and apparatus for matching names |
CN110309504A (en) * | 2019-05-23 | 2019-10-08 | 平安科技(深圳)有限公司 | Text handling method, device, equipment and storage medium based on participle |
CN111104795A (en) * | 2019-11-19 | 2020-05-05 | 平安金融管理学院(中国·深圳) | Company name matching method and device, computer equipment and storage medium |
US20200327136A1 (en) * | 2019-04-14 | 2020-10-15 | EverString Innovation Technology | Automated company matching |
CN111859042A (en) * | 2020-07-30 | 2020-10-30 | 上海妙一生物科技有限公司 | Retrieval method and device and electronic equipment |
CN112364635A (en) * | 2020-11-30 | 2021-02-12 | 中国银行股份有限公司 | Enterprise name duplication checking method and device |
US20210089562A1 (en) * | 2019-09-20 | 2021-03-25 | Jpmorgan Chase Bank, N.A. | System and method for generating and implementing context weighted words |
CN112749255A (en) * | 2020-12-30 | 2021-05-04 | 科大国创云网科技有限公司 | Human-computer interaction semantic recognition intention matching method and system based on ES |
-
2021
- 2021-05-24 CN CN202110566536.XA patent/CN113268986B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102651013A (en) * | 2012-03-23 | 2012-08-29 | 上海安捷力信息***有限公司 | Method and system for extracting area information from enterprise name data |
US20140052688A1 (en) * | 2012-08-17 | 2014-02-20 | Opera Solutions, Llc | System and Method for Matching Data Using Probabilistic Modeling Techniques |
CN107644010A (en) * | 2016-07-20 | 2018-01-30 | 阿里巴巴集团控股有限公司 | A kind of Text similarity computing method and device |
US20190251085A1 (en) * | 2016-11-25 | 2019-08-15 | Alibaba Group Holding Limited | Method and apparatus for matching names |
CN108829661A (en) * | 2018-05-09 | 2018-11-16 | 成都信息工程大学 | A kind of subject of news title extracting method based on fuzzy matching |
CN109657213A (en) * | 2018-12-21 | 2019-04-19 | 北京金山安全软件有限公司 | Text similarity detection method and device and electronic equipment |
CN110046341A (en) * | 2018-12-29 | 2019-07-23 | ***股份有限公司 | For carrying out matched method and system to information |
US20200327136A1 (en) * | 2019-04-14 | 2020-10-15 | EverString Innovation Technology | Automated company matching |
CN110309504A (en) * | 2019-05-23 | 2019-10-08 | 平安科技(深圳)有限公司 | Text handling method, device, equipment and storage medium based on participle |
US20210089562A1 (en) * | 2019-09-20 | 2021-03-25 | Jpmorgan Chase Bank, N.A. | System and method for generating and implementing context weighted words |
CN111104795A (en) * | 2019-11-19 | 2020-05-05 | 平安金融管理学院(中国·深圳) | Company name matching method and device, computer equipment and storage medium |
CN111859042A (en) * | 2020-07-30 | 2020-10-30 | 上海妙一生物科技有限公司 | Retrieval method and device and electronic equipment |
CN112364635A (en) * | 2020-11-30 | 2021-02-12 | 中国银行股份有限公司 | Enterprise name duplication checking method and device |
CN112749255A (en) * | 2020-12-30 | 2021-05-04 | 科大国创云网科技有限公司 | Human-computer interaction semantic recognition intention matching method and system based on ES |
Non-Patent Citations (4)
Title |
---|
乔世权;戴继勇;: "基于文本相似度的智能查号引擎研究", 河北科技大学学报, vol. 39, no. 3, pages 282 - 288 * |
段晨迪: "基于ElasticSearch面向M00C的垂直搜索引擎设计与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》, no. 1, pages 1138 - 2507 * |
连誉舜: "中文组织机构名检索***的设计与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》, no. 4, pages 138 - 1519 * |
钟良伍, 郑 方: "基于中文机构名简称的检索方法研究", 中文信息学报, vol. 21, no. 1, pages 38 - 42 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115905473A (en) * | 2022-12-20 | 2023-04-04 | 无锡锡银金科信息技术有限责任公司 | Full noun fuzzy matching method, device and storage medium |
CN115905473B (en) * | 2022-12-20 | 2024-03-05 | 无锡锡银金科信息技术有限责任公司 | Full noun fuzzy matching method, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113268986B (en) | 2024-05-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111783419B (en) | Address similarity calculation method, device, equipment and storage medium | |
CN110321925B (en) | Text multi-granularity similarity comparison method based on semantic aggregated fingerprints | |
US8812300B2 (en) | Identifying related names | |
US7424421B2 (en) | Word collection method and system for use in word-breaking | |
CN107862070B (en) | Online classroom discussion short text instant grouping method and system based on text clustering | |
US8316041B1 (en) | Generation and processing of numerical identifiers | |
CN103106287B (en) | A kind of processing method and system of user search sentence | |
CN109885675B (en) | Text subtopic discovery method based on improved LDA | |
CN111259151A (en) | Method and device for recognizing mixed text sensitive word variants | |
CN111899890A (en) | Medical data similarity detection system and method based on bit string Hash | |
CN110413998B (en) | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof | |
CN110990676A (en) | Social media hotspot topic extraction method and system | |
CN111090994A (en) | Chinese-internet-forum-text-oriented event place attribution province identification method | |
CN112559747A (en) | Event classification processing method and device, electronic equipment and storage medium | |
CN112989414A (en) | Mobile service data desensitization rule generation method based on width learning | |
CN111061873B (en) | Multi-channel text classification method based on Attention mechanism | |
CN113268986B (en) | Unit name matching and searching method and device based on fuzzy matching algorithm | |
CN111753547A (en) | Keyword extraction method and system for sensitive data leakage detection | |
CN111597423A (en) | Performance evaluation method and device of interpretable method of text classification model | |
CN114943285B (en) | Intelligent auditing system for internet news content data | |
CN108172304B (en) | Medical information visualization processing method and system based on user medical feedback | |
CN107291952B (en) | Method and device for extracting meaningful strings | |
CN113934910A (en) | Automatic optimization and updating theme library construction method and hot event real-time updating method | |
CN113657443A (en) | Online Internet of things equipment identification method based on SOINN network | |
TWI534640B (en) | Chinese network information monitoring and analysis system and its method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |