CN110442603B

CN110442603B - Address matching method, device, computer equipment and storage medium

Info

Publication number: CN110442603B
Application number: CN201910601364.8A
Authority: CN
Inventors: 申超波; 阮晓雯; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2024-01-19
Anticipated expiration: 2039-07-03
Also published as: CN110442603A; WO2021000831A1

Abstract

The application discloses an address matching method, an address matching device, computer equipment and a storage medium, wherein a first address of the address matching method is an address to be retrieved input by a user, a second address is stored in an index server, and the method comprises the following steps: calling a preset matching algorithm, and respectively word-segmenting the first address and the second address according to a first preset rule to obtain a first word-segmentation group corresponding to the first address and a second word-segmentation group corresponding to the second address, wherein the preset matching algorithm comprises word-segmentation calculation and matching calculation; dividing the first address into a plurality of first segments according to the first word-dividing group, and dividing the second address into a plurality of second segments according to the second word-dividing group; and obtaining a matching result of the first segment and the second segment according to a second preset rule, and judging whether the first address and the second address are the same or not. For the first four administrative level addresses of the segmented address, the accurate matching is carried out according to the national province, city, district and town address base (tree type), and the effective complementation is carried out on partial missing.

Description

Address matching method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of computers, and in particular, to an address matching method, an address matching device, a computer device, and a storage medium.

Background

Traditional fuzzy matching of addresses often uses addresses as a complete individual to perform fuzzy matching based on NLP, but the following defects exist in the way: 1) The structure of the address is a tree structure of the address names, the lower layer of the tree structure is similar more closely, but the address names matched as a whole are compared in parallel structures, and the actual distribution structures of the address names are not matched; 2) The comparison effect will be poor for short addresses, but most short addresses are of good value. 3) The address names of the same address are equivalent as individual value words, but are not identical in practice, such as Shenzhen/nan mountain/Tencer, where the address name Tencer would be significantly more valuable as an effective address.

Disclosure of Invention

The main purpose of the application is to provide an address matching method, which aims to solve the technical problem that the existing address matching has defects.

The application provides an address matching method, wherein a first address is an address to be retrieved input by a user, and a second address is stored in an index server, and the method comprises the following steps:

calling a preset matching algorithm, and respectively word-segmenting the first address and the second address according to a first preset rule to obtain a first word-segmentation group corresponding to the first address and a second word-segmentation group corresponding to the second address, wherein the preset matching algorithm comprises word-segmentation calculation and matching calculation;

Dividing the first address into a plurality of first segments according to the first word-dividing group, and dividing the second address into a plurality of second segments according to the second word-dividing group;

obtaining matching results of all the first segments and all the second segments according to a second preset rule;

and judging whether the first address and the second address are the same according to the matching result.

The application also provides an address matching device, wherein the first address is an address to be retrieved input by a user, the second address is stored in an index server, and the device comprises:

the word segmentation module is used for calling a preset matching algorithm, and respectively segmenting the first address and the second address according to a first preset rule to obtain a first word segmentation group corresponding to the first address and a second word segmentation group corresponding to the second address, wherein the preset matching algorithm comprises word segmentation calculation and matching calculation;

the dividing module is used for dividing the first address into a plurality of first segments according to the first word-dividing group and dividing the second address into a plurality of second segments according to the second word-dividing group;

the second acquisition module is used for acquiring matching results of all the first segments and all the second segments according to a second preset rule;

And the judging module is used for judging whether the first address and the second address are the same according to the matching result.

The present application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.

The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above-described method.

The method and the device for the address segmentation of the urban area and the urban area of the national province accurately match the first four administrative level addresses according to the address library (tree type) of the urban area and the urban area of the national province, and further, the method and the device for the address segmentation of the urban area and the urban area of the national province effectively complement partial deletions. The index server combines NoSQL distributed architecture storage and index structure to realize real-time quick query and calculation of massive data, and provides a configurable weight address matching model based on address multi-stage division, firstly, word-dividing and word-dividing word groups are divided into word groups through a natural language processing model, and the word-dividing word groups are divided into segments according to administrative levels, and mapped into nodes in a tree structure, so that the tree structure of an address is fully considered, the address is divided into segments according to the administrative levels, different weights are matched in segments according to the administrative levels, and the actual service scene can be finely tuned. According to the method and the device, an index structure is built for mass data prestored in the index server, and real-time and rapid query of the first address in a preset index structure is achieved by combining the self computing architecture of the Elastic search component and strong distributed computing capacity. The default weight is obtained through training a training model, and the training parameters are continuously adjusted in the training process, so that the similarity of model training output is consistent with the similarity value marked in advance, or the training parameters comprise weight values in a preset deviation range, so that the weight setting is more reliable.

Drawings

FIG. 1 is a schematic flow chart of an address matching method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an address matching device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Referring to fig. 1, in an address matching method according to an embodiment of the present application, the first address is an address to be retrieved input by a user, and the second address is stored in an index server, the method includes:

s1: calling a preset matching algorithm, and respectively word-segmenting the first address and the second address according to a first preset rule to obtain a first word-segmentation group corresponding to the first address and a second word-segmentation group corresponding to the second address, wherein the preset matching algorithm comprises word-segmentation calculation and matching calculation.

In this embodiment, taking the similarity between the first address and the second address as an example, the first address and the second address are written in a manner from high to low and from range to specific according to the administrative level. The first preset rule in this embodiment has different word segmentation rules according to different administrative levels of the address, such as word segmentation corresponding to four administrative levels of province/city/district, county/country and town, which are common across the country, and generally uses a universal address database across the country to segment words. For example, the result of word segmentation in the southern Hai region and the urban area of the Buddha, guangdong province is as follows: guangdong province/Buddha city/south sea area/Gui Town. And (3) performing word segmentation on the address information beyond the four administrative levels of province/city/district, county/village and town by a semantic word segmentation mode.

S2: dividing the first address into a plurality of first segments according to the first word-dividing group, and dividing the second address into a plurality of second segments according to the second word-dividing group.

According to the embodiment, the address is segmented and/or divided into administrative levels according to word segmentation phrases corresponding to the address, and each segment or each administrative level corresponds to one or more word segments. In order to facilitate distinguishing between the first address corresponding to each first segment and the second address corresponding to each second segment, the "first" and "second" in this embodiment are used for distinguishing, but are not limited to, and similar terms in other places have the same effect and are not repeated. The word groups are word arrangement of actual addresses and are formed according to the writing order of the original addresses. For example, a "certain city development area" with a relatively long name corresponds to two segmentation words, "certain city/development area", but the segmentation is a segmentation based on the segmentation words and carried out according to the administrative level, for example, the "certain city development area" belongs to one segmentation.

S3: and obtaining matching results of all the first segments and all the second segments according to a second preset rule.

In the embodiment, the first segments and the second segments are matched one by one according to the corresponding relation of administrative levels to obtain a matching result. For example, the first segment corresponding to the province level of the first address is compared with the second segment corresponding to the province level of the second address, so as to improve the symmetry and reliability of information comparison.

S4: and judging whether the first address and the second address are the same according to the matching result.

In this embodiment, the first address and the second address are compared in a one-to-one correspondence manner through the corresponding relationship of the administrative level, and when the matching rate of the first address and the second address reaches the preset range, the first address and the second address are judged to be the same, otherwise, the first address and the second address are different. In other embodiments of the present application, not only the matching rate is required to reach the preset range, but also the segment matching degree corresponding to the specified administrative level is required to reach 100%, so that the first address and the second address can be determined to be the same, or else, the first address and the second address are different, so as to improve the matching accuracy.

The first address in this embodiment is an address to be queried input by the user, and the data composition structure of the first address is not limited, so that matching calculation of the address to be queried can be realized, and the flexibility and the freedom degree of use of the user are improved. For example, the first address includes a data composition sequentially arranged according to six administrative levels of province, city/district/county/town, country/road, district, building/span, and house number, or includes a data composition in which one or several administrative levels are missing. The preset matching conditions in this embodiment include that the matching rate reaches a preset threshold, or that the flag data in the first address reaches 100% matching, etc. The flag data refers to data information in the first address, which specifies a geographical location, such as a name of a certain cell or a name of a certain building. For example, "Jiangnan Mingju cell Rong Yuan" included in the first address is flag data. The sign data of the first address in another embodiment of the present application is the sign data after the administrative level of "town, country", and the data information before "span and house number".

Further, the first address and the second address respectively include a range address and a flag address, the step S1 of calling a preset matching algorithm to perform word segmentation on the first address and the second address according to a first preset rule to obtain a first word segment group corresponding to the first address and a second word segment group corresponding to the second address includes:

s11: and performing word segmentation on the range addresses corresponding to the first address and the second address respectively according to a pre-associated address dictionary in a natural language processing model to obtain a first word segmentation part corresponding to the first address and a first word segmentation part corresponding to the second address respectively.

The range address of the present embodiment includes at least one of four administrative levels of province/city/district, county/county, and town. The range address of the embodiment is segmented by pre-associating an address dictionary, which is a word stock corresponding to the national address database, and the address name is segmented by pre-associating with a natural language processing model. The preset matching algorithm of the embodiment comprises analysis calculation and matching calculation, in order to improve the address matching precision, a crawler address library is added when the open source word segmentation algorithm package jieba carries out word segmentation calculation, the crawler address library is used in combination with a national address library to correct an address to be segmented, then word segmentation is carried out according to administrative levels, and the word segmentation accuracy is improved. And judging whether the administrative level contained in the current address is the administrative level corresponding to the calling address dictionary, and if so, calling the address dictionary to perform word segmentation calculation. For example, the address: in the Rongyuan 1 seat 306 of the district of the Jiangnan Mingjun of the south sea area of the south of the Guangdong province, including calling the four-level administrative level corresponding to the address dictionary, then the four-level administrative level corresponding to the address is segmented according to the address dictionary, and the segmentation result is as follows: guangdong province/Buddha city/south sea area/Gui Tong/Jiangnan Mingjun district Rongyuan 1 seat 306. The first word segmentation part corresponds to Guangdong province/bergamot/south sea area/cassia town.

S12: and respectively dividing the mark addresses corresponding to the first address and the second address according to a first grammar model in a natural language processing model to obtain a second word division part corresponding to the first address and a second word division part corresponding to the second address.

The logo address of the present embodiment includes information specifying the geographical location, such as the name of a certain cell, the name of a certain building. Such as "south-river name residential district Rong Yuan" in the above address. The present embodiment performs word segmentation on the tag address according to a first grammar model in the natural language processing model, including but not limited to "a certain cell", "a certain building", and the like. For example, "Guicheng Jiangnan Mingjun district Rongyuan 1 seat 306", the corresponding second part of the word is "Guicheng/Jiangnan Mingjun district/Rong Yuan". The first grammar model of another embodiment of the application is that after extracting town and country, characters before a house number are sign addresses.

S13: and forming a first word segmentation part corresponding to the first address and a second word segmentation part corresponding to the first address into a first word segmentation group corresponding to the first address, and forming the first word segmentation part corresponding to the second address and the second word segmentation part corresponding to the second address into a second word segmentation group corresponding to the second address.

The first address or the second address in this embodiment includes a range address and a flag address, and are sequentially arranged from left to right to form the first address or the second address. For example, the first address is "Rong Yuan" for the urban and south-town, south-named residential district of the south sea area of the bergamot, the Guangdong province; the second address is Rong Yuan for the urban and south Jiangnan Mingju of the south sea area of the Guangdong Fingered mountain City; the first sub-group corresponding to the first address is 'Guangdong province/Buddha city/south sea area/Gui Town/Jiangnan Mingjun district/Rong Yuan', and the second sub-group corresponding to the second address is 'Guangdong province/Buddha city/south sea area/Gui Town/Jiangnan Mingju/Rong Yuan'.

Further, the first address and the second address further include detailed addresses, and after step S12 of performing word segmentation according to a first grammar model in a natural language processing model to obtain a second word segmentation part corresponding to the first address and a second word segmentation part corresponding to the second address, the method further includes:

s14: and performing word segmentation on the detail addresses corresponding to the first address and the second address respectively according to a second grammar model in a natural language processing model to obtain a third word segmentation part corresponding to the first address and a third word segmentation part corresponding to the second address respectively.

The detail addresses of this embodiment are specific "span and house number", which have a small effect and influence on matching the similarity of two addresses, and even this part of the content can be ignored in other embodiments. But for some specific application scenarios, it is required to be accurate to the detailed address to meet the business requirements. The second grammar model of the present embodiment includes, but is not limited to, "a certain floor", "a certain floor, a certain room", and the like.

S15: and forming a first word segmentation group corresponding to the first address by the first word segmentation part corresponding to the first address, the second word segmentation part corresponding to the first address and the third word segmentation part corresponding to the first address, and forming a second word segmentation group corresponding to the second address by the first word segmentation part corresponding to the second address, the second word segmentation part corresponding to the second address and the third word segmentation part corresponding to the second address.

The first address or the second address in this embodiment includes a range address, a flag address, and a detail address, and are sequentially arranged from left to right to form the first address or the second address. For example, the first address is "Rongyuan 1 seat 306" in the Hai region of south America, gui Town, henan Mingjun, guangdong province; the second address is "Guangdong mountain Buddha City, south sea area, gui Town and Jiangnan Mingju Rong Yuan seat 502"; the first word group corresponding to the first address is 'Guangdong province/Buddha city/south sea area/Gui Town/Jiangnan Mingju district/Rong Yuan/1 seat/306', and the second word group corresponding to the second address is 'Guangdong province/Buddha city/south sea area/Gui Town/Jiangnan Mingju/Rong Yuan/1 seat/502', so that the first address or the second address is segmented or divided into administrative levels according to the word groups.

Further, the range address includes four administrative levels of province/city/district, county/village, and town, the flag address includes a cell name or a building name, and the step S3 of obtaining matching results of all the first segments and all the second segments according to a second preset rule includes:

s31: mapping all the first segments and all the second segments into two structural trees with the same structure according to the order from high to low of administrative level, wherein the structural tree comprises a plurality of nodes, and each node corresponds to each first segment or each second segment one by one.

In this embodiment, all the first segments corresponding to the first address or all the second segments corresponding to the second address are mapped into two structural trees with the same structure according to the order from high to low of the administrative level, and one node at least corresponds to one segment or one node corresponds to a plurality of segmentation words of the same administrative level. For example, the word "guangdong province" corresponding to the highest administrative level "province" contained in the first address is used as a root node, then the word "bergamot city" corresponding to the next level of child node "city" is sequentially connected, and then the end node "1 seat 502" is connected by analogy. According to different specific address information, the administrative levels corresponding to the root node and the end node are different, and the root node and the end node can be full addresses covering all administrative levels or short addresses covering part of administrative levels.

S32: and obtaining the matching values corresponding to the nodes of the two structural trees respectively.

According to the matching calculation of the embodiment, the corresponding relation between the nodes of the two structural trees is mapped according to the corresponding relation of the administrative level, and the matching values respectively corresponding to the nodes are obtained and calculated according to the corresponding relation, wherein the matching values comprise matching segments divided by all segments corresponding to the nodes. For example, a node corresponding to the first address is a "province" node, and is assigned to "Guangdong", and a node corresponding to the second address is assigned to "Guangdong" as well, and is matched, otherwise, the node is not matched.

S33: and respectively acquiring a first weight corresponding to the range address, a second weight corresponding to the mark address and a third weight corresponding to the detail address.

According to the embodiment, different weights are set according to different influences of the segments corresponding to the administrative levels on the address, so that flexibility of meeting business requirements is improved. For example, the second weight corresponding to the flag address is higher than the first weight corresponding to the range address, etc.

S34: and calculating the matching rate according to the multiplication of the matching value by the corresponding weight, and respectively obtaining a first matching rate corresponding to the range address, a second matching rate corresponding to the mark address and a third matching rate corresponding to the detail address.

The calculation formula of the matching rate in this embodiment is: each segment is configured with a weight equal to the matching rate of each segment, and the matching rates of the segments are summed to obtain a matching result of the first address and the second address.

S35: and adding the first matching rate, the second matching rate and the third matching rate to obtain the matching result of all the first segments and all the second segments.

Further, the step S32 of obtaining the matching values corresponding to the nodes of the two structural trees includes:

s321: and carrying out accurate full matching on each first segment corresponding to the range address in the first address and each second segment corresponding to the range address in the second address according to the one-to-one correspondence of the node correspondence, so as to obtain each first matching value.

The matching methods of the corresponding nodes of different administrative levels in this embodiment are different, and the four administrative levels of province/city/district, county/village and town are matched in a precise corresponding manner of full matching, that is, 100% of the corresponding characters are identical, and if they are identical, they are matched, otherwise, they are not matched. For example, if the "province" node corresponding to the first address is assigned to "Guangdong", and if the "province" node corresponding to the first address is assigned to "Guangdong", then the "province" node corresponding to the first address is assigned to "Guangdong" and is matched.

S322: and matching the first segments corresponding to the mark addresses in the first addresses with the second segments corresponding to the mark addresses in the second addresses, and performing model keyword matching according to the one-to-one correspondence of the node correspondence to obtain second matching values.

In this embodiment, the matching is implemented by matching the corresponding segments of the tag address through NLP (Natural Language Processing ) model, and the matching relationship can be implemented by including or containing the corresponding segments. For example, "Jiangnan Mingju district/Rong Yuan" and "Jiangnan Mingju/Rong Yuan" do not have a full matching peer-to-peer relationship in terms of characters, but the "Jiangnan Mingju district" contains the characters "Jiangnan Mingju" and still has a one-to-one matching relationship.

S323: and carrying out digital matching on each first segment corresponding to the detail address in the first address and each second segment corresponding to the detail address in the second address according to the one-to-one correspondence of the node correspondence, so as to obtain each third matching value.

The detail address in this embodiment includes a first specified number of segments, but the number of segments conforming to the matching relationship is a second specified number, and the matching value corresponding to the detail address is the second specified number divided by the first specified number.

S324: and summarizing the first matching values, the second matching values and the third matching values to obtain matching values corresponding to nodes of the two structural trees respectively.

For example, the word group corresponding to the first address is: guangdong/Buddha city/south China sea/Guicheng/Jiangnan Mingju district/Rong Yuan/1/306; the word group corresponding to the second address is: guangdong/Buddha City/south China sea/Guicheng/Jiangnan Mingju/Rong Yuan/1/502; after segmentation, the first address and the second address are divided into six administrative levels, including province/city/district, county/town, county/road, district, building/span and house number, which are respectively divided into six nodes, and the default weights of the nodes are respectively '0.1/0.1/0.1/0.1/0.5/0.1'. The first four administrative levels are character 100% matches: the matching results are 0.1 x 1/0.1 x 1 respectively; the fifth administrative level matching is a model matching of character inclusion relationships: the matching result of the Jiangnan Mingju district/Rong Yuan and the Jiangnan Mingju district/Rong Yuan is 0.5 x 1; the sixth administrative level matching is fuzzy matching: in the matching of 1/306 and 1/502, only one field of the two corresponding fields has a matching relationship, if 306 and 502 are not matched, the corresponding matching value is 0.5, and the matching result is 0.5×0.1, namely 0.05. The matching ratio of the first address and the second address is: 0.1+0.1+0.1+0.1+0.5+0.05=0.95.

Further, before the step S33 of obtaining the first weight corresponding to the range address, the second weight corresponding to the flag address, and the third weight corresponding to the detail address, the method includes:

s331: and inputting a specified number of training samples with pre-labeled similarity values into the natural language processing model for training.

S332: and adjusting training parameters to the first parameters to enable the similarity value output by the natural language processing model to be consistent with the pre-labeling similarity value.

S333: and respectively corresponding the weight values corresponding to the first parameters to the first weight, the second weight and the third weight according to the node corresponding relation.

The default weight of the embodiment is obtained through training a training model, and the training parameters are continuously adjusted in the training process to enable the similarity of model training output to be consistent with a similarity value marked in advance, or the training parameters comprise weight values in a preset deviation range so as to determine the weight values. According to other embodiments of the application, one or more of the default weights can be adjusted according to a specific application scene, so that the matching model is more in line with the current application scene.

Further, before step S11 of performing word segmentation on the range addresses corresponding to the first address and the second address respectively according to a pre-associated address dictionary in a natural language processing model to obtain a first word segmentation part corresponding to the first address and a first word segmentation part corresponding to the second address respectively, the method includes:

s10: and calling an address database to respectively correct the first address and the second address according to a third preset rule.

The first address or the second address in this embodiment may be address data that does not conform to the national address database, and address correction may be performed by calling the address database, including address completion, removal of qualifiers, and the like. When the address is fully filled, the root node is fully filled according to the sub-nodes, for example, the south sea area can be filled up into the bergamot city; or the address completion can be carried out according to the front and back node completion intermediate nodes, such as the Buddha city and the osmanthus town, and the like in the mode of intermediate full south sea area supplementation.

Further, invoking a preset matching algorithm, and respectively word-segmenting the first address and the second address according to a first preset rule to obtain a first word-segment group corresponding to the first address and a second word-segment group corresponding to the second address, where before step S1, includes:

S1a: indexing a specified number of unstructured address data pre-stored in the index server to obtain the preset index structure.

The pre-stored data in the index server is unstructured data, the storage mode is a column storage mode of key value pairs, the unstructured data is column storage formed by text, images, voice and the like based on NoSQL storage technology, the data size is very large, the NoSQL technology of a distributed architecture is required to be adopted for storage and calculation, and the index server exactly combines the NoSQL distributed architecture storage and an index structure to realize real-time rapid query and calculation of mass data. NOSQL, i.e. a non-relational database, is an open source technology. The elastiscearch is based on a storage mode of a Key-value Key value pair and an inverted index, and the calculation is mainly based on a large amount of memories, so that the rapid real-time calculation is realized.

S1b: and receiving an interface plug-in under the appointed directory uploaded to the index server, wherein the interface plug-in is formed by packaging the preset matching algorithm.

The index server of the embodiment is an open source component, supports a plug-in mode, can inherit interface plug-ins to rg. index servers, plug-ins and address matching algorithm plug-ins for custom extension development, and can be loaded for use by restarting the index server.

S1c: and acquiring configuration parameters of the interface plug-in.

S1d: and establishing a calculation association relation between the preset index structure and the interface plug-in by running the configuration parameters.

After the development of the preset matching algorithm is finished, the preset matching algorithm is packaged and then uploaded to an index server designated directory, and relevant configuration parameter configuration is carried out, so that the calculation association relation between the preset index structure and the interface plug-in is established through loading operation configuration parameters, and the matching calculation of the first address in the preset index structure is completed through calling the address matching algorithm in the plug-in, so that the address data query is realized.

The index server of the embodiment is an open-source Elastic search component (the Elastic search is used for distributed full text retrieval), provides a full text search engine with distributed computing capability based on a RESTful web interface, and can quickly query mass data in real time. The inquiring step comprises the following steps: (1) And importing the addresses of the massive address library into the bottom layer storage of the elastiscearch in a key-value key value pair mode according to the data importing interface of the elastiscearch, and establishing an index for the key. (2) Modifying the ground matching model of the step (1) according to an elastic search custom extension search model, adding the modified ground matching model to an elastic search main node extension module, and restarting the elastic search to form an address matching model which can be calculated based on distributed storage and high concurrency by using the elastic search. (3) And developing a one-to-many massive address matching interface on the elastic search by utilizing the custom model. (4) By developing an upper layer interface on the elastic search, a new address can be input, a massive address library to be matched and a custom model are selected, and then quick real computation of the new address and the address in the massive address library can be realized based on the elastic search, and the most similar TOPN address is returned, wherein N can be used for setting parameter transmission by a program. According to the method, the index structure is built for mass data pre-stored in the index server, and real-time and rapid query of the first address in the preset index structure is achieved by combining the self computing architecture of the elastic search component and strong distributed computing capacity.

In this embodiment, the matching methods for different segments corresponding to different administrative levels of the first address are different, the matching models are different, and the matching weights corresponding to the segments are also different. The first address of the embodiment is divided into six segments, which correspond to six administrative levels respectively, correspond to six nodes in the tree structure, and have the same matching model of the first four administrative levels in the six administrative levels, and are matched in a one-to-one correspondence manner; the fifth administrative level passes through the fuzzy matching model which is contained or included; the sixth administrative level is matched by a numerical matching model. According to the embodiment, a filtering mechanism is arranged in the matching calculation process, firstly, accurate matching calculation is conducted on target segments corresponding to four administrative levels of province/city, district/county/town, village and road in a character one-to-one matching mode, when the matching calculation result of the target segments corresponding to the four administrative levels is lower than a preset threshold value, it is judged that address data meeting preset matching conditions with the first address does not exist in the preset index structure, and a matching conclusion is directly output, so that the matching calculation amount is reduced, and the response speed is improved. The embodiment can filter at least more than 90% of addresses by setting the filtering mechanism. Therefore, one address is only required to be completely matched with the rest about 10% of addresses finally, and the computing resource is greatly saved.

Referring to fig. 2, in an address matching apparatus according to an embodiment of the present application, the first address is an address to be retrieved input by a user, and the second address is stored in an index server, where the apparatus includes:

the word segmentation module 1 is used for calling the preset matching algorithm, and respectively segmenting the first address and the second address according to a first preset rule to obtain a first word segmentation group corresponding to the first address and a second word segmentation group corresponding to the second address, wherein the preset matching algorithm comprises word segmentation calculation and matching calculation.

The dividing module 2 is configured to divide the first address into a plurality of first segments according to the first group of segments, and divide the second address into a plurality of second segments according to the second group of segments.

The first obtaining module 3 is configured to obtain matching results of all the first segments and all the second segments according to a second preset rule.

And the judging module 4 is used for judging whether the first address and the second address are the same according to the matching result.

Further, the word segmentation module 1 includes:

and the first word segmentation unit is used for segmenting the range addresses corresponding to the first address and the second address respectively according to a pre-associated address dictionary in a natural language processing model to obtain a first word segmentation part corresponding to the first address and a first word segmentation part corresponding to the second address respectively.

The range address of the present embodiment includes at least one of four administrative levels of province/city/district, county/county, and town. The range address of the embodiment is segmented by pre-associating an address dictionary, which is a word stock corresponding to the national address database, and the address name is segmented by pre-associating with a natural language processing model. In order to improve the address matching precision, the embodiment adds a crawler address library when performing word segmentation calculation by wrapping jieba in an open source word segmentation algorithm, corrects the address to be segmented by combining the crawler address library with a national address library, and then performs word segmentation according to administrative level, thereby improving the word segmentation accuracy. And judging whether the administrative level contained in the current address is the administrative level corresponding to the calling address dictionary, and if so, calling the address dictionary to perform word segmentation. For example, the address: in the Rongyuan 1 seat 306 of the district of the Jiangnan Mingjun of the south sea area of the south of the Guangdong province, including calling the four-level administrative level corresponding to the address dictionary, then the four-level administrative level corresponding to the address is segmented according to the address dictionary, and the segmentation result is as follows: guangdong province/Buddha city/south sea area/Gui Tong/Jiangnan Mingjun district Rongyuan 1 seat 306. The first word segmentation part corresponds to Guangdong province/bergamot/south sea area/cassia town.

And the second word segmentation unit is used for segmenting the mark addresses corresponding to the first address and the second address respectively according to a first grammar model in a natural language processing model to obtain a second word segmentation part corresponding to the first address and a second word segmentation part corresponding to the second address respectively.

The first composition unit is used for composing the first word segmentation part corresponding to the first address and the second word segmentation part corresponding to the first address into a first word segmentation group corresponding to the first address, and composing the first word segmentation part corresponding to the second address and the second word segmentation part corresponding to the second address into a second word segmentation group corresponding to the second address.

Further, the first address and the second address further include a detail address, respectively, and the word segmentation module 1 includes:

and the third word segmentation unit is used for segmenting the detail addresses corresponding to the first address and the second address respectively according to a second grammar model in a natural language processing model to obtain a third word segmentation part corresponding to the first address and a third word segmentation part corresponding to the second address respectively.

The second composition unit is configured to compose a first word group corresponding to the first address, a second word group corresponding to the first address, and a third word group corresponding to the second address, where the first word group corresponding to the second address, the second word group corresponding to the second address, and the third word group corresponding to the second address.

Further, the range address includes four administrative levels of province/city/district, county/county, town, and the first obtaining module 3 includes:

the mapping unit is used for mapping all the first segments and all the second segments into two structural trees with the same structure according to the order from high to low of administrative levels, wherein the structural tree comprises a plurality of nodes, and each node corresponds to each first segment or each second segment one by one.

The first acquisition unit is used for acquiring the matching values corresponding to the nodes of the two structural trees respectively.

According to the corresponding relation of administrative levels, the corresponding relation between nodes of two structural trees is mapped, and matching values respectively corresponding to the nodes are obtained according to the corresponding relation, wherein the matching values comprise matching segments divided by all segments corresponding to the nodes. For example, a node corresponding to the first address is a "province" node, and is assigned to "Guangdong", and a node corresponding to the second address is assigned to "Guangdong" as well, and is matched, otherwise, the node is not matched.

The second obtaining unit is used for obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address and the third weight corresponding to the detail address respectively.

And the calculating unit is used for calculating the matching rate according to the multiplication of the matching value by the corresponding weight, and respectively obtaining the first matching rate corresponding to the range address, the second matching rate corresponding to the mark address and the third matching rate corresponding to the detail address.

And the adding unit is used for adding the first matching rate, the second matching rate and the third matching rate as the matching results of all the first segments and all the second segments.

Further, the first acquisition unit includes:

and the first matching subunit is used for precisely and fully matching each first segment corresponding to the range address in the first address with each second segment corresponding to the range address in the second address according to the one-to-one correspondence of the node correspondence relationship to obtain each first matching value.

And the second matching subunit is used for matching each first segment corresponding to the mark address in the first address with each second segment corresponding to the mark address in the second address according to the one-to-one correspondence of the node correspondence, so as to obtain each second matching value.

And the third matching subunit is used for carrying out digital matching on each first segment corresponding to the detail address in the first address and each second segment corresponding to the detail address in the second address according to the node corresponding relation one by one to obtain each third matching value.

And the summarizing subunit is used for summarizing the first matching values, the second matching values and the third matching values to obtain matching values respectively corresponding to the nodes of the two structural trees.

Further, the first acquisition module 3 includes:

and the input unit is used for inputting the specified number of training samples with the pre-labeled similarity values into the natural language processing model for training.

And the adjusting unit is used for enabling the similarity value output by the natural language processing model to be consistent with the pre-labeling similarity value by adjusting the training parameters to the first parameters.

And the corresponding unit is used for respectively corresponding the weight values corresponding to the first parameters to the first weight, the second weight and the third weight according to the corresponding relation of the nodes.

Further, the word segmentation module 1 includes:

and the calling unit is used for calling the address database to respectively carry out address correction on the first address and the second address according to a third preset rule.

Further, the address matching device further includes:

and the indexing module is used for indexing the preset number of unstructured address data prestored in the indexing server to obtain the preset indexing structure.

And the receiving module is used for receiving the interface plug-in under the appointed directory uploaded to the index server, wherein the interface plug-in is formed by packaging and encapsulating the preset matching algorithm.

And the second acquisition module is used for acquiring the configuration parameters of the interface plug-in.

The establishing module is used for establishing a calculation association relation between the preset index structure and the interface plug-in by running the configuration parameters.

After the address matching algorithm is developed, the address matching algorithm is packaged and then uploaded to an index server designated directory, and relevant configuration parameter configuration is carried out, so that a calculation association relation is established between the preset index structure and the interface plug-in unit by loading operation configuration parameters, and matching calculation of a first address in the preset index structure is achieved by calling the address matching algorithm in the plug-in unit, so that address data inquiry is achieved.

Referring to fig. 3, a computer device is further provided in the embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store all the data required for the address matching process. The network interface of the computer device is for communicating with an external terminal via a network connection. The computer program is executed by a processor to implement an address matching method.

The processor executes the address matching method, wherein the first address is an address to be retrieved input by a user, and the second address is stored in an index server, and the method comprises the following steps: calling a preset matching algorithm, and respectively word-segmenting the first address and the second address according to a first preset rule to obtain a first word-segmentation group corresponding to the first address and a second word-segmentation group corresponding to the second address; dividing the first address into a plurality of first segments according to the first word-dividing group, and dividing the second address into a plurality of second segments according to the second word-dividing group; obtaining matching results of all the first segments and all the second segments according to a second preset rule; and judging whether the first address and the second address are the same according to the matching result.

According to the computer equipment, the pre-stored data in the index server are unstructured data, the storage mode is a column storage mode of key value pairs, the unstructured data are column storage formed by text, images, voice and the like based on NoSQL storage technology, the data size is very large, storage and calculation are needed to be carried out by adopting a NoSQL technology of a distributed architecture, the index server is just combined with NoSQL distributed architecture storage and index structure to realize real-time quick query and calculation of massive data, a configurable weight address matching model based on address multi-stage division is provided, firstly, word segmentation and word segmentation are carried out on an address name through a natural language processing model, word segmentation and word segmentation are divided into segments according to administrative levels, the segments are mapped into nodes in a tree structure, the tree structure of the address is fully considered, the address is classified and segmented according to the administrative levels, different weights are matched according to the administrative levels, and the actual service scene can be finely adjusted. By establishing an index structure for mass data pre-stored in an index server and combining the computing architecture of an Elastic search component and strong distributed computing capacity, real-time and rapid query of a first address in a preset index structure is realized. For the first four administrative level addresses of the segmented address, the accurate matching is performed according to the national province, city, district and town address library (tree type), and in addition, the effective complementation is performed for partial deletion. The default weight is obtained through training a training model, and the training parameters are continuously adjusted in the training process to enable the similarity of model training output to be consistent with the similarity value marked in advance, or the training parameters comprise weight values in a preset deviation range so as to determine the weight values, so that the weight setting is more reliable.

In one embodiment, the first address and the second address include a range address and a flag address, respectively, the processor invokes the preset matching algorithm, and the step of word segmentation is performed on the first address and the second address according to a first preset rule to obtain a first word segment group corresponding to the first address and a second word segment group corresponding to the second address, respectively, includes: the range addresses corresponding to the first address and the second address are segmented according to a pre-associated address dictionary in a natural language processing model, and a first segmentation part corresponding to the first address and a first segmentation part corresponding to the second address are obtained respectively; the first address and the second address are respectively corresponding to the mark address, word segmentation is carried out according to a first grammar model in a natural language processing model, and a second word segmentation part corresponding to the first address and a second word segmentation part corresponding to the second address are respectively obtained; and forming a first word segmentation part corresponding to the first address and a second word segmentation part corresponding to the first address into a first word segmentation group corresponding to the first address, and forming the first word segmentation part corresponding to the second address and the second word segmentation part corresponding to the second address into a second word segmentation group corresponding to the second address.

In one embodiment, the first address and the second address further include detailed addresses, and the step of the processor performing word segmentation according to a first grammar model in a natural language processing model by using the flag addresses corresponding to the first address and the second address respectively to obtain a second word segmentation part corresponding to the first address and a second word segmentation part corresponding to the second address respectively includes: the detail addresses corresponding to the first address and the second address are segmented according to a second grammar model in a natural language processing model, and a third segmentation part corresponding to the first address and a third segmentation part corresponding to the second address are obtained respectively; and forming a first word segmentation group corresponding to the first address by the first word segmentation part corresponding to the first address, the second word segmentation part corresponding to the first address and the third word segmentation part corresponding to the first address, and forming a second word segmentation group corresponding to the second address by the first word segmentation part corresponding to the second address, the second word segmentation part corresponding to the second address and the third word segmentation part corresponding to the second address.

In one embodiment, the range address includes four administrative levels of province, city/district, county and county/town, the flag address includes a cell name or a building name, and the step of the processor obtaining matching results of all the first segments and all the second segments according to a second preset rule includes: mapping all the first segments and all the second segments into two structural trees with the same structure according to the order from high to low of administrative level, wherein the structural tree comprises a plurality of nodes, and each node corresponds to each first segment or each second segment one by one; obtaining matching values corresponding to nodes of the two structural trees respectively; respectively acquiring a first weight corresponding to the range address, a second weight corresponding to the mark address and a third weight corresponding to the detail address; calculating a matching rate according to the multiplication of the matching value by the corresponding weight, and respectively obtaining a first matching rate corresponding to the range address, a second matching rate corresponding to the mark address and a third matching rate corresponding to the detail address; and adding the first matching rate, the second matching rate and the third matching rate to obtain the matching result of all the first segments and all the second segments.

In one embodiment, the step of obtaining, by the processor, matching values corresponding to nodes of two structural trees includes: each first segment corresponding to the range address in the first address is precisely and fully matched with each second segment corresponding to the range address in the second address according to the one-to-one correspondence of the node correspondence, and each first matching value is obtained; performing model keyword matching on each first segment corresponding to the mark address in the first address and each second segment corresponding to the mark address in the second address according to the one-to-one correspondence of the node correspondence, so as to obtain each second matching value; performing digital matching on each first segment corresponding to the detail address in the first address and each second segment corresponding to the detail address in the second address according to the one-to-one correspondence of the node correspondence, so as to obtain each third matching value; and summarizing the first matching values, the second matching values and the third matching values to obtain matching values corresponding to nodes of the two structural trees respectively.

In one embodiment, before the step of obtaining the first weight corresponding to the range address, the second weight corresponding to the flag address, and the third weight corresponding to the detail address, the processor includes: inputting a specified number of training samples with pre-labeled similarity values into the natural language processing model for training; the similarity value output by the natural language processing model is consistent with the pre-labeling similarity value by adjusting training parameters to first parameters; and respectively corresponding the weight values corresponding to the first parameters to the first weight, the second weight and the third weight according to the node corresponding relation.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device to which the present application is applied.

An embodiment of the present application further provides a computer readable storage medium having a computer program stored thereon, where the computer program when executed by a processor implements an address matching method, a first address being an address to be retrieved input by a user, and a second address being stored in an index server, the method including: calling a preset matching algorithm, and respectively word-segmenting the first address and the second address according to a first preset rule to obtain a first word-segmentation group corresponding to the first address and a second word-segmentation group corresponding to the second address; dividing the first address into a plurality of first segments according to the first word-dividing group, and dividing the second address into a plurality of second segments according to the second word-dividing group; obtaining matching results of all the first segments and all the second segments according to a second preset rule; and judging whether the first address and the second address are the same according to the matching result.

The storage mode of the data pre-stored in the index server is the column storage mode of key value pairs, the non-structured data is column storage formed by text, images, voice and the like based on NoSQL storage technology, the data size is very large, the NoSQL technology of a distributed architecture is required to be adopted for storage and calculation, the index server is just combined with the NoSQL distributed architecture storage and index structure to realize real-time quick query and calculation of massive data, a configurable weight address matching model based on address multi-stage division is provided, firstly, word segmentation and phrase segmentation are carried out on address names through a natural language processing model, word segmentation and phrase segmentation are divided into segments according to administrative levels, the tree structure of the addresses is fully considered, the addresses are classified according to the administrative levels, different weights are matched with each administrative level segment, and the actual service scene can be finely tuned. By establishing an index structure for mass data pre-stored in an index server and combining the computing architecture of an Elastic search component and strong distributed computing capacity, real-time and rapid query of a first address in a preset index structure is realized. For the first four administrative level addresses of the segmented address, the accurate matching is performed according to the national province, city, district and town address library (tree type), and in addition, the effective complementation is performed for partial deletion. The default weight is obtained through training a training model, and the training parameters are continuously adjusted in the training process to enable the similarity of model training output to be consistent with the similarity value marked in advance, or the training parameters comprise weight values in a preset deviation range so as to determine the weight values, so that the weight setting is more reliable.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims

1. The address matching method is characterized in that a first address is an address to be retrieved input by a user, and a second address is stored in an index server, and the method comprises the following steps:

judging whether the first address and the second address are the same according to the matching result;

the step of calling a preset matching algorithm to divide words of a first address and a second address according to a first preset rule to obtain a first word-dividing group corresponding to the first address and a second word-dividing group corresponding to the second address comprises the following steps:

and calling an address database to respectively correct the first address and the second address according to a third preset rule.

2. The address matching method according to claim 1, wherein the first address and the second address respectively include a range address and a flag address, the step of calling a preset matching algorithm to word the first address and the second address according to a first preset rule, respectively, to obtain a first word group corresponding to the first address and a second word group corresponding to the second address, includes:

The range addresses corresponding to the first address and the second address are segmented according to a pre-associated address dictionary in a natural language processing model, and a first segmentation part corresponding to the first address and a first segmentation part corresponding to the second address are obtained respectively;

the first address and the second address are respectively corresponding to the mark address, word segmentation is carried out according to a first grammar model in a natural language processing model, and a second word segmentation part corresponding to the first address and a second word segmentation part corresponding to the second address are respectively obtained;

and forming a first word segmentation part corresponding to the first address and a second word segmentation part corresponding to the first address into a first word segmentation group corresponding to the first address, and forming the first word segmentation part corresponding to the second address and the second word segmentation part corresponding to the second address into a second word segmentation group corresponding to the second address.

3. The address matching method according to claim 2, wherein the first address and the second address further include detailed addresses, respectively, and the step of performing word segmentation according to a first grammar model in a natural language processing model to obtain a second word segmentation part corresponding to the first address and a second word segmentation part corresponding to the second address, respectively, includes:

The detail addresses corresponding to the first address and the second address are segmented according to a second grammar model in a natural language processing model, and a third segmentation part corresponding to the first address and a third segmentation part corresponding to the second address are obtained respectively;

and forming a first word segmentation group corresponding to the first address by the first word segmentation part corresponding to the first address, the second word segmentation part corresponding to the first address and the third word segmentation part corresponding to the first address, and forming a second word segmentation group corresponding to the second address by the first word segmentation part corresponding to the second address, the second word segmentation part corresponding to the second address and the third word segmentation part corresponding to the second address.

4. The address matching method according to claim 3, wherein the range address includes four administrative levels of province, city/district, county, and county/town, the flag address includes a cell name or a building name, and the step of acquiring matching results of all the first segments and all the second segments according to a second preset rule includes:

mapping all the first segments and all the second segments into two structural trees with the same structure according to the order from high to low of administrative level, wherein the structural tree comprises a plurality of nodes, and each node corresponds to each first segment or each second segment one by one;

Obtaining matching values corresponding to nodes of the two structural trees respectively;

respectively acquiring a first weight corresponding to the range address, a second weight corresponding to the mark address and a third weight corresponding to the detail address;

calculating a matching rate according to the multiplication of the matching value by the corresponding weight, and respectively obtaining a first matching rate corresponding to the range address, a second matching rate corresponding to the mark address and a third matching rate corresponding to the detail address;

and adding the first matching rate, the second matching rate and the third matching rate to obtain the matching result of all the first segments and all the second segments.

5. The address matching method according to claim 4, wherein the step of obtaining the matching values corresponding to the nodes of the two structural trees respectively comprises:

each first segment corresponding to the range address in the first address is precisely and fully matched with each second segment corresponding to the range address in the second address according to the one-to-one correspondence of the node correspondence, and each first matching value is obtained;

performing model keyword matching on each first segment corresponding to the mark address in the first address and each second segment corresponding to the mark address in the second address according to the one-to-one correspondence of the node correspondence, so as to obtain each second matching value;

Performing digital matching on each first segment corresponding to the detail address in the first address and each second segment corresponding to the detail address in the second address according to the one-to-one correspondence of the node correspondence, so as to obtain each third matching value;

and summarizing the first matching values, the second matching values and the third matching values to obtain matching values corresponding to nodes of the two structural trees respectively.

6. The address matching method according to claim 5, wherein before the step of obtaining the first weight corresponding to the range address, the second weight corresponding to the flag address, and the third weight corresponding to the detail address, respectively, the method comprises:

inputting a specified number of training samples with pre-labeled similarity values into the natural language processing model for training;

the similarity value output by the natural language processing model is consistent with the pre-labeling similarity value by adjusting training parameters to first parameters;

and respectively corresponding the weight values corresponding to the first parameters to the first weight, the second weight and the third weight according to the node corresponding relation.

7. The address matching method according to claim 2, wherein before the step of calling a preset matching algorithm to segment the first address and the second address according to a first preset rule to obtain a first segment group corresponding to the first address and a second segment group corresponding to the second address, the method comprises:

Indexing a specified number of unstructured address data pre-stored in the index server to obtain a preset index structure;

receiving an interface plug-in under a specified directory uploaded to the index server, wherein the interface plug-in is formed by packaging and encapsulating the preset matching algorithm;

acquiring configuration parameters of the interface plug-in;

and establishing a calculation association relation between the preset index structure and the interface plug-in by running the configuration parameters.

8. An address matching apparatus, wherein a first address is an address to be retrieved input by a user, and a second address is stored in an index server, the apparatus comprising:

the judging module is used for judging whether the first address and the second address are the same according to the matching result;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.