CN117874214A - Method and equipment for standardized management and dynamic matching of address information - Google Patents

Method and equipment for standardized management and dynamic matching of address information Download PDF

Info

Publication number
CN117874214A
CN117874214A CN202410276382.4A CN202410276382A CN117874214A CN 117874214 A CN117874214 A CN 117874214A CN 202410276382 A CN202410276382 A CN 202410276382A CN 117874214 A CN117874214 A CN 117874214A
Authority
CN
China
Prior art keywords
address
entity
new
standard
place name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410276382.4A
Other languages
Chinese (zh)
Other versions
CN117874214B (en
Inventor
林韶军
黄炳裕
陈征宇
黄河
侯国通
戴文艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Evecom Information Technology Development Co ltd
Original Assignee
Evecom Information Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Evecom Information Technology Development Co ltd filed Critical Evecom Information Technology Development Co ltd
Priority to CN202410276382.4A priority Critical patent/CN117874214B/en
Publication of CN117874214A publication Critical patent/CN117874214A/en
Application granted granted Critical
Publication of CN117874214B publication Critical patent/CN117874214B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a standardized treatment and dynamic matching method and equipment for address information, comprising the following steps: automatically constructing triplet data (place name entity 1, relationship and place name entity 2) representing the relationship of two place name entities; using a Cypher grammar to map the place name entities and relations in each triplet data into nodes and edges in the knowledge graph respectively, and generating an address knowledge graph; and screening out an address entity with an administrative level of a seventh level from the address knowledge graph according to the seven-level administrative level relation of the standard address, then searching the address entity of the upper level administrative level step by step in a graph searching mode, complementing the seven-level place name entity to obtain the standard address, and storing the standard address into a standard address database to finish the standard treatment of the original address information. The invention has the advantages that: automatic treatment of original address information is realized, the accuracy and efficiency of address standardization are improved, and the labor cost is reduced.

Description

Method and equipment for standardized management and dynamic matching of address information
Technical Field
The invention relates to the field of artificial intelligence data processing, in particular to a method and equipment for standardized management and dynamic matching of address information.
Background
Along with the continuous promotion of the digitizing process of China and the continuous popularization of the application of an informatization system, the address field is used as key information for uniquely identifying a place, and the accurate filling of the address field meeting the standard is a key link of the operation of the informatization system in various fields such as national government affairs, logistics, trade and the like. However, in the actual operation process, the filling and labeling of the address fields are not the same due to the field or industry. Meanwhile, the address conforming to the national unified standard must comprise a plurality of element information such as administrative area names, basic area limiter names, primary local point position descriptions, secondary local point position descriptions and the like, and has the characteristics of more grades and more complicated information described at different grades, so that a great deal of manpower is required to be consumed for correction processing in the process of labeling the original address information, and meanwhile, the user has the conditions of long filling time, time and labor waste, and extremely easy occurrence of key element information deletion, word staggering, word leakage and the like in the address filling process, thereby causing information distortion. Therefore, how to unify standard address fields related to various fields, industries and the like according to the national standard, simplify the process of filling user addresses, and improve filling efficiency and accuracy becomes a problem to be solved urgently.
Disclosure of Invention
In order to solve the problems, the invention aims to provide a standardized treatment and dynamic matching method for address information, which realizes automatic treatment of original address information, improves the accuracy and efficiency of address standardization and reduces labor cost.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
technical solution one
A standardized treatment and dynamic matching method of address information comprises the following steps:
automatically constructing triplet data (place name entity 1, relation, place name entity 2) representing the relation of two place name entities: defining relations among place name entities, creating initial prompts and a plurality of example entity pairs corresponding to the initial prompts for each relation, randomly selecting one entity pair from the example entity pairs to be inserted into the corresponding initial prompts, generating initial sentences, transferring the initial sentences, generating a plurality of transferred sentences with the same meaning, and screening out a plurality of sentences with the minimum similarity as candidate sentences; deleting place name entities in candidate sentences, generating a new prompt corresponding to each candidate sentence, collecting the new prompts according to corresponding relation types, and generating a prompt set corresponding to each relation type; inputting the new prompt into a language model LM, and searching out new entity pairs consistent with all the new prompts; combining the searched new entity pairs with the corresponding relationship types according to the relationship types corresponding to the new prompts, and constructing the triplet data;
using a Cypher grammar to map the place name entities and relations in each triplet data into nodes and edges in the knowledge graph respectively, and generating an address knowledge graph; and screening out an address entity with an administrative level of a seventh level from the address knowledge graph according to the seven-level administrative level relation of the standard address, then searching the address entity of the upper level administrative level step by step in a graph searching mode, complementing the seven-level place name entity to obtain the standard address, and storing the standard address into a standard address database to finish the standard treatment of the original address information.
Preferably, in order to ensure consistency of the searched entity pairs and the corresponding relations, the following steps are further executed: after the prompt set is generated, calculating a matching score of each new prompt and each corresponding example entity pair in order to reflect the matching degree between the entity pair and each single address entity in the entity pair and each new prompt in the prompt set, wherein the calculation formula of the matching score is as follows:
wherein->Is a balance factor, the first term in the formula is that the corresponding entity pair is +.>Is used for predicting the probability of (1); the second term is the new hint p and the log likelihood minimum for the single place name entity in the entity pair; the values obtained after the matching scores are normalized by a softmax function are used as the confidence weight of the new prompt; traversing all new prompts in the prompt set to obtain confidence weight corresponding to each new prompt>The method comprises the steps of carrying out a first treatment on the surface of the The new prompts in the prompt set are input into a language model LM, the language model LM searches out new entity pairs consistent with all the new prompts in the prompt set, and the execution process is as follows: presetting new entity pair quantity M required by the corresponding relation of the prompt set; in the searching process, M new entity pairs with high consistency are screened out by considering the minimum logarithmic possibility of a single address entity: after each new prompt is input, M new entity pairs searched first are stored as candidate piles, and the monomer log likelihood value of each new entity pair is calculated
Then sequencing the candidate stacks according to the sequence from small to large, taking a new entity pair corresponding to the minimum monomer log likelihood value as the top of the stack and taking the monomer log likelihood value as a threshold value; when the next new entity pair is searched, calculating the log likelihood value, if the log likelihood value is lower than the current threshold value, discarding the new entity pair, and if the log likelihood value is higher than the current threshold value, inserting the new entity pair into the corresponding position in the candidate pile according to the magnitude of the log likelihood value, and simultaneously inserting the new entity pair into the corresponding position in the candidate pilePushing out a new entity pair at the top of the heap, and dynamically updating the current threshold; after the search is finished, storing new entity pairs in the candidate pile into a candidate set; the new entity pairs in the candidate set are noted +.>And calculating the consistency score by adopting a constructed consistency score function, wherein the calculation formula of the consistency score function is as follows:
wherein->For confidence weights, p represents a new hint; re-ordering all new entity pairs in the candidate set according to the sequence from high to low of the consistency score, and taking N1 new entity pairs ranked at the top as search results to be output; the steps are then performed: inquiring a prompt set corresponding to the searched new entity pair, determining a corresponding relation type of the prompt set, combining the searched new entity pair with the corresponding relation type, and constructing the triple data.
More preferably, based on the constructed address knowledge graph, address information input in real time in the informatization system is dynamically matched, and the steps are as follows: acquiring address information input by a user in real time, and taking the address information as key information; inputting the key information into a trained mapping model, and outputting a standard place name entity; judging whether the standard place name entity is the seventh level in the administrative hierarchy, if so, acquiring standard address information from the address knowledge graph in a graph query mode, and providing the standard address information for a user to select; if not, acquiring the information of the upper administrative region from the address knowledge graph in a graph query mode, complementing the information of the upper administrative region of the standard place name entity, matching the information with the standard addresses of the standard address database through a matching model, and taking and outputting the N2 standard addresses ranked at the top from the matching result for selection or modification by a user.
More preferably, the mapping model is realized by adopting a transducer model based on a machine learning algorithm, and the dependency relationship and the context information in the address information text are acquired by using an attention mechanism; in a training data set, address data to be trained comprises key information of an address and place name entities of nodes obtained from an address knowledge graph, and the place name entities of the nodes serve as labels or true values to play a role in supervised learning; before the model is input, carrying out random mask processing on the input key information by adopting an MLM (multi-level model) coverage pre-training mode, then obtaining corresponding hidden representation through an encoder of a transducer model, and mapping the hidden representation through an additional full-connection layer to obtain the score of each possible standard place name entity; finally, inputting the score into a softmax function, converting the score into probability distribution, obtaining the prediction probability of each standard place name entity, and outputting the standard place name entity corresponding to the maximum value of the prediction probability; in the training process, a cross entropy loss function is adopted to measure the difference between a prediction standard address entity output by a model and an address entity of a node in the address knowledge graph, and the calculation mode is as follows:
wherein->Representing the number of samples, i is the number of samples, +.>Is the i-th actual tag,/->Is the model's predicted probability for the ith sample.
More preferably, the matching model is implemented using an unsupervised SimCSE-based model.
Based on the same inventive concept, the invention also provides a device for standardized management and dynamic matching of address information.
Technical proposal II
An apparatus for standardized administration and dynamic matching of address information, comprising a memory storing an executable program and a processor running the program, performing the steps of: automatically constructing triplet data (place name entity 1, relationship and place name entity 2) representing the relationship of two place name entities; using a Cypher grammar to map the place name entities and relations in each triplet data into nodes and edges in the knowledge graph respectively, and generating an address knowledge graph; and screening out an address entity with an administrative level of a seventh level from the address knowledge graph according to the seven-level administrative level relation of the standard address, then searching the address entity of the upper level administrative level step by step in a graph searching mode, complementing the seven-level place name entity to obtain the standard address, and storing the standard address into a standard address database to finish the standard treatment of the original address information.
The invention has the following beneficial effects:
1. after the initial prompt and the related example entity pairs are defined, the invention can carry out full-flow automatic treatment on the original address information, automatically construct an address knowledge graph, generate a standard place name entity and import the standard place name entity into a standardized address database.
2. The invention automatically generates a new prompt through transferring the model, ensures the wide expression of the relationship and improves the comprehensiveness of address search.
3. The invention determines the confidence coefficient weight of the new prompt according to the matching of the entity pair and the new prompt, applies the confidence coefficient weight to the searching process of the language model, and applies the confidence coefficient weight to the consistency judgment, thereby improving the accuracy and the efficiency of the searching result.
4. The invention collects the key information of the address in real time, realizes automatic filling of the address information based on the address knowledge graph, simultaneously combines the matching model, automatically generates recommended standard address information, avoids incomplete or error of standard data information caused by reasons such as unfilling, and the like, simplifies the business process and improves the business handling efficiency of users.
Drawings
FIG. 1 is a flow chart of the standard address management of the present invention;
FIG. 2 is a flow chart of dynamic matching according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the attached drawings and specific examples:
example 1
Referring to fig. 1, a method for standardized management and dynamic matching of address information includes the following steps:
and 10, automatically constructing triplet data (the place name entity 1, the relation and the place name entity 2) representing the relation of the two place name entities. The specific process is as follows:
step 11, defining a relation existing between place name entities, wherein the relation comprises a plurality of types: administrative hierarchy relationships, alias relationships, ancient relationships, and inclusion relationships; as shown in table 1, the correspondence relationship between part of place name entities and place name entities is given by way of example. The administrative hierarchy relationship between the place name entities is divided according to seven levels, specifically including province, city, district (county), street (village/town), road (lane), community (natural village/administrative village), and district (village), that is, when the administrative hierarchy relationship is defined, the relationship needs to be further specific to the province relationship, the city level relationship, the district (county) level relationship, the street (village/town) relationship, the road (roadway) relationship, the community (natural village/administrative village) relationship, or the district (village) relationship.
TABLE 1
Step 12, creating an initial hint for each relation and a number of example entity pairs corresponding to the initial hint. For example: the province with initial hint "A" would be B "and a small number of example entity pairs, such as { (Fujian province, fuzhou), (Hubei province, wuhan), … … }.
And 13, randomly selecting one entity pair from the example entity pairs, and inserting the entity pair into the corresponding initial prompt to generate an initial sentence. For example, the province of Fujian province may be Fuzhou city.
And 14, inputting the initial sentences into a text transfer model to generate a plurality of transfer sentences with the same meaning. For example, the initial sentence "the province of Fujian province" may be changed to "the Fuzhou city is the province of Fujian province" after the text transfer model transfer "," the province is the Fuzhou city if the province is the Fujian province ", and so on. Through sentence transfer, a broad expression of the relationship is ensured.
And 15, calculating the similarity between any two sentences in the transferred sentences and the initial sentences, and reserving K sentences with the minimum similarity as candidate sentences, wherein K is a preset value. One of the ways in which this can be exemplified is: for a new prompt, firstly calculating the similarity between the new prompt and each prompt in the existing prompt set one by one, taking out the maximum value of the similarity, adding the maximum value to the prompt set if the maximum value is smaller than a preset comparison threshold, otherwise discarding the maximum value, setting the comparison threshold to be 75, reserving K sentences with the minimum similarity in the prompt set after the sentence transfer is finished, and discarding the rest sentences.
Step 16, deleting place name entities in candidate sentences, wherein each candidate sentence correspondingly generates a new prompt, and obtaining a prompt set containing K new prompts; each relationship type corresponds to a generated hint set. For example, turning to the sentence "Fuzhou City is the province meeting of Fujian province", "if the province is Fujian province its province meeting is Fuzhou City", the new hints generated after deleting the entity name are "B is the province meeting city of A", "if A its province meeting is B", respectively. Further, by continually reversing the new prompts, steps 14 through 16 may be repeatedly performed, generating more new prompts.
And step 17, inputting new prompts in the prompt set into a language model LM, and searching out new entity pairs consistent with all the new prompts in the prompt set by the language model LM. The method specifically comprises the following steps:
first, because automatically generated new cues may be inaccurate, the generated new cues may not accurately convey the intended relationship. Thus, an appropriate confidence weight is determined for each new hint in a manner that simultaneously considers the individual entity and the degree of adaptation between the entire entity and the new hint. Calculating a matching score of each corresponding example entity pair of a new prompt, wherein the calculation formula of the matching score is as follows:
wherein->Is a balance factor, the first term in the formula is that the corresponding entity pair is +.>I.e. consider the degree of adaptation between the whole entity pair and the new prompt; the second term is the new hint p and the log likelihood minimum for the single place name entity in the entity pair; the values obtained after the matching scores are normalized by a softmax function are used as the confidence weight of the new prompt; traversing all new prompts in the prompt set to obtain confidence weight corresponding to each new prompt>. And secondly, presetting the number M of new entity pairs required by the corresponding relation of the prompt set.
Then, inputting the new prompt in the prompt set into the language model LM, in the searching process, if all new entity pairs are searched out by adopting an enumeration manner, and all entity pairs are directly judged to be consistent with the new prompt, the calculated amount is large and the calculation complexity is too high, therefore, in the embodiment, M new entity pairs with higher consistency between single address entities and new prompts are screened out by considering the minimum logarithmic possibility of single address entities in the new entity pairs, and then the consistency judgment of the whole entity pairs is carried out on the M new entity pairs, so that not only some obviously wrong search results are eliminated, but also the calculated amount of consistency judgment of the whole entity pairs in the later stage can be greatly reduced. For example, a new hint that B is the provincial city of A, appears in the search results (Fujian, taijian area), and by considering single address entity consistency, can be excluded by the Taijian area not belonging to the provincial city. The specific process is as follows: after each new prompt is input, M new entity pairs searched first are stored as candidate piles, and the monomer log likelihood value of each new entity pair is calculated
Then sequencing the candidate stacks according to the sequence from small to large, taking a new entity pair corresponding to the minimum monomer log likelihood value as the top of the stack and taking the monomer log likelihood value as a threshold value; when the next new entity pair is searched, calculating a log likelihood value, if the log likelihood value is lower than a current threshold value, discarding the new entity pair, and if the log likelihood value is higher than the current threshold value, inserting the new entity pair at the top of the stack into a corresponding position in the candidate stack according to the magnitude of the log likelihood value, pushing out the new entity pair at the top of the stack, and dynamically updating the current threshold value; after the search is finished, storing new entity pairs in the candidate pile into a candidate set corresponding to the prompt set; a hint set contains K new hints, and then the candidate set contains K.M new hints. For example, to search for 50 new entity pairs, it is necessary to store the current retrieved +.>50 entity pairs with the largest value, and the current search is simultaneously retrieved to obtain the smallest +.>The value is used as a top of heap as a threshold for future retrieval.
Finally, the new entity pairs in the candidate set are recorded asAnd calculating the consistency score of the entity pair by adopting the constructed consistency score function, wherein the calculation formula of the consistency score function is as follows:
wherein->For confidence weights, p represents a new hint; and (3) reordering all new entity pairs in the candidate set according to the sequence from high to low of the consistency score, and taking N1 new entity pairs ranked at the top as search results to be output. For different relationships, the number of effective entity pairs may be different, for example, for the relationship of ' province-relationship ' between the province-name entity provinces and the province-relationship ', the number of correct entity pairs in the relationship does not exceed 34 because of 34 province-level administrative areas in China. By adopting a truncation method aiming at different relations, the consistency score of the 34 th entity pair (the values of N1 can be different according to different relations) is taken, and only the entity pair with the consistency score higher than the value is reserved as output knowledge.
And step 18, inquiring a prompt set corresponding to the searched new entity pair, determining a corresponding relation type of the prompt set, combining the searched new entity pair with the corresponding relation type, and constructing the triplet data. For example, for "a hierarchical relationship between province and province of a place name entity", the obtained result is that the entity pair containing the hierarchical relationship between 34 provinces and provinces specifically includes: (Fujian province, fuzhou city), (Henan province, zhengzhou city), (Hubei, wuhan), etc.
And 20, mapping the place name entities and the relations in the triple data into nodes and edges in the knowledge graph by using a Cypher grammar to generate an address knowledge graph. And a Neo4j graph database with high query efficiency and perfect development ecology can be used as a knowledge graph persistence tool, and the established address knowledge graph is imported into Neo4j for storage.
And 30, screening out an address entity with an administrative level of a seventh level from the address knowledge graph according to the seven-level administrative level relation of the standard address, then searching up the address entity of the upper level administrative level step by step in a graph searching mode, complementing the seven-level place name entity to obtain the standard address, and storing the standard address into a standard address database to finish the standard treatment of the original address information. The method comprises the following steps: according to the seven-level administrative hierarchy relation existing between the address entities, the address entity with the seventh level of administrative hierarchy, namely the address entity with the cell (village) in the hierarchy, is screened in a regular matching mode, the address information of the upper level of administrative hierarchy is complemented in a graph query mode, and the complemented result is stored in a standard address database. In the map query, for example, the corresponding upper-level administrative relationship is supplemented by a map query manner according to the address entity "drum building area" to obtain "drum building area in Fujian Fuzhou city", and the Cypher statement of the map query operation can be described as follows:
MATCH (start: administrative district { name: 'Drum floor' is })
MATCH (next) < - [: superior ] - (start) WHERE next. Name = 'Fuzhou City'
MATCH (upper) < - [: upper ] - (next) WHERE upper. Name = 'Fujian province'
RETURN CONCAT (upper. Name, ', next. Name,', start. Name) AS result;
specifically, first, an administrative area node named "drum floor" is matched and labeled as start. Then we match the upper level node and mark it as next. We use the WHERE clause to filter out nodes named "state of good market". Next, we continue to match the upper level node and mark it as upper. The WHERE clause is again used to filter out nodes named "Fujian province". Finally, the name of the three nodes is connected using the CONCAT function, and the query filling result is returned as a column with another name of "result".
The information system can relate to the filling problem of standard address fields, and the embodiment also provides a standard address dynamic matching method based on an address knowledge graph, which specifically comprises the following steps:
referring to fig. 2, step 40 is to use address information input by a user as key information after obtaining the address information in real time in order to avoid the problem that the user reduces the retrieval accuracy due to wrongly written or polyphone and other problems in the input process; inputting the key information into a trained mapping model, and outputting a standard place name entity; for example, for key information key= "venturi", the mapping model is used to correct the key information key= "venturi", where value is a complete expression of a place name entity of a node in the address knowledge graph.
The mapping model is implemented by using a transducer model based on a machine learning algorithm, and the network structure of the model is composed of a plurality of encoders and additional full-connection and softmax layers. Acquiring the dependency relationship and the context information in the address information text by using an attention mechanism; in a training data set, address data to be trained comprises key information of an address and place name entities of nodes obtained from an address knowledge graph, and the place name entities of the nodes serve as labels or true values to play a role in supervised learning; before the model is input, carrying out random mask processing on the input key information by adopting an MLM (multi-level model) coverage pre-training mode, then obtaining corresponding hidden representation through an encoder of a transducer model, and mapping the hidden representation through an additional full-connection layer to obtain the score of each possible standard place name entity; finally, inputting the score into a softmax function, converting the score into probability distribution, obtaining the prediction probability of each standard place name entity, and outputting the standard place name entity corresponding to the maximum value of the prediction probability; in the training process, a cross entropy loss function is adopted to measure the difference between a prediction standard address entity output by a model and an address entity of a node in the address knowledge graph, and the calculation mode is as follows:
wherein->Representing the number of samples, i is the number of samples, +.>Is the i-th actual tag,/->Is a dieType predictive probability for the i-th sample.
Step 50, judging whether the standard place name entity is the seventh level in the administrative hierarchy, if so, acquiring standard address information from the address knowledge graph in a graph query mode, and providing the standard address information for a user to select; if not, acquiring the information of the upper administrative region from the address knowledge graph in a graph query mode, complementing the information of the upper administrative region of the standard place name entity, matching the information with the standard addresses of the standard address database through a matching model, and taking and outputting the N2 standard addresses ranked at the top from the matching result for selection or modification by a user. The matching model is realized by adopting an unsupervised SimCSE model. In training the matching model, for a given input containing P address information, it is passed through a coder with random discard (dropout, which can take on a value of 0.1, 0.05 or 0.3) to obtain a vectorThe P address information is then passed through an encoder with a different dropout than the last time, and the resulting vector is noted asDue to->Slightly different will (+)>) As a pair of positive proportion samples for contrast learning, one sample in one dataset is randomly selected as the negative sample. Specifically, for example, if the given original address input information is "Jing Xizhen Hong Ganlu of Fujian city and hou county in Fujian province and" Wen mountain district of Yongfeng community at software intersection, "the original address input information is obtained by performing random discarding with dropout=0.3>May be expressed as "Fujian Fuzhou city Hong Ganlu and software intersection Yongfeng community Wenshan mountain region", and then go through dropout=0.1 after random discarding +.>May be expressed as "Fujian Fuzhou City Jingxi town Hong Ganlu and software intersection Yongfeng community Wenshan region". Randomly selecting another sample in one data set as a negative sample b= 'mountain land ballasting house Lu Guoguang community majorities garden' in drum building area of Fu, fujian province; training of the model can be understood as pulling up +.>And->Is far away from the relationship of (2)>And b. The purpose of contrast learning is to learn effective characterization by bringing together semantically close neighbors and separating non-neighbors, so that the objective function formula adopted in the training process of the SimCSE model is:
wherein,is a negative example of randomly sampling another input in a batch, < >>Representing cosine similarity, e.g->; />Representing the number of input samples, +.>For distinguishing between the original samples obtained from different dropouts,/->And->Respectively represent +.>Person and->Sample number->Is a super parameter, typically set to 0.05, for increasing the rate of model convergence.
The invention discloses a method for standardized treatment and dynamic matching of address information, which can automatically treat the original address information in a full-flow way after defining an initial prompt and related example entity pairs, automatically construct an address knowledge graph, generate a standard place name entity and import the standard place name entity into a standardized address database. The accuracy and the efficiency are improved, and meanwhile, the labor cost in the standard address construction process is greatly saved. Meanwhile, the invention also provides automatic address information filling based on the address knowledge graph, and key information of the address is acquired in real time, so that standard address information can be automatically identified and matched, the phenomenon that standard data information is incomplete or is wrong due to reasons such as no filling is avoided, and the business handling efficiency of a user is improved while the business handling process is simplified.
Example two
An apparatus for standardized administration and dynamic matching of address information, comprising a memory storing an executable program and a processor running the program, performing the steps of:
referring to fig. 1 and 2, step 10, automatically constructing triplet data (place name entity 1, relationship, place name entity 2) representing the relationship between two place name entities; step 20, mapping the place name entities and the relations in the data of each triplet into nodes and edges in the knowledge graph by using a Cypher grammar, and generating an address knowledge graph; step 30, according to the seven-level administrative hierarchy relation of the standard address, screening an address entity with an administrative hierarchy of a seventh level from the address knowledge graph, then searching the address entity of the upper level administrative hierarchy step by step in a graph searching mode, complementing the seven-level place name entity to obtain the standard address, and storing the standard address into a standard address database to finish the standard treatment of the original address information; step 40, in order to avoid the problem that the retrieval accuracy is reduced due to the fact that the user wrongly marks words or polyphones and the like in the input process, the address information input by the user is taken as key information after the address information is acquired in real time; inputting the key information into a trained mapping model, and outputting a standard place name entity; step 50, judging whether the standard place name entity is the seventh level in the administrative hierarchy, if so, acquiring standard address information from the address knowledge graph in a graph query mode, and providing the standard address information for a user to select; if not, acquiring the information of the upper administrative region from the address knowledge graph in a graph query mode, complementing the information of the upper administrative region of the standard place name entity, matching the information with the standard addresses of the standard address database through a matching model, and taking and outputting the N2 standard addresses ranked at the top from the matching result for selection or modification by a user.
Since the device described in the second embodiment of the present invention is a hardware device for executing the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific implementation of the apparatus based on the method described in the first embodiment of the present invention, and therefore, the description thereof is omitted herein. All methods used in the first embodiment of the present invention are within the scope of the present invention.
The foregoing description is only specific embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the present invention and the accompanying drawings, or direct or indirect application in other related technical fields, are included in the scope of the present invention.

Claims (6)

1. A standardized treatment and dynamic matching method for address information is characterized in that: the method comprises the following steps:
automatically constructing triplet data (place name entity 1, relation, place name entity 2) representing the relation of two place name entities: defining relations among place name entities, creating initial prompts and a plurality of example entity pairs corresponding to the initial prompts for each relation, randomly selecting one entity pair from the example entity pairs to be inserted into the corresponding initial prompts, generating initial sentences, transferring the initial sentences, generating a plurality of transferred sentences with the same meaning, and screening out a plurality of sentences with the minimum similarity as candidate sentences; deleting place name entities in candidate sentences, generating a new prompt corresponding to each candidate sentence, collecting the new prompts according to corresponding relation types, and generating a prompt set corresponding to each relation type; inputting the new prompt into a language model LM, and searching out new entity pairs consistent with all the new prompts; combining the searched new entity pairs with the corresponding relationship types according to the relationship types corresponding to the new prompts, and constructing the triplet data;
using a Cypher grammar to map the place name entities and relations in each triplet data into nodes and edges in the knowledge graph respectively, and generating an address knowledge graph;
and screening out an address entity with an administrative level of a seventh level from the address knowledge graph according to the seven-level administrative level relation of the standard address, then searching the address entity of the upper level administrative level step by step in a graph searching mode, complementing the seven-level place name entity to obtain the standard address, and storing the standard address into a standard address database to finish the standard treatment of the original address information.
2. The method for standardized management and dynamic matching of address information according to claim 1, wherein: in order to ensure the consistency of the searched entity pairs and the corresponding relations, the following steps are further executed:
after the prompt set is generated, calculating a matching score of each new prompt and each corresponding example entity pair in order to reflect the matching degree between the entity pair and each single address entity in the entity pair and each new prompt in the prompt set, wherein the calculation formula of the matching score is as follows:
wherein->Is a balance factor, the first term in the formula is that the corresponding entity pair is +.>Is used for predicting the probability of (1); the second term is the new hint p and the log likelihood minimum for the single place name entity in the entity pair; the values obtained after the matching scores are normalized by a softmax function are used as the confidence weight of the new prompt; traversing all new prompts in the prompt set to obtain confidence weight corresponding to each new prompt>
The new prompts in the prompt set are input into a language model LM, the language model LM searches out new entity pairs consistent with all the new prompts in the prompt set, and the execution process is as follows:
presetting new entity pair quantity M required by the corresponding relation of the prompt set;
in the searching process, M new entity pairs with high consistency are screened out by considering the minimum logarithmic possibility of a single address entity: after each new prompt is input, M new entity pairs searched first are stored as candidate piles, and the monomer log likelihood value of each new entity pair is calculated
Then sorting the candidate stacks in order from small to large, and minimizingA new entity pair corresponding to the monomer log likelihood value is taken as a pile top, and the monomer log likelihood value is taken as a threshold value; when the next new entity pair is searched, calculating a log likelihood value, if the log likelihood value is lower than a current threshold value, discarding the new entity pair, and if the log likelihood value is higher than the current threshold value, inserting the new entity pair at the top of the stack into a corresponding position in the candidate stack according to the magnitude of the log likelihood value, pushing out the new entity pair at the top of the stack, and dynamically updating the current threshold value; after the search is finished, storing new entity pairs in the candidate pile into a candidate set;
marking new entity pairs in a candidate set asAnd calculating the consistency score by adopting a constructed consistency score function, wherein the calculation formula of the consistency score function is as follows:
wherein->For confidence weights, p represents a new hint; reorder all new entities in the candidate set in order of high-to-low consistency score, taking the top N 1 The new entity pairs are output as search results;
the steps are then performed: inquiring a prompt set corresponding to the searched new entity pair, determining a corresponding relation type of the prompt set, combining the searched new entity pair with the corresponding relation type, and constructing the triple data.
3. The method for standardized management and dynamic matching of address information according to claim 1, wherein: based on the constructed address knowledge graph, dynamically matching the address information input in real time in the informatization system, wherein the steps are as follows:
acquiring address information input by a user in real time, and taking the address information as key information;
inputting the key information into a trained mapping model, and outputting a standard place name entity;
judging whether the standard place name entity is the seventh level in the administrative hierarchy, if so, acquiring standard address information from the address knowledge graph in a graph query mode, and providing the standard address information for a user to select; if not, acquiring the information of the upper administrative region from the address knowledge graph by a graph query mode, complementing the information of the upper administrative region of the standard place name entity, matching with the standard address of the standard address database by a matching model, and taking the ranking N from the matching result 2 And a standard address output for user selection or modification.
4. A method for standardized management and dynamic matching of address information according to claim 3, wherein: the mapping model is realized by adopting a transducer model based on a machine learning algorithm, and the dependency relationship and the context information in the address information text are acquired by using an attention mechanism; in a training data set, address data to be trained comprises key information of an address and place name entities of nodes obtained from an address knowledge graph, and the place name entities of the nodes serve as labels or true values to play a role in supervised learning; before the model is input, carrying out random mask processing on the input key information by adopting an MLM (multi-level model) coverage pre-training mode, then obtaining corresponding hidden representation through an encoder of a transducer model, and mapping the hidden representation through an additional full-connection layer to obtain the score of each possible standard place name entity; finally, inputting the score into a softmax function, converting the score into probability distribution, obtaining the prediction probability of each standard place name entity, and outputting the standard place name entity corresponding to the maximum value of the prediction probability; in the training process, a cross entropy loss function is adopted to measure the difference between a prediction standard address entity output by a model and an address entity of a node in the address knowledge graph, and the calculation mode is as follows:
wherein->Representing the number of samples, i is the number of samples, +.>Is the i-th actual tag,/->Is the model's predicted probability for the ith sample.
5. A method for standardized management and dynamic matching of address information according to claim 3, wherein: the matching model is realized by adopting an unsupervised SimCSE model.
6. The utility model provides a standardized management and dynamic matching's of address information equipment which characterized in that: comprising a memory storing an executable program and a processor running the program to perform the method steps of any one of claims 1 to 5.
CN202410276382.4A 2024-03-12 2024-03-12 Method and equipment for standardized management and dynamic matching of address information Active CN117874214B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410276382.4A CN117874214B (en) 2024-03-12 2024-03-12 Method and equipment for standardized management and dynamic matching of address information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410276382.4A CN117874214B (en) 2024-03-12 2024-03-12 Method and equipment for standardized management and dynamic matching of address information

Publications (2)

Publication Number Publication Date
CN117874214A true CN117874214A (en) 2024-04-12
CN117874214B CN117874214B (en) 2024-06-11

Family

ID=90583422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410276382.4A Active CN117874214B (en) 2024-03-12 2024-03-12 Method and equipment for standardized management and dynamic matching of address information

Country Status (1)

Country Link
CN (1) CN117874214B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140236575A1 (en) * 2013-02-21 2014-08-21 Microsoft Corporation Exploiting the semantic web for unsupervised natural language semantic parsing
CN112528174A (en) * 2020-11-27 2021-03-19 暨南大学 Address finishing and complementing method based on knowledge graph and multiple matching and application
CN114492438A (en) * 2021-11-26 2022-05-13 武汉众智数字技术有限公司 Address standardization method based on knowledge graph and natural language processing technology
CN114780680A (en) * 2022-04-21 2022-07-22 河南数慧信息技术有限公司 Retrieval and completion method and system based on place name and address database
CN117540729A (en) * 2022-07-27 2024-02-09 丰图科技(深圳)有限公司 Address detection method, address detection device, computer equipment and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140236575A1 (en) * 2013-02-21 2014-08-21 Microsoft Corporation Exploiting the semantic web for unsupervised natural language semantic parsing
CN112528174A (en) * 2020-11-27 2021-03-19 暨南大学 Address finishing and complementing method based on knowledge graph and multiple matching and application
CN114492438A (en) * 2021-11-26 2022-05-13 武汉众智数字技术有限公司 Address standardization method based on knowledge graph and natural language processing technology
CN114780680A (en) * 2022-04-21 2022-07-22 河南数慧信息技术有限公司 Retrieval and completion method and system based on place name and address database
CN117540729A (en) * 2022-07-27 2024-02-09 丰图科技(深圳)有限公司 Address detection method, address detection device, computer equipment and computer readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐步陌上行: "使用Pyecharts进行全国水质TDS地图可视化全过程3:用Python拆分物流地址以及实现地址补全", pages 1 - 6, Retrieved from the Internet <URL:https://blog.csdn.net/weixin_42878250/article/details/125753493> *
融合上下文的知识图谱补全方法 那宇嘉: "融合上下文的知识图谱补全方法", 《山东大学学报》, vol. 58, no. 9, 30 September 2023 (2023-09-30), pages 71 - 80 *

Also Published As

Publication number Publication date
CN117874214B (en) 2024-06-11

Similar Documents

Publication Publication Date Title
CN111310438B (en) Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN112131404B (en) Entity alignment method in four-risk one-gold domain knowledge graph
CN101630314B (en) Semantic query expansion method based on domain knowledge
CN107145577A (en) Address standardization method, device, storage medium and computer
CN111310439B (en) Intelligent semantic matching method and device based on depth feature dimension changing mechanism
CN108717433A (en) A kind of construction of knowledge base method and device of programming-oriented field question answering system
CN108388559A (en) Name entity recognition method and system, computer program of the geographical space under
CN111291099B (en) Address fuzzy matching method and system and computer equipment
CN106598937A (en) Language recognition method and device for text and electronic equipment
CN113592037B (en) Address matching method based on natural language inference
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN110633365A (en) Word vector-based hierarchical multi-label text classification method and system
CN114780680A (en) Retrieval and completion method and system based on place name and address database
CN112650833A (en) API (application program interface) matching model establishing method and cross-city government affair API matching method
CN115688779A (en) Address recognition method based on self-supervision deep learning
CN115658846A (en) Intelligent search method and device suitable for open-source software supply chain
CN114201480A (en) Multi-source POI fusion method and device based on NLP technology and readable storage medium
CN111339258B (en) University computer basic exercise recommendation method based on knowledge graph
CN117874214B (en) Method and equipment for standardized management and dynamic matching of address information
CN117010373A (en) Recommendation method for category and group to which asset management data of power equipment belong
CN116955558A (en) Knowledge graph reasoning-based method and system for inquiring and answering ground study data set
CN117076590A (en) Address standardization method, address standardization device, computer equipment and readable storage medium
CN117010398A (en) Address entity identification method based on multi-layer knowledge perception
CN111767476A (en) HMM model-based smart city space-time big data spatialization engine construction method
CN116431746A (en) Address mapping method and device based on coding library, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant