CN114398886A

CN114398886A - Address extraction and standardization method based on pre-training

Info

Publication number: CN114398886A
Application number: CN202111582633.4A
Authority: CN
Inventors: 冯纯博; 廖奇; 黄洋; 陈楷; 王辉
Original assignee: Kexun Jialian Information Technology Co ltd
Current assignee: Kexun Jialian Information Technology Co ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-04-26

Abstract

The invention relates to address extraction, in particular to an address extraction and standardization method based on pre-training, which collects corpora containing address information and pre-trains a model; based on the enhanced address corpus, fine tuning the pre-training model through a semi-supervised self-learning mode, and recognizing the place name by using the fine-tuned model; performing address correction based on the self-updating self-maintenance dictionary; performing address normalization based on a multi-head attention mechanism generation model; the technical scheme provided by the invention can effectively overcome the defects of higher cost of marking the corpus and insufficient standard and standard specification of the extracted address in the prior art.

Description

Address extraction and standardization method based on pre-training

Technical Field

The invention relates to address extraction, in particular to an address extraction and standardization method based on pre-training.

Background

The names of specific entities with specific range, characteristic and direction, which are gathered by a plurality of specific social people, are called place names, and are a specific language symbol code. The place name is a description of a specific spatial position of a specific object, and has a specific structure, such as a hierarchical structure, because the place name is a name of a regional or landmark position; meanwhile, as the expression is an abstract representation of a spatial position or a region by human beings and is a specific language expression formed by human beings after brain processing of geographic coordinates, the expression also presents the characteristics of thousands of people as writing articles, as shown in table 1.

Human activities are closely related to addresses, and address information plays an increasingly important role in people's lives. For the registration management of information such as native place, place of birth, current living address and the like of individuals, the input of starting places and destinations such as trip, express delivery, trip navigation and the like; the management of position information of entities such as governments, house transactions, municipal facilities and the like, the quick and accurate address reporting when illness, dangerous cases or accidents occur and the like are closely related to the address information. Due to the wide use of the address information, almost all industries collect the address information, but the address information does not have a standard format, so the obtained address information is complex and various; in addition, the region of China is vast, the levels of government management departments are more and complex, certain delay exists in address planning while the city construction is rapidly developed, so that the problems of shorthand, different names, popular calling, irregular writing, structural disorder and the like exist in the addresses, and the description of the same address is inconsistent, incomplete and irregular. In addition, because the Chinese address has the characteristics of no separation symbol, homophone, easy deletion or redundancy and the like, the difficulty of extracting and standardizing the address information is increased sharply.

At present, the extraction method of the chinese address mainly includes an address dictionary, an entity recognition method based on rule matching, statistics, and deep learning. The method for matching the address dictionary is simple, but has obvious defects, the place names which are not in the dictionary or are ambiguous cannot be matched, only the place names which exist and are accurately described can be extracted, the difficulty of constructing the dictionary is high, and great manpower and material resources are consumed for updating and maintaining the dictionary.

The rule-based matching generally comprises a forward matching method or a reverse matching method, but the rule construction depends on a specific region, only the rule construction can be carried out on the specific region, the constructed rule is difficult to be applied to other regions, the universality is poor, the number of the rules needing to be maintained is increased along with the expansion of the region range, the later maintenance cost is high, and errors are easy to occur.

The method based on statistics can get rid of the dependence on dictionaries, address extraction is carried out through statistics on the frequency of co-occurring words, and the frequently co-occurring words in sentences can relatively reflect the confidence coefficient of the address combined by the words. Statistical-based methods often use machine-learned algorithmic models such as N-gram models, HMMs (hidden markov models), CRFs (conditional random fields), SVMs (support vector machines), Maximum Entropy models (maxmum entrypode), and the like. The method is relatively dependent on feature setting, the convergence is difficult when the features are too few, and the overfitting condition is easy to occur when the features are too much.

A common method based on deep learning is ELMO, Bi-LSTM + CRF, and good effect is achieved in address extraction. However, the number of parameters of the deep learning model is exponentially increased compared with the parameter amount of the original machine learning, and larger model parameters mean that more corpora are needed for model training, however, supervised learning needs corpora with labels, and the labels to the corpora generate huge time and labor cost, and especially for some professional fields, the labels are more difficult.

In order to solve the problems, BERT + BilSTM + CRF is generally adopted to extract place names, and BERT is used to perform specific task fine adjustment in the place name related field based on a pre-training model. The method further breaks away from the corpus limitation of address extraction, can predict non-dictionary words, and further enhances generalization capability.

However, the BERT + BilSTM + CRF type scheme still has some problems. Firstly, in the Chinese geography, the Chinese address has more levels (such as ten-level addresses shown in table 2), the total amount of Chinese geography is particularly large, the collected linguistic data cannot cover all provinces, cities and regions, particularly, the geography in a remote area is complex (such as Yuxi Yuanjiang Hani nationality, Dai nationality and Dai nationality), and the linguistic data are few, so that the trained model has a good effect on common geography, but has a poor recognition effect on the geography in the remote area.

In addition, although the generalization capability of the scheme is improved, a large number of non-standard (abbreviated, miswritten, etc.) and even wrong place names are extracted, and the scheme is difficult to be directly used in practice. Finally, no matter which scheme is adopted, single place name information is generated, and problems of address information loss (such as 'Guanshan Dadao in Wuhan city, Hubei province'), address information redundancy (such as 'Guanhan New world T1 office building in Guanshan Dao of Guanshan Dao district in Wuhan city, Hubei province, diagonally opposite to the next door of Jindi Sun City), address information winding (such as' Guanshan Dao New world in Guanshan district in Wuchang district in Wuhan city, Hubei province), and the like can occur.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects in the prior art, the invention provides an address extraction and standardization method based on pre-training, which can effectively overcome the defects of high cost of labeled corpus and insufficient standard and standard for extracting the obtained address in the prior art.

(II) technical scheme

In order to achieve the purpose, the invention is realized by the following technical scheme:

a pre-training based address extraction and normalization method comprises the following steps:

s1, collecting corpora containing address information, and pre-training the model;

s2, fine-tuning the pre-training model through a semi-supervised self-learning mode based on the enhanced address corpus, and recognizing the place name by using the fine-tuned model;

s3, correcting addresses based on the self-updating self-maintenance dictionary;

and S4, performing address normalization based on the multi-head attention mechanism generation model.

Preferably, in S2, the fine-tuning of the model through the semi-supervised self-learning mode based on the enhanced address corpus includes:

acquiring an address white list according to a national administrative division, randomly combining all levels of addresses from the address white list, and constructing address white list corpora;

randomly replacing the corresponding level slot position in the existing real corpus by using the address in the address white list corpus to construct an enhanced address corpus;

and fine-tuning the pre-training model by utilizing the enhanced address corpus, and performing place name recognition by taking the fine-tuned model as a recognition model.

Preferably, the enhanced address corpus is corpus enhanced through a semi-supervised self-learning mode, and the method comprises the following steps:

calculating a threshold value of the selected corpus in a dynamic mode, and extracting the corpus to be labeled in batches;

after the place name is identified by using the identification model, selecting a prediction result with higher confidence coefficient to be blended into the training corpus;

performing multiple rounds of data extraction on the linguistic data to be labeled in sequence, adjusting the extraction amount in each round according to the prediction result, and if the prediction result is good, improving the extraction amount, otherwise, reducing the extraction amount;

and fusing the prediction result after manual inspection with the original manually marked markup corpus to form an enhanced address corpus.

Preferably, the pre-training model is started by fine tuning using a small amount of manually labeled markup corpora.

Preferably, the performing address correction based on the self-updated self-maintained dictionary in S3 includes:

splitting a place name obtained by identifying the place name of the identification model to respectively obtain a place name and a place name common name of a level above the level of the place name;

the place name candidate set is defined from the place names of the previous level, and the place names in the place name identification candidate set and the place name candidate set are respectively subjected to rule processing to separate out the place name common names and the place name proper names;

and performing maximum forward matching and maximum reverse matching on the place name proper names by using the self-updating self-maintaining dictionary, performing weighted matching on the results obtained by matching, and outputting the result with the highest weight as the marked place name.

Preferably, the maximum forward matching and the maximum reverse matching of the place name proper name include:

carrying out complete matching on the standard place names, if a complete matching result exists, directly successfully matching, outputting the standard place names, otherwise, expanding the matching range, carrying out complete matching on the alternative name place names by utilizing the self-updating self-maintenance dictionary, and if the complete matching result exists, successfully matching;

and if the matching is not completely matched, performing rule matching, if the rule matching result is obtained, entering the next step of weighted matching, and otherwise, directly exiting the matching process.

Preferably, the weighted matching includes:

and obtaining all weights by using the weighted matching based on the pinyin fuzzy matching on the keywords in the rule matching result.

Preferably, the self-updating self-maintaining dictionary is extracted from actual dialogue data, a dictionary is constructed for each place name with an alternative name, the key of the dictionary is the corresponding standard place name, the value of the dictionary is the corresponding alternative name and the occurrence frequency of the alternative name, and the self-updating self-maintaining dictionary automatically updates the ordering of the alternative names according to the occurrence frequency of the alternative names.

Preferably, the address normalization based on the multi-head attention mechanism generation model in S4 includes:

inputting the corrected address data into a coding network for coding, setting a multi-head attention mechanism between the address data and the context vector, inputting the output vector of the coding network into a decoder for decoding, inputting the generated result of the decoder and the multi-head attention vector containing the context vector generated by the multi-head attention mechanism into a copy network, determining whether to generate words from a vocabulary or directly copy the words in the original text by the copy network, and taking the output result of the copy network as a final result.

Preferably, the multi-head attention mechanism generation model comprises a coding network, a decoder and a copy network, wherein the coding network adopts a bidirectional long-time and short-time memory network, the decoder adopts a unidirectional long-time and short-time memory network, and the copy network adopts a Pointernet.

(III) advantageous effects

Compared with the prior art, the address extraction and standardization method based on pre-training provided by the invention has the following beneficial effects:

1) in the task of place name recognition, training corpora required by model fine tuning are constructed by combining a manual construction and mixing method and a semi-supervised self-learning mode, so that the cost of marking corpora in model fine tuning is effectively reduced, meanwhile, the place names with less occurrence times can be enhanced, and the accuracy of later recognition is improved;

2) aiming at the irregular address recognized in the BERT pre-training model, the method carries out correction matching based on the self-updating self-maintenance dictionary, so that the recognition result is more effective and accurate;

3) aiming at the problems of repetition, redundancy, deficiency, partial hierarchy errors and the like which often occur in the address text, the method and the device normalize the address text based on a multi-head attention mechanism generation model, and therefore can output a standard address.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic flow chart of corpus enhancement in the semi-supervised self-learning mode according to the present invention;

FIG. 2 is a schematic view illustrating a process of performing address correction based on a self-updating self-maintaining dictionary according to the present invention;

FIG. 3 is a schematic diagram illustrating a process of address normalization based on a multi-head attention mechanism generation model according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

TABLE 1 different address statements and corresponding problems for the same standard address

TABLE 2 Ten-level Address hierarchy

Label (R)	Address element hierarchy type	Examples of such applications are
			PROV	Economic	Provincial and direct municipality, autonomous region, etc
CITY	City (R)	City, autonomous state, etc
			ADNAME	District/county	City, county level city, etc
TOWN	Ballast for ballast	Town, village and the like
			VIL	Village/community	Village, village committee, community, etc
ROAD	Road	Roads, highways, national roads, etc
			StreetNUM	Street number	Street, etc
BuildingNUM	Building number	Building, A-seat, etc
			LOCATION	Location point	Squares, buildings, hospitals, schools, districts, etc
DIRECTION	Orientation	East-west, south-north, side-opposite, etc

A pre-training based address extraction and normalization method, as shown in fig. 1, includes:

firstly, corpora containing address information is collected, and a BERT-WWM model is pre-trained

Fine-tuning the pre-training model through a semi-supervised self-learning mode based on the enhanced address linguistic data, and performing place name recognition by using the fine-tuned model

The method specifically comprises the following steps:

and (3) fine-tuning the pre-training model by utilizing the enhanced address corpus (the pre-training model is firstly started by using a small amount of manually labeled markup corpus), and performing place name recognition by taking the fine-tuned model as a recognition model.

Wherein, the reinforced address corpus is reinforced by a semi-supervised self-learning mode, comprising:

The linguistic data acquired during the location name recognition are more frequently appeared in hot areas and less frequently appeared in cold areas, even many of the linguistic data are not appeared, the linguistic data can be enhanced in an artificial standard address mode, but the generalization capability of the artificial standard address is poor, and the effect in practical use is greatly reduced. According to the technical scheme, the artificial address data, the labeled address data and the enhanced address corpus constructed based on the enhanced data of the real address and the labeled address data are used as the training corpus, and the enhanced address corpus covering the full address can improve the model recognition effect, particularly the place name recognition of a remote area.

Each slot position information in the place name identification can also be regarded as a classification problem, and for the classification problem, a self-learning mode can be used for corpus enhancement, so that the accuracy is high, and the universality is very good. By utilizing the semi-supervised self-learning mode, the labeled corpus is obtained from a large amount of to-be-labeled corpora to be used as training data for model fine tuning, so that the dependence on the labeled corpus can be further eliminated, and the time and labor cost for labeling the corpus are saved.

Thirdly, address correction is carried out based on self-updating and self-maintaining dictionary

The method specifically comprises the following steps:

the place name candidate set is defined from the place names of the previous level, and the place names in the place name candidate set and the place names in the place name recognition candidate set are respectively subjected to regular processing, so that the place name common names and the place name proper names (the place name common names are as in counties and villages, and the place name proper names are as in Tongshan, and the like) are separated;

The maximum forward matching and the maximum reverse matching are carried out on the place name proper names, and the method comprises the following steps:

Wherein, the weighted matching comprises:

The self-updating self-maintenance dictionary is extracted from actual dialogue data, a dictionary is constructed for each place name with an alternative name, the key of the dictionary is the corresponding standard place name, the value of the dictionary is the corresponding alternative name and the occurrence frequency of the alternative name, and the self-updating self-maintenance dictionary automatically updates the ordering of the alternative names according to the occurrence frequency of the alternative names. On one hand, the efficiency of late matching can be improved, and on the other hand, the ranking of the alternative names which are not used any more is subjected to degradation processing.

Since address plans are constantly changing and for some historical reasons there is a very special phenomenon in place names that is shorthand and alias for place names, such as: the Guangxi Zhuang nationality autonomous state is abbreviated as Guangxi, the Liangshan Yi nationality autonomous state is abbreviated as Liangshan, and the Yuzhou city is distinguished as Yuxian.

In order to solve the problems, the technical scheme of the application constructs the place name expansion dictionary which can be automatically updated and maintained, the dictionary can be automatically updated and maintained in real time according to actual dialogue data, the problem that the dictionary is difficult to maintain during rule matching is solved, and meanwhile, the accuracy of recognizing the place name by the BERT pre-training model is effectively improved through matching of the recognition result.

Fourthly, address normalization is carried out based on a multi-head attention mechanism generation model

The method specifically comprises the following steps:

inputting the corrected address data into a coding network for coding, setting a multi-head attention mechanism between the address data and the context vector (so as to increase the coding dimension of the address data and pay more attention to key place name information), inputting the output vector of the coding network into a decoder for decoding, inputting the generated result of the decoder and the multi-head attention vector containing the context vector generated by the multi-head attention mechanism into a copy network, determining whether to generate words from a vocabulary or directly copy the words in the original text by the copy network, and taking the output result of the copy network as a final result.

The multi-head attention mechanism generation model comprises a coding network, a decoder and a copy network, wherein the coding network adopts a bidirectional long-time memory network, the decoder adopts a unidirectional long-time memory network, and the copy network adopts a Point network.

After the three steps of processing, most of the place name information is standard. However, the entire address text has problems such as missing elements, redundant information, hierarchical confusion, and information entanglement, and is not a standardized address. In order to solve the problem, the technical scheme of the application carries out address normalization based on a multi-head attention mechanism generation model, integrates a multi-head attention mechanism and a copying mechanism on the basis of the generation model, and obtains a standard address according to a non-standard address text in a generation rather than combination mode.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A pre-training-based address extraction and standardization method is characterized by comprising the following steps: the method comprises the following steps:

2. The pre-training based address extraction and normalization method of claim 1, wherein: in S2, based on the enhanced address corpus, the model is fine-tuned through a semi-supervised self-learning mode, including:

3. The pre-training based address extraction and normalization method of claim 2, wherein: the reinforced address corpus is subjected to corpus reinforcement through a semi-supervised self-learning mode, and the method comprises the following steps:

4. The pre-training based address extraction and normalization method of claim 2 or 3, wherein: and the pre-training model is started by using a small amount of manually marked marking corpora for fine adjustment.

5. The pre-training based address extraction and normalization method of claim 1, wherein: the address correction based on the self-updating self-maintenance dictionary in the S3 includes:

6. The pre-training based address extraction and normalization method of claim 5, wherein: the maximum forward matching and the maximum reverse matching are carried out on the place name proper names, and the method comprises the following steps:

7. The pre-training based address extraction and normalization method of claim 5 or 6, wherein: the weighted matching comprises the following steps:

8. The pre-training based address extraction and normalization method of claim 7, wherein: the self-updating self-maintenance dictionary is extracted from actual dialogue data, a dictionary is constructed for each place name with an alternative name, the key of the dictionary is the corresponding standard place name, the value of the dictionary is the corresponding alternative name and the occurrence frequency of the alternative name, and the self-updating self-maintenance dictionary automatically updates the ordering of the alternative names according to the occurrence frequency of the alternative names.

9. The pre-training based address extraction and normalization method of claim 1, wherein: in S4, address normalization is performed based on the multi-head attention mechanism generation model, which includes:

10. The pre-training based address extraction and normalization method of claim 9, wherein: the multi-head attention mechanism generation model comprises a coding network, a decoder and a copy network, wherein the coding network adopts a bidirectional long-time and short-time memory network, the decoder adopts a unidirectional long-time and short-time memory network, and the copy network adopts a Pointernet.