CN114398886A - Address extraction and standardization method based on pre-training - Google Patents

Address extraction and standardization method based on pre-training Download PDF

Info

Publication number
CN114398886A
CN114398886A CN202111582633.4A CN202111582633A CN114398886A CN 114398886 A CN114398886 A CN 114398886A CN 202111582633 A CN202111582633 A CN 202111582633A CN 114398886 A CN114398886 A CN 114398886A
Authority
CN
China
Prior art keywords
address
matching
place name
self
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111582633.4A
Other languages
Chinese (zh)
Inventor
冯纯博
廖奇
黄洋
陈楷
王辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kexun Jialian Information Technology Co ltd
Original Assignee
Kexun Jialian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kexun Jialian Information Technology Co ltd filed Critical Kexun Jialian Information Technology Co ltd
Priority to CN202111582633.4A priority Critical patent/CN114398886A/en
Publication of CN114398886A publication Critical patent/CN114398886A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to address extraction, in particular to an address extraction and standardization method based on pre-training, which collects corpora containing address information and pre-trains a model; based on the enhanced address corpus, fine tuning the pre-training model through a semi-supervised self-learning mode, and recognizing the place name by using the fine-tuned model; performing address correction based on the self-updating self-maintenance dictionary; performing address normalization based on a multi-head attention mechanism generation model; the technical scheme provided by the invention can effectively overcome the defects of higher cost of marking the corpus and insufficient standard and standard specification of the extracted address in the prior art.

Description

Address extraction and standardization method based on pre-training
Technical Field
The invention relates to address extraction, in particular to an address extraction and standardization method based on pre-training.
Background
The names of specific entities with specific range, characteristic and direction, which are gathered by a plurality of specific social people, are called place names, and are a specific language symbol code. The place name is a description of a specific spatial position of a specific object, and has a specific structure, such as a hierarchical structure, because the place name is a name of a regional or landmark position; meanwhile, as the expression is an abstract representation of a spatial position or a region by human beings and is a specific language expression formed by human beings after brain processing of geographic coordinates, the expression also presents the characteristics of thousands of people as writing articles, as shown in table 1.
Human activities are closely related to addresses, and address information plays an increasingly important role in people's lives. For the registration management of information such as native place, place of birth, current living address and the like of individuals, the input of starting places and destinations such as trip, express delivery, trip navigation and the like; the management of position information of entities such as governments, house transactions, municipal facilities and the like, the quick and accurate address reporting when illness, dangerous cases or accidents occur and the like are closely related to the address information. Due to the wide use of the address information, almost all industries collect the address information, but the address information does not have a standard format, so the obtained address information is complex and various; in addition, the region of China is vast, the levels of government management departments are more and complex, certain delay exists in address planning while the city construction is rapidly developed, so that the problems of shorthand, different names, popular calling, irregular writing, structural disorder and the like exist in the addresses, and the description of the same address is inconsistent, incomplete and irregular. In addition, because the Chinese address has the characteristics of no separation symbol, homophone, easy deletion or redundancy and the like, the difficulty of extracting and standardizing the address information is increased sharply.
At present, the extraction method of the chinese address mainly includes an address dictionary, an entity recognition method based on rule matching, statistics, and deep learning. The method for matching the address dictionary is simple, but has obvious defects, the place names which are not in the dictionary or are ambiguous cannot be matched, only the place names which exist and are accurately described can be extracted, the difficulty of constructing the dictionary is high, and great manpower and material resources are consumed for updating and maintaining the dictionary.
The rule-based matching generally comprises a forward matching method or a reverse matching method, but the rule construction depends on a specific region, only the rule construction can be carried out on the specific region, the constructed rule is difficult to be applied to other regions, the universality is poor, the number of the rules needing to be maintained is increased along with the expansion of the region range, the later maintenance cost is high, and errors are easy to occur.
The method based on statistics can get rid of the dependence on dictionaries, address extraction is carried out through statistics on the frequency of co-occurring words, and the frequently co-occurring words in sentences can relatively reflect the confidence coefficient of the address combined by the words. Statistical-based methods often use machine-learned algorithmic models such as N-gram models, HMMs (hidden markov models), CRFs (conditional random fields), SVMs (support vector machines), Maximum Entropy models (maxmum entrypode), and the like. The method is relatively dependent on feature setting, the convergence is difficult when the features are too few, and the overfitting condition is easy to occur when the features are too much.
A common method based on deep learning is ELMO, Bi-LSTM + CRF, and good effect is achieved in address extraction. However, the number of parameters of the deep learning model is exponentially increased compared with the parameter amount of the original machine learning, and larger model parameters mean that more corpora are needed for model training, however, supervised learning needs corpora with labels, and the labels to the corpora generate huge time and labor cost, and especially for some professional fields, the labels are more difficult.
In order to solve the problems, BERT + BilSTM + CRF is generally adopted to extract place names, and BERT is used to perform specific task fine adjustment in the place name related field based on a pre-training model. The method further breaks away from the corpus limitation of address extraction, can predict non-dictionary words, and further enhances generalization capability.
However, the BERT + BilSTM + CRF type scheme still has some problems. Firstly, in the Chinese geography, the Chinese address has more levels (such as ten-level addresses shown in table 2), the total amount of Chinese geography is particularly large, the collected linguistic data cannot cover all provinces, cities and regions, particularly, the geography in a remote area is complex (such as Yuxi Yuanjiang Hani nationality, Dai nationality and Dai nationality), and the linguistic data are few, so that the trained model has a good effect on common geography, but has a poor recognition effect on the geography in the remote area.
In addition, although the generalization capability of the scheme is improved, a large number of non-standard (abbreviated, miswritten, etc.) and even wrong place names are extracted, and the scheme is difficult to be directly used in practice. Finally, no matter which scheme is adopted, single place name information is generated, and problems of address information loss (such as 'Guanshan Dadao in Wuhan city, Hubei province'), address information redundancy (such as 'Guanhan New world T1 office building in Guanshan Dao of Guanshan Dao district in Wuhan city, Hubei province, diagonally opposite to the next door of Jindi Sun City), address information winding (such as' Guanshan Dao New world in Guanshan district in Wuchang district in Wuhan city, Hubei province), and the like can occur.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects in the prior art, the invention provides an address extraction and standardization method based on pre-training, which can effectively overcome the defects of high cost of labeled corpus and insufficient standard and standard for extracting the obtained address in the prior art.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
a pre-training based address extraction and normalization method comprises the following steps:
s1, collecting corpora containing address information, and pre-training the model;
s2, fine-tuning the pre-training model through a semi-supervised self-learning mode based on the enhanced address corpus, and recognizing the place name by using the fine-tuned model;
s3, correcting addresses based on the self-updating self-maintenance dictionary;
and S4, performing address normalization based on the multi-head attention mechanism generation model.
Preferably, in S2, the fine-tuning of the model through the semi-supervised self-learning mode based on the enhanced address corpus includes:
acquiring an address white list according to a national administrative division, randomly combining all levels of addresses from the address white list, and constructing address white list corpora;
randomly replacing the corresponding level slot position in the existing real corpus by using the address in the address white list corpus to construct an enhanced address corpus;
and fine-tuning the pre-training model by utilizing the enhanced address corpus, and performing place name recognition by taking the fine-tuned model as a recognition model.
Preferably, the enhanced address corpus is corpus enhanced through a semi-supervised self-learning mode, and the method comprises the following steps:
calculating a threshold value of the selected corpus in a dynamic mode, and extracting the corpus to be labeled in batches;
after the place name is identified by using the identification model, selecting a prediction result with higher confidence coefficient to be blended into the training corpus;
performing multiple rounds of data extraction on the linguistic data to be labeled in sequence, adjusting the extraction amount in each round according to the prediction result, and if the prediction result is good, improving the extraction amount, otherwise, reducing the extraction amount;
and fusing the prediction result after manual inspection with the original manually marked markup corpus to form an enhanced address corpus.
Preferably, the pre-training model is started by fine tuning using a small amount of manually labeled markup corpora.
Preferably, the performing address correction based on the self-updated self-maintained dictionary in S3 includes:
splitting a place name obtained by identifying the place name of the identification model to respectively obtain a place name and a place name common name of a level above the level of the place name;
the place name candidate set is defined from the place names of the previous level, and the place names in the place name identification candidate set and the place name candidate set are respectively subjected to rule processing to separate out the place name common names and the place name proper names;
and performing maximum forward matching and maximum reverse matching on the place name proper names by using the self-updating self-maintaining dictionary, performing weighted matching on the results obtained by matching, and outputting the result with the highest weight as the marked place name.
Preferably, the maximum forward matching and the maximum reverse matching of the place name proper name include:
carrying out complete matching on the standard place names, if a complete matching result exists, directly successfully matching, outputting the standard place names, otherwise, expanding the matching range, carrying out complete matching on the alternative name place names by utilizing the self-updating self-maintenance dictionary, and if the complete matching result exists, successfully matching;
and if the matching is not completely matched, performing rule matching, if the rule matching result is obtained, entering the next step of weighted matching, and otherwise, directly exiting the matching process.
Preferably, the weighted matching includes:
and obtaining all weights by using the weighted matching based on the pinyin fuzzy matching on the keywords in the rule matching result.
Preferably, the self-updating self-maintaining dictionary is extracted from actual dialogue data, a dictionary is constructed for each place name with an alternative name, the key of the dictionary is the corresponding standard place name, the value of the dictionary is the corresponding alternative name and the occurrence frequency of the alternative name, and the self-updating self-maintaining dictionary automatically updates the ordering of the alternative names according to the occurrence frequency of the alternative names.
Preferably, the address normalization based on the multi-head attention mechanism generation model in S4 includes:
inputting the corrected address data into a coding network for coding, setting a multi-head attention mechanism between the address data and the context vector, inputting the output vector of the coding network into a decoder for decoding, inputting the generated result of the decoder and the multi-head attention vector containing the context vector generated by the multi-head attention mechanism into a copy network, determining whether to generate words from a vocabulary or directly copy the words in the original text by the copy network, and taking the output result of the copy network as a final result.
Preferably, the multi-head attention mechanism generation model comprises a coding network, a decoder and a copy network, wherein the coding network adopts a bidirectional long-time and short-time memory network, the decoder adopts a unidirectional long-time and short-time memory network, and the copy network adopts a Pointernet.
(III) advantageous effects
Compared with the prior art, the address extraction and standardization method based on pre-training provided by the invention has the following beneficial effects:
1) in the task of place name recognition, training corpora required by model fine tuning are constructed by combining a manual construction and mixing method and a semi-supervised self-learning mode, so that the cost of marking corpora in model fine tuning is effectively reduced, meanwhile, the place names with less occurrence times can be enhanced, and the accuracy of later recognition is improved;
2) aiming at the irregular address recognized in the BERT pre-training model, the method carries out correction matching based on the self-updating self-maintenance dictionary, so that the recognition result is more effective and accurate;
3) aiming at the problems of repetition, redundancy, deficiency, partial hierarchy errors and the like which often occur in the address text, the method and the device normalize the address text based on a multi-head attention mechanism generation model, and therefore can output a standard address.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a schematic flow chart of corpus enhancement in the semi-supervised self-learning mode according to the present invention;
FIG. 2 is a schematic view illustrating a process of performing address correction based on a self-updating self-maintaining dictionary according to the present invention;
FIG. 3 is a schematic diagram illustrating a process of address normalization based on a multi-head attention mechanism generation model according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
TABLE 1 different address statements and corresponding problems for the same standard address
Figure BDA0003426646080000071
TABLE 2 Ten-level Address hierarchy
Label (R) Address element hierarchy type Examples of such applications are
PROV Economic Provincial and direct municipality, autonomous region, etc
CITY City (R) City, autonomous state, etc
ADNAME District/county City, county level city, etc
TOWN Ballast for ballast Town, village and the like
VIL Village/community Village, village committee, community, etc
ROAD Road Roads, highways, national roads, etc
StreetNUM Street number Street, etc
BuildingNUM Building number Building, A-seat, etc
LOCATION Location point Squares, buildings, hospitals, schools, districts, etc
DIRECTION Orientation East-west, south-north, side-opposite, etc
A pre-training based address extraction and normalization method, as shown in fig. 1, includes:
firstly, corpora containing address information is collected, and a BERT-WWM model is pre-trained
Fine-tuning the pre-training model through a semi-supervised self-learning mode based on the enhanced address linguistic data, and performing place name recognition by using the fine-tuned model
The method specifically comprises the following steps:
acquiring an address white list according to a national administrative division, randomly combining all levels of addresses from the address white list, and constructing address white list corpora;
randomly replacing the corresponding level slot position in the existing real corpus by using the address in the address white list corpus to construct an enhanced address corpus;
and (3) fine-tuning the pre-training model by utilizing the enhanced address corpus (the pre-training model is firstly started by using a small amount of manually labeled markup corpus), and performing place name recognition by taking the fine-tuned model as a recognition model.
Wherein, the reinforced address corpus is reinforced by a semi-supervised self-learning mode, comprising:
calculating a threshold value of the selected corpus in a dynamic mode, and extracting the corpus to be labeled in batches;
after the place name is identified by using the identification model, selecting a prediction result with higher confidence coefficient to be blended into the training corpus;
performing multiple rounds of data extraction on the linguistic data to be labeled in sequence, adjusting the extraction amount in each round according to the prediction result, and if the prediction result is good, improving the extraction amount, otherwise, reducing the extraction amount;
and fusing the prediction result after manual inspection with the original manually marked markup corpus to form an enhanced address corpus.
The linguistic data acquired during the location name recognition are more frequently appeared in hot areas and less frequently appeared in cold areas, even many of the linguistic data are not appeared, the linguistic data can be enhanced in an artificial standard address mode, but the generalization capability of the artificial standard address is poor, and the effect in practical use is greatly reduced. According to the technical scheme, the artificial address data, the labeled address data and the enhanced address corpus constructed based on the enhanced data of the real address and the labeled address data are used as the training corpus, and the enhanced address corpus covering the full address can improve the model recognition effect, particularly the place name recognition of a remote area.
Each slot position information in the place name identification can also be regarded as a classification problem, and for the classification problem, a self-learning mode can be used for corpus enhancement, so that the accuracy is high, and the universality is very good. By utilizing the semi-supervised self-learning mode, the labeled corpus is obtained from a large amount of to-be-labeled corpora to be used as training data for model fine tuning, so that the dependence on the labeled corpus can be further eliminated, and the time and labor cost for labeling the corpus are saved.
Thirdly, address correction is carried out based on self-updating and self-maintaining dictionary
The method specifically comprises the following steps:
splitting a place name obtained by identifying the place name of the identification model to respectively obtain a place name and a place name common name of a level above the level of the place name;
the place name candidate set is defined from the place names of the previous level, and the place names in the place name candidate set and the place names in the place name recognition candidate set are respectively subjected to regular processing, so that the place name common names and the place name proper names (the place name common names are as in counties and villages, and the place name proper names are as in Tongshan, and the like) are separated;
and performing maximum forward matching and maximum reverse matching on the place name proper names by using the self-updating self-maintaining dictionary, performing weighted matching on the results obtained by matching, and outputting the result with the highest weight as the marked place name.
The maximum forward matching and the maximum reverse matching are carried out on the place name proper names, and the method comprises the following steps:
carrying out complete matching on the standard place names, if a complete matching result exists, directly successfully matching, outputting the standard place names, otherwise, expanding the matching range, carrying out complete matching on the alternative name place names by utilizing the self-updating self-maintenance dictionary, and if the complete matching result exists, successfully matching;
and if the matching is not completely matched, performing rule matching, if the rule matching result is obtained, entering the next step of weighted matching, and otherwise, directly exiting the matching process.
Wherein, the weighted matching comprises:
and obtaining all weights by using the weighted matching based on the pinyin fuzzy matching on the keywords in the rule matching result.
The self-updating self-maintenance dictionary is extracted from actual dialogue data, a dictionary is constructed for each place name with an alternative name, the key of the dictionary is the corresponding standard place name, the value of the dictionary is the corresponding alternative name and the occurrence frequency of the alternative name, and the self-updating self-maintenance dictionary automatically updates the ordering of the alternative names according to the occurrence frequency of the alternative names. On one hand, the efficiency of late matching can be improved, and on the other hand, the ranking of the alternative names which are not used any more is subjected to degradation processing.
Since address plans are constantly changing and for some historical reasons there is a very special phenomenon in place names that is shorthand and alias for place names, such as: the Guangxi Zhuang nationality autonomous state is abbreviated as Guangxi, the Liangshan Yi nationality autonomous state is abbreviated as Liangshan, and the Yuzhou city is distinguished as Yuxian.
In order to solve the problems, the technical scheme of the application constructs the place name expansion dictionary which can be automatically updated and maintained, the dictionary can be automatically updated and maintained in real time according to actual dialogue data, the problem that the dictionary is difficult to maintain during rule matching is solved, and meanwhile, the accuracy of recognizing the place name by the BERT pre-training model is effectively improved through matching of the recognition result.
Fourthly, address normalization is carried out based on a multi-head attention mechanism generation model
The method specifically comprises the following steps:
inputting the corrected address data into a coding network for coding, setting a multi-head attention mechanism between the address data and the context vector (so as to increase the coding dimension of the address data and pay more attention to key place name information), inputting the output vector of the coding network into a decoder for decoding, inputting the generated result of the decoder and the multi-head attention vector containing the context vector generated by the multi-head attention mechanism into a copy network, determining whether to generate words from a vocabulary or directly copy the words in the original text by the copy network, and taking the output result of the copy network as a final result.
The multi-head attention mechanism generation model comprises a coding network, a decoder and a copy network, wherein the coding network adopts a bidirectional long-time memory network, the decoder adopts a unidirectional long-time memory network, and the copy network adopts a Point network.
After the three steps of processing, most of the place name information is standard. However, the entire address text has problems such as missing elements, redundant information, hierarchical confusion, and information entanglement, and is not a standardized address. In order to solve the problem, the technical scheme of the application carries out address normalization based on a multi-head attention mechanism generation model, integrates a multi-head attention mechanism and a copying mechanism on the basis of the generation model, and obtains a standard address according to a non-standard address text in a generation rather than combination mode.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (10)

1. A pre-training-based address extraction and standardization method is characterized by comprising the following steps: the method comprises the following steps:
s1, collecting corpora containing address information, and pre-training the model;
s2, fine-tuning the pre-training model through a semi-supervised self-learning mode based on the enhanced address corpus, and recognizing the place name by using the fine-tuned model;
s3, correcting addresses based on the self-updating self-maintenance dictionary;
and S4, performing address normalization based on the multi-head attention mechanism generation model.
2. The pre-training based address extraction and normalization method of claim 1, wherein: in S2, based on the enhanced address corpus, the model is fine-tuned through a semi-supervised self-learning mode, including:
acquiring an address white list according to a national administrative division, randomly combining all levels of addresses from the address white list, and constructing address white list corpora;
randomly replacing the corresponding level slot position in the existing real corpus by using the address in the address white list corpus to construct an enhanced address corpus;
and fine-tuning the pre-training model by utilizing the enhanced address corpus, and performing place name recognition by taking the fine-tuned model as a recognition model.
3. The pre-training based address extraction and normalization method of claim 2, wherein: the reinforced address corpus is subjected to corpus reinforcement through a semi-supervised self-learning mode, and the method comprises the following steps:
calculating a threshold value of the selected corpus in a dynamic mode, and extracting the corpus to be labeled in batches;
after the place name is identified by using the identification model, selecting a prediction result with higher confidence coefficient to be blended into the training corpus;
performing multiple rounds of data extraction on the linguistic data to be labeled in sequence, adjusting the extraction amount in each round according to the prediction result, and if the prediction result is good, improving the extraction amount, otherwise, reducing the extraction amount;
and fusing the prediction result after manual inspection with the original manually marked markup corpus to form an enhanced address corpus.
4. The pre-training based address extraction and normalization method of claim 2 or 3, wherein: and the pre-training model is started by using a small amount of manually marked marking corpora for fine adjustment.
5. The pre-training based address extraction and normalization method of claim 1, wherein: the address correction based on the self-updating self-maintenance dictionary in the S3 includes:
splitting a place name obtained by identifying the place name of the identification model to respectively obtain a place name and a place name common name of a level above the level of the place name;
the place name candidate set is defined from the place names of the previous level, and the place names in the place name identification candidate set and the place name candidate set are respectively subjected to rule processing to separate out the place name common names and the place name proper names;
and performing maximum forward matching and maximum reverse matching on the place name proper names by using the self-updating self-maintaining dictionary, performing weighted matching on the results obtained by matching, and outputting the result with the highest weight as the marked place name.
6. The pre-training based address extraction and normalization method of claim 5, wherein: the maximum forward matching and the maximum reverse matching are carried out on the place name proper names, and the method comprises the following steps:
carrying out complete matching on the standard place names, if a complete matching result exists, directly successfully matching, outputting the standard place names, otherwise, expanding the matching range, carrying out complete matching on the alternative name place names by utilizing the self-updating self-maintenance dictionary, and if the complete matching result exists, successfully matching;
and if the matching is not completely matched, performing rule matching, if the rule matching result is obtained, entering the next step of weighted matching, and otherwise, directly exiting the matching process.
7. The pre-training based address extraction and normalization method of claim 5 or 6, wherein: the weighted matching comprises the following steps:
and obtaining all weights by using the weighted matching based on the pinyin fuzzy matching on the keywords in the rule matching result.
8. The pre-training based address extraction and normalization method of claim 7, wherein: the self-updating self-maintenance dictionary is extracted from actual dialogue data, a dictionary is constructed for each place name with an alternative name, the key of the dictionary is the corresponding standard place name, the value of the dictionary is the corresponding alternative name and the occurrence frequency of the alternative name, and the self-updating self-maintenance dictionary automatically updates the ordering of the alternative names according to the occurrence frequency of the alternative names.
9. The pre-training based address extraction and normalization method of claim 1, wherein: in S4, address normalization is performed based on the multi-head attention mechanism generation model, which includes:
inputting the corrected address data into a coding network for coding, setting a multi-head attention mechanism between the address data and the context vector, inputting the output vector of the coding network into a decoder for decoding, inputting the generated result of the decoder and the multi-head attention vector containing the context vector generated by the multi-head attention mechanism into a copy network, determining whether to generate words from a vocabulary or directly copy the words in the original text by the copy network, and taking the output result of the copy network as a final result.
10. The pre-training based address extraction and normalization method of claim 9, wherein: the multi-head attention mechanism generation model comprises a coding network, a decoder and a copy network, wherein the coding network adopts a bidirectional long-time and short-time memory network, the decoder adopts a unidirectional long-time and short-time memory network, and the copy network adopts a Pointernet.
CN202111582633.4A 2021-12-22 2021-12-22 Address extraction and standardization method based on pre-training Pending CN114398886A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111582633.4A CN114398886A (en) 2021-12-22 2021-12-22 Address extraction and standardization method based on pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111582633.4A CN114398886A (en) 2021-12-22 2021-12-22 Address extraction and standardization method based on pre-training

Publications (1)

Publication Number Publication Date
CN114398886A true CN114398886A (en) 2022-04-26

Family

ID=81226784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111582633.4A Pending CN114398886A (en) 2021-12-22 2021-12-22 Address extraction and standardization method based on pre-training

Country Status (1)

Country Link
CN (1) CN114398886A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115688779A (en) * 2022-10-11 2023-02-03 杭州瑞成信息技术股份有限公司 Address recognition method based on self-supervision deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115688779A (en) * 2022-10-11 2023-02-03 杭州瑞成信息技术股份有限公司 Address recognition method based on self-supervision deep learning
CN115688779B (en) * 2022-10-11 2023-05-09 杭州瑞成信息技术股份有限公司 Address recognition method based on self-supervision deep learning

Similar Documents

Publication Publication Date Title
CN104091054B (en) Towards the Mass disturbance method for early warning and system of short text
CN106528526B (en) A kind of Chinese address semanteme marking method based on Bayes's segmentation methods
WO2015027836A1 (en) Method and system for place name entity recognition
CN111104802B (en) Method for extracting address information text and related equipment
CN104462216B (en) Occupy committee's standard code converting system and method
WO2015027835A1 (en) System and terminal for querying mailing address postal codes
CN112329467A (en) Address recognition method and device, electronic equipment and storage medium
CN110781670B (en) Chinese place name semantic disambiguation method based on encyclopedic knowledge base and word vectors
CN107169079A (en) A kind of field text knowledge abstracting method based on Deepdive
WO2022126988A1 (en) Method and apparatus for training entity naming recognition model, device and storage medium
CN113722478B (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
CN114510566B (en) Method and system for mining, classifying and analyzing hotword based on worksheet
CN114860960B (en) Method for constructing flood type Natech disaster event knowledge graph based on text mining
CN113592037A (en) Address matching method based on natural language inference
CN112527933A (en) Chinese address association method based on space position and text training
CN109299469A (en) A method of identifying complicated address in long text
CN114780680A (en) Retrieval and completion method and system based on place name and address database
CN114676353B (en) Address matching method based on segmentation inference
CN114398886A (en) Address extraction and standardization method based on pre-training
CN113505233B (en) Extraction method of ecological civilized geographic knowledge based on open domain
CN112989811B (en) History book reading auxiliary system based on BiLSTM-CRF and control method thereof
CN114091454A (en) Method for extracting place name information and positioning space in internet text
Ghukasyan et al. pioNER: Datasets and baselines for Armenian named entity recognition
CN117149140B (en) Method, device and related equipment for generating coded architecture information
CN117010398A (en) Address entity identification method based on multi-layer knowledge perception

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination