CN117010398A - Address entity identification method based on multi-layer knowledge perception - Google Patents

Address entity identification method based on multi-layer knowledge perception Download PDF

Info

Publication number
CN117010398A
CN117010398A CN202311110916.8A CN202311110916A CN117010398A CN 117010398 A CN117010398 A CN 117010398A CN 202311110916 A CN202311110916 A CN 202311110916A CN 117010398 A CN117010398 A CN 117010398A
Authority
CN
China
Prior art keywords
knowledge
address
character
layer
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311110916.8A
Other languages
Chinese (zh)
Inventor
李茹
高俊杰
邵文远
谭红叶
张虎
闫智超
苏雪峰
张越
梁吉业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi University
Original Assignee
Shanxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi University filed Critical Shanxi University
Priority to CN202311110916.8A priority Critical patent/CN117010398A/en
Publication of CN117010398A publication Critical patent/CN117010398A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the field of natural language processing, and particularly relates to an address entity identification method based on multi-layer knowledge perception. The method combines address entity identification with application scenes which need to be faced in the knowledge graph construction process, designs and provides an address entity identification method based on multi-layer knowledge perception, constructs an address knowledge tree according to the characteristics of an address text from the aspect of address entity identification, acquires external knowledge of two layers of sentences and vocabularies related to the input text from the address knowledge tree, and helps a general model learn specific knowledge in the address field, so that the address entity identification capability is enhanced, and the model accuracy is improved. In addition, the research and the practicability are both considered, and from the engineering point of view, the system can be fused to the existing place indexing system, and the usability of the system is improved. The performance of the address entity identification model is further improved through the fine tuning interface, and the address information is converted into a unified format through address standardization complement.

Description

Address entity identification method based on multi-layer knowledge perception
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to an address entity identification method based on multi-layer knowledge perception.
Background
Early address naming entity identification methods relied primarily on rules and dictionary based methods. Shaalan et al construct rule templates using a variety of features such as vocabulary, syntax, and semantics, and use a place name dictionary matching approach to identify address entities that appear in text. JA et al use place name dictionary to match the twitter text to identify the address entity. However, this method requires manual formulation of the corresponding rule templates and dictionary, takes a lot of time, and requires periodic updating of the dictionary.
As machine learning techniques develop, researchers learn the weights of features from labeled training data by means of machine learning methods and identify entities by predicting the most likely tag sequences. Bikel et al model text sequence data using an HMM model to calculate the probability that each word is of a certain entity type, thereby predicting the entity type of each word. The Mena et al propose a mixed place name extraction method based on an HMM and an SVM, place name extraction is performed by using an HMM model, place name disambiguation is performed by matching the extracted place name with a place name table, and a final result is obtained through the SVM model. Sobhana et al propose a CRF-based model that uses word context information to predict address entities. However, the method based on statistical machine learning lacks generalization capability, and corresponding features need to be set manually, which depends on knowledge and experience of field experts and causes a great deal of manpower consumption.
In recent years, deep learning-based methods have become the mainstream in NER tasks. The method can automatically learn the characteristic representation through an end-to-end training mode, and has better generalization capability. Colnbert et al propose a word-level neural network model that uses convolutional layers to process word vector feature representations and incorporates CRFs for entity recognition. Huang et al propose a BiLSTM-CRF sequence annotation model, which significantly improves the accuracy of named entity recognition. Yang et al use a convolutional layer with a fixed window size over the character-embedded layer for entity recognition. Liu et al propose a WC-LSTM model that represents text as character-word pairs, thereby integrating word information into each character.
Although deep learning-based methods can effectively improve performance of NER, context information is easily lost during training. Some studies capture rich contextual information and linguistic representations via pre-trained models (e.g., BERT, ERNIE, etc.), further enhancing model performance. Li et al propose a Chinese NER model of BERT-IDCNN-CRF, which gets the context representation of the word through BERT pre-training model, and inputs the word vector sequence into IDCNN-CRF model for training. Liu et al help the model understand domain knowledge by introducing knowledge-graph-structured knowledge into the pre-trained language model. Sun et al used a multi-task learning and knowledge distillation approach to model training and used the results of the entity relationship prediction to assist in NER results, further improving NER performance. The study by Seyler et al demonstrates the importance of external knowledge to named entity recognition.
Although the above works have good effects, the following two problems still remain: 1) Compared with the general field, the address field has a plurality of entity types and numbers, and the similarity degree between the address entities is high, so that the address entities are difficult to identify. 2) Text data in the address field is free to write, and the phenomenon of using shorthand, homonym/alias and unregistered place name is common, so that the problem of processing entity boundaries and types is more difficult.
Disclosure of Invention
Based on the problems and the shortcomings in the prior art, the invention combines address entity identification to face application scenes, designs and provides an address entity identification method based on multi-layer knowledge perception, an address standardization and a model fine adjustment interface. From the aspect of address entity identification, an address knowledge tree is constructed according to the characteristics of the address text, and external knowledge of two layers of sentences and vocabularies related to the input text is obtained from the address knowledge tree, so that a general model is helped to learn specific knowledge in the address field, the address entity identification capability is enhanced, and the model accuracy is improved. In addition, the research and the practicability are considered, and from the engineering point of view, the system can be fused to the existing place indexing system, the availability of the system is improved, and the construction of an address database is quickened.
In order to achieve the above purpose, the present invention is realized by the following technical scheme:
an address entity identification method based on multi-layer knowledge perception comprises the following steps:
step 1, address entity identification: constructing an address entity identification model based on multi-layer knowledge perception; the address entity identification model comprises: 1) Knowledge recall layer: sentence knowledge and vocabulary knowledge matched with the address text sequence are obtained from an address knowledge tree AKT; 2) Coding layer: coding the character sequence and the sentence knowledge by using the BERT embedding layer, calculating semantic relativity between the character sequence and the sentence knowledge, obtaining the score of each sentence knowledge, selecting the sentence knowledge of Top-N, and injecting the sentence knowledge into a feedforward neural network of the BERT Transformer; encoding the vocabulary knowledge using the pre-trained word vectors and aligning with the character vector dimensions; 3) Vocabulary knowledge fusion layer: extracting global semantic representation according to BiGRU, giving different weights to different vocabulary knowledge by using bilinear attention, and fusing the weight sum with a character sequence; 4) Label prediction layer: the joint expression of the character sequence information and the vocabulary knowledge is used for learning constraint conditions in sentences through CRF, so that a prediction result of the address text is obtained;
step 2, site indexing: the method comprises address standardization and model fine tuning interfaces; the address standardization is to normalize the extracted address entity according to the labeling specification of the national statistical bureau; the model fine tuning interface is to fine tune the model using a pre-trained natural language processing model.
Further, the knowledge recall layer in the step 1 specifically includes:
(1) Acquiring external address knowledge from a national statistical division table of the national statistical bureau 2022, and constructing AKT according to the type of the statistical division;
(2) Continuously intercepting an input address text in a longest matching mode and searching all potential vocabulary knowledge in AKT;
(3) And finding out a corresponding path of the vocabulary in the AKT according to the vocabulary knowledge matched with the characters, and splicing the paths to be sentence knowledge of the address text.
Further, the encoding layer in the step 1 specifically includes:
obtaining corresponding text embedding vectors through the BERT embedding layer according to (1)Wherein->Representing the nth character vector in the text,
x c =BERT(S) (1)
will sentence knowledge S k ={k 1 ,k 2 ,...,k l ' consider a sequence of characters identical to the address text, where k l Representing knowledge of the ith sentence; knowledge of each sentence k i From a plurality of charactersComposition of->An nth character representing an ith sentence; for every character->Sentence knowledge embedding vector through BERT embedding layerWherein->An nth character vector representing an ith sentence; according to formula (2), the average pooling value +.>To represent the knowledge of the sentence,
in order to avoid that excessive sentence knowledge can bring noise and wrong semantic information to the model, according to formula (3), a cosine similarity calculation method is adopted to calculate the semantic relevance of the character sequence and the sentence knowledge, and finally the front Top-N sentences are selected as a sentence knowledge setN.epsilon.l, where ∈l>A feature vector representation representing the nth sentence,
according to equation (4), the feedforward neural network of the BERT Transformer consists of two linear transforms with GeLU activation functions,
FFN(H i )=gelu(H i ·W k )·W v (4)
wherein H is i Representing the output of the transducer layer i,two linear networks, d, in the feed-forward neural network, respectively m Representing the intermediate size of the transducer, d representing the dimension of the hidden states; after the BERT embedding layer, the knowledge of front Top-N sentences is obtained>Wherein d is n Representing dimensions representing sentence knowledge; to inject sentence knowledge into the transducer ith layer, knowledge is projected using two different linear layers, linear1 and linear2, according to equations (5), (6), respectively, +.>The weights of linear1 and linear2 are indicated respectively,representing knowledge after projection;
according to equation (7), the projected knowledge is stitched to the end of the linear layer to achieveAnd->Implanted into corresponding W k And W is v In, get +.>
H i The corresponding value can be queried in the injected knowledge, namely, the model can learn the context information from the external sentence knowledge;
each vocabulary knowledge acquired by the knowledge recall layer is distributed to the characters contained in the vocabulary knowledge, so that the character sequence is converted into a character-vocabulary pair sequence S cw ={(c 1 ,w 1 ),(c 2 ,w 2 ),...,(c n ,w n ) }, wherein c n Representing the nth character, w, in the address text n ={w i1 ,w i2 ,...,w in And the i-th character matches all lexical knowledge.
The address vocabulary knowledge acquired from the national statistical division of the national statistical office 2022 is used for obtaining a pre-trained word vector PWV through a word2vec model;
to obtain the character c according to the formulas (8), (9) i Will w i The j-th vocabulary knowledge in the list obtains vocabulary knowledge embedded vector through PWVWherein->An nth lexical knowledge vector representation representing the ith character and aligned to the character vector dimension by nonlinear conversion for subsequent lexical knowledge integration into the model;
wherein the method comprises the steps ofThe j-th lexical knowledge vector representation representing the i-th character,>dimension transformation result representing vocabulary knowledge corresponding to ith character, w ij The j-th lexical knowledge representing the i-th character, b 1 And b 2 Is a bias term, W 1 ∈d×d w ,W 2 ∈d×d,d w Representing the dimensions of the vocabulary embedded vectors.
Further, the vocabulary knowledge fusion layer in the step 1 specifically includes:
inputting the character representation sequence integrated with sentence knowledge into BiGRU to obtain more comprehensive context sequence information h= { h 1 ,h 2 ,...,h n -a }; the vocabulary knowledge fusion layer uses the characters and the vocabulary knowledge matched with the characters as two inputs, and the inputs are expressed asWherein h is i Is the character vector of the i-th position in the context sequence information, < >>Is the vocabulary knowledge embedding vector corresponding to the ith character, wherein +.>The dimension transformation result of the vocabulary knowledge corresponding to the ith character is represented, the size is n multiplied by d, and n is the total number of the matched words;
the importance of each word that it matches is different for each character, and according to equation (10), different weights are given to different words using bilinear attention,
wherein a is i ={a i1 ,a i2 ,...,a in },a ij Weight, W, of the j-th vocabulary representing the i-th character attn Representing a bilinear attention matrix; for each character, weighting and summing word vectors according to the weight of each matched word knowledge to obtain word knowledge vector representation of the characterThe calculation formula is as follows:
adding the character vector and the weighted vocabulary vector to obtain a feature fusion vectorAnd finally, carrying out operations such as dropout, layer norm, residual error connection and the like to obtain the final output of the fused vocabulary knowledge, wherein the calculation formula is as follows:
further, the label prediction layer in the step 1 specifically includes:
the tag prediction layer outputs the sequence representation of the vocabulary knowledge fusion layerGlobal consistency modeling is carried out through a CRF layer, the score of each label is obtained, screening is carried out, a final prediction result is obtained, the loss function of the model can be expressed as the negative number of a log likelihood function, and the calculation formula is as follows:
wherein t represents the true tag sequence, t' represents all tag sequences predicted by the model, score (·) representationCalculating an input sequenceAnd t, Σ t′ exp (score (x, t') represents the sum of the scores of all possible tag sequences.
Further, the address normalization in the step 2 specifically includes: and (3) carrying out standardization processing on the address entity extracted in the step (1) according to the labeling specification of the national statistical bureau, automatically standardizing the address information into elements such as provinces, cities, district and counties, street and town, cells, buildings, units, floors, houses and rooms, supplementing level missing data, constructing an address level relation, realizing the service of associating road house numbers with the cells and matching geographic coordinates of all levels so as to ensure consistency and comparability of indexing results.
Further, the model fine tuning interface in the step 2 specifically includes: using a pre-trained natural language processing model, and adjusting model parameters according to training results of the model and selection of an optimization algorithm; optimizing the structure of the model according to the characteristics and the requirements of the model; training by using the optimized model, and fine-tuning the model by combining a specific data set and task requirements in the address field.
Compared with the prior art, the invention has the following beneficial effects:
(1) The invention can more accurately identify the address entity in the address text by utilizing the technology of combining multiple layers of knowledge. By using a real network-aware data set to carry out experiments, the practicability and reliability of the technology can be improved, the recognition efficiency is improved, and the cost of manpower and material resources is reduced.
(2) The invention is oriented to the rapid and high concurrency address entity identification requirement, and combines different application scenes to realize address standardization and open model fine tuning interfaces. Address information is converted into a unified format through address standardization and complement, performance of an address entity identification model is further improved through a fine tuning interface, usability of the system is improved, and technical support is provided for improving address standardization and model fine tuning interfaces in a subsequent place indexing system. The invention has very remarkable positive effect in the address entity identification direction in the natural language processing field and has better application prospect.
Drawings
FIG. 1 is a diagram of an address entity identification model;
fig. 2 is a place indexing flow chart.
Detailed Description
The present invention will be described more fully hereinafter in order to facilitate an understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Examples
An address entity identification method based on multi-layer knowledge perception comprises the following steps:
step 1, address entity identification: constructing an address entity identification model based on multi-layer knowledge perception; as shown in fig. 1, the address entity identification model includes: 1) Knowledge recall layer: acquiring sentence knowledge and vocabulary knowledge matched with the address text sequence from an address knowledge tree (Address Knowledge Tree, AKT); 2) Coding layer: coding the character sequence and the sentence knowledge by using the BERT embedding layer, calculating semantic relativity between the character sequence and the sentence knowledge, obtaining the score of each sentence knowledge, selecting the sentence knowledge of Top-N, and injecting the sentence knowledge into a feedforward neural network of the BERT Transformer; encoding the vocabulary knowledge using the pre-trained word vectors and aligning with the character vector dimensions; 3) Vocabulary knowledge fusion layer: extracting global semantic representation according to BiGRU, giving different weights to different vocabulary knowledge by using bilinear attention, and fusing the weight sum with a character sequence; 4) Label prediction layer: the joint expression of the character sequence information and the vocabulary knowledge is used for learning constraint conditions in sentences through CRF, so that a prediction result of the address text is obtained;
the knowledge recall layer prepares related external knowledge features from two dimensions of sentence knowledge and vocabulary knowledge respectively, and specifically comprises the following steps:
(1) The address administrative division itself contains tree structure information, the hierarchical relationship is the type relationship of the address data, the tree structure searches data quickly, and the same name address is not ambiguous. The invention obtains the external address knowledge from the regional table for national statistics of the national statistics bureau 2022, and constructs AKT according to the type of the regional table for statistics;
(2) Continuously intercepting an input address text in a longest matching mode and searching all potential vocabulary knowledge in AKT;
(3) And finding out a corresponding path of the vocabulary in the AKT according to the vocabulary knowledge matched with the characters, and splicing the paths to be sentence knowledge of the address text.
The coding layer specifically comprises:
obtaining corresponding text embedding vectors through the BERT embedding layer according to (1)Wherein->Representing the nth character vector in the text,
x c =BERT(S) (1)
will sentence knowledge S k ={k 1 ,k 2 ,...,k l ' consider a sequence of characters identical to the address text, where k l Representing knowledge of the ith sentence; knowledge of each sentence k i From a plurality of charactersComposition of->An nth character representing an ith sentence; for every character->Sentence knowledge embedding vector through BERT embedding layerWherein->An nth character vector representing an ith sentence; according to formula (2), the average pooling value +.>To represent the knowledge of the sentence,
in order to avoid that excessive sentence knowledge can bring noise and wrong semantic information to the model, according to formula (3), a cosine similarity calculation method is adopted to calculate the semantic relevance of the character sequence and the sentence knowledge, and finally the front Top-N sentences are selected as a sentence knowledge setN.epsilon.l, where ∈l>A feature vector representation representing an nth sentence;
according to equation (4), the feedforward neural network of the BERT Transformer consists of two linear transforms with GeLU activation functions,
FFN(H i )=gelu(H i ·W k )·W v (4)
wherein H is i Representing the output of the transducer layer i,two linear networks, d, in the feed-forward neural network, respectively m Representing the intermediate size of the transducer, d representing the dimension of the hidden states; after the BERT embedding layer, the knowledge of front Top-N sentences is obtained>Wherein d is n Representing dimensions representing sentence knowledge; to inject sentence knowledge into the transducer ith layer, knowledge is projected using two different linear layers, linear1 and linear2, according to equations (5), (6), respectively, +.>The weights of linear1 and linear2 are indicated respectively,representing knowledge after projection;
according to equation (7), the projected knowledge is stitched to the end of the linear layer to achieveAnd->Implanted into corresponding W k And W is v In, get +.>
H i The corresponding value can be queried in the injected knowledge, namely, the model can learn the context information from the external sentence knowledge;
assigning each vocabulary knowledge acquired by the knowledge recall layer to the characters contained in the vocabulary knowledge, so thatConversion of character sequences into character-vocabulary pair sequences S cw ={(c 1 ,w 1 ),(c 2 ,w 2 ),...,(c n ,w n ) }, wherein c n An nth character in the table address text, where w n ={w i1 ,w i2 ,...,w in And the i-th character matches all lexical knowledge.
Obtaining Pre-trained word vectors (Pre-Trained Word Vectors, PWV) by word2vec model through address vocabulary knowledge acquired from national statistical division of national statistical office 2022;
to obtain the character c according to the formulas (8), (9) i Will w i The j-th vocabulary knowledge in the list obtains vocabulary knowledge embedded vector through PWVWherein->An nth lexical knowledge vector representation representing the ith character and aligned to the character vector dimension by nonlinear conversion for subsequent lexical knowledge to be incorporated into the model;
wherein the method comprises the steps ofThe j-th lexical knowledge vector representation representing the i-th character,>dimension transformation result of vocabulary knowledge corresponding to ith character is shown, w ij The j-th lexical knowledge representing the i-th character, b 1 And b 2 Is biased toPut items, W 1 ∈d×d w ,W 2 ∈d×d,d w Representing the dimensions of the vocabulary embedded vectors.
The vocabulary knowledge fusion layer specifically comprises:
inputting the character representation sequence integrated with sentence knowledge into BiGRU to obtain more comprehensive context sequence information h= { h 1 ,h 2 ,...,h n -a }; the vocabulary knowledge fusion layer uses the characters and the vocabulary knowledge matched with the characters as two inputs, and the inputs are expressed asWherein h is i Is the character vector of the i-th position in the context sequence information, < >>Is the vocabulary knowledge embedding vector corresponding to the ith character, wherein +.>Dimension transformation result of vocabulary knowledge corresponding to ith character, size is n×d, n is total number of matched words;
the importance of each word that it matches is different for each character, and according to equation (10), different weights are given to different words using bilinear attention,
wherein a is i ={a i1 ,a i2 ,...,a in },a ij Weight, W, of the j-th vocabulary representing the i-th character attn Representing a bilinear attention matrix; for each character, weighting and summing word vectors according to the weight of each matched word knowledge to obtain word knowledge vector representation of the characterThe calculation formula is as follows:
adding the character vector and the weighted vocabulary vector to obtain a feature fusion vectorAnd finally, carrying out operations such as dropout, layer norm, residual error connection and the like to obtain the final output of the fused vocabulary knowledge, wherein the calculation formula is as follows:
the label prediction layer specifically comprises:
the tag prediction layer outputs the sequence representation of the vocabulary knowledge fusion layerGlobal consistency modeling is carried out through a CRF layer, the score of each label is obtained, screening is carried out, a final prediction result is obtained, the loss function of the model can be expressed as the negative number of a log likelihood function, and the calculation formula is as follows:
wherein t represents the true tag sequence, t' represents all tag sequences predicted by the model, score (·) represents the calculated input sequenceAnd t, Σ t′ exp (score (x, t') represents the sum of the scores of all possible tag sequences.
Step 2, site indexing: the method comprises address standardization and model fine tuning interfaces;
the address standardization is to normalize the extracted address entity according to the labeling specification of the national statistical bureau, and specifically comprises the following steps: and (3) carrying out standardization processing on the address entity extracted in the step (1) according to the labeling specification of the national statistical bureau, automatically standardizing the address information into elements such as provinces, cities, district and counties, street and town, cells, buildings, units, floors, houses and rooms, supplementing level missing data, constructing an address level relation, realizing the service of associating road house numbers with the cells and matching geographic coordinates of all levels so as to ensure consistency and comparability of indexing results. Through uniformly converting addresses in different formats and expression modes into standard formats, the location matching and association can be better carried out, and the accuracy and reliability of the indexing result are improved.
The model fine tuning interface is used for fine tuning a model by using a pre-trained natural language processing model, and specifically comprises the following steps: using a pre-trained natural language processing model, and adjusting model parameters according to training results of the model and selection of an optimization algorithm; optimizing the structure of the model according to the characteristics and the requirements of the model; training by using the optimized model, and fine-tuning the model by combining a specific data set and task requirements in the address field.
The address entity identification model of the invention on the CCKS2021 address element analysis data set is compared with the previous method, and the experimental effect is as follows in table 1:
TABLE 1 experimental results of CCKS2021 Address element resolved dataset
The address entity identification validity experimental result on the data set proves that the method has good performance on the validity of the address entity identification.
The foregoing is merely illustrative of the present invention and is not to be construed as limiting thereof, and it is intended to cover all modifications and equivalent arrangements included within the spirit and scope of the invention.

Claims (7)

1. The address entity identification method based on multi-layer knowledge perception is characterized by comprising the following steps of:
step 1, address entity identification: constructing an address entity identification model based on multi-layer knowledge perception; the address entity identification model comprises: 1) Knowledge recall layer: sentence knowledge and vocabulary knowledge matched with the address text sequence are obtained from an address knowledge tree AKT; 2) Coding layer: coding the character sequence and the sentence knowledge by using the BERT embedding layer, calculating semantic relativity between the character sequence and the sentence knowledge, obtaining the score of each sentence knowledge, selecting the sentence knowledge of Top-N, and injecting the sentence knowledge into a feedforward neural network of the BERT Transformer; encoding the vocabulary knowledge using the pre-trained word vectors and aligning with the character vector dimensions; 3) Vocabulary knowledge fusion layer: extracting global semantic representation according to BiGRU, giving different weights to different vocabulary knowledge by using bilinear attention, and fusing the weight sum with a character sequence; 4) Label prediction layer: the joint expression of the character sequence information and the vocabulary knowledge is used for learning constraint conditions in sentences through CRF, so that a prediction result of the address text is obtained;
step 2, site indexing: the method comprises address standardization and model fine tuning interfaces; the address standardization is to normalize the extracted address entity according to the labeling specification of the national statistical bureau; the model fine tuning interface is to fine tune the model using a pre-trained natural language processing model.
2. The method for identifying an address entity based on multi-layer knowledge awareness of claim 1, wherein the knowledge recall layer in step 1 specifically comprises:
(1) Acquiring external address knowledge from a national statistical division table of the national statistical bureau 2022, and constructing AKT according to the type of the statistical division;
(2) Continuously intercepting an input address text in a longest matching mode and searching all potential vocabulary knowledge in AKT;
(3) And finding out a corresponding path of the vocabulary in the AKT according to the vocabulary knowledge matched with the characters, and splicing the paths to be sentence knowledge of the address text.
3. The method for identifying an address entity based on multi-layer knowledge sensing as claimed in claim 1, wherein the coding layer in step 1 specifically comprises:
obtaining corresponding text embedding vectors through the BERT embedding layer according to (1)Wherein->Representing the nth character vector in the text,
x c =BERT(S) (1)
will sentence knowledge S k ={k 1 ,k 2 ,...,k l ' consider a sequence of characters identical to the address text, where k l Representing knowledge of the ith sentence; knowledge of each sentence k i From a plurality of charactersComposition of->An nth character representing an ith sentence; for every character->Sentence knowledge embedding vector is obtained through BERT embedding layer>Wherein->An nth character vector representing an ith sentence; according to formula (2), the average pooling value +.>To represent the sentence knowledge,
In order to avoid that excessive sentence knowledge can bring noise and wrong semantic information to the model, according to formula (3), a cosine similarity calculation method is adopted to calculate the semantic relevance of the character sequence and the sentence knowledge, and finally the front Top-N sentences are selected as a sentence knowledge setWherein->A feature vector representation representing an nth sentence;
according to equation (4), the feedforward neural network of the BERT Transformer consists of two linear transforms with GeLU activation functions,
FFN(H i )=gelu(H i ·W k )·W v (4)
wherein H is i Representing the output of the transducer layer i,two linear networks, d, in the feed-forward neural network, respectively m Representing the intermediate size of the transducer, d representing the dimension of the hidden states; after the BERT embedding layer, the knowledge of front Top-N sentences is obtained>Wherein d is n Representing dimensions representing sentence knowledge; to inject sentence knowledge into the transducer ith layer, knowledge is projected using two different linear layers, linear1 and linear2, according to equations (5), (6), respectively,/>The weights of linear1 and linear2 are indicated respectively,representing knowledge after projection;
according to equation (7), the projected knowledge is stitched to the end of the linear layer to achieveAnd->Implanted into corresponding W k And W is v In, get +.>
H i The corresponding value can be queried in the injected knowledge, namely, the model can learn the context information from the external sentence knowledge;
each vocabulary knowledge acquired by the knowledge recall layer is distributed to the characters contained in the vocabulary knowledge, so that the character sequence is converted into a character-vocabulary pair sequence S cw ={(c 1 ,w 1 ),(c 2 ,w 2 ),...,(c n ,w n ) }, wherein c n Representing the nth character in the address text, where w n ={w i1 ,w i2 ,...,w in -meaning that the ith character matches all lexical knowledge;
the address vocabulary knowledge acquired from the national statistical division of the national statistical office 2022 is used for obtaining a pre-trained word vector PWV through a word2vec model;
to obtain the character c according to the formulas (8), (9) i Will w i The j-th vocabulary knowledge in the list obtains vocabulary knowledge embedded vector through PWVWherein->An nth lexical knowledge vector representation representing the ith character and aligned to the character vector dimension by nonlinear conversion for subsequent lexical knowledge integration into the model;
wherein the method comprises the steps ofThe j-th lexical knowledge vector representation representing the i-th character,>dimension transformation result of vocabulary knowledge corresponding to ith character is shown, w ij The j-th lexical knowledge representing the i-th character, b 1 And b 2 Is a bias term, W 1 ∈d×d w ,W 2 ∈d×d,d w Representation vocabulary embeddingDimension of the vector.
4. The method for identifying an address entity based on multi-layer knowledge sensing as claimed in claim 1, wherein the vocabulary knowledge fusion layer in step 1 specifically comprises:
inputting the character representation sequence integrated with sentence knowledge into BiGRU to obtain more comprehensive context sequence information h= { h 1 ,h 2 ,...,h n -a }; the vocabulary knowledge fusion layer uses the characters and the vocabulary knowledge matched with the characters as two inputs, and the inputs are expressed asWherein h is i Is the character vector of the i-th position in the context sequence information, < >>Is the vocabulary knowledge embedding vector corresponding to the ith character, wherein +.>The dimension transformation result of the vocabulary knowledge corresponding to the ith character is represented, the size is n multiplied by d, and n is the total number of the matched words;
the importance of each word that it matches is different for each character, and according to equation (10), different weights are given to different words using bilinear attention,
wherein a is i ={a i1 ,a i2 ,...,a in },a ij Weight, W, of the j-th vocabulary representing the i-th character attn Representing a bilinear attention matrix; for each character, weighting and summing word vectors according to the weight of each matched word knowledge to obtain word knowledge vector representation of the characterThe calculation formula is as follows:
adding the character vector and the weighted vocabulary vector to obtain a feature fusion vectorAnd finally, carrying out operations such as dropout, layer norm, residual error connection and the like to obtain the final output of the fused vocabulary knowledge, wherein the calculation formula is as follows:
5. the method for identifying an address entity based on multi-layer knowledge sensing according to claim 1, wherein the label prediction layer in step 1 specifically comprises:
the tag prediction layer outputs the sequence representation of the vocabulary knowledge fusion layerGlobal consistency modeling is carried out through a CRF layer, the score of each label is obtained, screening is carried out, a final prediction result is obtained, the loss function of the model can be expressed as the negative number of a log likelihood function, and the calculation formula is as follows:
wherein t represents the true tag sequence, t' represents all tag sequences predicted by the model, score (·) represents the calculated input sequenceAnd t, Σ t′ exp (score (x, t') represents the sum of the scores of all possible tag sequences.
6. The method for identifying an address entity based on multi-layer knowledge sensing according to claim 1, wherein the address normalization in step 2 specifically comprises: and (3) carrying out standardization processing on the address entity extracted in the step (1) according to the labeling specification of the national statistical bureau, automatically standardizing the address information into elements such as provinces, cities, district and counties, street and town, cells, buildings, units, floors, houses and rooms, supplementing level missing data, constructing an address level relation, realizing the service of associating road house numbers with the cells and matching geographic coordinates of all levels so as to ensure consistency and comparability of indexing results.
7. The method for identifying address entities based on multi-layer knowledge sensing according to claim 1, wherein the model fine tuning interface in step 2 specifically comprises: using a pre-trained natural language processing model, and adjusting model parameters according to training results of the model and selection of an optimization algorithm; optimizing the structure of the model according to the characteristics and the requirements of the model; training by using the optimized model, and fine-tuning the model by combining a specific data set and task requirements in the address field.
CN202311110916.8A 2023-08-30 2023-08-30 Address entity identification method based on multi-layer knowledge perception Pending CN117010398A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311110916.8A CN117010398A (en) 2023-08-30 2023-08-30 Address entity identification method based on multi-layer knowledge perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311110916.8A CN117010398A (en) 2023-08-30 2023-08-30 Address entity identification method based on multi-layer knowledge perception

Publications (1)

Publication Number Publication Date
CN117010398A true CN117010398A (en) 2023-11-07

Family

ID=88565568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311110916.8A Pending CN117010398A (en) 2023-08-30 2023-08-30 Address entity identification method based on multi-layer knowledge perception

Country Status (1)

Country Link
CN (1) CN117010398A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117933372A (en) * 2024-03-22 2024-04-26 山东大学 Data enhancement-oriented vocabulary combined knowledge modeling method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117933372A (en) * 2024-03-22 2024-04-26 山东大学 Data enhancement-oriented vocabulary combined knowledge modeling method and device
CN117933372B (en) * 2024-03-22 2024-06-07 山东大学 Data enhancement-oriented vocabulary combined knowledge modeling method and device

Similar Documents

Publication Publication Date Title
CN109508462B (en) Neural network Mongolian Chinese machine translation method based on encoder-decoder
CN110929030B (en) Text abstract and emotion classification combined training method
CN112329467B (en) Address recognition method and device, electronic equipment and storage medium
Lyu et al. Let: Linguistic knowledge enhanced graph transformer for chinese short text matching
CN114048350A (en) Text-video retrieval method based on fine-grained cross-modal alignment model
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN113377897B (en) Multi-language medical term standard standardization system and method based on deep confrontation learning
CN117290489B (en) Method and system for quickly constructing industry question-answer knowledge base
CN112100332A (en) Word embedding expression learning method and device and text recall method and device
CN115080694A (en) Power industry information analysis method and equipment based on knowledge graph
CN116737759B (en) Method for generating SQL sentence by Chinese query based on relation perception attention
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
Jian et al. [Retracted] LSTM‐Based Attentional Embedding for English Machine Translation
CN111488455A (en) Model training method, text classification method, system, device and medium
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN117010398A (en) Address entity identification method based on multi-layer knowledge perception
CN115169349A (en) Chinese electronic resume named entity recognition method based on ALBERT
CN114091454A (en) Method for extracting place name information and positioning space in internet text
Ma et al. Improving Chinese spell checking with bidirectional LSTMs and confusionset-based decision network
CN111382333B (en) Case element extraction method in news text sentence based on case correlation joint learning and graph convolution
CN113076744A (en) Cultural relic knowledge relation extraction method based on convolutional neural network
CN116955594A (en) Semantic fusion pre-training model construction method and cross-language abstract generation method and system
CN113807102B (en) Method, device, equipment and computer storage medium for establishing semantic representation model
Liang et al. Hierarchical hybrid code networks for task-oriented dialogue
CN115393849A (en) Data processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination