CN117010398A

CN117010398A - Address entity identification method based on multi-layer knowledge perception

Info

Publication number: CN117010398A
Application number: CN202311110916.8A
Authority: CN
Inventors: 李茹; 高俊杰; 邵文远; 谭红叶; 张虎; 闫智超; 苏雪峰; 张越; 梁吉业
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-11-07

Abstract

The invention belongs to the field of natural language processing, and particularly relates to an address entity identification method based on multi-layer knowledge perception. The method combines address entity identification with application scenes which need to be faced in the knowledge graph construction process, designs and provides an address entity identification method based on multi-layer knowledge perception, constructs an address knowledge tree according to the characteristics of an address text from the aspect of address entity identification, acquires external knowledge of two layers of sentences and vocabularies related to the input text from the address knowledge tree, and helps a general model learn specific knowledge in the address field, so that the address entity identification capability is enhanced, and the model accuracy is improved. In addition, the research and the practicability are both considered, and from the engineering point of view, the system can be fused to the existing place indexing system, and the usability of the system is improved. The performance of the address entity identification model is further improved through the fine tuning interface, and the address information is converted into a unified format through address standardization complement.

Description

Address entity identification method based on multi-layer knowledge perception

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to an address entity identification method based on multi-layer knowledge perception.

Background

Early address naming entity identification methods relied primarily on rules and dictionary based methods. Shaalan et al construct rule templates using a variety of features such as vocabulary, syntax, and semantics, and use a place name dictionary matching approach to identify address entities that appear in text. JA et al use place name dictionary to match the twitter text to identify the address entity. However, this method requires manual formulation of the corresponding rule templates and dictionary, takes a lot of time, and requires periodic updating of the dictionary.

As machine learning techniques develop, researchers learn the weights of features from labeled training data by means of machine learning methods and identify entities by predicting the most likely tag sequences. Bikel et al model text sequence data using an HMM model to calculate the probability that each word is of a certain entity type, thereby predicting the entity type of each word. The Mena et al propose a mixed place name extraction method based on an HMM and an SVM, place name extraction is performed by using an HMM model, place name disambiguation is performed by matching the extracted place name with a place name table, and a final result is obtained through the SVM model. Sobhana et al propose a CRF-based model that uses word context information to predict address entities. However, the method based on statistical machine learning lacks generalization capability, and corresponding features need to be set manually, which depends on knowledge and experience of field experts and causes a great deal of manpower consumption.

In recent years, deep learning-based methods have become the mainstream in NER tasks. The method can automatically learn the characteristic representation through an end-to-end training mode, and has better generalization capability. Colnbert et al propose a word-level neural network model that uses convolutional layers to process word vector feature representations and incorporates CRFs for entity recognition. Huang et al propose a BiLSTM-CRF sequence annotation model, which significantly improves the accuracy of named entity recognition. Yang et al use a convolutional layer with a fixed window size over the character-embedded layer for entity recognition. Liu et al propose a WC-LSTM model that represents text as character-word pairs, thereby integrating word information into each character.

Although deep learning-based methods can effectively improve performance of NER, context information is easily lost during training. Some studies capture rich contextual information and linguistic representations via pre-trained models (e.g., BERT, ERNIE, etc.), further enhancing model performance. Li et al propose a Chinese NER model of BERT-IDCNN-CRF, which gets the context representation of the word through BERT pre-training model, and inputs the word vector sequence into IDCNN-CRF model for training. Liu et al help the model understand domain knowledge by introducing knowledge-graph-structured knowledge into the pre-trained language model. Sun et al used a multi-task learning and knowledge distillation approach to model training and used the results of the entity relationship prediction to assist in NER results, further improving NER performance. The study by Seyler et al demonstrates the importance of external knowledge to named entity recognition.

Although the above works have good effects, the following two problems still remain: 1) Compared with the general field, the address field has a plurality of entity types and numbers, and the similarity degree between the address entities is high, so that the address entities are difficult to identify. 2) Text data in the address field is free to write, and the phenomenon of using shorthand, homonym/alias and unregistered place name is common, so that the problem of processing entity boundaries and types is more difficult.

Disclosure of Invention

Based on the problems and the shortcomings in the prior art, the invention combines address entity identification to face application scenes, designs and provides an address entity identification method based on multi-layer knowledge perception, an address standardization and a model fine adjustment interface. From the aspect of address entity identification, an address knowledge tree is constructed according to the characteristics of the address text, and external knowledge of two layers of sentences and vocabularies related to the input text is obtained from the address knowledge tree, so that a general model is helped to learn specific knowledge in the address field, the address entity identification capability is enhanced, and the model accuracy is improved. In addition, the research and the practicability are considered, and from the engineering point of view, the system can be fused to the existing place indexing system, the availability of the system is improved, and the construction of an address database is quickened.

In order to achieve the above purpose, the present invention is realized by the following technical scheme:

an address entity identification method based on multi-layer knowledge perception comprises the following steps:

step 1, address entity identification: constructing an address entity identification model based on multi-layer knowledge perception; the address entity identification model comprises: 1) Knowledge recall layer: sentence knowledge and vocabulary knowledge matched with the address text sequence are obtained from an address knowledge tree AKT; 2) Coding layer: coding the character sequence and the sentence knowledge by using the BERT embedding layer, calculating semantic relativity between the character sequence and the sentence knowledge, obtaining the score of each sentence knowledge, selecting the sentence knowledge of Top-N, and injecting the sentence knowledge into a feedforward neural network of the BERT Transformer; encoding the vocabulary knowledge using the pre-trained word vectors and aligning with the character vector dimensions; 3) Vocabulary knowledge fusion layer: extracting global semantic representation according to BiGRU, giving different weights to different vocabulary knowledge by using bilinear attention, and fusing the weight sum with a character sequence; 4) Label prediction layer: the joint expression of the character sequence information and the vocabulary knowledge is used for learning constraint conditions in sentences through CRF, so that a prediction result of the address text is obtained;

step 2, site indexing: the method comprises address standardization and model fine tuning interfaces; the address standardization is to normalize the extracted address entity according to the labeling specification of the national statistical bureau; the model fine tuning interface is to fine tune the model using a pre-trained natural language processing model.

Further, the knowledge recall layer in the step 1 specifically includes:

(1) Acquiring external address knowledge from a national statistical division table of the national statistical bureau 2022, and constructing AKT according to the type of the statistical division;

(2) Continuously intercepting an input address text in a longest matching mode and searching all potential vocabulary knowledge in AKT;

(3) And finding out a corresponding path of the vocabulary in the AKT according to the vocabulary knowledge matched with the characters, and splicing the paths to be sentence knowledge of the address text.

Further, the encoding layer in the step 1 specifically includes:

obtaining corresponding text embedding vectors through the BERT embedding layer according to (1)Wherein->Representing the nth character vector in the text,

x ^c ＝BERT(S) (1)

will sentence knowledge S _k ＝{k ₁ ,k ₂ ,...,k _l ' consider a sequence of characters identical to the address text, where k _l Representing knowledge of the ith sentence; knowledge of each sentence k _i From a plurality of charactersComposition of->An nth character representing an ith sentence; for every character->Sentence knowledge embedding vector through BERT embedding layerWherein->An nth character vector representing an ith sentence; according to formula (2), the average pooling value +.>To represent the knowledge of the sentence,

in order to avoid that excessive sentence knowledge can bring noise and wrong semantic information to the model, according to formula (3), a cosine similarity calculation method is adopted to calculate the semantic relevance of the character sequence and the sentence knowledge, and finally the front Top-N sentences are selected as a sentence knowledge setN.epsilon.l, where ∈l>A feature vector representation representing the nth sentence,

according to equation (4), the feedforward neural network of the BERT Transformer consists of two linear transforms with GeLU activation functions,

FFN(H ⁱ )＝gelu(H ⁱ ·W _k )·W _v (4)

wherein H is ⁱ Representing the output of the transducer layer i,two linear networks, d, in the feed-forward neural network, respectively _m Representing the intermediate size of the transducer, d representing the dimension of the hidden states; after the BERT embedding layer, the knowledge of front Top-N sentences is obtained>Wherein d is _n Representing dimensions representing sentence knowledge; to inject sentence knowledge into the transducer ith layer, knowledge is projected using two different linear layers, linear1 and linear2, according to equations (5), (6), respectively, +.>The weights of linear1 and linear2 are indicated respectively,representing knowledge after projection;

according to equation (7), the projected knowledge is stitched to the end of the linear layer to achieveAnd->Implanted into corresponding W _k And W is _v In, get +.>

H ⁱ The corresponding value can be queried in the injected knowledge, namely, the model can learn the context information from the external sentence knowledge;

each vocabulary knowledge acquired by the knowledge recall layer is distributed to the characters contained in the vocabulary knowledge, so that the character sequence is converted into a character-vocabulary pair sequence S _cw ＝{(c ₁ ,w ₁ ),(c ₂ ,w ₂ ),...,(c _n ,w _n ) }, wherein c _n Representing the nth character, w, in the address text _n ＝{w _i1 ,w _i2 ,...,w _in And the i-th character matches all lexical knowledge.

The address vocabulary knowledge acquired from the national statistical division of the national statistical office 2022 is used for obtaining a pre-trained word vector PWV through a word2vec model;

to obtain the character c according to the formulas (8), (9) _i Will w _i The j-th vocabulary knowledge in the list obtains vocabulary knowledge embedded vector through PWVWherein->An nth lexical knowledge vector representation representing the ith character and aligned to the character vector dimension by nonlinear conversion for subsequent lexical knowledge integration into the model;

wherein the method comprises the steps ofThe j-th lexical knowledge vector representation representing the i-th character,>dimension transformation result representing vocabulary knowledge corresponding to ith character, w _ij The j-th lexical knowledge representing the i-th character, b ₁ And b ₂ Is a bias term, W ₁ ∈d×d _w ，W ₂ ∈d×d，d _w Representing the dimensions of the vocabulary embedded vectors.

Further, the vocabulary knowledge fusion layer in the step 1 specifically includes:

inputting the character representation sequence integrated with sentence knowledge into BiGRU to obtain more comprehensive context sequence information h= { h ₁ ,h ₂ ,...,h _n -a }; the vocabulary knowledge fusion layer uses the characters and the vocabulary knowledge matched with the characters as two inputs, and the inputs are expressed asWherein h is _i Is the character vector of the i-th position in the context sequence information, < >>Is the vocabulary knowledge embedding vector corresponding to the ith character, wherein +.>The dimension transformation result of the vocabulary knowledge corresponding to the ith character is represented, the size is n multiplied by d, and n is the total number of the matched words;

the importance of each word that it matches is different for each character, and according to equation (10), different weights are given to different words using bilinear attention,

wherein a is _i ＝{a _i1 ,a _i2 ,...,a _in }，a _ij Weight, W, of the j-th vocabulary representing the i-th character _attn Representing a bilinear attention matrix; for each character, weighting and summing word vectors according to the weight of each matched word knowledge to obtain word knowledge vector representation of the characterThe calculation formula is as follows:

adding the character vector and the weighted vocabulary vector to obtain a feature fusion vectorAnd finally, carrying out operations such as dropout, layer norm, residual error connection and the like to obtain the final output of the fused vocabulary knowledge, wherein the calculation formula is as follows:

further, the label prediction layer in the step 1 specifically includes:

the tag prediction layer outputs the sequence representation of the vocabulary knowledge fusion layerGlobal consistency modeling is carried out through a CRF layer, the score of each label is obtained, screening is carried out, a final prediction result is obtained, the loss function of the model can be expressed as the negative number of a log likelihood function, and the calculation formula is as follows:

wherein t represents the true tag sequence, t' represents all tag sequences predicted by the model, score (·) representationCalculating an input sequenceAnd t, Σ _t′ exp (score (x, t') represents the sum of the scores of all possible tag sequences.

Further, the address normalization in the step 2 specifically includes: and (3) carrying out standardization processing on the address entity extracted in the step (1) according to the labeling specification of the national statistical bureau, automatically standardizing the address information into elements such as provinces, cities, district and counties, street and town, cells, buildings, units, floors, houses and rooms, supplementing level missing data, constructing an address level relation, realizing the service of associating road house numbers with the cells and matching geographic coordinates of all levels so as to ensure consistency and comparability of indexing results.

Further, the model fine tuning interface in the step 2 specifically includes: using a pre-trained natural language processing model, and adjusting model parameters according to training results of the model and selection of an optimization algorithm; optimizing the structure of the model according to the characteristics and the requirements of the model; training by using the optimized model, and fine-tuning the model by combining a specific data set and task requirements in the address field.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention can more accurately identify the address entity in the address text by utilizing the technology of combining multiple layers of knowledge. By using a real network-aware data set to carry out experiments, the practicability and reliability of the technology can be improved, the recognition efficiency is improved, and the cost of manpower and material resources is reduced.

(2) The invention is oriented to the rapid and high concurrency address entity identification requirement, and combines different application scenes to realize address standardization and open model fine tuning interfaces. Address information is converted into a unified format through address standardization and complement, performance of an address entity identification model is further improved through a fine tuning interface, usability of the system is improved, and technical support is provided for improving address standardization and model fine tuning interfaces in a subsequent place indexing system. The invention has very remarkable positive effect in the address entity identification direction in the natural language processing field and has better application prospect.

Drawings

FIG. 1 is a diagram of an address entity identification model;

fig. 2 is a place indexing flow chart.

Detailed Description

The present invention will be described more fully hereinafter in order to facilitate an understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Examples

step 1, address entity identification: constructing an address entity identification model based on multi-layer knowledge perception; as shown in fig. 1, the address entity identification model includes: 1) Knowledge recall layer: acquiring sentence knowledge and vocabulary knowledge matched with the address text sequence from an address knowledge tree (Address Knowledge Tree, AKT); 2) Coding layer: coding the character sequence and the sentence knowledge by using the BERT embedding layer, calculating semantic relativity between the character sequence and the sentence knowledge, obtaining the score of each sentence knowledge, selecting the sentence knowledge of Top-N, and injecting the sentence knowledge into a feedforward neural network of the BERT Transformer; encoding the vocabulary knowledge using the pre-trained word vectors and aligning with the character vector dimensions; 3) Vocabulary knowledge fusion layer: extracting global semantic representation according to BiGRU, giving different weights to different vocabulary knowledge by using bilinear attention, and fusing the weight sum with a character sequence; 4) Label prediction layer: the joint expression of the character sequence information and the vocabulary knowledge is used for learning constraint conditions in sentences through CRF, so that a prediction result of the address text is obtained;

the knowledge recall layer prepares related external knowledge features from two dimensions of sentence knowledge and vocabulary knowledge respectively, and specifically comprises the following steps:

(1) The address administrative division itself contains tree structure information, the hierarchical relationship is the type relationship of the address data, the tree structure searches data quickly, and the same name address is not ambiguous. The invention obtains the external address knowledge from the regional table for national statistics of the national statistics bureau 2022, and constructs AKT according to the type of the regional table for statistics;

The coding layer specifically comprises:

x ^c ＝BERT(S) (1)

in order to avoid that excessive sentence knowledge can bring noise and wrong semantic information to the model, according to formula (3), a cosine similarity calculation method is adopted to calculate the semantic relevance of the character sequence and the sentence knowledge, and finally the front Top-N sentences are selected as a sentence knowledge setN.epsilon.l, where ∈l>A feature vector representation representing an nth sentence;

FFN(H ⁱ )＝gelu(H ⁱ ·W _k )·W _v (4)

assigning each vocabulary knowledge acquired by the knowledge recall layer to the characters contained in the vocabulary knowledge, so thatConversion of character sequences into character-vocabulary pair sequences S _cw ＝{(c ₁ ,w ₁ ),(c ₂ ,w ₂ ),...,(c _n ,w _n ) }, wherein c _n An nth character in the table address text, where w _n ＝{w _i1 ,w _i2 ,...,w _in And the i-th character matches all lexical knowledge.

Obtaining Pre-trained word vectors (Pre-Trained Word Vectors, PWV) by word2vec model through address vocabulary knowledge acquired from national statistical division of national statistical office 2022;

to obtain the character c according to the formulas (8), (9) _i Will w _i The j-th vocabulary knowledge in the list obtains vocabulary knowledge embedded vector through PWVWherein->An nth lexical knowledge vector representation representing the ith character and aligned to the character vector dimension by nonlinear conversion for subsequent lexical knowledge to be incorporated into the model;

wherein the method comprises the steps ofThe j-th lexical knowledge vector representation representing the i-th character,>dimension transformation result of vocabulary knowledge corresponding to ith character is shown, w _ij The j-th lexical knowledge representing the i-th character, b ₁ And b ₂ Is biased toPut items, W ₁ ∈d×d _w ，W ₂ ∈d×d，d _w Representing the dimensions of the vocabulary embedded vectors.

The vocabulary knowledge fusion layer specifically comprises:

inputting the character representation sequence integrated with sentence knowledge into BiGRU to obtain more comprehensive context sequence information h= { h ₁ ,h ₂ ,...,h _n -a }; the vocabulary knowledge fusion layer uses the characters and the vocabulary knowledge matched with the characters as two inputs, and the inputs are expressed asWherein h is _i Is the character vector of the i-th position in the context sequence information, < >>Is the vocabulary knowledge embedding vector corresponding to the ith character, wherein +.>Dimension transformation result of vocabulary knowledge corresponding to ith character, size is n×d, n is total number of matched words;

the label prediction layer specifically comprises:

wherein t represents the true tag sequence, t' represents all tag sequences predicted by the model, score (·) represents the calculated input sequenceAnd t, Σ _t′ exp (score (x, t') represents the sum of the scores of all possible tag sequences.

Step 2, site indexing: the method comprises address standardization and model fine tuning interfaces;

the address standardization is to normalize the extracted address entity according to the labeling specification of the national statistical bureau, and specifically comprises the following steps: and (3) carrying out standardization processing on the address entity extracted in the step (1) according to the labeling specification of the national statistical bureau, automatically standardizing the address information into elements such as provinces, cities, district and counties, street and town, cells, buildings, units, floors, houses and rooms, supplementing level missing data, constructing an address level relation, realizing the service of associating road house numbers with the cells and matching geographic coordinates of all levels so as to ensure consistency and comparability of indexing results. Through uniformly converting addresses in different formats and expression modes into standard formats, the location matching and association can be better carried out, and the accuracy and reliability of the indexing result are improved.

The model fine tuning interface is used for fine tuning a model by using a pre-trained natural language processing model, and specifically comprises the following steps: using a pre-trained natural language processing model, and adjusting model parameters according to training results of the model and selection of an optimization algorithm; optimizing the structure of the model according to the characteristics and the requirements of the model; training by using the optimized model, and fine-tuning the model by combining a specific data set and task requirements in the address field.

The address entity identification model of the invention on the CCKS2021 address element analysis data set is compared with the previous method, and the experimental effect is as follows in table 1:

TABLE 1 experimental results of CCKS2021 Address element resolved dataset

The address entity identification validity experimental result on the data set proves that the method has good performance on the validity of the address entity identification.

The foregoing is merely illustrative of the present invention and is not to be construed as limiting thereof, and it is intended to cover all modifications and equivalent arrangements included within the spirit and scope of the invention.

Claims

1. The address entity identification method based on multi-layer knowledge perception is characterized by comprising the following steps of:

2. The method for identifying an address entity based on multi-layer knowledge awareness of claim 1, wherein the knowledge recall layer in step 1 specifically comprises:

3. The method for identifying an address entity based on multi-layer knowledge sensing as claimed in claim 1, wherein the coding layer in step 1 specifically comprises:

x ^c ＝BERT(S) (1)

will sentence knowledge S _k ＝{k ₁ ,k ₂ ,...,k _l ' consider a sequence of characters identical to the address text, where k _l Representing knowledge of the ith sentence; knowledge of each sentence k _i From a plurality of charactersComposition of->An nth character representing an ith sentence; for every character->Sentence knowledge embedding vector is obtained through BERT embedding layer>Wherein->An nth character vector representing an ith sentence; according to formula (2), the average pooling value +.>To represent the sentence knowledge，

In order to avoid that excessive sentence knowledge can bring noise and wrong semantic information to the model, according to formula (3), a cosine similarity calculation method is adopted to calculate the semantic relevance of the character sequence and the sentence knowledge, and finally the front Top-N sentences are selected as a sentence knowledge setWherein->A feature vector representation representing an nth sentence;

FFN(H ⁱ )＝gelu(H ⁱ ·W _k )·W _v (4)

wherein H is ⁱ Representing the output of the transducer layer i,two linear networks, d, in the feed-forward neural network, respectively _m Representing the intermediate size of the transducer, d representing the dimension of the hidden states; after the BERT embedding layer, the knowledge of front Top-N sentences is obtained>Wherein d is _n Representing dimensions representing sentence knowledge; to inject sentence knowledge into the transducer ith layer, knowledge is projected using two different linear layers, linear1 and linear2, according to equations (5), (6), respectively，/>The weights of linear1 and linear2 are indicated respectively,representing knowledge after projection;

each vocabulary knowledge acquired by the knowledge recall layer is distributed to the characters contained in the vocabulary knowledge, so that the character sequence is converted into a character-vocabulary pair sequence S _cw ＝{(c ₁ ,w ₁ ),(c ₂ ,w ₂ ),...,(c _n ,w _n ) }, wherein c _n Representing the nth character in the address text, where w _n ＝{w _i1 ,w _i2 ,...,w _in -meaning that the ith character matches all lexical knowledge;

wherein the method comprises the steps ofThe j-th lexical knowledge vector representation representing the i-th character,>dimension transformation result of vocabulary knowledge corresponding to ith character is shown, w _ij The j-th lexical knowledge representing the i-th character, b ₁ And b ₂ Is a bias term, W ₁ ∈d×d _w ，W ₂ ∈d×d，d _w Representation vocabulary embeddingDimension of the vector.

4. The method for identifying an address entity based on multi-layer knowledge sensing as claimed in claim 1, wherein the vocabulary knowledge fusion layer in step 1 specifically comprises:

5. the method for identifying an address entity based on multi-layer knowledge sensing according to claim 1, wherein the label prediction layer in step 1 specifically comprises:

6. The method for identifying an address entity based on multi-layer knowledge sensing according to claim 1, wherein the address normalization in step 2 specifically comprises: and (3) carrying out standardization processing on the address entity extracted in the step (1) according to the labeling specification of the national statistical bureau, automatically standardizing the address information into elements such as provinces, cities, district and counties, street and town, cells, buildings, units, floors, houses and rooms, supplementing level missing data, constructing an address level relation, realizing the service of associating road house numbers with the cells and matching geographic coordinates of all levels so as to ensure consistency and comparability of indexing results.

7. The method for identifying address entities based on multi-layer knowledge sensing according to claim 1, wherein the model fine tuning interface in step 2 specifically comprises: using a pre-trained natural language processing model, and adjusting model parameters according to training results of the model and selection of an optimization algorithm; optimizing the structure of the model according to the characteristics and the requirements of the model; training by using the optimized model, and fine-tuning the model by combining a specific data set and task requirements in the address field.