CN112733541A

CN112733541A - Named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism

Info

Publication number: CN112733541A
Application number: CN202110016942.9A
Authority: CN
Inventors: 张毅; 王爽胜; 何彬; 叶培明; ***
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-01-06
Filing date: 2021-01-06
Publication date: 2021-04-30

Abstract

The invention relates to a named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism, which comprises the following steps: training a BERT pre-training language model through large-scale label-free anticipation; constructing a complete BERT-BiGRU-IDCNN-Attention-CRF named entity recognition model on the basis of the trained BERT model; constructing an entity recognition training set, and training a complete entity recognition model on the training set; the invention combines the feature vectors extracted by the BiGRU and IDCNN neural networks, overcomes the defect that the BiGRU neural networks neglect local features in the process of extracting global context features, introduces an attention mechanism, performs weight distribution on the extracted features, strengthens the features playing a key role in entity recognition, weakens irrelevant features and further improves the recognition effect of named entity recognition.

Description

Named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism

Technical Field

The invention belongs to the field of named entity identification, and particularly relates to a method for identifying a named entity of BERT-BiGRU-IDCNN-CRF based on an attention mechanism.

Background

Named Entity Recognition (NER) is one of the basic tasks in the field of natural language processing, namely, recognizing instances, i.e., entities, such as person names, place names, organization names, etc., embodying concepts in text. Named entity recognition is widely applied to tasks such as information extraction, question and answer systems, intelligent translation, knowledge graph construction and the like.

The methods for naming identification are mainly classified into the following three categories:

the first category of dictionary and rule-based methods first performs named entity recognition by matching through manual construction of a dictionary or rule template.

The second category is a statistical machine learning-based method, which considers the named entity recognition task as a sequence tagging problem and learns tagging models using large-scale corpora. Such methods mainly include Hidden Markov Models (HMMs), Maximum Entropy Models (MEM), Support Vector Machines (SVMs), Conditional Random Fields (CRFs), and the like.

The third type is a method based on deep learning, that is, characters or words are mapped to a vector space and then input to a neural network for feature extraction and label prediction, wherein a popular neural network model is BilSTM-CRF, the model can extract global context features through the BilSTM, meanwhile, the dependency relationship among labels is captured through the CRF, and the corrected context features are output. Meanwhile, in order to accelerate the convergence rate, a BiGRU-CRF model is proposed by scholars.

The above prior art has the following disadvantages:

1. the first stage is to adopt a dictionary and rule-based method, which has the disadvantages that the method relies on a manually constructed rule template of a linguist, is labor-consuming, has subjective factors, is easy to generate errors, and cannot be transplanted among different fields.

2. Statistical machine learning-based methods still require a large amount of human involvement in feature extraction and rely heavily on the corpus.

3. The mainstream model BilSTM-CRF based on the deep learning method has the disadvantages that a word vector of a word embedding layer cannot represent word ambiguity, so that the recognition effect of a lower layer is influenced, and in addition, part of local features can be omitted when the BiLSTM and the BiGRU extract global context features.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A named entity recognition method of BERT-BiGRU-IDCNN-CRF based on attention mechanism is provided. The technical scheme of the invention is as follows:

a named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism comprises the following steps:

s1, training a BERT (bidirectional encoder representation from transformations) model based on large-scale unlabeled expectation; the BERT model comprises an embedding layer, a bidirectional Transformer coding layer and an output layer; an imbedding layer is input of a model, and a multi-head attention mechanism is used in transform coding;

s2, acquiring and labeling training corpus data of the named entity recognition model, and constructing a data set;

s3, constructing a BERT-BiGRU-IDCNN-Attention-CRF neural network model on the basis of the trained BERT model obtained in the step S1, and training the model on the data set obtained in the step S2;

s4, conducting named entity recognition on the text to be recognized by utilizing the trained neural network model based on BERT-BiGRU-IDCNN-Attention-CRF obtained in the step S3, and obtaining a recognition result.

Further, in step S1, the embedding layer is the sum of word embedding, position embedding, and type embedding, and respectively represents word information, position information, and sentence pair information;

the Transformer consists of a self-attention mechanism and a feedforward neural network, wherein the self-attention mechanism is used for calculating the association degree between words in a text sequence and adjusting the weight coefficient according to the association degree, and the association degree is calculated by the following method:

wherein: q represents a query vector, K represents a key vector, and V represents a value vector, and a penalty factor is introduced to prevent the inner product of Q, V from being too large

Wherein d is_kRepresenting an input vector dimension;

the transform coding structure uses a multi-head Attention mechanism, that is, Q, K, V is subjected to multiple different linear mappings, the obtained new Q, K, V is recalculated to obtain different attentions (Q, K, V) and the attentions are spliced, W is a weight matrix, and the specific method is as follows:

MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_n)W

head_i＝Attention(QW_i ^Q,KW_i ^k,VW_i ^V)

head_ithe attention result Concat representing the attention head i represents that the attention results of different attention heads are spliced, the output of the multi-head attention mechanism in the Transformer structure is represented as Z, b is an offset vector, and then the full link feedforward network FFN is represented as:

FFN(Z)＝max(0,ZW₁+b₁)W₂+b₂

w1, b1 represent weights and biases of the multi-headed attention mechanism to the summation normalization layer, respectively, and W2, b2 represent weights and biases of the summation normalization layer to the fully connected feedforward neural network, respectively. The BERT model obtains corresponding parameters by training on large-scale label-free data, and infers and outputs word vector representation of an input sequence, namely T₁,T₂,T_n。

Further, the step S2 is to obtain and label the corpus data of the named entity recognition model, and construct a data set, which specifically includes:

s2-1, performing conventional data processing on the original corpus identified by the named entity, wherein the conventional data processing comprises correcting wrongly-written characters and scaling the characters;

s2-2, determining the entity type to be identified according to the actual application requirement;

s2-3, in order to cope with the situation that the entity lengths are different and the entity boundaries are difficult to distinguish, adopting a BIO three-stage entity labeling method: b represents the first character of the entity, I represents the character following the first character in the entity, and O represents a non-entity.

Further, the step S3 is to construct a BERT-BiGRU-IDCNN-Attention-CRF neural network model, which specifically includes the steps of:

s3-1, inputting the training set of the named entity recognition obtained by preprocessing in the step S2 into the BERT pre-training model trained in the step S1, and outputting word embedding vectors by the BERT model;

s3-2, inputting the word vector output by the BERT pre-training language model into a biGRU (bidirectional gated recurrent unit) neural network model;

s3-3, inputting the word embedding vector output by the BERT pre-training language model into an IDCNN (iterative scaled convolution neural network) iterative expansion convolution neural network model;

s3-4, merging the feature vectors output by the BiGRU and IDCNN neural networks;

s3-5, inputting the combined feature vector obtained in the step S3-4 into an attention mechanism module, performing weight distribution on the extracted features, strengthening the features which play a key role in named entity identification, and weakening irrelevant features;

s3-6, inputting the weight-distributed feature vectors obtained in the step S3-5 into a CRF layer, extracting the dependency features among the labels, and calculating a loss function;

and S3-7, updating parameters of the whole named entity recognition model by using a loss function of a CRF layer by adopting an SGD random gradient descent method, wherein the parameters comprise model parameters including a BiGRU neural network model, an IDCNN neural network model, an Attention layer and the CRF layer, the parameters of the BERT model are kept unchanged, and when the loss value generated by the model meets the set requirement or reaches the set maximum iteration number, the training of the model is terminated.

Further, the step S3-2 inputs the word vector output by the BERT pre-training language model to the BiGRU neural network model; the GRU is a special recurrent neural network, and the state of its neurons is calculated as follows:

z_t＝σ(W_i*[h_t-1,x_t])

r_t＝σ(W_r*[h_t-1,x_t])

where σ is the sigmod function, x_tIs an input vector at time t, h_tIs a hidden state, is also an output vector, contains all valid information of the previous t time, z_tIs an update gate, controls the flow of information input for the next time,

representing candidate hidden layers, W_i W_c W_rAre all represented as weight matrices, r, of GRUs_tIs a reset gate, the control information is lost, and the update gate and the reset gate jointly determine the output of the hidden state.

Further, the step S3-5 is to input the merged feature vector obtained in the step S3-4 to the attention mechanism module, perform weight distribution on the extracted features, strengthen the features that play a key role in named entity recognition, and weaken irrelevant features;

defining a set of feature vectors H ═ H₀,h₁...h₃The extra information is a part of speech matrix P ═ P₀,P₁...P_nIs an activation function for the purpose of word formationInformation can give weight to the target vector set, and weight matrix W is used respectively₁,W₂Affine transformation is carried out on H and P to enable vector space dimensions of H and P to be the same, and transformation results are input into tan H (.) activation functions to obtain joint feature vectors

And using softmax function pairs

Weighting and scoring to obtain the weight of each input

Finally, using the weight vector

And (3) carrying out attention weighting on the feature vector set H, and outputting the feature vector logits after weight distribution, wherein the calculation method comprises the following steps:

further, the step S3-6 inputs the weight-assigned feature vector obtained in the step S3-5 to the CRF layer, extracts the dependency features between the labels, and calculates the loss function, which specifically includes:

for an input feature vector X, the corresponding prediction sequence is Y, the probability generated by the prediction sequence Y is obtained by calculating the scoring function of Y, and finally, the prediction labeling sequence when the likelihood function of the probability generated by the prediction sequence is maximum is calculated as output, wherein the calculating method of the scoring function of the prediction sequence Y is as follows:

a denotes a transition score matrix, n denotes the number of words, yi denotes the i-th label of the predicted sequence Y, A_i,jRepresents the transition of label i to label j, P is the scoring matrix of the upper layer output, P_i,jThe score of the jth label representing the ith word, the probability that the predicted sequence y yields is:

wherein the content of the first and second substances,

representing predicted sequences

The score of the scoring function of (a) is scored,

representing the actual annotation sequence, Y_XRepresenting all possible labeled sequences, and taking logarithms on two sides to obtain a likelihood function of a prediction sequence:

the training loss function is:

λ represents a penalty term coefficient and θ represents a penalty term.

Further, the step S4 performs named entity recognition on the text to be recognized by using the trained neural network model based on BERT-BiGRU-IDCNN-Attention-CRF obtained in the step S3 to obtain a recognition result, and includes the steps of:

s4-1, inputting text data needing named entity recognition into a trained BERT-BiGRU-IDCNN-Attention-CRF neural network model;

and S4-2, converting the text data into word vectors after passing through a BERT model, extracting the features of the word vectors through a BiGRU and IDCNN neural network, distributing the weights of the extracted features through an Attention layer, and finally solving the most possible labeling sequence of each sentence by adopting a Viterbi algorithm on a CRF layer, namely the result of named entity recognition.

The invention has the following advantages and beneficial effects:

the invention provides a named entity identification method of BERT-BiGRU-IDCNN-CRF based on an attention mechanism. The method can solve the problem that the word vector cannot represent the polysemy of a word, can overcome the defect that the BiGRU module ignores local features in the process of extracting the context features, and can strengthen the features playing a key role in classification and weaken irrelevant features.

1. The BERT model used by the invention can perform unsupervised training on large-scale label-free data, can perform pre-training by combining context semantic information, and learns the characteristics of word level, syntactic structure and context semantic information so as to solve the defect that static word embedding cannot represent word ambiguity.

2. An IDCNN neural network is introduced to extract local features of a text sequence so as to make up for the defect that the BiGRU neural network ignores the local features when extracting global context features.

3. On the basis of global context features and local features extracted by BiGRU and IDCNN, an attention mechanism is introduced, the features playing a key role in the named entity are strengthened by carrying out weight distribution on the features, and irrelevant features are weakened or ignored, so that the accuracy of named entity identification is further improved.

The invention is mainly characterized in that when the BiGRU neural network extracts the global context characteristics, the local characteristics extracted by the IDCNN neural network are fused to make up for the defect that the BiGRU neural network ignores the local characteristics, and meanwhile, the BiGRU neural network has only 2 gate structures, so that the BiGRU neural network has fewer training parameters and higher training speed compared with the BiLSTM neural network. On the other hand, on the basis of the fusion features extracted by the BiGRU neural network and the IDCNN neural network, an attention mechanism is introduced to carry out weight distribution on the features, the features playing a key role in classification are strengthened, and irrelevant features are weakened, so that a better recognition effect is achieved.

Drawings

FIG. 1 is a schematic flow chart of a named entity recognition method based on a BERT-BiGRU-IDCNN-Attention-CRF neural network model according to a preferred embodiment of the present invention;

FIG. 2 is a block diagram of a BERT model of an embodiment of the present invention;

FIG. 3 is a structural diagram of a BERT-BiGRU-IDCNN-Attention-CRF neural network model according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

fig. 1 is a schematic flow chart of the named entity recognition method of BERT-BiGRU-IDCNN-CRF based on the attention mechanism of the present invention. The method comprises the following steps:

s1, training a BERT model based on large-scale label-free expectation;

specifically, the structure of the BERT model is shown in fig. 2, and mainly includes a model embedding layer, a bidirectional Transformer encoding layer, and an output layer.

The embedding layer is the input of the model, which is the sum of word embedding, position embedding and type embedding, and respectively represents word information, position information and sentence pair information.

The Transformer consists of a self-attention mechanism and a feedforward neural network, the working principle of the self-attention mechanism is mainly to calculate the association degree between words in a text sequence, and adjust the weight coefficient according to the association degree, and the calculation method of the association degree is as follows:

Wherein d is_kRepresenting the input vector dimensions.

MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_n)W

head_i＝Attention(QW_i ^Q,KW_i ^k,VW_i ^V)

if the output of the multi-head attention mechanism is represented as Z and b is the bias vector in the Transformer structure, the fully-concatenated feed-forward network (FFN) can be represented as:

FFN(Z)＝max(0,ZW₁+b₁)W₂+b₂

the BERT model obtains corresponding parameters by training on large-scale label-free data, and infers and outputs word vector representation of an input sequence, namely T₁,T₂,T_n。

S2, obtaining and labeling the training corpus data of the named entity recognition model, and constructing a data set, wherein the method comprises the following steps:

s2-1, performing conventional data processing on the original corpus identified by the named entity, wherein the conventional data processing comprises correction on wrongly written characters, large-scale processing on characters and the like;

S3, on the basis of the trained BERT model obtained in the step S1, constructing a BERT-BiGRU-IDCNN-Attention-CRF neural network model, wherein the model structure is shown in FIG. 3, and the model is trained on the data set obtained in the step S2, and the method comprises the following steps:

s3-2, inputting the word vector output by the BERT pre-training language model to the BiGRU neural network model;

specifically, a GRU is a special recurrent neural network whose state of neurons is calculated as follows:

z_t＝σ(W_i*[h_t-1,x_t])

r_t＝σ(W_r*[h_t-1,x_t])

where σ is the sigmod function, x_tIs an input vector at time t, h_tIt is a hidden state and is also an output vector, containing all valid information from the previous time t. z is a radical of_tIs an update gate that controls the flow of information into the next time instant. r is_tIs a reset gate, the control information is lost, and the update gate and the reset gate jointly determine the output of the hidden state.

S3-3, inputting the word embedding vector output by the BERT pre-training language model to the IDCNN neural network model;

specifically, the iterative convolutional neural network (IDCNN) is composed of a plurality of layers of convolutional neural networks (DCNN) with different expansion widths, and compared with the conventional convolutional neural network, the DCNN expands the convolution kernel thereof, thereby increasing the receptive field. At a convolution kernel size of 3x3 with a dilation width of 2, but its receptive field is expanded to 7x 7. The benefit of DCNN is that the convolution output contains a larger field of view of information without changing the size of the convolution kernel.

specifically, a feature vector set H ═ H is defined₀,h₁...h₃The extra information is a part of speech matrix P ═ P₀,P₁...P_nUsing a weight matrix W to weight the target vector set according to the part of speech information₁,W₂And performing affine transformation on the H and the P to ensure that the vector space dimensions are the same. Inputting the transformation result into the tanh (.) activation function to obtain a joint feature vector

And using softmax function pairs

Weighting and scoring to obtain the weight of each input

Finally, using the weight vector

specifically, for an input feature vector X, the corresponding prediction sequence is Y, the probability of generation of the prediction sequence Y is obtained by calculating the scoring function of Y, and finally, the prediction labeling sequence when the likelihood function of the probability of generation of the prediction sequence is maximum is calculated and is used as output. The calculation method of the scoring function of the predicted sequence Y is as follows:

a represents a transition score matrix, A_i,jRepresents the transition of label i to label j, P is the scoring matrix of the upper layer output, P_i,jThe score of the jth label representing the ith word, the probability that the predicted sequence y yields is:

wherein the content of the first and second substances,

the training loss function is:

and S3-7, updating parameters of the whole named entity recognition model by using a loss function of a CRF layer by adopting an SGD (random gradient descent) method, wherein the parameters comprise model parameters including a BiGRU neural network model, an IDCNN neural network model, an Attention layer and the CRF layer, the parameters of the BERT model are kept unchanged, and when the loss value generated by the model meets the set requirement or reaches the set maximum iteration number, the training of the model is terminated.

S4, carrying out named entity recognition on the text to be recognized by utilizing the trained neural network model based on BERT-BiGRU-IDCNN-Attention-CRF obtained in the step S3 to obtain a recognition result, and comprising the following steps:

The modules or methods illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism is characterized by comprising the following steps:

s1, training a BERT model based on large-scale label-free expectation; the BERT model comprises an embedding word embedding layer, a bidirectional Transformer coding layer and an output layer; embedding an embedding word into a layer, namely inputting a model, wherein a multi-head attention mechanism is used in transform coding;

2. The method for identifying named entities on BERT-BiGRU-IDCNN-CRF according to claim 1, wherein in step S1, the embedding layer is the sum of word embedding, position embedding and type embedding, and represents word information, position information and sentence pair information respectively;

Wherein d is_kRepresenting an input vector dimension;

MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_n)W

head_ithe attention result of the attention head i is shown, Concat shows that the attention results of different attention heads are spliced, the output of the multi-head attention mechanism in the Transformer structure is shown as Z, b is an offset vector, and then the full link feedforward network FFN is shown as:

FFN(Z)＝max(0,ZW₁+b₁)W₂+b₂

w1 and b1 respectively represent the weight and the bias of a multi-head attention mechanism to a summation normalization layer, W2 and b2 respectively represent the weight and the bias of the summation normalization layer to a fully-connected feedforward neural network, a BERT model obtains corresponding parameters by training on large-scale unlabeled data, and the word vector representation of an input sequence, namely T, is reasoned and output₁,T₂,T_n。

3. The method for recognizing the named entity of BERT-BiGRU-IDCNN-CRF based on attention mechanism as claimed in claim 1, wherein the step S2 is performed by obtaining and labeling corpus data of the named entity recognition model to construct a data set, specifically comprising:

4. The method for identifying named entities on BERT-BiGRU-IDCNN-CRF based on Attention mechanism as claimed in claim 3, wherein the step S3 is to construct a BERT-BiGRU-IDCNN-Attention-CRF neural network model, specifically comprising the steps of:

5. The method for named entity recognition of BERT-BiGRU-IDCNN-CRF based on attention mechanism as claimed in claim 4, wherein the step S3-2 inputs the word vector outputted from the BERT pre-training language model to the BiGRU neural network model; the GRU is a special recurrent neural network, and the state of its neurons is calculated as follows:

z_t＝σ(W_i*[h_t-1,x_t])

r_t＝σ(W_r*[h_t-1,x_t])

representing candidate hidden layers, W_i W_cW_rAre all represented as weight matrices, r, of GRUs_tIs a reset gate, control information is lost, and the update gate and the reset gate jointly decide to hideAnd (4) outputting the state.

6. The method for identifying named entities on a BERT-BiGRU-IDCNN-CRF as claimed in claim 5, wherein the step S3-5 comprises inputting the merged feature vector obtained in the step S3-4 to an attention mechanism module, performing weight distribution on the extracted features, enhancing features critical to named entity identification, and weakening irrelevant features;

defining a set of feature vectors H ═ H₀,h₁...h₃The extra information is a part of speech matrix P ═ P₀,P₁...P_nUsing a weight matrix W to weight the target vector set according to the part of speech information₁,W₂Affine transformation is carried out on H and P to enable vector space dimensions of H and P to be the same, and transformation results are input into tan H (.) activation functions to obtain joint feature vectors

And using softmax function pairs

Weighting and scoring to obtain the weight of each input

Finally, using the weight vector

7. the method for identifying named entities on BERT-BiGRU-IDCNN-CRF according to claim 6, wherein the step S3-6 inputs the weight-assigned feature vectors obtained in step S3-5 into a CRF layer, extracts the dependency features between labels, and calculates the loss function, and comprises:

wherein the content of the first and second substances,

representing predicted sequences

Score function ofThe score is obtained by the above-mentioned method,

the training loss function is:

λ represents a penalty term coefficient and θ represents a penalty term.

8. The method for named entity recognition on BERT-BiGRU-IDCNN-CRF based on Attention mechanism as claimed in claim 7, wherein the step S4 using the trained BERT-BiGRU-IDCNN-Attention-CRF based neural network model obtained in step S3 to perform named entity recognition on the text to be recognized to obtain the recognition result comprises the steps of: