CN111563383A

CN111563383A - Chinese named entity identification method based on BERT and semi CRF

Info

Publication number: CN111563383A
Application number: CN202010272320.8A
Authority: CN
Inventors: 蔡毅; 郑煜佳
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-08-21

Abstract

The invention discloses a Chinese named entity recognition method based on BERT and semi CRF, which constructs a named entity recognition model and comprises the following steps: obtaining a pre-trained BERT model; preprocessing original corpus data of named entity recognition to construct a training set of the named entity recognition; inputting the constructed training set data of named entity recognition into a pre-trained BERT language model; sequentially inputting the output of the BERT language model into a bidirectional LSTM neural network and a CRF and SemicCRF combined module, and carrying out multiple iterative training on the bidirectional LSTM neural network and the combined module; and carrying out named entity recognition on the Chinese text by using the complete named entity recognition model obtained after training. The invention solves the problem that the traditional word2vec can not distinguish the polysemous words, and combines the word level information which is often ignored by the traditional CRF method and the word level information by introducing a method based on the semi CRF, thereby improving the effect of the Chinese named entity recognition to a certain extent.

Description

Chinese named entity identification method based on BERT and semi CRF

Technical Field

The invention relates to the technical field of named entity recognition, in particular to a Chinese named entity recognition method based on BERT and semi CRF.

Background

Named Entity Recognition (NER) is a task in the field of Natural Language Processing (NLP) that aims at identifying entities from text and classifying them into predefined Entity types, such as names of people, places, organizations, etc. Named entity recognition can not only be used as a tool for information extraction alone, but also play an important role in other tasks and applications in the field of natural language processing, such as information retrieval, automatic text summarization, question answering, machine translation, knowledge base construction and the like.

The existing method for identifying named entities is Bi-LSTM + CRF, wherein the used Bi-LSTM (bidirectional long-short term memory network) is a deep neural network which is very popular in deep learning, and the characteristic context relationship in a long sequence can be learned in the identification of the named entities; the CRF (conditional random field) used is a traditional machine learning method, and the context of the label can be learned in named entity recognition.

The above-mentioned Bi-LSTM + CRF-based method requires learning the word-embedded representation from the named entity recognition dataset by itself, and there are drawbacks including: Bi-LSTM cannot deal with the situation of word ambiguity when the learning word is embedded and expressed; the named entity recognition data set is not large in scale, and the quality of word embedding representation which can be learned from the named entity recognition data set is limited; Bi-LSTM cannot process data in parallel, which results in that the size of the word-embedded representation it sets is limited not to be too large, otherwise the time cost of training learning will multiply. In addition, the characteristic exists in the text, namely the named entity mostly exists in a segment (segment) consisting of a plurality of words, and the CRF conditional random field in the Bi-LSTM + CRF-based method cannot utilize the information of the segment level in the unit of the word level.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a Chinese named entity identification method based on BERT and semi CRF. The invention can solve the problems of limited learning quality of word embedding representation and single word ambiguity which cannot be solved, and can avoid the problem that CRF can only use word-level information and neglects fragment-level information.

The purpose of the invention can be realized by the following technical scheme:

a named entity recognition method based on BERT and SemicRF builds a named entity recognition model, wherein the model comprises a BERT language model, a bidirectional LSTM and a CRF and SemicCRF combined module, and the method comprises the following steps:

acquiring a pre-trained BERT language model;

preprocessing original corpus data of named entity recognition to construct a training set of the named entity recognition;

inputting the obtained training set data of named entity recognition into a pre-trained BERT language model;

sequentially inputting the output of the BERT language model into a bidirectional LSTM neural network and a CRF and SemicCRF combined module, and carrying out multiple iterative training on the bidirectional LSTM neural network and the combined module;

and carrying out named entity recognition on the Chinese text by using the complete named entity recognition model obtained after training.

Further, the pre-trained BERT language model is obtained in a manner that: downloading BERT source codes of the open source of the Google, and using the BERT source codes to automatically pre-train a BERT pre-training language model on a mass of Chinese text corpora without labels; or directly download the Chinese BERT language model, Chinese _ L-12_ H-768_ A-12, pre-trained by Google officials.

Further, the step of preprocessing the original corpus data of the named entity recognition and constructing the training set of the named entity recognition includes:

carrying out conventional data preprocessing on the named entity identification original corpus;

determining the entity type to be identified according to the actual application requirement, or directly using the universal entity type;

marking the original corpus by adopting an entity marking method of BIOES;

and (3) formulating a specific marking rule according to the actual application requirement, and manually marking the unmarked original corpus or converting and correcting the marked original corpus by combining the marking rule.

Further, the step of sequentially inputting the output of the BERT language model into the bidirectional LSTM neural network and the CRF and SemiCRF joint module, and performing multiple iterative training on the bidirectional LSTM neural network and the joint module includes:

inputting the sequence output by the BERT language model into a bidirectional LSTM neural network;

the bidirectional LSTM neural network outputs probability distribution vectors of all entity types of each word in the sequence, namely CRF characteristics of each word at a word level (word level);

respectively inputting the obtained CRF characteristic sequences of the word level into a CRF layer and a semiCRF layer in a CRF and semiCRF combined module;

calculating a loss function of the CRF layer by adopting a bi-LSTM + CRF method in the CRF layer;

the semiCRF layer calculates the semiCRF characteristics of each segment in the sentence according to the CRF characteristics and the ground true tags of each word in the sentence, and further calculates the score of the optimal path;

the semi CRF layer calculates the scores of all paths through a forward algorithm through a semi CRF feature transfer matrix;

calculating a loss function of the semi CRF layer according to the best path score and all path scores;

the SGD is used to update the parameters of the entire named entity recognition model with a weighted sum of the loss function of the CRF layer and the loss function of the SemiCRF layer.

Further, the step of performing named entity recognition on the Chinese text by using the trained complete named entity recognition model includes:

inputting the sentence needing named entity recognition into the trained complete named entity recognition model;

after passing through a pre-trained BERT language model, the input sequence sequentially passes through a bidirectional LSTM neural network and a CRF and SemiCRF combined module, CRF characteristics of each word of an input sentence are calculated, and then a CRF characteristic matrix of a CRF layer and a SemiCRF characteristic matrix of a SemiCRF layer are calculated;

using a viterbi algorithmThe method decodes the best path of the input sentence on the CRF layer and the semi CRF layer respectively to obtain the CRF layer label sequence and the fraction score of the sequence on the CRF layer_C-CSemicrF layer tag sequence and fractional score of the sequence on the SemicrF layer_S-S；

Calculating the fraction score of the label sequence decoded by the CRF layer in the SemiCRF layer_C-SAnd calculating the fraction score of the label sequence decoded by the SemiCRF layer in the CRF layer_S-C；

Calculating the total score of the tag sequences of CRF layer_C-C+score_C-SAnd score of SemicRF layer tag sequence_S-S+score_S-CAnd because the score is calculated through negative log-likelihood processing, the label sequence with the smallest score is taken as the result of named entity recognition.

Compared with the prior art, the invention has the following beneficial effects:

1. the BERT model used by the invention can learn the expression of word embedding with good quality from a large-scale Chinese text in a pre-training and fine-tuning mode, is not limited to a named entity recognition data set which needs to be manually labeled and processed, and can adjust the current semantics according to a context scene, thereby solving the problem of ambiguity.

2. Compared with the Conditional Random Field (CRF) only using word level information, the semi-Markov conditional random field (SemiCRF) introduced into the named entity recognition can be more suitable for the named entities with obvious segment level characteristics, and in order to ensure the named entity recognition effect of the SemiCRF, the semi-Markov conditional random field (SemiCRF) is combined with the CRF to a certain extent so that the model can simultaneously and effectively utilize the characteristics of the word level and the fragment level.

3. The invention considers two methods of CRF and semi CRF at the same time in the process of training and decoding, and especially, the accuracy rate of named entity identification can be ensured by taking the best result as the final result during decoding.

Drawings

FIG. 1 is a flow chart of a method for Chinese named entity recognition based on BERT and SemicRF in the present invention.

Fig. 2 is a schematic structural diagram of a named entity recognition model for chinese named entity recognition based on BERT and SemiCRF in the embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

Fig. 1 is a flow chart of a method for recognizing a named entity in chinese based on BERT and SemiCRF, and a named entity recognition model shown in fig. 2 is constructed, wherein the model comprises a BERT language model, a bidirectional LSTM and a CRF and SemiCRF joint module, and the method comprises the following steps:

s1, obtaining a pre-trained BERT model;

specifically, the obtaining mode includes: downloading BERT source codes of the open source of the Google, and obtaining a BERT pre-training language model by using the BERT source codes on a mass of non-label Chinese text corpora by using the existing pre-training technology; or directly download the Chinese BERT language model, Chinese _ L-12_ H-768_ A-12, pre-trained by Google officials.

S2, preprocessing the original corpus data of the named entity recognition, and constructing a training set of the named entity recognition, comprising the following steps:

s21, performing conventional data preprocessing on the named entity recognition original corpus, wherein the conventional data preprocessing comprises correcting wrongly-written characters, normalizing characters and the like; the original corpus data is named entity data which is marked;

s22, determining the entity type to be identified according to the actual application requirement, or directly using the universal entity type such as human name (PERSON), LOCATION name (LOCATION), ORGANIZATION name (ORGANIZATION), etc.;

s23, in order to cope with the situation that the entity lengths are different and the entity boundaries are difficult to distinguish, adopting a BIOES entity labeling method: b marks the beginning of the long entity, I marks the inside of the long entity, E marks the tail of the long entity, S marks the entity represented by only one word, O marks the non-entity, e.g., "liu xuan de" will be marked as (B-PER, I-PER, E-PER);

s24, formulating specific labeling rules according to actual application requirements, and manually labeling the unlabeled original corpus or converting and correcting the labeled original corpus by combining the labeling rules of the steps S22 and S23.

And S3, inputting the training set data of the named entity recognition obtained by preprocessing in the step S2 into the pretrained BERT language model.

Specifically, the training set data is input into a pre-trained BERT language model in sentence units, and the output of the BERT language model is a word embedding vector sequence.

S4, sequentially inputting the output of the BERT language model in the step S3 into the bidirectional LSTM neural network and the CRF and SemicCRF combined module, and performing multiple iterative training on the bidirectional LSTM neural network and the combined module, wherein the method comprises the following steps:

s41, inputting the sequence output by the BERT language model into a bidirectional LSTM neural network;

s42, the bidirectional LSTM neural network outputs probability distribution vectors of all entity types of each word in the sequence, namely CRF characteristics of each word at word level;

s43, respectively inputting the obtained CRF characteristic sequences into a CRF layer and a semiCRF layer in the CRF and semiCRF combined module;

s44, calculating a loss function of the CRF layer by the aid of a bi-LSTM + CRF method in the CRF layer;

for example, if the input sequence is "Wilson's doctor comes to California research", the corresponding label sequence of group true is (B-PER, I-PER, E-PER, O, O, O, B-LOC, I-LOC, I-LOC, I-LOC, E-LOC, O, O, O, O, O) for the "benefit" word, the CRF considers not only the score of the location itself labeled as I-LOC, but also the labeling results of the context "plus" and "good", if "benefit" is labeled as I-PER, it is obviously impossible to follow I-PER from "plus" B-LOC, the CRF learning the context of the label from the data set would therefore give a very low score of "benefit" labeled as I-PER;

s45, the semiCRF layer calculates the semiCRF characteristics of each segment in the sentence according to the CRF characteristics and the ground true labels of each word in the sentence, and further calculates the score of the best path;

in the invention, the words in the word embedding vector correspond to the single words in the text, and the segment level corresponds to the words in the text. Similarly for the input sequence "Wilson's doctor comes to California research study", the segment-level annotation sequences for SemicRF are ((1,3, PER), (4,4, O), (5,5, O), (6,6, O), (7,7, O), (8,12, LOC), (13,13, O), (14,14, O), (15,15, O), (16,16, O)), the score for the best path can be calculated according to the following formula:

where s denotes a segment-level tag sequence, w denotes a word-embedded vector representation of the input sequence, l_iDenoted is the i-th segment level label, b_iAnd e_iRespectively representing the corresponding positions of the beginning and end of the ith segment-level label on the input sequence, m_iIs a fraction of the ith segment itself, b_i,jDenoted is the segment level transition score, y, from category i to category j_kIs the group true tag of the kth word of the input sequence,

is with the label y_kA vector of weight parameters associated. w'_kThe feature vector of the k-th word is represented, and the construction mode is as follows:

wherein the content of the first and second substances,

is the embedded vector corresponding to the index of each word in the segment;

s46, calculating the scores of all paths by the semi CRF layer through a forward algorithm through a semi CRF feature transfer matrix;

s47, calculating a loss function of the SemicRF layer according to the best path score and all the path scores;

required fraction of the loss function

Is processed by negative log-likelihood, so it is expressed as Loss ═ score_{all_path}-score_{best_path}；

And S48, adopting SGD (random gradient descent) to update parameters of the whole named entity recognition model by using the weighted sum of the loss function of the CRF layer and the loss function of the semiCRF layer, wherein the parameters comprise model parameters including a BERT language model, an LSTM neural network and a CRF and semiCRF joint module. The weighted weights need to be optimized by a controlled variable method, and the weighted weights can be changed according to different training data of named entity recognition.

S5, using the complete named entity recognition model obtained by training in the step S4 to recognize the named entity of the Chinese text, comprising the following steps:

s51, inputting the sentence needing named entity recognition into the trained complete named entity recognition model;

s52, after passing through a pre-trained BERT language model, the input sequence sequentially passes through a bidirectional LSTM neural network and a CRF and SemiCRF combined module, CRF characteristics of each word of the input sentence are calculated, and a CRF characteristic matrix of a CRF layer and a SemiCRF characteristic matrix of the SemiCRF layer are further calculated;

s53, decoding the best path of the input sentence on the CRF layer and the semi CRF layer respectively by using a viterbi algorithm, namely obtaining the CRF layer label sequence and the fraction score of the CRF layer label sequence on the CRF layer_C-CAnd a SemicCRF layer tag sequence on a SemicCRF layerScore_S-S；

S54, calculating the score of the label sequence decoded by the CRF layer in the step S53 in the SemiCRF layer_C-SAnd calculating the score of the label sequence decoded by the SemiCRF layer in the step S53 in the CRF layer_S-C；

S55, respectively calculating the total score of the two label sequences obtained in the step S53_C-C+ score_C-S、score_S-S+score_S-CAnd because the score is calculated through negative log-likelihood processing, the label sequence with the smallest score is taken as the result of named entity recognition.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A method for recognizing a Chinese named entity based on BERT and SemicRF is characterized in that a named entity recognition model is constructed, the model comprises a BERT language model, a bidirectional LSTM and a CRF and SemicCRF combined module, and the method comprises the following steps:

obtaining a pre-trained BERT model;

inputting the constructed training set data of named entity recognition into a pre-trained BERT language model;

2. The method of claim 1, wherein the pre-trained BERT model is obtained by: downloading BERT source codes of the open source of the Google, and using the BERT source codes to automatically pre-train a BERT pre-training language model on a mass of Chinese text corpora without labels; or directly download the Chinese BERT language model, Chinese _ L-12_ H-768_ A-12, pre-trained by Google officials.

3. The method according to claim 1, wherein the step of preprocessing the raw corpus data of the named entity recognition to construct the training set of the named entity recognition comprises:

marking the original corpus by adopting an entity marking method of BIOES;

and (3) formulating a specific labeling rule according to the actual application requirement, and manually labeling the unmarked original corpus or converting and correcting the labeled original corpus by combining the labeling rule.

4. The method of claim 1, wherein the step of inputting the output of the BERT language model to the bi-directional LSTM neural network and the CRF and SemiCRF joint module sequentially comprises the steps of:

the bidirectional LSTM neural network outputs probability distribution vectors of all entity types of each word in the sequence, namely CRF characteristics of each word at a word level;

respectively inputting the obtained CRF characteristic sequences into a CRF layer and a semiCRF layer in a CRF and semiCRF combined module;

the SGD is used to update the parameters of the named entity recognition model with a weighted sum of the loss function of the CRF layer and the loss function of the SemiCRF layer.

5. The method of claim 4, wherein the SemiCRF layer calculates the SemiCRF characteristics of each segment in the sentence according to the CRF characteristics and the ground true tags of each word in the sentence, and further calculates the best path score, wherein the best path score is calculated by:

is with the label y_kA vector of weight parameters associated. w'_kTo representThe feature vector of the kth word is constructed in the following way:

wherein the content of the first and second substances,

is the embedded vector corresponding to the index within the segment of each word that makes up the segment.

6. The method of claim 4, wherein the step of calculating a loss function for the semi CRF layer based on the best path score and all path scores includes dividing the score by the loss function

Negative log-likelihood processing is performed, so the Loss function is finally expressed as Loss ═ score_{all_path}-score_{best_path}。

7. The method according to claim 1, wherein the step of performing named entity recognition on the chinese text using the trained complete named entity recognition model comprises:

after passing through a pre-trained BERT language model, the input sequence sequentially passes through a bidirectional LSTM neural network and a CRF and SemiCRF combined module, CRF characteristics of each word of the input sentence are calculated, and a CRF characteristic matrix of a CRF layer and a SemiCRF characteristic matrix of a SemiCRF layer are further calculated;

decoding the best paths of the input sentences on the CRF layer and the semi CRF layer respectively by using a viterbi algorithm, namely obtaining a CRF layer label sequence and a score of the CRF layer label sequence on the CRF layer_C-CAnd SemicCRF layer tag sequences and fraction score of SemicCRF layer tag sequences on SemicCRF layer_S-S；

ComputingFractional score of tag sequence decoded by CRF layer in SemiCRF layer_C-SAnd calculating the fraction score of the label sequence decoded by the SemiCRF layer in the CRF layer_S-C；