CN117610567A

CN117610567A - Named entity recognition algorithm based on ERNIE3.0_Att_IDCNN_BiGRU_CRF

Info

Publication number: CN117610567A
Application number: CN202311539422.1A
Authority: CN
Inventors: 邱兰; 朱波; 邹艳华; 王选飞; 胡朋; 荆晓娜; 黎魁
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-02-27

Abstract

A named entity recognition algorithm based on ERNIE 3.0_att_idcnn_biglu_crf, comprising the steps of: step 1: an ERNIE3.0 model which is a large language model pre-training model published by hundred degrees is used as a semantic characterization model; step 2: adding semantic word vectors characterized by ERNIE3.0 in the last step into sequences before and after the Att (attention mechanism) strengthening entity; step 3: embedding the output result of the step two into an IDCNN (expanded convolution neural network) to obtain local characteristics of an entity sequence in a sentence; step 4: the output of IDCNN is connected to BiGRU (bidirectional gate control circulation unit); step 5: finally, adding a classification layer and CRF (conditional random field) to obtain a final result. A series of experiments are carried out on three Chinese data sets based on MSRA, weibo and people daily report of a named entity recognition task, and the validity of an ERNIE3.0_Att_IDCNN_BiGRU_CRF model is verified.

Description

Named entity recognition algorithm based on ERNIE3.0_Att_IDCNN_BiGRU_CRF

Technical Field

The invention belongs to the technical field of Chinese named entity recognition methods in natural language processing, and relates to a named entity recognition algorithm based on ERNIE 3.0_att_IDCNN_BiGRU_CRF.

Background

The development of technologies such as random internet and intelligent manufacturing, the information in production and life is exponentially increased, and how to explore valuable information in massive data is particularly important, so knowledge mining becomes one of the research hotspots worldwide. Named entity recognition has been a research basis as one of the important links of knowledge extraction. For example, JIB proposes that dictionary+bilstm_crf is one of the most common and most classical modes, and extracts multidimensional information of chinese text by adopting different feature vector construction modes, thereby improving model recognition effect. Emma Strubell and Patrick Verga propose an IDCNN obtained by expanding a convolution kernel by the CNN, so that the CNN which is not suitable for the sequence problem is suitable for the sequence problem, and the computational advantage of the CNN is better applied. Since the BERT is deduced from ***, semantic word vectors represented by the pre-training model are remarkably improved in a plurality of Chinese task models, and Yan Yangtian and Yang Wenming respectively use BERT_BiLSTM_CRF and BERT_IDCNN_CRF algorithms to explore named entity recognition, and better performance of the named entity recognition is verified through experiments.

However, in practical knowledge extraction, the named entity recognition data is basically a small sample due to high cost of manual labeling and limited effort of expert, and because the BERT pre-training model does not consider the characteristics of chinese grammar and phrase and words, the accuracy of the named entity recognition task is limited in some small sample and text with high subject intersection. In view of the wide existence of the named entity task in actual production under the condition, the research is conducted, and a named entity identification algorithm based on ERNIE3.0_Att_IDCNN_BiGRU_CRF is provided. This is not a simple replacement for the pre-trained model, but also better and accurate characterization of the local and global features of the entity.

Disclosure of Invention

In order to solve the problems, a named entity recognition algorithm based on ERNIE3.0_Att_IDCNN_BiGRU_CRF is provided, a pre-training model which is more suitable for Chinese tasks is provided, and a multi-head attention mechanism and two neural networks are used for extracting sequence features of entities in sentences.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

A named entity recognition algorithm based on ERNIE 3.0_att_idcnn_biglu_crf, comprising the steps of:

step 1: an ERNIE3.0 model which is a large language model pre-training model published by hundred degrees is used as a semantic characterization model;

step 2: adding semantic word vectors characterized by ERNIE3.0 in the last step into sequences before and after the Att (attention mechanism) strengthening entity;

step 3: embedding the output result of the step 2 into an IDCNN (expanded convolution neural network) to obtain local characteristics of an entity sequence in a sentence;

step 4: the output of IDCNN is connected to BiGRU (bidirectional gate control circulation unit);

step 5: finally, adding a classification layer and CRF (conditional random field) to obtain a final result.

Preferably, before the step 1, a sliding window is used to intercept the corpus according to the size of the window (i.e. every two sentences are one corpus) to construct a data set with a relatively standard format.

Preferably, the ERNIE pre-training model used in the step 1 is different from the BERT pre-training model by changing the large language pre-training model into the large language pre-training model, wherein the large language pre-training model is characterized by masking feature knowledge with grammar information such as words and phrases, and then representing semantics through context prediction, and the large language pre-training model is characterized by masking single words, so that the ERNIE has more grammar characteristics and can be more suitable for Chinese language tasks.

Preferably, in the step 2, the semantic enhancement of the entity context is characterized by utilizing a multi-head attention mechanism, so that the functional relationship therein is more easily learned.

Preferably, in the step 3, the convolution kernel of CNN is expanded to obtain IDCNN, so that under the condition that the IDCNN has the advantage of the speed of CNN, the problem that the original CNN is not suitable for the front-back sequence is also optimized, and the IDCNN is used for exploring the local feature of the entity in the sentence, so that the entity feature is expressed more accurately.

Preferably, the step 4 further uses biglu to extract global feature features, and is simpler and more efficient than other algorithms that use biplstm (two-way long-short-term memory network) or the like to explore long-sequence problems.

Preferably, in the step 5, the label of the entity is reasonably judged by using the CRF, so that the algorithm accuracy is higher.

Compared with the prior art, the invention has the following excellent effects:

in the actual small sample multidisciplinary named entity recognition task, the text is too complex, the existing technology often causes the problem of lower recognition accuracy, and a named entity recognition algorithm based on ERNIE3.0_att_IDCNN_BiGRU_CRF is provided. The model uses an ERNIE pre-training model which is one of the most intelligible Chinese pre-training models, and then uses an attention mechanism to be combined with other neural networks, so that not only the local and global characteristics of the entity sequence are considered, but also the problem that the model is too complex and has low calculation speed is solved. Finally, CRF is applied to consider the rationality of label category. And a series of experiments are carried out on three Chinese data sets of MSRA, weibo and people daily report based on a named entity recognition task, so that the effectiveness of the model is verified. The results show that the model is superior to some other comparison models belonging to a machine learning model or a deep learning model in terms of F1 performance index identified by named entities under the condition of small sample multidisciplinary intersection. In addition, the speed of the model is also known as festoon among many models.

Drawings

FIG. 1 is a flowchart of the overall named entity recognition based on the ERNIE3.0_Att_IDCNN_BiGRU_CRF algorithm;

FIG. 2 is a schematic diagram of a five-fold cross-validation method for partitioning training and testing sets;

FIG. 3 is a schematic diagram of the ERNIE3.0_Att_IDCNN_BiGRU_CRF model.

Detailed Description

The following detailed description of the preferred embodiments of the invention is provided to enable those skilled in the art to more readily understand the advantages and features of the invention and to make a clear and concise definition of the scope of the invention.

As shown in fig. 1, the original data needs to be subjected to preprocessing operations such as text segmentation, noise removal, spell checking, data cleaning, and removal of redundant spaces before being input into the model. After pretreatment, two sentences are automatically read into a context by utilizing a sliding window, and a data set is constructed. And then, marking the processed data by using a YEDDA marking tool according to the marking rule of the specified named entity. Finally, after the marked data sets are scrambled, the prepared data sets are randomly divided into 5 subsets which are not overlapped with each other by utilizing a five-fold cross validation method (shown in figure 2). For each subset, it is taken as a validation set, with the remaining 4 subsets taken as training sets.

After training data is input in step 1, text Embedding is performed on an Embedding layer by using an Ernie3.0 model (shown in fig. 3), an input text sequence is converted into a fixed vector representation and is used as input of a subsequent network, so that the performance and generalization capability of the model are improved.

Step 1.1: in terms of constructing word vectors, although word2vec is simple and efficient, training speed is high, word vectors with similar semantics can be expressed as similar vectors, but the word vectors cannot handle the situation that one word is ambiguous, even if the same word possibly has different meanings in different contexts, and the context order is not considered; while Glove solves the problem that word2vec cannot process word ambiguity, and models a global corpus to enable the relation between word vectors to be more accurate, the Glove cannot directly process a context sequence, some information can be lost, and when tasks in some specific fields are processed, poor results can be caused due to smaller training data sets; ELMO can capture semantic information of different levels, has strong expression capability, but has large calculation amount due to large parameter amount, is a depth model-based method, and is difficult to explain the meaning of each dimension; the Ernie model achieves excellent effects in a plurality of NLP tasks, including named entity recognition, text classification, emotion analysis and the like, so that better effects and applicability can be achieved by using the Ernie model for text embedding in the named entity recognition task.

Step 1.2: so far, ernie has evolved to version 3.0. The Ernie3.0 model is a pre-training model developed by Baidu corporation in the field of natural language processing, and has the advantages of multi-language support, rich fields, context sensitivity (the Ernie3.0 adopts a structure based on a transducer, and is excellent in context understanding, and particularly in NER tasks, entities can be better identified according to the context understanding), combined training and the like compared with other pre-training models. Thus, text embedding using the Ernie3.0 model in named entity recognition may improve the accuracy and efficiency of the model, especially in multi-lingual, multi-domain and context sensitive scenarios.

As shown in fig. 3, after the Embedding layer in step 2, a multi-head attention mechanism is added to capture context information of different layers and different positions, so that the context information can effectively model a long text sequence.

Step 2.1: the named entity recognition task requires the use of contextual information to accurately label entities in text. The multi-head attention mechanism is a self-attention mechanism, and can learn the relations between different positions and different layers in the context, so that more abundant context information is captured, and the performance of the model in NER tasks is improved.

Step 2.2: the multi-head attention mechanism obtains a plurality of attention heads from the input vector through different linear transformations, then performs attention calculation on the different heads, and splices the results to form a final representation. This has the advantage that information of different locations and different levels can be focused on simultaneously, thereby better capturing context information. In the named entity recognition task, the input is a sequence of text, the model needs to process each word in turn from left to right, and predict whether the word is an entity based on the context information. The multi-headed attention mechanism may capture the relationship between the current word and the preceding word and the relationship between the current word and the following word to better model the entire sequence. In addition to sequence modeling, the multi-headed gaze mechanism may learn contextual information at different locations in the text. In the NER task, for an entity, the model needs to consider word information before and after the entity. The multi-headed gaze mechanism may learn the relationship between different locations to better capture contextual information.

And 3, performing convolution operation on a convolution layer by using IDCNN, wherein the IDCNN is a convolution neural network structure improved on the basis of CNN, and can better capture long-distance dependence when processing sequence data. The IDCNN can expand the receptive field through multiple iterations, so that the convolution kernel can cover a wider area to extract characteristic information spanning different time steps and different spatial positions, and an iterative expansion convolution operation is adopted to combine a plurality of convolution kernels with different sizes so as to enhance the receptive field and the characteristic extraction capability of the model.

In the step 4, the bidirectional GRU is used for sequence modeling, and because the bidirectional GRU comprises two GRUs, one is used for processing an input sequence from front to back and the other is used for processing the input sequence from back to front, the modeling capability of the model on the context information can be improved by using the forward information and the backward information at the same time. In the named entity recognition task, the input is a text sequence, and through sequence modeling by the bidirectional GRU, the hidden states of the forward and backward directions can be obtained at each time step and combined into a more comprehensive representation. This may better capture contextual information to help identify named entities.

The step 5 uses the CRF model for label prediction in the output layer, since CRF is a discriminant probability model, different marker sequences can be scored based on the conditional probabilities of the input sequence and the specific output sequence. In the task of identifying named entities, the CRF can consider the relation between the marks of each word and the marks of the surrounding words, so that the context information is better processed, and the accuracy and the robustness of the model are improved.

Step 5.1: since the data set is divided by adopting five-fold cross validation, it is necessary to wait until all subsets are used as one validation set, and the average value of 5 validation results can be calculated as the performance index of the final model. Therefore, the training and testing can be performed by utilizing all data sets as much as possible, and the problem of over-fitting or under-fitting of the model caused by insufficient sample data is avoided.

Step 5.2: to verify whether the named entity recognition algorithm based on ERNIE 3.0_att_idcnn_biglu_crf is truly valid, we selected some other algorithms currently in the mainstream based on pre-training models as comparisons, and the comparison results are shown in table 1.

Table 1 test Performance index for each comparative model

As can be seen from Table 1, the named entity recognition algorithm of ERNIE3.0_Att_IDCNN_BiGRU_CRF obtains the highest F1 value in the comparison model. Therefore, the model has superiority for Chinese named entity recognition. At the same time, we can see that the model is 1.5% higher than ernie_idcnn_biggu_att_crf in value. This is probably because ERNIE3.0 uses a broader corpus and has a deeper hierarchy than ERNIE when pre-trained, including 24-layer and 4-layer transducers encoders and decoders, and can learn and represent text data better. Meanwhile, compared with the attention mechanism, the multi-head attention mechanism has the advantages of improving robustness, diversity representation, parallel calculation, comprehensive information and the like, so that the information in an input sequence can be better processed, and the performance of a model is improved.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A named entity recognition algorithm based on ERNIE 3.0_att_idcnn_biglu_crf, comprising the steps of:

step 3: embedding the output result of the step two into an IDCNN (expanded convolution neural network) to obtain local characteristics of an entity sequence in a sentence;

2. The method for recognizing named entities based on ERNIE 3.0_att_idcnn_biglu_crf according to claim 1, wherein the ERNIE3.0 pre-training model used in step 1 is changed to be different from the BERT pre-training model in that the former is demasked by feature knowledge having grammar information such as words and phrases, and then the semantic is characterized by context prediction, and the latter is also masked by single words, so that the ERNIE has more grammar characteristics and can be more suitable for chinese language tasks.

3. The ERNIE 3.0_Att_IDCNN_BiGRU_CRF-based named entity recognition algorithm according to claim 1, wherein the semantic enhancement of the entity context is characterized in step 2 by using a multi-head attention mechanism, so that the functional relationship is more easily learned.

4. The ERNIE 3.0_Att_IDCNN_BiGRU_CRF-based named entity recognition algorithm according to claim 1, wherein in the step 3, the convolution kernel of CNN is expanded to obtain IDCNN, so that under the condition that the CNN has the speed advantage, the problem that the original CNN is not suitable for a front-back sequence is optimized, and the result is used for exploring the local characteristics of an entity in a sentence, so that the entity characteristics are expressed more accurately.

5. The ERNIE 3.0_Att_IDCNN_BiGRU_CRF-based named entity recognition algorithm according to claim 1, wherein the step 4 is characterized in that BiGRU is used for extracting global feature features, and the method is simpler and more efficient than other algorithms for exploring long-sequence problems by using BiLSTM (two-way long-short-term memory network) and the like.

6. The named entity recognition algorithm based on ERNIE 3.0_att_idcnn_biglu_crf according to claim 1, wherein the step 5 finally uses CRF to reasonably judge the label of the entity, so that the algorithm accuracy is higher.