CN113361277A

CN113361277A - Medical named entity recognition modeling method based on attention mechanism

Info

Publication number: CN113361277A
Application number: CN202110667423.9A
Authority: CN
Inventors: 李天瑞; 张世豪; 贾真; 杜圣东; 滕飞
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2021-09-07

Abstract

The invention discloses a medical named entity recognition modeling method based on an attention mechanism. Firstly, converting each word in an input medical text statement into a word vector by using a vector representation technology, then obtaining rich context information in the medical text statement by using BGRU, then selecting the importance degree of the context semantic information by using an attention mechanism, and finally obtaining a global optimal solution of a medical entity label sequence by CRF to complete the identification of the medical named entity. The invention constructs a medical named entity recognition model based on an attention mechanism, and adopts a network framework of RNN + CRF. Wherein, the RNN part uses BGRU network, compares with BLSTM commonly used, and its structure is simpler, and training speed is faster, and the effect is also better. An attention mechanism is introduced on the basis of the network framework of the RNN + CRF to select the importance degree of the context information, so that the entity identification effect is improved.

Description

Medical named entity recognition modeling method based on attention mechanism

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a medical named entity recognition modeling method based on an attention mechanism.

Background

With the advance of medical informatization, the medical field has accumulated a huge amount of unstructured text data, which contains a large amount of valuable information. How to extract effective information from the medical texts and store and manage the effective information to construct a large-scale and high-quality medical knowledge map has great significance for the development of medical informatization, and is also a research hotspot in the field of natural language processing. Named entity recognition is one of the core tasks of medical text structured information extraction, and aims to identify entities with specific meanings from unstructured text.

Traditional named entity recognition methods mainly include rule-based, dictionary-based, and machine learning-based methods. The rule and dictionary based method requires a domain expert to manually write a rule template, or realizes the identification of an entity by means of character string matching with the help of a domain dictionary. The named entity recognition is regarded as a sequence labeling problem based on a machine learning method, so that the named entity is recognized by using targeted feature engineering and a proper machine learning model, and the commonly used machine learning model comprises maximum entropy, a Support Vector Machine (SVM), a Conditional Random Field (CRF) and the like. Although all the methods achieve certain effects, the methods need to rely on experts in the medical field to establish medical rules or medical dictionaries or rely on manually designed features to train machine learning models, which not only takes a lot of time and effort, but also limits the recognition effect to the quality of the manually designed rules, dictionaries or features. In recent years, with the development of deep learning, neural network-based methods have been applied to entity recognition tasks, and many research results have been obtained. The method does not depend on artificial design features, and all related features are automatically learned by a neural network.

At present, the network framework of 'RNN + CRF' combining a cyclic neural network and a conditional random field is a mainstream model in a named entity recognition task. Due to the particularity of the medical field, entities in the medical text have the characteristics of strong medical specialization, a large number of abbreviations and the like, and the required medical entities can be extracted more accurately and effectively only by depending on context information with high association degree and strong dependency. However, a simple "RNN + CRF" network framework can only learn the context information of a sentence from the RNN model, and cannot select the importance of the context information.

Disclosure of Invention

The invention provides a medical named entity recognition modeling method based on an attention mechanism, aiming at the problem that an entity in a medical text has poor entity recognition effect caused by strong medical specialization, a large number of abbreviations and the like. Firstly, converting each word in an input medical text statement into a word vector by using a vector representation technology, then obtaining rich context information in the medical text statement by using BGRU, then selecting the importance degree of the context semantic information by using an attention mechanism, and finally obtaining a global optimal solution of a medical entity label sequence by CRF to complete the identification of the medical named entity.

The medical named entity recognition modeling method based on the attention mechanism comprises the following steps:

step 1: vectorizing the medical text statement sequence X to obtain an input feature vector W, which specifically comprises the following steps:

the sentence sequence X with the length of n is equal to (X)₁，x₂，...，x_n) Word x in_iInto a low-dimensional dense real-valued vector w_iWord vectors of words are embedded by words in a matrix W_charIs represented by a vector code of_charIs | V | × d, where | V | is a fixed-size input word table and d is the dimension of the word vector; wherein i belongs to [1, 2]；

Representing an input feature vector of a medical text sentence as W ═ W (W ═ W₁，w₂，...，w_n)；

Step 2: learning the context information of the medical text sentence from the input feature vector W by using a bidirectional gate control loop unit network BGRU to obtain a sentence vector H, which specifically comprises the following steps:

the BGRU obtains the state output of the hidden layer from the upper information and the lower information of the medical text sentence from the input feature vector W by a forward GRU network and a backward GRU network respectively

And

wherein

And

respectively representing hidden layer state output of a forward GRU network and a backward GRU network at the time t, wherein t belongs to [1, 2]；

BGRU splices hidden layer state outputs of forward GRU network and backward GRU network to obtain sentence vector H ═ (H)₁，h₂，...，h_n) Wherein the hidden layer state output of BGRU at time t is:

and step 3: selecting the importance degree of the context information in the sentence vector H by using an attention mechanism to obtain a feature vector M of the sentence, wherein the method specifically comprises the following steps:

performing attention weight calculation on the sentence vector H to obtain an attention weight vector a:

a＝softmax(w_atanh(H))；

wherein w_aIs the weight vector to be learned, tanh (-) is the hyperbolic tangent function;

and the sentence vector H carries out weighted summation according to the attention weight vector a to obtain a feature vector M of the sentence:

M＝aH；

and 4, step 4: decoding the feature vector M by using a conditional random field CRF to obtain a final output sequence Y of the input statement X^*The method specifically comprises the following steps:

the feature vector M of the sentence obtained is (M)₁，m₂，...，m_n) Calculating the conditional probability of the possible output label sequence Y:

P(Y|M)＝CRF(M，Y)；

wherein Y ∈ Y_x，Y_xRepresents all possible output tag sequences of the input sequence X;

finally, the output label sequence Y with the maximum conditional probability is used^*As the final output sequence of input statement X:

Y^*＝argmaxP(Y|M)。

the invention constructs a medical named entity recognition model based on an attention mechanism, and adopts a network framework of RNN + CRF. Wherein, the RNN part uses BGRU network, compares with BLSTM commonly used, and its structure is simpler, and training speed is faster, and the effect is also better. An attention mechanism is introduced on the basis of the network framework of the RNN + CRF to select the importance degree of the context information, so that the entity identification effect is improved.

Drawings

FIG. 1 is a structural diagram of a medical named entity recognition model based on an attention mechanism.

Detailed Description

The specific implementation steps are as follows:

step 1: vectorizing the medical text statement by using a vector representation technology to obtain an input feature vector:

the medical text sentence sequence X with the length of n is equal to (X)₁，x₂，...，x_n) Word x in_iInto a low-dimensional dense real-valued vector w_iWord vectors of words are embedded by words in a matrix W_charIs represented by a vector code of_charIs | V | × d, where | V | is a fixed-size input word table and d is the dimension of the word vector; wherein i belongs to [1, 2]；

Thus, the input feature vector of the medical text sentence may be expressed as W ═ (W ═ W₁w₂，...，w_n)。

Step 2: and learning the context information of the medical text sentence from the input feature vector by using a bidirectional gating circulation unit network BGRU to obtain a sentence vector.

For named entity recognition, this sequence tagging problem is suitably learned using LSTM to solve the problem of dependency on sequence data. GRU as LSTM variant can learn sequence data dependence well, solve RNN gradient disappearance problem, and has simpler structure, faster training speed and better effect than LSTM. Therefore, here, the input sequence data is processed using the GRU.

The BGRU learns the context information of the text sentence by combining a forward GRU network and a backward GRU network; the forward GRU network and the backward GRU network control information flow by setting an update gate z and a reset gate r, so as to realize the update, the accept and the reject and the storage of historical information; wherein, the information flow of the forward GRU network comprises the input information w of the current time t_tAnd hidden layer state output h of GRU at previous moment_t-1；

Updating the door z at time t_tAnd a reset gate r_tThe calculation method is as follows:

z_t＝σ(W_wzw_t+W_hzh_t-1+b_z)；

r_t＝σ(W_wrw_t+W_hrh_t-1+b_r)；

where σ (-) stands for sigmoid functionNumber, W_wz、W_hzTo update the weight matrix to be learned in the gate, b_zTo update the offset vector of the gate, W_wr、W_hrTo reset the weight matrix to be learned in the gate, b_rA bias vector to reset the gate;

then, use the reset gate r_tObtaining candidate information of GRU hidden layer at current time t

The calculation method is as follows:

wherein tanh (-) represents a hyperbolic tangent function,

representing the Hadamard (Hadamard) product, W_wh、W_hhWeight matrix to be learned in the candidate information of the hidden layer at the current moment, b_hA bias vector in the candidate information of the hidden layer at the current moment is obtained;

finally, with the refresh door z_tRespectively carrying out Hadamard multiplication with hidden layer state output of GRU at the previous moment and candidate information of the hidden layer at the current moment to obtain hidden layer state output of GRU at the current moment

The forward GRU network is used for learning the above information of the medical text sentence, and the backward GRU network is used for learning the below information of the medical text sentence, and the information flow of the backward GRU network comprises the input information w of the current time t_tAnd hidden layer state output h of GRU at the later time_t+1The calculation mode is the same as that of the forward GRU network;

BGRU is prepared by mixingAnd splicing hidden layer state outputs of the GRU network and the backward GRU network to obtain a sentence vector H ═ H₁，h₂，...，h_n) Wherein at time t, the hidden layer output of the BGRU is:

wherein

And

respectively representing hidden layer state output of the forward GRU network and the backward GRU network at the time t.

And step 3: and selecting the importance degree of the context information in the sentence vector by using an attention mechanism to obtain the feature vector of the sentence.

Context dependence information of medical text sentences can be learned relatively comprehensively by using the BGRU, so that current characters can be effectively recognized. However, each context information is not of the same importance for identifying the current character. Therefore, by using an attention mechanism behind the BGRU, the attention of the context information with higher relevance and stronger dependence on the current character can be enhanced, and the attention of the context information with lower relevance and weaker dependence is weakened, so that the recognition effect of the entity is improved.

Specifically, the attention weight calculation is performed on the sentence vector H output by the BGRU network in step 2, resulting in an attention weight vector a:

a＝softmax(w_atanh(H))，

and (3) carrying out weighted summation on the sentence vector H output by the BGRU network according to the attention weight vector a to obtain a feature vector M of the sentence:

M＝aH。

and 4, step 4: and performing joint decoding on the prediction labels by using the CRF to obtain a final output sequence of the input statement.

Named entity recognition is a sequence tagging problem, in which each tag in a predicted tag sequence is not predicted independently, and it needs to be combined with previous and next tags to predict more accurately. For example, for an entity composed of multiple characters, the labels of each character are consistent with the labels of the entity category. If the label predicts one of the characters independently, this information is not available, which may lead to prediction errors. Although the context information of the current character can be well learned by the BGRU, the prediction of the label is independent, and the label bias problem is generated. Therefore, the attention added BGRU is followed by a CRF to carry out joint decoding on the label sequence, so that the label sequence at the current position can be well predicted according to the labels at the front and rear positions, and the globally optimal label sequence can be obtained.

Specifically, the sentence obtained in step 3 has a feature vector M ═ M (M)₁，m₂，...，m_n) And calculating the conditional probability of the possible output label sequence Y in the following way:

S(M，Y)＝∑_i，kλ_kt_k(y_i-1，y_i，m，i)+∑_i，lμ_ls_l(y_i，m，i)

wherein t is_kAnd s_lAre all characteristic functions, t_kIs a state feature function for extracting the features of the state sequence, its state y at the current moment_iDependent on the state y at the previous moment_i-1The influence of (a); s_tIs a transfer feature function for extracting the features of the observation sequence, its state y at the current moment_iWill be observed by the current moment m_iThe influence of (c). The characteristic function can only take a value of 0 or 1, and takes a value of 1 when a certain characteristic is satisfied, and takes a value of 0 when the certain characteristic is not satisfied. Lambda [ alpha ]_k、μ_lThe weights of the two characteristic functions are used for measuring the importance degree of the current characteristic. Y is_XRepresents all possible output tag sequences of the input sequence X;

Y^*＝argmaxP(Y|M)。

to verify the effectiveness of the present invention, the present invention evaluates task one at CHIP 2020: a comparison experiment is carried out on a data set identified by the named entity of the Chinese medical text and 3 named entity identification models in the medical field, wherein the 3 comparison models are as follows:

(1) CRF: and after converting the input medical text into a word vector, performing entity identification by adopting a CRF model.

(2) BLSTM-CRF: the context-dependent information is learned from the input feature vectors using BLSTM, and then entity identification is performed using CRF joint decoding.

(3) BLSTM-CRF: an attention mechanism is introduced on the basis of BLSTM-CRF, and the importance degree of the context information is selected by adding the attention mechanism after BLSTM.

The evaluation indexes adopt precision (P), Recall (R) and F1 (F1-score), wherein the precision represents the proportion of the quantity of a certain medical entity type which is correctly predicted to the quantity of the medical entity type, the Recall represents the proportion of the quantity of the certain medical entity type which is correctly predicted to the quantity of the medical entity type which is truly predicted to the quantity of the medical entity type, and the F1 is a harmonic mean of the precision and the Recall, and simultaneously takes into account the two contradictory indexes of the precision and the Recall, thereby being capable of carrying out more comprehensive and overall evaluation on the performance of the model.

Let J ═ J₁，j₂，...，j_n) For the real label set of the medical entity, the label set of the medical entity predicted by the medical named entity recognition model of the invention is set as K ═ K (K₁，k₂，...，k_m). Any one element in the two sets represents oneEach element of the medical entity comprises four contents of a sentence sequence number, an entity starting position, an entity ending position and an entity type. The sentence sequence number represents the number of the sentence in which the entity is located in the data set, the start position of the entity represents the position of the first character of the entity in the sentence, the end position of the entity represents the position of the last character of the entity in the sentence, and the entity type is the type of the medical entity. For any two elements j in two sets_i、k_lTwo elements are equivalent if and only if their sentence numbers, start positions of the entities, end positions of the entities, and types of the entities are the same.

Based on this, the accuracy, recall and F1 values were calculated as follows:

where # represents the intersection of sets, i.e., the element that both sets share;

the results of the experiment are shown in table 1:

TABLE 1 results of the experiment

Experimental results show that the BGRU-att-CRF model provided by the invention has better effect than all comparison models, and has the best effect on accuracy, recall rate and F1 value.

Claims

1. The medical named entity recognition modeling method based on the attention mechanism is characterized by comprising the following steps:

And

wherein

And

a＝softmax(w_ataah(H))；

M＝aH；

P(Y|M)＝CRF(M，Y)；

Y^*＝argmaxP(Y|M)。