CN113536799B

CN113536799B - Medical named entity recognition modeling method based on fusion attention

Info

Publication number: CN113536799B
Application number: CN202110927320.1A
Authority: CN
Inventors: 李天瑞; 邬萌; 贾真; 杜圣东; 滕飞
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2023-04-07
Anticipated expiration: 2041-08-10
Also published as: CN113536799A

Abstract

The medical named entity recognition modeling method based on the fusion attention comprises the following steps: performing Chinese word segmentation and indexing on the medical text sentence; obtaining a Bi-LSTM model by splicing forward LSTM and reverse LSTM; updating the output feature vector through an attention mechanism; and decoding the output characteristic vector by a conditional random field CRF to obtain the label of the medical entity type of the input medical text sentence. The invention adds the matched words in the dictionary on the basis of the character sequence, provides more guidance for the model through the dynamic control of the gate structure, and selects the most relevant characters and words from the medical corpus. Compared to character-based methods, multi-granular information is explicitly exploited by the model to obtain better recognition performance. And moreover, an attention mechanism is introduced, so that the model focuses on effective information, and the condition that the traditional Bi-LSTM-CRF model considers context information but ignores different characters and words with different importance in sentences is overcome.

Description

Medical named entity recognition modeling method based on fusion attention

Technical Field

The invention relates to the technical field of natural language processing, in particular to a medical named entity recognition modeling method based on fusion attention.

Background

The continuous advancement of medical informatization has led to the explosive growth of massive data in the medical field, and the data forms are more and more diverse, which contain a great deal of useful information. Medical data is widely available and mainly comes from medical websites, hospital electronic medical records, medical books and the like. The medical knowledge map regulates the whole refining process of acquiring knowledge update from knowledge according to a complete set of criteria, so that massive medical knowledge can be efficiently managed and applied. For the construction of the medical knowledge map, the important foundation is laid for accurately identifying medical named entities from massive medical data. The extraction of specific ad hoc types of medical entities, such as "blood routine" entities belonging to the examination item type, "bronchitis" entities belonging to the disease type, is performed by medical named entity identification, a task also referred to as medical entity extraction.

Named entity recognition is a basic task of knowledge extraction, which is generally viewed as a sequence labeling task for predicting entity boundaries and entity class labels. Currently, the named entity recognition method is mainly a deep learning-based method. The neural network is applied to named entity recognition for the first time by carrying out named entity recognition through a one-way Short-Term Memory network (LSTM). Excellent recognition is achieved by using a Convolutional Neural Network (CNN) in combination with a Conditional Random Field (CRF). The performance of the CNN-CRF model may be enhanced by utilizing a character-based CNN. In the field of Chinese named entity recognition, the LSTM-CRF system structure is the mainstream at present and is adopted by a great deal of research institutes. Character features of manual design (Hand-crafted), character-based CNN, and character-based LSTM, etc. are used to represent character features.

Character sequence notation is the primary method of Chinese named entity recognition. By comparing word-based and character-based statistical methods in the named entity recognition task, the results indicate that the character-based method is relatively superior. There has been much interest from researchers regarding how to better utilize word information in a task of recognizing a named entity in chinese, such as using word segmentation information as a feature of the named entity recognition, combining word segmentation and named entity recognition through dual decomposition, and multi-task learning.

External information resources are used to assist named entity recognition, especially dictionaries are widely used. Such as multi-task learning on large raw text using word-level language models to enhance training of named entity recognition, word representation enhancement by pre-training character-level language models, and mining of cross-domain and cross-language knowledge through multi-task learning.

Disclosure of Invention

The invention aims to provide a medical named entity recognition modeling method based on fusion attention.

The technical scheme for realizing the purpose of the invention is as follows:

the medical named entity recognition modeling method based on the fusion attention comprises the following steps:

step 1: performing Chinese word segmentation and indexing on the medical text sentence s: matching the medical text sentence s with a dictionary to obtain a word sequence w ₁ ，w ₂ \8230, wn, wi is the ith word in the word sequence, i =1,2, \8230, n; indexing the kth character of the ith word by t (i, k), wherein k is the position of the character in the word;

the medical text sentence s = c ₁ ,c ₂ \8230, cm, cj is the jth character of the medical text sentence s, j =1,2, \8230, m;

words with index starting with b and ending with e are passed through

Representing, wherein b represents an index of a word start character, and e represents an index of a word end character;

step 2: obtaining a Bi-LSTM model by splicing the forward LSTM and the reverse LSTM;

of said forward LSTM

Comprises the following steps:

wherein,

and &>

An input gate, a forgetting gate and an output gate respectively; />

Information of new candidate cells; wcT and bc are model weight parameters and bias terms to be learned respectively; sigma is a Sigmoid function; as a hadamard product;

for an embedded representation of the character cj, ->

ec represents a character embedding look-up table;

for the hidden state corresponding to the character cj>

The cell state of the character corresponding to the character cj; />

Is a previous character cj _-1 Corresponding hidden state, is greater or less than>

Is a previous character cj _-1 The corresponding character cell state;

word with word information introducedCell state

The method comprises the following steps:

wherein,

and &>

Is->

And &>

Obtaining the result after normalization;

is an additional gate structure introduced for controlling all word cells ending with the character with index e>

On tail character cells>

Is based on the contribution of->

WlT and bl are model weight parameters to be learned and bias terms respectively;

word cellular state

The method comprises the following steps:

/>

wherein,

is a input door, is>

To forget the door; />

Cell information of new candidate words is obtained; wwT and bw are respectively a model weight parameter to be learned and a bias term; />

Is a hidden state corresponding to the first character of the word;

is a word/phrase>

Is shown in a non-volatile memory cell (c), device for combining or screening>

ew represents a word embedding lookup table obtained by the dictionary conversion in the step 1;

reverse LSTM is similar to forward LSTM;

the method is respectively used for medical text sentences s to obtain

And &>

Splicing the two groups of vectors, and then performing final hidden vector(s) based on each character in s>

The calculation formula is as follows:

and 3, step 3: the output of step 2 is assigned a weight α tj, corresponding thereto, by the attention mechanism, the eigenvector

And carrying out weighted summation with the corresponding weight alpha tj to obtain a new output vector ct, specifically:

feature vector

The corresponding weight α tj is obtained by the following steps:

wherein etj is used for measuring jth source end character and jth source end characterMatching degree of t target end characters; st _-1 Hiding the layer state at the t-th moment;

wa and Ua are weight matrixes;

and 4, step 4: performing comparison on feature vector c = { c output in step 3 through conditional random field CRF ₁ ,c ₂ Decoding \8230;, cm } to obtain the label of the medical entity type of the input medical text statement s, specifically:

P(y|c)＝CRF(c,y)；

wherein y is all possible output tag sequences of the input medical text sentence s, and P (y | c) is the conditional probability of the possible output tag sequences y;

during prediction, a labeling sequence with the highest score is found on an input sequence through a Viterbi Algorithm (Viterbi Algorithm) of dynamic programming to obtain a label y of the medical entity type of the input medical text sentence s,

y*＝argmaxP(y|c)。

compared with the prior art, the invention has the beneficial effects that:

1. the medical named entity recognition accuracy is improved by utilizing the word information. Input of the invention on the basis of a character sequence

Adding the matched words in the dictionary, and providing more guidance for the model through the dynamic control of the gate structure, thereby selecting the most relevant characters and words from the medical corpus. Compared to character-based methods, multi-granular information is explicitly exploited by the model for better recognition performance.

2. An attention mechanism is introduced to ensure that the model focuses on effective information, thereby making up the consideration of the traditional Bi-LSTM-CRF model

Context information is ignored, but the situation that different characters and words have different importance in the sentence is ignored.

Drawings

Fig. 1 is a schematic diagram of a medical named entity recognition model architecture.

FIG. 2 is a schematic illustration of an Encode-Decoder using an attention mechanism.

FIG. 3 is a schematic diagram of an attention machine mechanism in a modeling method.

Detailed Description

In order to improve the recognition accuracy of the medical named entity by using word information, avoid or reduce recognition errors of unknown words caused by ambiguity due to certain particularity of medical professional terms, and simultaneously make up for the situation that although context information is considered in a traditional Bi-LSTM-CRF model, different characters and words have different importance in sentences, the invention provides a medical named entity recognition method Ng-BAC based on new word discovery and a fusion attention mechanism, wherein the model architecture is shown in figure 1. The model firstly acquires new words from medical corpus through N-grams algorithm to construct a medical related external dictionary; integrating potential word information into the character-based model as an extension, dynamically routing information from different paths to each character through a gate structure; different weights are distributed to the Bi-LSTM layer output by introducing an attention mechanism to improve the accuracy of the output; and finally, carrying out sequence labeling by a conditional random field CRF.

The specific implementation steps are as follows:

step 1: extracting new words by a new word discovery algorithm based on an N-grams model, wherein the new words are extracted mainly according to two indexes of solidity and word frequency:

(1) Degree of solidification: represents how close the word segments are, usually measured by mutual information;

(2) Word frequency: representing the number of times a word occurs in a corpus.

Different from a method only considering the solidity of adjacent characters, the method of the invention simultaneously considers the internal solidity of multiple characters in the corpus, segments the corpus through N-grams, and calculates the internal solidity, wherein a calculation formula of the internal solidity of a character string taking 3 characters as an example is as follows:

wherein a, b and c are three adjacent characters in the character string; p (a), p (b), p (c) represent the frequency of their individual occurrence; p (ab), p (bc), p (abc) represent the frequency of occurrence after their component words; taking the minimum value of the solidification degree as the internal solidification degree of the whole character string.

The new word discovery is to extract words meeting two indexes of a freezing degree threshold value and a word frequency, taking 4-grams as an example, the algorithm execution steps are as follows:

step 1.1: setting n as 4 characters (namely 4-grams) to segment the sentences, counting 2-grams, 3-grams and 4-grams, and calculating the internal solidification degree;

step 1.2: setting different thresholds as multiples of 5, if n is equal to 2, the threshold is set to 5, if n is equal to 3, the threshold is set to 25, and if n is equal to 4, the threshold is set to 125;

step 1.3: generating a set G to reserve the fragments with the internal solidification degree higher than the threshold value;

step 1.4: coarsely segmenting the corpus according to the setting of the grams in the step 1.1, and if only one segment exists in the set G, the segment is not segmented any more and the frequency of the segment is calculated;

step 1.5: after the step 1.4, backtracking inspection is needed according to the principle of 'peaceful placement and no cutting error'; when the word number of the word is less than or equal to 4, judging whether the word is in the set G, if so, keeping the word, otherwise, deleting the word; when the number of words is larger than 4 words, detecting whether each 4-word fragment is in the set G, if so, keeping the fragment, otherwise, deleting the fragment;

and after extracting the new words, comparing the new words with the internal Jieba word segmentation dictionary, and manually screening the new words after comparison and screening to further determine a final dictionary.

The invention introduces new word discovery on the basis of utilizing word information to avoid or reduce errors caused by unknown words. Because the professional terms have certain particularity, the recognition of the unknown words can be wrong due to ambiguity for the named entity recognition task in a special field. For example, in "bronchiectasis", since "trachea" is recognized erroneously as a common tool in the living field, such an error is particularly significant in the medical field. The invention obtains new words from the medical corpus through the N-grams algorithm to assist the Word segmentation algorithm to carry out Word segmentation operation, and then builds a medical related dictionary by using Word2Vec according to the result obtained after Word segmentation.

And 2, step: the invention sets the basic line network structure as Bi-LSTM as the basis, and introduces word information to improve the accuracy of medical named entity recognition.

Unlike the traditional character-based model, the method of the invention is applied to the character sequence c ₁ ,c ₂ 8230cm, the words matched by all subsequences of the character sequence in a dictionary determined by the new word discovery algorithm based on the N-grams model mentioned in step 1 are input. Other dictionaries, such as a Jieba participle built-in dictionary, may also be used.

Formally, by s = c ₁ ,c ₂ \8230cmrepresents the input sentence and cj represents the jth character of the input sequence.

For the method of the invention, the input sentence s can also be regarded as a word sequence: s = w ₁ ,w ₂ And 8230wn, obtaining the word sequence by performing Chinese word segmentation on the input sentence, wherein the ith word in the sequence is wi.

Subsequence pass with index starting with b and ending with e

Is shown, as shown in FIG. 1, in "bronchiectasis>

Means "bronchus">

Meaning "expand".

The specific character k of the specific word i is indexed by t (i, k), and as shown in fig. 1, if the sentence is divided into "bronchiectasis", t (2, 1) =4 (bronchiole) and t (1, 3) =3 (tubes).

And 3, step 3: compared with the traditional recurrent neural network, the LSTM model has the advantages that a gating mechanism is introduced, the neuron forgetting function is added to the LSTM model, and the method can be used for avoiding the gradient disappearance phenomenon.

Taking forward LSTM as an example, the algorithm performs the following steps:

the baseline model of the method is a model based on characters, and when the model is input, the embedded expression calculation formula of each character cj is as follows:

wherein ec represents a character embedding look-up table;

the basic cycle structure of the baseline model is based on the character cell state

Hidden status ≥ corresponding to each character cj>

Is constructed in which>

For recording a stream of cycle information from the beginning of a period to the character cj, and->

The method is used for sequence labeling on a CRF layer;

the main calculation in LSTM is as follows:

wherein,

and &>

Respectively representing an input gate, a forgetting gate and an output gate; wcT and bc are model parameters; σ represents Sigmoid function; an as high as the hadamard product.

Unlike the character-based model, the method of the present invention matches sentences with a dictionary to obtain a subsequence

Introducing a status on cells->

Is calculated such that, when entered into the model, each subsequence +>

The embedding expression of (a) represents the calculation formula as follows:

where ew represents a term embedding look-up table.

Word cell

For indicating that start of a period->

Is marked only at the character level, there is no output gate for the word cell, which is present here, or which is present there>

The calculation formula of (a) is as follows:

wherein,

represents an input door, <' > or>

Representing a forgetting gate.

Introduction of word cells

Then flows into each->

Has more circular paths, as indicated by the "bronchiectasis" input in FIG. 1, and ` according to `>

Includes->

(tube),. Or (es)>

(trachea) and->

(bronchi).

The method of the invention ends all characters with index e

Is connected to the cell->

And an additional door structure is introduced>

To control each sub-sequence of cells->

Is paired and/or matched>

The calculation formula is as follows:

therefore, the cell state calculation formula of the model after introducing the word information is as follows:

wherein,

and &>

Is normalized to get->

And &>

The calculation formula is as follows:

/>

after word information is introduced, vectors are hidden

The calculation is still performed as in the character-based model.

During model training, the loss values are propagated back to the parameters, constantly optimizing the parameters, so that the model dynamically focuses on more relevant words during annotation.

The Bi-LSTM layer applies the above method to each corpus in forward and reverse order, resulting in

And

splicing the two groups of vectors, and obtaining the final hidden vector corresponding to each character in the corpus

The calculation formula is as follows:

and 4, step 4: as shown in FIG. 2, the method of the present invention introduces an attention mechanism to simulate the processing mode of human beings for focus of attention, and let the model ignore useless information and concentrate on target information needing important attention in the input medical corpus and information related to the current output, so as to improve the output efficiency and quality;

as shown in FIG. 3, the weights corresponding to the output of the Bi-LSTM layer are assigned by the attention mechanism to assign the feature vectors of the preamble model output

And after weighted summation with the corresponding weight alpha tjObtaining a new output vector ct, wherein the calculation formula is as follows:

each feature vector

The corresponding weight α tj is calculated as follows:

the etj is used for measuring the matching degree of the jth source terminal character and the tth target terminal character;

after introducing the attention mechanism into the conventional Decoder, the Encoder does not need to encode all the information in a fixed length any more, and in this way, the Decoder can select the information propagated in the annotation sequence.

And 5: finally, sequence labeling is carried out by using standard CRF, and the CRF can predict the relationship between labels by using the dependency information between labels in the labeling process so as to obtain a globally optimal labeling sequence;

the eigenvector c = { c } output in step 4 by the conditional random field CRF ₁ ,c ₂ Decoding \8230;, cm } to obtain the label of the medical entity type of the input medical text statement s, specifically:

P(y|c)＝CRF(c,y)；

during prediction, a labeling sequence with the highest score is found on an input sequence through a Viterbi Algorithm (Viterbi Algorithm) of dynamic programming, so as to obtain a label y of the medical entity type of the input medical text statement s, wherein the calculation formula is as follows:

y*＝argmax P(y|c)。

in order to verify the effectiveness of the invention, the invention carries out comparison experiments on 1200 electronic medical records with artificial labels in a CCKS2017-Task2 public data set and classical models in 3 named entity recognition tasks, wherein the 3 comparison models are as follows:

(1) LSTM-CRF: introducing a conditional random field in the unidirectional long-short term memory network;

(2) Bi-LSTM-CRF: introducing a conditional random field in a bidirectional long-short term memory network;

(3) Lattice-LSTM: word-based units are introduced in character-based LSTM-CRF.

The evaluation index was evaluated by using Precision (Precision), recall (Recall) and F1 value (F1-Score), and the calculation formula was as follows:

depending on the desired output and the actual output of the model, the results can be divided into TP (True Positive), FP (False Positive), TN (True Negative) and FN (False Negative). In the named entity recognition task, the meaning is as follows:

(1) TP: correctly identifying the entity as a number of entities;

(2) FP: identifying non-entity errors as a number of entities;

(3) TN: correctly judging the non-entities as the number of the non-entities;

(4) FN: and judging the entity error as the number of non-entities.

The precision rate represents the ratio of real entities in all samples identified as entities, i.e., precision rate. The recall rate represents the rate at which all entities in the sample are correctly identified, i.e., the recall rate. And comprehensively considering the two evaluation indexes, and performing weighted harmonic averaging on the two evaluation indexes to obtain an F1 value so as to avoid the mutual restriction of the two evaluation indexes.

The results of the experiment are shown in table 1:

table 1 model comparative experimental results

Experimental results show that the Ng-BAC model provided by the invention is excellent in performance in the CCKS2017-Task2 data set, the accuracy rate reaches 89.86%, the recall rate reaches 92.05%, the F1 value reaches 90.94%, and the model is improved to a certain extent compared with a classical model in a named entity recognition Task.

Claims

1. The medical named entity recognition modeling method based on fusion attention is characterized by comprising the following steps:

step 1: performing Chinese word segmentation and indexing on the medical text sentence s: matching the medical text sentence s with the dictionary to obtain a word sequence w ₁ ，w ₂ ，...，w _n ，w _i Is the ith word in the sequence of words, i =1, 2. Indexing the kth character of the ith word by t (i, k), wherein k is the position of the character in the word;

the medical text sentence s = c ₁ ，c ₂ ，...，c _m ，c _j J =1, 2., m, which is the jth character of the medical text sentence s;

words with index starting with b and ending with e are passed through

of said forward LSTM

Comprises the following steps:

wherein,

and &>

An input gate, a forgetting gate and an output gate respectively; />

Information of new candidate cells; />

And b ^c Respectively are a model weight parameter to be learned and a bias term; sigma is a Sigmoid function; as a hadamard product;

is a character c _j Is embedded means that->

e ^c Indicating character embedding look-up table;

is a character c _j Corresponding hidden state in the interior of the housing>

Is a character c _j The corresponding character cell state; />

Is a preceding character c _j-1 Corresponding hidden state, is greater or less than>

Is a previous character c _j-1 The corresponding character cell state;

character cell state after introduction of word information

The method comprises the following steps:

wherein,

and &>

Is->

And &>

Obtaining the result after normalization;

For the tail character cell->

Is based on the contribution of->

And b ^l Respectively are a model weight parameter to be learned and a bias term;

word cellular state

The method comprises the following steps:

wherein,

is a input door, is>

To forget the door; />

Cell information of new candidate words; />

And b ^w Respectively are a model weight parameter to be learned and a bias term; />

Is a hidden state corresponding to the first character of the word;

is a word/phrase>

Is embedded means that->

e ^w Indicating that the words obtained by the dictionary conversion in the step 1 are embedded into a lookup table;

the reverse LSTM is similar to the forward LSTM;

respectively applying the methods to medical text sentences s to obtain

And &>

Two sets of vectors, and then combining the two sets of vectorsSplicing is performed, and the final hidden vector corresponding to each character in s->

The calculation formula is as follows:

and step 3: assigning a weight α corresponding to the output of step 2 by the attention mechanism _tj Feature vector

And a weight α corresponding thereto _tj Carrying out weighted summation to obtain a new output vector c _t The method specifically comprises the following steps:

feature vector

Corresponding weight α _tj The method comprises the following steps:

wherein e is _tj The matching degree of the jth source terminal character and the tth target terminal character is measured; s _t-1 Hiding the layer state at the t-th moment;

W _a and U _a Is a weight matrix;

and 4, step 4: performing comparison on feature vector c = { c output in step 3 through conditional random field CRF ₁ ，c ₂ ，...，c _m Decoding is carried out, and a label of the medical entity type of the input medical text sentence s is obtained, specifically:

P(y|c)＝CRF(c，y)；

during prediction, a labeling sequence with the highest score is found on an input sequence through a Viterbi Algorithm (Viterbi Algorithm) of dynamic programming, and a label y of the medical entity type of the input medical text sentence s is obtained ^* ，

y ^* ＝arg max P(y|c)。