CN113536799B - Medical named entity recognition modeling method based on fusion attention - Google Patents

Medical named entity recognition modeling method based on fusion attention Download PDF

Info

Publication number
CN113536799B
CN113536799B CN202110927320.1A CN202110927320A CN113536799B CN 113536799 B CN113536799 B CN 113536799B CN 202110927320 A CN202110927320 A CN 202110927320A CN 113536799 B CN113536799 B CN 113536799B
Authority
CN
China
Prior art keywords
character
word
medical
lstm
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110927320.1A
Other languages
Chinese (zh)
Other versions
CN113536799A (en
Inventor
李天瑞
邬萌
贾真
杜圣东
滕飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Original Assignee
Southwest Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University filed Critical Southwest Jiaotong University
Priority to CN202110927320.1A priority Critical patent/CN113536799B/en
Publication of CN113536799A publication Critical patent/CN113536799A/en
Application granted granted Critical
Publication of CN113536799B publication Critical patent/CN113536799B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The medical named entity recognition modeling method based on the fusion attention comprises the following steps: performing Chinese word segmentation and indexing on the medical text sentence; obtaining a Bi-LSTM model by splicing forward LSTM and reverse LSTM; updating the output feature vector through an attention mechanism; and decoding the output characteristic vector by a conditional random field CRF to obtain the label of the medical entity type of the input medical text sentence. The invention adds the matched words in the dictionary on the basis of the character sequence, provides more guidance for the model through the dynamic control of the gate structure, and selects the most relevant characters and words from the medical corpus. Compared to character-based methods, multi-granular information is explicitly exploited by the model to obtain better recognition performance. And moreover, an attention mechanism is introduced, so that the model focuses on effective information, and the condition that the traditional Bi-LSTM-CRF model considers context information but ignores different characters and words with different importance in sentences is overcome.

Description

Medical named entity recognition modeling method based on fusion attention
Technical Field
The invention relates to the technical field of natural language processing, in particular to a medical named entity recognition modeling method based on fusion attention.
Background
The continuous advancement of medical informatization has led to the explosive growth of massive data in the medical field, and the data forms are more and more diverse, which contain a great deal of useful information. Medical data is widely available and mainly comes from medical websites, hospital electronic medical records, medical books and the like. The medical knowledge map regulates the whole refining process of acquiring knowledge update from knowledge according to a complete set of criteria, so that massive medical knowledge can be efficiently managed and applied. For the construction of the medical knowledge map, the important foundation is laid for accurately identifying medical named entities from massive medical data. The extraction of specific ad hoc types of medical entities, such as "blood routine" entities belonging to the examination item type, "bronchitis" entities belonging to the disease type, is performed by medical named entity identification, a task also referred to as medical entity extraction.
Named entity recognition is a basic task of knowledge extraction, which is generally viewed as a sequence labeling task for predicting entity boundaries and entity class labels. Currently, the named entity recognition method is mainly a deep learning-based method. The neural network is applied to named entity recognition for the first time by carrying out named entity recognition through a one-way Short-Term Memory network (LSTM). Excellent recognition is achieved by using a Convolutional Neural Network (CNN) in combination with a Conditional Random Field (CRF). The performance of the CNN-CRF model may be enhanced by utilizing a character-based CNN. In the field of Chinese named entity recognition, the LSTM-CRF system structure is the mainstream at present and is adopted by a great deal of research institutes. Character features of manual design (Hand-crafted), character-based CNN, and character-based LSTM, etc. are used to represent character features.
Character sequence notation is the primary method of Chinese named entity recognition. By comparing word-based and character-based statistical methods in the named entity recognition task, the results indicate that the character-based method is relatively superior. There has been much interest from researchers regarding how to better utilize word information in a task of recognizing a named entity in chinese, such as using word segmentation information as a feature of the named entity recognition, combining word segmentation and named entity recognition through dual decomposition, and multi-task learning.
External information resources are used to assist named entity recognition, especially dictionaries are widely used. Such as multi-task learning on large raw text using word-level language models to enhance training of named entity recognition, word representation enhancement by pre-training character-level language models, and mining of cross-domain and cross-language knowledge through multi-task learning.
Disclosure of Invention
The invention aims to provide a medical named entity recognition modeling method based on fusion attention.
The technical scheme for realizing the purpose of the invention is as follows:
the medical named entity recognition modeling method based on the fusion attention comprises the following steps:
step 1: performing Chinese word segmentation and indexing on the medical text sentence s: matching the medical text sentence s with a dictionary to obtain a word sequence w 1 ,w 2 \8230, wn, wi is the ith word in the word sequence, i =1,2, \8230, n; indexing the kth character of the ith word by t (i, k), wherein k is the position of the character in the word;
the medical text sentence s = c 1 ,c 2 \8230, cm, cj is the jth character of the medical text sentence s, j =1,2, \8230, m;
words with index starting with b and ending with e are passed through
Figure BDA0003203592810000021
Representing, wherein b represents an index of a word start character, and e represents an index of a word end character;
step 2: obtaining a Bi-LSTM model by splicing the forward LSTM and the reverse LSTM;
of said forward LSTM
Figure BDA0003203592810000022
Comprises the following steps:
Figure BDA0003203592810000023
Figure BDA0003203592810000024
Figure BDA0003203592810000025
wherein,
Figure BDA0003203592810000026
and &>
Figure BDA0003203592810000027
An input gate, a forgetting gate and an output gate respectively; />
Figure BDA0003203592810000028
Information of new candidate cells; wcT and bc are model weight parameters and bias terms to be learned respectively; sigma is a Sigmoid function; as a hadamard product;
Figure BDA0003203592810000029
for an embedded representation of the character cj, ->
Figure BDA00032035928100000210
ec represents a character embedding look-up table;
Figure BDA00032035928100000211
for the hidden state corresponding to the character cj>
Figure BDA00032035928100000212
The cell state of the character corresponding to the character cj; />
Figure BDA00032035928100000213
Is a previous character cj -1 Corresponding hidden state, is greater or less than>
Figure BDA00032035928100000214
Is a previous character cj -1 The corresponding character cell state;
word with word information introducedCell state
Figure BDA00032035928100000215
The method comprises the following steps:
Figure BDA00032035928100000216
Figure BDA00032035928100000217
Figure BDA00032035928100000218
wherein,
Figure BDA00032035928100000219
and &>
Figure BDA00032035928100000220
Is->
Figure BDA00032035928100000221
And &>
Figure BDA00032035928100000222
Obtaining the result after normalization;
Figure BDA0003203592810000031
is an additional gate structure introduced for controlling all word cells ending with the character with index e>
Figure BDA0003203592810000032
On tail character cells>
Figure BDA0003203592810000033
Is based on the contribution of->
Figure BDA0003203592810000034
WlT and bl are model weight parameters to be learned and bias terms respectively;
word cellular state
Figure BDA0003203592810000035
The method comprises the following steps:
Figure BDA0003203592810000036
/>
Figure BDA0003203592810000037
wherein,
Figure BDA0003203592810000038
is a input door, is>
Figure BDA0003203592810000039
To forget the door; />
Figure BDA00032035928100000310
Cell information of new candidate words is obtained; wwT and bw are respectively a model weight parameter to be learned and a bias term; />
Figure BDA00032035928100000311
Is a hidden state corresponding to the first character of the word;
Figure BDA00032035928100000312
is a word/phrase>
Figure BDA00032035928100000313
Is shown in a non-volatile memory cell (c), device for combining or screening>
Figure BDA00032035928100000314
ew represents a word embedding lookup table obtained by the dictionary conversion in the step 1;
reverse LSTM is similar to forward LSTM;
the method is respectively used for medical text sentences s to obtain
Figure BDA00032035928100000315
And &>
Figure BDA00032035928100000316
Splicing the two groups of vectors, and then performing final hidden vector(s) based on each character in s>
Figure BDA00032035928100000317
The calculation formula is as follows:
Figure BDA00032035928100000318
and 3, step 3: the output of step 2 is assigned a weight α tj, corresponding thereto, by the attention mechanism, the eigenvector
Figure BDA00032035928100000319
And carrying out weighted summation with the corresponding weight alpha tj to obtain a new output vector ct, specifically:
Figure BDA00032035928100000320
feature vector
Figure BDA00032035928100000321
The corresponding weight α tj is obtained by the following steps:
Figure BDA00032035928100000322
Figure BDA00032035928100000323
wherein etj is used for measuring jth source end character and jth source end characterMatching degree of t target end characters; st -1 Hiding the layer state at the t-th moment;
Figure BDA00032035928100000324
wa and Ua are weight matrixes;
and 4, step 4: performing comparison on feature vector c = { c output in step 3 through conditional random field CRF 1 ,c 2 Decoding \8230;, cm } to obtain the label of the medical entity type of the input medical text statement s, specifically:
P(y|c)=CRF(c,y);
wherein y is all possible output tag sequences of the input medical text sentence s, and P (y | c) is the conditional probability of the possible output tag sequences y;
during prediction, a labeling sequence with the highest score is found on an input sequence through a Viterbi Algorithm (Viterbi Algorithm) of dynamic programming to obtain a label y of the medical entity type of the input medical text sentence s,
y*=argmaxP(y|c)。
compared with the prior art, the invention has the beneficial effects that:
1. the medical named entity recognition accuracy is improved by utilizing the word information. Input of the invention on the basis of a character sequence
Adding the matched words in the dictionary, and providing more guidance for the model through the dynamic control of the gate structure, thereby selecting the most relevant characters and words from the medical corpus. Compared to character-based methods, multi-granular information is explicitly exploited by the model for better recognition performance.
2. An attention mechanism is introduced to ensure that the model focuses on effective information, thereby making up the consideration of the traditional Bi-LSTM-CRF model
Context information is ignored, but the situation that different characters and words have different importance in the sentence is ignored.
Drawings
Fig. 1 is a schematic diagram of a medical named entity recognition model architecture.
FIG. 2 is a schematic illustration of an Encode-Decoder using an attention mechanism.
FIG. 3 is a schematic diagram of an attention machine mechanism in a modeling method.
Detailed Description
In order to improve the recognition accuracy of the medical named entity by using word information, avoid or reduce recognition errors of unknown words caused by ambiguity due to certain particularity of medical professional terms, and simultaneously make up for the situation that although context information is considered in a traditional Bi-LSTM-CRF model, different characters and words have different importance in sentences, the invention provides a medical named entity recognition method Ng-BAC based on new word discovery and a fusion attention mechanism, wherein the model architecture is shown in figure 1. The model firstly acquires new words from medical corpus through N-grams algorithm to construct a medical related external dictionary; integrating potential word information into the character-based model as an extension, dynamically routing information from different paths to each character through a gate structure; different weights are distributed to the Bi-LSTM layer output by introducing an attention mechanism to improve the accuracy of the output; and finally, carrying out sequence labeling by a conditional random field CRF.
The specific implementation steps are as follows:
step 1: extracting new words by a new word discovery algorithm based on an N-grams model, wherein the new words are extracted mainly according to two indexes of solidity and word frequency:
(1) Degree of solidification: represents how close the word segments are, usually measured by mutual information;
(2) Word frequency: representing the number of times a word occurs in a corpus.
Different from a method only considering the solidity of adjacent characters, the method of the invention simultaneously considers the internal solidity of multiple characters in the corpus, segments the corpus through N-grams, and calculates the internal solidity, wherein a calculation formula of the internal solidity of a character string taking 3 characters as an example is as follows:
Figure BDA0003203592810000051
wherein a, b and c are three adjacent characters in the character string; p (a), p (b), p (c) represent the frequency of their individual occurrence; p (ab), p (bc), p (abc) represent the frequency of occurrence after their component words; taking the minimum value of the solidification degree as the internal solidification degree of the whole character string.
The new word discovery is to extract words meeting two indexes of a freezing degree threshold value and a word frequency, taking 4-grams as an example, the algorithm execution steps are as follows:
step 1.1: setting n as 4 characters (namely 4-grams) to segment the sentences, counting 2-grams, 3-grams and 4-grams, and calculating the internal solidification degree;
step 1.2: setting different thresholds as multiples of 5, if n is equal to 2, the threshold is set to 5, if n is equal to 3, the threshold is set to 25, and if n is equal to 4, the threshold is set to 125;
step 1.3: generating a set G to reserve the fragments with the internal solidification degree higher than the threshold value;
step 1.4: coarsely segmenting the corpus according to the setting of the grams in the step 1.1, and if only one segment exists in the set G, the segment is not segmented any more and the frequency of the segment is calculated;
step 1.5: after the step 1.4, backtracking inspection is needed according to the principle of 'peaceful placement and no cutting error'; when the word number of the word is less than or equal to 4, judging whether the word is in the set G, if so, keeping the word, otherwise, deleting the word; when the number of words is larger than 4 words, detecting whether each 4-word fragment is in the set G, if so, keeping the fragment, otherwise, deleting the fragment;
and after extracting the new words, comparing the new words with the internal Jieba word segmentation dictionary, and manually screening the new words after comparison and screening to further determine a final dictionary.
The invention introduces new word discovery on the basis of utilizing word information to avoid or reduce errors caused by unknown words. Because the professional terms have certain particularity, the recognition of the unknown words can be wrong due to ambiguity for the named entity recognition task in a special field. For example, in "bronchiectasis", since "trachea" is recognized erroneously as a common tool in the living field, such an error is particularly significant in the medical field. The invention obtains new words from the medical corpus through the N-grams algorithm to assist the Word segmentation algorithm to carry out Word segmentation operation, and then builds a medical related dictionary by using Word2Vec according to the result obtained after Word segmentation.
And 2, step: the invention sets the basic line network structure as Bi-LSTM as the basis, and introduces word information to improve the accuracy of medical named entity recognition.
Unlike the traditional character-based model, the method of the invention is applied to the character sequence c 1 ,c 2 8230cm, the words matched by all subsequences of the character sequence in a dictionary determined by the new word discovery algorithm based on the N-grams model mentioned in step 1 are input. Other dictionaries, such as a Jieba participle built-in dictionary, may also be used.
Formally, by s = c 1 ,c 2 \8230cmrepresents the input sentence and cj represents the jth character of the input sequence.
For the method of the invention, the input sentence s can also be regarded as a word sequence: s = w 1 ,w 2 And 8230wn, obtaining the word sequence by performing Chinese word segmentation on the input sentence, wherein the ith word in the sequence is wi.
Subsequence pass with index starting with b and ending with e
Figure BDA0003203592810000061
Is shown, as shown in FIG. 1, in "bronchiectasis>
Figure BDA0003203592810000062
Means "bronchus">
Figure BDA0003203592810000063
Meaning "expand".
The specific character k of the specific word i is indexed by t (i, k), and as shown in fig. 1, if the sentence is divided into "bronchiectasis", t (2, 1) =4 (bronchiole) and t (1, 3) =3 (tubes).
And 3, step 3: compared with the traditional recurrent neural network, the LSTM model has the advantages that a gating mechanism is introduced, the neuron forgetting function is added to the LSTM model, and the method can be used for avoiding the gradient disappearance phenomenon.
Taking forward LSTM as an example, the algorithm performs the following steps:
the baseline model of the method is a model based on characters, and when the model is input, the embedded expression calculation formula of each character cj is as follows:
Figure BDA0003203592810000064
wherein ec represents a character embedding look-up table;
the basic cycle structure of the baseline model is based on the character cell state
Figure BDA0003203592810000065
Hidden status ≥ corresponding to each character cj>
Figure BDA0003203592810000066
Is constructed in which>
Figure BDA0003203592810000067
For recording a stream of cycle information from the beginning of a period to the character cj, and->
Figure BDA0003203592810000068
The method is used for sequence labeling on a CRF layer;
the main calculation in LSTM is as follows:
Figure BDA0003203592810000069
Figure BDA00032035928100000610
Figure BDA00032035928100000611
wherein,
Figure BDA00032035928100000612
and &>
Figure BDA00032035928100000613
Respectively representing an input gate, a forgetting gate and an output gate; wcT and bc are model parameters; σ represents Sigmoid function; an as high as the hadamard product.
Unlike the character-based model, the method of the present invention matches sentences with a dictionary to obtain a subsequence
Figure BDA00032035928100000614
Introducing a status on cells->
Figure BDA0003203592810000071
Is calculated such that, when entered into the model, each subsequence +>
Figure BDA0003203592810000072
The embedding expression of (a) represents the calculation formula as follows:
Figure BDA0003203592810000073
where ew represents a term embedding look-up table.
Word cell
Figure BDA0003203592810000074
For indicating that start of a period->
Figure BDA0003203592810000075
Is marked only at the character level, there is no output gate for the word cell, which is present here, or which is present there>
Figure BDA0003203592810000076
The calculation formula of (a) is as follows:
Figure BDA0003203592810000077
Figure BDA0003203592810000078
wherein,
Figure BDA0003203592810000079
represents an input door, <' > or>
Figure BDA00032035928100000710
Representing a forgetting gate.
Introduction of word cells
Figure BDA00032035928100000711
Then flows into each->
Figure BDA00032035928100000712
Has more circular paths, as indicated by the "bronchiectasis" input in FIG. 1, and ` according to `>
Figure BDA00032035928100000713
Includes->
Figure BDA00032035928100000714
(tube),. Or (es)>
Figure BDA00032035928100000715
(trachea) and->
Figure BDA00032035928100000716
(bronchi).
The method of the invention ends all characters with index e
Figure BDA00032035928100000717
Is connected to the cell->
Figure BDA00032035928100000718
And an additional door structure is introduced>
Figure BDA00032035928100000719
To control each sub-sequence of cells->
Figure BDA00032035928100000720
Is paired and/or matched>
Figure BDA00032035928100000721
The calculation formula is as follows:
Figure BDA00032035928100000722
therefore, the cell state calculation formula of the model after introducing the word information is as follows:
Figure BDA00032035928100000723
wherein,
Figure BDA00032035928100000724
and &>
Figure BDA00032035928100000725
Is normalized to get->
Figure BDA00032035928100000726
And &>
Figure BDA00032035928100000727
The calculation formula is as follows:
Figure BDA00032035928100000728
/>
Figure BDA00032035928100000729
after word information is introduced, vectors are hidden
Figure BDA00032035928100000730
The calculation is still performed as in the character-based model.
During model training, the loss values are propagated back to the parameters, constantly optimizing the parameters, so that the model dynamically focuses on more relevant words during annotation.
The Bi-LSTM layer applies the above method to each corpus in forward and reverse order, resulting in
Figure BDA00032035928100000731
And
Figure BDA0003203592810000081
splicing the two groups of vectors, and obtaining the final hidden vector corresponding to each character in the corpus
Figure BDA0003203592810000082
The calculation formula is as follows:
Figure BDA0003203592810000083
and 4, step 4: as shown in FIG. 2, the method of the present invention introduces an attention mechanism to simulate the processing mode of human beings for focus of attention, and let the model ignore useless information and concentrate on target information needing important attention in the input medical corpus and information related to the current output, so as to improve the output efficiency and quality;
as shown in FIG. 3, the weights corresponding to the output of the Bi-LSTM layer are assigned by the attention mechanism to assign the feature vectors of the preamble model output
Figure BDA0003203592810000084
And after weighted summation with the corresponding weight alpha tjObtaining a new output vector ct, wherein the calculation formula is as follows:
Figure BDA0003203592810000085
each feature vector
Figure BDA0003203592810000086
The corresponding weight α tj is calculated as follows:
Figure BDA0003203592810000087
Figure BDA0003203592810000088
the etj is used for measuring the matching degree of the jth source terminal character and the tth target terminal character;
after introducing the attention mechanism into the conventional Decoder, the Encoder does not need to encode all the information in a fixed length any more, and in this way, the Decoder can select the information propagated in the annotation sequence.
And 5: finally, sequence labeling is carried out by using standard CRF, and the CRF can predict the relationship between labels by using the dependency information between labels in the labeling process so as to obtain a globally optimal labeling sequence;
the eigenvector c = { c } output in step 4 by the conditional random field CRF 1 ,c 2 Decoding \8230;, cm } to obtain the label of the medical entity type of the input medical text statement s, specifically:
P(y|c)=CRF(c,y);
wherein y is all possible output tag sequences of the input medical text sentence s, and P (y | c) is the conditional probability of the possible output tag sequences y;
during prediction, a labeling sequence with the highest score is found on an input sequence through a Viterbi Algorithm (Viterbi Algorithm) of dynamic programming, so as to obtain a label y of the medical entity type of the input medical text statement s, wherein the calculation formula is as follows:
y*=argmax P(y|c)。
in order to verify the effectiveness of the invention, the invention carries out comparison experiments on 1200 electronic medical records with artificial labels in a CCKS2017-Task2 public data set and classical models in 3 named entity recognition tasks, wherein the 3 comparison models are as follows:
(1) LSTM-CRF: introducing a conditional random field in the unidirectional long-short term memory network;
(2) Bi-LSTM-CRF: introducing a conditional random field in a bidirectional long-short term memory network;
(3) Lattice-LSTM: word-based units are introduced in character-based LSTM-CRF.
The evaluation index was evaluated by using Precision (Precision), recall (Recall) and F1 value (F1-Score), and the calculation formula was as follows:
Figure BDA0003203592810000091
Figure BDA0003203592810000092
Figure BDA0003203592810000093
depending on the desired output and the actual output of the model, the results can be divided into TP (True Positive), FP (False Positive), TN (True Negative) and FN (False Negative). In the named entity recognition task, the meaning is as follows:
(1) TP: correctly identifying the entity as a number of entities;
(2) FP: identifying non-entity errors as a number of entities;
(3) TN: correctly judging the non-entities as the number of the non-entities;
(4) FN: and judging the entity error as the number of non-entities.
The precision rate represents the ratio of real entities in all samples identified as entities, i.e., precision rate. The recall rate represents the rate at which all entities in the sample are correctly identified, i.e., the recall rate. And comprehensively considering the two evaluation indexes, and performing weighted harmonic averaging on the two evaluation indexes to obtain an F1 value so as to avoid the mutual restriction of the two evaluation indexes.
The results of the experiment are shown in table 1:
table 1 model comparative experimental results
Figure BDA0003203592810000094
Experimental results show that the Ng-BAC model provided by the invention is excellent in performance in the CCKS2017-Task2 data set, the accuracy rate reaches 89.86%, the recall rate reaches 92.05%, the F1 value reaches 90.94%, and the model is improved to a certain extent compared with a classical model in a named entity recognition Task.

Claims (1)

1. The medical named entity recognition modeling method based on fusion attention is characterized by comprising the following steps:
step 1: performing Chinese word segmentation and indexing on the medical text sentence s: matching the medical text sentence s with the dictionary to obtain a word sequence w 1 ,w 2 ,...,w n ,w i Is the ith word in the sequence of words, i =1, 2. Indexing the kth character of the ith word by t (i, k), wherein k is the position of the character in the word;
the medical text sentence s = c 1 ,c 2 ,...,c m ,c j J =1, 2., m, which is the jth character of the medical text sentence s;
words with index starting with b and ending with e are passed through
Figure FDA0003203592800000011
Representing, wherein b represents an index of a word start character, and e represents an index of a word end character;
step 2: obtaining a Bi-LSTM model by splicing the forward LSTM and the reverse LSTM;
of said forward LSTM
Figure FDA0003203592800000012
Comprises the following steps:
Figure FDA0003203592800000013
Figure FDA0003203592800000014
Figure FDA0003203592800000015
wherein,
Figure FDA0003203592800000016
and &>
Figure FDA0003203592800000017
An input gate, a forgetting gate and an output gate respectively; />
Figure FDA0003203592800000018
Information of new candidate cells; />
Figure FDA00032035928000000123
And b c Respectively are a model weight parameter to be learned and a bias term; sigma is a Sigmoid function; as a hadamard product;
Figure FDA0003203592800000019
is a character c j Is embedded means that->
Figure FDA00032035928000000110
e c Indicating character embedding look-up table;
Figure FDA00032035928000000111
is a character c j Corresponding hidden state in the interior of the housing>
Figure FDA00032035928000000112
Is a character c j The corresponding character cell state; />
Figure FDA00032035928000000113
Is a preceding character c j-1 Corresponding hidden state, is greater or less than>
Figure FDA00032035928000000114
Is a previous character c j-1 The corresponding character cell state;
character cell state after introduction of word information
Figure FDA00032035928000000115
The method comprises the following steps:
Figure FDA00032035928000000116
Figure FDA00032035928000000117
Figure FDA00032035928000000118
wherein,
Figure FDA00032035928000000119
and &>
Figure FDA00032035928000000120
Is->
Figure FDA00032035928000000121
And &>
Figure FDA00032035928000000122
Obtaining the result after normalization;
Figure FDA0003203592800000021
is an additional gate structure introduced for controlling all word cells ending with the character with index e>
Figure FDA0003203592800000022
For the tail character cell->
Figure FDA0003203592800000023
Is based on the contribution of->
Figure FDA0003203592800000024
Figure FDA00032035928000000225
And b l Respectively are a model weight parameter to be learned and a bias term;
word cellular state
Figure FDA0003203592800000025
The method comprises the following steps:
Figure FDA0003203592800000026
Figure FDA0003203592800000027
wherein,
Figure FDA0003203592800000028
is a input door, is>
Figure FDA0003203592800000029
To forget the door; />
Figure FDA00032035928000000210
Cell information of new candidate words; />
Figure FDA00032035928000000226
And b w Respectively are a model weight parameter to be learned and a bias term; />
Figure FDA00032035928000000211
Is a hidden state corresponding to the first character of the word;
Figure FDA00032035928000000212
is a word/phrase>
Figure FDA00032035928000000213
Is embedded means that->
Figure FDA00032035928000000214
e w Indicating that the words obtained by the dictionary conversion in the step 1 are embedded into a lookup table;
the reverse LSTM is similar to the forward LSTM;
respectively applying the methods to medical text sentences s to obtain
Figure FDA00032035928000000215
And &>
Figure FDA00032035928000000216
Two sets of vectors, and then combining the two sets of vectorsSplicing is performed, and the final hidden vector corresponding to each character in s->
Figure FDA00032035928000000217
The calculation formula is as follows:
Figure FDA00032035928000000218
and step 3: assigning a weight α corresponding to the output of step 2 by the attention mechanism tj Feature vector
Figure FDA00032035928000000219
And a weight α corresponding thereto tj Carrying out weighted summation to obtain a new output vector c t The method specifically comprises the following steps:
Figure FDA00032035928000000220
feature vector
Figure FDA00032035928000000221
Corresponding weight α tj The method comprises the following steps:
Figure FDA00032035928000000222
Figure FDA00032035928000000223
wherein e is tj The matching degree of the jth source terminal character and the tth target terminal character is measured; s t-1 Hiding the layer state at the t-th moment;
Figure FDA00032035928000000224
W a and U a Is a weight matrix;
and 4, step 4: performing comparison on feature vector c = { c output in step 3 through conditional random field CRF 1 ,c 2 ,...,c m Decoding is carried out, and a label of the medical entity type of the input medical text sentence s is obtained, specifically:
P(y|c)=CRF(c,y);
wherein y is all possible output tag sequences of the input medical text sentence s, and P (y | c) is the conditional probability of the possible output tag sequences y;
during prediction, a labeling sequence with the highest score is found on an input sequence through a Viterbi Algorithm (Viterbi Algorithm) of dynamic programming, and a label y of the medical entity type of the input medical text sentence s is obtained *
y * =arg max P(y|c)。
CN202110927320.1A 2021-08-10 2021-08-10 Medical named entity recognition modeling method based on fusion attention Active CN113536799B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110927320.1A CN113536799B (en) 2021-08-10 2021-08-10 Medical named entity recognition modeling method based on fusion attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110927320.1A CN113536799B (en) 2021-08-10 2021-08-10 Medical named entity recognition modeling method based on fusion attention

Publications (2)

Publication Number Publication Date
CN113536799A CN113536799A (en) 2021-10-22
CN113536799B true CN113536799B (en) 2023-04-07

Family

ID=78122434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110927320.1A Active CN113536799B (en) 2021-08-10 2021-08-10 Medical named entity recognition modeling method based on fusion attention

Country Status (1)

Country Link
CN (1) CN113536799B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818721B (en) * 2022-06-30 2022-11-01 湖南工商大学 Event joint extraction model and method combined with sequence labeling
CN115146644B (en) * 2022-09-01 2022-11-22 北京航空航天大学 Alarm situation text-oriented multi-feature fusion named entity identification method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN110032739A (en) * 2019-04-18 2019-07-19 清华大学 Chinese electronic health record name entity abstracting method and system
CN110866401A (en) * 2019-11-18 2020-03-06 山东健康医疗大数据有限公司 Chinese electronic medical record named entity identification method and system based on attention mechanism
CN111079377A (en) * 2019-12-03 2020-04-28 哈尔滨工程大学 Method for recognizing named entities oriented to Chinese medical texts
CN111737991A (en) * 2020-07-01 2020-10-02 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN111783466A (en) * 2020-07-15 2020-10-16 电子科技大学 Named entity identification method for Chinese medical records
CN112597774A (en) * 2020-12-14 2021-04-02 山东师范大学 Chinese medical named entity recognition method, system, storage medium and equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10606846B2 (en) * 2015-10-16 2020-03-31 Baidu Usa Llc Systems and methods for human inspired simple question answering (HISQA)

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN110032739A (en) * 2019-04-18 2019-07-19 清华大学 Chinese electronic health record name entity abstracting method and system
CN110866401A (en) * 2019-11-18 2020-03-06 山东健康医疗大数据有限公司 Chinese electronic medical record named entity identification method and system based on attention mechanism
CN111079377A (en) * 2019-12-03 2020-04-28 哈尔滨工程大学 Method for recognizing named entities oriented to Chinese medical texts
CN111737991A (en) * 2020-07-01 2020-10-02 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN111783466A (en) * 2020-07-15 2020-10-16 电子科技大学 Named entity identification method for Chinese medical records
CN112597774A (en) * 2020-12-14 2021-04-02 山东师范大学 Chinese medical named entity recognition method, system, storage medium and equipment

Also Published As

Publication number Publication date
CN113536799A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN109657239B (en) Chinese named entity recognition method based on attention mechanism and language model learning
Yu et al. Named entity recognition as dependency parsing
CN111241294B (en) Relationship extraction method of graph convolution network based on dependency analysis and keywords
CN108460013B (en) Sequence labeling model and method based on fine-grained word representation model
He et al. Decoding with value networks for neural machine translation
Jiang et al. A BERT-BiLSTM-CRF model for Chinese electronic medical records named entity recognition
CN107168955B (en) Utilize the Chinese word cutting method of the word insertion and neural network of word-based context
CN103853710B (en) A kind of bilingual name entity recognition method based on coorinated training
CN111738007B (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN109800411A (en) Clinical treatment entity and its attribute extraction method
CN109800434B (en) Method for generating abstract text title based on eye movement attention
CN106980608A (en) A kind of Chinese electronic health record participle and name entity recognition method and system
CN110110324A (en) A kind of biomedical entity link method that knowledge based indicates
CN113536799B (en) Medical named entity recognition modeling method based on fusion attention
Gao et al. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF
CN112541356A (en) Method and system for recognizing biomedical named entities
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN116341557A (en) Diabetes medical text named entity recognition method
Li et al. Biomedical named entity recognition based on the two channels and sentence-level reading control conditioned LSTM-CRF
Liu et al. Chinese clinical entity recognition via attention-based CNN-LSTM-CRF
Luo et al. Loss prediction: End-to-end active learning approach for speech recognition
Cao et al. Knowledge guided short-text classification for healthcare applications
Hu et al. A simple yet effective subsequence-enhanced approach for cross-domain NER
CN116757195B (en) Implicit emotion recognition method based on prompt learning
CN113986345B (en) Pre-training enhanced code clone detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant