CN115017910A

CN115017910A - Entity relation joint extraction method, network, equipment and computer readable storage medium based on Chinese electronic medical record

Info

Publication number: CN115017910A
Application number: CN202210749641.1A
Authority: CN
Inventors: 李丽双; 王泽昊
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-09-06

Abstract

An entity relationship joint extraction method, a network, equipment and a computer readable storage medium based on a Chinese electronic medical record belong to the field of natural language processing and solve the problems that long-distance context semantic information cannot be fully captured and interaction of entity identification and relationship extraction information is insufficient; acquiring entity task characteristics, wherein the entity task characteristics are subjected to a conditional random field to obtain an entity tag sequence; pairing the obtained entities pairwise to obtain a plurality of entity pairs through the entity marker sequences; according to the relation extraction features, weighting and averaging the features of the corresponding positions of the entities to obtain the features of the entities; the characteristics of the entities corresponding to the two entities of the entity pair are spliced with the sentence representation of the pre-training model, and the entity relationship type obtained according to the spliced representation is output through the classifier, so that the accuracy of entity relationship extraction is improved.

Description

Entity relation joint extraction method, network, equipment and computer readable storage medium based on Chinese electronic medical record

Technical Field

The invention belongs to the field of natural language processing, and relates to a method for performing entity relationship joint extraction aiming at Chinese Electronic Medical Record (EMR) texts, in particular to a Memory-LSTM coding algorithm, a Co-Attention (Co-Attention) mechanism-based entity relationship information deep fusion algorithm, self-Attention mechanism and CRF algorithm-based entity identification and multilayer CNN-based relationship extraction.

Background

Entity relationship Extraction (Entity relationship Extraction) is one of the core tasks in Information Extraction (IE), and includes two subtasks, Named Entity identification (NER) and Relationship Extraction (RE), which aim to automatically extract Named entities of a predefined type from unstructured text and classify the type of relationship existing between each pair of entities. In the medical field, the electronic medical record records abundant clinical medical information, such as important medical entities of diseases, examinations, symptoms, treatments, parts and the like, and semantic relationships among various types of medical entities. The entity relation extraction can extract valuable medical information from massive electronic medical records, and can provide important technical support for a plurality of downstream tasks such as medical question-answering systems, high-quality medical knowledge map construction, clinical assistant decision making and the like.

The entity relationship extraction task was originally proposed by the information Understanding Conference (MUC) in 1996, and then the authoritative assessment conferences such as ACE, TAC, SemEval and the like provide high-quality assessment corpora and accepted assessment standards for the task, so that the research and development of entity relationship extraction in the general field are strongly promoted. In 2010, the national center for integrated Biology and clinical Informatics research (I2B 2) published a medical entity relationship extraction task based on english electronic medical records, so that the medical entity relationship extraction entered into the research field. In the aspect of extracting the relation of Chinese medical entities, in 2020, the sixth Chinese health information processing conference (CHIP2020) issues a medical text information extraction task, and then a Chinese medical information processing evaluation benchmark CBLUE (Chinese biological Language Understanding evaluation) is brought on line, which is a benchmark disclosed in the first medical information processing field in China at present, and aims to construct a uniformly-identified medical information system performance evaluation platform and promote the rapid development of medical informatics.

Entity relationship extraction is mainly divided into a pipeline method and a combined extraction method. The traditional entity relationship extraction method is a pipeline-form-based method, and the method treats entity identification and relationship extraction as two independent subtasks. For a piece of text, all entities are first identified, and then the relationship category of each entity pair is determined. However, the pipeline method has a problem of error propagation, that is, errors accumulated in the previous stage propagate to the next stage. For example, the erroneous entities obtained in the entity recognition stage may seriously affect the training and effect of the relationship model in the relationship extraction stage. Meanwhile, the pipelining method ignores the potential association information between the entity identification and the relation extraction, so that the traditional pipelining method cannot achieve a satisfactory result.

In order to solve the problems of the pipeline method, researchers propose to design a combined model to simultaneously perform entity identification and relationship extraction. The joint extraction method can effectively relieve the problem of error propagation, and can utilize the relevance of the two tasks to enable the information of the two tasks to be interacted, so that deeper semantic information is mined. Miwa and Bansal (End-to-End interaction using lsms on sequences and tree structures [ C ]. Association for Computational relationships, 2016) first propose an End-to-End model for entity relationship joint extraction, and enable the relationship extraction to share the characteristics of entity identification tasks, proving that the characteristic information of entity identification is helpful for improving the performance of relationship extraction. Zheng et al (Joint extraction of entities and relationships based on a novel labeling scheme [ C ]. Association for general rules, 2017) unifies entity relationships into a sequence labeling problem, so that two tasks share a unified network model for learning, thereby realizing interaction of hidden information. However, the above-mentioned early joint extraction model cannot extract efficiently the more complex nested entities and overlapping relationships. In order to solve the problem, Bekoulis et al (Joint registration and translation as a multi-head selection protocol [ J ]. Expert Systems with Applications,2018) propose a decoding method for table filling, and can represent complex entities and overlapping relations by using a table filling frame. Wei et al (A novel template binding frame for relational triple extraction [ C ]. Association for relational languages, 2020) convert the triplet extraction into extracting the subject words first, and then extracting the subject words and relationship types, so that the model can effectively extract the overlapping relationships. The model generally adopts a parameter sharing mode to model information interaction of two tasks, namely entity recognition and relation extraction share the same word embedding. However, the information interaction mode is not sufficient, and the information of two tasks cannot be extracted through deep joint entity identification and relationship. The method is characterized in that two tasks only share the same input characteristics, and then two models learn task characteristics independently; secondly, information interaction is unidirectional, and relationship extraction needs to use identified entity features for relationship classification, but otherwise, the entity identification does not effectively use the features of the relationship extraction. Wang et al (Two are more than one of a connection and a relationship with a table-sequence encoders [ C ]. actual Methods in Natural Language Processing,2020) adopt a multi-layer transform structure, and interact with Two features of entity identification and relationship extraction at each layer, thereby enhancing information interaction. Yan et al (A partial filter network for joint entry and relationship extraction [ C ]. electronic Methods in Natural Language Processing,2021) split and recombine entity features and relationship features, so that each task can fuse the information of another task, thereby realizing bidirectional interaction and proving that relationship extraction has a promoting effect on entity identification.

Through the analysis, the entity relation extraction research obtains richer results. However, in the chinese medical field, the performance of the joint extraction is still low (the highest F values of the entity identification and the relationship extraction on the CBLUE2.0 basis are only 70.1% and 62.8%, respectively), and the main reasons for this are that, firstly, the entity distribution of the medical text (e.g. electronic medical record) is sparse and has a large span, and the traditional recurrent neural network cannot learn long-distance dependent information. Secondly, the medical entities are complex and a large amount of redundant information exists in the medical texts, so that the difficulty in distinguishing the models is improved. Finally, the current combination model still has the problem of insufficient information interaction, and how to construct a deeper combination mode and strengthen the information interaction between tasks is still a research subject which is very concerned by a plurality of researchers.

Disclosure of Invention

The invention provides a Chinese electronic medical record-based entity relationship joint extraction method, a network, equipment and a computer-readable storage medium, which realize the extraction of medical entities from a large number of unstructured electronic medical records and the classification of entity relationships and solve the problems that the prior research cannot fully capture remote context semantic information and the interaction of entity identification and relationship extraction information is insufficient.

Based on the above purpose, the invention provides the following technical scheme:

an entity relationship joint extraction method based on Chinese electronic medical records comprises the following steps:

by coding each position of a sentence of the Chinese electronic medical record, the output characteristic of each position of the sentence is obtained

And output characteristics for each position

Splicing to obtain sentence characteristic representation;

the bidirectional time sequence characteristic H of the sentence is obtained by splicing the forward time sequence characteristic and the reverse time sequence characteristic represented by the sentence characteristic _m ；

Characterizing bi-directional timing _m Obtaining entity task characteristics H through self-attention network _ner Entity task feature H _ner Obtaining an entity mark sequence through a conditional random field;

the entity task characteristic H of the sentence _ner And a bidirectional timing feature H _m Obtaining deep fusion characteristic H through Co-Attention network _merge And fuse the depths into a feature H _merge Obtaining relation extraction characteristic H through multilayer convolution _re ；

Pairing the obtained entities pairwise to obtain a plurality of entity pairs through the entity marker sequences;

extracting features H according to the relationship _re Weighting and averaging the characteristics of the corresponding positions of the entities to obtain the characteristics h of the entities _e ；

The characteristics h of the entities corresponding to the two entities of the entity pair _e Sentence representation h with pre-trained model _cls And splicing to obtain a splicing expression, and outputting the entity relationship type obtained according to the splicing expression through a classifier.

In one aspect, the present invention further relates to a chinese electronic medical record-based entity relationship joint extraction network, including:

LSTM advanced networks based on Memory mechanisms,

the network is drawn based on the self-attentive entity,

an entity relationship depth information fusion network based on the Co-Attention mechanism, and

extracting a network based on the relation of the multilayer convolution;

the LSTM improved network based on the Memory mechanism consists of two LSTM units, two storage units and two gate control units which are respectively expressed as

Wherein forward and reverse arrows indicate the direction of data input; input data X ═ X ₀ ,x ₁ ,...,x _L-1 Dimension of L × H, where L is the length of the input data and H is the dimension of the feature; input size of LSTM cell is 1 XH for each character of input dataCharacteristic x _t E.g. X, which is input to the forward LSTM unit in turn to obtain

The output size is 1 XH/2; the storage unit is a full connection layer with input size of 1 XH/2, the feature to be forward LSTM unit encoded

Obtaining memory characteristics from forward memory cells

The output size is 1 XH/2; the gate control units are linear layers and are connected with sigmoid functions, the input size of the gate control units is 1 multiplied by H/2, the output size of the gate control units is 1 multiplied by 1, and the characteristics are memorized

Obtaining gate control weights by a gate control unit

Features for then encoding LSTM units

And memory features

By weight

Linear combination to obtain

The output size is 1 XH/2; all characters are sequentially passed through forward LSTM units and spliced to obtain

The output size is L multiplied by H/2, all characters are reversely and sequentially passed through an inverse LSTM unit to obtain

Splicing the two-way Memory with the forward characteristic to finally obtain the bidirectional Memory-LSTM characteristic

The output size is L multiplied by H;

the self-attention-based entity extraction network consists of a self-attention network and a CRF decoding module; the input size of the self-attention network is L multiplied by H, and the output size is L multiplied by H; obtaining entity task characteristics H through self-attention network _ner (ii) a Entity task characteristics H _ner Decoding is carried out through a CRF module, the input size of the CRF module is L multiplied by H, and the output is a label sequence L of the path with the highest score _pred ；

The entity relationship depth information fusion network based on the Co-Attention mechanism consists of two Attention feature learners, wherein the learners are linear layers, the input size of the learners is L multiplied by H, and the output size of the learners is L multiplied by 1; entity characteristics H to be self-attention network coded _ner And feature H of Memory-LSTM _m Respectively output as

Splicing two attention characteristics

Obtaining attention weight attn by softmax activation function _m ,attn _ner ∈R ^L×1 Finally, the attention weight is respectively multiplied with the original features correspondingly to obtain the fused features H _re Taking the characteristic as the input of a relation extraction module based on multilayer convolution;

the relation extraction model based on the multilayer convolution is formed by stacking two one-dimensional convolution units, namely Conv1 and Conv 2; the input size of Conv1 is L × H, the output size is L × H/2, and the convolution kernel size is 3; the input size of Conv2 is L × H/2, the output size is L × H, and the convolution kernel size is 3; h to be characterized by a Co-Attention network _merge Obtaining output H through convolution operation _re The output size is L multiplied by H; then weighting the characteristics of the corresponding position of the entity to obtain the entity characteristics h _e The characteristic size is 1 XH; splicing the entity features pairwise, and integrating the entity features into the global features to obtain entity pair features

The characteristic size is 1 multiplied by 3H; classifying the characteristics through a linear layer and outputting a prediction result y _ored ∈R ^1×C 。

In one aspect, the invention also relates to an electronic device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method when executing the computer program.

In one aspect, the invention also relates to a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method.

Has the advantages that: the invention improves the accuracy of entity relationship extraction.

Drawings

FIG. 1: and (4) extracting a model framework diagram based on entity relation union of Memory-LSTM.

Detailed Description

According to the method, a training set is firstly established to train the model, then the prediction effect of the model is tested, and the model is compared with an advanced pipeline model and a combined model, so that the effectiveness of the model provided by the invention is verified.

1. Data set preprocessing

Firstly, manually labeling electronic medical record data according to entities and relationship types defined by the invention, wherein the constructed data set comprises 12219 samples, the data set is randomly divided into a training set and a testing set, the total number of the samples in the training set is 9773, the number of the samples in the testing set is 2446, and the proportion is 80%: 20 percent.

2. Network model structure

As shown in FIG. 1, the network model constructed by the invention includes an LSTM improved network based on a Memory mechanism, an entity extraction network based on self-Attention, an entity relationship depth information fusion network based on a Co-Attention mechanism, and a relationship extraction network based on multilayer convolution.

In the invention, the LSTM improved network based on the Memory mechanism consists of two LSTM units, two storage units and two gate control units which are respectively expressed as

Wherein the forward and reverse arrows indicate the direction of data input. Input data X ═ X ₀ ,x ₁ ,...,x _L-1 Dimension of is L H, where L is the length of the input data and H is the dimension of the feature. The input size of the LSTM cell is 1 XH, the characteristic x for each character of the input data _t E.g. X, which is input to the forward LSTM unit in turn to obtain

The output size was 1 XH/2. The storage unit is a full connection layer with input size of 1 XH/2, the feature to be forward LSTM unit encoded

Obtaining memory characteristics from forward memory cells

The output size was 1 XH/2. The gate control units are linear layers and are connected with sigmoid functions, the input size of the gate control units is 1 multiplied by H/2, the output size of the gate control units is 1 multiplied by 1, and the characteristics are memorized

Obtaining gate control weights by a gate control unit

Features for then encoding LSTM units

And memory features

By weight

Linear combination to obtain

The output size was 1 XH/2. Thus the output of each character

The system not only comprises the output of the LSTM unit, but also integrates the memory characteristic, relieves the problem of forgetting long-distance information, and can dynamically regulate and control the proportion of the two characteristics through a gating mechanism. All characters are sequentially passed through forward LSTM units and spliced to obtain

The output size is L multiplied by H/2, and all characters reversely pass through the reverse LSTM unit in the same way to obtain

The output size is L × H.

In the present invention, the self-attention-based entity extraction network consists of a self-attention network and a CRF decoding module. The input size from the attention network is L × H and the output size is L × H. The entity task characteristics H are obtained by learning the characteristics of the entity through a self-attention network, strengthening the weight of the important characteristics by using an attention mechanism and reducing the interference of redundant information _ner . Entity task characteristics H _ner Decoding is carried out by a CRF module, the input size of the CRF module is LxH, and the input size is L x HThe label sequence L which is the path with the highest score is obtained _pred Loss of and entity identification _ner . It uses the Viterbi algorithm for calculating the real path score s _real And total score of all paths s _total Loss of entity identification is loss _ner ＝s _real -s _total 。

In the invention, the entity relationship depth information fusion network based on the Co-Attention mechanism consists of two Attention feature learners, wherein the learners are linear layers, the input size of the learners is L multiplied by H, and the output size of the learners is L multiplied by 1. Entity characteristics H to be self-attention network coded _ner And feature H of Memory-LSTM _m Respectively output as

Then the two attention characteristics are spliced

Obtaining attention weight attn by softmax activation function _m ,attn _ner ∈R ^L×1 Finally, the attention weight is respectively multiplied with the original features correspondingly to obtain the fused features H _re The feature is used as an input to a relationship extraction module.

In the present invention, the relation extraction model based on multilayer convolution is formed by stacking two one-dimensional convolution units, which are Conv1 and Conv 2. Conv1 has an input size of L × H, an output size of L × H/2, and a convolution kernel size of 3. Conv2 has an input size of L × H/2, an output size of L × H, and a convolution kernel size of 3. H to be characterized by a Co-Attention network _merge Obtaining output H through convolution operation _re The output size is L × H. Then weighting the characteristics of the corresponding position of the entity to obtain the entity characteristics h _e The characteristic dimension is 1 XH. Splicing the entity features pairwise, and integrating the entity features into the global features to obtain entity pair features

The characteristic dimension is 1 × 3H. Passing the feature throughClassifying the linear layer and outputting a prediction result y _pred ∈R ^1×C Finally calculating loss by using cross entropy to obtain loss _re . The total loss of the combined model is loss _ner +loss _re 。

3. Model training

For a training sample, as shown in fig. 1, an electronic medical record text is first subjected to a BERT pre-training model to obtain a context feature representation of each character, and then is encoded through a Memory-LSTM network to obtain a feature H _m Features H after encoding _m Obtaining entity task characteristics H through self-attention network _ner Then the entity characteristics H are set _ner Decoding the data by a CRF module to obtain the loss of the entity identification task _ner And predicted entity tag sequence L _pred . Entity task characteristic H subjected to self-attention network coding _ner And the coded characteristics H of Memory-LSTM _m Obtaining the characteristic H of depth information fusion through a Co-Attention network _merge Then, the relation extraction characteristic H is obtained by the characteristic through multilayer convolution _re Obtaining the feature expression h of the entity pair according to the label sequence of the entity _＜e1,e2＞ Calculating loss by cross entropy _re Finally, the total loss is lost _ner +loss _re A back propagation training model is performed.

4. Method for designing network model

The method comprises the following steps:

step 1, constructing an electronic medical record data set

The corpus of the invention is from real electronic medical record data of departments such as internal medicine, surgery and pediatrics of a certain hospital. According to the concept standard of the Unified Medical Language System (UMLS) and combining the characteristics of the chinese electronic Medical record corpus, the present invention defines 5 entity types: disease, location, symptoms, examination and treatment; and 7 relationship categories: "disease-disease", "disease-symptom", "disease-site", "treatment-symptom", "treatment-disease", "examination-symptom" and "examination-disease".

Step 2, constructing an entity relation joint extraction model

Construction of LSTM improved network based on Memory mechanism

The problems that the distance between head and tail entities is Long and the entity relationship is sparse exist in the relationship in the electronic medical record, and the traditional Long Short Term Memory network (LSTM) cannot effectively extract the relationship characteristics of the head and tail entities with large span. Therefore, the invention provides a Memory-LSTM structure, the Memory mechanism is used for storing the characteristics obtained by each LSTM unit iteration, and the LSTM unit can utilize the information stored by the Memory in the next iteration, thereby effectively relieving the problem of forgetting the characteristic information in the traditional LSTM.

The LSTM network is mainly composed of an input gate, a forgetting gate and an output gate, and the transmission and loss of information are controlled through a gate mechanism. Wherein at the time t, the LSTM unit has three input features, namely an input feature x at the current time _t Hidden layer feature h of previous moment _t-1 And LSTM cell state C at the previous time _t-1 Finally, the output characteristic h at the time t is obtained _t The LSTM model formula is as follows:

f _t ＝σ(W _f ·[h _t-1 ，x _t ]+b _f )

i _t ＝σ(W _i ·[h _t-1 ，x _t ]+b _i )

o _t ＝σ(W _o ·[h _t-1 ，x _t ]+b _o )

h _t ＝o _t ⊙tanh(C _t )

wherein, W _(·) ，b _(·) Indicates trainable weights and offsets, tanh and σ are activation functions, and indicates that elements are correspondingly multiplied. In LSIn the TM structure, the hidden layer feature h _t Sensitive and fast changing for short term input features, cell state C _t The state change for preserving the long term is slow. I.e. cell state C _t The information of long-distance text is contained, so the invention introduces a memory unit M for storing C _t Characterized in that at time t the input to the memory cell is cell state C _t The formula is as follows:

m _t ＝W _M ·C _t +b _M

wherein, W _M ，b _M Representing trainable parameters of the memory unit. Through t-1 iterations, the previous cell state C ₀ ～C _t-1 Will be stored in W _M ，b _M In the method, memory characteristics m containing long-distance text information are obtained _t And then linearly combining output characteristics h of the LSTM through a gate control mechanism _t And memory characteristics m _t Obtaining the final output at the time t

The formula is as follows:

g _t ＝σ(W _g ·m _t +b _g )

wherein, W _g ，b _g Indicates that the gate unit may train parameters, σ indicates that a sigmoid activates a function, and an element indicates that the element corresponds to a multiplication. For a text with length k, the output characteristic of each position is determined

Splicing to obtain a complete feature representation of a sentence coded by Memory-LSTM, and finally splicing the forward feature and the reverse feature to obtain the feature H of the bidirectional Memory-LSTM _m ：

(II) constructing entity extraction network based on self attention

A large amount of redundant information exists in an electronic medical record text, medical entities are distributed sparsely, a traditional entity identification model based on time sequence can learn entity information and redundant information equally, and when the redundant information is too much, the learning of entity characteristics can be hindered, and the effect of entity extraction is influenced. Therefore, the invention adopts the self-attention mechanism to learn the characteristics of the entity, strengthens the weight of important information such as a medical entity and the like through the attention model, and reduces the interference of irrelevant redundant information. The formula for the calculation of self-attention is as follows:

Q＝W _Q ·H _m +b _Q

K＝W _K ·H _m +b _K

V＝W _V ·H _m +b _V

obtaining entity characteristics H by attention mechanism _ner Then, the entity tag sequence is obtained by Conditional Random Field (CRF) and the loss function is calculated. Setting a sequence with the length of N, wherein the marking types of the entity are M, and the calculation formula is as follows:

wherein s is _i Is the fraction of a certain marker sequence, s _real Is a fraction of the authentic marker sequence;

is shown at w _i Position y _j The score of the tag is determined by the score of the tag,

the mark indicated at the current position is y _i The next place to transfer is the mark y _j Is scored. Loss of entity identification through CRF calculation _ner 。

(III) constructing an entity relationship depth information fusion network based on a Co-Attention mechanism

In the joint extraction model, how to model information interaction of entity identification and relationship extraction is always the focus of research. Most of the current researches only use the extracted entity characteristics as interaction information, and ignore the context of the entity and other potentially useful information for relationship extraction. Therefore, the invention provides a depth information fusion algorithm based on a Co-Attention mechanism (Co-Attention), which can realize 'character level' information interaction of complete characteristics of an entity recognition task, thereby achieving depth information fusion. The implementation mode is that firstly, an attention feature learner respectively learns the entity features H obtained by the self-attention mechanism _ner And feature H of Memory-LSTM _m Get attention characteristics

(L is sentence length); then the two attention characteristics are spliced

Obtaining an attention score attn by a softmax activation function _m ，attn _ner ∈R ^L×1 (ii) a And finally, multiplying the original features by the attention scores to obtain fused features, wherein the calculation formula is as follows:

wherein the attention score is specifically expressed as

Each element represents the ratio of the features of the current position to the fused features. For example, the position i in the text is characterized by synergistic attention fusion

Score of attention

And

the method can be dynamically adjusted to the optimal proportion in the training process, and the character-level depth information fusion is realized. With satisfaction of attention score

The scale is unchanged before and after feature fusion. The cooperative attention mechanism not only considers all characteristics from the entity extraction task, but also can give higher weight to important information (such as an entity, the context thereof and the like) in the entity task through the attention score, so that the relationship extraction model can pay attention to all the important information and exert better performance.

(IV) constructing a relation extraction network based on multilayer convolution

The relation extraction adopts Convolutional Neural network (Convolutional Neural network)k, CNN) as an extracted characterizer. Will get the fusion feature H through the cooperative attention _merge Obtaining relation extraction characteristic H by two-layer one-dimensional convolution _re . And respectively predicting the relationship type of each entity pair by adopting a pairwise pairing mode aiming at all the entities extracted by the entities. For entity pair<E ₁ ，E ₂ >Assume entity E ₁ Positions in the sentence are i-j, entity E ₁ Is a weighted average of the features of positions i-j, expressed as

Similarly, entity E can be obtained ₂ Is characterized in that

Finally, the two entity characteristics are compared with a BERT model' [ CLS ]]"feature of position h _2ls And splicing, judging the relation type by using a classifier and calculating the loss, wherein the formula is as follows:

loss _re ＝CrossEntropyLoss(y _pred ，y _label )

wherein, W _pred And b _pred Trainable parameters, y, representing a classifier _pred ∈R ^1×C (C represents the total number of relationship classes) represents the predicted relationship class, y _label Expressing the truth value of the relation category, and obtaining the loss of the relation extraction through the Cross EntropyLoss cross entropy loss function _re 。

5. Entity relation combined extraction method based on Chinese electronic medical record executed based on network model

The extraction method comprises the following steps:

And for eachOutput characteristics of individual position

Splicing to obtain sentence characteristic representation;

the method comprises the steps of obtaining a bidirectional time sequence characteristic H of a sentence by splicing forward characteristics and reverse characteristics represented by sentence characteristics _m ；

the entity task characteristic H of the sentence _ner And bidirectional timing characteristics H _m Obtaining deep fusion characteristic H through Co-Attention network _merge And will fuse features H _merge Obtaining relation extraction characteristic H through multilayer convolution _re ；

Pairwise pairing the obtained entities through the entity marker sequences to obtain a plurality of entity pairs;

In one arrangement, a first output characteristic is obtained

The method comprises the following steps:

inputting the feature representation of each character of the sentence into the LSTM, and acquiring the unit state C of the LSTM containing long-distance text information _t ；

Storing cell state C of LSTM by memory cell M _t In response to cell state C _t Obtaining a memory feature m _t ；

Output characteristic h of LSTM at t moment _t And memory featuresm _t Linear combination to obtain output characteristic of t time

In one approach:

obtaining a memory feature m _t Expressed by the formula: m is _t ＝W _M ·C _t +b _M ，W _M ，b _M Representing trainable parameters of a memory cell M, cell state C, with t-1 iterations ₀ ～C _t-1 Is stored in W _M ，b _M Middle memory feature m _t Containing long-distance text information;

obtaining output characteristics at time t

Expressed by the formula: g _t ＝σ(W _g ·m _t +b _g )，

W _g ，b _g Indicates trainable parameters of the gate control unit, sigma indicates a sigmoid activation function, and an all indicates that elements are correspondingly multiplied;

the obtained sentence characteristic representation is expressed by the formula:

k is the text length, and t belongs to (0-k-1);

obtaining bidirectional time sequence characteristic H of sentence _m Expressed by the formula:

a forward timing characteristic is represented that is characteristic of,

indicating a reverse timing characteristic.

In one approach, a fusion feature H is obtained _merge The method comprises the following steps:

entity task feature H through attention feature learner _ner And bidirectional timing characteristics H _m Linear mapping to obtain the attention characteristics of the entity task

And bidirectional timing attention feature

L is the sentence length;

characterizing two-way timing attention

And entity task characteristics

Splicing, and acquiring the two-way time sequence attention characteristic by activating a function through softmax

Attention score of (attn) _m And physical task attention features

Attention score of (attn) _ner ，attn _m 、attn _ner ∈R ^L×1 ，

Attention score attn _m Characterized by two-way time sequence attention

Is taken as a fusion feature H _merge The proportion of the features of the corresponding positions is obtained according to the position sequence,

attention score attn _ner Characterised by the attention of the entity task

Is taken as a fusion feature H _merge The proportion of the features corresponding to the positions is obtained by arranging the features according to the position sequence,

fusion feature H _merge According to a bidirectional time sequenceAttention characteristic

Proportion of features corresponding to location to said location determination and physical task attention features

The ratio of the feature of the corresponding position to the position determination is calculated.

In one approach:

obtaining bidirectional temporal attention features

Expressed by the formula:

obtaining entity task attention features

Expressed by the formula:

the acquisition attention score is formulated as:

obtaining fusion features H _merge Expressed by the formula:

in one arrangement, features H are extracted based on the relationships _re Weighting and averaging the characteristics of the corresponding positions of the entities in the sentence of each entity pair to obtain the characteristics h of the entities _e Expressed by the formula: h is _e ＝H _re [i～j]And/(j-i), the entities are located in sentence positions i to j.

In one approach:

entity pair<E ₁ ，E ₂ >Middle entity E ₁ Positions in the sentence are i-j, entity E ₁ Is a weighted average of the features of positions i-j, expressed as

Entity pair<E ₁ ，E ₂ >Intermediate entity E ₂ Positions in the sentence are i-j, entity E ₂ Is a weighted average of the features of positions i-j, expressed as

Expressing two entity characteristics and sentences of a pre-training model h _cls Splicing, wherein the entity relationship type obtained according to the splicing expression is output through a classifier and is represented by a formula:

W _pred and b _pred Trainable parameters, y, representing a classifier _pred ∈R ^1×C C denotes the total number of relationship categories, y _label Representing the truth of the relationship class.

6. Model quality assessment

In order to evaluate the effect of extracting the entity relationship of the model, a micro-average mode is adopted, and the accuracy (P) and the recall rate (F) are used

And F-value to evaluate the effectiveness of the model. The predicted triples are evaluated in a strict matching manner, i.e. they are considered correct if and only if the entity boundaries, entity types and relationship types in the predicted triples are completely correct.

Table 1: entity identification evaluation result

The results of the entity identification evaluation are shown in table 1, and it can be seen that the entity identification achieves good effects for each category, and the F-value of 88.05% is reached as a whole. As can be seen from the results, the "symptom" and "site" types are less favorable, one of the reasons for this is that the "symptom" type entities are semantically closer to the "disease" type entities, e.g., "migraine" belongs to a disease, but is easily confused with the symptom "headache". The proportion of the entities of the type of the part to the total entities is small, the model cannot fully learn the characteristics of the entities, and if complex entities of the part, such as the ureter crosses the iliac vessel, the ureter junction of the renal pelvis and the like, cannot be extracted correctly. The entity of the types of 'disease', 'treatment' and 'inspection' has better extraction effect because the characteristics are more obvious and the sample is sufficient.

Table 2: relationship extraction evaluation result

The results of the relationship extraction are shown in table 2, and it can be seen that the relationship extraction also achieves better effect overall, reaching 77.17% of F-value. The reason why the categories of "disease-disease" and "disease-symptom" are less effective is that the "symptom" type entity is semantically similar to the "disease" type entity, and the model is not easily distinguished, and if the "disease" (or "symptom") type entity is incorrectly predicted as the "symptom" (or "disease") type, the relationship extraction model uses the incorrect information to predict the incorrect relationship type. Meanwhile, in the process of manually labeling the electronic medical record data set, due to insufficient medical knowledge, labeling personnel are prone to causing the problem of wrong labeling of entity relationship types, and noise can be introduced into model training.

7. Model comparison

To verify the validity of the model proposed by the present invention, the relationship extraction results of the present invention are compared with advanced pipeline models and joint models. The streamline model adopts lie and the like (extraction of the entity relation of the electronic medical record [ J ] based on position noise reduction and rich semantics, Chinese information academy 2021); the joint model is Yan et al (A part filter network for joint entry and relation interaction [ C ]. Empitical Methods in Natural Language Processing, 2021). The two models are compared by adopting the same training set, test set and evaluation mode as the invention, the F-values of the entity recognition and relationship extraction of the pipeline model are 86.10% and 65.07% respectively, and the results of the entity recognition and relationship extraction of each category are shown in tables 3 and 4 respectively:

table 3: pipeline method entity identification evaluation result

Table 4: pipeline method relation extraction evaluation result

As can be seen from the results of the pipeline model, the combined model of the invention is obviously superior to the traditional pipeline model, the F-value of the entity identification is improved by 1.95%, the F-value of the relation extraction is improved by 12.10%, and the improvement is obvious in each category. The reason is analyzed, firstly, the entity is extracted by the pipeline model in a knowledge base matching mode in an entity identification stage, although the accuracy of the extracted entity can be guaranteed, the long and difficult entity cannot be effectively extracted, for example, the entity 'pancreatic head enlargement accompanied by peripheral exudation' is split in a word segmentation stage, and the entity cannot be matched by the knowledge base. Meanwhile, the quality of the knowledge base can also directly influence the entity matching effect, and the entity extraction effect of the 'part' type is very low as can be seen from table 3, which is largely due to the fact that the 'part' type entity in the knowledge base is not perfect enough, resulting in poor matching effect. Finally, the pipeline model has the problem of error propagation, and the potential correlation of the two tasks cannot be utilized, so that the result of the pipeline model is poor.

We also compared the model of the present invention with the current advanced federated model (a partial filter network for joint entry and relationship extraction [ C ]. Empirical Methods in Natural Language Processing,2021), the federated model (PFN) achieved the best results in 2021 years on the universal domain dataset, and the entity identification and relationship extraction F-values of the PFN model on the dataset of the present invention reached 87.18% and 74.49%, respectively, and the results of the entity identification and relationship extraction for each category are shown in tables 5 and 6, respectively:

table 5: comparing the results of entity identification and evaluation in the combined method

Table 6: relation extraction evaluation result of comparison joint method

The result shows that the combined model of the invention is superior to a PFN combined model in overall effect, the F-value of entity recognition is improved by 0.87%, and the F-value of relationship extraction is improved by 2.68%. From the effect of each category, the model of the invention is obviously superior to a PFN model in predicting the relation (such as 'disease-disease' and 'disease-symptom') related to 'disease', because the medical entity is more complex than the entity in the general field, and the relation span between the head entity and the tail entity is larger, and the combined model of the invention effectively saves the remote characteristics by using a Memory mechanism, so the effect of extracting the relation of the medical entity with larger span is better.

In summary, the invention provides an entity relation joint extraction model based on Chinese electronic medical records, and provides an improved LSTM coding algorithm based on a Memory mechanism aiming at the characteristics of the electronic medical record document level; and a depth information fusion mechanism based on Co-Attention is provided, and information interaction of entity identification and relationship extraction is enhanced. Finally, the model of the invention is compared with advanced pipeline models and combined models, the results of the entity extraction and the relation extraction of the model of the invention are obviously improved, and the effectiveness of the method is verified.

An embodiment of the present invention further provides an electronic device, where the electronic device includes: the memory, the processor and the computer program stored on the memory and capable of running on the processor, when the processor executes the computer program, the steps of the method provided by the above embodiments are realized. The electronic device provided by the embodiment of the invention can realize each implementation mode in the method embodiments and corresponding beneficial effects.

The embodiment of the invention also provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the method provided by the embodiment of the invention is realized, and the same technical effect can be achieved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A method for extracting entity relationship jointly based on Chinese electronic medical records is characterized by comprising the following steps:

And output characteristics for each position

Splicing to obtain sentence characteristic representation;

the method comprises the steps of splicing forward features and backward features represented by sentence features to obtain bidirectional time sequence features H of sentences _m ；

the entity task characteristic H of the sentence _ner And bidirectional timing characteristics H _m Obtaining a deep fusion feature H through a Co-Attention network _merge And fuse the depths into a feature H _merge Obtaining relation extraction characteristic H through multilayer convolution _re ；

Corresponding the characteristics h of the entity of the two entities of the entity pair _e Sentence representation h with pre-trained model _cls And splicing to obtain a splicing expression, and outputting the entity relationship type obtained according to the splicing expression through a classifier.

2. The method of claim 1, wherein the first output characteristic is obtained

The method comprises the following steps:

Output characteristic h of LSTM at t moment _t And memory characteristics m _t Linear combination to obtain output characteristic of t time

3. The method of claim 2, wherein the entity relationship extraction method based on Chinese electronic medical record is characterized in that,

obtaining output characteristics at time t

Expressed by the formula: g _t ＝σ(W _g ·m _t +b _g )，

W _g ，b _g Indicates a trainable parameter of the gate control unit, sigma indicates a sigmoid activation function, and-indicates that elements are multiplied correspondingly;

k is the text length, and t belongs to (0-k-1);

a forward timing characteristic is represented that is characteristic of,

indicating a reverse timing characteristic.

4. The method of claim 1, wherein the depth fusion feature H is obtained _merge The method comprises the following steps:

entity task feature H through attention feature learner _ner And bidirectional timing characteristics H _m Linear mapping to obtain bidirectional time sequence attention characteristics

And entity task attention feature

L is the sentence length;

characterizing bidirectional timing attention

And entity task attention features

Attention score of (attn) _m And physical task attention features

Attention score of (attn) _ner ，attn _m 、attn _ner ∈R ^L×1 ，

Attention score attn _m Characterised by two-way time-sequential attention

The fusion characteristic H occupied by the characteristic at any position _merge The proportion of the features corresponding to the positions is obtained by arranging the features according to the position sequence,

attention score attn _ner Characterised by the attention of the entity task

The depth occupied by the feature at any position of (2) is fused with the feature H _merge The proportion of the features corresponding to the positions is obtained by arranging the features according to the position sequence,

depth fusion feature H _merge According to a bidirectional time sequence attention characteristic

5. The method as claimed in claim 4, wherein the entity relationship extraction method based on Chinese electronic medical record is,

obtaining bidirectional temporal attention features

Expressed by the formula:

obtaining entity task attention features

Expressed by the formula:

the acquisition attention score is formulated as:

obtaining fusion features H _merge Expressed by the formula:

6. the method of claim 1, wherein the entity relationship extraction method based on Chinese electronic medical record is,

extracting features H according to the relationship _re Weighting and averaging the characteristics of the corresponding positions of the entities in the sentence of each entity pair to obtain the characteristics h of the entities _e Expressed by the formula: h is a total of _e ＝H _re [i～j]/(j- _i) The position of the entity in the sentence is i to j.

7. The method as claimed in claim 6, wherein the entity relationship extraction method based on Chinese electronic medical record is,

entity pair<E ₁ ,E ₂ >Intermediate entity E ₁ Positions in the sentence are i-j, entity E ₁ Is a weighted average of the features of positions i-j, expressed as

Entity pair<E ₁ ,E ₂ >Middle entity E ₂ Positions in the sentence are i-j, entity E ₂ Is a weighted average of the features of positions i-j, expressed as

8. An entity relation combined extraction network based on Chinese electronic medical records is characterized in that,

Included

LSTM advanced networks based on Memory mechanisms,

the self-attention-based entity abstracts the network,

extracting a network based on the relation of the multilayer convolution;

Wherein forward and reverse arrows indicate the direction of data input; input data X ═ X ₀ ,x ₁ ,...,x _L-1 Dimension of L × H, where L is the length of the input data and H is the dimension of the feature; the input size of the LSTM cell is 1 XH, feature x for each character of the input data _t E.g. X, which is input to the forward LSTM unit in turn to obtain

The output size is 1 XH/2; the storage unit is a full connection layer with an input size of 1 XH/2, the characteristic to be coded by the forward LSTM unit

Obtaining memory characteristics from forward memory cells

Obtaining gate control weights by a gate control unit

Features for then encoding LSTM units

And memory features

By weight

Linear combination to obtain

The output size is L multiplied by H;

the self-attention-based entity extraction network consists of a self-attention network and a CRF decoding module; the input size of the self-attention network is L multiplied by H, and the output size is L multiplied by H; obtaining entity task characteristics H through self-attention network _ner (ii) a Entity task feature H _ner Decoding is carried out through a CRF module, the input size of the CRF module is L multiplied by H, and the output is a label sequence L of the path with the highest score _pred ；

Splicing two attention characteristics

Obtaining attention weight attn by softmax activation function _m ,attn _ner ∈R ^L×1 Finally, the attention weight is respectively multiplied by the original characteristics correspondingly to obtain the fused characteristics H _re Taking the characteristic as the input of a relation extraction module based on multilayer convolution;

the relation extraction model based on the multilayer convolution is formed by stacking two one-dimensional convolution units, namely Conv1 and Conv 2; the input size of Conv1 is L × H, the output size is L × H/2, and the convolution kernel size is 3; the input size of Conv2 is L × H/2, the output size is L × H, and the convolution kernel size is 3; h that will get a feature through the Co-Attention network _merge Obtaining output H through convolution operation _re The output size is L multiplied by H; then weighting the characteristics of the corresponding position of the entity to obtain the entity characteristics h _e The characteristic size is 1 XH; splicing the entity features pairwise, and integrating the entity features into the global features to obtain entity pair features

The characteristic size is 1 multiplied by 3H; classifying the characteristics through a linear layer and outputting a prediction result y _pred ∈R ^1×C 。

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the steps in the method as claimed in claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as set forth in claims 1-7.