CN109388807B

CN109388807B - Method, device and storage medium for identifying named entities of electronic medical records

Info

Publication number: CN109388807B
Application number: CN201811282557.3A
Authority: CN
Inventors: 任江涛; 殷明旺
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2021-09-21
Anticipated expiration: 2038-10-30
Also published as: CN109388807A

Abstract

The invention discloses a method for identifying named entities of electronic medical records, which comprises the following steps: generating a word vector matrix and a radical vector matrix corresponding to a character sequence of the electronic medical record of the named entity to be identified, inputting the radical vector matrix into a convolutional neural network layer for processing to obtain a radical convolutional vector matrix corresponding to the character sequence, generating a word characteristic vector matrix according to the word vector matrix and the radical convolutional vector matrix, and inputting the word characteristic vector matrix into a bidirectional long-short term memory network for processing to obtain a named entity identification result of the electronic medical record. The invention also discloses an electronic medical record named entity recognition device and a storage medium. The invention provides the method for identifying the named entity of the electronic medical record, which has high identification accuracy by extracting the internal morphological characteristics of the characters of the electronic medical record and sequentially inputting the characteristics of the characters and the internal morphological characteristics of the characters into a deep neural network to predict the character label.

Description

Method, device and storage medium for identifying named entities of electronic medical records

Technical Field

The invention relates to the technical field of computers, in particular to a method for identifying named entities of electronic medical records, a device for identifying the named entities of the electronic medical records and a computer storage medium.

Background

With the vigorous development of socioeconomic performance in China and the increasing improvement of the living standard of people, the health consciousness of people is increasingly enhanced, and how to construct an intelligent medical system by using a large amount of medical data is an urgent need of the society at present. The electronic medical record is a medical data text with the most medical data and the most information, has unique specialty, is written by a professional doctor aiming at a patient, and records various symptoms in the process of going in and out of a hospital, diseases diagnosed by the doctor and corresponding treatment means in detail, and also contains a great amount of medical information such as results of various examination reports. Therefore, many intelligent medical information systems are constructed based on information of electronic medical records. In the process of constructing an intelligent medical information system and system, named entity identification is the basis of an important task of information extraction on a large amount of medical data, and is very important for information processing and management systems in various medical fields.

In the prior art, a named entity recognition method based on deep learning in the medical field is available, and a neural network model is used for extracting context information between words or phrases and outputting probability distribution of an entity category. However, because the information representation of the characters or words is not complete, only the character vectors or word vectors are relied on, and deep information hidden in the characters or words is not considered, the recognition effect is not good.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide an electronic medical record named entity identification method, an electronic medical record named entity identification device, electronic medical record named entity identification equipment and a computer storage medium, and aims to solve the technical problems that in an implementation method based on deep learning in the prior art, only word vectors or word vectors are relied on, deep information hidden in the words or words is not considered, and the identification effect is poor.

In order to achieve the above object, the present invention provides a method for identifying named entities of electronic medical records, which comprises the following steps:

generating a word vector matrix corresponding to the character sequence of the electronic medical record of the named entity to be identified;

generating a radical vector matrix corresponding to the character sequence;

inputting the component vector matrix into a first neural network for processing to obtain a component convolution vector matrix corresponding to the character sequence, wherein the first neural network comprises a convolution neural network layer;

generating a word feature vector matrix according to the word vector matrix and the radical convolution vector matrix;

inputting the word feature vector matrix into a second neural network for processing to obtain a named entity recognition result of the electronic medical record, wherein the second neural network comprises a bidirectional long-short term memory network layer;

and the parameters of the first neural network and the second neural network are obtained by training according to the electronic medical record of the identified named entity.

Preferably, the step of generating the radical vector matrix corresponding to the character sequence includes:

acquiring Chinese character components of each character in the character sequence;

generating radical vectors of the characters according to the Chinese character components;

and generating a radical vector matrix corresponding to the character sequence according to the radical vector of each character.

Preferably, the second neural network further includes a full connection layer, and the step of inputting the word feature vector matrix into the second neural network for processing to obtain the named entity identification result of the electronic medical record includes:

inputting the character feature vector matrix into the bidirectional long and short term memory network for processing to obtain a hidden vector matrix corresponding to the character sequence;

and inputting the hidden vector matrix into the full connection layer for processing to obtain a named entity recognition result of the electronic medical record.

Preferably, the second neural network further includes a self-attention mechanism layer, and the step of inputting the word feature vector matrix into the second neural network for processing to obtain the named entity identification result of the electronic medical record includes:

and inputting the hidden vector matrix into a self-attention mechanism layer for processing to obtain a named entity recognition result of the electronic medical record.

Preferably, the second neural network further includes a self-attention mechanism layer and a conditional random field model, and the step of inputting the word feature vector matrix into the second neural network for processing to obtain the named entity recognition result of the electronic medical record includes:

inputting the implicit vector matrix into a self-attention mechanism layer for processing to obtain a prediction matrix corresponding to the character sequence;

and inputting the prediction matrix into the conditional random field model for processing to obtain a named entity recognition result of the electronic medical record.

Preferably, the self-attention mechanism layer includes a full-link layer, and the step of inputting the implicit vector matrix into the self-attention mechanism layer for processing to obtain the prediction matrix corresponding to the character sequence includes:

calculating attention weights of hidden vectors in the hidden vector matrix;

generating an attention vector matrix according to the attention weight and the implicit vector;

generating an attention hidden vector matrix according to the hidden vector matrix and the attention vector matrix;

and inputting the attention hiding vector matrix into the full-connection layer for processing to obtain a prediction matrix corresponding to the character sequence.

Preferably, the step of calculating attention weights of hidden vectors in the hidden vector matrix comprises:

calculating the dependency relationship between the hidden vectors in the hidden vector matrix according to the following formula:

f_t,t′＝σ(w_a tanh(w_th_t+w_t′h_t′))，

where t and t' represent different time steps, w_a，w_t，w_t′Is a weight vector, sigma is a sigmoid function, h_tIs a sum of h_t′Hidden vectors of different time steps;

according to the following formula, each hidden vector h in the shown hidden vector matrix_kCalculating corresponding attention weights

Wherein e is an exponential function, N is the number of the hidden vectors,

preferably, the second neural network further includes a conditional random field model, and the step of inputting the word feature vector matrix into the second neural network for processing to obtain the named entity recognition result of the electronic medical record includes:

and inputting the hidden vector matrix into a conditional random field model for processing to obtain a named entity recognition result of the electronic medical record.

In addition, in order to achieve the above object, the present invention further provides an electronic medical record named entity recognition apparatus, including: the electronic medical record named entity recognition processing program realizes the steps of the method for recognizing the named entity of the electronic medical record when being executed by the processor.

In addition, in order to achieve the above object, the present invention further provides a computer storage medium, wherein the computer storage medium stores a processing program for named entity identification of an electronic medical record, and the processing program for named entity identification of an electronic medical record is executed by a processor to implement the steps of the method for named entity identification of an electronic medical record as described above.

The method for identifying the named entities of the electronic medical record, the device for identifying the named entities of the electronic medical record and the computer storage medium provided by the embodiment of the invention generate a word vector matrix and a component vector matrix corresponding to a character sequence of the electronic medical record of the named entities to be identified, input the component vector matrix into a convolutional neural network layer for processing to obtain the component convolutional vector matrix corresponding to the character sequence, generate a word feature vector matrix according to the word vector matrix and the component convolutional vector matrix, and input the word feature vector matrix into a bidirectional long-short term memory network for processing to obtain a named entity identification result of the electronic medical record. The invention provides the method for identifying the named entity of the electronic medical record, which has high identification accuracy by extracting the internal morphological characteristics of the characters of the electronic medical record and sequentially inputting the characteristics of the characters and the internal morphological characteristics of the characters into a deep neural network to predict the character label.

Drawings

FIG. 1 is a schematic diagram of an apparatus in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for identifying named entities in an electronic medical record according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a convolutional neural network processing procedure according to a first embodiment of the method for identifying named entities in electronic medical records of the present invention;

FIG. 4 is a schematic diagram of a neural network system processing procedure according to a first embodiment of the method for identifying named entities in electronic medical records of the present invention;

FIG. 5 is a flowchart illustrating a method for identifying named entities in an electronic medical record according to a second embodiment of the present invention;

FIG. 6 is a schematic diagram of a neural network system processing procedure according to a second embodiment of the method for identifying named entities in electronic medical records of the present invention;

FIG. 7 is a flowchart illustrating a method for identifying named entities in an electronic medical record according to a third embodiment of the present invention;

FIG. 8 is a diagram illustrating a processing procedure of a neural network system according to a third embodiment of the method for identifying named entities in electronic medical records of the present invention;

FIG. 9 is a flowchart illustrating a method for identifying named entities in an electronic medical record according to a fourth embodiment of the invention;

FIG. 10 is a diagram illustrating a processing procedure of a neural network system according to a fourth embodiment of the method for identifying named entities in electronic medical records of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.

The terminal of the embodiment of the invention can be a PC, and can also be a mobile terminal device with a display function, such as a smart phone, a tablet computer, an electronic book reader, an MP3(Moving Picture Experts Group Audio Layer III, dynamic video Experts compress standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, dynamic video Experts compress standard Audio Layer 3) player, a portable computer, and the like.

As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.

Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, the memory 1005, which is a type of computer storage medium, can include an operating system, a network communication module, a user interface module, and an electronic medical record named entity identification handler.

In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to invoke the electronic medical record named entity recognition processing program stored in the memory 1005 and execute the steps of the electronic medical record named entity recognition method.

Referring to fig. 2, a first embodiment of the present invention provides a method for identifying named entities of electronic medical records, where the method includes:

and step S10, generating a word vector matrix corresponding to the character sequence of the electronic medical record of the named entity to be identified.

Firstly, acquiring a character sequence in the current medical history content of the electronic medical record of the named entity to be identified. Since the method for identifying named entities of electronic medical records provided by this embodiment is implemented by combining a convolutional network model (CNN) and a bidirectional long-short term memory network model (Bi-LSTM), and these network models can only process numerical type input, when acquiring a character sequence of an electronic medical record of a named entity to be identified, it needs to be converted into a vector form.

Word vectors corresponding to a character sequence can be obtained by using word vectors trained in advance, for example, word2vec vector representation method of Google is adopted, and the method can project characters into a low-dimensional space, in which the distances between words or phrases with similar semantemes are relatively close. For example, the distance between the words "china" and "guangzhou", "china" and "computer" is much smaller in this low dimensional space than between the words "china" and "guangzhou".

In order to obtain an accurate word vector by adopting a word2vec vector representation method, 10000 electronic histories are used as corpus training word vectors, and a Skip-Gram model in the word2vec is adopted for training. Although the Skip-Gram model is slower than the CBOW model in training, the Skip-Gram model is better than the CBOW in terms of corpus containing rare characters, and the obtained word vector is more matched with the character sequence of the electronic medical record.

Specifically, when a word vector corresponding to a character sequence is obtained by using a word2vec vector representation method, the word vector can be realized in an index manner. For example, if the character sequence of the electronic medical record is C (C1, C2 … Cn), n represents the length of the input character sequence, and a character index is generated according to the position of the character in the sequence. After the pre-trained word vector is obtained, the word vector corresponding to the character can be obtained through a character index table look-up, namely a word vector sequence x (x1, x2 … xn), wherein x belongs to R^n×dAnd d is the word vector space dimension.

And step S20, generating a radical vector matrix corresponding to the character sequence.

In a common named entity recognition method based on a neural network, a word vector or a word vector corresponding to a text to be recognized is usually input into a neural network model for label prediction, but the amount of information expressed by a word or a word is limited, and only the word vector or the word vector is relied on, so that the accuracy of named entity recognition is improved to a limited extent.

Based on the above-mentioned drawbacks of the prior art, the inventive concept of the present invention is formed from the perspective of deep mining of deeper levels of information that may exist within a word or phrase. Because Chinese characters are developed from pictographic characters, a plurality of characters still keep the original meanings thereof, and the meanings of a plurality of characters with similar shapes are similar, such as 'disease' and 'disease', 'pain' and 'pain', and the like, the character morphological information can be taken as the input of a neural network, the neural network is used for carrying out feature extraction on the characters, and deeper information existing in the characters or the words is provided for the later label prediction.

Intuitively, the component composition of the character reflects the form of the character to a certain extent, so that the Chinese character component composition information of the character can be acquired as the character form information, for example, the Chinese character component composition of the "and" character is acquired as the character form information of the "and" character.

Specifically, when the Chinese character component composition of a character is acquired, each Chinese character component is regarded as an independent component of the character, for example, the word "He" and the word "kou" are respectively the left component and the right component of the word "He" and "kou", and a corresponding component vector is generated for each component, a component sequence of the character, which comprises a plurality of components, corresponds to a component vector sequence, and the component vector sequence is equivalent to a two-dimensional component vector matrix. For the character sequence of the named entity to be recognized, the two-dimensional radical vector matrixes of a plurality of characters can form a three-dimensional radical vector matrix corresponding to the character sequence together.

Step S30, inputting the radical vector matrix into a first neural network for processing, so as to obtain a radical convolution vector matrix corresponding to the character sequence, where the first neural network includes a convolution neural network layer.

In this embodiment, a CNN convolutional neural network is used for feature extraction. For example, as shown in fig. 3, the partial sequence of the character "pain" to be recognized is a fixed length of memory allocated for the partial sequence during processing, and the character sequence includes partial padding. After the component vector of the component sequence of the character to be recognized is obtained, the component vector matrix is input into a CNN convolution neural network layer in a first neural network, and the component convolution vector matrix containing character internal form information is output after convolution processing of convolution layers, pooling processing of pooling layers and processing of full connection layers. It should be noted that the CNN convolutional neural network may include a plurality of convolutional layers, a plurality of pooling layers, and a plurality of fully-connected layers, and this embodiment does not limit this structure.

It is understood that the radical vector may be initialized randomly without using a trained vector, and the radical vector is also trained as a parameter in the first neural network.

And step S40, generating a word feature vector matrix according to the word vector matrix and the radical convolution vector matrix.

The character sequences in the electronic medical record of the named entity to be identified are processed through the steps to obtain the corresponding word vector matrix and the radical convolution vector matrix, and because the two vector matrices both contain the characteristic information of the character sequences of the named entity to be identified, the overall word characteristic vector matrix needs to be generated according to the two vector matrices.

Specifically, the vectors in the word vector matrix and the vectors in the character form information vector matrix are subjected to vector splicing. For example, for a character sequence C (C1, C2 … Cn), a word vector matrix X (X1, X2 … Xn) and a radical convolution vector matrix Y (Y1, Y2 … Yn) correspond respectively, where X1, X2 … Xn, Y1, and Y2 … Yn are vectors, a word vector corresponding to a Ci character in the character sequence C and a radical convolution vector are Xi and Yi respectively, Xi and Yi are vector-spliced to obtain a new vector Zi, a word vector corresponding to all characters in the character sequence C and a radical vector are spliced to obtain a new vector Zi, and further, a corresponding word feature vector matrix Z (Z1, Z2 … Zn) can be obtained.

And step S50, inputting the word feature vector matrix into a second neural network for processing to obtain the named entity recognition result of the electronic medical record, wherein the second neural network comprises a bidirectional long-short term memory network layer.

Since named entity recognition is a sequence tagging problem, the second neural network in this embodiment employs a bidirectional long-short term memory network (Bi-LSTM) to extract context information of a sequence, the long-short term memory network (LSTM) is a kind of network of RNN, the LSTM solves a gradient disappearance/explosion problem existing in RNN, and also solves a long-term dependency problem that RNN cannot capture a sequence.

The Bi-LSTM employed in this embodiment comprises LSTM networks in both the forward and backward directions. The word eigenvector matrix Z (Z1, Z2 … Zn) generated from the word vector matrix and the radical convolution vector matrix contains eigenvectors of n characters in the character sequence, the eigenvectors of the n characters are input into the forward LSTM network from left to right, and the hidden vectors corresponding to the eigenvectors of each character are output in sequence

Similarly, the feature vectors of the n characters are sequentially input into the backward LSTM network from right to left, and another hidden vector corresponding to the feature vector of each character is sequentially output

It can be understood that the feature vectors of n characters are processed by the bidirectional long-short term memory network, and context information of the sequence can be acquired, which is more comprehensive than information acquired by the unidirectional long-short term memory network. Splicing two hidden vectors corresponding to the feature vector of each character to obtain a bidirectional hidden vector

And the bidirectional hidden vectors corresponding to the feature vectors of all the characters are put into the same matrix to generate a total hidden vector matrix.

Further, the second neural network also comprises a full connection layer which is used for processing a total hidden vector matrix output by the bidirectional long-short term memory network and finally obtaining a probability matrix corresponding to the character sequence of the named entity to be identified. How to obtain the final named entity recognition result according to the probability matrix is explained next.

Named entity recognition, also called named recognition, refers to recognition of entities with specific meaning in text, for electronic medical records to be recognized in this embodiment, body parts, examination, diseases, symptoms, treatment, and the like.

Named entity recognition typically requires solving two problems: firstly, entity boundary identification, namely word segmentation; the second is to determine entity classes. The two problems can be solved by using labeled data for training a neural network and performing label prediction on characters of a named entity to be recognized in the neural network, wherein various label labeling methods can be adopted, such as an IOB label labeling method or a biees label labeling method.

In this embodiment, when a bios tag labeling method is used in the process of identifying a named entity of an electronic medical record, there are 15 types of defined tags: B-BodyPart, I-BodyPart, E-BodyPart, B-Check, I-Check, E-Check, B-Disease, I-Disease, E-Disease, B-Symptom, I-Symptom, E-Symptom, B-Treatment, I-Treatment, wherein the B-BodyPart label indicates the beginning of a "body part" entity, the I-BodyPart label indicates the interior of a "body part" entity, the E-BodyPart label indicates the end of a "body part" entity, the B-Check label indicates the beginning of a "test examination" entity, the I-Check label indicates the interior of a "test examination" entity, the E-Check label indicates the end of a "test examination" entity, the B-Disease label indicates the beginning of a "Disease" entity, the I-Disease label indicates the interior of a "Disease" entity, the E-Disease tag represents the end of the "Disease" entity, the B-Symptom tag represents the beginning of the "Symptom" entity, the I-Symptom tag represents the interior of the "Symptom" entity, the E-Symptom tag represents the end of the "Symptom" entity, the B-Treatment tag represents the beginning of the "Treatment" entity, the I-Treatment tag represents the interior of the "Treatment" entity, and the E-Treatment tag represents the end of the "Treatment" entity.

The probability value in the probability matrix obtained in the above step is the label classification probability of the character sequence prediction, for example, when the above 15 kinds of labels are defined, 15 probability values are corresponding to each character in the character sequence, that is, the probability value of the character prediction for each label, and the predicted label result with the highest probability value as the character is selected. After the prediction label of each character in the character sequence is determined, the character sequence can be subjected to word segmentation and entity category determination according to the meaning of the label, and named entity identification is completed.

Understandably, the parameters of the first neural network and the second neural network need to be trained by adopting a back propagation and gradient descent algorithm according to the electronic medical record of the identified named entity, so as to obtain better parameters, and improve the accuracy of the named entity identification.

Wherein, the character sequence acquisition of the electronic medical record of the identified named entity includes but is not limited to: running a script program to extract the current medical history part in the electronic medical record and converting the current medical history part into an xml file; importing the xml file into a labeling tool, and performing data labeling on a part of the xml file by a professional doctor; carrying out consistency detection on the data labeling result; if the detection result meets the expected threshold value, marking the rest files by a professional doctor; and running a script program to convert the file marked with the named entity into a training text required by the neural network.

For further explanation of the method for identifying named entities in electronic medical records in this embodiment, fig. 4 shows an illustration of a processing procedure of a neural network system in this embodiment. As shown in fig. 4, the neural network system includes a character embedding layer, a first neural network including a convolutional network layer, and a second neural network including a forward long-short term memory network layer and a backward long-short term memory network layer, and the process of the system for identifying the named entity of the electronic medical record is as follows:

1. and acquiring the text of the electronic medical record, and processing the text by taking 10 sentences as a group of input character embedding layers each time. The sentence length is set to be the maximum sentence length K in 10 sentences, the character radical sequence size is fixed to be 10, the dimension of a pre-trained word vector is 100 dimensions, and the dimension of a radical vector is set to be 50 dimensions, so that a group of 10 sentences forms a 10 XKx 100 word vector matrix and a 10 XKx 10X 50 word radical vector matrix after being processed by a character embedding layer.

2. And inputting the component vector matrix obtained in the step 1 into a convolution network layer for processing, wherein the window size of a convolution kernel is 3, the number of convolution kernels is 30, the pooling window is 2, the data obtained through the convolution network layer processing is a component convolution vector matrix of 10 multiplied by K multiplied by 30, namely the extracted internal form information of each character is represented by a 30-dimensional component vector, and the component vectors in the component vector matrix and the word vectors in the word vector matrix are spliced to obtain a word feature vector matrix of 10 multiplied by K multiplied by 130.

3. And (3) processing the character characteristic vectors obtained in the step (2) by a discarding layer (dropout) to prevent the model from being over-fitted, setting the specific gravity of the dropout to be 0.5, then inputting the dropout into the forward long and short term memory network and the backward long and short term memory network, setting the size of a hidden unit of the long and short term memory network to be 64, and splicing the output of each time step of the forward long and short term memory network and the backward long and short term memory network to obtain a hidden vector matrix of 10 multiplied by K multiplied by 128.

4. And (3) passing the hidden vector matrix vector obtained in the step (3) through a full connection layer, wherein the size of the full connection layer is the number N of the labels in the training sample, and then obtaining a probability matrix of 10 multiplied by K multiplied by N.

5. Since the output 10 × K × N matrix represents the probabilities that one character is marked as N labels, one label with the highest probability among the N probabilities may be selected as the label of the character. For example, the symbol sequences "neck, head, pain, and pain" in FIG. 4 are labeled "B-BodyPart (corresponding to B-BOD in the figure), I-BodyPart (corresponding to I-BOD in the figure), B-Symptom (corresponding to B-SYM in the figure), and I-Symptom (corresponding to I-SYM in the figure)" in this order.

In the embodiment, the method for identifying the named entity of the electronic medical record with high identification accuracy is provided by extracting the morphological characteristics in the characters of the electronic medical record and sequentially inputting the characteristics of the characters and the morphological characteristics in the characters into the deep neural network to predict the character label.

Further, referring to fig. 5, a second embodiment of the present invention provides a method for identifying named entities of electronic medical records based on the first embodiment, where the embodiment includes, in step S50:

and step S60, inputting the character feature vector matrix into the bidirectional long-short term memory network for processing to obtain a hidden vector matrix corresponding to the character sequence.

And step S70, inputting the implicit vector matrix into a self-attention mechanism layer for processing to obtain a prediction matrix corresponding to the character sequence.

In the research of the method for identifying the named entities of the electronic medical record, it is found that some entities have dependency relationships, such as the text in the electronic medical record: the symptoms appear repeatedly and aggravate year by year in 10 years, appear in winter and spring and after catching a cold, and go to local hospitals to see a doctor, and the doctor diagnoses the chronic bronchitis and the recurrent cough and expectoration. "season of winter and spring, catching a cold" in the text represents a cause-like entity, "chronic bronchitis" represents a disease-like entity, and "cough, expectoration" represents a symptom-like entity. It is obvious that the "winter and spring season" represents time in a general sentence, but in the medical history as a training sample of the neural network in the present embodiment, it represents an incentive because the coming of the winter and spring season induces the recurrence of seasonal diseases, and a professional doctor marks it as an incentive, so the neural network should mainly use the information of "chronic bronchitis" and "cough" and "expectoration" when deciding the entity type of the "winter and spring season". Therefore, in the present embodiment, a self-attention mechanism is adopted to directly calculate the dependency relationship between the entities by ignoring the distance between the entities.

f_t,t′＝σ(w_a tanh(w_th_t+w_t′h_t′))，

for each hidden vector h according to the following formula_kCalculating corresponding attention weights

Wherein e is an exponential function, N is the number of the hidden vectors,

the calculation of the attention weight is described in detail below with reference to fig. 6.

As shown in fig. 6, if the character sequence processed this time is { neck, head, pain }, the character sequence length is 4, the hidden vectors input to the self-attention mechanism layer are h1, h2, h3, and h4, and N in the corresponding formula takes a value of 4.

Since the character sequence is sequentially input to the neural network system for processing in time sequence, each character in the character sequence corresponds to a different time step in sequence, for example, the time steps corresponding to the four characters of "neck, region, pain, and pain" may be labeled as t1, t2, t3, and t4 in this example, and the hidden vector corresponding to each character corresponds to these time steps one by one.

For each character to be recognized in the character sequence, a hidden vector corresponding to the time step is output, and attention weight vectors marked by other time steps except the time step need to be correspondingly calculated. For example, a "neck" word is input at time t1, a hidden vector h1 is output corresponding to time t1, and time steps other than this time step include t2, t3, and t4, and are based on a predetermined rule

The weight vector to be calculated is

The formula for calculating the attention weight at this time is converted into the following formula, where the k value ranges include t2, t3, and t 4.

Attention weight obtained according to the following formulaMultiplying the corresponding hidden vector to obtain the final attention vector

And finally forming an attention vector matrix by the attention vectors corresponding to the plurality of hidden vectors.

Since the hidden vector matrix and the attention vector matrix both contain the prediction information of the character sequence of the named entity to be identified, the attention hidden vector matrix including the total information needs to be generated according to the two vector matrices.

Specifically, the hidden vector in the hidden vector matrix and the attention vector in the attention vector matrix are subjected to vector splicing according to the following formula.

For example, there is a hidden vector matrix H (H)₁,H₂…H_n) And attention vector matrix

Wherein H₁、H₂…H_nAnd

are all vectors, will H_iAnd

vector splicing is carried out to obtain a new vector H'_iAll the hidden vectors and the attention vector are spliced to obtain a new vector, and a corresponding attention hidden vector matrix H '(H'₁,H′₂…H′_n)。

And inputting the obtained attention hidden vector matrix into the full-connection layer for processing to obtain a prediction matrix corresponding to the character sequence.

And step S80, inputting the prediction matrix into the conditional random field model for processing to obtain the named entity recognition result of the electronic medical record.

If the character labels are predicted independently by directly using the hidden vectors obtained by the Bi-LSTM network layer or the self-attention mechanism layer, the dependency relationship among the labels is not considered, and a bottleneck can be met when the accuracy of the prediction result is improved. For example, the tag after I-symptom may be I-disease, and it is clear that this tag sequence is erroneous. In the named entity recognition task, labels usually have strong dependency relationship, for example, the next label of B-symmetry cannot be I-distance, or only I-symmetry can appear behind B-symmetry.

Therefore, in order to further improve the accuracy of named entity recognition, a Conditional Random Field (CRF) model is used in the present embodiment for final character label prediction. The CRF Model overcomes the disadvantage of independence assumption of a Hidden Markov Model (Hidden Markov Model), solves the marking offset problem of a Maximum Entropy Markov Model (Maximum Entry Markov Model), and explains the action principle of the CRF Model.

For an input sequence x (x1, x2 … xn), let P be the matrix obtained after the attention network, P ∈ R^n×sS is the number of labels, P_ijIndicating that the ith character in the input sequence is predicted to be the jth tag score. For a predicted sequence y (y1, y2 … yn), its score is defined as:

a represents a transition matrix, A ∈ R^s+2×s+2，A_ijRepresents the probability (score) of a transition from tag i to tag j, and then applying softmax on all possible tag sequences yields the probability of sequence y:

the log probability of the correct tag sequence is maximized during the training process:

yx denotes all possible tag sequences, including those that do not satisfy the BIOES labeling scheme constraints. In decoding, the maximum score obtained by predicting the output sequence is:

for a CRF model, it can be efficiently trained and decoded by employing the viterbi algorithm.

Finally, the method for identifying the named entity of the electronic medical record in the embodiment is further described with reference to fig. 6. Fig. 6 shows a schematic structure of a neural network system according to this embodiment, where the neural network system includes a character embedding layer, a first neural network including a radical CNN convolutional layer, and a second neural network including a bidirectional LSTM layer, a self-attention mechanism layer, and a conditional random field model, and a process of identifying a named entity in an electronic medical record by the system is as follows:

2. And inputting the component vector matrix obtained in the step 1 into a component CNN convolution network layer for processing, wherein the size of a convolution kernel window is 3, the number of convolution kernels is 30, a pooling window is 2, the data obtained by the component CNN convolution network layer processing is a component convolution vector matrix of 10 multiplied by K multiplied by 30, namely the extracted internal form information of each character is represented by a 30-dimensional component vector, and the component vector in the component vector matrix and the word vector in the word vector matrix are spliced to obtain a word feature vector matrix of 10 multiplied by K multiplied by 130.

3. And (3) processing the character feature vector obtained in the step (2) by a discarding layer (dropout) to prevent the model from being over-fitted, setting the specific gravity of the dropout to be 0.5, then inputting the dropout into a bidirectional LSTM network, setting the size of a hidden unit of the LSTM network to be 64, and splicing the output of each time step of the bidirectional LSTM to obtain a hidden vector matrix of 10 xKx 128.

4. And (3) sequentially processing the hidden vector matrix vector obtained in the step (3) by a self-attention mechanism layer and a conditional random field model to obtain a prediction probability matrix of 10 multiplied by K multiplied by N.

5. Since the output 10 × K × N matrix represents the probabilities that one character is marked as N labels, one label with the highest probability among the N probabilities may be selected as the label of the character.

Further, referring to fig. 7, a third embodiment of the present invention provides a method for identifying named entities of electronic medical records based on the first embodiment, where the step S50 includes:

and step S90, inputting the character feature vector matrix into the bidirectional long-short term memory network for processing to obtain a hidden vector matrix corresponding to the character sequence.

And S100, inputting the hidden vector matrix into a self-attention mechanism layer for processing to obtain a named entity identification result of the electronic medical record.

It is understood that, based on the first embodiment, in consideration of different application scenarios or processing resources, the difference from the second embodiment is that, as shown in fig. 8, the second neural network in the present embodiment includes only a self-attention mechanism layer and does not include a conditional random field model.

After the hidden vector matrix output by the bidirectional long and short term memory network is input into the self-attention mechanism layer, the attention weight of the hidden vector in the hidden vector matrix is calculated, an attention vector matrix is generated according to the attention weight and the hidden vector, an attention hidden vector matrix is generated according to the hidden vector matrix and the attention vector matrix, and finally the attention hidden vector matrix is input into the full-connection layer to be processed to obtain a prediction probability matrix corresponding to the character sequence to be recognized.

And the probability value in the prediction probability matrix obtained in the step is the label classification probability predicted by the character sequence to be recognized, and the prediction label result with the highest probability value as the corresponding character is selected. After the prediction label of each character in the character sequence is determined, the character sequence can be subjected to word segmentation and entity category determination according to the meaning of the label, and named entity identification is completed.

In the embodiment, morphological features inside the electronic medical record characters are extracted through the convolutional neural network, the features of the characters and the morphological features inside the characters are sequentially input into the bidirectional long-short term memory network layer and the self-attention mechanism layer in the deep neural network to predict the character tags, and the accurate and efficient electronic medical record named entity recognition method is provided.

Further, referring to fig. 9, a fourth embodiment of the present invention provides a method for identifying named entities of electronic medical records based on the first embodiment, where the step S50 includes:

and step S110, inputting the character feature vector matrix into a bidirectional long-short term memory network for processing to obtain a hidden vector matrix corresponding to the character sequence.

And step S120, inputting the hidden vector matrix into a conditional random field model for processing to obtain a named entity recognition result of the electronic medical record.

It is understood that, based on the first embodiment, due to the consideration of different application scenarios or processing resources, the difference from the second embodiment is that, as shown in fig. 10, the second neural network in the present embodiment only includes the conditional random field model and does not include the self-attention mechanism layer.

And (3) after the hidden vector matrix output by the bidirectional long-short term memory network is input into the conditional random field model, processing to obtain a prediction probability matrix corresponding to the character sequence to be recognized.

In the embodiment, morphological features inside the electronic medical record characters are extracted through the convolutional neural network, the features of the characters and the morphological features inside the characters are sequentially input into the bidirectional long-short term memory network layer and the conditional random field model in the deep neural network to predict the character tags, and the accurate and efficient electronic medical record named entity recognition method is provided.

The invention also provides an electronic medical record named entity recognition device, which comprises: the electronic medical record named entity recognition processing program realizes the steps of the method for recognizing the electronic medical record named entity when being executed by the processor.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where an electronic medical record named entity identification processing program is stored on the computer-readable storage medium, and when the electronic medical record named entity identification processing program is executed by a processor, the steps of the method for identifying an electronic medical record named entity are implemented.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for identifying named entities of electronic medical records is characterized by comprising the following steps:

generating a radical vector matrix corresponding to the character sequence;

splicing the word vectors in the word vector matrix with the radical convolution vectors in the radical convolution vector matrix to obtain corresponding word characteristic vectors, and generating a word characteristic vector matrix based on the word characteristic vectors;

2. The method for identifying named entities in electronic medical records according to claim 1, wherein the step of generating the radical vector matrix corresponding to the character sequence comprises:

3. The method for identifying named entities in electronic medical records according to claim 2, wherein the second neural network further comprises a full connection layer, and the step of inputting the word feature vector matrix into the second neural network for processing to obtain the identifying result of the named entities in the electronic medical records comprises:

4. The method for identifying named entities in electronic medical records according to claim 2, wherein the second neural network further comprises a self-attention mechanism layer, and the step of inputting the word feature vector matrix into the second neural network for processing to obtain the identifying result of the named entities in the electronic medical records comprises:

5. The method for identifying named entities in electronic medical records according to claim 2, wherein the second neural network further comprises a self-attention mechanism layer and a conditional random field model, and the step of inputting the word feature vector matrix into the second neural network for processing to obtain the identifying result of the named entities in the electronic medical records comprises:

6. The method as claimed in claim 5, wherein the self-attention mechanism layer includes a full-link layer, and the step of inputting the hidden vector matrix into the self-attention mechanism layer for processing to obtain the prediction matrix corresponding to the character sequence includes:

calculating attention weights of hidden vectors in the hidden vector matrix;

7. The method for identifying named entities in electronic medical records according to claim 6, wherein the step of calculating the attention weight of the hidden vector in the hidden vector matrix comprises:

f_t,t'＝σ(w_atanh(w_th_t+w_t'h_t'))，

where t and t' represent different time steps, w_a，w_t，w_t'Is a weight vector, sigma is a sigmoid function, h_tIs a sum of h_t'Hidden vectors of different time steps;

Wherein e is an exponential function, N is the number of the hidden vectors,

8. the method for identifying named entities in electronic medical records according to claim 2, wherein the second neural network further comprises a conditional random field model, and the step of inputting the word feature vector matrix into the second neural network for processing to obtain the identifying result of the named entities in the electronic medical records comprises:

9. An electronic medical record named entity recognition device, characterized in that, the electronic medical record named entity recognition device includes: the electronic medical record named entity recognition processing program is stored on the storage and can run on the processor, and when being executed by the processor, the electronic medical record named entity recognition processing program realizes the steps of the electronic medical record named entity recognition method according to any one of claims 1 to 8.

10. A storage medium, characterized in that the storage medium stores thereon a processing program for named entity identification of electronic medical record, and the processing program for named entity identification of electronic medical record realizes the steps of the method for named entity identification of electronic medical record according to any one of claims 1 to 8 when being executed by a processor.