CN116451690A

CN116451690A - Medical field named entity identification method

Info

Publication number: CN116451690A
Application number: CN202310282404.3A
Authority: CN
Inventors: 张怡; 章永
Original assignee: Mabo Shanghai Health Technology Co ltd
Current assignee: Mabo Shanghai Health Technology Co ltd
Priority date: 2023-03-21
Filing date: 2023-03-21
Publication date: 2023-07-18

Abstract

The invention provides a method for identifying named entities in the medical field, which is characterized by comprising the following steps: s1, acquiring related data in the medical field, and marking the medical data; s2, processing the marked data by using an EDA method comprises the following steps of; synonym substitution, random insertion, random exchange, and random deletion; s3, constructing a Bert (pre-trained on a medical big data set, expanding the position codes by using a hierarchical decomposition method) +Bi-GRU (fused with a attention mechanism) +CRF model; s4, reasonably adjusting and optimizing the model by adopting 10-fold cross verification; the invention can effectively solve the problem of data shortage, can effectively extract complex medical entities in the ultra-long text, and adopts K-fold cross verification to reasonably tune and optimize the model.

Description

Medical field named entity identification method

Technical Field

The invention belongs to an implementation of artificial intelligence technology in the medical field, in particular to a method for identifying named entities in the medical field.

Background

Medical named entity recognition refers to recognizing the boundaries of medical entities from medical text and judging the category of the medical entities, and common medical entity categories include disease names, body parts, drug information, examination or examination items, symptoms, and the like. The accuracy rate of medical named entity identification influences the effects of tasks such as event extraction, relation extraction and the like, is a key task of medical text data mining, provides a key foundation for constructing a medical ICD coding system, a healthy medical system, an intelligent medical question-answering system and a medical knowledge graph, and has profound significance in a good medical named entity identification method.

In the existing technologies for identifying medical named entities, the problem of data starvation is ignored, the used deep learning method is too simple, good effects cannot be produced when the deep learning method faces to the ultra-long text attributes of complex medical entities and electronic medical records, and the model lacks an optimized flow.

Disclosure of Invention

The invention aims to provide a method for identifying named entities in the medical field, which aims to solve the problems that in the prior art for identifying the named entities in the medical field, the existing technology provided in the background art ignores the problem of lack of data, the used deep learning method is too simple, and the method cannot always produce good effects and the model lacks an optimized flow when facing the ultra-long text attributes of complex medical entities and electronic medical records.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a method of medical domain named entity identification, comprising the steps of:

s1, acquiring related data in the medical field, and marking the medical data;

s2, processing marked data (mainly processing non-marked words) by using an EDA (data enhancement technology applied to text classification) method, wherein the method comprises the following steps: synonym substitution, random insertion, random exchange, and random deletion;

s3, constructing a Bert (pre-trained on a medical big data set, expanding the position codes by using a hierarchical decomposition method) +Bi-GRU (fused with a attention mechanism) +CRF model;

s4, reasonably adjusting parameters and optimizing the model by adopting 10-fold cross verification.

The step S1 of acquiring the related data of the medical field specifically comprises the following steps:

s1.1, acquiring electronic medical record data by a docking medical institution;

s1.2, network crawling medical field related data; the method specifically comprises the following steps:

s1.2.1, acquiring a URL of target medical data;

s1.2.2, submitting an HTTP request to the corresponding URL;

s1.2.3, parsing the HTTP response;

s1.2.4 storing the analysis result

In the step S1, the acquired medical data is marked, and main marking types include: diagnosis, surgery, treatment, examination, medicine, and site.

The processing of the marked data in the step S2 specifically includes the following contents:

(1) synonym substitution: 1-10 non-stop words are randomly selected from the sentences. For each selected word, replacing with its randomly selected synonym;

(2) random insertion: finding a non-stop word in the sentence, randomly selecting a synonym of the non-stop word, and inserting the synonym into any position in the sentence. Randomly repeating 1-10 times;

(3) random exchange: two words in the sentence are arbitrarily selected, and the positions are exchanged. Randomly repeating 1-10 times;

(4) and (5) randomly deleting: for each word in the sentence with an occurrence probability greater than 0.1, the word is randomly deleted or not deleted.

The step S3 specifically includes the following:

s3.1, acquiring a Bert model which is pre-trained on a large-scale medical data set from the Internet;

s3.2, constructing a Bert layer containing position coding hierarchical decomposition;

s3.3, constructing a Bi-GRU layer containing an attention mechanism;

the step S3.2 specifically includes the following:

specifically, let the maximum position code length that Bert defaults to be trainable be n and the corresponding position code vector be p ₁ ，p ₂ ，···，p _n The new coding vector which can be constructed in turn by the method is q ₁ ，q ₂ ，···，q _m Wherein m=n ² ，

q _(i-1)×n+j ＝au _i +(1+a)u _j

Where i is the position index of the first layer, j is the position index of the second layer, n is the Bert layer length, a ε (0, 1) and a+.0.5.

The step S3.3 specifically includes the following:

for any time step i, a small batch of input data X is given _i ∈R ^n×d Wherein n is the batch length, d is the vector length, and the hidden layer activation function is set asThe forward and reverse hidden states of this time step are divided into l _i ∈R ^n×h And r _i ∈R ^n×h Where h is the number of hidden units. The forward and reverse hidden states are updated as follows:

wherein the method comprises the steps ofFor weight item, ++>As bias terms, xi is formed according to the self-attention mechanism;

wherein r is _i Corresponding to the ith data after ebedding, r is used for _i As a Query (Query), all of the input data r are raw ₁ 、r ₂ ···r _n As Keys (Keys) and Values (Values), the attention scoring function is f, then

Where concat is used for concatenation of 2 vectors, β (r _i ,r _j ) The 2 vectors of queries and keys are mapped into scalar quantities by an attention scoring function (here a scaled dot product attention scoring function is used), and then obtained by a softmax function:

the step S4 specifically includes the following:

s4.1, dividing data into K parts (K is 10 here), taking 1 part of the data as a test set and the rest as a training set, obtaining 10 training sets and verification sets here, training a model sequentially by the data, and obtaining 10 error average values;

and S4.2, reasonably adjusting the model hyper-parameters and the neural network structure, repeating the step S4.1, finding the model with the optimal error result, and training the optimal model by using all data.

Compared with the prior art, the invention has the following beneficial effects:

the invention can be used for extracting the named entity from the medical text, and EDA data augmentation is carried out on the original data; the model uses a pre-trained Bert as an Embedding layer on a large-scale medical data set, and carries out hierarchical decomposition on the position codes of the Bert layer to construct a Bi-GRU+CRF layer with a self-attention mechanism; the model is trained and verified by using a 10-fold cross-validation mode, and the super parameters of the model and the structure of the neural network are reasonably adjusted.

Drawings

FIG. 1 is a general flow chart of the present invention;

FIG. 2 is a flow chart of the invention for acquiring data related to the medical field and labeling the medical data;

FIG. 3 is a schematic diagram of a hierarchical decomposition of a position code when constructing a Bert layer containing the hierarchical decomposition of the position code according to the present invention;

FIG. 4 is a schematic diagram of a method for constructing a Bi-GRU layer containing an attention mechanism according to the present invention;

FIG. 5 is a diagram showing X in step S3.2 of the present invention _i A specific flow chart formed;

FIG. 6 is a specific flow of model construction in step S3 of the present invention;

fig. 7 is a flowchart showing the step S4 of the present invention.

Detailed Description

In order to clarify the technical problems, technical solutions, implementation processes and performance, the present invention will be further described in detail below with reference to examples. It should be understood that the specific embodiments described herein are for purposes of illustration only. The invention is not intended to be limiting. Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

Example 1

As shown in fig. 1, a method for identifying named entities in a medical field includes the following steps:

s1, acquiring related data in the medical field, and marking the medical data;

As shown in fig. 2, the step S1 of acquiring the related data of the medical field specifically includes the following steps:

s1.2.1, acquiring a URL of target medical data;

s1.2.2, submitting an HTTP request to the corresponding URL;

s1.2.3, parsing the HTTP response;

s1.2.4 storing the analysis result

The step S3 specifically includes the following:

s3.3, constructing a Bi-GRU layer containing an attention mechanism;

as shown in fig. 3, the step S3.2 specifically includes the following:

q _(i-1)×n+j ＝au _i +(1+a)u _j

As shown in fig. 4, the step S3.3 specifically includes the following:

for any time step i, a small batch of input data X is given _i ∈R ^n×d Wherein n is the batch length, d is the vector length, and the hidden layer activation function is set asThe hidden state of the forward and reverse of the time stepDivided into l _i ∈R ^n×h And r _i ∈R ⁿ ^×h Where h is the number of hidden units. The forward and reverse hidden states are updated as follows:

wherein the method comprises the steps ofFor weight item, ++>For bias terms, xi is formed according to a self-attention mechanism, and the specific flow of the formation is shown in fig. 5;

as shown in fig. 6, the step S4 specifically includes the following:

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for identifying named entities in a medical field, comprising the steps of:

s1, acquiring related data in the medical field, and marking the medical data;

s2, processing the marked data by using an EDA method comprises the following steps: synonym substitution, random insertion, random exchange, and random deletion;

2. The method for identifying a named entity of a medical field according to claim 1, wherein the step S1 of obtaining the related data of the medical field specifically includes the following steps:

s1.2.1, acquiring a URL of target medical data;

s1.2.2, submitting an HTTP request to the corresponding URL;

s1.2.3, parsing the HTTP response;

s1.2.4 storing the analysis result

3. The method for identifying a named entity in a medical field according to claim 1, wherein the processing the labeled data in step S2 specifically includes the following contents:

(1) synonym substitution: randomly selecting 1-10 non-stop words in the sentence; for each selected word, replacing with its randomly selected synonym;

(2) random insertion: finding a non-stop word in the sentence, randomly selecting a synonym of the non-stop word, and inserting the synonym into any position in the sentence; randomly repeating 1-10 times;

(3) random exchange: two words in the sentence are selected at will, and the positions are exchanged; randomly repeating 1-10 times;

4. The method for identifying named entities in medical fields according to claim 1, wherein said step S3 comprises the following steps:

s3.3, constructing a Bi-GRU layer containing an attention mechanism.

5. The method for identifying a named entity in a medical field according to claim 4, wherein the step S3.2 specifically comprises the following steps:

q _(i-1)×n+j ＝au _i +(1+a)u _j

6. The method for identifying a named entity in a medical field according to claim 4, wherein the step S3.3 specifically comprises the following steps:

for any time step i, a small batch of input data X is given _i ∈R ^n×d Let the hidden layer activation function beThe hidden states of the forward and reverse directions of the time step are respectively l _i ∈R ^n×h And r _i ∈R ^n×h Where h is the number of hidden units, the forward and reverse hidden states are updated as follows:

7. the method for identifying a named entity in a medical field according to claim 1, wherein the step S4 specifically comprises the following steps: