CN111178047A

CN111178047A - Ancient medical record prescription extraction method based on hierarchical sequence labeling

Info

Publication number: CN111178047A
Application number: CN201911347473.8A
Authority: CN
Inventors: 张引; 熊海辉
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-19
Anticipated expiration: 2039-12-24
Also published as: CN111178047B

Abstract

The invention discloses an ancient medical record prescription extraction method based on hierarchical sequence labeling, which adopts a hierarchical sequence labeling network of BERT + CRF and comprises an input layer, a feature extraction layer, a full connection layer, a medicine and prescription name prediction CRF layer and a prescription prediction CRF layer. Firstly, the invention does not need to divide sentences of medical records, and directly takes the complete medical record as input, thereby avoiding error propagation caused by the sentences. Secondly, the prescription text is obtained in a form of serial labeling, and the most relevant text fragment is directly obtained. Finally, the information of the medicine name and the prescription name is considered in the identification process, the characteristic representation in the prescription extraction process is enhanced, a better effect is obtained, and the prescription text in the ancient medical records can be identified by using a small amount of manually labeled data; the invention also designs an evaluation index method based on BLEU and suitable for model selection, which is used for quantifying the matching layer degree between the model extraction result and the labeling result and obtaining the optimal model.

Description

Ancient medical record prescription extraction method based on hierarchical sequence labeling

Technical Field

The invention relates to a pre-training language model in deep learning, namely a conditional random field. In particular to an ancient medical record and prescription extraction method based on hierarchical sequence labeling.

Background

The traditional Chinese medical record records the complete process of treating diseases of patients, including the contents of symptoms, prescriptions and medicines during the treatment period. However, the ancient medical records have complicated and simple contents and large format differences due to the times, personal styles and the like of the ancient physicians. This brings difficulties to the formatting and digitization of the medical record content. For the Chinese medical scholars and Chinese medical enthusiasts, learning the treatment experience of the past doctors from the medical plan is an important way to learn the treatment idea. For this reason, how to format the text of the ancient medical records is particularly important. The purpose of medical record structuring is to identify the contents of medicines, prescriptions, diseases, symptoms, certificates, prescriptions and the like from unstructured medical record texts, thereby forming a structured medical record data resource. The structured medical record is helpful for computer storage and retrieval, thereby better browsing the medical record. In addition, the structured medical record is helpful for mining and researching medical record data, and has important significance for traditional Chinese medicine enthusiasts and researchers.

Prescription extraction is one of the tasks in medical case structuring, and its basic purpose is to identify and extract pieces of prescription text from unstructured medical case text. One existing solution is to identify the prescription by classification, with sentence granularity. The method comprises the steps of firstly, carrying out sentence segmentation on a medical case, and marking each sentence as a prescription or not, so that a two-classification model can be established for sentence classification, and finally, obtaining a sentence representing the prescription. This approach has two problems: (1) clauses are carried out on the basis of rules, so that errors are easy to propagate due to errors; (2) classification at sentence granularity is too coarse and some prescriptions are only a portion of the sentence, not the entire sentence. Therefore, in order to better realize the task of extracting the ancient medical record and prescription, the technical difficulties mainly involved are as follows:

1. how to design a model and extract a long text segment;

2. how to reduce the labeling cost and the labeling pressure and only use a small amount of labeling data to realize prescription extraction;

3. how to design the effect of the evaluation index quantification model.

Disclosure of Invention

In order to solve the problems, the invention provides a hierarchical sequence labeling model for prescription extraction, which solves the problem of prescription extraction in a sequence labeling mode. Firstly, the complete medical record content is used as input, and error propagation caused by clauses is avoided. Secondly, the prescription text is obtained in a form of sequence marking, the most relevant text segment can be directly obtained, and the identified content is more accurate. Finally, the information of the medicine name and the prescription name is considered in the identification process, the characteristic representation in the prescription extraction process is enhanced, a better effect is obtained, and the prescription text in the ancient medical records can be identified by using a small amount of manually labeled data.

In order to achieve the purpose, the invention adopts the following technical scheme:

an ancient medical record prescription extraction method based on hierarchical sequence labeling comprises the following steps:

1) collecting authoritative medical record data resources, and extracting medical record text information through an OCR tool;

2) labeling prescription data, medicine names and prescription names in part of medical record texts by using a data labeling tool to obtain manual labeling data comprising two labeling sequences, wherein one labeling sequence is a prescription labeling sequence, and the other labeling sequence is a medicine name and prescription name labeling sequence; the two labeling sequences both adopt a BIO label system, wherein B represents the initial part of a prescription, a medicine name or a prescription name, I represents the middle part of the prescription, the medicine name or the prescription name, and O represents the part which is not the prescription, the medicine name or the prescription name;

3) the method comprises the steps of (1) carrying out sentence segmentation on an unlabeled medical scheme text, and filtering sentences with the number of words less than a preset threshold value to obtain a pre-training corpus; using the pre-training corpus in a BERT model to perform model parameter fine adjustment;

4) establishing a BERT + CRF hierarchical sequence labeling network which comprises an input layer, a feature extraction layer, a full connection layer, a medicine and prescription name prediction CRF layer and a prescription prediction CRF layer, wherein the feature extraction layer adopts the BERT model trained in the step 3); loading the trimmed BERT parameters, training a hierarchical sequence labeling network of BERT + CRF by using the artificial labeling data obtained in the step 2), calculating matching scores of prescription contents obtained by decoding a BIO label sequence predicted by prescription contents output by a prescription prediction CRF layer of the hierarchical sequence labeling network and an artificial labeling result by adopting an evaluation index method based on BLEU, and selecting network parameters corresponding to a model with the highest matching scores to obtain a hierarchical sequence labeling model of BERT + CRF;

5) inputting the text of the medical plan to be processed into the hierarchical sequence labeling model of BERT + CRF obtained in the step 4), outputting BIO label sequences predicted by prescription contents, and decoding all BI sequences from the BIO label sequences, wherein the text contents corresponding to the BI sequences are extracted prescription contents.

Further, the invention designs an evaluation index method based on BLEU, which is used for quantifying the matching layer degree between the model extraction result and the labeling result, and the evaluation index method based on BLEU specifically comprises the following steps:

in the training process, the BIO label sequence of the prescription content prediction output by the hierarchical sequence labeling network prescription prediction CRF layer is decoded to obtain all BI sequences, and the prescription content is obtained according to the text content corresponding to the BI sequences and is expressed as pred ═ p₁,p₂,...,p_n](ii) a The manual labeling result is expressed as label ═ t₁,t₂,,...,t_m]Wherein p is_iAnd t_jAll the text character strings are text character strings, n represents the number of prescription contents, and m represents the number of manually marked prescription contents;

when T is defined as min (m, n) and T is defined as max (m, n), the BLEU method is used to calculate { p ═ n_i,t_jCorrelation of }, i ═ 1,2, …, n, j ═ 1,2, …, n; enumerating the sum of the correlation degrees of different columns of all T different rows, taking the maximum value as a numerator and T as a denominator, and calculating to obtain a final matching score, wherein the calculation formula is as follows:

where matrix _ sum represents the sum of the correlation degrees of different columns of all t different rows.

Further, the hierarchical sequence labeling network of BERT + CRF comprises an input layer, a feature extraction layer, a full connection layer, a medicine and prescription name prediction CRF layer and a prescription prediction CRF layer, wherein the feature extraction layer adopts the BERT model trained in the step 3);

the input layer maps the input word sequence into a corresponding ID sequence; the BERT model takes the ID sequences as input to obtain the characteristic representation corresponding to each ID sequence, and the characteristic length is 768 dimensions; respectively inputting the feature representation corresponding to each ID sequence into two full-connection layers for feature conversion, wherein the feature dimension reduction is 3-dimensional feature, the feature converted through the full-connection layer 1 is the name feature of the medicine and the prescription, and the feature converted through the full-connection layer 2 is the prescription feature; inputting the medicine and prescription name characteristics output by the full connection layer 1 into the medicine and prescription name prediction CRF layer to obtain a BIO label sequence of the medicine and prescription name prediction, adding the prescription characteristics output by the full connection layer 2 and the medicine and prescription name characteristics output by the full connection layer 1 as new characteristics, and inputting the prescription prediction CRF layer to obtain a BIO label sequence of the prescription content prediction.

The invention has the following beneficial effects:

(1) according to the invention, through a sequence marking mode, when a hierarchical sequence marking network of BERT + CRF is trained, medical records do not need to be divided, complete medical records are directly used as input, the most relevant text segments are directly identified, and the problem of error propagation is avoided; the extracted prescription content is more accurate and clean and does not contain irrelevant text information;

(2) the invention uses the pre-training language model in the natural language processing technology to obtain the character representation, can utilize the large-scale non-labeled text for pre-training, learn the general semantics and grammar, reduce the labeling cost and the labeling pressure, and only use a small amount of labeling data to realize the prescription extraction;

(3) the prescription extraction system disclosed by the invention has the advantages that the medicine name and prescription name information is fully utilized to carry out prescription identification, a good auxiliary effect is realized on the prescription identification, the characteristic representation in the prescription extraction process is enhanced, and a better effect is obtained;

(4) the invention designs an evaluation index method based on BLEU, which is used for quantifying the matching layer degree between the model extraction result and the labeling result and obtaining the optimal model in the training process.

Drawings

FIG. 1 is a diagram of a model structure based on hierarchical sequence labeling;

fig. 2 is an evaluation index explanatory view.

Detailed Description

The present invention is described in detail below with reference to specific examples.

In current information extraction tasks, the goal is mainly to extract the named entities of the text, and the named entities are usually short compared to the prescription text, which is usually a long sequence of medication. Therefore, the invention provides two methods, one is to extract the prescription based on the way of hierarchical sequence marking. The prescription extraction is taken as a sequence labeling problem, and the corresponding fragment of the prescription is marked by a BIO label system. The other is a boundary prediction based method, that is, the starting position and the ending position of the prescription segment in the text are predicted. Through a large number of experiments, the scheme based on hierarchical sequence labeling is proved to be superior to the method based on boundary prediction, so that prescription extraction is finally carried out in a hierarchical sequence labeling-based mode.

Because the data annotation of the traditional Chinese medicine neighborhood requires that annotating personnel have basic professional domain knowledge, prescription annotation data is difficult to obtain. In order to solve the problem, the invention uses a pre-training language model in the natural language processing technology to obtain the character representation, can perform pre-training in large-scale label-free texts, and learns general semantics and grammar. Then fine tuning is carried out through non-labeled data in the field of traditional Chinese medicine, better field-related character characteristics are obtained, and finally training is carried out through labeled data; the labeling cost and the labeling pressure are reduced, and prescription extraction is realized by using a small amount of labeling data.

As shown in fig. 1, the ancient medical record prescription extraction method based on hierarchical sequence labeling provided by the invention adopts a hierarchical sequence labeling network based on BERT + CRF, which comprises an input layer, a BERT model, a full connection layer, a medicine and prescription name prediction CRF layer and a prescription prediction CRF layer;

In a specific embodiment of the invention, the hierarchical sequence labeling network of BERT + CRF is adopted to extract the ancient medical record prescription, and the steps are as follows:

scanning and processing a classic book related to a traditional Chinese medical record, such as 'the second-generation famous medical record', through an OCR (optical character recognition), converting the classic book into text information, inserting related marks in the scanning process, and marking contents such as departments, diseases, texts and the like related to the medical record.

Marking out prescription text fragments in part of medical record texts by using a BIO label system through a medical record prescription marking tool to obtain manual marking data, wherein the manual marking data comprises two marking sequences, one marking sequence is a prescription marking sequence, and the other marking sequence is a medicine name and prescription name marking sequence; the two labeling sequences both adopt a BIO label system, and each word corresponds to one of BIO labels in the labeling process, wherein B represents the starting part of a prescription, a medicine name or a prescription name, I represents the middle part of the prescription, the medicine name or the prescription name, and O represents the part which is not the prescription, the medicine name or the prescription name;

for example, for text: "Yupingfeng san plus Guizhi Shaoyao is used to benefit the defense and strengthen the exterior to achieve the effect of sweating self-stopping taking", the labeling result of the corresponding prescription labeling sequence and the labeling result of the medicine and prescription name are as follows:

and step three, preparing the pre-training corpus. The method comprises the following steps of (1) carrying out sentence segmentation on an unlabeled medical scheme text, and filtering sentences with the number of words less than 5 to obtain pre-training corpus; the pre-training corpus is used for pre-training and fine-tuning the BERT model. The format of the corpus is as follows: each line is a text, and each case is treated as a document, separated by empty lines in the corpus.

In one embodiment of the present invention, the corpus format is as follows:

the lines 1-6 represent the contents of one medical case, each line represents a sentence, and the lines 8-10 represent the other medical case, with the medical cases separated by empty lines.

Step four, loading the trimmed BERT parameters, training a hierarchical sequence labeling network of BERT + CRF by using manual labeling data, decoding BIO label sequences predicted by prescription content output by a hierarchical sequence labeling network prescription prediction CRF layer in the training process by adopting an evaluation index method based on BLEU (block error unit), obtaining all BI sequences, obtaining the prescription content according to the character content corresponding to the BI sequences, and expressing as pred ═ p [ [ p ] ]₁,p₂,...,p_n](ii) a The manual labeling result is expressed as label ═ t₁,t₂,,...,t_m]Wherein p is_iAnd t_jAll the text character strings are text character strings, n represents the number of prescription contents, and m represents the number of manually marked prescription contents;

FIG. 2 is an explanatory diagram of evaluation indexes, where A is the number of predicted prescription contents smaller than the number of manually labeled prescription contents, B is the number of predicted prescription contents equal to the number of manually labeled prescription contents, and C is the number of predicted prescription contents larger than the number of manually labeled prescription contents; due to the number and number of predicted prescription contentsIn the calculation of the evaluation index, penalty measures for prediction results smaller than or larger than the labeled number need to be considered, so T is defined as min (m, n), T is defined as max (m, n), and the BLEU method is adopted to calculate { p_i,t_jCorrelation of }, i ═ 1,2, …, n, j ═ 1,2, …, n; enumerating the sum of the correlation degrees of different columns of all T different rows, taking the maximum value as a numerator and T as a denominator, and calculating to obtain a final matching score, wherein the calculation formula is as follows:

where matrix _ sum represents the sum of the correlation degrees of different columns of all t different rows. Selecting a network parameter corresponding to the model with the highest matching score to obtain a hierarchical sequence labeling model of BERT + CRF; inputting the text of the medical plan to be processed into a hierarchical sequence labeling model of BERT + CRF, outputting BIO label sequences predicted by prescription content, decoding all BI sequences from the BIO label sequences, wherein the text content corresponding to the BI sequences is the extracted prescription content, and obtaining a final prescription extraction result.

For instance, … … it is indicated for the syndrome of downward movement, slow and thready pulse, qi deficiency, exterior deficiency, wind-cold-yang deficiency, insecurity of body fluids and intolerance of wind and malaria. It is administered with Yupingfeng san and Guizhi Shaoyao to strengthen superficies and relieve sweating. The original prescription is used to remove cassia twig and ginseng, and the ginseng is cooked and attached to the skin to be taken for a plurality of times, so that the clothes can be completely removed. … … ", the extraction results based on sentence classification are (wherein the parenthesis content behind each sentence is the recognition result, and the ellipses indicate the postambles before and after the omitted part):

the extraction result based on the hierarchical sequence labeling method is as follows (wherein, the underline content is a prescription text segment, and the ellipses represent the postamble before and after the omitted part):

the invention does not need to divide sentences of medical records, and directly takes the complete medical record as input, thereby avoiding error propagation caused by the sentences; obtaining a prescription text in a form of sequence marking, and directly obtaining the most relevant text fragment; finally, the information of the medicine name and the prescription name is considered in the identification process, the characteristic representation in the prescription extraction process is enhanced, a better effect is obtained, a small amount of manually labeled data can be used for identifying the prescription text in the ancient medical record, and compared with the extraction result based on sentence classification, the prescription content extracted by the method is more accurate and clean and does not contain irrelevant text information.

The above examples only show one embodiment of the present invention, and the description is specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An ancient medical record prescription extraction method based on hierarchical sequence labeling is characterized by comprising the following steps:

2. The method for extracting ancient medical record prescriptions based on hierarchical sequence labeling of claim 1, wherein the evaluation index method based on BLEU specifically comprises:

when T is defined as min (m, n) and T is defined as max (m, n), the BLEU method is used to calculate { p ═ n_i,t_jCorrelation of }, i ═ 1,2, …, n, j ═ 1,2, …, n; enumerating the sum of the correlation degrees of different columns of all T different rows, taking the maximum value as a numerator and taking T as a numeratorAnd denominator, calculating to obtain a final matching score, wherein the calculation formula is as follows:

3. The method for extracting the ancient medical record prescription based on hierarchical sequence labeling of claim 1, wherein the hierarchical sequence labeling network of BERT + CRF comprises an input layer, a characteristic extraction layer, a full connection layer, a medicine and prescription name prediction CRF layer and a prescription prediction CRF layer, wherein the characteristic extraction layer adopts the BERT model trained in the step 3);