CN113360643A

CN113360643A - Electronic medical record data quality evaluation method based on short text classification

Info

Publication number: CN113360643A
Application number: CN202110587641.1A
Authority: CN
Inventors: 叶方全; 陈逸龙
Original assignee: Guangzhou Tianpeng Computer Technology Co ltd; Chongqing Nanpeng Artificial Intelligence Technology Research Institute Co ltd
Current assignee: Guangzhou Tianpeng Computer Technology Co ltd; Chongqing Nanpeng Artificial Intelligence Technology Research Institute Co ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-09-07

Abstract

The invention discloses an electronic medical record data quality evaluation method based on short text classification, which comprises the following steps: s1: processing data; s2: according to the identification; s3: and (5) evaluating the quality. The invention provides a short text classification-based method, which comprises the steps of splitting an original text of an electronic medical record into shorter sentences, constructing a BilSTM-Attention model to classify the short sentences, and finally performing corresponding evaluation according to whether a classification result is consistent with diagnosis or not. The method provided by the invention does not need to manually process the original text of the electronic medical record, thereby not only saving the labor and time cost, but also reducing the requirements on professional medical personnel. Meanwhile, the deep learning model can make full use of massive electronic medical record data to effectively classify the split sentences, so that reasonable evaluation is made.

Description

Electronic medical record data quality evaluation method based on short text classification

Technical Field

The invention belongs to the technical field of electronic medical record data quality evaluation, and particularly relates to an electronic medical record data quality evaluation method based on short text classification.

Background

With the advent of the big data era, computer network technology is widely applied to the medical field, and various medical institutions collect massive electronic medical record data through an information management system to replace traditional handwritten paper medical records. The electronic medical record records the whole process of diagnosis and treatment of a patient by a doctor, contains information such as symptoms, signs, diagnosis, prescription and the like, and has great potential value in the fields of auxiliary diagnosis, risk prediction, medicine recommendation and the like. However, due to the limited data management level of the medical institution and the insufficient diagnosis and treatment capability of the doctor, a large amount of non-standard description texts exist in the electronic medical record data, so that the recorded information is inaccurate and incomplete, and the efficiency and the quality of medical research and product development are directly influenced. Therefore, data quality evaluation needs to be performed on the electronic medical records, and the electronic medical records with high quality are screened based on the data quality evaluation, so that interference of noise information and redundant information is reduced, which is of great significance for completing tasks such as medical data analysis, prediction model research, auxiliary system development and the like.

The existing electronic medical record data quality evaluation methods mainly comprise two methods, one is a manual evaluation method, and the other is a method combining information extraction and identification. In the manual evaluation method, professional medical personnel directly check each electronic medical record, and the clinical diagnosis and treatment experience of the professional medical personnel confirms whether the electronic medical record has the problems of inaccurate description, incomplete diagnosis, insufficient basis and the like, so that reliable evaluation is performed. This method has an advantage in that the evaluation results are stable and effective, and a disadvantage in that the labor and time costs are very high. The method of information extraction and basis identification firstly utilizes a question-answering system to extract key information of a patient from an electronic medical record, then establishes a basis identification model through machine learning algorithms such as logistic regression, decision trees, random forests and the like, and evaluates the electronic medical record according to whether the key information can improve sufficient diagnosis basis. The method has the advantages that a large number of electronic medical records can be efficiently processed, labor and time cost is saved, and the method has the defect that the representation according to the recognition model is greatly dependent on the quality of an information extraction result. The information extraction is firstly carried out by medical experts to design rules and formulate standards, then structured data are matched from an original text, and finally the structured data are subjected to standardization processing, so that the obtained result has high uncertainty. Due to the obvious shortcomings of both methods, the data quality evaluation of the electronic medical record is still a challenge at present.

In summary, the electronic medical record has a data quality problem, and an accurate and efficient data quality evaluation method is needed to solve the problem.

Disclosure of Invention

The invention aims to provide an electronic medical record data quality evaluation method based on short text classification, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a method for evaluating the quality of electronic medical record data based on short text classification is structurally characterized in that: the method comprises the following steps:

s1: data processing:

s1.1: classifying the electronic medical record data into the current medical history, physical examination, imaging examination and laboratory examination, and constructing corresponding data sets according to the diagnosis basis of different diseases;

s1.2: respectively splitting different types of data into short text sequences, taking commas and periods as separators, splitting original data to form short sentence samples, and constructing short text data sets according to the short sentence samples and diagnosis results thereof;

s1.3: removing a short sentence sample containing diagnosis description, and avoiding directly prompting a corresponding diagnosis result;

s2: according to the identification:

s2.1: dividing a data set into a training set and a verification set according to a ratio of 4:1, wherein the training set is used for training a model, and the model is optimized through a cross entropy loss function based on errors of a predicted label and a real label, and the cross entropy loss is calculated as follows:

wherein the content of the first and second substances,l is the number of tags, y is the true tag,

is a predictive tag;

the verification set is used for verifying the expression of the model, the effectiveness of the model is proved by calculating the precision, the recall rate and the F1 score, and the precision is calculated as follows:

the recall is calculated as follows:

the F1 score was calculated as follows:

where TP is the number of positive samples with positive prediction label, FN is the number of positive samples with negative prediction label, FP is the number of negative samples with positive prediction label

S2.2: respectively taking different types of data as the input of the models, and training different basis recognition models;

s2.3: sequentially carrying out the following processing, namely firstly, Embedding an input word of an original text x into an Embedding layer, and calculating to obtain a word vector representation e, wherein the calculation is as follows:

e_i＝Embed(x_i)

and then inputting e into a bidirectional long-short term memory network (BilSTM), and calculating to obtain a hidden state h as follows:

i^t＝σ(W_ih^t-1+U_ix^t+b_i)

f^t＝σ(W_fh^t-1+U_fx^t+b_f)

o^t＝σ(W_oh^t-1+U_ox^t+b_o)

a^t＝tanh(W_ah^t-1+U_ax^t+b_a)

wherein t is a time step, i is an input gate, f is a forgetting gate, o is an output gate, c is a cell state, h is a hidden state, W, U, B are model parameters, sigma and tanh are activation functions, and finally the hidden state is input into an Attention layer Attention to be calculated to obtain a predicted label

The calculation is as follows:

s_i＝vtanh(h_i)

w_i＝softmax(s_i)

wherein w is the weight and v is the model parameter;

s2.4: the model output is the probability of identifying short text as a diagnostic basis for different diseases;

s3: and quality evaluation, wherein the quality evaluation comprises pure data, high-quality data, low-quality data and noise data.

Preferably, the clean data in step S3 indicates that the predicted labels of all phrases in the electronic medical record are consistent with the true labels, which indicates that the sample has sufficient diagnostic basis.

Preferably, the high quality data in step S3 means that the most predictive label in the electronic medical record is a true label, which indicates that the sample has a large amount of diagnostic basis and a small amount of noise information.

Preferably, the low quality data in step S3 means that the most predictive label in the electronic medical record is not a true label, which indicates that the sample has a small amount of diagnostic basis and a large amount of noise information.

Preferably, the noise data in step S3 indicates that the prediction labels of all phrases in the electronic medical record do not match the true labels, and that the sample contains noise information at all.

Compared with the prior art, the method provided by the invention has the following advantages:

1) the labor and time costs are low. The manual evaluation method needs to check the electronic medical records, information extraction is combined with the identification method, information extraction rules need to be formulated, and the two methods not only consume a large amount of manpower and time, but also provide higher requirements for medical personnel participating in tasks. The short text classification only needs to split the original data into short sentences, and the short text classification model is used for identifying the short sentences, the whole process is completely realized by a computer, and the labor and time cost is saved on the whole.

2) The noise at the phrase level is small. For the model directly identified according to the electronic medical record, the noise at the short sentence level can influence the overall judgment, and strong interference is caused. The short text classification model used by the invention identifies each short sentence sample independently, and even if noise information exists in part of short sentences, the data quality evaluation of the model on the whole electronic medical record sample is difficult to influence. Therefore, the method of the invention has stronger anti-interference capability.

3) The evaluation result is stable and reliable. The prediction labels of the single electronic medical record sample have great contingency and are not necessarily convincing. Compared with the prior art, the data quality evaluation result based on the plurality of short sentence sample prediction labels is stable and reliable, and meanwhile, the noise of the electronic medical record samples is prompted, so that the method is more suitable for practical application scenes.

Drawings

FIG. 1 is a schematic view of the overall process of the present invention;

FIG. 2 is a schematic diagram of the data processing of the present invention;

FIG. 3 is a diagram illustrating a structure of a recognition model according to the present invention;

FIG. 4 is a schematic diagram of the bidirectional long short term memory network BilSTM according to the present invention;

FIG. 5 is a schematic diagram of the Attention layer Attention of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Referring to fig. 1-5, the present invention provides a technical solution, a method for evaluating the quality of electronic medical record data based on short text classification, comprising the following steps:

s1: data processing:

s1.3: removing a short sentence sample containing diagnosis description, and avoiding directly prompting a corresponding diagnosis result; for example, "consider two lungs with multiple bronchiectasis and infection" does not give a diagnostic basis, but directly indicates the result of diagnosis, and this diagnostic description cannot be used as a reference for data quality evaluation,

s2: according to the identification:

where l is the number of tags, y is the true tag,

is a predictive tag;

the recall is calculated as follows:

the F1 score was calculated as follows:

e_i＝Embed(x_i)

then e is input into a bidirectional long-short term memory network BilSTM, as shown in FIG. 3, a hidden state h is obtained by calculation as follows:

i^t＝σ(W_ih^t-1+U_ix^t+b_i)

f^t＝σ(W_fh^t-1+U_fx^t+b_f)

o^t＝σ(W_oh^t-1+U_ox^t+b_o)

a^t＝tanh(W_ah^t-1+U_ax^t+b_a)

wherein t is a time step, i is an input gate, f is a forgetting gate, o is an output gate, c is a cell state, h is a hidden state, W, U, B are model parameters, σ and tanh are activation functions, and finally the hidden state is input to the Attention layer Attention, as shown in fig. 4, a prediction label is obtained by calculation

The calculation is as follows:

s_i＝v tanh(h_i)

w_i＝soft max(s_i)

wherein w is the weight and v is the model parameter;

s2.4: the model output is the probability of identifying short text as diagnostic basis for different diseases, such as "interstitial lung disease-0.8538, bronchiectasis-0.0755, … …";

In this embodiment, the clean data in step S3 means that the predicted labels of all phrases in the electronic calendar are consistent with the true labels, which indicates that the sample has sufficient diagnosis basis.

In this embodiment, the high quality data in step S3 means that the most predictive label in the electronic medical record is a true label, which indicates that the sample has a large amount of diagnostic bases and a small amount of noise information.

In this embodiment, the low quality data in step S3 means that the most predictive label in the electronic medical record is not a true label, which indicates that the sample has a small amount of diagnosis-dependent data and a large amount of noise information.

In this embodiment, the noise data in step S3 means that the predicted labels of all phrases in the electronic calendar are inconsistent with the true labels, which indicates that the sample contains all noise information.

The high-quality electronic medical record data contains accurate and complete information, and the disease of the patient can be effectively inferred. The low-quality electronic medical record data has a large amount of error information and redundant information, and the clinical performance is often inconsistent with the diagnosis result. In order to distinguish the two, a basis identification model needs to be constructed, and the prediction label of the model is compared with the real label of the electronic medical record. In order to accurately and efficiently evaluate the quality of electronic medical record data, the invention provides a method based on short text classification. The method provided by the invention does not need to manually process the original text of the electronic medical record, thereby not only saving the labor and time cost, but also reducing the requirements on professional medical personnel. Meanwhile, the deep learning model can make full use of massive electronic medical record data to effectively classify the split sentences, so that reasonable evaluation is made.

The method provided by the invention has the following advantages:

2) The noise at the phrase level is small. For the model directly identified according to the electronic medical record, the noise at the short sentence level can influence the overall judgment, and strong interference is caused. The short text classification model used by the invention can independently identify each short sentence sample, and even if noise information exists in part of short sentences, the data quality evaluation of the model on the whole electronic medical record sample is difficult to influence. Therefore, the method of the invention has stronger anti-interference capability.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference in the claims is not intended to be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should make the description as a whole, and the embodiments may be appropriately combined to form other embodiments understood by those skilled in the art.

Claims

1. A short text classification-based electronic medical record data quality evaluation method is characterized by comprising the following steps: the method comprises the following steps:

s1: data processing:

s2: according to the identification:

s2.1: dividing a data set into a training set and a verification set according to the ratio of 4:1, wherein the training set is used for training a model, and optimizing the model through a cross entropy loss function based on the error of a predicted label and a real label, wherein the cross entropy loss is calculated as follows:

where l is the number of tags, y is the true tag,

is a predictive tag;

the recall is calculated as follows:

the F1 score was calculated as follows:

e_i＝Embed(x_i)

inputting e into a bidirectional long-short term memory network (BilSTM), and calculating to obtain a hidden state h as follows:

i^t＝σ(W_ih^t-1+U_ix^t+b_i)

f^t＝σ(W_fh^t-1+U_fx^t+b_f)

o^t＝σ(W_oh^t-1+U_ox^t+b_o)

a^t＝tanh(W_ah^t-1+U_ax^t+b_a)

wherein t is a time step, i is an input gate, f is a forgetting gate, o is an output gate, c is a cell state, h is a hidden state, W, U, B are model parameters, sigma and tanh are activation functions, and finally the hidden state is input into an Attention layer Attention, a meter and the like to obtain a prediction label

The calculation is as follows:

s_i＝vtanh(h_i)

w_i＝softmax(s_i)

wherein w is the weight and v is the model parameter;

2. The method for evaluating the quality of the electronic medical record data based on the short text classification as claimed in claim 1, wherein: the clean data in step S3 means that the predicted labels of all phrases in the electronic medical record are consistent with the true labels, which indicates that the sample has sufficient diagnostic basis.

3. The method for evaluating the quality of the electronic medical record data based on the short text classification as claimed in claim 1, wherein: the high quality data in step S3 means that the most predictive label in the electronic medical record is a true label, which indicates that the sample has a large amount of diagnostic bases and a small amount of noise information.

4. The method for evaluating the quality of the electronic medical record data based on the short text classification as claimed in claim 1, wherein: the low quality data in step S3 means that the most predictive label in the electronic medical record is not a true label, which indicates that the sample has a small amount of diagnostic basis and a large amount of noise information.

5. The method for evaluating the quality of the electronic medical record data based on the short text classification as claimed in claim 1, wherein: the noise data in step S3 means that the prediction labels of all phrases in the electronic medical record are not consistent with the true labels, which indicates that the sample contains noise information completely.