CN115394435B

CN115394435B - Method and system for identifying key clinical index entity based on deep learning

Info

Publication number: CN115394435B
Application number: CN202211103092.7A
Authority: CN
Inventors: 孔桂兰; 张路霞; 丁国辉; 张家豪; 林鸿波; 沈鹏; 孙烨祥; 王怀玉; 彭苏元; 孟若谷; 孙小宇; 郝建国
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-04-25
Anticipated expiration: 2042-09-09
Also published as: CN115394435A

Abstract

The invention discloses a method and a system for identifying a key clinical index entity based on deep learning, wherein the method comprises the following steps: establishing an original index entity library according to the original clinical index data of the chronic kidney disease; establishing a key index entity knowledge base of the chronic kidney disease based on an expert marking method; constructing an index entity normalization model based on an EntityEmbedddings algorithm to form a mapping relation and a classifier between an original index entity library and a key index entity knowledge library; inputting the test data to be processed into the index entity normalization model to obtain a classification result. Compared with the prior art, the technical scheme of the invention realizes the efficient and accurate automatic identification of the clinical key indexes of the multi-source chronic kidney disease, has strong reliability and high accuracy, and has good popularization and application prospect.

Description

Method and system for identifying key clinical index entity based on deep learning

Technical Field

The invention relates to the technical field of electronic medical record data classification, in particular to a method and a system for identifying key clinical index entities based on deep learning.

Background

In "healthy china" as a national strategy, it is important to provide people with an all-round and full-period health service. Among them, chronic diseases are regarded as the first big enemy which seriously endangers the health of the national people in China, and bring a heavy burden to the medical and health system. Statistics data show that the prevalence rate of adult hypertension in China is 18%, and the prevalence rate is nearly two hundred million. The death caused by chronic diseases such as chronic kidney disease, cardiovascular disease, tumor, diabetes, respiratory system disease and the like in China accounts for more than half of the total death number, and the prevention and control of the chronic diseases are very serious.

With the continuous and deep development of the internet technology, cloud computing, big data, artificial intelligence, interconnection technology and the like are fused together, and creatively subverts the health medical industry. The data acquisition and structural integration of the full life cycle have profound effects on clinical research and even chronic disease prevention and control decisions. However, in the process of data integration, the problems of improving the data quality and meeting the data standard are followed. After data integration from different medical institutions, the lack of unified clinical index naming standards in EMR (electronic medical record, electronic Medical Records) and different naming habits of different medical institutions result in the same laboratory clinical index having a plurality of different representation names, which creates a great obstacle for subsequent development of cross-modal retrieval, disease risk factor analysis and prognosis prediction, and medical decision support based on multi-modal EMR data.

The prior art mainly relies on a manual mode of a clinical doctor to process data, normalizes the names of clinical indexes of a laboratory, has low working efficiency and is difficult to popularize. In fact, the problem of normalization of the index entity can be classified into the problem of structured data classification, the related research on structured data classification is mainly classified into two types at present, one type is a traditional machine learning model, the other type is a deep learning neural network model, but both types of methods are limited to theoretical research, the accuracy is still difficult to meet the actual working requirement, the application and breakthrough are not obtained in the field of electronic medical record data processing, and the current manual processing mode is difficult to be continuous, so that the invention is needed to develop the method and the system for identifying the clinical index entity efficiently and accurately.

Disclosure of Invention

Therefore, the embodiment of the invention provides the method and the system for identifying the key clinical index entity based on deep learning, which realize the efficient and accurate automatic identification of the key clinical index of the multi-source chronic kidney disease and have good popularization and application prospects.

The embodiment of the invention provides a key clinical index entity identification method based on deep learning, which comprises the following steps:

establishing an original index entity library according to the original clinical index data of the chronic kidney disease;

establishing a key index entity knowledge base of the chronic kidney disease based on an expert marking method;

constructing an index entity normalization model based on a Entity Embeddings algorithm to form a mapping relation and a classifier between the original index entity library and the key index entity knowledge library;

inputting the test data to be processed into the index entity normalization model to obtain a classification result.

Further, the constructing an index entity normalization model based on the Entity Embeddings algorithm, and forming the mapping relation and the classifier between the original index entity library and the key index entity knowledge library includes:

acquiring test data recorded by the mixed original index entity library and the key index entity knowledge library, and preprocessing the test data to be processed according to the characteristics of different fields of the test data;

adopting a model based on Entity Embeddings algorithm to map the classified test data into discrete characteristic values one by one;

outputting probability calculation corresponding to each index category based on the Softmax function to form a Softmax classifier;

and calculating the deviation degree between the index classification result and the actual result by using the cross entropy loss function, and adjusting the weight parameter by a gradient descent method to reduce the deviation degree.

Further, the mapping the classified assay data into discrete feature values one by one using a model based on Entity Embeddings algorithm comprises:

mapping the classified test data into numerical values One by using a hard coding method, storing the mapping relation into a hash table, and performing One-Hot coding (hereinafter referred to as "single-Hot coding") on a target column;

and setting the length of the vector, and inputting the classified field through the single thermal coding into a Entity Embeddings layer for conversion treatment to obtain the fixed dimension vector.

Further, outputting the probabilities corresponding to the index categories based on the Softmax function includes:

inputting the fixed dimension vector to a neural network layer for training;

and normalizing the trained vector into a probability distribution vector by adopting the Softmax function at an output layer to obtain a probability value corresponding to each index category, wherein the total probability sum is 1.

Further, the inputting the test data to be processed into the index entity normalization model to obtain the classification result includes:

acquiring the assay data to be processed;

mapping the test data to be processed through the hash table to obtain a mapping result;

inputting the mapping result into the neural network layer for training, and obtaining the probability of the category to which the assay data belongs through the Softmax function, so as to realize the classification of the classifier;

and selecting a category corresponding to the highest probability value in the categories to which the assay data belong as a clinical index entity category to be used as a classification result of the classifier.

Further, the clinical index entity category includes a institution number, an index internal number, an assay chinese name, an assay english name, a unit, and a reference range.

Further, the step of inputting the classified field through One-Hot encoding (or single Hot encoding) into Entity Embeddings layer for conversion processing comprises the following steps:

mapping the discrete feature values into vectors at the Entity Embeddings layer;

and determining the association degree between different discrete feature values through vector distances, and determining the distance length of the fixed dimension vector according to the association degree.

Further, the preprocessing of the assay data to be processed includes:

and carrying out data cleaning and data complementation on the test data to be processed, and screening out data needing classification.

Further, the method further comprises the following steps:

judging whether a new field value or category exists, wherein the field value or category is not included in the training set;

if yes, the new field value or the new category is replaced uniformly.

Another embodiment of the present invention proposes a key clinical index entity identification system based on deep learning, comprising:

the original index entity library establishing unit is used for establishing an original index entity library according to the original clinical index data of the chronic kidney disease;

the index entity knowledge base establishing unit is used for establishing a key index entity knowledge base of the chronic kidney disease based on an expert marking method;

the model construction unit is used for constructing an index entity normalization model based on a Entity Embeddings algorithm to form a mapping relation and a classifier between the original index entity library and the key index entity knowledge library;

and the data classification unit is used for inputting the test data to be processed into the index entity normalization model to obtain a classification result.

Yet another embodiment of the present invention provides a computer readable storage medium storing a computer program which, when executed, implements the key clinical index entity identification method based on deep learning as described above.

The invention provides a key clinical index entity identification method and a system based on deep learning, firstly, an original index entity library is established based on massive original clinical index data of chronic kidney disease; secondly, constructing a key index entity knowledge base of the chronic kidney disease by professional medical staff; the index entity normalization model is constructed based on Entity Embeddings algorithm to form a mapping relation between an original index entity library and a key index entity knowledge library to form a correlation knowledge library, and the correlation between the original index entity and the index entity of the knowledge library is learned from the correlation library by utilizing a deep learning technology, so that test data to be processed can be input into the index entity normalization model to obtain a classification result. Compared with the prior art, the technical scheme of the invention realizes the efficient and accurate automatic identification of the clinical key indexes of the multi-source chronic kidney disease, has strong reliability and high accuracy, and has good popularization and application prospect.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the drawings that are required for the embodiments will be briefly described, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope of the present invention. Like elements are numbered alike in the various figures.

FIG. 1 is a schematic flow chart of a method for identifying a key clinical index entity based on deep learning according to an embodiment of the present invention;

fig. 2 is a flowchart of a method of step S103 provided in an embodiment of the present invention;

FIG. 3 is a schematic diagram of data preprocessing according to an embodiment of the present invention;

fig. 4 is a flowchart of a method of step S104 according to an embodiment of the present invention;

FIG. 5 is a graph showing the comparison of ROC-AUC curves of different classification methods according to the embodiments of the present invention;

FIG. 6 is a graph showing PR-AP curves for different classification methods according to the embodiment of the present invention;

fig. 7 is a schematic diagram of a system for identifying a key clinical index entity based on deep learning according to an embodiment of the present invention.

Description of main reference numerals:

10-an original index entity library building unit; 20-an index entity knowledge base building unit; 30-a model building unit; 40-data classification unit.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

The original CKD (Chronic Kidney Disease, namely the multi-source chronic kidney disease) clinical index entity consists of six fields including an organization code, an index internal code, an index Chinese name, an index English name, a unit and a reference range, and belongs to typical structured data (namely the original clinical index entity). The types of the fields can be generally classified into numerical data and classified data, the classification of the structured data is mainly performed by filling the class labels of the record through a certain column or values of a certain column in the structured data, at present, the algorithm research of classifying the structured data is mainly classified into a method based on a deep neural network and a traditional machine learning classification algorithm such as a decision tree and the like, and the embodiment of the invention provides a method for classifying CKD key index entities under a deep learning framework by utilizing Entity Embeddings technology.

Example 1

Fig. 1 is a schematic flow chart of a method for identifying a key clinical index entity based on deep learning according to an embodiment of the present invention.

Referring to fig. 1, the method for identifying the key clinical index entity based on deep learning comprises the following steps:

step S101, an original index entity library is established according to the original clinical index data of the chronic kidney disease;

step S102, a key index entity knowledge base of the chronic kidney disease is established based on the expert marking method (the key index entity knowledge base is also called the expert marking method is a common marking processing method, and a knowledge base (an original index entity base) is generated by using an artificial marking data mode to gather information from a public source (such as the original clinical index data of the chronic kidney disease);

step S103, constructing an index entity normalization model based on a Entity Embeddings algorithm to form a mapping relation and a classifier between an original index entity library and a key index entity knowledge library;

and step S104, inputting the test data to be processed into the index entity normalization model to obtain a classification result.

Specifically, the Entity Embeddings algorithm is a technology for mapping discrete feature values into vectors, and has the advantages that the association degree between two feature values can be reflected through vector distances, the larger the association degree is, the smaller the corresponding vector distance is, and the Entity Embeddings algorithm can enable the neural network to learn the discrete features of the table data better. In the traditional structured data set classification algorithm, a hard coding mode is often used to map classification type data of an original clinical index entity into a vector space one by one, and the method can blur differences among numerical values and mislead the operation of a function. The Entity Embeddings algorithm is used for supplementing the hard coding to compensate the defect, meanwhile, the specific weight parameters of Entity Embeddings are obtained through nerve back propagation training, and compared with manually set weights, the target problem can be better fitted, and the generalization capability is stronger. According to the embodiment of the invention, through multi-layer neuron calculation on the input x, each neuron calculates the input and then transmits the output to the next-layer neuron, and the last output layer maps the value in the output vector to between 0 and 1 through a Softmax function so as to represent the probability of belonging to a certain class after classification.

Referring to fig. 2, step S103 (i.e. constructing an index entity normalization model based on Entity Embeddings algorithm to form a mapping relationship and classifier between the original index entity library and the key index entity knowledge library) includes:

step S201, acquiring test data recorded by the mixed original index entity library and key index entity knowledge library, preprocessing the test data to be processed according to the characteristics of different fields of the test data, and FIG. 3 is the preprocessed data condition; the data used are from 4246 ten thousand laboratory sheets, which contain more than 6 hundred million laboratory records, from which 45384 ten thousand index entities containing 6 field marks of organization codes, index internal codes, index Chinese names, index English names, units, reference ranges and the like are extracted to form an original index entity library. The method comprises the steps of constructing a knowledge base containing 78 CKD key index entities by professional medical staff, then mixing and establishing the knowledge base of the original index entities and the key index entities, and learning the association between the original index entities and the knowledge base index entities from the association knowledge base by using a deep learning technology, namely learning normalization rules expressed by different names under the same index entity on the basis of the mixed knowledge base, wherein the normalization rules are called a multi-source CKD key clinical index entity identification method based on deep learning.

Step S202, adopting an algorithm based on Entity Embeddings to map the classified test data into discrete feature values one by one;

specifically, the hard coding method is utilized to map the classified test data of the original clinical index entity into numerical values one by one, and the mapping is only a process of mapping the classified test data into numerical data; and simultaneously storing the mapping relation into a hash table, performing one-time thermal coding on the target column, and inputting the classified field through the one-time thermal coding into an EntityEmbeddings layer for conversion treatment to obtain a fixed dimension vector. Discrete feature values can be mapped into vectors at the EntityEmbeddings layer, the degree of association between different discrete feature values is determined through vector distances, and the distance length of the fixed dimension vector is determined according to the degree of association. Here, in the embodiment of the present invention, when processing classified data of a neural network, the data is first mapped into numbers using a hard coding method, for example, if the classified data field contains two types: red blood cells and red blood cells count, then the two fields will be mapped to 1, 2, respectively, thus numerically representing the categorical fields.

Next, the heat is transferred to the bodyThe coded classified fields are subjected to EntityEmbeddings processing, assuming that N represents the mapped data dimension, M represents the number of the types of the values of the fields, O _M×M Representing a matrix storing One-Hot vectors (hereinafter referred to as "One-Hot vectors"), each row representing One-Hot vector, W _M×N Is a matrix for storing weights obtained by the neural network learning, and a matrix E is obtained by a formula (1) _M×N ，E _M×N Is a matrix that holds field value vectors, each row is an embedded vector representation of each field value.

E _M×N ＝O _M×M ×W _M×N Formula (1)

Step S203, outputting probability calculation corresponding to each index category based on the Softmax function to form a Softmax classifier; the normalization processing of different names under the same index entity can be realized through the processing of the Softmax classifier, and the mapping relation between the original index entity library and the key index entity knowledge library can be completed;

specifically, the vectors are sequentially transmitted to the following network layers for training, all results are mapped between [0,1] in the output layer by adopting a Softmax function, and the output result of the Softmax function is the probability that the test data (i.e. the sample) belongs to each category, and the total probability sum is 1. The Softmax function is an activation function of the output layer; the neural network provided by the embodiment of the invention is divided into three layers, namely an input layer, a hidden layer and an output layer, all fields in the structured data are converted into digital types after the EntityEmbeddding algorithm is carried out, the calculation requirement is met, and finally the probability that the current test data belongs to a certain category is output through the output layer. For example: if the output vector corresponding to (serum creatinine, urinary creatinine) is (0.2, 0.8), the probability of the test result being serum creatinine is 0.2, and the probability of urinary creatinine is 0.8, and the vector sum of the output results is 1 by using the Softmax function, so that the probability can be expressed as a probability.

Wherein, the liquid crystal display device comprises a liquid crystal display device,in the above (2), S _i Representing the probability of belonging to the ith category, V _i For the column value where the output layer class is located, e is a constant and j represents the column (or target column) to which it belongs.

In step S204, the deviation between the index classification result and the actual result (i.e. the actual category) is calculated by using the cross entropy loss function, and the weight parameter is adjusted by the gradient descent method to reduce the deviation.

Specifically, for the result output by the Softmax function, the probability of correct classification needs to be raised as much as possible. For example, if the actual class is serum creatinine, the probability value of serum creatinine is increased by 0.2, the probability value of urinary creatinine is decreased by 0.8, and the difference between the actual result and the index classification result can be measured by using a cross entropy loss function, where θ represents the sample number, m represents the batch size (i.e. batch size), h represents the training process of full connection, λ represents the super-parameter for preventing sinking into local optimum, and n represents the number for generating the same probability:

the formula (3) uses a cross entropy loss function to implement calculation J (theta), x in the formula (3) represents a characteristic value of an index classification result (x represents the index classification result), y represents a characteristic value of an actual result (i.e. an actual result or an actual category), i represents an ith category, J represents a row belonging to the category, J (theta) is calculated after traversing each batch, and J (theta) is a deviation degree calculated after traversing the same probability generated by all samples; batch: i.e. one back-propagation update of model parameter weights using a fraction of samples in the training set, which fraction of samples is called "batch of data". The size of the "batch of data" is the Batchsize. The Batchsize: the sample size used for 1 iteration updates the parameters every time one batch runs out.

The cross entropy loss function is used for measuring the deviation degree between the current prediction result and the actual result, has better smoothness compared with the common mean square error loss, is convenient for gradient descent calculation, and can prevent the function from sinking into local optimum.

The weight and the deviation are changed through gradient descent, so that the deviation value can be reduced, and the embodiment of the invention utilizes the positive and negative of the partial derivative to judge the parameter change mode, and realizes the parameter update through the following formula;

wherein x is ₀ ，x ₁ ，……x _n Representing the characteristic values of a plurality of index classification results, wherein y represents the characteristic value of an actual result; m represents the batch size (i.e., batch size), and the h function represents the fully connected training process; i represents the ith category, j represents the column to which it belongs; wherein α represents a learning rate;

θ _i representing the partial derivative of the ith class;

in summary, a deep neural network is composed of an input layer, a hidden layer and an output layer, through calculation of multiple layers of neurons for x input by the input layer, each neuron calculates the input and then transmits the output to the next layer of neurons, and meanwhile, the last layer, namely the output layer, maps the value in the output vector to between 0 and 1 through a Softmax function to represent the probability of belonging to a certain class after classification.

Training of the neural network includes four steps, namely initializing neural network parameters, forward propagating, calculating loss (i.e., loss), and back propagating. The forward propagation sequentially passes through each layer of neurons from front to back according to the input value to finish operation, and the operation result is input to the next layer. In the loss calculation stage, calculating the difference between the network classification result and the actual class according to the cross entropy loss function, so that the difference value approaches zero;

to achieve this objective, the parameters need to be updated iteratively in a direction with smaller function values, and the value of the cross entropy loss function is calculated by forward propagation, so that the loss function includes the parameter information of each layer of neurons, and the parameters of each layer of neurons can be updated by calculating partial derivatives of the deviation values calculated by the cross entropy function.

Therefore, in the loss calculation stage, the difference between the network classification result and the actual class is calculated according to the cross entropy loss function, if the difference value approaches zero, the parameter needs to be iteratively updated in the direction with smaller function value, and the value of the cross entropy loss function is calculated by forward propagation, so that the loss function contains the parameter information of each layer of neurons, and the parameter of each layer of neurons can be updated by calculating the partial derivative of the deviation value calculated by the cross entropy function.

According to the embodiment of the invention, a model with strong robustness can be obtained by utilizing the neural network to automatically learn parameters, for the classified fields in the structured data, the classified fields are required to be converted into numerical matrixes by a hard coding method, the matrixes processed by the hard coding method can cause the defect that the differences of different types after hard coding are the same, the data after hard coding are mapped into vectors with fixed sizes in the neural network by a Entity Embeddings algorithm, the function for Entity Embeddings operation is also added into the neural network as a neuron, and the parameters in the function are iteratively updated by forward propagation and backward propagation, so that the relation between the vectors is more fit with the actual meaning after training.

Referring to fig. 4, step S104 includes:

step S301, obtaining test data to be processed;

step S302, mapping the test data to be processed through a hash table;

step S303, inputting the mapping result into a neural network layer for training, and obtaining the probability of the category to which the test data belongs through a Softmax function to realize the classification of the classifier;

step S304, selecting the category corresponding to the highest probability value in the categories to which the test data belong as the entity category of the clinical index.

It should be noted that, the clinical index entity category provided by the embodiment of the invention includes, but is not limited to, a mechanism number, an index internal number, an assay chinese name, an assay english name, a unit and a reference range.

According to the embodiment of the invention, the knowledge base of the CKD key index entity is constructed, then the association between the original index entity and the index entity in the knowledge base is established to form the association knowledge base, the association between the original index entity and the index entity of the knowledge base is learned from the association base by using a deep learning technology, namely, normalization rules expressed by different names of the same index entity are learned, and the marking experience of professional medical staff is learned, so that the efficient and accurate automatic identification of the multi-source chronic kidney disease clinical key index is realized, the reliability is strong, the accuracy is high, and the popularization and application prospect is good.

Example 2

The invention utilizes laboratory test data of a plurality of medical institutions in a certain area to carry out statistical analysis on the naming rule of the names of clinical indexes, and establishes an index entity normalization model. The data are from laboratory sheets and laboratory records, from which hundreds of millions of index entities containing 6 field marks of organization codes, index internal codes, index Chinese names, index English names, units, reference ranges and the like are extracted, so as to form an original index entity library. A knowledge base containing 78 CKD key index entities is constructed by professional medical staff, then the association between the original index entities and the index entities in the knowledge base is established to form an association knowledge base, and the classification processing of data is realized by adopting the method in the embodiment 1.

The normalization process of the data set of the embodiment of the invention is as follows: (1) The original name is subjected to variable expansion, and after expansion, the original name comprises a mechanism number, an index internal number, an assay Chinese name, an assay English name, a unit and a reference range; (2) Manually labeling by related medical professionals, and classifying into 78 classes; (3) 80% of the marked data set is used as a training set; (4) Data preprocessing is carried out, and the range field is cleaned; for the classified data (i.e., classified test data), a hard coding method is used to map one-to-one to specific values, and the mapping relationship is stored in a hash table, and a single-hot coding operation is used for the target field. And (5) putting training data into a network to start training.

The 6 rows of discrete characteristic values are respectively put into an Embeddding layer for learning, the Embedding vectors corresponding to each row of discrete characteristic are spliced into training vectors, and the training vectors are transmitted into a hidden layer in a network. The hidden layer is 7 layers, each layer contains 500-700 neurons, and the activation function is ReLU (Rectified LinearUnit, modified linear unit).

The test set consisted of 20% of the manually annotated data set, the class of data of which included class 78. Because the present invention experiment needs to uniformly process the newly appeared field value or category, the test set needs to process the following steps: if the feature values in the training set do not appear in the training set, the embodiment of the invention adopts a mode of regarding the feature values as new features and uniformly replaces the new features with preset characters.

In order to evaluate the effect of the multi-source CKD key clinical index entity identification method based on deep learning, the embodiment of the invention uses indexes such as Accumey (hereinafter referred to as "Accuracy"), precision (hereinafter referred to as "Precision"), recall (hereinafter referred to as "Recall") and F1Score (hereinafter referred to as "balance average") as evaluation criteria of classification effects, and performs comparison experiments with traditional machine learning algorithms such as SVM (Support Vector Machine), support vector machine (hereinafter referred to as "support vector machine"), linear Regression (hereinafter referred to as "linear regression"), K-nearest neighbor classification (hereinafter referred to as "K neighbor classification"), XGBoost, decision Tree (hereinafter referred to as "Decision Tree") and the like. Table 1 below shows the results of the comparative experiments, and it can be seen that classification based on Entity Embeddings is superior to other algorithms in all indexes, especially, compared with the widely applied XGBoost algorithm in the structured data at present, the classification is improved to some extent.

TABLE 1 correlation index values of different classification algorithms

As can be seen from table 1, compared with the hard coding mode trained by the same neural network, the accuracy of Entity Embeddings algorithm is 21.47%, the recall rate is 37.08%, the accuracy is 34.17%, the average value is 36.99%, and compared with the XGBoost algorithm with the best performance in the traditional machine learning, the accuracy, the recall rate, the average value and other indexes are respectively 1.1%, 2.79% and 0.71%. Therefore, the Entity Embeddings algorithm has a significantly better effect than the traditional hard coding in the classification of test items, and is improved to a certain extent compared with the traditional machine learning algorithm XGBoost widely applied at present.

FIG. 5 shows ROC-AUC graphs for four different algorithms; FIG. 6 shows PR-AP graphs of four different algorithms (the four different algorithms are Entity Embeddings algorithm, hard coding method, support vector machine and linear regression; PR-AP curves are precision-recall curves), the precision of Entity Embeddings algorithm is 21.47% higher, recall is 37.08% higher, precision is 34.17% higher, balance average is 36.99% higher, and XGBoost algorithm which is best in traditional machine learning is 1.1%, 2.79% higher and 0.71% higher in precision, recall and balance average, respectively, than a neural network trained by hard coding. Therefore, the Entity Embeddings algorithm has more remarkable effect than the traditional hard coding in the process of carrying out the test item classification item, and has a certain improvement compared with the traditional machine learning algorithm XGBoost widely applied at present. Under the condition that the neural network structures are the same, entity Embeddings is added and the neural network training is carried out only by using a hard coding mode, the effect of the Entity Embeddings algorithm on the ROC-AUC and PR-AP graphs is obviously improved compared with hard coding, and the effect is far better than that of the traditional machine learning algorithm. The embodiment of the invention shows that the neural network classification algorithm based on Entity Embeddings has good practical effect on the multi-source CKD key clinical index entity identification method.

Referring to fig. 5, fig. 5 is a ROC (Receiver Operating Characteristic, or subject operating characteristic) curve on which each point reflects sensitivity to the same signal stimulus. Referring to fig. 5, the horizontal axis of fig. 5 is as follows: false positive rate (False Postive Rate, FPR), which refers to the proportion of all negative cases in the divided examples to all negative cases; fig. 5 vertical axis: true case rate (True Postive Rate TPR). As can be seen from fig. 5, the values of the Area Under the Curve (hereinafter referred to as "AUC") of four different algorithms can be obtained by plotting ROC curves.

Wherein, the AUC value of Entity Embeddings in fig. 5 is 0.999; the hard-coded AUC value was 0.972; the AUC value of the support vector machine is 0.908; AUC value for linear regression was 0.928;

referring to fig. 6, fig. 6 is a graph of precision rate versus recall rate (hereinafter referred to as "PR-AP graph"), and referring to fig. 6, it can be seen that the coordinates on the horizontal axis of fig. 6 are recall rates, and the coordinates on the vertical axis of fig. 6 are precision rates; as can be seen from the above fig. 6, by plotting the PR-AP graph, the AUC values on the PR-AP graph for four different algorithms can be obtained.

Wherein the AUC value of Entity Embeddings in fig. 6 is 0.987; the hard-coded AUC value was 0.720; the AUC value of the support vector machine is 0.520; AUC value for linear regression was 0.254;

example 3

Referring to fig. 7, the deep learning-based key clinical index entity recognition system includes:

an original index entity library establishing unit 10, configured to establish an original index entity library according to original clinical index data of chronic kidney disease;

an index entity knowledge base establishing unit 20 for establishing a key index entity knowledge base of chronic kidney disease based on expert labeling;

the model construction unit 30 is configured to construct an index entity normalization model based on a Entity Embeddings algorithm, so as to form a mapping relationship and a classifier between an original index entity library and a key index entity knowledge library;

and the data classification unit 40 is used for inputting the test data to be processed into the index entity normalization model to obtain a classification result.

It can be understood that the above-described deep learning-based key clinical index entity recognition system corresponds to the deep learning-based key clinical index entity recognition method of embodiment 1. Any of the alternatives in embodiment 1 are also applicable to this embodiment and will not be described in detail here.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention.

Claims

1. The method for identifying the key clinical index entity based on deep learning is characterized by comprising the following steps of:

constructing an index entity normalization model based on an EntityEmbeddding algorithm to form a mapping relation and a classifier between the original index entity library and the key index entity knowledge library;

inputting the test data to be processed into the index entity normalization model to obtain a classification result;

the constructing the index entity normalization model based on the EntityEmbeddding algorithm, and forming the mapping relation and the classifier between the original index entity library and the key index entity knowledge library comprises the following steps:

adopting a model based on an EntityEmbeddings algorithm to map the classified test data into discrete feature values one by one;

outputting probabilities corresponding to the index categories based on the Softmax function to form a Softmax classifier; the normalization processing of different names under the same index entity can be realized through the processing of the Softmax classifier, and the mapping relation between the original index entity library and the key index entity knowledge library can be completed;

calculating the deviation degree between the index classification result and the actual result by using the cross entropy loss function, and adjusting the weight parameter by a gradient descent method to reduce the deviation degree;

the adoption of the model based on the EntityEmbeddings algorithm to map the classified test data into discrete feature values one by one comprises the following steps:

mapping the classified test data into numerical values One by using a hard coding method, storing the mapping relation into a hash table, and performing One-Hot coding on a target column;

setting the length of the vector, inputting the classifying field coded by One-Hot into a Entity Embeddings layer for conversion treatment to obtain a fixed dimension vector.

2. The deep learning-based key clinical indicator entity recognition method according to claim 1, wherein the outputting probabilities corresponding to respective indicator categories based on a Softmax function comprises:

inputting the fixed dimension vector to a neural network layer for training;

3. The method for identifying key clinical index entity based on deep learning according to claim 2, wherein inputting the test data to be processed into the index entity normalization model to obtain the classification result comprises:

acquiring the assay data to be processed;

inputting the mapping result into the neural network layer for training, and obtaining the probability of the category to which the test data belongs through the Softmax function, so as to realize the classification of the classifier;

and selecting a category corresponding to the highest probability value in the categories to which the assay data belong as a clinical index entity category.

4. The deep learning-based key clinical index entity identification method according to claim 3, wherein the clinical index entity categories include institution numbers, index internal numbers, test chinese names, test english names, units, and reference ranges.

5. The deep learning based key clinical index entity identification method according to claim 3, wherein the inputting the classified field encoded by One-Hot into Entity Embeddings layer for conversion processing comprises:

mapping the discrete feature values into vectors at the EntityEmbeddings layer;

6. The deep learning based key clinical indicator entity identification method of claim 3, wherein the preprocessing the assay data to be processed comprises:

and carrying out data cleaning and data complementation on the test data to be processed, and screening out the data of the required classification type.

7. The deep learning based key clinical indicator entity identification method of claim 3, further comprising:

if yes, the new field value or the new category is replaced uniformly.