CN114530248A - Method for determining risk pre-warning model of potentially inappropriate prescription for cardiovascular disease - Google Patents

Method for determining risk pre-warning model of potentially inappropriate prescription for cardiovascular disease Download PDF

Info

Publication number
CN114530248A
CN114530248A CN202210168629.1A CN202210168629A CN114530248A CN 114530248 A CN114530248 A CN 114530248A CN 202210168629 A CN202210168629 A CN 202210168629A CN 114530248 A CN114530248 A CN 114530248A
Authority
CN
China
Prior art keywords
data
model
sampling
risk
cardiovascular disease
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210168629.1A
Other languages
Chinese (zh)
Inventor
童荣生
吴行伟
常欢
温亚林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Peoples Hospital of Sichuan Academy of Medical Sciences
Original Assignee
Sichuan Peoples Hospital of Sichuan Academy of Medical Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Peoples Hospital of Sichuan Academy of Medical Sciences filed Critical Sichuan Peoples Hospital of Sichuan Academy of Medical Sciences
Priority to CN202210168629.1A priority Critical patent/CN114530248A/en
Publication of CN114530248A publication Critical patent/CN114530248A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method for determining a risk early warning model of a potential inappropriate prescription of cardiovascular diseases, which comprises the steps of firstly carrying out data processing on a medical record subjected to desensitization processing to obtain N data sets; then, based on the N data sets, establishing N multiplied by M risk early warning models; wherein M represents the type of risk early warning model, and each risk early warning model takes the data set as input and takes any one of PIP, PIM and PPO as output; and finally, taking AUC, accuracy, precision, recall rate and F1 value as evaluation indexes of the performance of the risk early warning model, and selecting a risk early warning model with optimal performance corresponding to PIP, PIM and PPO according to the evaluation indexes for the risk early warning of PIP, PIM and PPO, so that the risk of the patient suffering from adverse drug reactions can be predicted based on the medical history of the patient suffering from cardiovascular diseases.

Description

Method for determining risk pre-warning model of potentially inappropriate prescription for cardiovascular disease
Technical Field
The invention relates to the technical field of machine learning and medical information processing, in particular to a method for determining a risk early warning model of a potential improper prescription of cardiovascular diseases.
Background
In recent years, with the increasing number of elderly people, the elderly often suffer from more than one chronic disease, wherein cardiovascular diseases are common diseases among elderly patients, many cardiovascular disease patients are commonly accompanied by depression or anxiety, and many cardiovascular patients need combined treatment (antithrombotic drugs, statins, and antihypertensive drugs). Moreover, with age, the pharmacodynamics and pharmacokinetics of elderly patients change, increasing the risk of Adverse Drug Reactions (ADRs). Although, relevant studies have shown that age, number of medications, number of diseases, Potentially Inadequate Prescription (PIP), etc. increase the risk of ADR in elderly patients, with PIP being the most common factor in elderly patients at risk for ADR. The PIP, in turn, includes, inter alia, Potentially Inappropriate Medications (PIM) and Potentially missing prescriptions (PPO).
Although, various PIP evaluation standards are in place and at home, including Beers standards established by the university of California, the STOPP/START standards established by the organization experts in the subsidiary hospitals of the university of Cork, Ireland, et al, use the Delphi method to establish PIP standards in the disease state of the elderly people in China, and so on. The results of several studies show that the STOPP/START standard is more accurate than the Beers standard, which (second edition) lists whether it is reasonable and possibly negligible to use certain classes of drugs in specific disease states. However, the existing evaluation standards are all intervention afterwards, the risk of the aged patient generating PIP cannot be early warned in advance, the risk of PIP cannot be accurately early warned, accurate intervention cannot be provided for the aged patient, and individualized treatment cannot be provided for the aged patient.
With the application of machine learning in the medical field, for example, many researchers propose to model risk factors in electronic medical records of patients by using algorithms such as machine learning, so as to realize risk prediction of cardiovascular diseases. Therefore, in order to reduce the occurrence of adverse drug reactions, it is necessary to provide a technical solution for predicting the risk of adverse drug reactions in patients with cardiovascular diseases by learning the electronic medical record of cardiovascular diseases through machine learning.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, the present invention aims to: a method is provided for determining a risk pre-warning model for a potentially inappropriate prescription for cardiovascular disease that predicts the risk of a cardiovascular disease patient for developing an adverse drug reaction based on the patient's electronic medical record.
In order to achieve the purpose, the invention provides the following technical scheme:
a method of determining a risk pre-warning model for potentially inappropriate prescription of cardiovascular disease, comprising the steps of:
s1: and deleting the information such as the identification card number, the name, the home address, the telephone number and the like from the cardiovascular patient medical record information. The study was a retrospective study, with no intervention taken, and the ethical committee considered it unnecessary to obtain patient consent. Performing data processing after the desensitization processing to obtain N data sets, dividing the N data sets into a training set and a test set, modeling the N data sets by using an M machine learning model, and establishing N multiplied by M risk early warning models;
s2: internal verification is performed by using a ten-fold cross verification method: inputting training set data into an M machine learning model, and adjusting model parameters by using a cross-folding verification method until the model parameters obtain the maximum AUC value on the training set, so as to obtain AUC values, accuracy, precision, recall rate and F1 values of internally verified data sampling and feature sampling of a PIP model, a PPO model and a PIM model and a machine learning model;
s3: generating an integrated model by selecting a plurality of models with the largest AUC values, establishing N data sets by using the M +1 machine learning model, and establishing N (M +1) risk early warning models;
s4: external verification was performed using Bootstrapping method: resampling the test set for N times by using a Bootstrapping method, establishing a new sample, and carrying out external verification on N x (M +1) risk early warning models by using the new sample to obtain AUC values, accuracy rates, precision rates, recall rates and F1 values of data sampling, feature sampling and machine learning models of external verification of a PIP model, a PPO model and a PIM model;
s5: the method comprises the steps of using an AUC, accuracy, precision, recall rate and F1 value index evaluation model, selecting a model with the maximum AUC as five models with the best prediction performance to obtain ROC curves and P-R curves of the five models, and selecting a model with the maximum ROC curve value as a risk early warning model with the most accurate prediction performance corresponding to a PIP model, a PIM model and a PPO model;
s6: calculating the SHAP value of each variable in the risk early warning model with the most accurate prediction performance, expressing the relationship between each output variable and the best model prediction result by using the SHAP value, and respectively arranging the contribution value of each variable to the best prediction model by taking the average value of the absolute value of the SHAP value of each variable as the importance of the variable to obtain the risk early warning output of each variable to PIP, PIM and PPO.
Further, the data processing further comprises: deleting variables with data missing proportion higher than 90%, variables with single category proportion higher than 90% and variables with variation coefficient smaller than 0.1, dividing the N data sets into training sets and testing sets according to the proportion of 8: 2, respectively sampling data in the N data sets by adopting x different data sampling modes, and respectively screening data in y different data screening modes to obtain the N data sets, wherein N is x y.
Further, the variables include myocardial infarction, cardiac conduction block, venous thromboembolism, history of gout, renal failure, anticoagulation therapy, angina, atherosclerosis, heart failure, diabetes, number of medications, number of illnesses, gender, hospital stays, age, gastrointestinal bleeding, antithrombotic therapy, history of cardiovascular disease, cerebrovascular disease, atrial fibrillation, hyperlipidemia, and hypertension.
Further, the cardiovascular patient medical record information adopts a prescription of identifying PIP possibly existing in an old patient by adopting a cardiovascular system and an antiplatelet/anticoagulant medicament in a STOPP/START standard of a second edition, the prescription comprises a PIM standard of the cardiovascular system and the antiplatelet/anticoagulant medicament and a PPO standard of the cardiovascular system, and the desensitization treatment comprises deleting identity number, name, family address and telephone number information of the patient.
Further, the method also comprises the step of verifying the data set quantity required by the method for determining the risk early warning model of the potential improper prescription of the cardiovascular disease.
Furthermore, the data training models with different quantities are sequentially and randomly extracted from the training set by adopting a re-sampling method with replacement, the test set obtains a plurality of groups of different AUC values, the AUC values are repeated for n times to obtain a curve diagram for drawing the AUC values, the AUC values are gradually increased along with the continuous increase of the data quantity, the dispersion of the data is continuously reduced, and the curve is continuously gentle, wherein in the PIP model, when the data quantity reaches 70% of the existing data quantity, the curve is gentle, the sample quantity in the PIP model is enough, in the PPO model and the PIM model, when the sample quantity reaches 70%, the curve is gentle, then the curve continuously rises, and when the obtained data quantity reaches 70% -80% of the existing data quantity, the PPO model and the PIM model can be satisfied.
Further, hypothesis testing is included, and a total is deduced through the existing data quantity, so that the difference between the existing data quantity index and the total index is obtained.
Further, the hypothesis test comprises analysis of variance, rank sum test, data does not meet normal distribution, the rank sum test is used, normal distribution is met, homogeneity of variance and uniformity of variance are judged, analysis of variance is selected, otherwise, the rank sum test is selected, the significance level is set to be 0.05, the hypothesis test is realized by stats in python3.8, and the model is established by skearn in python 3.8.
Further, the data sampling method includes: non-sampling, random up-sampling, random down-sampling, SMOTE up-sampling, and Borderline SMOTE up-sampling; the data screening mode comprises the following steps: non-screening, Lasso screening and Boruta screening, wherein the machine algorithm model comprises: AdaBoost, Bagging, Bernoulli _ Naive _ Bayes, precision _ Tree, Extra _ Tree, Gaussian _ Naive _ Bayes, Gradient _ Boosting, KNN, LDA, Logistic _ Reguration, Multi _ Naive _ Bayes, Passive _ Aggresive, QDA, Random _ Forest, SGD, SVM, XGBoost.
Further, the data sampling method includes:
and (3) non-sampling: inputting the original data into the model without sampling the data;
random upsampling: randomly copying data with few label types to ensure that the number of positive and negative data is consistent;
random down-sampling: randomly deleting data with more tag types to ensure that the number of positive and negative data is consistent;
SMOTE upsampling:
analyzing the minority class data by using synthesized minority class upsampling, synthesizing new data of the minority class, and adding the new data into a data set, wherein the specific algorithm flow is as follows:
for each data x in the minority class, calculating the distance from the data x to all data in the minority class data set by taking the Euclidean distance as a standard to obtain k neighbor of the data x, setting a sampling proportion according to the data imbalance proportion to determine a sampling multiplying factor N, randomly selecting a plurality of data from the k neighbor of each minority class of data x, supposing that the selected neighbor is xn, and respectively constructing new data with the original data for each randomly selected neighbor xn: x is the number ofnew=x+rand(0,1)(x-xn);
Borderline SMOTE upsampling:
the Borderline SMOTE algorithm synthesizes new data by a few classes of data on a boundary, thereby improving the distribution of data classes, and Borderline SMOTE sampling divides the few classes of data into 3 classes, namely Safe, Danger and Noise, and only oversamples the few classes of data of Danger.
A risk early warning system for determining potential improper prescription of cardiovascular diseases is characterized in that a prediction system is established according to parameters of PIP, PPO and PIM optimal models, and visualization of potential improper prescription risk early warning is achieved. The system comprises a data processing module and a risk early warning module:
the data processing module is used for processing medical record data of a patient with cardiovascular disease;
and the risk early warning module adopts a risk early warning model determined by a method based on a risk early warning model for determining a potential improper prescription of cardiovascular diseases, and outputs the risk probability of PIP, PIM and PPO according to the data input by the data processing module.
Compared with the prior art, the invention has the beneficial effects that:
the method for determining the risk early warning model of potential improper prescription of cardiovascular diseases is mainly based on the traditional statistical analysis, such as one-factor analysis of variance, multivariate logistic regression and the like, on the research of PIP of patients with cardiovascular diseases. The present study used a variety of machine learning algorithms to build a PIP prediction model according to the SOTTP/START standard (second edition). The probability of the old patients with cardiovascular diseases to suffer from PIP is early warned, the model is internally and externally verified, the generalization capability is higher, the prediction result is more accurate, and the model is also verified in sample amount to verify whether the sample amount meets the requirements of the invention.
Drawings
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is an importance arrangement diagram of a PIP model, a PPO model and a PMI model obtained by screening variables;
FIG. 3 is a diagram showing the results of PIP model, PPO model and PIM model internal verification;
fig. 4 is a diagram showing the results of external verification of the PIP model, the PPO model and the PIM model;
FIG. 5 is an importance graph of the data after Bootstrapping is sampled 200 times, arranging variables;
FIG. 6 is a bar graph of the average of SHAP values according to variables;
FIG. 7 is a graph of variables versus best model prediction results;
FIG. 8 is a bar graph of the importance of the average of the absolute values of the SHAP values of the variable as the variable;
fig. 9 is a sample amount-dispersion diagram of the PIP model, the PPO model, and the PMI model;
FIG. 10 is a schematic block diagram of a risk early warning system for potentially inappropriate prescription of cardiovascular disease;
FIG. 11 is a schematic view of a risk pre-warning system visualization interface for potentially inappropriate prescription of cardiovascular disease;
FIG. 12 is a schematic representation of Borderline SMOTE sampling.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments.
Thus, the following detailed description of the embodiments of the invention is not intended to limit the scope of the invention as claimed, but is merely representative of some embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments of the present invention and the features and technical solutions thereof may be combined with each other without conflict.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present invention, it should be noted that the terms "upper", "lower", and the like refer to orientations or positional relationships based on orientations or positional relationships shown in the drawings, orientations or positional relationships that are usually used for placing the products of the present invention, or orientations or positional relationships that are usually understood by those skilled in the art, and these terms are only used for convenience of description and simplification of the description, and do not indicate or imply that the devices or elements referred to must have specific orientations, be constructed and operated in specific orientations, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.
Example 1, as shown in fig. 1, the inventive method flow for determining a risk pre-warning model for potentially inappropriate prescription of cardiovascular disease uses the cardiovascular system and anti-platelet/anticoagulant drugs in the STOPP/START standard (second edition) to identify the prescription of PIP that may be present in elderly patients, including 24 PIM standards, including 13 cardiovascular systems and 11 anti-platelet/anticoagulant drugs and 8 PPO standards for the cardiovascular system.
And selecting a patient discharged from an aged cardiovascular department in a hospital as a research object. Corresponding medical information is collected in the electronic medical record system and serves as a data source, and the data source comprises information such as prescription information, medical record information and laboratory examination. The inclusion criteria were: (1) the age of the patient is more than or equal to 65 years old; (2) the hospitalization time of the patient is 3 days to 60 days; (3) patients diagnosed with at least one cardiovascular disorder, including hypertension, myocardial infarction, angina pectoris, hyperlipidemia, peripheral vascular disease, and indications of antithrombotic therapy, whether a patient has an indication of antithrombotic therapy being determined by a cardiovascular physician. After the electronic medical record of the patient is collected, the data desensitization processing is completed by deleting the information of the identity card number, the name, the home address, the telephone number and the like of the patient. The assignment of the basic information of the patient is shown in the following table:
TABLE 1 variable assignment case
Figure BDA0003512372050000061
Then, preprocessing the data after data desensitization treatment, which specifically comprises:
1. by deleting columns with a data missing ratio higher than 90%, deleting columns with a single category ratio higher than 90%, and columns with a coefficient of variation smaller than 0.1. Data were pre-screened to retain 16 variables and to remove 6 variables, including X8 myocardial infarction, X11 cardiac conduction block, X16 venous thromboembolism, X17 history of gout, X18 renal failure, X21 anticoagulation therapy.
2. The data set after data pre-screening and data sampling was screened for variables using Lasso and Boruta, see fig. 2. The results show that the first five variables of importance ranking in PIP model are angina pectoris, atherosclerosis, heart failure, diabetes and drug dose number, respectively, but hyperlipidemia, cardiovascular history, hypertension and cerebrovascular disease are of lower importance in PIP model, see fig. 2A. The five variables in the PPO model before the significance ranking were the number of doses, angina pectoris, atherosclerosis, cardiovascular history and age, respectively, and hyperlipidemia, diabetes, hypertension and cerebrovascular disease were of lower significance in the PPO model, as shown in FIG. 2B. The first five variables of importance ranking in the PIM model were the number of medications, number of illnesses, duration of hospitalization, age, and heart failure, respectively, and the less important variables in the PIM model were atherosclerosis, gastrointestinal bleeding, antithrombotic therapy, and cardiovascular history, see fig. 2C.
The data sampling adopts 5 data sampling modes, including: non-sampling, random up-sampling, random down-sampling, SMOTE up-sampling, and Borderline SMOTE up-sampling; the data screening adopts 3 data sampling modes, including: no screening, Lasso screening and Boruta screening. I.e. 15 data sets are obtained.
The process of establishing the risk early warning model comprises the following steps:
and performing model internal verification by using a ten-fold cross verification method. The PIP, PPO and PIM respectively obtain 15 data sets through 5 data sampling and 3 feature screening methods, 17 machine learning algorithms are used, including AdaBoost, Bagging, Bernoulli _ Naive _ Bayes, precision _ Tree, Extra _ Tree, Gaussian _ Naive _ Bayes, Gradient _ Boosting, KNN, LDA, Logistic _ Regression, Multinomial _ Naive _ Bayes, Pasive _ aggregation, QDA, Rand _ Forest, SGD, SVM and XGS, and 255 prediction models are respectively established. As shown in fig. 3, the results of the PIP model, PPO model and PIM model internal verification show that,
the PIP model has different data samples and model performances established by different algorithms, wherein the best data sample is SMOTE, the AUC is 0.880 +/-0.095, and the accuracy is 0.814 +/-0.095. The best model is XGboost, the AUC is 0.854 plus or minus 0.130, and the accuracy is 0.798 plus or minus 0.121;
different data samples in the PPO model and models established by different algorithms have different performances, but different feature screening methods have no difference. The results show that the best data sampling is SMOTE, AUC is 0.800 + -0.126, and the accuracy is 0.733 + -0.114. The best model among the 17 models is XGboost, the AUC is 0.832 +/-0.133, and the accuracy is 0.772 +/-0.119;
the PIM model and the PPO model have similar results, the performances of the models established by different data samples and different algorithms have difference, and the different feature screening methods have no difference. The optimal data sampling method is Random Over Sampler, AUC is 0.786 +/-0.126, and accuracy is 0.726 +/-0.115. The best model is Random _ Forest, AUC is 0.818 +/-0.138, and accuracy is 0.758 +/-0.117.
And performing external verification, and generating an integrated model by selecting the optimal five models with the maximum AUC values to finally obtain 18 models. Modeling 15 data sets using 18 machine learning models, building 270 machine learning models, and selecting the best model as a risk early warning model. The data after boosting 200 times sampling is used for external verification, and as shown in fig. 4, the results show that the models established by different data sampling, feature screening and algorithms in the PIP model, the PPO model and the PIM model have significant differences.
The best data sampling in the PIP model is Random Over Sampler, AUC is 0.643 + -0.101, and accuracy is 0.616 + -0.100. The best characteristic screening is no screening, the AUC is 0.622 +/-0.095, and the accuracy is 0.647 +/-0.106. The best model is Ensemble _ Learning, the AUC is 0.696 +/-0.095, and the accuracy is 0.682 +/-0.107;
the optimal data sampling in the PPO model is Borderline SMOTE, the AUC is 0.632 +/-0.082, and the accuracy is 0.673 +/-0.070. The best characteristic screening is no screening, the AUC is 0.604 +/-0.089, and the accuracy is 0.646 +/-0.081. The best model is Ensemble _ Learning, the AUC is 0.676 +/-0.070, and the accuracy is 0.678 +/-0.071;
the best data sampling in the PIM model is Random Under Sampler, AUC is 0.563 +/-0.089, and accuracy is 0.572 +/-0.078. The best characteristic screening is Boruta, the AUC is 0.546 +/-0.089, and the accuracy is 0.596 +/-0.087. The best model is Ensemble _ Learning, AUC is 0.637 + -0.076, and accuracy is 0.647 + -0.072.
The data after boosting 200 samples are input into the PIP model, the PIM model and the PPO model, and the importance of each variable is ranked, see fig. 5. In the PIP model, the first five variable importance ranks are cerebrovascular disease, cardiovascular disease history, number of medications, hospitalization time, and age, respectively, but diabetes, digestive tract bleeding, hypertension, and angina pectoris are less important in the PIP model. See fig. 5 a. The first five variables in the PPO model were ranked in importance as diabetes, hyperlipidemia, heart failure, hospital stay and gastrointestinal bleeding, respectively, and the less important variables in the PPO model were hypertension, cerebrovascular disease, antithrombotic therapy and atrial fibrillation. See fig. 5 b. The first five variable importance ranks in PIM are diabetes, antithrombotic treatment, hospitalization, age, and hypertension, respectively, but gut bleeding, hyperlipidemia, history of cardiovascular disease, and atrial fibrillation are of lesser importance in the PIM model. See fig. 5 c.
And selecting five models with the best prediction performance by using index evaluation models such as AUC, accuracy, precision, recall rate, F1 value and the like. The model with the best prediction performance in the PIP model has an ROC curve of 0.8341 and a P-R curve of 0.9556; the model with the best prediction performance in the PPO model has an ROC curve of 0.7007 and a P-R curve of 0.7992; the model with the best prediction performance in the PIM model has an ROC curve of 0.7061 and a P-R curve of 0.4268. Other prediction performance indicators are shown in tables 2-4 below:
TABLE 2 predicted Performance indices for PIP 5 best models
Figure BDA0003512372050000081
TABLE 3 prediction performance index of PPO 5 best models
Figure BDA0003512372050000082
TABLE 4 predicted Performance indices for PIM 5 best models
Figure BDA0003512372050000083
Figure BDA0003512372050000091
The contribution of the variables in the optimal model is explained by using the SHAP values, see FIG. 6. The results show that in the PIP model, the contribution degree of cerebrovascular diseases, heart failure, age, hyperlipidemia and hypertension to the prediction results of the optimal model is high, and the contribution degree of hospitalization time (day), myocardial infarction and gender to the prediction results of the optimal model is low. See fig. 6A. The PPO model has high contribution degree of angina, age, heart failure and hyperlipidemia to the prediction result of the optimal model, and the PPO model has low contribution degree of cerebrovascular diseases, disease amount, hypertension and myocardial infarction to the prediction result of the optimal model. See fig. 6B. The PIM model has high prediction contribution of the number of drugs, angina pectoris, and hospitalization time (day) to the optimal model, and the variables having low contribution to the model prediction result include cerebrovascular disease, the number of diseases, myocardial infarction, atrial fibrillation, and diabetes. See fig. 6C.
Further, the relationship of each variable to the prediction result of the optimal model is explained by calculating the SHAP value of each variable in the optimal model, as shown in FIG. 7. The results show that SHAP values of the PIP model, the PPO model and the PIM model are similar, and the prediction capability of the optimal model is reduced by higher values of the variables such as the hospitalization time (day), the heart failure, the medication amount, the angina pectoris and the like. And taking the average value of the absolute value of the SHAP value of each variable as the importance of the variable, and drawing a bar chart, as shown in figures 8 a-8 c, wherein the five first variable importance arrangements in the PIP model are angina pectoris, atherosclerosis, disease number, medication number and cardiovascular history respectively; the first five variable importance arrangements in the PPO model are angina pectoris, disease number, cardiovascular disease history, atherosclerosis and heart failure, respectively; the first five variable importance ranks in the PIM model are heart failure, number of medications, angina pectoris, length of stay and age, respectively.
Whether the sample size is enough is verified by adopting a re-sampling method with replacement, and figures 9 a-9 c are drawn, and the result shows that the AUC is gradually increased along with the continuous increase of the sample size, the dispersion of the sample is continuously reduced, and the curve is continuously smooth. In fig. 9a, when the sample size reaches 70% in the PIP model, the curve tends to be flat, indicating that the sample size is sufficient in the PIP model. As shown in fig. 9b and 9c, in the PPO model and the PIM model, when the sample size reaches 70%, the curve is gentle, and then the curve continuously rises; indicating that the sample size can meet the PPO model and the PIM model when the sample size is 70-80%.
As shown in fig. 10, in another aspect of the present invention, there is also provided a risk pre-warning system for determining a potentially inappropriate prescription for a cardiovascular disease, comprising:
the data preprocessing module is used for preprocessing the electronic medical record of the potential risk of improper prescription of the cardiovascular disease to be determined;
the risk early warning module is used for adopting the risk early warning model determined by the method for determining the risk early warning model of the potentially improper prescription of the cardiovascular disease, and carrying out PIP, PIM and PPO risk early warning according to the data input by the preprocessing module.
In implementation, as shown in fig. 11, in order to facilitate the input of the electronic medical record data of the patient, the present invention further provides an electronic medical record data input interface module, where the input interface includes input fields of variables such as X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, and X22 in table 1, and the data preprocessing module is configured to preprocess the input variable data to obtain a data set, and then perform operation processing on the data set by the risk early warning module to further achieve the early warning of the risk of improper prescription of the cardiovascular disease of the patient.
The technical scheme of the invention is further elaborated in detail by combining 5 data sampling methods and 17 machine learning algorithms as follows:
and (3) non-sampling: inputting the original data into the model without sampling the data;
random upsampling: randomly copying data with few label types to ensure that the number of positive and negative samples is consistent;
random down-sampling: randomly deleting data with more label types to ensure that the number of positive and negative samples is consistent;
SMOTE upsampling:
the minority sample is analyzed using a synthetic minority upsampling technique and a new sample of the synthetic class is added to the dataset. The specific algorithm flow is as follows:
for each sample x in the minority class, calculating the distance from the sample x to all samples in the minority class sample set by using the Euclidean distance as a standard to obtain the k neighbor of the sample x.
And setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from k neighbors of each few class sample x. Assume that the selected neighbor is xn.
For each randomly selected neighbor xn, a new sample is constructed with the original sample:
xnew=x+rand(0,1)(x-xn)
borderline SMOTE upsampling:
the Borderline SMOTE algorithm improves the distribution of sample classes by synthesizing new samples from a small number of classes of samples on the boundary. Borderline SMOTE sampling divides the minority of class samples into 3 classes, Safe, Danger and Noise respectively, and only oversamples the minority of class samples of Danger, as shown in FIG. 12.
AdaBoost: the working mechanism of the self-adaptive boosting algorithm is that firstly a weak learner 1 is trained by using initial weight from a training set, and the weight of a training sample is updated according to the learning error rate performance of weak learning, so that the weight of training sample points with high learning error rate of the weak learner 1 is higher, and the points with high error rate are paid more attention by a following weak learner 2. And then training the weak learners 2 based on the training set after the weight is adjusted, repeating the steps until the number of the weak learners reaches the number T specified in advance, and finally integrating the T weak learners through a set strategy to obtain the final strong learner.
Given a training data set T { (x1, y1), (x2, y 2.). (xN, yN) }, where the instance x ∈ χ, and the instance space
Figure BDA0003512372050000111
yi belongs to the label set { -1, +1}, and the purpose of Adaboost is to learn a series of weak classifiers or basic classifiers from training data and then combine these weak classifiers into one strong classifier.
The algorithm flow of Adaboost is as follows:
step 1: firstly, initializing the weight distribution of training data, and assigning the same weight to each training sample at the beginning: 1/N of the total weight of the mixture,
Figure BDA0003512372050000112
and 2, step: a number of iterations are performed, with i ═ 1, 2, … …, and N denoting the first few rounds of iteration:
a. learning by using a training data set with weight distribution Dm to obtain a basic classifier:
Gm(x):χ→{-1,+1}
b. calculating a Classification error Rate of Gm (x) on a training data set
Figure BDA0003512372050000113
As can be seen from the above equation, the error rate em of Gm (x) on the training data set is the sum of the weights of the misclassified samples Gm (x);
c. calculating a coefficient of Gm (x), wherein am represents the importance degree of Gm (x) in the final classifier, and the purpose is to obtain the weight occupied by the basic classifier;
Figure BDA0003512372050000114
as can be seen from the above-mentioned formula,
Figure BDA0003512372050000115
when am ≧ 0, and am increases with decreasing em, meaning that the base classifier with smaller classification error rate is in the final classifierThe greater the effect of (a);
d. updating the weight distribution of the training data set to obtain a new weight distribution of the sample for the next iteration:
Dm+1=(wm+1,1,wm+1,2……wm+1,i……,wm+1,N),
Figure BDA0003512372050000116
so that the weight of the misclassified sample by the basic classifier gm (x) is increased and the weight of the correctly classified sample is decreased, and thus, in this way, the Adaboost way can "focus on" the sample which is difficult to classify;
where Zm is a normalization factor, Dm +1 is a probability, respectively:
Figure BDA0003512372050000117
and step 3: combining the weak classifiers:
Figure BDA0003512372050000121
the final classifier is thus obtained as follows:
Figure BDA0003512372050000122
bagging: used in conjunction with other algorithms to improve their accuracy and stability, and to avoid overfitting. The principle is that a plurality of weak learners are respectively constructed and are parallel to each other, and training can be performed simultaneously, finally combining a plurality of weak learners. Bagging is applicable to models with low bias and high variance, and the input is a sample set D { (x, y1), (x2, y2), … … (xm, ym) } D { (x, y1), (x2, y2), … … (xm, ym) }, a weak learner algorithm, and a weak classifier iteration number T;
the output is the final strong classifier f (x)
1. For T ═ 1, 2, … …, T:
a: random sampling is carried out on the training set for the t time, and m times are collected in total to obtain a sampling set DtDt containing m samples
B: training the tth weak learner Gt (x) with the sample set DtDt
2. If the classification algorithm is used for prediction, the category or one of the categories with the largest number of votes thrown by the T weak learners is the final category; and if the model is a regression algorithm, performing arithmetic mean on regression results obtained by the T weak learners to obtain a value which is the final model output.
The naive Bayes algorithm is a classification method based on Bayes theorem. The joint probability distribution is learned from the training data and then posterior probability distributions are obtained, including bernoulli naive bayes, gaussian naive bayes, polynomial distributed naive bayes.
The Bayesian theorem is introduced as follows:
Figure BDA0003512372050000123
wherein (x, y) represents a characteristic variable, ciDenotes classification, p (c)i| x, y) indicates classification into category c if the feature is (x, y)iTherefore, combining the conditional probability and bayes' theorem, there are:
1: if p (c1| x, y) > p (c2, | x, y), then the classification should belong to class c 1;
2: if p (c1| x, y) < p (c2, | x, y), then the classification should belong to class c2,
bernoulli Naive _ Bayes: the model is suitable for the multi-element Bernoulli distribution, namely, each feature is a binary variable, if the feature is not a binary variable, the model can firstly carry out binarization on the variable, the feature is whether a word appears in the document classification, if the word appears in a certain file, the word is 1, and if the word does not appear in the certain file, the word is 0. In the example of text classification, a vector of whether a statistical word occurs, rather than a vector of the number of times a statistical word occurs, may be used to train and use this classifier. Bernoulli NB may perform better on some datasets, especially those with shorter documents.
Gaussian _ Naive _ Bayes: adapted to continuous variables, assuming individual features xiUnder each category y, the normal distribution is obeyed, and the probability is calculated by the algorithm using the probability density function of the normal distribution:
Figure BDA0003512372050000131
μy: in samples of class y, feature xiThe mean value of (a);
σ2y: in samples of class y, feature xiStandard deviation of (2).
Multinominal nasal Bayes: the method realizes a naive Bayes algorithm obeying multinomial distribution data and is also suitable for one of two classical naive Bayes classification algorithms of text classification, and the naive Bayes algorithm is often represented by word vectors, distribution parameters are determined by theta (theta 1, theta 2, … …, theta n) vectors of each x, n represents the number of features, namely the size of vocabulary in the text classification, and theta represents the number of the featuresiIs the probability p (x) of belonging to a feature ii|c),θiEstimated from the smoothed maximum likelihood estimate, the relative frequency counts:
Figure BDA0003512372050000132
where Nci is the number of occurrences of feature i in class c and Ny is the total number of occurrences of all features in c.
Precision Tree: decision trees are a basic machine learning algorithm. The classification data uses a classification decision tree, and the continuous data uses a regression decision tree. The model has legibility and classification speed. The most common decision tree algorithms include ID3, C4.5, and CART.
The information gain represents the degree to which the information of the features Xj is known to reduce the uncertainty of the belonging classification;
g(D,A)=H(D)-H(D|A)
the information gain g (D, A) of the feature A to the training data set D, defined as the difference between the empirical entropy H (D) of the set D and the empirical conditional entropy H (D | A) of D given the feature A;
assuming that the data set D has K classes, the feature A has n possible values. Wherein the empirical entropy H (D) of the data set D is
Figure BDA0003512372050000133
Wherein P iskThe probability of the class k for any sample data in the set D, or the proportion of samples belonging to the class k.
The empirical conditional entropy H (D | a) is:
Figure BDA0003512372050000134
can also be recorded as
Figure BDA0003512372050000135
Wherein P isiThe probability that the feature value is the ith dereferencing value is taken as the probability. DiIs a sample set with characteristic A as the ith dereferencing value.
The greater the information gain, the greater the "purity boost" obtained using attribute a for partitioning. Thus, the information gain may be used for partition attribute selection for the decision tree. The ID3 decision tree learning algorithm selects partition attributes based on information gain.
An Extra Tree: the ET or Extra-Trees algorithm is very similar to the random forest algorithm and is composed of a plurality of decision Trees.
All samples used by the Extra Tree were only characterized by random selection, and because the splits were random, better results were obtained than random forests.
Gradient _ Boosting: the difference is that the Gradient Boosting uses a negative Gradient to measure the learner, and the error in the previous round is corrected by fitting the negative Gradient in the next round of learning, so that the model is more resistant to noise, specifically, the following steps are carried out;
1. initialization
Figure BDA0003512372050000141
Wherein gamma is the output value corresponding to the leaf node of the base learner.
2、for m=1 to M:
(a) Calculating a negative gradient:
Figure BDA0003512372050000142
i=1,2……N
(b) by minimizing the square error, with a basis learner hm(x) Fitting
Figure BDA0003512372050000143
(c) Determining step size ρ using line searchmSo as to minimize the amount of L,
Figure BDA0003512372050000144
(d)fm(x)=fm-1(x)+ρmhm(x;wm)
3. output fM (x)
KNN: k-nearest neighbor algorithm, the idea of which is that a sample is most similar to K samples in a dataset, and if most of the K samples belong to a certain class, the sample also belongs to a certain class.
In KNN, the matching problem between objects is avoided by calculating the inter-object distance as an indicator of non-similarity between objects, where the distance generally uses the euclidean distance or manhattan distance:
euclidean distance:
Figure BDA0003512372050000145
manhattan distance:
Figure BDA0003512372050000146
wherein (x, y) isTwo samples; n is a dimension; (x)k,yk) Is the eigenvalue of (x, y) in the k-th dimension.
LDA: the linear discriminant analysis algorithm belongs to a supervised algorithm. The original data are projected to a low-dimensional space, the same-class data are collected, and the different-class data are dispersed.
Let data set D { (x1, y1), (x2, y2), … …, ((xm, ym)) }, where any sample xi is an n-dimensional vector, yi ∈ {0, 1 }. Defining Nj (j is 0, 1) as the number of the jth sample, Xj (j is 0, 1) as the set of jth samples, μ j (j is 0, 1) as the mean vector of the jth sample, and defining Σ j (j is 0, 1) as the covariance matrix of the missing denominator part of the jth sample.
The expression for μ j is:
Figure BDA0003512372050000147
the expression of Σ j is: sigmaj=∑χ∈Xj(χ-μj)(χ-μj)T(j=0,1)
Because of the two types of data, the data only needs to be projected on a straight line. Assuming that the projected straight line is vector w, for any sample xi, its projection on the straight line w is wTxi, for the center points μ 0, μ 1 of the two classes, the projection on the straight line w is wTMu 0 and wTμ 1. Maximizing | | w because LDA requires the distance between class centers of different classes of data to be as large as possibleTμ0-wTμ1||22, simultaneously, the projection points of the same type of data are as close as possible, namely, the covariance w of the projection points of the same type of sample is requiredTSigma 0w and wTΣ 1w is as small as possible, i.e. minimizing wT∑0w+wTΣ 1w, the optimization objective is:
Figure BDA0003512372050000151
the intra-class divergence matrix Sw is generally defined as:
Figure BDA0003512372050000152
and defining an inter-class divergence matrix Sb as follows:
Figure BDA0003512372050000153
thus the optimization target rewrite is:
Figure BDA0003512372050000154
logistic _ Regulation: logistic regression is a probability nonlinear regression model, and is a multivariate analysis method for researching the relationship between the secondary classification observation result and the influence factors.
Consider a vector x ═ x (x) with n independent variables1,x2,......,xn) The conditional probability P (y ═ 1| x) ═ P is a probability of occurrence of an event x from an observed quantity. Then the Logistic regression model can be expressed as:
Figure BDA0003512372050000155
the ratio of the probability of an event occurring to that not occurring is:
Figure BDA0003512372050000156
this ratio is called the occurrence ratio of events, abbreviated odds. Taking the logarithm of odds gives:
Figure BDA0003512372050000157
in the classification case, the learned LR classifier is a set of weights ω0,ω1,......,ωn
Passive _ aggregate: the classical Online linear classifier passive attack algorithm can continuously integrate new samples to adjust a classification model and enhance the classification capability of the classification model.
And (3) QDA: the quadratic discriminant analysis algorithm is similar to LDA, and the LDA is popularized to the situation that covariance matrixes among classes are different. In order for the QDA classifier to yield high classification performance, an accurate estimate of the covariance matrix is needed.
Determining the midpoint of each class, and then letting the equation
Figure BDA0003512372050000161
The value of (a) is maximum:
random _ Forest: the random forest is mainly classified by integrating a plurality of decision trees, assuming that n samples in a training set have d characteristics, and a random forest containing T numbers needs to be trained, wherein a specific algorithm flow is as follows:
1. for the T decision trees, the following operations are repeated respectively: a. using Bootstrap sampling to obtain a training set D with the size of n from the training set D; b. randomly selecting m from the d characteristics;
2. if it is a regression problem, the final output is the mean of each tree output;
3. if the problem is a classification problem, determining a final class according to a voting principle;
the generation of each tree is random, and as for the randomly selected feature number, how to determine the size of the randomly selected feature number mainly comprises two methods, namely cross validation and empirical setting.
SGD: the objective function f (x) is minimized using a random gradient descent algorithm that iterates in the opposite direction of the gradient vector to obtain the extreme points of the function. Generally, SGD is the simplest and most practical method, but the convergence rate is somewhat slow.
SVM: the support vector machine is a two-class model. It is defined over the feature space with the largest spacing. Support vectors play a key role in determining the separation hyperplane. For any hyperplane, the data points on both sides of the hyperplane have a minimum distance from the hyperplane as a vertical distance, and the sum of the two minimum distances is the interval. The larger the margin, the smaller the probability of making a mistake, i.e. the robustness is higher, in order to have good classification effect on unknown sample points, not easy to be disturbed, the credibility is higher, and the expression is as follows:
wTx+b=0
wherein w ═ { w 1; w 2; … …, respectively; wd is a normal vector that determines the direction of the hyperplane, and d is the number of eigenvalues.
X is a training sample; b is a displacement term, and determines the distance between the hyperplane and the origin.
Once the normal vector w and displacement b are determined, a partition hyperplane can be uniquely determined. The distance between the boundary hyperplane at two sides of the boundary hyperplane and the hyperplane is
Figure BDA0003512372050000162
Using some mathematical derivation, the formula yi × (w0+ w1x1+ w2x2) ≧ 1,
Figure BDA0003512372050000164
convex optimization problem that can become restrictive:
Figure BDA0003512372050000163
this equation represents the boundary-maximized partitioned hyperplane, where i is the number of support vector points, since most points are not support vector points, and only individual points on the boundary hyperplane are support vector points. Summing only those belonging to the support vector points; w is the norm of the hyperplane; xi is the eigenvalue of the support vector point; yi is a category label for the support vector point Xi, such as +1 or-1; xTFor the instance to be tested, it is substituted into the equation; both α i and b0 are single numerical parameters, derived from the above-mentioned optimization algorithm, and α i is the Lagrangian multiplier.
Whenever there is a new test sample X, it is substituted into the equation, looking at whether the value of the equation is positive or negative, and sorted by sign.
XGboost: the XGboost has a linear scale solver and a tree learning algorithm, is an improvement on the GBDT algorithm, and is more efficient. The traditional GBDT method only utilizes first-order derivative information, Xgboost is second-order Taylor expansion on a loss function, the actual loss function can be accurately approximated, a regular term is added outside the target function, an optimal solution is integrally solved, the reduction of the target function and the complexity of a model are balanced, overfitting is avoided, and the solving efficiency of the model is improved.
The above embodiments are only used for illustrating the invention and not for limiting the technical solutions described in the invention, and although the present invention has been described in detail in the present specification with reference to the above embodiments, the present invention is not limited to the above embodiments, and therefore, any modification or equivalent replacement of the present invention is made; all such modifications and variations are intended to be included herein within the scope of this disclosure and the appended claims.

Claims (10)

1. A method of determining a risk pre-warning model for potentially inappropriate prescription of cardiovascular disease, comprising the steps of:
s1: desensitization processing of deleting identity card number, name, home address and telephone number information is carried out on cardiovascular patient medical record information, data processing is carried out on the desensitized case data, the data processing comprises data sampling and feature screening to obtain N data sets, the N data sets are divided into training sets and testing sets, the training sets and the testing sets are more than the testing sets, an M machine learning model is used for modeling the N data sets, and N multiplied by M risk early warning models are established;
s2: internal verification is performed by using a ten-fold cross verification method: inputting training set data into an M machine learning model, and adjusting model parameters by using a cross-folding verification method until the model parameters obtain the maximum AUC value on the training set, so as to obtain AUC values, accuracy, precision, recall rate and F1 values of internally verified data sampling and feature sampling of a PIP model, a PPO model and a PIM model and a machine learning model;
s3: generating an integrated model by selecting a plurality of models with the largest AUC values, establishing N data sets by using the M +1 machine learning model, and establishing N (M +1) risk early warning models;
s4: external verification was performed using Bootstrapping method: resampling the test set for N times by using a Bootstrapping method, establishing a new sample, and carrying out external verification on N x (M +1) risk early warning models by using the new sample to obtain AUC values, accuracy rates, precision rates, recall rates and F1 values of data sampling, feature sampling and machine learning models of external verification of a PIP model, a PPO model and a PIM model;
s5: the method comprises the steps of using an AUC, accuracy, precision, recall rate and F1 value index evaluation model, selecting a model with the maximum AUC as five models with the best prediction performance to obtain ROC curves and P-R curves of the five models, and selecting a model with the maximum ROC curve value as a risk early warning model with the most accurate prediction performance corresponding to a PIP model, a PIM model and a PPO model;
s6: calculating the SHAP value of each variable in the risk early warning model with the most accurate prediction performance, expressing the relationship between each output variable and the best model prediction result by using the SHAP value, and respectively arranging the contribution value of each variable to the best prediction model by taking the average value of the absolute value of the SHAP value of each variable as the importance of the variable to obtain the risk early warning output of each variable to PIP, PIM and PPO.
2. The method of determining a risk pre-warning model for potentially inappropriate prescription of cardiovascular disease as claimed in claim 1, wherein the data processing further comprises: deleting variables with data missing proportion higher than 90%, variables with single category proportion higher than 90% and variables with variation coefficient smaller than 0.1, dividing the N data sets into training sets and testing sets according to the proportion of 8: 2, respectively sampling data in the N data sets by adopting x different data sampling modes, and respectively screening data in y different data screening modes to obtain the N data sets, wherein N is x y.
3. The method of determining a risk pre-warning model for potentially under-prescribing cardiovascular disease of claim 2, wherein said variables comprise myocardial infarction, cardiac conduction block, venous thromboembolism, history of gout, renal failure, anti-coagulant therapy, angina, atherosclerosis, heart failure, diabetes, number of medications, number of illnesses, gender, hospital length, age, gastrointestinal bleeding, anti-thrombotic therapy, history of cardiovascular disease, cerebrovascular disease, atrial fibrillation, hyperlipidemia, and hypertension.
4. A method for determining a risk pre-warning model for potentially inappropriate prescription for cardiovascular disease as claimed in claim 1 wherein the cardiovascular patient history information identifies the prescription for PIP that may be present in elderly patients using the cardiovascular system and anti-platelet/anticoagulant drugs in the second edition of STOPP/START standards, including the PIM standard for cardiovascular system and anti-platelet/anticoagulant drugs and the PPO standard for cardiovascular system, and the desensitization process includes deleting the patient's identification number, name, home address, telephone number information.
5. The method of determining a risk pre-warning model for potentially inadequate prescription of cardiovascular disease of claim 1, further comprising the data set validation of the number of data sets required by the method of determining a risk pre-warning model for potentially inadequate prescription of cardiovascular disease.
6. The method for determining a risk pre-warning model for potentially inappropriate prescription of cardiovascular disease as claimed in claim 5, wherein the sample size verification adopts a re-sampling method with a return to randomly extract data from the training set in turn to obtain a plurality of different sets of AUC values, the test set is repeated n times to obtain a curve graph plotting the AUC values, the AUC is gradually increased along with the continuous increase of the data size, the dispersion of the data is continuously reduced, the curve is continuously gentle, the curve is gentle, the obtained sample size in the model is sufficient, and the data size capable of meeting the requirements of the model is obtained.
7. The method of determining a risk pre-warning model for potentially inadequate prescription of cardiovascular disease of claim 1, further comprising a hypothesis test to infer a population from the existing data volume, resulting in a difference between the existing data volume index and the population index.
8. The method of determining a risk pre-warning model for potentially inadequate prescription for cardiovascular disease of claim 7, wherein the hypothesis testing includes analysis of variance, rank and test, data not satisfying normal distribution, using rank and test, satisfying normal distribution, judging homogeneity of variance, selecting analysis of variance, otherwise selecting the rank and test, setting the significance level to 0.05, the hypothesis testing is performed using stats in python3.8, and the model is established using skeam in python 3.8.
9. The method of determining a risk pre-warning model for potentially inappropriate prescription of cardiovascular disease as claimed in claim 2, wherein the data sampling means comprises: non-sampling, random up-sampling, random down-sampling, SMOTE up-sampling, and Borderline SMOTE up-sampling; the data screening mode comprises the following steps: non-screening, Lasso screening and Boruta screening, wherein the machine algorithm model comprises: AdaBoost, Bagging, Bemoulil _ Naive _ Bayes, Dension _ Tree, Extra _ Tree, Gaussian _ Naive _ Bayes, Gradient _ Boosting, KNN, LDA, Logistic _ Regulation, multinominal _ Naive _ Bayes, Pasive _ Aggressive, QDA, Random _ Forest, SGD, SVM, XGBoost.
10. The method of determining a risk pre-warning model for potentially inappropriate prescription of cardiovascular disease as claimed in claim 9, wherein the data sampling means comprises:
and (3) non-sampling: inputting the original data into the model without sampling the data;
random upsampling: randomly copying data with few label types to ensure that the number of positive and negative data is consistent;
random down-sampling: randomly deleting data with more tag types to ensure that the number of positive and negative data is consistent;
SMOTE upsampling:
analyzing the minority class data by using synthesized minority class upsampling, synthesizing new data of the minority class, and adding the new data into a data set, wherein the specific algorithm flow is as follows:
for each data x in the minority class, calculating the distance from the data x to all data in the minority class data set by taking the Euclidean distance as a standard to obtain k neighbor of the data x, setting a sampling proportion according to the data imbalance proportion to determine a sampling multiplying factor N, randomly selecting a plurality of data from the k neighbor of each minority class of data x, supposing that the selected neighbor is xn, and respectively constructing new data with the original data for each randomly selected neighbor xn: x is the number ofnew=x+rand(0,1)(x-xn);
Borderline SMOTE upsampling:
the Borderline SMOTE algorithm synthesizes new data by a few classes of data on a boundary, thereby improving the distribution of data classes, and Borderline SMOTE sampling divides the few classes of data into 3 classes, namely Safe, Danger and Noise, and only oversamples the few classes of data of Danger.
CN202210168629.1A 2022-02-21 2022-02-21 Method for determining risk pre-warning model of potentially inappropriate prescription for cardiovascular disease Pending CN114530248A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210168629.1A CN114530248A (en) 2022-02-21 2022-02-21 Method for determining risk pre-warning model of potentially inappropriate prescription for cardiovascular disease

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210168629.1A CN114530248A (en) 2022-02-21 2022-02-21 Method for determining risk pre-warning model of potentially inappropriate prescription for cardiovascular disease

Publications (1)

Publication Number Publication Date
CN114530248A true CN114530248A (en) 2022-05-24

Family

ID=81624793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210168629.1A Pending CN114530248A (en) 2022-02-21 2022-02-21 Method for determining risk pre-warning model of potentially inappropriate prescription for cardiovascular disease

Country Status (1)

Country Link
CN (1) CN114530248A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116543910A (en) * 2023-07-05 2023-08-04 山东大学 Antibiotic auxiliary decision-making system for reducing clinical uncertainty of upper respiratory tract infection
CN117079743A (en) * 2023-10-18 2023-11-17 中日友好医院(中日友好临床医学研究所) Statin drug treatment effect prediction model and application

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116543910A (en) * 2023-07-05 2023-08-04 山东大学 Antibiotic auxiliary decision-making system for reducing clinical uncertainty of upper respiratory tract infection
CN116543910B (en) * 2023-07-05 2023-11-03 山东大学 Antibiotic auxiliary decision-making system for reducing clinical uncertainty of upper respiratory tract infection
CN117079743A (en) * 2023-10-18 2023-11-17 中日友好医院(中日友好临床医学研究所) Statin drug treatment effect prediction model and application

Similar Documents

Publication Publication Date Title
Lan et al. A survey of data mining and deep learning in bioinformatics
Bi et al. What is machine learning? A primer for the epidemiologist
Hira et al. A review of feature selection and feature extraction methods applied on microarray data
Guyon et al. An introduction to variable and feature selection
Ruan et al. Representation learning for clinical time series prediction tasks in electronic health records
Alex et al. Deep convolutional neural network for diabetes mellitus prediction
Momeni et al. A survey on single and multi omics data mining methods in cancer data classification
US20220130541A1 (en) Disease-gene prioritization method and system
Hernández-Julio et al. Framework for the development of data-driven Mamdani-type fuzzy clinical decision support systems
CN114530248A (en) Method for determining risk pre-warning model of potentially inappropriate prescription for cardiovascular disease
Vazirgiannis et al. Uncertainty handling and quality assessment in data mining
CN116682557A (en) Chronic complications early risk early warning method based on small sample deep learning
Valdebenito et al. Machine learning approaches to study glioblastoma: A review of the last decade of applications
Das et al. A metaheuristic optimization framework for informative gene selection
Firouzi et al. Machine learning for IoT
Tripathy et al. Combination of reduction detection using TOPSIS for gene expression data analysis
Jena et al. An integrated novel framework for coping missing values imputation and classification
Mostafa et al. A machine learning ensemble classifier for prediction of Brain Strokes
Liang et al. Landmarking manifolds with Gaussian processes
Awe et al. Weighted hard and soft voting ensemble machine learning classifiers: Application to anaemia diagnosis
Gupta et al. Feature importance for human epithelial (HEp-2) cell image classification
Cocianu et al. Classical, evolutionary, and deep learning approaches of automated heart disease prediction: a case study
Dundar et al. Sparse Fisher discriminant analysis for computer aided detection
Vinutha et al. EPCA—enhanced principal component analysis for medical data dimensionality reduction
CN110033862B (en) Traditional Chinese medicine quantitative diagnosis system based on weighted directed graph and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination