CN114121296A - Data-driven clinical information rule extraction method, storage medium and device - Google Patents
Data-driven clinical information rule extraction method, storage medium and device Download PDFInfo
- Publication number
- CN114121296A CN114121296A CN202111500068.2A CN202111500068A CN114121296A CN 114121296 A CN114121296 A CN 114121296A CN 202111500068 A CN202111500068 A CN 202111500068A CN 114121296 A CN114121296 A CN 114121296A
- Authority
- CN
- China
- Prior art keywords
- rule
- data
- rule set
- optimal
- clinical information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 49
- 238000003860 storage Methods 0.000 title claims abstract description 17
- 239000013610 patient sample Substances 0.000 claims abstract description 33
- 238000012216 screening Methods 0.000 claims abstract description 12
- 239000002245 particle Substances 0.000 claims description 25
- 238000000034 method Methods 0.000 claims description 23
- 238000004422 calculation algorithm Methods 0.000 claims description 16
- 238000005457 optimization Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 12
- 238000003745 diagnosis Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- 201000010099 disease Diseases 0.000 claims description 7
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000007619 statistical method Methods 0.000 claims description 4
- 230000004044 response Effects 0.000 claims description 3
- 238000003066 decision tree Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000010801 machine learning Methods 0.000 description 7
- 208000010378 Pulmonary Embolism Diseases 0.000 description 5
- 206010046996 Varicose vein Diseases 0.000 description 4
- 210000003141 lower extremity Anatomy 0.000 description 4
- 238000007637 random forest analysis Methods 0.000 description 4
- 239000000523 sample Substances 0.000 description 4
- 238000003759 clinical diagnosis Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 108010005094 Advanced Glycation End Products Proteins 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 241000272778 Cygnus atratus Species 0.000 description 1
- 208000025174 PANDAS Diseases 0.000 description 1
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 1
- 240000000220 Panda oleosa Species 0.000 description 1
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 208000001647 Renal Insufficiency Diseases 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 208000020832 chronic kidney disease Diseases 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013501 data transformation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000024924 glomerular filtration Effects 0.000 description 1
- 201000006370 kidney failure Diseases 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000013433 optimization analysis Methods 0.000 description 1
- 238000012856 packing Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008929 regeneration Effects 0.000 description 1
- 238000011069 regeneration method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Physics & Mathematics (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Epidemiology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Primary Health Care (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention provides a data-driven clinical information rule extraction method, a storage medium and equipment, wherein the data-driven clinical information rule extraction method comprises the following steps: acquiring patient sample data, wherein the patient sample data comprises various clinical characteristics of a patient; generating an initial rule set according to the patient sample data; screening the initial rule set based on the time sequence characteristics in the initial rule set to obtain a universal rule set; and determining an optimal rule set through the accuracy and the interpretability of each rule in the universal rule set. The invention can mine a series of rules with high confidence and accuracy from clinical information on the premise of ensuring accuracy, thereby effectively obtaining a clear conclusion path and assisting a doctor to make a decision to a certain extent.
Description
Technical Field
The invention belongs to the technical field of data mining, relates to a rule extraction method, and particularly relates to a data-driven clinical information rule extraction method, a storage medium and equipment.
Background
Currently, with the development of intelligent medical technology, medical rules play an important role in the processes of risk prediction, clinical diagnosis and the like of diseases, wherein rules with high confidence coefficient in data such as mining clinical diagnosis information, demographic information and the like can assist the decision of doctors to a certain extent.
Most of the existing disease risk and clinical diagnosis rules come from various medical quality tables and machine learning prediction models. (1) The medical scale can quantify clinical information, demographic information, various daily habits and the like of patients, endow different characteristics with different scores, and finally measure the degree of illness, the risk of illness and the like through the form of scoring. However, most of the existing medical scales are made by foreign people, and factors such as race, daily habits, individual difference and the like are often ignored, and have certain influence on the accuracy of scale evaluation. (2) The use of machine learning models can improve prediction and diagnostic accuracy to some extent. However, most existing machine learning models do not directly provide interpretable decision rules.
Therefore, how to provide a data-driven clinical information rule extraction method, a storage medium and a device to solve the defects that the prior art cannot provide a rule extraction scheme with high accuracy and interpretability, and the like, is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention is directed to a data-driven clinical information rule extraction method, a storage medium and a device, which are used to solve the problem that the prior art cannot provide a rule extraction scheme with high accuracy and interpretability.
To achieve the above and other related objects, an aspect of the present invention provides a data-driven clinical information rule extraction method, including: acquiring patient sample data, wherein the patient sample data comprises various clinical characteristics of a patient; generating an initial rule set according to the patient sample data; screening the initial rule set based on the time sequence characteristics in the initial rule set to obtain a universal rule set; and determining an optimal rule set through the accuracy and the interpretability of each rule in the universal rule set.
In an embodiment of the present invention, the patient sample data is table data without missing values, wherein each row of the table data represents a patient sample, and each column represents a feature of the patient.
In an embodiment of the present invention, the step of generating an initial rule set according to the patient sample data includes: pre-processing the patient sample data; aiming at the preprocessed patient sample data, utilizing a tree model to perform rule extraction on each node in each generated tree; and generating the initial rule set according to the rule extraction result.
In an embodiment of the present invention, the step of screening the initial rule set based on the timing characteristics in the initial rule set to obtain a universal rule set includes: acquiring the time frequency of the regular occurrence on each node by using a time sequence statistical method; and screening out the rule of which the time frequency meets the preset requirement of the user as the universal rule set.
In an embodiment of the invention, the step of determining the optimal rule set according to the accuracy and interpretability of each rule in the universal rule set comprises: aiming at each rule in the universal rule set, determining an optimal solution through a multi-objective optimization algorithm; and determining the combination of all the optimal solutions as the optimal rule set.
In an embodiment of the present invention, the step of determining the optimal solution through the multi-objective optimization algorithm includes: the accuracy and the interpretability of each rule are taken as two optimization targets; randomly initializing a particle swarm for the optimization target; determining a fitness of each particle in the population of particles; updating the speed and the position of the particle according to the fitness; judging whether the maximum iteration times or the global optimal position meets the minimum authority; and if so, determining the pareto optimal solution.
In an embodiment of the invention, after the step of determining the optimal rule set according to the accuracy and interpretability of each rule in the universal rule set, the data-driven clinical information rule extraction method further includes: acquiring prediction data of a user needing to make a clinical decision; all the acquired prediction data form a prediction data set; and comparing the predicted data with the rules in the optimal rule set one by one, and obtaining the rules which are met by the predicted data set according to the matching result of the predicted data and the optimal rule set.
In an embodiment of the present invention, the optimal rule set includes a first rule, a second rule and a third rule; the step of comparing the prediction data with the rules in the optimal rule set one by one, and obtaining the rules which the prediction data set accords with according to the matching result of the prediction data and the optimal rule set, comprises: and determining the user illness probability corresponding to the prediction data set in response to the prediction data simultaneously meeting the first rule, the second rule and the third rule, wherein the user illness probability is used for providing auxiliary judgment information for a doctor in the process of disease diagnosis of the doctor.
To achieve the above and other related objects, another aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the data-driven clinical information rule extraction method.
To achieve the above and other related objects, a final aspect of the present invention provides an electronic device, comprising: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored by the memory so as to enable the electronic equipment to execute the data-driven clinical information rule extraction method.
As described above, the data-driven clinical information rule extraction method, the storage medium, and the device according to the present invention have the following advantages:
according to the method, an initial rule set is generated according to patient sample data, universal rule screening is further performed according to time sequence characteristics, and an optimal rule set is determined by utilizing the accuracy and the interpretability of each rule. Therefore, the problems of low prediction accuracy of the medical scale and poor solvability of a traditional machine learning model are well solved, and the rule extraction scheme based on data driving can mine a series of rules with high confidence coefficient and high accuracy from clinical information on the premise of ensuring the accuracy. The method can effectively obtain a clear conclusion path and assist a doctor in making a decision to a certain extent.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating a data-driven clinical information rule extraction method according to an embodiment of the present invention.
FIG. 2 is a flow chart of the optimal rule set determination in an embodiment of the data-driven clinical information rule extraction method according to the present invention.
FIG. 3 is a flowchart illustrating the calculation of an optimal solution for the data-driven-based clinical information rule extraction method according to an embodiment of the present invention.
FIG. 4 is a flowchart illustrating predictive data matching in an embodiment of a data-driven clinical information rule extraction method according to the present invention.
Fig. 5 is a schematic structural connection diagram of an electronic device according to an embodiment of the invention.
Description of the element reference numerals
5 electronic device
51 processor
52 memory
S11-S16
S141 to S142
S141A-S141F steps
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
The data-driven clinical information rule extraction method, the storage medium and the equipment can mine a series of rules with high confidence coefficient and accuracy from clinical information on the premise of ensuring the accuracy, so that a clear conclusion path can be effectively obtained, and a doctor is assisted in making a decision to a certain extent.
The principle and implementation of a data-driven clinical information rule extraction method, a storage medium and a device according to the present embodiment will be described in detail below with reference to fig. 1 to 5, so that those skilled in the art can understand the data-driven clinical information rule extraction method, the storage medium and the device according to the present embodiment without creative work.
Referring to fig. 1, a schematic flow chart of a data-driven clinical information rule extraction method according to an embodiment of the invention is shown. As shown in fig. 1, the data-driven clinical information rule extraction method specifically includes the following steps:
s11, obtaining patient sample data including various clinical characteristics of the patient.
In an embodiment of the present invention, the patient sample data is table data without missing values, wherein each row of the table data represents a patient sample, and each column represents a feature of the patient.
In practical applications, taking pulmonary artery embolism as an example, laboratory examination data of a batch of patients with outcome variables is taken out by a hospital-related department as patient sample data.
And S12, generating an initial rule set according to the patient sample data.
In one embodiment, S12 specifically includes the following steps:
(1) pre-processing the patient sample data.
Specifically, the preprocessing includes existing preprocessing means such as data cleaning, data merging, data transformation, and data normalization, so as to improve the availability of patient sample data.
(2) And aiming at the preprocessed patient sample data, performing rule extraction on each node in each generated tree by using a tree model.
Specifically, the Tree model may be any robust model such as a Decision Tree, a random forest, a GBDT (Gradient Boosting Decision Tree), and an Xgboost.
In practical application, a random forest algorithm is used for extracting a rule from each node in each generated tree. The random forest is a stable integrated learning model, a bag packing thought is adopted, a plurality of training sets are generated by a bootstrap method, a decision tree is constructed for each training set, and finally classification results of a plurality of decision tree-based classifiers are combined to obtain a relatively better prediction model.
Specifically, given a dataset D, a feature vector X and a corresponding label y, let D be (Xi, yi), i be 1,2, …, n. Then Xi e X, Xi (Xi1, Xi2, …, Xim), m is the number of features, yi e y {0,1, … }. Gini (D) is defined as the measure of the purity of D and can be expressed as follows:
p in formula 1k(K-1, 2, …, K) represents the property of the kth class sample in the current dataset. k' represents other categories than the k category. The smaller Gini (D), the higher the purity of data set D. Assuming that the feature m has V possible values { m1, m2, …, mv }, dividing the data set D by using the feature m to generate V different branch nodes, wherein the V-th branch is marked as Dv, and Gini is definedindex(,)To represent the uncertainty of feature m in D, it can be expressed as:
for the training set D, the learning algorithm for constructing the decision tree can be represented as a mapping from X to y, and the data set D is circularly divided into a plurality of subsets by using the characteristic of the lowest divided kini index to form a tree. The selected features m are represented as:
then, the classification result is obtained by integrating the weighted outputs of all decision trees:
in equation 4, ωhRepresenting the weight of the h-th tree, a sample can be classified according to the following formula:
in equation 5, S represents the number of trees.
(3) And generating the initial rule set according to the rule extraction result.
Specifically, the initial rule set obtaining mode is as follows: the random forest algorithm obtains the rule condition corresponding to the characteristics of the nodes in each path and the conclusion corresponding to the rules of the categories of the leaf nodes by traversing the path from the root node to each leaf node in each decision tree.
In practical applications, the type of tree model output is determined by the individual tree output when performing disease prediction or medical diagnosis tasks. Since the tree model is a "white-box model" that provides a clear path for each conclusion, the rules for all nodes on each tree in the tree model are output as the initial rule set.
S13, based on the time sequence characteristics in the initial rule set, screening the initial rule set to obtain a universal rule set. Therefore, through the screening of indexes such as time frequency and the like of the occurrence of the analysis rule, the phenomenon that some black swans are not provided with universal rules corresponding to the events can be effectively avoided.
In one embodiment, S13 specifically includes the following steps:
(1) and acquiring the time frequency of the regular occurrence on each node by using a time sequence statistical method.
Specifically, the timing statistic method may be a timing statistic function or other embodiments that can implement a timing statistic function.
In practical application, for the statistical analysis process of time series data in a rule, a python-based pandas package is used to implement a grouping and aggregation function on samples on each node according to time frequency, such as: and counting information with time frequency attribute, such as the number of days, the number of weeks, the number of months, the number of years or the starting and ending time of the appearance of the sample on the node.
(2) And screening out the rule of which the time frequency meets the preset requirement of the user as the universal rule set.
Specifically, for example, if the user preset requirement is 1 year, if a certain patient sample data appears within 2 weeks, the rule extracted corresponding to the patient sample data does not have universality, and if a certain patient sample data appears within 2 years, the rule extracted corresponding to the patient sample data has universality.
And S14, determining an optimal rule set according to the accuracy and the interpretability of each rule in the universal rule set.
Referring to fig. 2, a flow chart of determining an optimal rule set according to an embodiment of the data-driven clinical information rule extraction method of the present invention is shown. As shown in fig. 2, S14 specifically includes the following steps:
and S141, aiming at each rule in the universal rule set, determining an optimal solution through a multi-objective optimization algorithm. Wherein the multi-objective optimization algorithm is used to balance the accuracy and interpretability of the rules.
Specifically, the multi-objective optimization algorithm may be any algorithm capable of realizing optimization analysis of two or more objectives, such as a multi-objective particle swarm algorithm, a non-dominated sorting genetic algorithm, a multi-objective evolutionary algorithm, and the like.
Referring to fig. 3, a flowchart of an optimal solution calculation of the data-driven-based clinical information rule extraction method according to an embodiment of the invention is shown. As shown in fig. 3, S141 specifically includes the following steps:
S141A, with accuracy and interpretability of each rule as two optimization objectives.
In order to ensure the accuracy of the rule sets, the accuracy of each rule set, namely the ratio of the data sets which are correctly predicted, is calculated. Rule accuracy is defined as follows:
in equation 6, QACC represents the accuracy of the rule set, Q represents the number of samples, and xi represents the ith sample. To measure the interpretability of a rule, we define it as:
in formula 7, QFEA、QCOV、QCNTRespectively representing the complexity of the rule, the convergence of the rule and the quality of the rule. Alpha, beta and gamma are the weights of the three, and they can be set according to the actual situation. Specifically, QFEAFor determining the number of features per rule, if the rule relates to a smaller number of average features, its QCNTThe larger the value. QCOVFor indicating the coverage of each rule, when the rule has strong applicability, its QCOVAnd is larger. QCNTFor measuring the quality of the rules. They are defined as:
in the formula 8, the first and second groups of the compound,representing the valid features in the ith rule, in equation 9,representing the number of samples that match the ith rule. In equation 10, ruleselectedRepresenting the number of rules derived from the algorithm. Z is the number of generation candidate rules. When Q isFEAWhen 1 represents only one feature of the rule, QFEAWhen 0, the representation rule contains all the features. Namely QFEAThe smaller the rule, the easier the physician can understand at the time of diagnosis.
S141B, randomly initializing a particle swarm according to the optimization target.
In the invention, the solution in the optimization problem is taken as 'particles', all the particles are searched in an N-dimensional space, and each particle has only two attributes: position and speed, speed representing how fast the movement is, position representing the direction of movement. The current position of the particle is a candidate solution of the corresponding optimization problem, and the flight process of the particle is the search process of the individual.
S141C, determining the fitness of each particle in the particle swarm.
Specifically, a fitness function capable of determining an individual optimal solution of each particle is defined, and a global optimal value is found from the individual optimal solutions.
And S141D, updating the speed and the position of the particles according to the fitness.
Specifically, the flight speed of the particles can be dynamically adjusted according to the historical optimal positions of the particles and the historical optimal positions of the population. And updating the speed and the position of the particle according to the fitness.
And S141E, judging whether the maximum iteration number is reached or the global optimal position meets the minimum authority.
The optimal solution searched by each particle independently is called an individual extremum, and the optimal individual extremum in the particle swarm is used as the current global optimal solution. And continuously iterating, and updating the speed and the position. And finally obtaining the optimal solution meeting the termination condition. If the maximum iteration number is not reached or the global optimal position does not satisfy the minimum authority, the process returns to step S141C.
And S141F, if yes, determining the pareto optimal solution.
And determining the pareto optimal solution in the final population by using a fast non-dominated sorting method for the particles which reach the maximum iteration number or the global optimal position meets the minimum authority.
And S142, determining the combination of all the optimal solutions as the optimal rule set.
Specifically, for pulmonary artery embolism, the optimal rule set is: "1 month _ varicose vein of lower limb _ diagnosis _ any >0.5, 10000 days _ sex _ visit _ count ═ 1.5,10000 days _ age _ visit _ last ═ 26373.0".
When "1 month _ varicose vein of lower limb _ diagnosis _ any >0.5, 10000 days _ sex _ visit _ count < (1.5,10000 days _ age _ visit _ last < (26373.0)" are satisfied, the probability that the patient suffers from VTE is determined to be 90% or more.
Referring to fig. 4, a flow chart of predictive data matching in an embodiment of the data-driven-based clinical information rule extraction method of the invention is shown. As shown in fig. 4, after the step, the data-driven clinical information rule extraction-based method further includes the steps of:
s15, acquiring the prediction data of the user needing to make clinical decision; all acquired prediction data constitutes a prediction data set.
And S16, comparing the predicted data with the rules in the optimal rule set one by one, and obtaining the rules which the predicted data set accords with according to the matching result of the predicted data and the optimal rule set.
In one embodiment, the optimal rule set includes a first rule, a second rule, and a third rule.
And determining the user illness probability corresponding to the prediction data set in response to the prediction data simultaneously meeting the first rule, the second rule and the third rule, wherein the user illness probability is used for providing auxiliary judgment information for a doctor in the process of disease diagnosis of the doctor.
Specifically, for pulmonary artery embolism, the optimal rule set is: "1 month _ varicose vein of lower limb _ diagnosis _ any >0.5, 10000 days _ sex _ visit _ count ═ 1.5, and 10000 days _ age _ visit _ last ≦ 26373.0". The first rule is 1 month _ varicose vein _ diagnose _ any >0.5 of lower limb, the second rule is 10000 days _ sex _ visit _ count < (1.5), and the third rule is 10000 days _ age _ visit _ last < (26373.0). When the corresponding prediction data of a certain patient simultaneously satisfy three rules, the analyzed probability that the patient has pulmonary artery embolism is more than 90%, and after the doctor knows the information that the probability that the patient has pulmonary artery embolism is more than 90%, the doctor can diagnose the patient according to the information.
The effect comparison analysis of the invention and the existing machine learning model is as follows: the existing machine learning model takes a risk ratio regression model as an example, and simultaneously evaluates the influence of various factors on the risk or diagnosis result of the disease, and obtains a function which can be predicted and diagnosed by weighting the factors and carrying out nonlinear mapping. Taking the probability that chronic kidney disease is predicted to develop renal failure within five years as an example, the following risk ratio regression model can be obtained:
the function can obtain a more accurate prediction result, but rules obtained by weighting or nonlinear operation of factors such as GFR (Glomerular Filtration Rate), ACR (Autologous cell regeneration), AGE (Advanced Glycation End products) and the like have no interpretability, and a series of rules with high confidence and accuracy are mined from clinical information by a multi-objective optimization algorithm on the premise of ensuring the accuracy.
The protection scope of the data-driven-based clinical information rule extraction method according to the present invention is not limited to the execution sequence of the steps listed in this embodiment, and all the schemes of adding, subtracting, and replacing steps in the prior art according to the principles of the present invention are included in the protection scope of the present invention.
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data-driven clinical information rule extraction-based method.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned computer-readable storage media comprise: various computer storage media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Please refer to fig. 5, which is a schematic structural connection diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 5, the present embodiment provides an electronic device 5, which specifically includes: a processor 51 and a memory 52; the memory 52 is used for storing computer programs, and the processor 51 is used for executing the computer programs stored in the memory 52 to make the electronic device 5 execute the steps of the data-driven clinical information rule extraction method.
The Processor 51 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware component.
The Memory 52 may include a Random Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
In practice, the electronic device may be a computer including all or some of the components of memory, memory controller, one or more processing units (CPUs), peripheral interfaces, RF circuits, audio circuits, speakers, microphones, input/output (I/O) subsystems, display screens, other output or control devices, and external ports; the computer includes, but is not limited to, Personal computers such as desktop computers, notebook computers, tablet computers, smart phones, Personal Digital Assistants (PDAs), and the like. In other embodiments, the electronic device may also be a server, where the server may be arranged on one or more entity servers according to various factors such as functions and loads, or may be a cloud server formed by a distributed or centralized server cluster, which is not limited in this embodiment.
In summary, the data-driven clinical information rule extraction method, the storage medium and the device of the present invention generate an initial rule set according to patient sample data, further perform universal rule screening according to timing characteristics, and determine an optimal rule set by using the accuracy and interpretability of each rule. Therefore, the problems of low prediction accuracy of the medical scale and poor solvability of a traditional machine learning model are well solved, and the rule extraction scheme based on data driving can mine a series of rules with high confidence coefficient and high accuracy from clinical information on the premise of ensuring the accuracy. The method can effectively obtain a clear conclusion path and assist a doctor in making a decision to a certain extent. The invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.
Claims (10)
1. A data-driven clinical information rule extraction method is characterized by comprising the following steps:
acquiring patient sample data, wherein the patient sample data comprises various clinical characteristics of a patient;
generating an initial rule set according to the patient sample data;
screening the initial rule set based on the time sequence characteristics in the initial rule set to obtain a universal rule set;
and determining an optimal rule set through the accuracy and the interpretability of each rule in the universal rule set.
2. The data-driven-based clinical information rule extraction method according to claim 1, wherein:
the patient sample data is tabular data without missing values, wherein each row of the tabular data represents a patient sample, and each column represents a characteristic of the patient.
3. The data-driven clinical information rule extraction-based method of claim 1, wherein the step of generating an initial rule set from the patient sample data comprises:
pre-processing the patient sample data;
aiming at the preprocessed patient sample data, utilizing a tree model to perform rule extraction on each node in each generated tree;
and generating the initial rule set according to the rule extraction result.
4. The method according to claim 3, wherein the step of screening the initial rule set based on the time-series characteristics of the initial rule set to obtain a universal rule set comprises:
acquiring the time frequency of the regular occurrence on each node by using a time sequence statistical method;
and screening out the rule of which the time frequency meets the preset requirement of the user as the universal rule set.
5. The method of claim 1, wherein the step of determining the optimal rule set according to the accuracy and interpretability of each rule in the universal rule set comprises:
aiming at each rule in the universal rule set, determining an optimal solution through a multi-objective optimization algorithm;
and determining the combination of all the optimal solutions as the optimal rule set.
6. The data-driven-based clinical information rule extraction method of claim 5, wherein the step of determining an optimal solution through a multi-objective optimization algorithm comprises:
the accuracy and the interpretability of each rule are taken as two optimization targets;
randomly initializing a particle swarm for the optimization target;
determining a fitness of each particle in the population of particles;
updating the speed and the position of the particle according to the fitness;
judging whether the maximum iteration times or the global optimal position meets the minimum authority;
and if so, determining the pareto optimal solution.
7. The data-driven-based clinical information rule extraction method of claim 1, wherein after the step of determining the optimal rule set by the accuracy and interpretability of each rule in the universal rule set, the data-driven-based clinical information rule extraction method further comprises:
acquiring prediction data of a user needing to make a clinical decision; all the acquired prediction data form a prediction data set;
and comparing the predicted data with the rules in the optimal rule set one by one, and obtaining the rules which are met by the predicted data set according to the matching result of the predicted data and the optimal rule set.
8. The data-driven clinical information rule extraction-based method according to claim 7, wherein the optimal rule set includes a first rule, a second rule, and a third rule; the step of comparing the prediction data with the rules in the optimal rule set one by one, and obtaining the rules which the prediction data set accords with according to the matching result of the prediction data and the optimal rule set, comprises:
and determining the user illness probability corresponding to the prediction data set in response to the prediction data simultaneously meeting the first rule, the second rule and the third rule, wherein the user illness probability is used for providing auxiliary judgment information for a doctor in the process of disease diagnosis of the doctor.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the data-driven clinical information rule extraction-based method according to any one of claims 1 to 8.
10. An electronic device, comprising: a processor and a memory;
the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the electronic device to perform the data-driven clinical information rule extraction method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111500068.2A CN114121296B (en) | 2021-12-09 | 2021-12-09 | Data-driven clinical information rule extraction method, storage medium and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111500068.2A CN114121296B (en) | 2021-12-09 | 2021-12-09 | Data-driven clinical information rule extraction method, storage medium and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114121296A true CN114121296A (en) | 2022-03-01 |
CN114121296B CN114121296B (en) | 2024-02-02 |
Family
ID=80364078
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111500068.2A Active CN114121296B (en) | 2021-12-09 | 2021-12-09 | Data-driven clinical information rule extraction method, storage medium and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114121296B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117059214A (en) * | 2023-07-21 | 2023-11-14 | 南京智慧云网络科技有限公司 | Clinical scientific research data integration and intelligent analysis system and method based on artificial intelligence |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103326353A (en) * | 2013-05-21 | 2013-09-25 | 武汉大学 | Environmental economic power generation dispatching calculation method based on improved multi-objective particle swarm optimization algorithm |
CN111489827A (en) * | 2020-04-10 | 2020-08-04 | 吉林大学 | Thyroid disease prediction modeling method based on associative decision tree |
US20200357514A1 (en) * | 2019-05-07 | 2020-11-12 | International Business Machines Corporation | Clinical decision support |
CN112071420A (en) * | 2020-08-12 | 2020-12-11 | 福建中榕数据科技有限公司 | Clinical aid decision making method, system, equipment and medium based on real-time data |
AU2020103709A4 (en) * | 2020-11-26 | 2021-02-11 | Daqing Oilfield Design Institute Co., Ltd | A modified particle swarm intelligent optimization method for solving high-dimensional optimization problems of large oil and gas production systems |
-
2021
- 2021-12-09 CN CN202111500068.2A patent/CN114121296B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103326353A (en) * | 2013-05-21 | 2013-09-25 | 武汉大学 | Environmental economic power generation dispatching calculation method based on improved multi-objective particle swarm optimization algorithm |
US20200357514A1 (en) * | 2019-05-07 | 2020-11-12 | International Business Machines Corporation | Clinical decision support |
CN111489827A (en) * | 2020-04-10 | 2020-08-04 | 吉林大学 | Thyroid disease prediction modeling method based on associative decision tree |
CN112071420A (en) * | 2020-08-12 | 2020-12-11 | 福建中榕数据科技有限公司 | Clinical aid decision making method, system, equipment and medium based on real-time data |
AU2020103709A4 (en) * | 2020-11-26 | 2021-02-11 | Daqing Oilfield Design Institute Co., Ltd | A modified particle swarm intelligent optimization method for solving high-dimensional optimization problems of large oil and gas production systems |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117059214A (en) * | 2023-07-21 | 2023-11-14 | 南京智慧云网络科技有限公司 | Clinical scientific research data integration and intelligent analysis system and method based on artificial intelligence |
Also Published As
Publication number | Publication date |
---|---|
CN114121296B (en) | 2024-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xia et al. | Complete random forest based class noise filtering learning for improving the generalizability of classifiers | |
CN110929029A (en) | Text classification method and system based on graph convolution neural network | |
CN109817339B (en) | Patient grouping method and device based on big data | |
CN103559504A (en) | Image target category identification method and device | |
US20190286978A1 (en) | Using natural language processing and deep learning for mapping any schema data to a hierarchical standard data model (xdm) | |
EP2614470A2 (en) | Method for providing with a score an object, and decision-support system | |
CN112102899A (en) | Construction method of molecular prediction model and computing equipment | |
CN110728313B (en) | Classification model training method and device for intention classification recognition | |
US20220277188A1 (en) | Systems and methods for classifying data sets using corresponding neural networks | |
Hu et al. | A novel support vector regression for data set with outliers | |
Durak | A classification algorithm using Mahalanobis distance clustering of data with applications on biomedical data sets | |
Poolsawad et al. | Issues in the mining of heart failure datasets | |
Karrar | The effect of using data pre-processing by imputations in handling missing values | |
CN114121296B (en) | Data-driven clinical information rule extraction method, storage medium and equipment | |
Bonakdarpour et al. | Prediction rule reshaping | |
Saravanan et al. | Prediction of insufficient accuracy for human activity recognition using convolutional neural network in compared with support vector machine | |
Li et al. | A new two-stage hybrid feature selection algorithm and its application in Chinese medicine | |
Dineva et al. | Methodology for data processing in modular IoT system | |
CN115206421B (en) | Drug repositioning method, and repositioning model training method and device | |
CN116680401A (en) | Document processing method, document processing device, apparatus and storage medium | |
CN115936841A (en) | Method and device for constructing credit risk assessment model | |
Pandeeswari et al. | K-means clustering and Naïve Bayes classifier for categorization of diabetes patients | |
Huang et al. | Community detection algorithm for social network based on node intimacy and graph embedding model | |
Kang et al. | Kernel optimisation for KPCA based on Gaussianity estimation | |
Vinutha et al. | EPCA—enhanced principal component analysis for medical data dimensionality reduction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |