CN112820403A - Deep learning method for predicting prognosis risk of cancer patient based on multiple groups of mathematical data - Google Patents
Deep learning method for predicting prognosis risk of cancer patient based on multiple groups of mathematical data Download PDFInfo
- Publication number
- CN112820403A CN112820403A CN202110210941.8A CN202110210941A CN112820403A CN 112820403 A CN112820403 A CN 112820403A CN 202110210941 A CN202110210941 A CN 202110210941A CN 112820403 A CN112820403 A CN 112820403A
- Authority
- CN
- China
- Prior art keywords
- risk
- data
- function
- network
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 47
- 201000011510 cancer Diseases 0.000 title claims abstract description 46
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000004393 prognosis Methods 0.000 title claims abstract description 38
- 238000013135 deep learning Methods 0.000 title claims abstract description 20
- 238000013528 artificial neural network Methods 0.000 claims abstract description 21
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 17
- 238000013058 risk prediction model Methods 0.000 claims abstract description 12
- 239000000126 substance Substances 0.000 claims abstract description 11
- 230000008676 import Effects 0.000 claims abstract description 3
- 230000006870 function Effects 0.000 claims description 61
- 230000004083 survival effect Effects 0.000 claims description 21
- 108020004999 messenger RNA Proteins 0.000 claims description 10
- 238000007476 Maximum Likelihood Methods 0.000 claims description 9
- 238000011084 recovery Methods 0.000 claims description 8
- 230000006835 compression Effects 0.000 claims description 7
- 238000007906 compression Methods 0.000 claims description 7
- 230000002068 genetic effect Effects 0.000 claims 1
- 206010005003 Bladder cancer Diseases 0.000 description 18
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 description 18
- 201000005112 urinary bladder cancer Diseases 0.000 description 18
- 230000037361 pathway Effects 0.000 description 10
- 108091070501 miRNA Proteins 0.000 description 8
- 239000002679 microRNA Substances 0.000 description 8
- 230000019491 signal transduction Effects 0.000 description 6
- 230000001105 regulatory effect Effects 0.000 description 5
- 230000007067 DNA methylation Effects 0.000 description 4
- 230000011987 methylation Effects 0.000 description 4
- 238000007069 methylation reaction Methods 0.000 description 4
- 208000010507 Adenocarcinoma of Lung Diseases 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000007796 conventional method Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 201000005249 lung adenocarcinoma Diseases 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000013508 migration Methods 0.000 description 3
- 230000005012 migration Effects 0.000 description 3
- 102000043136 MAP kinase family Human genes 0.000 description 2
- 108091054455 MAP kinase family Proteins 0.000 description 2
- 238000003559 RNA-seq method Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000007730 Akt signaling Effects 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 208000006545 Chronic Obstructive Pulmonary Disease Diseases 0.000 description 1
- 108091029430 CpG site Proteins 0.000 description 1
- 108700019961 Neoplasm Genes Proteins 0.000 description 1
- 102000048850 Neoplasm Genes Human genes 0.000 description 1
- 206010029260 Neuroblastoma Diseases 0.000 description 1
- 102000003728 Peroxisome Proliferator-Activated Receptors Human genes 0.000 description 1
- 108090000029 Peroxisome Proliferator-Activated Receptors Proteins 0.000 description 1
- 102000016611 Proteoglycans Human genes 0.000 description 1
- 108010067787 Proteoglycans Proteins 0.000 description 1
- 238000004833 X-ray photoelectron spectroscopy Methods 0.000 description 1
- 238000002679 ablation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000022131 cell cycle Effects 0.000 description 1
- 239000005515 coenzyme Substances 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000003828 downregulation Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000010195 expression analysis Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000031146 intracellular signal transduction Effects 0.000 description 1
- 201000007270 liver cancer Diseases 0.000 description 1
- 208000014018 liver neoplasm Diseases 0.000 description 1
- 230000037353 metabolic pathway Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003068 pathway analysis Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 238000011269 treatment regimen Methods 0.000 description 1
- 230000003827 upregulation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- XOOUIPVCVHRTMJ-UHFFFAOYSA-L zinc stearate Chemical compound [Zn+2].CCCCCCCCCCCCCCCCCC([O-])=O.CCCCCCCCCCCCCCCCCC([O-])=O XOOUIPVCVHRTMJ-UHFFFAOYSA-L 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/80—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Primary Health Care (AREA)
- General Health & Medical Sciences (AREA)
- Epidemiology (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a deep learning method for predicting prognosis risk of a cancer patient based on multiomic data, which is used for predicting prognosis risk of the cancer patient and comprises the following steps: s1: acquiring clinical data Y of a target cancer patient and corresponding multigroup chemical expression data X thereof from an existing public data set; s2: constructing a deep neural network; s3: multiple histology data X of cancer with existing public data setpAnd patient clinical information YpUpdating the weight theta through the constructed deep neural network to obtain a pre-training network N based on the public data setp(ii) a S4: to network NpTraining again until the training times epoch reach the operation upper limit, thereby obtaining a risk prediction network Nf(ii) a S5: XGboost algorithm is utilized to select the first n gene characteristics of the Import coefficient of the target cancer patientImproving risk prediction network NfAnd obtaining a final risk prediction model. The invention improves the robustness of the prediction model and more accurately predicts the prognosis risk of the cancer patient by utilizing the multiomic data.
Description
Technical Field
The invention relates to the technical field of survival analysis of cancer patients, in particular to a deep learning method for predicting the prognosis risk of the cancer patients based on multigroup data.
Background
The high incidence of cancer has prompted the development of medical assistance techniques in recent years, and prognostic risk analysis is a key medical assistance technique that can assist in the selection of different treatment regimens based on the potential risk of prognosis for different patients.
Most methods for predicting cancer prognosis are realized by analyzing expression data of a single omic, such as gene mRNA expression data, methylation data, miRNA data and the like, however, the prognosis of a patient is jointly regulated by multiple molecules at different levels, and strong complementary effects and interactions exist among the molecules at different levels, so that the result of the analysis of the single omic data can only provide one-sided information. In addition, data analysis of different omics and different modes is fused, and the problem that a single-group method is too sensitive to noise can be solved through error cancellation. Therefore, the fusion of various data for cancer analysis has become a powerful tool in recent years.
The biggest difficulty in fusing multiple sets of omic data is how to optimize the dimensionality reduction effect of high-dimensional omic data by using the cancer data of a small sample. In 2018, Li Xin et al (Li Xin, Weigong, Lu Zhang Yan, etc.) build a lung adenocarcinoma prognosis related risk prediction model [ J ] based on multiomic data, Nanjing university of medicine science (Nature science edition), 2018,38 (12); 1820 one 1825) use a traditional Cox method regularized by L1 to build a lung adenocarcinoma prognosis related risk prediction model based on multigroup chemical data, and build the prognosis related risk prediction model by integrating multigroup chemical information of a lung adenocarcinoma clinical information group, a genome and a transcription group, but the method is not robust enough, cannot solve the defect of poor performance in high-dimensional small sample cancer data, and has low prediction accuracy. Then, researchers apply deep learning to this field, and extract high-dimensional multi-group chemical characteristics (including mRNA, miRNA, and methylation data) of liver cancer by using a self-encoder, and then use the compressed characteristics to identify different clinical subtypes of patients. On the basis, researchers fuse relevant data of copy number variation and are used for distinguishing two prognosis subtypes of high-risk neuroblastoma. Besides this method, some variants based on other self-encoder methods are derived. However, the biggest problem with this framework is that it splits feature reduction and patient risk prediction into two models to do, and the method is not robust enough. In 2019, researchers combine a loss function of a proportional risk model with a deep neural network, and survival risk of a patient is directly predicted by utilizing multiomic data. The method has the problems that the deep neural network directly optimizes the loss function of risk prediction, and the reconstruction characteristics after multi-layer compression in the network still keep the spatial distribution characteristics of the initial characteristics, so that the performance of the method is limited.
Disclosure of Invention
The invention provides a deep learning method for predicting the prognosis risk of a cancer patient based on multiomic data, aiming at overcoming the defects that the accuracy of prognosis risk prediction is not high and the target data set is small in the prior art.
The primary objective of the present invention is to solve the above technical problems, and the technical solution of the present invention is as follows:
a deep learning method for predicting the risk of prognosis for a cancer patient based on multiple sets of mathematical data, comprising the steps of:
s1: acquiring clinical data Y of a target cancer patient and corresponding multigroup chemical expression data X thereof from an existing public data set;
s2: constructing a deep neural network;
s3: multiple histology data X of cancer with existing public data setpAnd patient clinical information YpUpdating the weight theta through the constructed deep neural network to obtain a pre-training network N based on the public data setp;
S4: comparing the clinical data Y of the target cancer patient and the multigroup expression data X thereof to the network NpTraining again until the training times epoch reach the operation upper limit, thereby obtaining a risk prediction network Nf;
S5: selecting target cancer patients by using XGBoost algorithmImproving risk prediction network N by the first N gene characteristics of the Importance coefficientfAnd obtaining a final risk prediction model.
Further, the specific process of constructing the deep neural network in step S2 is as follows:
s201: coding a plurality of groups of chemical expression data X to generate compression characteristics z ═ E (X), decoding the compression characteristics to generate new characteristics X', and calculating the data recovery loss Lr after decoding;
s202: defining a risk of survival function representing the survival rate of the cancer patient before a time-set time t;
s203: constructing a proportional risk function by using the survival risk function;
s204: constructing a maximum likelihood function by using the proportional risk function, and obtaining a preliminary prognosis risk prediction loss function through the maximum likelihood function;
s205: and adding the data recovery loss Lr into a preliminary prognosis risk prediction loss function to construct a final loss function.
Further, the loss function expression is:
further, the survival risk function is expressed as: s (T) ═ Pr (T > T)
Wherein T is the time to survival collected to the patient;
survival risk function at time t:
further, the proportional risk function is:
λ(t|x)=λ0(t)*exph(x)wherein h (X) ═ β Xi,λ0(t) represents the basic risk function at time t.
Further, the maximum likelihood function may be expressed as:
further, the preliminary prognostic risk prediction loss function can then be expressed as:
further, the final loss function is expressed as: lTRDN=(1-γ)lr+γlpWherein gamma is more than 0 and less than 1.
Further, the final risk prediction model in step S5 represents:
wherein, XmTo construct mRNA characteristics of the model, YmPredicting a network N for riskfThe risk of the patient is predicted and, representing the space of a regression tree, q the structure of the tree, T the number of leaf nodes in the tree, fkRepresenting the structure q of the regression tree with weight w.
Further, in step S5, the value of n is 200 according to the first n gene features of the Importance coefficient.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
according to the invention, more prior knowledge is obtained through the public data set during deep neural network learning, the robustness of the prediction model is improved, a data recovery loss function and a risk prediction loss function are introduced, and the prognosis risk of a cancer patient is predicted more accurately by utilizing the multiomic data.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a representation of the prognostic risk prediction methods for different patients in the example of the present invention in simulated data.
FIG. 3 is a schematic representation of the risk identification of targeted genes and pathways affecting bladder cancer prognosis as predicted by the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
Example 1
As shown in fig. 1, a deep learning method for predicting the risk of prognosis of a cancer patient based on multiple sets of mathematical data, for predicting the risk of prognosis of a cancer patient, comprises the steps of:
s1: obtaining clinical data Y and its corresponding multigenomic expression data X of a target cancer patient from an existing public data set (e.g., TCGA, GEO));
in a specific example, 14 TCGA datasets (BRCA, CESC, COAD, ESCA, HNSC, KIRC, LGG, LIHC, LUAD, lucc, MESO, PAAD, SRAC, and SKCM) were used for pre-training, while bladder cancer (BLCA) data served as the target cancer.
Wherein, the multigroup data comprises mRNA expression, miRNA expression, DNA methylation information and copy number variation information of the bladder cancer patient. mRNA data are RNA sequencing data generated by UNC Illumina HiSeq _ RNASeq V2. miRNA is miRNA sequencing data obtained from BCGSC Illumina HiSeq miRNASeq. DNA methylation data was generated by USC human methylation 450; CNV data were generated from BROAD-MIT whole genome SNP _ 6. All of these data are from TCGA lv3 grade data. We calculated the mean of DNA methylation at CpG sites for each gene as methylation expression. CNV features were extracted by averaging the copy number of all CNV variations on one gene.
S2: constructing a deep neural network;
the steps of constructing the deep neural network in the invention comprise:
s201: encoding a plurality of groups of mathematical expression data X to generate a compression characteristic z ═ E (X),
decoding the compressed features to generate new features X', and calculating the decoded data recovery loss Lr, wherein the loss function expression is as follows:
s202: defining a risk of survival function representing the survival rate of the cancer patient before a time-set time t;
the survival risk function is expressed as: s (T) ═ Pr (T > T)
Wherein T is the time to survival collected to the patient;
survival risk function at time t:
s203: constructing a proportional risk function by using a survival risk function, wherein the proportional risk function is as follows:
λ(t|x)=λ0(t)*exph(x)wherein h (X) ═ β Xi,λ0(t) represents a basic risk function at time t;
s204: constructing a maximum likelihood function by using the proportional risk function, obtaining a preliminary prognosis risk prediction loss function through the maximum likelihood function,
the maximum likelihood function may be expressed as:
the preliminary prognostic risk prediction loss function may then be expressed as:
s205: and adding the data recovery loss Lr into a preliminary prognosis risk prediction loss function to construct a final loss function, wherein the final loss function is expressed as: lTRDN=(1-γ)lr+γlpWherein gamma is more than 0 and less than 1;
in the present invention, a neural network is trained using a TCGA cancer public dataset to obtain published cancer multiomic data Xp and patient clinical information Yp, wherein the patient clinical information Yp includes the patient's time-to-live t and its status st, where st 1 indicates that the patient has died at this time point, and st 0 indicates that the patient has not died at this time point.
The data of the common data set for TCGA cancer was preprocessed before training, i.e. more than 20% of the genes and samples with deletion values were deleted, and then the remaining deletion values were filled in according to the median method.
S3: cancer multinomial data X of pre-processed existing public data setpAnd patient clinical information YpUpdating the weight theta through the constructed deep neural network to obtain a pre-training network N based on the public data setp;
The specific process is as follows:
the compression feature generated by encoding in the deep neural network is z, z ═ e (X), and the new feature generated by decoding the data X' can be expressed as: and calculating a decoded data recovery loss Lr, wherein the loss function expression is as follows:
constructing a final loss function: lTRDN=(1-γ)lr+γlpWherein gamma is more than 0 and less than 1.
And updating the weight theta of the deep neural network through a random gradient descent algorithm optimization model to obtain a pre-training network Np based on a public data set.
S4: comparing the clinical data Y of the target cancer patient and the multigroup expression data X thereof to the network NpTraining again until the training times epoch reach the operation upper limit, thereby obtaining a risk prediction network Nf;
As shown in FIG. 2, in the simulation experiment, we tested the enhancement effect of different improvement mechanisms on the prognosis performance of tumors, namely Cox neural network without migration learning (Deep _ surv), Deep Cox network combining two loss functions (Deep _ Cox), transfer-Cox neural network using pre-training dataset (trans _ Cox) and the method TRCN proposed by us. The C-index values obtained for different amounts of training data are shown in FIG. 2. As can be seen from FIG. 2, the value of C-index in each data set is lowest for Deep _ surv, while Deep _ Cox using the synthetic loss function performs better than Cox but worse than the other methods. Deep _ Cox improved the C-index by an average of 3.7% compared to Deep _ surv, but was not as pronounced as trans _ Cox _ all (13.8%) and TRCN (17.9%). Compared with trans _ Cox, the C-index indexes of three types of simulation data obtained by TRCN are respectively improved by 3.3%, 4.2% and 2.9%. These results indicate that integration loss is an effective way to improve predictive performance, and that pre-trained models can bring more useful information to the learning task.
TABLE 1C-index values for predicting the risk of bladder cancer prognosis by different methods
In table 1, this example compares the accuracy of predicting the risk of bladder cancer (true data) prognosis by the existing different methods, including four conventional methods and four deep learning-based methods. Of these conventional methods, the C-index of the simple Cox method is the lowest (0.525) performing the worst, while the C-index of Cox with elastic network regularization (Cox-elastic net) is the highest value of 0.561. These C-index values obtained by the conventional method are much smaller than those obtained by the deep learning-based method. In the Deep learning based method, the performance of the Cox model using the function of the auto-encoder (AE-Cox) reconstruction is superior to the Cox model with the Deep neural network (Deep _ surv). The C-index value of the TRCN without the migration learning mechanism is higher than Deep _ surv and AE-Cox, which proves that the mechanism of combining the loss functions provided by the invention can bring about improvement on accuracy. The highest C-index value obtained by TRCN in the methods shows that the migration learning is helpful to improve the performance of model learning.
This example also performed an ablation study that predicted patient risk based on multiple sets of omics data to investigate the contribution of different omics data to the accuracy of the prediction as shown in table 2.
Table 2 contribution of different omics data in predicting the prognosis risk of bladder cancer
The results show that when using single type of omics data, the C-index of mRNA performs best, 0.624, and that of miRNA, 0.552, is the lowest. CNV and DNA methylation are ranked second and third, respectively. While when we attempted to eliminate one type from the TRCN's four omics data, elimination of mRNA resulted in a decrease in C-index from 0.643 to 0.599, with the greatest decrease. The decrease in C-index was minimal to 0.09 after the exclusion of miRNA. These results indicate that mRNA data play the most important role in the prognosis prediction of bladder cancer, while miRNA contribution is minimal.
S5: utilizing XGboost algorithm to select the first 200 bases of the Import coefficient of the target cancer patientImproving risk prediction network N by characteristicsfAnd obtaining a final risk prediction model.
The final risk prediction model described in step S5 represents:
wherein, XmTo construct mRNA characteristics of the model, YmPredicting a network N for riskfThe risk of the patient is predicted and, representing the space of a regression tree, q the structure of the tree, T the number of leaf nodes in the tree, fkRepresenting the structure q of the regression tree with weight w.
The final risk prediction model established in this embodiment is verified and analyzed as follows:
in this embodiment, four bladder cancer data sets in GEO are downloaded as independent tests to verify the robustness of the model constructed based on the XGboost method: GSE13507 contains RNA-seq data and survival information collected at the university hospital, north kingdom, for 165 patients with primary bladder cancer. Dana-Farber cancer institute data were shared among 93 patients with bladder cancer in GSE 31684. GSE32894 contains information on 224 bladder cancer patients from the SCIBLU genomics center, university of london, sweden. GSE42876 contains information collected at the university of coenzyme about 43 patients with bladder cancer.
Table 3 shows the results of independent verification, and it can be seen that the C-index values of the four groups of data are all greater than 0.6, which verifies the accuracy of the model in predicting the patient risk, and the p values among different risk groups are all less than 0.05, which indicates that there is a significant difference among different risk groups. These results demonstrate that the prediction model constructed with the XGboost algorithm works well on these four datasets.
Table 3 independent test results of XGboost risk prediction model on4 GEO datasets
Patients can be classified into high-risk groups and low-risk groups based on median predictive pre-patient risk. And then carrying out differential expression analysis according to different risk groups to find differential expression genes influencing prognosis. A total of 244 genes were identified based on the results, with 90 genes downregulated and 154 genes upregulated (fig. 3A). The first 20 difference genes with the highest correlation coefficient are additionally labeled in FIG. 3A. A heat map based on the expression of these differential genes is shown in figure 3B. In review of literature, 104 genes have been shown to be associated with bladder cancer. In addition to known cancer genes, our results also reveal 140 potential genes that have not been fully studied to influence the prognosis of bladder cancer.
Using these 244 genes, we performed KEGG pathway analysis to find an enrichment pathway for differentially expressed genes. A total of 38 KEGG pathways (2 downregulation pathways and 36 upregulation pathways) were found to correlate with the prognosis of bladder cancer. Considering that there are many more pathways up-regulated than down-regulated, we only show pathways with a gene number >4 in fig. 3 (c). The metabolic pathway is one of the common pathways in cancer, and therefore it contains the most diverse genes (n-12). Among these pathways, the PI3K-Akt signaling pathway has the lowest p-value. The PI3K-Akt signal transduction pathway is an important intracellular signal transduction pathway in regulating the cell cycle, and the growth of human bladder cancer cells can be inhibited by regulating the PI3K-Akt signal transduction pathway. In addition, we have discovered MAPK signaling pathways, Ras signaling pathways, PPAR signaling pathways, proteoglycans, and cancer pathways, among others. MAPK signaling pathways have also been shown to affect treatment in patients with bladder cancer. These results further demonstrate that TRCN predicted cancer outcome is of biological significance.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (10)
1. A deep learning method for predicting the risk of prognosis for a cancer patient based on multiple sets of mathematical data, comprising the steps of:
s1: acquiring clinical data Y of a target cancer patient and corresponding multigroup chemical expression data X thereof from an existing public data set;
s2: constructing a deep neural network;
s3: multiple histology data X of cancer with existing public data setpAnd patient clinical information YpUpdating the weight theta through the constructed deep neural network to obtain a pre-training network N based on the public data setp;
S4: comparing the clinical data Y of the target cancer patient and the multigroup expression data X thereof to the network NpTraining again until the training times epoch reach the operation upper limit, thereby obtaining a risk prediction network Nf;
S5: the XGboost algorithm is utilized to select the first N gene characteristics of the Import coefficient of the target cancer patient, and the risk prediction network N is improvedfAnd obtaining a final risk prediction model.
2. The deep learning method for predicting the prognosis risk of cancer patients based on multi-group chemical data as claimed in claim 1, wherein the step S2 is to construct a deep neural network by:
s201: coding a plurality of groups of chemical expression data X to generate compression characteristics z ═ E (X), decoding the compression characteristics to generate new characteristics X', and calculating the data recovery loss Lr after decoding;
s202: defining a risk of survival function representing the survival rate of the cancer patient before a time-set time t;
s203: constructing a proportional risk function by using the survival risk function;
s204: constructing a maximum likelihood function by using the proportional risk function, and obtaining a preliminary prognosis risk prediction loss function through the maximum likelihood function;
s205: and adding the data recovery loss Lr into a preliminary prognosis risk prediction loss function to construct a final loss function.
5. the deep learning method of claim 4, wherein the proportional risk function is:
λ(t|x)=λ0(t)*exph(x)wherein h (X) ═ β Xi,λ0(t) represents the basic risk function at time t.
8. the deep learning method of claim 7, wherein the final loss function is expressed as: lTRDN=(1-γ)lr+γlpWherein gamma is more than 0 and less than 1.
9. The deep learning method for predicting the risk of prognosis of cancer patients based on multi-group chemical data as claimed in claim 8, wherein the final risk prediction model of step S5 represents:
wherein, XmTo construct mRNA characteristics of the model, YmPredicting a network N for riskfThe risk of the patient is predicted and, representing the space of a regression tree, q the structure of the tree, T the leaves in the treeNumber of nodes, fkRepresenting the structure q of the regression tree with weight w.
10. The deep learning method for predicting the risk of prognosis of cancer patients based on multi-group chemical data as claimed in claim 1, wherein the value of n is 200 for the first n genetic features of the inportance coefficient in step S5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110210941.8A CN112820403B (en) | 2021-02-25 | 2021-02-25 | Deep learning method for predicting prognosis risk of cancer patient based on multiple sets of learning data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110210941.8A CN112820403B (en) | 2021-02-25 | 2021-02-25 | Deep learning method for predicting prognosis risk of cancer patient based on multiple sets of learning data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112820403A true CN112820403A (en) | 2021-05-18 |
CN112820403B CN112820403B (en) | 2024-03-29 |
Family
ID=75865575
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110210941.8A Active CN112820403B (en) | 2021-02-25 | 2021-02-25 | Deep learning method for predicting prognosis risk of cancer patient based on multiple sets of learning data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112820403B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113409946A (en) * | 2021-07-02 | 2021-09-17 | 中山大学 | System and method for predicting cancer prognosis risk under high-dimensional deletion data |
CN113838570A (en) * | 2021-08-31 | 2021-12-24 | 华中科技大学 | Cervical cancer self-consistent typing method and system based on deep learning |
CN114783524A (en) * | 2022-06-17 | 2022-07-22 | 之江实验室 | Path abnormity detection system based on self-adaptive resampling depth encoder network |
CN114927162A (en) * | 2022-05-19 | 2022-08-19 | 大连理工大学 | Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution |
CN116417070A (en) * | 2023-04-17 | 2023-07-11 | 齐鲁工业大学(山东省科学院) | Method for improving prognosis prediction precision of gastric cancer typing based on gradient lifting depth feature selection algorithm |
CN116580841A (en) * | 2023-07-12 | 2023-08-11 | 北京大学 | Disease diagnosis device, device and storage medium based on multiple groups of study data |
CN116862861A (en) * | 2023-07-04 | 2023-10-10 | 浙江大学 | Prediction model training and prediction method and system for gastric cancer treatment efficacy based on multiple groups of students |
CN117594243A (en) * | 2023-10-13 | 2024-02-23 | 太原理工大学 | Ovarian cancer prognosis prediction method based on cross-modal view association discovery network |
WO2024065987A1 (en) * | 2022-09-27 | 2024-04-04 | 山东第一医科大学(山东省医学科学院) | Lung cancer prognosis prediction system based on multi-omics of radiomics, pathomics and genomics |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108922628A (en) * | 2018-04-23 | 2018-11-30 | 华北电力大学 | A kind of Prognosis in Breast Cancer survival rate prediction technique based on dynamic Cox model |
KR20190021471A (en) * | 2017-02-02 | 2019-03-05 | 사회복지법인 삼성생명공익재단 | Method, Apparatus and Program for Predicting Prognosis of Gastric Cancer Using Artificial Neural Network |
CN109859801A (en) * | 2019-02-14 | 2019-06-07 | 辽宁省肿瘤医院 | A kind of model and method for building up containing seven genes as biomarker prediction lung squamous cancer prognosis |
CN110853756A (en) * | 2019-11-08 | 2020-02-28 | 郑州轻工业学院 | Esophagus cancer risk prediction method based on SOM neural network and SVM |
CN110942808A (en) * | 2019-12-10 | 2020-03-31 | 山东大学 | Prognosis prediction method and prediction system based on gene big data |
CN111161882A (en) * | 2019-12-04 | 2020-05-15 | 深圳先进技术研究院 | Breast cancer life prediction method based on deep neural network |
CN111161799A (en) * | 2019-12-24 | 2020-05-15 | 大连海事大学 | Method and system for acquiring multigene risk scores based on multigroup mathematical data |
KR102119687B1 (en) * | 2020-03-02 | 2020-06-05 | 엔에이치네트웍스 주식회사 | Learning Apparatus and Method of Image |
CN112037919A (en) * | 2020-09-15 | 2020-12-04 | 南京鼓楼医院 | Risk assessment model for papillary carcinoma of thyroid nodule patient |
CN112086199A (en) * | 2020-09-14 | 2020-12-15 | 中科院计算所西部高等技术研究院 | Liver cancer data processing system based on multiple groups of mathematical data |
CN112201346A (en) * | 2020-10-12 | 2021-01-08 | 哈尔滨工业大学(深圳) | Cancer survival prediction method, apparatus, computing device and computer-readable storage medium |
CN112309576A (en) * | 2020-09-22 | 2021-02-02 | 江南大学 | Colorectal cancer survival period prediction method based on deep learning CT (computed tomography) image omics |
CN112397143A (en) * | 2020-10-30 | 2021-02-23 | 深圳思勤医疗科技有限公司 | Method for predicting tumor risk value based on plasma multi-omic multi-dimensional features and artificial intelligence |
-
2021
- 2021-02-25 CN CN202110210941.8A patent/CN112820403B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190021471A (en) * | 2017-02-02 | 2019-03-05 | 사회복지법인 삼성생명공익재단 | Method, Apparatus and Program for Predicting Prognosis of Gastric Cancer Using Artificial Neural Network |
CN108922628A (en) * | 2018-04-23 | 2018-11-30 | 华北电力大学 | A kind of Prognosis in Breast Cancer survival rate prediction technique based on dynamic Cox model |
CN109859801A (en) * | 2019-02-14 | 2019-06-07 | 辽宁省肿瘤医院 | A kind of model and method for building up containing seven genes as biomarker prediction lung squamous cancer prognosis |
CN110853756A (en) * | 2019-11-08 | 2020-02-28 | 郑州轻工业学院 | Esophagus cancer risk prediction method based on SOM neural network and SVM |
CN111161882A (en) * | 2019-12-04 | 2020-05-15 | 深圳先进技术研究院 | Breast cancer life prediction method based on deep neural network |
CN110942808A (en) * | 2019-12-10 | 2020-03-31 | 山东大学 | Prognosis prediction method and prediction system based on gene big data |
CN111161799A (en) * | 2019-12-24 | 2020-05-15 | 大连海事大学 | Method and system for acquiring multigene risk scores based on multigroup mathematical data |
KR102119687B1 (en) * | 2020-03-02 | 2020-06-05 | 엔에이치네트웍스 주식회사 | Learning Apparatus and Method of Image |
CN112086199A (en) * | 2020-09-14 | 2020-12-15 | 中科院计算所西部高等技术研究院 | Liver cancer data processing system based on multiple groups of mathematical data |
CN112037919A (en) * | 2020-09-15 | 2020-12-04 | 南京鼓楼医院 | Risk assessment model for papillary carcinoma of thyroid nodule patient |
CN112309576A (en) * | 2020-09-22 | 2021-02-02 | 江南大学 | Colorectal cancer survival period prediction method based on deep learning CT (computed tomography) image omics |
CN112201346A (en) * | 2020-10-12 | 2021-01-08 | 哈尔滨工业大学(深圳) | Cancer survival prediction method, apparatus, computing device and computer-readable storage medium |
CN112397143A (en) * | 2020-10-30 | 2021-02-23 | 深圳思勤医疗科技有限公司 | Method for predicting tumor risk value based on plasma multi-omic multi-dimensional features and artificial intelligence |
Non-Patent Citations (1)
Title |
---|
ZHI HUANG 等: "Deep learning-based cancer survival prognosis from RNA-seq data:approaches and evaluations", BMC MEDICAL GENOMICS, vol. 13, no. 5, pages 1 - 12 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113409946A (en) * | 2021-07-02 | 2021-09-17 | 中山大学 | System and method for predicting cancer prognosis risk under high-dimensional deletion data |
CN113838570A (en) * | 2021-08-31 | 2021-12-24 | 华中科技大学 | Cervical cancer self-consistent typing method and system based on deep learning |
CN113838570B (en) * | 2021-08-31 | 2024-04-26 | 华中科技大学 | Cervical cancer self-consistent typing method and system based on deep learning |
CN114927162A (en) * | 2022-05-19 | 2022-08-19 | 大连理工大学 | Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution |
CN114927162B (en) * | 2022-05-19 | 2024-06-14 | 大连理工大学 | Multi-mathematic association phenotype prediction method based on hypergraph characterization and dirichlet allocation |
CN114783524A (en) * | 2022-06-17 | 2022-07-22 | 之江实验室 | Path abnormity detection system based on self-adaptive resampling depth encoder network |
WO2024065987A1 (en) * | 2022-09-27 | 2024-04-04 | 山东第一医科大学(山东省医学科学院) | Lung cancer prognosis prediction system based on multi-omics of radiomics, pathomics and genomics |
CN116417070A (en) * | 2023-04-17 | 2023-07-11 | 齐鲁工业大学(山东省科学院) | Method for improving prognosis prediction precision of gastric cancer typing based on gradient lifting depth feature selection algorithm |
CN116862861A (en) * | 2023-07-04 | 2023-10-10 | 浙江大学 | Prediction model training and prediction method and system for gastric cancer treatment efficacy based on multiple groups of students |
CN116580841B (en) * | 2023-07-12 | 2023-11-10 | 北京大学 | Disease diagnosis device, device and storage medium based on multiple groups of study data |
CN116580841A (en) * | 2023-07-12 | 2023-08-11 | 北京大学 | Disease diagnosis device, device and storage medium based on multiple groups of study data |
CN117594243A (en) * | 2023-10-13 | 2024-02-23 | 太原理工大学 | Ovarian cancer prognosis prediction method based on cross-modal view association discovery network |
CN117594243B (en) * | 2023-10-13 | 2024-05-14 | 太原理工大学 | Ovarian cancer prognosis prediction method based on cross-modal view association discovery network |
Also Published As
Publication number | Publication date |
---|---|
CN112820403B (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112820403B (en) | Deep learning method for predicting prognosis risk of cancer patient based on multiple sets of learning data | |
CN108647489B (en) | Method and system for screening disease drug target and target combination | |
Pang et al. | Gene selection using iterative feature elimination random forests for survival outcomes | |
CN103649337B (en) | The probabilistic Modeling assessment cell signaling pathway activity expressed using target gene | |
Fujiwara et al. | ASCL1-coexpression profiling but not single gene expression profiling defines lung adenocarcinomas of neuroendocrine nature with poor prognosis | |
Zhao et al. | Identification of differentially expressed genes in pituitary adenomas by integrating analysis of microarray data | |
JP2022524484A (en) | How to predict the survival rate of cancer patients | |
Liu et al. | MNNMDA: predicting human microbe-disease association via a method to minimize matrix nuclear norm | |
Chai et al. | Integrating multi-omics data with deep learning for predicting cancer prognosis | |
CN113409946A (en) | System and method for predicting cancer prognosis risk under high-dimensional deletion data | |
KR102386876B1 (en) | Method for identifying condition-specific micro rna targets with big data | |
Zhao et al. | SSCMDA: spy and super cluster strategy for MiRNA-disease association prediction | |
CN107075586B (en) | Glycosyltransferase gene expression profiling for identifying multiple cancer types and subtypes | |
CN117038067A (en) | Neuroendocrine type prostate cancer risk prediction method and application thereof | |
CN116486913A (en) | System, apparatus and medium for de novo predictive regulatory mutations based on single cell sequencing | |
Gupta et al. | A new deep learning technique reveals the exclusive functional contributions of individual cancer mutations | |
Jo et al. | Interpretation of SNP combination effects on schizophrenia etiology based on stepwise deep learning with multi-precision data | |
Quackenbush | From ‘omes to biology | |
Kuznetsov et al. | Statistically weighted voting analysis of microarrays for molecular pattern selection and discovery cancer genotypes | |
VIEIRA | Unveiling Novel Glioma Biomarkers through Multi-omics Integration and Classification | |
Joo | Bayesian lasso: An extension for genome-wide association study | |
Zhang et al. | Network propagation models for gene selection | |
CN116741269A (en) | Method for predicting personalized cancer driving genes by fusion of gene characteristics and graph convolution | |
Gleason | Methods for Integrative Multi-Omics Association Analysis Using Summary Statistics | |
CN115206440A (en) | KRAS mutation colon cancer gene-based prognosis model and application thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |