CN112820403A

CN112820403A - Deep learning method for predicting prognosis risk of cancer patient based on multiple groups of mathematical data

Info

Publication number: CN112820403A
Application number: CN202110210941.8A
Authority: CN
Inventors: 杨跃东; 柴华; 张仲岳; 周翔
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2021-05-18
Anticipated expiration: 2041-02-25
Also published as: CN112820403B

Abstract

The invention discloses a deep learning method for predicting prognosis risk of a cancer patient based on multiomic data, which is used for predicting prognosis risk of the cancer patient and comprises the following steps: s1: acquiring clinical data Y of a target cancer patient and corresponding multigroup chemical expression data X thereof from an existing public data set; s2: constructing a deep neural network; s3: multiple histology data X of cancer with existing public data set_pAnd patient clinical information Y_pUpdating the weight theta through the constructed deep neural network to obtain a pre-training network N based on the public data set_p(ii) a S4: to network N_pTraining again until the training times epoch reach the operation upper limit, thereby obtaining a risk prediction network N_f(ii) a S5: XGboost algorithm is utilized to select the first n gene characteristics of the Import coefficient of the target cancer patientImproving risk prediction network N_fAnd obtaining a final risk prediction model. The invention improves the robustness of the prediction model and more accurately predicts the prognosis risk of the cancer patient by utilizing the multiomic data.

Description

Deep learning method for predicting prognosis risk of cancer patient based on multiple groups of mathematical data

Technical Field

The invention relates to the technical field of survival analysis of cancer patients, in particular to a deep learning method for predicting the prognosis risk of the cancer patients based on multigroup data.

Background

The high incidence of cancer has prompted the development of medical assistance techniques in recent years, and prognostic risk analysis is a key medical assistance technique that can assist in the selection of different treatment regimens based on the potential risk of prognosis for different patients.

Most methods for predicting cancer prognosis are realized by analyzing expression data of a single omic, such as gene mRNA expression data, methylation data, miRNA data and the like, however, the prognosis of a patient is jointly regulated by multiple molecules at different levels, and strong complementary effects and interactions exist among the molecules at different levels, so that the result of the analysis of the single omic data can only provide one-sided information. In addition, data analysis of different omics and different modes is fused, and the problem that a single-group method is too sensitive to noise can be solved through error cancellation. Therefore, the fusion of various data for cancer analysis has become a powerful tool in recent years.

The biggest difficulty in fusing multiple sets of omic data is how to optimize the dimensionality reduction effect of high-dimensional omic data by using the cancer data of a small sample. In 2018, Li Xin et al (Li Xin, Weigong, Lu Zhang Yan, etc.) build a lung adenocarcinoma prognosis related risk prediction model [ J ] based on multiomic data, Nanjing university of medicine science (Nature science edition), 2018,38 (12); 1820 one 1825) use a traditional Cox method regularized by L1 to build a lung adenocarcinoma prognosis related risk prediction model based on multigroup chemical data, and build the prognosis related risk prediction model by integrating multigroup chemical information of a lung adenocarcinoma clinical information group, a genome and a transcription group, but the method is not robust enough, cannot solve the defect of poor performance in high-dimensional small sample cancer data, and has low prediction accuracy. Then, researchers apply deep learning to this field, and extract high-dimensional multi-group chemical characteristics (including mRNA, miRNA, and methylation data) of liver cancer by using a self-encoder, and then use the compressed characteristics to identify different clinical subtypes of patients. On the basis, researchers fuse relevant data of copy number variation and are used for distinguishing two prognosis subtypes of high-risk neuroblastoma. Besides this method, some variants based on other self-encoder methods are derived. However, the biggest problem with this framework is that it splits feature reduction and patient risk prediction into two models to do, and the method is not robust enough. In 2019, researchers combine a loss function of a proportional risk model with a deep neural network, and survival risk of a patient is directly predicted by utilizing multiomic data. The method has the problems that the deep neural network directly optimizes the loss function of risk prediction, and the reconstruction characteristics after multi-layer compression in the network still keep the spatial distribution characteristics of the initial characteristics, so that the performance of the method is limited.

Disclosure of Invention

The invention provides a deep learning method for predicting the prognosis risk of a cancer patient based on multiomic data, aiming at overcoming the defects that the accuracy of prognosis risk prediction is not high and the target data set is small in the prior art.

The primary objective of the present invention is to solve the above technical problems, and the technical solution of the present invention is as follows:

a deep learning method for predicting the risk of prognosis for a cancer patient based on multiple sets of mathematical data, comprising the steps of:

s1: acquiring clinical data Y of a target cancer patient and corresponding multigroup chemical expression data X thereof from an existing public data set;

s2: constructing a deep neural network;

s3: multiple histology data X of cancer with existing public data set_pAnd patient clinical information Y_pUpdating the weight theta through the constructed deep neural network to obtain a pre-training network N based on the public data set_p；

S4: comparing the clinical data Y of the target cancer patient and the multigroup expression data X thereof to the network N_pTraining again until the training times epoch reach the operation upper limit, thereby obtaining a risk prediction network N_f；

S5: selecting target cancer patients by using XGBoost algorithmImproving risk prediction network N by the first N gene characteristics of the Importance coefficient_fAnd obtaining a final risk prediction model.

Further, the specific process of constructing the deep neural network in step S2 is as follows:

s201: coding a plurality of groups of chemical expression data X to generate compression characteristics z ═ E (X), decoding the compression characteristics to generate new characteristics X', and calculating the data recovery loss Lr after decoding;

s202: defining a risk of survival function representing the survival rate of the cancer patient before a time-set time t;

s203: constructing a proportional risk function by using the survival risk function;

s204: constructing a maximum likelihood function by using the proportional risk function, and obtaining a preliminary prognosis risk prediction loss function through the maximum likelihood function;

s205: and adding the data recovery loss Lr into a preliminary prognosis risk prediction loss function to construct a final loss function.

Further, the loss function expression is:

further, the survival risk function is expressed as: s (T) ═ Pr (T > T)

Wherein T is the time to survival collected to the patient;

survival risk function at time t:

further, the proportional risk function is:

λ(t|x)＝λ₀(t)*exp^h(x)wherein h (X) ═ β X_i，λ₀(t) represents the basic risk function at time t.

Further, the maximum likelihood function may be expressed as:

further, the preliminary prognostic risk prediction loss function can then be expressed as:

further, the final loss function is expressed as: l_TRDN＝(1-γ)l_r+γl_pWherein gamma is more than 0 and less than 1.

Further, the final risk prediction model in step S5 represents:

wherein, X_mTo construct mRNA characteristics of the model, Y_mPredicting a network N for risk_fThe risk of the patient is predicted and,

representing the space of a regression tree, q the structure of the tree, T the number of leaf nodes in the tree, f_kRepresenting the structure q of the regression tree with weight w.

Further, in step S5, the value of n is 200 according to the first n gene features of the Importance coefficient.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

according to the invention, more prior knowledge is obtained through the public data set during deep neural network learning, the robustness of the prediction model is improved, a data recovery loss function and a risk prediction loss function are introduced, and the prognosis risk of a cancer patient is predicted more accurately by utilizing the multiomic data.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a representation of the prognostic risk prediction methods for different patients in the example of the present invention in simulated data.

FIG. 3 is a schematic representation of the risk identification of targeted genes and pathways affecting bladder cancer prognosis as predicted by the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

Example 1

As shown in fig. 1, a deep learning method for predicting the risk of prognosis of a cancer patient based on multiple sets of mathematical data, for predicting the risk of prognosis of a cancer patient, comprises the steps of:

s1: obtaining clinical data Y and its corresponding multigenomic expression data X of a target cancer patient from an existing public data set (e.g., TCGA, GEO));

in a specific example, 14 TCGA datasets (BRCA, CESC, COAD, ESCA, HNSC, KIRC, LGG, LIHC, LUAD, lucc, MESO, PAAD, SRAC, and SKCM) were used for pre-training, while bladder cancer (BLCA) data served as the target cancer.

Wherein, the multigroup data comprises mRNA expression, miRNA expression, DNA methylation information and copy number variation information of the bladder cancer patient. mRNA data are RNA sequencing data generated by UNC Illumina HiSeq _ RNASeq V2. miRNA is miRNA sequencing data obtained from BCGSC Illumina HiSeq miRNASeq. DNA methylation data was generated by USC human methylation 450; CNV data were generated from BROAD-MIT whole genome SNP _ 6. All of these data are from TCGA lv3 grade data. We calculated the mean of DNA methylation at CpG sites for each gene as methylation expression. CNV features were extracted by averaging the copy number of all CNV variations on one gene.

S2: constructing a deep neural network;

the steps of constructing the deep neural network in the invention comprise:

s201: encoding a plurality of groups of mathematical expression data X to generate a compression characteristic z ═ E (X),

decoding the compressed features to generate new features X', and calculating the decoded data recovery loss Lr, wherein the loss function expression is as follows:

the survival risk function is expressed as: s (T) ═ Pr (T > T)

Wherein T is the time to survival collected to the patient;

survival risk function at time t:

s203: constructing a proportional risk function by using a survival risk function, wherein the proportional risk function is as follows:

λ(t|x)＝λ₀(t)*exp^h(x)wherein h (X) ═ β X_i，λ₀(t) represents a basic risk function at time t;

s204: constructing a maximum likelihood function by using the proportional risk function, obtaining a preliminary prognosis risk prediction loss function through the maximum likelihood function,

the maximum likelihood function may be expressed as:

the preliminary prognostic risk prediction loss function may then be expressed as:

s205: and adding the data recovery loss Lr into a preliminary prognosis risk prediction loss function to construct a final loss function, wherein the final loss function is expressed as: l_TRDN＝(1-γ)l_r+γl_pWherein gamma is more than 0 and less than 1;

in the present invention, a neural network is trained using a TCGA cancer public dataset to obtain published cancer multiomic data Xp and patient clinical information Yp, wherein the patient clinical information Yp includes the patient's time-to-live t and its status st, where st 1 indicates that the patient has died at this time point, and st 0 indicates that the patient has not died at this time point.

The data of the common data set for TCGA cancer was preprocessed before training, i.e. more than 20% of the genes and samples with deletion values were deleted, and then the remaining deletion values were filled in according to the median method.

S3: cancer multinomial data X of pre-processed existing public data set_pAnd patient clinical information Y_pUpdating the weight theta through the constructed deep neural network to obtain a pre-training network N based on the public data set_p；

The specific process is as follows:

the compression feature generated by encoding in the deep neural network is z, z ═ e (X), and the new feature generated by decoding the data X' can be expressed as: and calculating a decoded data recovery loss Lr, wherein the loss function expression is as follows:

calculating a loss of predicted risk in a deep neural network:

constructing a final loss function: l_TRDN＝(1-γ)l_r+γl_pWherein gamma is more than 0 and less than 1.

And updating the weight theta of the deep neural network through a random gradient descent algorithm optimization model to obtain a pre-training network Np based on a public data set.

As shown in FIG. 2, in the simulation experiment, we tested the enhancement effect of different improvement mechanisms on the prognosis performance of tumors, namely Cox neural network without migration learning (Deep _ surv), Deep Cox network combining two loss functions (Deep _ Cox), transfer-Cox neural network using pre-training dataset (trans _ Cox) and the method TRCN proposed by us. The C-index values obtained for different amounts of training data are shown in FIG. 2. As can be seen from FIG. 2, the value of C-index in each data set is lowest for Deep _ surv, while Deep _ Cox using the synthetic loss function performs better than Cox but worse than the other methods. Deep _ Cox improved the C-index by an average of 3.7% compared to Deep _ surv, but was not as pronounced as trans _ Cox _ all (13.8%) and TRCN (17.9%). Compared with trans _ Cox, the C-index indexes of three types of simulation data obtained by TRCN are respectively improved by 3.3%, 4.2% and 2.9%. These results indicate that integration loss is an effective way to improve predictive performance, and that pre-trained models can bring more useful information to the learning task.

TABLE 1C-index values for predicting the risk of bladder cancer prognosis by different methods

In table 1, this example compares the accuracy of predicting the risk of bladder cancer (true data) prognosis by the existing different methods, including four conventional methods and four deep learning-based methods. Of these conventional methods, the C-index of the simple Cox method is the lowest (0.525) performing the worst, while the C-index of Cox with elastic network regularization (Cox-elastic net) is the highest value of 0.561. These C-index values obtained by the conventional method are much smaller than those obtained by the deep learning-based method. In the Deep learning based method, the performance of the Cox model using the function of the auto-encoder (AE-Cox) reconstruction is superior to the Cox model with the Deep neural network (Deep _ surv). The C-index value of the TRCN without the migration learning mechanism is higher than Deep _ surv and AE-Cox, which proves that the mechanism of combining the loss functions provided by the invention can bring about improvement on accuracy. The highest C-index value obtained by TRCN in the methods shows that the migration learning is helpful to improve the performance of model learning.

This example also performed an ablation study that predicted patient risk based on multiple sets of omics data to investigate the contribution of different omics data to the accuracy of the prediction as shown in table 2.

Table 2 contribution of different omics data in predicting the prognosis risk of bladder cancer

The results show that when using single type of omics data, the C-index of mRNA performs best, 0.624, and that of miRNA, 0.552, is the lowest. CNV and DNA methylation are ranked second and third, respectively. While when we attempted to eliminate one type from the TRCN's four omics data, elimination of mRNA resulted in a decrease in C-index from 0.643 to 0.599, with the greatest decrease. The decrease in C-index was minimal to 0.09 after the exclusion of miRNA. These results indicate that mRNA data play the most important role in the prognosis prediction of bladder cancer, while miRNA contribution is minimal.

S5: utilizing XGboost algorithm to select the first 200 bases of the Import coefficient of the target cancer patientImproving risk prediction network N by characteristics_fAnd obtaining a final risk prediction model.

The final risk prediction model described in step S5 represents:

The final risk prediction model established in this embodiment is verified and analyzed as follows:

in this embodiment, four bladder cancer data sets in GEO are downloaded as independent tests to verify the robustness of the model constructed based on the XGboost method: GSE13507 contains RNA-seq data and survival information collected at the university hospital, north kingdom, for 165 patients with primary bladder cancer. Dana-Farber cancer institute data were shared among 93 patients with bladder cancer in GSE 31684. GSE32894 contains information on 224 bladder cancer patients from the SCIBLU genomics center, university of london, sweden. GSE42876 contains information collected at the university of coenzyme about 43 patients with bladder cancer.

Table 3 shows the results of independent verification, and it can be seen that the C-index values of the four groups of data are all greater than 0.6, which verifies the accuracy of the model in predicting the patient risk, and the p values among different risk groups are all less than 0.05, which indicates that there is a significant difference among different risk groups. These results demonstrate that the prediction model constructed with the XGboost algorithm works well on these four datasets.

Table 3 independent test results of XGboost risk prediction model on4 GEO datasets

Patients can be classified into high-risk groups and low-risk groups based on median predictive pre-patient risk. And then carrying out differential expression analysis according to different risk groups to find differential expression genes influencing prognosis. A total of 244 genes were identified based on the results, with 90 genes downregulated and 154 genes upregulated (fig. 3A). The first 20 difference genes with the highest correlation coefficient are additionally labeled in FIG. 3A. A heat map based on the expression of these differential genes is shown in figure 3B. In review of literature, 104 genes have been shown to be associated with bladder cancer. In addition to known cancer genes, our results also reveal 140 potential genes that have not been fully studied to influence the prognosis of bladder cancer.

Using these 244 genes, we performed KEGG pathway analysis to find an enrichment pathway for differentially expressed genes. A total of 38 KEGG pathways (2 downregulation pathways and 36 upregulation pathways) were found to correlate with the prognosis of bladder cancer. Considering that there are many more pathways up-regulated than down-regulated, we only show pathways with a gene number >4 in fig. 3 (c). The metabolic pathway is one of the common pathways in cancer, and therefore it contains the most diverse genes (n-12). Among these pathways, the PI3K-Akt signaling pathway has the lowest p-value. The PI3K-Akt signal transduction pathway is an important intracellular signal transduction pathway in regulating the cell cycle, and the growth of human bladder cancer cells can be inhibited by regulating the PI3K-Akt signal transduction pathway. In addition, we have discovered MAPK signaling pathways, Ras signaling pathways, PPAR signaling pathways, proteoglycans, and cancer pathways, among others. MAPK signaling pathways have also been shown to affect treatment in patients with bladder cancer. These results further demonstrate that TRCN predicted cancer outcome is of biological significance.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A deep learning method for predicting the risk of prognosis for a cancer patient based on multiple sets of mathematical data, comprising the steps of:

s2: constructing a deep neural network;

S5: the XGboost algorithm is utilized to select the first N gene characteristics of the Import coefficient of the target cancer patient, and the risk prediction network N is improved_fAnd obtaining a final risk prediction model.

2. The deep learning method for predicting the prognosis risk of cancer patients based on multi-group chemical data as claimed in claim 1, wherein the step S2 is to construct a deep neural network by:

3. The deep learning method of claim 2, wherein the loss function is expressed as:

4. the deep learning method of claim 3, wherein the survival risk function is expressed as: s (T) ═ Pr (T > T)

Wherein T is the time to survival collected to the patient;

survival risk function at time t:

5. the deep learning method of claim 4, wherein the proportional risk function is:

6. The deep learning method of claim 5, wherein the maximum likelihood function is expressed as:

7. the method of claim 6, wherein the preliminary prognostic risk prediction loss function is expressed as:

8. the deep learning method of claim 7, wherein the final loss function is expressed as: l_TRDN＝(1-γ)l_r+γl_pWherein gamma is more than 0 and less than 1.

9. The deep learning method for predicting the risk of prognosis of cancer patients based on multi-group chemical data as claimed in claim 8, wherein the final risk prediction model of step S5 represents:

representing the space of a regression tree, q the structure of the tree, T the leaves in the treeNumber of nodes, f_kRepresenting the structure q of the regression tree with weight w.

10. The deep learning method for predicting the risk of prognosis of cancer patients based on multi-group chemical data as claimed in claim 1, wherein the value of n is 200 for the first n genetic features of the inportance coefficient in step S5.