CN114927166A - Pan-cancer multi-component molecular typing and prognosis model construction method based on Notch signal pathway - Google Patents

Pan-cancer multi-component molecular typing and prognosis model construction method based on Notch signal pathway Download PDF

Info

Publication number
CN114927166A
CN114927166A CN202210079683.9A CN202210079683A CN114927166A CN 114927166 A CN114927166 A CN 114927166A CN 202210079683 A CN202210079683 A CN 202210079683A CN 114927166 A CN114927166 A CN 114927166A
Authority
CN
China
Prior art keywords
data
cancer
notch
pan
gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210079683.9A
Other languages
Chinese (zh)
Inventor
郭丽
李孙静
任得康
任玉杰
窦宇阳
向阳洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210079683.9A priority Critical patent/CN114927166A/en
Publication of CN114927166A publication Critical patent/CN114927166A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention belongs to the field of biological information, and discloses a method for constructing a pan-cancer multi-component molecular typing and prognosis model based on a Notch signal channel, wherein DNA methylation, mRNA expression profile, miRNA expression and copy number variation data of all Notch channel genes are extracted from multi-component data as model characteristic values, deletion value processing and standardization processing are carried out, and all characteristics including the multi-component data are input into a noise reduction self-encoder network; and carrying out univariate Cox-PH analysis on the obtained representative characteristics to obtain characteristic data obviously related to survival time, carrying out K-means clustering on the characteristics, dividing the sample into a high risk group and a low risk group according to median risk score, and carrying out Kaplan-Meier survival analysis on the two groups. The multigroup characteristics of the Notch signaling pathway can well divide patients into two subtypes, and has a prognostic role in various cancers.

Description

Pan-cancer multi-component molecular typing and prognosis model construction method based on Notch signal pathway
Technical Field
The invention particularly relates to a method for establishing a pan-cancer multi-component molecular typing and prognosis model based on a Notch signal pathway, belonging to the field of biological information.
Background
The Notch signaling pathway is a highly conserved cellular signaling system that is present in most multicellular organisms. The Notch signaling pathway plays a key role in the regulation of many fundamental cellular processes, such as proliferation, stem cell maintenance and differentiation during embryonic and adult development. Many studies have shown that Notch signaling pathways are involved in the development of a variety of tumors, including pancreatic cancer, prostate cancer, breast cancer, lung cancer, sarcoma, cervical cancer, melanoma, head and neck cancer, renal cancer, and gastroenteropancreatic neuroendocrine tumors, all with deregulated expression of wild-type Notch receptors and Notch ligands.
The Notch signaling pathway plays an important role in cell proliferation, differentiation and apoptosis, and the occurrence of many tumors is related to the abnormality of the Notch signaling pathway. The functions of Notch are different in different tumors and different periods of the same tumor, which indicates that the Notch signal path is very complex, and a great deal of intensive research is needed for explaining the mechanism. The Notch receptor, ligand and various modified molecules are taken as targets to block the Notch signal path so as to play the role of resisting tumors, and the modified molecular peptide has been widely paid attention as a new way for treating tumors
A number of studies have shown evidence of activity of the Notch pathway in a variety of cancers, however, previous studies on the Notch signaling pathway have focused on its role as a single component in a small number of tumor environments, and to date, the overall molecular characteristics of the Notch signaling pathway in cancer have not been characterized, leading to an inefficient use of knowledge about the signaling pathway in oncology. With the rapid development of next-generation sequencing technologies, a plurality of large-scale genomics projects provide a great deal of multigroup data of various cancer types, including transcriptomics, genomics, proteomics and the like, and create an unprecedented opportunity for the comprehensive development of the molecular driving mechanism of Notch signaling pathway. Therefore, comparative and integrated studies of different types of omics data help to reveal the regulatory mechanisms of the Notch pathway in different cancers.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a method for constructing a pan-cancer multi-component molecular typing and prognosis model based on a Notch signal pathway, is beneficial to identifying a driving factor and a regulatory factor which influence the Notch signal pathway, excavates the role of multi-component characteristics in prognosis, and provides knowledge and reference for determining the diagnosis, prognosis and treatment of tumors, particularly the potential application of anticancer drugs.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
a method for constructing a pan-cancer multi-group chemical molecular typing and prognosis model based on a Notch signal pathway comprises the following steps:
extracting DNA methylation, mRNA expression profiles, miRNA expression and copy number variation data containing all Notch access genes as model characteristic values based on multiple sets of mathematical data, performing deletion value processing and standardization processing to obtain standardized high-dimensional multiple sets of mathematical characteristic data, and inputting all characteristics containing multiple sets of mathematical data into a noise reduction self-encoder network to obtain representative characteristics; and carrying out univariate Cox-PH analysis on the obtained representative characteristics to obtain characteristic data obviously related to survival time, carrying out K-means clustering on the characteristics, dividing the sample into a high risk group and a low risk group according to median risk score, and carrying out Kaplan-Meier survival analysis on the two groups.
Further, the gene satisfies any one of the following conditions: (a) the gene is previously related to cancer research, (b) the gene is directly involved in the binding and regulation of Notch function, and (c) the gene has phenotypic correlation with Notch family members, and the mutation or deletion of the gene results in similar phenotype to the loss of Notch family function.
Further, sample screening is performed on each characteristic value, and the somatic mutation data is screened as follows: mining an MC3 TCGA MAF file, and determining the non-silent mutation frequency of each gene in each cancer and pan-cancer background; removing a high mutation sample carrying 1,000 individual cell mutations; for each type of data, tumor samples from the same patient were screened.
Further, somatic mutation data were screened as follows: only the base mutation class in which non-silent mutation occurs is retained; for DNA methylation data, probes in each cancer that were significantly negatively correlated with mRNA data were selected, and this signature data was not deleted; for miRNA data, Spearman correlations between miRNA and gene expression levels were calculated for each cancer and then filtered by recordings in mirtarbasev 6.0; only the interactive pairs that appear in ≧ 3 cancer types are considered highly trusted regulators of miRNA-Notch.
Further, in each cancer data, more than 20% of patients with missing features in the cancer study were excluded.
Further, the missing value processing and the normalization processing are as follows: and carrying out specific data statistical analysis and quantification on the data by using an R language, filling missing values according to median, carrying out standardization processing on a plurality of characteristic value data and Notch pathway scores, standardizing all characteristics into a data format with an average value of 0 and a standard deviation of 1, and obtaining standardized multigroup data of the Notch signal pathway core gene in a plurality of cancers.
Further, the data uses mRNA and miRNA expression data after log2 processing as characteristic values, and uses average copy number or DNA methylation of each gene CpG locus to extract respective gene level characteristics and quantify the characteristics.
Further, standardized high-dimensional multiomic feature data are input into the DAE network for training, the feature data are encoded into a representative feature with a smaller size through continuously updating the weight, and the weight is continuously updated through calculating a loss function.
Further, the samples are divided into two classes and used for establishing a multivariate Cox proportional hazards regression model based on the LASSO algorithm and cross validation.
Further, the K value of the K-means cluster is 2.
Has the advantages that: most methods of predicting cancer prognosis are achieved by analyzing expression data of a single omic, such as gene mRNA expression data, methylation data, or miRNA data. However, the prognosis of a patient is co-regulated by multiple different levels of molecules, which interact with each other. Therefore, the result of single-group chemical data analysis can only provide one-sided information, and by analyzing data at different molecular levels, the problem that a single omics method is too sensitive to noise can be relieved through error cancellation. Therefore, in recent years, integrating various data for cancer analysis has become a powerful tool.
The biggest difficulty in fusing multigroup data is how to optimize the dimensionality reduction effect of high-dimensionality omic data by using cancer data of a small sample. The noise reduction automatic encoder can be used for data noise reduction, data visualization dimension reduction and the like, and a denoised vector with high robustness and small sample size is output through encoding-decoding. Therefore, the method provides a method for establishing a multiple-component molecular typing and prognosis model of pan-cancer based on a Notch signal pathway, and aims to solve the problems of undersize sample size and low accuracy of the prognosis model.
The invention discloses a method for establishing a pan-cancer multi-component molecular typing and prognosis model based on a Notch signal pathway. The method mainly comprises a multiomic analysis stage and a prognostic typing stage. Firstly, extracting data including all Notch pathway gene DNA methylation, mRNA expression profiles, miRNA expression and copy number variation as model characteristic values based on the multiomic data, carrying out deletion value processing and standardization processing, and inputting all characteristics including multigroup chemical data into a Denoising auto encoder network (DAE) to obtain representative characteristics. Second, these features were used in conjunction with TCGA clinical information to perform a univariate Cox-PH regression analysis to screen for survival-related features. And finally, constructing a prognosis model based on LASSO regression and Cox regression analysis according to the survival related characteristics of the patient, evaluating the risk of the patient, dividing the patient sample into a high risk group and a low risk group according to the median of the risk score, and indicating that the high risk group has poor OS through survival analysis. And performing K-means clustering on the survival related characteristics of the patient, determining that K is 2, and survival analysis shows that the coincidence degree of the two types of typing is higher, which shows that the multigroup characteristics of the Notch signal path can well divide the patient into two subtypes, and the prognosis effect is realized in various cancers.
The invention provides a method for constructing a pan-cancer multi-component molecular typing and prognosis model based on a Notch signal pathway, which is characterized in that integration analysis and screening are carried out through bioinformatics based on high-throughput sequencing data, and Notch gene pathway multiomic data of pan-cancer are integrated and screened, so that the feature presentation of the Notch signal pathway in different types of omic data in a specific cancer type is excavated, and the feature presentation is combined with survival data to excavate the prognosis value of the Notch signal pathway.
The invention is based on a deep learning network, calculates an optimal loss function by continuously updating weight drop, selects representative features, and combines multiple groups of mathematical features with survival data to construct a prognosis typing model. The molecular driving factors and the regulatory factors influencing the Notch signaling pathway are identified, the roles of multiple groups of chemical characteristics in pre-purchasing are mined, and the results provide knowledge and reference for determining the diagnosis, prognosis and treatment of tumors, particularly the potential application of anticancer drugs.
Drawings
FIG. 1 is a flow chart of a method for constructing a pan-cancer multi-component molecular typing and prognosis model based on Notch signaling pathway, which is implemented by the present invention;
FIG. 2 is a BLCA, BRCA, HNSC, KIRP result chart of the pan-cancer multi-component molecular typing and prognosis model construction method based on Notch signaling pathway;
FIG. 3 is a KICH, KIRC, LAML result graph of a pan-cancer multi-component molecular typing and prognosis model construction method based on Notch signaling pathway, which is implemented by the present invention;
FIG. 4 is a graph of the LGG, LIHC, LUAD, MESO results of a Notch signaling pathway-based pan-cancer multi-component molecular typing and prognosis model construction method implemented in the present invention;
FIG. 5 is a graph of OV, SARC, SKCM, STAD results of a Notch signaling pathway-based pan-cancer multi-component molecular typing and prognosis model construction method implemented in the present invention;
FIG. 6 is a THCA and UCEC result diagram of a pan-cancer multi-component molecular typing and prognosis model construction method based on Notch signaling pathway implemented by the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Example 1
A method for constructing a pan-cancer multi-component molecular typing and prognosis model based on a Notch signaling pathway comprises the following steps:
step 1: 22 core genes are obtained by searching a Notch signal pathway gene set in 4 public databases of MSigDB, KEGG, BioMart and AmiGo and screening according to the prior literature; downloading mRNA expression profiles, miRNA expression profiles, DNA methylation, mutation and copy number variation data of 33 cancers from TCGA, and preprocessing the data into a matrix format comprising gene names, sample names and corresponding quantitative data; the pan-cancer patient clinical data was downloaded from the TCGA and pre-processed into a matrix format, including the patient name and corresponding clinical data.
The previous list of genes was generated by searching 4 public databases for the keywords "Notch" (i) BIOCARTA _ Notch _ path from GSEA (http:// software. branched. organization. org/GSEA/msigdb/cars/BIOCARTA _ Notch _ path), (ii) KEGG _ Notch _ SIGNALING _ path from GSEA (http:// software. branched. organization. org/GSEA/msigdb/cars/KEGG _ Notch _ SIGNALING _ path), (iii) GO _0007219full gene from mark, and (iv) subset of GO _0007219(filtered by "experimental approach") from ami 247. Subsequent extensive literature searches were performed to retain only genes that were either (a) previously associated with cancer research, or (b) directly involved in the binding and regulation of Notch function, or (c) that had phenotypic associations with members of the Notch family, with mutations or deletions of the gene resulting in a phenotype similar to the loss of Notch family function. Finally, 22 core Notch signaling pathway genes were generated, including 5 classical ligands and 4 atypical ligands, 4 Notch receptors, 1 transcription factor, 8 transcriptional coactivators.
Step 2: effective data screening is carried out based on the multigroup chemical data in the step 1, and 9125 tumor samples are screened based on the previous research so as to ensure that the data are effective data. The screening criteria were: in each cancer data, more than 20% of patients with missing features in the cancer study were excluded.
Sample screening was performed for each characteristic value, 1) somatic mutation data were screened as follows: the frequency of non-silent mutations per gene in each cancer and pan-cancer context was determined by mining MC3 TCGA MAF files (9125 patients covering the pan-cancer pathway analysis cohort) from 33 cancer types. To prevent false positives, 1) all types of cancer retain only the "PASS" mutation in the column "FILTER" except OV (ova series cystodenocimoma) and LAML (ace Myeloid Leukamia), for which the "wga" mutation may additionally be retained. II) highly mutated samples carrying 1,000 individual cell mutations were removed; 2) for each type of data, tumor samples from the same patient were screened.
1) And performing gene screening on each characteristic value. Somatic mutation data were screened as follows: only the base mutation classes in which non-Silent mutations occurred were retained, with the defined mutation types "Intron", "Silent", "3'UTR", "5'
UTR ","3'Flank ","5' Flank "are silent mutations, the remainder are non-silent mutations.
2) For copy number variation data, copy number values were obtained for 9125 patients, 22 genes, from TCGA.
3) For DNA methylation data, probes were selected that were significantly negatively correlated with mRNA data in each cancer, and this signature data was not deleted.
4) For miRNA data, the Spearman correlation between miRNA and gene expression levels in each cancer was calculated as a significant negative correlation (q <10-5 and Rs-0.25). Filtering then takes place through the records in mirtarbasev6.0, preserving the stronger and weaker functional interactions. Only the interactive pairs that appear in ≧ 3 cancer types are considered highly-trusted regulators of miRNA-Notch.
In each cancer data, more than 20% of patients with missing features in the cancer study were excluded.
And step 3: and 2, carrying out specific data statistical analysis on the data processed in the step 2 by using R language, wherein the data adopts mRNA and miRNA expression data processed by log2 as characteristic values, and the respective gene level characteristics are extracted by adopting average copy number or DNA methylation of each gene CpG site. And quantizing, filling missing values according to median, and estimating by using r packets 'inputeMissings'. The multiple characteristic value data and the Notch pathway scores are standardized into a data format with the average value of 0 and the standard deviation of 1, so that the standardized high-dimensional multi-group feature data of 22 Notch signaling pathway core genes in multiple cancers is obtained.
Table 122 Notch signaling pathway core genes
Notch pathway core gene
JAG1
JAG2
DLL1
DLL3
DLL4
NOV
CNTN1
CNTN6
DNER
NOTCH1
NOTCH2
NOTCH3
NOTCH4
RBPJ
MAML1
MAML2
MAML3
CREBBP
EP300
KAT2A
KAT2B
SNW1
And 4, step 4: and (4) inputting the standardized high-dimensional multi-group mathematical characteristic data obtained in the step (3) into a DAE network for training, coding the data into a representative characteristic with a smaller size by continuously updating the weight, and continuously updating the weight by calculating a loss function.
The algorithm principle is as follows:
let x be (x1, x2.. xn) an input feature list, which is encoded as a representative feature with a smaller dimension by AE, and decoded as x ', x' being the same size as x and output by the auto-encoder. Mean Square Error (MSE) is used to measure the difference between the input x and the output x':
Figure RE-GDA0003752825140000071
a noise reduction auto encoder (DAE) constructs damaged data by adding noise to high-dimensional features, and restores the original input by encoding and decoding. The DAE can enable a deep neural network to construct a real low-dimensional representation with large information quantity and strong robustness. The input is as follows:
Figure RE-GDA0003752825140000081
the loss function of DAE is:
Figure RE-GDA0003752825140000082
wherein f is e () Representing the process of generating compressed features after input features have been encoded, f d () Representing the data decoding process. To avoid overfitting of high-dimensional features, the loss function after adding an L2 regularization penalty term is:
Figure RE-GDA0003752825140000083
where γ is the coefficient of the l2 norm regularization penalty, F 1→i Is the node activity in the deep neural network, and K is the total number of layers (input, output and hidden layers).
And 5: and (3) carrying out univariate Cox-PH analysis on the representative characteristic data obtained in the step (4) to obtain characteristic data obviously related to the survival time, carrying out K-means clustering on the characteristics, finally determining that the optimal K value is 2, dividing the sample into two classes, and establishing a multivariate Cox proportional risk regression model based on an LASSO algorithm and Cross Validation (CV). The Harrell's C statistical method was used to evaluate the predictive power of the model using the consistency index (C-index). A higher C-index value indicates a better prediction, 0.50 means a random prediction. Dividing the sample into a high risk group and a low risk group according to the median risk score, carrying out Kaplan-Meier survival analysis on the two groups, and calculating the long-rank test value between the two groups.
The multivariate Cox-PH (Cox reporting wizard) model is defined as:
h(t|X i )=h 0 (t)θ i
h (t | Xi) is a risk function related to X (covariate) at time t. hx (t) is a baseline hazard function describing how the risk changes at time t, θ i =exp(βX i ) Coefficient vector beta and covariate vector X for describing patient i i Risk variations in between. t is t i The probability of death for patient i at time:
Figure RE-GDA0003752825140000084
in the denominator j is indicated at t i The sum of all individual risks at the moment.
The corresponding log-likelihood function is:
Figure RE-GDA0003752825140000085
θ j is the (j) th sample of the sample,
Figure RE-GDA0003752825140000091
step 6: and (4) constructing a prognosis model by using the standardized monoomics characteristic data and the different combined data obtained in the step (3) and calculating a c index. The results show that methylated C-index performs best when using a single type of omics data, 0.677; the C-index of miRNA is lowest, and is 0.635. mRNA and CNV line the second and third positions, respectively. While when we attempted to eliminate one type from the four omics data, elimination of mRNA resulted in a decrease in C-index from 0.679 to 0.648, with the greatest decrease. The C-index reduction was minimal to 0.008 after excluding methylation data. The prognostic model of the integrated four omics data, C-index, was highest at 0.679. These results indicate that the combined multiomics data presented in the present invention can bring about an improvement in accuracy.
Table 217 c-indices for cross-validation testing of cancers
Figure RE-GDA0003752825140000092
Figure RE-GDA0003752825140000101
TABLE 3C-index for different data types
Figure RE-GDA0003752825140000102
Example 2
As shown in fig. 1, the data screening strategy for implementing the method for constructing a pan-cancer multi-component molecular typing and prognosis model based on Notch signaling pathway of the present invention comprises the following steps:
step 1) extracting multigroup mathematical data from TCGA, and preprocessing the multigroup mathematical data into a matrix format, wherein the matrix format comprises gene names, sample names and corresponding quantization data;
the multi-group chemical data comprise medical big data of gene mutation, mRNA expression, miRNA expression, DNA methylation and copy number variation, and can be analyzed from the perspective of a single cancer type or pan-cancer, each characteristic of each cancer type is integrated and processed into a data matrix, the row name is a gene name, and the column name is a sample number.
And 2) screening effective data based on the multiple groups of chemical data in the step 1).
And carrying out sample screening on each characteristic value. Somatic mutation data were screened as follows: the frequency of non-silent mutations per gene in each cancer and pan-cancer context was determined by mining MC3 TCGA MAF files (9125 patients covering the pan-cancer pathway analysis cohort) from 33 cancer types. To prevent false positives, I) all types of cancer retain only the "PASS" mutation in the column "FILTER" except OV and LAML, for which the "wga" mutation may be additionally retained. II) highly mutated samples carrying 1,000 individual cell mutations were removed; for each type of data, tumor samples from the same patient were screened.
And (3) performing gene screening on each characteristic value, and screening somatic mutation data as follows: only the base mutation classes in which non-Silent mutations occur are retained, and the mutation classes defined as "Intron", "Silent", "3'UTR", "5' UTR", "3'Flank", "5' Flank" are Silent mutations, and the remainder are non-Silent mutations. For the copy number variation data,
for DNA methylation data, we selected probes that were significantly negatively correlated with mRNA data in each cancer, and did not delete this signature data.
For miRNA data, the Spearman correlation between miRNA and gene expression levels in each cancer was calculated as a significant negative correlation (q <10-5 and Rs-0.25). Filtering then takes place through the records in mirtarbasev6.0, preserving the stronger and weaker functional interactions. Only the interactive pairs that appear in ≧ 3 cancer types are considered highly-trusted regulators of miRNA-Notch.
And 3) carrying out standardization treatment on the multi-group mathematical characteristic data integrated in the step 2, specifically, estimating all missing values in expression data of miRNA and mRNA by using r package 'inputeMissings', and converting data by using log2 in downstream analysis. Using the mean Value of the DNA methylation Beta-Value in each cancer as a characteristic Value of the model; the average copy number is used as a characteristic value of the model. The plurality of characteristic value data are normalized into a data format having a mean value of 0 and a standard deviation of 1.
And 4) inputting the standardized feature data into a DAE network for training, obtaining an optimal loss function by continuously updating the weight, and compressing the data dimension to obtain representative feature data.
And 5) combining the representative characteristic data with survival data, carrying out univariate Cox-PH analysis to obtain characteristic data obviously related to the survival time, carrying out K-means clustering on the characteristics, finally determining the optimal K to be 2, dividing the sample into two types, and establishing a multi-Cox proportional risk regression model based on an LASSO algorithm and Cross Validation (CV). Dividing the sample into a high risk group and a low risk group according to the median risk score, carrying out Kaplan-Meier survival analysis on the two groups, and calculating the long-rank test value between the two groups.

Claims (10)

1. A method for constructing a pan-cancer multi-group chemical molecular typing and prognosis model based on a Notch signal pathway is characterized by comprising the following steps:
extracting data containing all Notch pathway gene DNA methylation, mRNA expression profiles, miRNA expression and copy number variation from the multigroup mathematical data as model characteristic values, carrying out deletion value processing and standardization processing to obtain standardized high-dimensional multigroup mathematical characteristic data, and inputting all characteristics containing the multigroup mathematical data into a noise reduction self-encoder network to obtain representative characteristics; and carrying out univariate Cox-PH analysis on the obtained representative characteristics to obtain characteristic data obviously related to survival time, carrying out K-means clustering on the characteristics, dividing the sample into a high risk group and a low risk group according to median risk score, and carrying out Kaplan-Meier survival analysis on the two groups.
2. The method for constructing a pan-cancer multi-component molecular typing and prognosis model based on Notch signaling pathway of claim 1, wherein the gene meets the condition of any one of the following: (a) the gene is related to cancer research in advance, (b) the gene is directly involved in the combination and regulation of Notch function, and (c) the gene has phenotype correlation with Notch family members, and the mutation or deletion of the gene causes the phenotype to be similar to the loss of function of the Notch family.
3. The method for constructing a pan-cancer multi-component molecular typing and prognosis model based on Notch signaling pathway of claim 1, wherein the sample screening is performed to each characteristic value, and the somatic mutation data is screened as follows: mining an MC3 TCGA MAF file, and determining the non-silent mutation frequency of each gene in each cancer and pan-cancer background; removing a high mutation sample carrying 1,000 individual cell mutations; for each type of data, tumor samples from the same patient were screened.
4. The method for constructing a pan-cancer multi-component molecular typing and prognosis model based on Notch signaling pathway of claim 1, wherein the somatic mutation data is screened as follows: only the base mutation class in which non-silent mutations occur is retained; for DNA methylation data, probes in each cancer that were significantly negatively correlated with mRNA data were selected, and this signature data was not deleted; for miRNA data, Spearman correlations between miRNA and gene expression levels were calculated for each cancer and then filtered by recordings in mirtarbasev 6.0; only the interactive pairs that appear in ≧ 3 cancer types are considered highly-trusted regulators of miRNA-Notch.
5. The method of constructing a multi-component molecular typing and prognosis model for pan-cancer based on Notch signaling pathway as claimed in claim 1, wherein more than 20% of patients with missing features in cancer research are excluded from each cancer data.
6. The method for constructing a pan-cancer multi-component molecular typing and prognosis model based on Notch signaling pathway of claim 1, wherein the deletion value processing and normalization processing are: and carrying out specific data statistical analysis and quantification on the data by using an R language, filling missing values according to median, and carrying out standardization processing on a plurality of characteristic value data and Notch pathway scores, wherein all characteristics are standardized into a data format with an average value of 0 and a standard deviation of 1, so as to obtain standardized multigroup data of the Notch signal pathway core gene in a plurality of cancers.
7. The method for constructing a pan-cancer multi-component molecular typing and prognosis model based on Notch signaling pathway of claim 6, wherein the data uses mRNA and miRNA expression data processed by log2 as characteristic values, and the average copy number or DNA methylation of each gene CpG site is used to extract respective gene level characteristics for quantification.
8. The Notch signaling pathway-based method for constructing a multi-component molecular typing and prognosis model for pan-cancer according to claim 1, wherein normalized high dimensional omics feature data are inputted to DAE network training, encoded into a representative feature of a smaller size by continuously updating weights, and continuously updated by calculating loss functions.
9. The method of constructing a multiple-component molecular typing and prognosis model of pan-cancer based on Notch signaling pathway of claim 1, wherein the samples are divided into two classes and used to build a multivariate Cox proportional hazards regression model based on LASSO algorithm and cross validation.
10. The method for constructing a pan-cancer multi-component molecular typing and prognosis model based on Notch signaling pathway of claim 1, wherein K value of K-means clustering is 2.
CN202210079683.9A 2022-01-24 2022-01-24 Pan-cancer multi-component molecular typing and prognosis model construction method based on Notch signal pathway Pending CN114927166A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210079683.9A CN114927166A (en) 2022-01-24 2022-01-24 Pan-cancer multi-component molecular typing and prognosis model construction method based on Notch signal pathway

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210079683.9A CN114927166A (en) 2022-01-24 2022-01-24 Pan-cancer multi-component molecular typing and prognosis model construction method based on Notch signal pathway

Publications (1)

Publication Number Publication Date
CN114927166A true CN114927166A (en) 2022-08-19

Family

ID=82805635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210079683.9A Pending CN114927166A (en) 2022-01-24 2022-01-24 Pan-cancer multi-component molecular typing and prognosis model construction method based on Notch signal pathway

Country Status (1)

Country Link
CN (1) CN114927166A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631799A (en) * 2022-12-20 2023-01-20 深圳先进技术研究院 Sample phenotype prediction method and device, electronic equipment and storage medium
CN115985388A (en) * 2022-12-27 2023-04-18 上海人工智能创新中心 Multi-group chemical integration method and system based on preprocessing noise reduction and biological center rule
CN116153424A (en) * 2023-04-18 2023-05-23 北京概普生物科技有限公司 Monogenic pan-cancer prognosis analysis system and analysis method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631799A (en) * 2022-12-20 2023-01-20 深圳先进技术研究院 Sample phenotype prediction method and device, electronic equipment and storage medium
CN115985388A (en) * 2022-12-27 2023-04-18 上海人工智能创新中心 Multi-group chemical integration method and system based on preprocessing noise reduction and biological center rule
CN115985388B (en) * 2022-12-27 2024-05-28 上海人工智能创新中心 Multi-group-study integration method and system based on preprocessing noise reduction and biological center rule
CN116153424A (en) * 2023-04-18 2023-05-23 北京概普生物科技有限公司 Monogenic pan-cancer prognosis analysis system and analysis method
CN116153424B (en) * 2023-04-18 2023-06-23 北京概普生物科技有限公司 Monogenic pan-cancer prognosis analysis system and analysis method

Similar Documents

Publication Publication Date Title
CN114927166A (en) Pan-cancer multi-component molecular typing and prognosis model construction method based on Notch signal pathway
Zhou et al. Relapse-related long non-coding RNA signature to improve prognosis prediction of lung adenocarcinoma
Haibe-Kains et al. A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all?
Archer et al. L 1 penalized continuation ratio models for ordinal response prediction using high‐dimensional datasets
Milanez-Almeida et al. Cancer prognosis with shallow tumor RNA sequencing
CN112048559A (en) Model construction and clinical application of m 6A-related IncRNA network gastric cancer prognosis-based model
CN110714078B (en) Marker gene for colorectal cancer recurrence prediction in stage II and application thereof
CN112626218A (en) Gene expression classifier and in-vitro diagnosis kit for predicting pancreatic cancer metastasis risk
CN109801681B (en) SNP (Single nucleotide polymorphism) selection method based on improved fuzzy clustering algorithm
Nguyen et al. Identification and validation of a novel three hub long noncoding RNAs with m6A modification signature in low-grade gliomas
Chang et al. Predicting colorectal cancer microsatellite instability with a self-attention-enabled convolutional neural network
CN112037863B (en) Early NSCLC prognosis prediction system
Ming et al. Integrated analysis of gene co-expression network and prediction model indicates immune-related roles of the identified biomarkers in sepsis and sepsis-induced acute respiratory distress syndrome
Zhang et al. Weighted gene co-expression network analysis identified a novel thirteen-gene signature associated with progression, prognosis, and immune microenvironment of colon adenocarcinoma patients
Jia et al. ChrNet: a re-trainable chromosome-based 1D convolutional neural network for predicting immune cell types
Maden et al. recountmethylation enables flexible analysis of public blood DNA methylation array data
Zhou et al. Identification of subtype-specific genes signature by WGCNA for prognostic prediction in diffuse type gastric cancer
CN113234823B (en) Pancreatic cancer prognosis risk assessment model and application thereof
Fan et al. Genetic cross‐talk between oral squamous cell carcinoma and type 2 diabetes: the potential role of immunity
Yan et al. Identification of immune-related molecular clusters and diagnostic markers in chronic kidney disease based on cluster analysis
Luyapan et al. A new efficient method to detect genetic interactions for lung cancer GWAS
Yi et al. Identification of four novel prognostic biomarkers and construction of two nomograms in adrenocortical carcinoma: a multi-omics data study via bioinformatics and machine learning methods
Ma et al. Identification of non-Hodgkin's lymphoma prognosis signatures using the CTGDR method
Lin et al. Screening Potential Diagnostic Biomarkers for Age‐Related Sarcopenia in the Elderly Population by WGCNA and LASSO
Gan et al. Identification of differential gene groups from single-cell transcriptomes using network entropy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination