CN110791565B - Prognostic marker gene for colorectal cancer recurrence prediction in stage II and random survival forest model - Google Patents

Prognostic marker gene for colorectal cancer recurrence prediction in stage II and random survival forest model Download PDF

Info

Publication number
CN110791565B
CN110791565B CN201910932418.9A CN201910932418A CN110791565B CN 110791565 B CN110791565 B CN 110791565B CN 201910932418 A CN201910932418 A CN 201910932418A CN 110791565 B CN110791565 B CN 110791565B
Authority
CN
China
Prior art keywords
stage
model
colorectal cancer
recurrence
gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910932418.9A
Other languages
Chinese (zh)
Other versions
CN110791565A (en
Inventor
丁克峰
陆玮
肖乾
李军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201910932418.9A priority Critical patent/CN110791565B/en
Publication of CN110791565A publication Critical patent/CN110791565A/en
Application granted granted Critical
Publication of CN110791565B publication Critical patent/CN110791565B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/118Prognosis of disease development
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers

Abstract

The invention discloses a prognosis marker gene for predicting colorectal cancer recurrence at stage II and application thereof, and provides a model for predicting the recurrence risk of patients according to gene expression information of tumors of colorectal cancer patients at stage II, wherein the model predicts the 5-year recurrence risk of colorectal cancer patients at stage II, and the AUC value of the model is 0.993, so that the patients with high recurrence risk and low recurrence risk can be obviously separated in a test set; regarding the establishment and selection of the model, the invention uses the random survival forest model, and performs variable screening according to the minimum depth value of the maximum subtree where the variable is located, selects the important variable to establish the model again, so that the variable dimension of the model is greatly reduced; after patients were divided into high and low risk of relapse groups using a random survival forest model in the test set, the relapse-free survival time of patients between the two groups was significantly different.

Description

Prognostic marker gene for colorectal cancer recurrence prediction in stage II and random survival forest model
(I) technical field
The invention relates to a construction method of a colorectal cancer recurrence prediction model in stage II and a prognostic marker gene.
(II) background of the invention
The colorectal cancer is mainly treated by surgical operation and is assisted by comprehensive treatment modes such as chemotherapy, radiotherapy, targeted therapy, immunotherapy and the like. For early stage colorectal cancer patients, high quality radical colorectal cancer surgery can bring significant benefits to the patients. However, partial early colorectal cancer patients have local recurrence and metastatic recurrence after radical surgery, the prognosis of the patients with the recurrence is poor, and researches show that the shorter RFS (recurrence-free survival) is after radical surgery, the shorter the total survival is. Therefore, the prediction of the recurrence risk after the early colorectal cancer radical operation has important significance, and the method is helpful for guiding the formulation of the postoperative adjuvant therapy scheme of colorectal cancer patients.
Due to the large difference of prognosis of patients, the colorectal cancer at stage II has been a hot point of attention in recent years, and the prognosis of some patients at stage IIB and IIC is even worse than that of patients at stage IIIA. The NCCN (national integrated cancer network) guidelines propose 8 high-risk factors for stage II colorectal cancer: the method comprises the following steps of firstly, detecting 12 lymph nodes, secondly, tumors in the stage of T4, thirdly, tumors with perforation, fourthly, tumors with obstruction, fourthly, lymphatic vessels and blood vessels are invaded, and finally, detecting whether the tumors are low-differentiated or undifferentiated. Meanwhile, in recent years, MSI-H (microsatellite high instability) or dMMR (mismatch repair protein deletion) is found to be a low risk factor of colorectal cancer in II stage. Stage II colorectal cancer with any high risk factor should be considered a high risk patient, while stage II colorectal cancer patients with low risk factors should be considered low risk patients, and if not, be a high risk patient or a low risk patient should be considered a common risk patient. Generally, low-risk II-stage patients and general-risk II-stage patients have better prognosis, so that the benefit of adjuvant chemotherapy is small, while high-risk II-stage patients have poorer prognosis and higher postoperative recurrence risk, and can benefit from adjuvant chemotherapy. However, many clinical studies find that the definition of the high-risk phase II is not accurate, many high-risk phase II patients do not have recurrence, and some common-risk phase II patients have recurrence and metastasis, which may be related to the conventional high-risk factors only considering the clinical pathological characteristics of the patients and not considering the biological characteristics of the tumor itself, and the current gene chip technology and high throughput sequencing technology can enable researchers to better mine the gene expression information of the tumor, thereby reflecting the biological characteristics of the tumor.
Disclosure of the invention
The invention aims to identify the colorectal cancer recurrence related genes in the stage II according to the gene expression information of the tumor tissues of patients with colorectal cancer in the stage II, and construct a colorectal cancer recurrence prediction model in the stage II by adopting a machine learning method.
The technical scheme adopted by the invention is as follows:
the present invention provides a prognostic marker gene for prediction of stage II colorectal cancer recurrence, which includes NVL, ACER, JUP, PLAT, JUNB, IGDCC3, ANKRD27, NUP50, GIT1, PRKCDBP, TDO2, LOC100287896, MCMBP, SNED1, ESM1, MARCO, FAM46C, LMO2, HAUS8, TTC13, ADNP2, RSBN1L, RAVER2, SNORA21, PNRC1, and PJA 1.
The invention also provides an application of the prognostic marker in constructing a model for predicting colorectal cancer recurrence in stage II, wherein the model has an expression in an R language program: and v.obj _2nd score (v.obj _2nd, data), wherein v.obj _2nd is an expression value of a prognosis marker gene related to the recurrence of the colorectal cancer patient in the stage II, a random survival forest model method is adopted to construct a tumor recurrence prediction model of the colorectal cancer patient in the stage II, and v.obj _2nd score is a tumor recurrence risk score.
The invention relates to a construction method of a prediction model of colorectal cancer recurrence in stage II, which comprises the following steps:
(1) obtaining a gene expression dataset: obtaining colorectal cancer tumor samples at the stage II, obtaining mRNA expression data of each sample, wherein the detection technology comprises but is not limited to gene chip technology, high-throughput transcriptome sequencing technology, real-time fluorescence quantitative qPCR technology and the like, detecting outliers through cluster analysis and removing the outliers; (2) identifying common differentially expressed genes associated with recurrence in stage II colorectal cancer patients: calculating the p value of each gene in each data set by Lorrank test by adopting a gene chip meta-analysis technology; then, combining the p value of each gene in each data set by adopting a minP method to finally obtain the p of each geneminPA value; p of each gene by adopting Benjamini-Hochberg methodminPThe value was corrected to obtain the FDR value (false discovery rate) of each gene based on the FDR<0.1 standard screen for common differentially expressed genes (i.e., prognostic marker genes) associated with recurrence in stage II colorectal cancer patients; (3) signal path enrichment analysis of differentially expressed genes: performing signal path enrichment analysis on the common differential expression genes identified in the step (2) and related to the recurrence of the colorectal cancer patient in the stage II by adopting a Metascape database (http:// Metascape. org/gp/index. html #/main/step1), and screening signal paths with significantly enriched differential expression genes by taking p as 0.01 as a threshold value, namely signal paths with the p value less than 0.01; (4) constructing a model: constructing a tumor recurrence prediction model v.obj _2nd of the colorectal cancer patient at the stage II by adopting a random survival forest model method according to the expression value of the common differential expression gene related to the recurrence of the colorectal cancer patient at the stage II in the step (2),
the expression of the model in the R language program is as follows: v. obj _2nd score ═ pred (v.obj _2nd, data)
The size of the v.obj _2nd score in the model represents the tumor recurrence risk probability, and the model (v.obj _2nd) and the gene expression data (data) of the patient are input through a predict function of a stats package of an R language program, so that the tumor recurrence risk score of the patient can be obtained. Patients were divided into high risk groups (recurrence risk score higher than median) and low risk groups (recurrence risk score lower than median) according to median tumor recurrence risk score (preferably median 6.10) for stage II colorectal cancer patients.
Compared with the prior art, the invention has the following beneficial effects: the invention provides a model for predicting the recurrence risk of patients according to gene expression information of tumors of colorectal cancer patients in stage II, the model predicts the 5-year recurrence risk of colorectal cancer patients in stage II to be 0.993, and patients with high recurrence risk and low recurrence risk can be significantly separated in a test set (HR is 1.901, 95% CI is 1.124-3.215); regarding the establishment and selection of the model, the invention uses the random survival forest model, and performs variable screening according to the minimum depth value of the maximum subtree where the variable is located, selects the important variable to establish the model again, so that the variable dimension of the model is greatly reduced; after patients were divided into high and low risk of relapse groups using a random survival forest model in the test set, the relapse-free survival time of patients between the two groups was significantly different.
(IV) description of the drawings
FIG. 1: and (5) screening the data set.
FIG. 2: hierarchical clustering of GSE14333 data sets.
FIG. 3: a signal path with obviously enriched differentially expressed genes. The color of the histogram reflects the size of the p value, and the deeper the color, the smaller the p value; the signal path names are shown on the right side of the bar graph.
FIG. 4: and (3) adjusting and optimizing parameter combinations of the random survival forest model v.obj _2 nd. The abscissa is the number of data points of the final node, the ordinate is the number of candidate variables of each node, the color depth reflects the magnitude of the error rate outside the bag, and the deeper the color is, the greater the error rate outside the bag is.
FIG. 5: a non-linear relationship between gene expression values in the random survival forest model v.obj _2nd and the relapse-free survival time of the patient. The abscissa is the value of expression of each gene and the ordinate is the recurrence-free survival rate of the patient.
FIG. 6: ROC curve of training set time dependence. ROC graphs of the time dependence of the random survival forest model at 1 year, 3 years and 5 years respectively.
FIG. 7: and (4) testing the survival curves of the high and low recurrence risk groups predicted according to the random survival forest model v.obj _2nd in a centralized manner. Line b is the high recurrence risk group predicted according to the Lasso Cox model; line a is the low risk of recurrence group predicted according to the Lasso Cox model.
(V) detailed description of the preferred embodiments
The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto:
example 1 tumor recurrence prediction model based on Gene expression information in tumor tissue of stage II colorectal cancer patients
(1) Obtaining a Gene expression dataset
The gene expression data refers to mRNA expression data of a plurality of individual samples, and the detection technology comprises but is not limited to gene chip technology, high-throughput transcriptome sequencing technology, real-time fluorescence quantitative qPCR technology and the like.
Gene expression datasets were obtained by searching the high throughput gene expression database (https:// www.ncbi.nlm.nih.gov/gds /) of NCBI (national center for Biotechnology information, USA) with the following search formula: ("clinical cancer" [ All Fields ] OR "clinical cancer" [ All Fields ]) AND "Expression profile by array" [ Filter ], obtain 981 gene Expression data sets altogether, AND screen the data set according to inclusion criteria AND exclusion criteria.
The inclusion criteria for the data set were: (1) the sample type is a tumor tissue sample after the operation of a colorectal cancer patient in stage II; (2) the patient's RFS data is available; (3) the detection technology is a gene expression profile chip. The data set that all of the above 3 criteria met will be included in the subsequent analysis.
The exclusion criteria for the data set were: (1) sample type non-stage II colorectal cancer patient postoperative tumor tissue samples; (2) providing only patient relapse status information and not patient RFS information; (3) the data set sample size is less than 40. The samples finally included in the analysis were 651 cases of post-operative tumor tissue samples of stage II colorectal cancer patients in 6 data sets (GSE14333, GSE17538, GSE33113, GSE39582, GSE 245151), and the screening procedure of the data sets is shown in fig. 1. Data sets that are not met by any 1 of the 3 criteria above will be excluded.
Downloading original data of the gene chip of the 651 samples, reading the data by using an oligo packet of an R language program, performing standardized preprocessing by using an RMA (robust multiple average) algorithm (aiming at eliminating the change of gene expression amount caused by other factors such as experimental technology and enabling the gene expression data of each sample to be at a comparable level), and obtaining a gene expression matrix of each sample, wherein the abscissa of the gene expression matrix is the number of each sample, the ordinate is the number of a probe set, and R language program codes are as follows (taking the data set GSE14333 as an example, the rest data sets only need to replace the GSE14333 with the names of other data sets). To obtain the gene expression values for each sample, the DAVID database (http:// DAVID. ncifcrf. gov /) was used to convert the probe set numbers to gene names, resulting in a gene expression matrix for each data set.
library(oligo)
library(pd.hg.u133.plus.2)
geneCELs=list.celfiles(listGzipped=T,full.name=T)
affyGeneFS<-read.celfiles(geneCELs)
geneCore<-rma(affyGeneFS)
write.csv(geneCore@assayData$exprs,file="GSE14333.csv")
Due to the existence of factors such as experimental errors and the like, outliers may exist in gene chip data, and the occurrence of the outliers brings difficulties for the subsequent establishment of models and test models. And (2) calculating Pearson correlation coefficients among samples according to the gene expression matrix, performing hierarchical clustering by using (1-Pearson correlation coefficient)/2 as the distance among the samples (performing clustering by using an hclust function of an NbCluster packet of an R language program), and detecting an outlier by using the height 0.2 of the cluster as a threshold, wherein a hierarchical clustering diagram of the samples in one data set GSE14333 is shown in figure 2. In this example, gene expression matrices of 651 samples of 6 data sets pass outlier detection, and are included in the next analysis, and R language program codes for outlier detection are as follows (taking GSE14333 as an example, the remaining data sets only need to replace GSE14333 with names of other data sets).
library(NbClust)
GSE14333=read.table("GSE14333.txt",sep="\t",header=TRUE)
GSE14333=GSE14333[-c(1,2),]
row.names(GSE14333)=GSE14333[,1]
GSE14333=GSE14333[,-c(1,2)]
dd<-as.dist((1-cor(GSE14333))/2)
cluster=hclust(dd)
tiff(filename="GSE14333.GIF",width=4500,height=3500,res=300)
plot(cluster,main="Cluster dendrogram of GSE14333",xlab=NA)
dev.off()
cutree(cluster,h=0.2)
(2) Identification of common differentially expressed genes associated with recurrence in stage II colorectal cancer patients
The traditional identification of genes related to the recurrence of tumor patients and establishment of recurrence prediction models are usually based on a single data set, but the method has the defects of sample volume shortage, poor repeatability of differential expression genes and the like, and one of the innovation points of the invention is that a gene chip meta-analysis technology is adopted to identify common differential expression genes related to the recurrence of colorectal cancer patients in stage II.
Reading the gene expression matrix of each sample by adopting a MetaDE packet of an R language program, matching the RFS time and the recurrence state of the patient with the gene expression matrix, and performing subsequent gene chip meta-analysis. Because the gene expression matrix has the condition that a plurality of probe set numbers correspond to one gene name, screening the probe sets by using a match function of an R language program MetaDE packet according to the IQR values (interquartile range values) of the probe set numbers, and selecting the probe set with the maximum IQR value of the same gene name to represent the expression value of the gene.
The method for combining p values is adopted to carry out the meta analysis of the gene chip, and the implementation of the method depends on the MetaDE. Firstly, calculating the p value of each gene in each data set by logrank test; then, combining the p value of each gene in each data set by adopting a minP method to finally obtain the p of each geneminPThe value is obtained. Because each gene is subjected to hypothesis testing once, in order to reduce alpha errors in the hypothesis testing process, the method of Benjamini-Hochberg is adopted to carry out p of each geneminPThe values were corrected to obtain the FDR value (false discovery rate) of each gene. According to FDR<0.1, and identifying 479 common differential expression genes related to the relapse of the colorectal cancer patients at the stage II. The R language program code is as follows:
rm(list=ls())
library(MetaDE)
study.names<-c("GSE14333","GSE17538","GSE24551","GSE33113","GSE39582","GSE92921")
crc.raw<-MetaDE.Read(study.names,skip=rep(2,6),via="txt",matched=FALSE,log=FALSE)
crc.raw_match=MetaDE.match(crc.raw,pool.replicate="IQR")
dim(crc.raw[[1]][[1]])
crc.merged<-MetaDE.merge(crc.raw_match)
dim(crc.merged[[1]][[1]])
crc.filtered<-MetaDE.filter(crc.merged,c(0.2,0))
dim(crc.filtered[[1]][[1]])
crc.filtered$GSE14333$censoring.status=crc.raw$GSE14333$censoring.status
crc.filtered$GSE17538$censoring.status=crc.raw$GSE17538$censoring.status
crc.filtered$GSE24551$censoring.status=crc.raw$GSE24551$censoring.status
crc.filtered$GSE33113$censoring.status=crc.raw$GSE33113$censoring.status
crc.filtered$GSE39582$censoring.status=crc.raw$GSE39582$censoring.status
crc.filtered$GSE92921$censoring.status=crc.raw$GSE92921$censoring.status
start=Sys.time()
set.seed(2018)
MetaDE.Res.minP<-MetaDE.rawdata(crc.filtered,ind.method=rep("logrank",6),meta.metho d=c("minP"),paired=rep(F,7),asymptotic=T)
b<-Sys.time()-start
print(b)
View(MetaDE.Res.minP$meta.analysis$FDR)
count.DEnumber(MetaDE.Res.minP,p.cut=c(0.01,0.05),q.cut=c(0.1))
draw.DEnumber(MetaDE.Res.minP,1,FDR=T)
write.table(MetaDE.Res.minP$meta.analysis$FDR,file="DEG.txt",sep="\t")
(3) signal path enrichment analysis of differentially expressed genes
Gene enrichment analysis is commonly used for functional annotation and pathway annotation of differentially expressed genes obtained by high-throughput techniques such as gene chips, transcriptome sequencing and the like, and is further helpful for exploring molecular mechanisms inherent in biological phenotypes and clinical diseases. 479 identified common differentially expressed genes related to recurrence of stage II colorectal cancer patients were subjected to signal pathway enrichment analysis using the Metascape database (http:// metascap. org/gp/index. html #/main/step1), and signal pathways significantly enriched for the differentially expressed genes were screened using p 0.01 as the threshold, see FIG. 3.
(4) Construction of stage II colorectal cancer patient tumor recurrence prediction model in training set
In order to construct a tumor recurrence prediction model for colorectal cancer patients in stage II, 651 samples of 6 data sets are randomly divided into a training set and a testing set according to the proportion of 60% and 40%, and the tumor recurrence prediction model is constructed in the training set. Because gene expression data has the characteristics of high data dimension, high correlation and the like, the traditional Cox regression model is not suitable for high-dimensional data, and the random survival forest model can well solve the problem. In addition, the influence of the gene expression value on the relapse risk of the patient can be nonlinear, the traditional Cox regression model is a linear model in nature and can reduce the prediction precision of the model, and random survival forests can find the nonlinear effect of the gene expression value on the relapse risk of the patient. The selection of the random survival forest model is also one of the innovation points of the invention.
The important parameters of the random survival forest model are the number of candidate variables of each node and the number of data points of a final node, and the parameter combination which enables the error rate outside the bag to be the lowest is determined by calculating the error rate outside the bag in a training set under various parameter combination conditions, as shown in fig. 4. The random survival forest model can perform variable screening according to the minimum depth value of the maximum subtree where the variable is located, and the screening threshold value is the average value of the minimum depth values of the maximum subtrees where all the variables are located. After variable screening, the model building steps can be repeated by using the screened variables until the minimum depth values of the maximum subtrees where all the variables are located are below the threshold value.
Through variable screening of the random survival forest model for 2 times, a prediction model v.obj _2nd consisting of 26 gene expression values is finally obtained, the used software is a randomForestSRC packet of an R language program, and specific R language program codes are as follows:
library("randomForestSRC")
library("survival")
library("timeROC")
data=rbind(GSE14333_BN,GSE17538_BN,GSE24551_BN,GSE33113_BN,GSE39582_BN,GSE92921_BN)
data=data.frame(data)
colnames(data)[480]="time"
set.seed(2018)
random_sample=sample(651)
train_index=random_sample[1:390]
test_index=random_sample[391:651]
train_data=data[train_index,]
test_data=data[test_index,]
tune_data=tune(Surv(time,status)~.,data=train_data,ntreeTry=1000,trace=T)
print(tune_data$rf)
v.obj<-rfsrc(Surv(time,status)~.,data=train_data,nodesize=2,mtry=34,ntree=
1000,importance=TRUE)
print(v.obj)
plot(v.obj)
pred=predict(v.obj,train_data)
pred
#1st simplify#
v.max$topvars
train_data_1st=train_data[,c(v.max$topvars,"time","status")]
test_data_1st=test_data[,c(v.max$topvars,"time","status")]
tune_data_1st=tune(Surv(time,status)~.,data=train_data_1st,ntreeTry=1000,trace=T)print(tune_data_1st$rf)
v.obj_1st<-rfsrc(Surv(time,status)~.,data=train_data_1st,nodesize=1,mtry=11,ntree=1000,importance=TRUE)
print(v.obj_1st)
plot(v.obj_1st)
plot.survival(v.obj_1st,cens.model="rfsrc")
pred_1st=predict(v.obj_1st,train_data_1st)
pred_1st
#2nd simplify#
v.max_1st$topvars
train_data_2nd=train_data[,c(v.max_1st$topvars,"time","status")]
test_data_2nd=test_data[,c(v.max_1st$topvars,"time","status")]
tune_data_2nd=tune(Surv(time,status)~.,data=train_data_2nd,ntreeTry=1000,trace=T)print(tune_data_2nd$rf)
v.obj_2nd<-rfsrc(Surv(time,status)~.,data=train_data_2nd,nodesize=2,mtry=10,ntree=1000,importance=TRUE)
print(v.obj_2nd)
plot(v.obj_2nd)
plot.variable(v.obj_2nd,surv.type="surv",time=60,plots.per.page=6)
pred_2nd=predict(v.obj_2nd,train_data_2nd)
pred_2nd
#minimal_depth
v.max_2nd<-max.subtree(v.obj_2nd)
print(round(v.max_2nd$order,3))
#the minimal depth is the first order depth
print(round(v.max_2nd$order[,1],3))
minimal_depth_2nd=round(v.max_2nd$order[,1],3)
print(v.max_2nd$threshold)
v.max_2nd$topvars
#test#
pred_test_2nd=predict(v.obj_2nd,test_data_2nd)
#tdroc#
ROC_train_2nd=timeROC(T=train_data_2nd$time,delta=train_data_2nd$status,marker=pred_2nd$predicted,cause=1,iid=TRUE,times=c(12,36,60))
plot(ROC_train_2nd,time=12,add=F)
plot(ROC_train_2nd,time=36,add=F)
plot(ROC_train_2nd,time=60,add=F)
the expression of the model in the R language program is as follows: v. obj _2nd score ═ pred (v.obj _2nd, data)
The size of the v.obj _2nd score in the model represents the tumor recurrence risk probability, and the model (v.obj _2nd) and the gene expression data (data) of the patient are input through a predict function of a stats package of an R language program, so that the tumor recurrence risk score of the patient can be obtained. The names of 26 genes in the obj _2nd model and the minimum depth values of the largest subtrees in which the genes are located are shown in table 1.
TABLE 1 minimum depth values of the maximum subtrees of 26 genes in the random survival forest model v.obj _2nd
Figure GDA0002354029440000091
Figure GDA0002354029440000101
The non-linear relationship between gene expression values and patient relapse-free survival is shown by a variable edge effect plot, see figure 5.
After a random survival forest model v.obj _2nd is adopted in a training set to construct a tumor recurrence prediction model, the prediction effect of the model is evaluated by adopting AUC (area under the curve) of a time-dependent ROC curve (a subject working curve). The value range of AUC is between 0 and 1, and the larger the AUC is, the better the prediction effect of the model is. AUC of random survival forest model v.obj _2nd1 year=0.995、AUCFor 3 years=0.999、AUC5 years old0.993, see fig. 6.
(5) Evaluating model prediction effects in test set
Gene expression matrices were obtained for all samples in the test set and 26 gene expression values (NVL, ACER, JUP, PLAT, JUNB, IGDCC3, ANKRD27, NUP50, GIT1, PRKCDBP, TDO2, LOC100287896, MCMBP, SNED1, ESM1, MARCO, FAM46C, LMO2, HAUS8, TTC13, ADNP2, RSBN1L, RAVER2, SNORA21, PNRC1, PJA1) were extracted for the random survival forest model v.obj _2 nd. And importing the constructed random survival forest prediction model v.obj _2nd into an R language program, calling a predict function of the R language program, and taking the 26 gene expression matrixes of the model and the test set sample as input parameters to obtain the tumor recurrence risk score of the model in the test set. Patients were divided into high risk groups (recurrence risk score higher than 6.10) and low risk groups (recurrence risk score lower than 6.10) according to the median tumor recurrence risk score (median 6.10 in this example) of stage II colorectal cancer patients. Kaplan-Meier survival curves were drawn for two groups of patients, respectively, and Logrank was used to test whether there was a significant difference in relapse-free survival between the two groups, and Mantel-Haenszel method was used to calculate HR values (risk ratio) and their 95% CI (confidence interval). After the test set was divided into the high recurrence risk group and the low recurrence risk group according to the above method, the p value of Logrank test was 0.0165 between the two groups, HR of the high recurrence risk group was 1.901, and 95% CI was 1.124-3.215, as shown in fig. 7.
Having thus described one embodiment of the present invention, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and additions may be made to the embodiment, methodology, selected model and programming software without departing from the principles and spirit of the invention, which shall be deemed to be within the scope of the invention.

Claims (3)

1. A prognostic marker gene for use in prediction of stage II colorectal cancer recurrence, characterised in that the marker gene is NVL, ACER, JUP, PLAT, JUNB, IGDCC3, ANKRD27, NUP50, GIT1, PRKCDBP, TDO2, LOC100287896, MCMBP, SNED1, ESM1, MARCO, FAM46C, LMO2, HAUS8, TTC13, ADNP2, RSBN1L, RAVER2, SNORA21, PNRC1 and PJA 1.
2. Use of the prognostic marker gene according to claim 1, for constructing a model for predicting recurrence of colorectal cancer at stage II, wherein the model is expressed in the R language program: and v.obj _2nd score (v.obj _2nd, data), wherein v.obj _2nd is an expression value of a prognosis marker gene related to the recurrence of the colorectal cancer patient in the stage II, a random survival forest model method is adopted to construct a tumor recurrence prediction model of the colorectal cancer patient in the stage II, and v.obj _2nd score is a tumor recurrence risk score.
3. The use according to claim 2, characterized in that the model building method comprises the following steps: (1) obtaining a gene expression dataset: obtaining stage II knotsIntestinal cancer tumor samples, mRNA expression data of each sample is obtained, and outliers are detected and eliminated through cluster analysis; (2) identifying common differentially expressed genes associated with recurrence in stage II colorectal cancer patients: calculating the p value of each gene in each data set by Lorrank test by adopting a gene chip meta-analysis technology; then, combining the p value of each gene in each data set by adopting a minP method to finally obtain the p of each geneminPA value; p of each gene by adopting Benjamini-Hochberg methodminPCorrecting the value to obtain the FDR value of each gene according to the FDR<0.1 standard screen for common differentially expressed genes associated with recurrence in stage II colorectal cancer patients; (3) signal path enrichment analysis of differentially expressed genes: performing signal channel enrichment analysis on the public differentially expressed genes identified in the step (2) and related to the recurrence of the colorectal cancer patient in the stage II by adopting a Metascape database, and screening the signal channels with the differentially expressed genes significantly enriched by taking p as a threshold value of 0.01, namely the signal channels with the p value of less than 0.01; (4) constructing a model: constructing a tumor recurrence prediction model v.obj _2nd of the colorectal cancer patient at the stage II by adopting a random survival forest model method according to the expression value of the common differential expression gene related to the recurrence of the colorectal cancer patient at the stage II in the step (2),
the expression of the model in the R language program is as follows: v. obj _2nd score ═ pred (v.obj _2nd, data).
CN201910932418.9A 2019-09-29 2019-09-29 Prognostic marker gene for colorectal cancer recurrence prediction in stage II and random survival forest model Active CN110791565B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910932418.9A CN110791565B (en) 2019-09-29 2019-09-29 Prognostic marker gene for colorectal cancer recurrence prediction in stage II and random survival forest model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910932418.9A CN110791565B (en) 2019-09-29 2019-09-29 Prognostic marker gene for colorectal cancer recurrence prediction in stage II and random survival forest model

Publications (2)

Publication Number Publication Date
CN110791565A CN110791565A (en) 2020-02-14
CN110791565B true CN110791565B (en) 2021-09-03

Family

ID=69438762

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910932418.9A Active CN110791565B (en) 2019-09-29 2019-09-29 Prognostic marker gene for colorectal cancer recurrence prediction in stage II and random survival forest model

Country Status (1)

Country Link
CN (1) CN110791565B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115515636A (en) * 2020-02-27 2022-12-23 得克萨斯州大学***董事会 Methods and compositions for treating APC-deficient cancers
CN111584064A (en) * 2020-03-27 2020-08-25 湖州市中心医院 Colorectal cancer metastasis prediction system and application method thereof
CN112768060A (en) * 2020-07-14 2021-05-07 福州宜星大数据产业投资有限公司 Liver cancer postoperative recurrence prediction method based on random survival forest and storage medium
CN114672554A (en) * 2020-12-24 2022-06-28 复旦大学附属华山医院 Method for detecting expression quantity of tumor-related gene profile and application thereof
CN113921079B (en) * 2021-12-06 2022-03-18 四川省肿瘤医院 MSI prediction model construction method based on immune related gene
CN116344027B (en) * 2023-02-14 2023-09-26 浙江大学 Intestinal adenoma adenocarcinoma diagnosis method based on peripheral blood circulation micro ribonucleic acid and protein
CN116580846B (en) * 2023-07-05 2023-09-15 四川大学华西医院 Colorectal cancer prognosis risk model construction method and system based on correlation analysis

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NZ544432A (en) * 2005-12-23 2009-07-31 Pacific Edge Biotechnology Ltd Prognosis prediction for colorectal cancer using a prognositc signature comprising markers ME2 and FAS
CN108977539A (en) * 2018-07-24 2018-12-11 中山大学肿瘤防治中心 For predicting the gene methylation panel of colorectal cancer curative effect and prognosis
CN109762907B (en) * 2019-03-27 2022-02-22 中山大学附属第六医院 Application of hypoxia-related gene in kit for predicting colorectal cancer at stage I/II
CN110241221B (en) * 2019-07-31 2022-07-19 中山大学附属第六医院 Kit and system for prognosis prediction of metastatic colorectal cancer

Also Published As

Publication number Publication date
CN110791565A (en) 2020-02-14

Similar Documents

Publication Publication Date Title
CN110791565B (en) Prognostic marker gene for colorectal cancer recurrence prediction in stage II and random survival forest model
CN110714078B (en) Marker gene for colorectal cancer recurrence prediction in stage II and application thereof
Patruno et al. A review of computational strategies for denoising and imputation of single-cell transcriptomic data
CN109801680B (en) Tumor metastasis and recurrence prediction method and system based on TCGA database
US20050159896A1 (en) Apparatus and method for analyzing data
CN112289376B (en) Method and device for detecting somatic cell mutation
CA3005791A1 (en) Methods for detecting copy-number variations in next-generation sequencing
CN114582425A (en) NMIBC prognosis prediction molecular marker, screening method and modeling method
CN112951327A (en) Drug sensitivity prediction method, electronic device and computer-readable storage medium
CN113470743A (en) Differential gene analysis method based on BD single cell transcriptome and proteome sequencing data
CN115295074A (en) Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device
KR101765999B1 (en) Device and Method for evaluating performace of cancer biomarker
Barrett et al. Application of the random forest classification method to peaks detected from mass spectrometric proteomic profiles of cancer patients and controls
CN114220487A (en) Construction method of novel 9-gene RISK acute myelogenous leukemia prognosis model
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
CN116153387A (en) Overall survival rate prognosis model for lung squamous carcinoma patient and application
Zhao et al. Rfe based feature selection improves performance of classifying multiple-causes deaths in colorectal cancer
US10114929B2 (en) Mathematical musical orchestral method for predicting classes of patients for medical treatment
CN114267411A (en) DTC (digital time-dependent control) prognostic marker, application thereof and construction method of DTC prognostic evaluation model
JP2022534236A (en) A method for discovering a marker for predicting depression or suicide risk using multiple omics analysis, a marker for predicting depression or suicide risk, and a method for predicting depression or suicide risk using multiple omics analysis
JPWO2019022018A1 (en) Polymorphism detection method
CN116656829B (en) Gene marker combination, kit and system for predicting bad prognosis of gastric cancer
CN117316277B (en) Gene detection data processing method based on fluorescence spectrum
Prasad et al. Identifying genes responsible for shoot development in Arabidopsis thaliana using microarray technique
CN117524302A (en) Screening method and device for methylation diagnosis markers for early diagnosis of colon cancer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant