CN110791565B

CN110791565B - Prognostic marker gene for colorectal cancer recurrence prediction in stage II and random survival forest model

Info

Publication number: CN110791565B
Application number: CN201910932418.9A
Authority: CN
Inventors: 丁克峰; 陆玮; 肖乾; 李军
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2021-09-03
Anticipated expiration: 2039-09-29
Also published as: CN110791565A

Abstract

The invention discloses a prognosis marker gene for predicting colorectal cancer recurrence at stage II and application thereof, and provides a model for predicting the recurrence risk of patients according to gene expression information of tumors of colorectal cancer patients at stage II, wherein the model predicts the 5-year recurrence risk of colorectal cancer patients at stage II, and the AUC value of the model is 0.993, so that the patients with high recurrence risk and low recurrence risk can be obviously separated in a test set; regarding the establishment and selection of the model, the invention uses the random survival forest model, and performs variable screening according to the minimum depth value of the maximum subtree where the variable is located, selects the important variable to establish the model again, so that the variable dimension of the model is greatly reduced; after patients were divided into high and low risk of relapse groups using a random survival forest model in the test set, the relapse-free survival time of patients between the two groups was significantly different.

Description

Prognostic marker gene for colorectal cancer recurrence prediction in stage II and random survival forest model

(I) technical field

The invention relates to a construction method of a colorectal cancer recurrence prediction model in stage II and a prognostic marker gene.

(II) background of the invention

The colorectal cancer is mainly treated by surgical operation and is assisted by comprehensive treatment modes such as chemotherapy, radiotherapy, targeted therapy, immunotherapy and the like. For early stage colorectal cancer patients, high quality radical colorectal cancer surgery can bring significant benefits to the patients. However, partial early colorectal cancer patients have local recurrence and metastatic recurrence after radical surgery, the prognosis of the patients with the recurrence is poor, and researches show that the shorter RFS (recurrence-free survival) is after radical surgery, the shorter the total survival is. Therefore, the prediction of the recurrence risk after the early colorectal cancer radical operation has important significance, and the method is helpful for guiding the formulation of the postoperative adjuvant therapy scheme of colorectal cancer patients.

Due to the large difference of prognosis of patients, the colorectal cancer at stage II has been a hot point of attention in recent years, and the prognosis of some patients at stage IIB and IIC is even worse than that of patients at stage IIIA. The NCCN (national integrated cancer network) guidelines propose 8 high-risk factors for stage II colorectal cancer: the method comprises the following steps of firstly, detecting 12 lymph nodes, secondly, tumors in the stage of T4, thirdly, tumors with perforation, fourthly, tumors with obstruction, fourthly, lymphatic vessels and blood vessels are invaded, and finally, detecting whether the tumors are low-differentiated or undifferentiated. Meanwhile, in recent years, MSI-H (microsatellite high instability) or dMMR (mismatch repair protein deletion) is found to be a low risk factor of colorectal cancer in II stage. Stage II colorectal cancer with any high risk factor should be considered a high risk patient, while stage II colorectal cancer patients with low risk factors should be considered low risk patients, and if not, be a high risk patient or a low risk patient should be considered a common risk patient. Generally, low-risk II-stage patients and general-risk II-stage patients have better prognosis, so that the benefit of adjuvant chemotherapy is small, while high-risk II-stage patients have poorer prognosis and higher postoperative recurrence risk, and can benefit from adjuvant chemotherapy. However, many clinical studies find that the definition of the high-risk phase II is not accurate, many high-risk phase II patients do not have recurrence, and some common-risk phase II patients have recurrence and metastasis, which may be related to the conventional high-risk factors only considering the clinical pathological characteristics of the patients and not considering the biological characteristics of the tumor itself, and the current gene chip technology and high throughput sequencing technology can enable researchers to better mine the gene expression information of the tumor, thereby reflecting the biological characteristics of the tumor.

Disclosure of the invention

The invention aims to identify the colorectal cancer recurrence related genes in the stage II according to the gene expression information of the tumor tissues of patients with colorectal cancer in the stage II, and construct a colorectal cancer recurrence prediction model in the stage II by adopting a machine learning method.

The technical scheme adopted by the invention is as follows:

the present invention provides a prognostic marker gene for prediction of stage II colorectal cancer recurrence, which includes NVL, ACER, JUP, PLAT, JUNB, IGDCC3, ANKRD27, NUP50, GIT1, PRKCDBP, TDO2, LOC100287896, MCMBP, SNED1, ESM1, MARCO, FAM46C, LMO2, HAUS8, TTC13, ADNP2, RSBN1L, RAVER2, SNORA21, PNRC1, and PJA 1.

The invention also provides an application of the prognostic marker in constructing a model for predicting colorectal cancer recurrence in stage II, wherein the model has an expression in an R language program: and v.obj _2nd score (v.obj _2nd, data), wherein v.obj _2nd is an expression value of a prognosis marker gene related to the recurrence of the colorectal cancer patient in the stage II, a random survival forest model method is adopted to construct a tumor recurrence prediction model of the colorectal cancer patient in the stage II, and v.obj _2nd score is a tumor recurrence risk score.

The invention relates to a construction method of a prediction model of colorectal cancer recurrence in stage II, which comprises the following steps:

(1) obtaining a gene expression dataset: obtaining colorectal cancer tumor samples at the stage II, obtaining mRNA expression data of each sample, wherein the detection technology comprises but is not limited to gene chip technology, high-throughput transcriptome sequencing technology, real-time fluorescence quantitative qPCR technology and the like, detecting outliers through cluster analysis and removing the outliers; (2) identifying common differentially expressed genes associated with recurrence in stage II colorectal cancer patients: calculating the p value of each gene in each data set by Lorrank test by adopting a gene chip meta-analysis technology; then, combining the p value of each gene in each data set by adopting a minP method to finally obtain the p of each gene_minPA value; p of each gene by adopting Benjamini-Hochberg method_minPThe value was corrected to obtain the FDR value (false discovery rate) of each gene based on the FDR<0.1 standard screen for common differentially expressed genes (i.e., prognostic marker genes) associated with recurrence in stage II colorectal cancer patients; (3) signal path enrichment analysis of differentially expressed genes: performing signal path enrichment analysis on the common differential expression genes identified in the step (2) and related to the recurrence of the colorectal cancer patient in the stage II by adopting a Metascape database (http:// Metascape. org/gp/index. html #/main/step1), and screening signal paths with significantly enriched differential expression genes by taking p as 0.01 as a threshold value, namely signal paths with the p value less than 0.01; (4) constructing a model: constructing a tumor recurrence prediction model v.obj _2nd of the colorectal cancer patient at the stage II by adopting a random survival forest model method according to the expression value of the common differential expression gene related to the recurrence of the colorectal cancer patient at the stage II in the step (2),

the expression of the model in the R language program is as follows: v. obj _2nd score ═ pred (v.obj _2nd, data)

The size of the v.obj _2nd score in the model represents the tumor recurrence risk probability, and the model (v.obj _2nd) and the gene expression data (data) of the patient are input through a predict function of a stats package of an R language program, so that the tumor recurrence risk score of the patient can be obtained. Patients were divided into high risk groups (recurrence risk score higher than median) and low risk groups (recurrence risk score lower than median) according to median tumor recurrence risk score (preferably median 6.10) for stage II colorectal cancer patients.

Compared with the prior art, the invention has the following beneficial effects: the invention provides a model for predicting the recurrence risk of patients according to gene expression information of tumors of colorectal cancer patients in stage II, the model predicts the 5-year recurrence risk of colorectal cancer patients in stage II to be 0.993, and patients with high recurrence risk and low recurrence risk can be significantly separated in a test set (HR is 1.901, 95% CI is 1.124-3.215); regarding the establishment and selection of the model, the invention uses the random survival forest model, and performs variable screening according to the minimum depth value of the maximum subtree where the variable is located, selects the important variable to establish the model again, so that the variable dimension of the model is greatly reduced; after patients were divided into high and low risk of relapse groups using a random survival forest model in the test set, the relapse-free survival time of patients between the two groups was significantly different.

(IV) description of the drawings

FIG. 1: and (5) screening the data set.

FIG. 2: hierarchical clustering of GSE14333 data sets.

FIG. 3: a signal path with obviously enriched differentially expressed genes. The color of the histogram reflects the size of the p value, and the deeper the color, the smaller the p value; the signal path names are shown on the right side of the bar graph.

FIG. 4: and (3) adjusting and optimizing parameter combinations of the random survival forest model v.obj _2 nd. The abscissa is the number of data points of the final node, the ordinate is the number of candidate variables of each node, the color depth reflects the magnitude of the error rate outside the bag, and the deeper the color is, the greater the error rate outside the bag is.

FIG. 5: a non-linear relationship between gene expression values in the random survival forest model v.obj _2nd and the relapse-free survival time of the patient. The abscissa is the value of expression of each gene and the ordinate is the recurrence-free survival rate of the patient.

FIG. 6: ROC curve of training set time dependence. ROC graphs of the time dependence of the random survival forest model at 1 year, 3 years and 5 years respectively.

FIG. 7: and (4) testing the survival curves of the high and low recurrence risk groups predicted according to the random survival forest model v.obj _2nd in a centralized manner. Line b is the high recurrence risk group predicted according to the Lasso Cox model; line a is the low risk of recurrence group predicted according to the Lasso Cox model.

(V) detailed description of the preferred embodiments

The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto:

example 1 tumor recurrence prediction model based on Gene expression information in tumor tissue of stage II colorectal cancer patients

(1) Obtaining a Gene expression dataset

The gene expression data refers to mRNA expression data of a plurality of individual samples, and the detection technology comprises but is not limited to gene chip technology, high-throughput transcriptome sequencing technology, real-time fluorescence quantitative qPCR technology and the like.

Gene expression datasets were obtained by searching the high throughput gene expression database (https:// www.ncbi.nlm.nih.gov/gds /) of NCBI (national center for Biotechnology information, USA) with the following search formula: ("clinical cancer" [ All Fields ] OR "clinical cancer" [ All Fields ]) AND "Expression profile by array" [ Filter ], obtain 981 gene Expression data sets altogether, AND screen the data set according to inclusion criteria AND exclusion criteria.

The inclusion criteria for the data set were: (1) the sample type is a tumor tissue sample after the operation of a colorectal cancer patient in stage II; (2) the patient's RFS data is available; (3) the detection technology is a gene expression profile chip. The data set that all of the above 3 criteria met will be included in the subsequent analysis.

The exclusion criteria for the data set were: (1) sample type non-stage II colorectal cancer patient postoperative tumor tissue samples; (2) providing only patient relapse status information and not patient RFS information; (3) the data set sample size is less than 40. The samples finally included in the analysis were 651 cases of post-operative tumor tissue samples of stage II colorectal cancer patients in 6 data sets (GSE14333, GSE17538, GSE33113, GSE39582, GSE 245151), and the screening procedure of the data sets is shown in fig. 1. Data sets that are not met by any 1 of the 3 criteria above will be excluded.

Downloading original data of the gene chip of the 651 samples, reading the data by using an oligo packet of an R language program, performing standardized preprocessing by using an RMA (robust multiple average) algorithm (aiming at eliminating the change of gene expression amount caused by other factors such as experimental technology and enabling the gene expression data of each sample to be at a comparable level), and obtaining a gene expression matrix of each sample, wherein the abscissa of the gene expression matrix is the number of each sample, the ordinate is the number of a probe set, and R language program codes are as follows (taking the data set GSE14333 as an example, the rest data sets only need to replace the GSE14333 with the names of other data sets). To obtain the gene expression values for each sample, the DAVID database (http:// DAVID. ncifcrf. gov /) was used to convert the probe set numbers to gene names, resulting in a gene expression matrix for each data set.

library(oligo)

library(pd.hg.u133.plus.2)

geneCELs＝list.celfiles(listGzipped＝T,full.name＝T)

affyGeneFS<-read.celfiles(geneCELs)

geneCore<-rma(affyGeneFS)

write.csv(geneCore@assayData$exprs,file＝"GSE14333.csv")

Due to the existence of factors such as experimental errors and the like, outliers may exist in gene chip data, and the occurrence of the outliers brings difficulties for the subsequent establishment of models and test models. And (2) calculating Pearson correlation coefficients among samples according to the gene expression matrix, performing hierarchical clustering by using (1-Pearson correlation coefficient)/2 as the distance among the samples (performing clustering by using an hclust function of an NbCluster packet of an R language program), and detecting an outlier by using the height 0.2 of the cluster as a threshold, wherein a hierarchical clustering diagram of the samples in one data set GSE14333 is shown in figure 2. In this example, gene expression matrices of 651 samples of 6 data sets pass outlier detection, and are included in the next analysis, and R language program codes for outlier detection are as follows (taking GSE14333 as an example, the remaining data sets only need to replace GSE14333 with names of other data sets).

library(NbClust)

GSE14333＝read.table("GSE14333.txt",sep＝"\t",header＝TRUE)

GSE14333＝GSE14333[-c(1,2),]

row.names(GSE14333)＝GSE14333[,1]

GSE14333＝GSE14333[,-c(1,2)]

dd<-as.dist((1-cor(GSE14333))/2)

cluster＝hclust(dd)

tiff(filename＝"GSE14333.GIF",width＝4500,height＝3500,res＝300)

plot(cluster,main＝"Cluster dendrogram of GSE14333",xlab＝NA)

dev.off()

cutree(cluster,h＝0.2)

(2) Identification of common differentially expressed genes associated with recurrence in stage II colorectal cancer patients

The traditional identification of genes related to the recurrence of tumor patients and establishment of recurrence prediction models are usually based on a single data set, but the method has the defects of sample volume shortage, poor repeatability of differential expression genes and the like, and one of the innovation points of the invention is that a gene chip meta-analysis technology is adopted to identify common differential expression genes related to the recurrence of colorectal cancer patients in stage II.

Reading the gene expression matrix of each sample by adopting a MetaDE packet of an R language program, matching the RFS time and the recurrence state of the patient with the gene expression matrix, and performing subsequent gene chip meta-analysis. Because the gene expression matrix has the condition that a plurality of probe set numbers correspond to one gene name, screening the probe sets by using a match function of an R language program MetaDE packet according to the IQR values (interquartile range values) of the probe set numbers, and selecting the probe set with the maximum IQR value of the same gene name to represent the expression value of the gene.

The method for combining p values is adopted to carry out the meta analysis of the gene chip, and the implementation of the method depends on the MetaDE. Firstly, calculating the p value of each gene in each data set by logrank test; then, combining the p value of each gene in each data set by adopting a minP method to finally obtain the p of each gene_minPThe value is obtained. Because each gene is subjected to hypothesis testing once, in order to reduce alpha errors in the hypothesis testing process, the method of Benjamini-Hochberg is adopted to carry out p of each gene_minPThe values were corrected to obtain the FDR value (false discovery rate) of each gene. According to FDR<0.1, and identifying 479 common differential expression genes related to the relapse of the colorectal cancer patients at the stage II. The R language program code is as follows:

rm(list＝ls())

library(MetaDE)

study.names<-c("GSE14333","GSE17538","GSE24551","GSE33113","GSE39582","GSE92921")

crc.raw<-MetaDE.Read(study.names,skip＝rep(2,6),via＝"txt",matched＝FALSE,log＝FALSE)

crc.raw_match＝MetaDE.match(crc.raw,pool.replicate＝"IQR")

dim(crc.raw[[1]][[1]])

crc.merged<-MetaDE.merge(crc.raw_match)

dim(crc.merged[[1]][[1]])

crc.filtered<-MetaDE.filter(crc.merged,c(0.2,0))

dim(crc.filtered[[1]][[1]])

crc.filtered$GSE14333$censoring.status＝crc.raw$GSE14333$censoring.status

crc.filtered$GSE17538$censoring.status＝crc.raw$GSE17538$censoring.status

crc.filtered$GSE24551$censoring.status＝crc.raw$GSE24551$censoring.status

crc.filtered$GSE33113$censoring.status＝crc.raw$GSE33113$censoring.status

crc.filtered$GSE39582$censoring.status＝crc.raw$GSE39582$censoring.status

crc.filtered$GSE92921$censoring.status＝crc.raw$GSE92921$censoring.status

start＝Sys.time()

set.seed(2018)

MetaDE.Res.minP<-MetaDE.rawdata(crc.filtered,ind.method＝rep("logrank",6),meta.metho d＝c("minP"),paired＝rep(F,7),asymptotic＝T)

b<-Sys.time()-start

print(b)

View(MetaDE.Res.minP$meta.analysis$FDR)

count.DEnumber(MetaDE.Res.minP,p.cut＝c(0.01,0.05),q.cut＝c(0.1))

draw.DEnumber(MetaDE.Res.minP,1,FDR＝T)

write.table(MetaDE.Res.minP$meta.analysis$FDR,file＝"DEG.txt",sep＝"\t")

(3) signal path enrichment analysis of differentially expressed genes

Gene enrichment analysis is commonly used for functional annotation and pathway annotation of differentially expressed genes obtained by high-throughput techniques such as gene chips, transcriptome sequencing and the like, and is further helpful for exploring molecular mechanisms inherent in biological phenotypes and clinical diseases. 479 identified common differentially expressed genes related to recurrence of stage II colorectal cancer patients were subjected to signal pathway enrichment analysis using the Metascape database (http:// metascap. org/gp/index. html #/main/step1), and signal pathways significantly enriched for the differentially expressed genes were screened using p 0.01 as the threshold, see FIG. 3.

(4) Construction of stage II colorectal cancer patient tumor recurrence prediction model in training set

In order to construct a tumor recurrence prediction model for colorectal cancer patients in stage II, 651 samples of 6 data sets are randomly divided into a training set and a testing set according to the proportion of 60% and 40%, and the tumor recurrence prediction model is constructed in the training set. Because gene expression data has the characteristics of high data dimension, high correlation and the like, the traditional Cox regression model is not suitable for high-dimensional data, and the random survival forest model can well solve the problem. In addition, the influence of the gene expression value on the relapse risk of the patient can be nonlinear, the traditional Cox regression model is a linear model in nature and can reduce the prediction precision of the model, and random survival forests can find the nonlinear effect of the gene expression value on the relapse risk of the patient. The selection of the random survival forest model is also one of the innovation points of the invention.

The important parameters of the random survival forest model are the number of candidate variables of each node and the number of data points of a final node, and the parameter combination which enables the error rate outside the bag to be the lowest is determined by calculating the error rate outside the bag in a training set under various parameter combination conditions, as shown in fig. 4. The random survival forest model can perform variable screening according to the minimum depth value of the maximum subtree where the variable is located, and the screening threshold value is the average value of the minimum depth values of the maximum subtrees where all the variables are located. After variable screening, the model building steps can be repeated by using the screened variables until the minimum depth values of the maximum subtrees where all the variables are located are below the threshold value.

Through variable screening of the random survival forest model for 2 times, a prediction model v.obj _2nd consisting of 26 gene expression values is finally obtained, the used software is a randomForestSRC packet of an R language program, and specific R language program codes are as follows:

library("randomForestSRC")

library("survival")

library("timeROC")

data＝rbind(GSE14333_BN,GSE17538_BN,GSE24551_BN,GSE33113_BN,GSE39582_BN,GSE92921_BN)

data＝data.frame(data)

colnames(data)[480]＝"time"

set.seed(2018)

random_sample＝sample(651)

train_index＝random_sample[1:390]

test_index＝random_sample[391:651]

train_data＝data[train_index,]

test_data＝data[test_index,]

tune_data＝tune(Surv(time,status)～.,data＝train_data,ntreeTry＝1000,trace＝T)

print(tune_data$rf)

v.obj<-rfsrc(Surv(time,status)～.,data＝train_data,nodesize＝2,mtry＝34,ntree＝

1000,importance＝TRUE)

print(v.obj)

plot(v.obj)

pred＝predict(v.obj,train_data)

pred

#1st simplify#

v.max$topvars

train_data_1st＝train_data[,c(v.max$topvars,"time","status")]

test_data_1st＝test_data[,c(v.max$topvars,"time","status")]

tune_data_1st＝tune(Surv(time,status)～.,data＝train_data_1st,ntreeTry＝1000,trace＝T)print(tune_data_1st$rf)

v.obj_1st<-rfsrc(Surv(time,status)～.,data＝train_data_1st,nodesize＝1,mtry＝11,ntree＝1000,importance＝TRUE)

print(v.obj_1st)

plot(v.obj_1st)

plot.survival(v.obj_1st,cens.model＝"rfsrc")

pred_1st＝predict(v.obj_1st,train_data_1st)

pred_1st

#2nd simplify#

v.max_1st$topvars

train_data_2nd＝train_data[,c(v.max_1st$topvars,"time","status")]

test_data_2nd＝test_data[,c(v.max_1st$topvars,"time","status")]

tune_data_2nd＝tune(Surv(time,status)～.,data＝train_data_2nd,ntreeTry＝1000,trace＝T)print(tune_data_2nd$rf)

v.obj_2nd<-rfsrc(Surv(time,status)～.,data＝train_data_2nd,nodesize＝2,mtry＝10,ntree＝1000,importance＝TRUE)

print(v.obj_2nd)

plot(v.obj_2nd)

plot.variable(v.obj_2nd,surv.type＝"surv",time＝60,plots.per.page＝6)

pred_2nd＝predict(v.obj_2nd,train_data_2nd)

pred_2nd

#minimal_depth

v.max_2nd<-max.subtree(v.obj_2nd)

print(round(v.max_2nd$order,3))

#the minimal depth is the first order depth

print(round(v.max_2nd$order[,1],3))

minimal_depth_2nd＝round(v.max_2nd$order[,1],3)

print(v.max_2nd$threshold)

v.max_2nd$topvars

#test#

pred_test_2nd＝predict(v.obj_2nd,test_data_2nd)

#tdroc#

ROC_train_2nd＝timeROC(T＝train_data_2nd$time,delta＝train_data_2nd$status,marker＝pred_2nd$predicted,cause＝1,iid＝TRUE,times＝c(12,36,60))

plot(ROC_train_2nd,time＝12,add＝F)

plot(ROC_train_2nd,time＝36,add＝F)

plot(ROC_train_2nd,time＝60,add＝F)

The size of the v.obj _2nd score in the model represents the tumor recurrence risk probability, and the model (v.obj _2nd) and the gene expression data (data) of the patient are input through a predict function of a stats package of an R language program, so that the tumor recurrence risk score of the patient can be obtained. The names of 26 genes in the obj _2nd model and the minimum depth values of the largest subtrees in which the genes are located are shown in table 1.

TABLE 1 minimum depth values of the maximum subtrees of 26 genes in the random survival forest model v.obj _2nd

The non-linear relationship between gene expression values and patient relapse-free survival is shown by a variable edge effect plot, see figure 5.

After a random survival forest model v.obj _2nd is adopted in a training set to construct a tumor recurrence prediction model, the prediction effect of the model is evaluated by adopting AUC (area under the curve) of a time-dependent ROC curve (a subject working curve). The value range of AUC is between 0 and 1, and the larger the AUC is, the better the prediction effect of the model is. AUC of random survival forest model v.obj _2nd_{1 year}＝0.995、AUC_{For 3 years}＝0.999、AUC_{5 years old}0.993, see fig. 6.

(5) Evaluating model prediction effects in test set

Gene expression matrices were obtained for all samples in the test set and 26 gene expression values (NVL, ACER, JUP, PLAT, JUNB, IGDCC3, ANKRD27, NUP50, GIT1, PRKCDBP, TDO2, LOC100287896, MCMBP, SNED1, ESM1, MARCO, FAM46C, LMO2, HAUS8, TTC13, ADNP2, RSBN1L, RAVER2, SNORA21, PNRC1, PJA1) were extracted for the random survival forest model v.obj _2 nd. And importing the constructed random survival forest prediction model v.obj _2nd into an R language program, calling a predict function of the R language program, and taking the 26 gene expression matrixes of the model and the test set sample as input parameters to obtain the tumor recurrence risk score of the model in the test set. Patients were divided into high risk groups (recurrence risk score higher than 6.10) and low risk groups (recurrence risk score lower than 6.10) according to the median tumor recurrence risk score (median 6.10 in this example) of stage II colorectal cancer patients. Kaplan-Meier survival curves were drawn for two groups of patients, respectively, and Logrank was used to test whether there was a significant difference in relapse-free survival between the two groups, and Mantel-Haenszel method was used to calculate HR values (risk ratio) and their 95% CI (confidence interval). After the test set was divided into the high recurrence risk group and the low recurrence risk group according to the above method, the p value of Logrank test was 0.0165 between the two groups, HR of the high recurrence risk group was 1.901, and 95% CI was 1.124-3.215, as shown in fig. 7.

Having thus described one embodiment of the present invention, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and additions may be made to the embodiment, methodology, selected model and programming software without departing from the principles and spirit of the invention, which shall be deemed to be within the scope of the invention.

Claims

1. A prognostic marker gene for use in prediction of stage II colorectal cancer recurrence, characterised in that the marker gene is NVL, ACER, JUP, PLAT, JUNB, IGDCC3, ANKRD27, NUP50, GIT1, PRKCDBP, TDO2, LOC100287896, MCMBP, SNED1, ESM1, MARCO, FAM46C, LMO2, HAUS8, TTC13, ADNP2, RSBN1L, RAVER2, SNORA21, PNRC1 and PJA 1.

2. Use of the prognostic marker gene according to claim 1, for constructing a model for predicting recurrence of colorectal cancer at stage II, wherein the model is expressed in the R language program: and v.obj _2nd score (v.obj _2nd, data), wherein v.obj _2nd is an expression value of a prognosis marker gene related to the recurrence of the colorectal cancer patient in the stage II, a random survival forest model method is adopted to construct a tumor recurrence prediction model of the colorectal cancer patient in the stage II, and v.obj _2nd score is a tumor recurrence risk score.

3. The use according to claim 2, characterized in that the model building method comprises the following steps: (1) obtaining a gene expression dataset: obtaining stage II knotsIntestinal cancer tumor samples, mRNA expression data of each sample is obtained, and outliers are detected and eliminated through cluster analysis; (2) identifying common differentially expressed genes associated with recurrence in stage II colorectal cancer patients: calculating the p value of each gene in each data set by Lorrank test by adopting a gene chip meta-analysis technology; then, combining the p value of each gene in each data set by adopting a minP method to finally obtain the p of each gene_minPA value; p of each gene by adopting Benjamini-Hochberg method_minPCorrecting the value to obtain the FDR value of each gene according to the FDR<0.1 standard screen for common differentially expressed genes associated with recurrence in stage II colorectal cancer patients; (3) signal path enrichment analysis of differentially expressed genes: performing signal channel enrichment analysis on the public differentially expressed genes identified in the step (2) and related to the recurrence of the colorectal cancer patient in the stage II by adopting a Metascape database, and screening the signal channels with the differentially expressed genes significantly enriched by taking p as a threshold value of 0.01, namely the signal channels with the p value of less than 0.01; (4) constructing a model: constructing a tumor recurrence prediction model v.obj _2nd of the colorectal cancer patient at the stage II by adopting a random survival forest model method according to the expression value of the common differential expression gene related to the recurrence of the colorectal cancer patient at the stage II in the step (2),

the expression of the model in the R language program is as follows: v. obj _2nd score ═ pred (v.obj _2nd, data).