CN117038092A

CN117038092A - Pancreatic cancer prognosis model construction method based on Cox regression analysis

Info

Publication number: CN117038092A
Application number: CN202311057905.8A
Authority: CN
Inventors: 刘建平; 郑剑锋; 刘尊龙; 任飞
Original assignee: Sun Yat Sen Memorial Hospital Sun Yat Sen University
Current assignee: Sun Yat Sen Memorial Hospital Sun Yat Sen University
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-11-10

Abstract

The invention provides a pancreatic cancer prognosis model construction method based on Cox regression analysis, which comprises the following steps: downloading pancreatic cancer expression profiles from a TCGA database, and preprocessing to form sample data; randomly dividing the training data set into a training data set and a verification data set according to a proportion, further screening the training data set for the OS-related lncRNA by using LASSO Cox regression analysis, and calculating a risk score of each patient sample; verifying the lncRNAs screened by using the training data set in the data set and the complete data set to establish a prediction model, and dividing the samples into a high risk group and a low risk group; constructing an alignment chart and carrying out consistency test, correction curve analysis and time-dependent ROC curve analysis; screening out protein coding genes co-expressed with prognostic lncRNAs by using Pearson related analysis; a prognosis prediction model was further selected and established using LASSO Cox regression analysis. The capacity of the prognosis model constructed by the invention is better than that of the lncRNAs model and TNM staging system published before, and the prognosis prediction accuracy of pancreatic cancer can be greatly improved.

Description

Pancreatic cancer prognosis model construction method based on Cox regression analysis

Technical Field

The invention belongs to the field of medicine, and particularly relates to a pancreatic cancer prognosis model construction method based on Cox regression analysis.

Background

Pancreatic cancer (cancer of pancreas, pancreatic cancer) is one of the common malignant tumors of the digestive tract, and is frequently generated in the head of pancreas. Abdominal pain and painless jaundice are common symptoms of pancreatic head cancer. The diabetic patients smoke a large amount of cigarettes for a long time, the incidence rate of the patients with high fat and animal protein diets is relatively high, the disease is frequently occurred in middle-aged and elderly people, male patients are far more than postmenopausal women, and the incidence rate of postmenopausal women is similar to that of men. The cause of the disease is not clear, and it is found that some environmental factors are related to the occurrence of pancreatic cancer. The primary risk factors are smoking, drinking alcohol (including beer) for diabetes and cholelithiasis, and eating high-fat and high-protein diet and refined flour food such as chronic pancreatitis, and gastrectomy is also a risk factor for pancreatic cancer, and the death rate is extremely high.

Up to now, surgery is the only method by which pancreatic cancer can be cured, but less than 20% of definitive patients can benefit from resection surgery and the risk of recurrence after radical resection is high. In terms of survival time, one study showed that the 5 year survival rate of patients with surgically resectable pancreatic cancer was about 20%. In the last decade, pancreatic cancer treatment has achieved some success in research, for example, with increasing options in adjuvant therapy and metastatic disease treatment. In terms of adjuvant therapy, both the modified FOLFIRINOX (mFOLFIRINOX) regimen and the gemcitabine regimen were able to increase disease-free survival and overall survival in pancreatic cancer patients, and the results of one study showed that the mflfirinox group had a disease-free survival of 39.7% for 3 years, an overall survival of 63.4%, a gemcitabine group disease-free survival of 21.4%, and an overall survival of 48.6%. ASCO guidelines recommend that post-operative adjuvant treatment be mflfirinox first, and gemcitabine when monotherapy is used first. However, despite adjuvant treatment, the recurrence rate is still high, with 69% to 75% of patients recurring within 2 years. In terms of metastatic pancreatic cancer treatment, the FOLFIRINOX regimen is an effective first-line treatment option for patients with good ECOG performance status, with a 4.3 month increase in median survival in FOLFIRINOX compared to gemcitabine (11.1 months vs 6.8 months).

Despite the aforementioned favorable advances in therapy, prognosis of pancreatic cancer remains poor, requiring more research, especially in personalized therapies.

Disclosure of Invention

The invention aims to provide a pancreatic cancer prognosis model construction method based on Cox regression analysis, which constructs a pancreatic cancer prognosis model by using a sum LASOCox regression analysis method, wherein the model is an independent influencing factor in pancreatic cancer prognosis, and the prediction capability is verified in a training data set, a verification data set and a complete data set, and has higher prediction accuracy than the published lncRNAs model and TNM stage system.

In order to achieve the above object, in the present invention, there is provided a method for constructing a prognosis model of pancreatic cancer based on Cox regression analysis, the method comprising:

step one, downloading a pancreatic cancer expression profile from a TCGA database, and preprocessing the pancreatic cancer expression profile to form sample data;

step two, the sample data are complete data sets, the complete data sets are randomly divided into a training data set and a verification data set according to proportion, LASOCox regression analysis is used for further screening the lncRNA related to the OS on the training data sets, then risk scores of each patient sample are calculated based on a risk score calculation formula of lncRNAs regression coefficients and expression quantities, and then the patients are divided into a high-risk group and a low-risk group according to critical values of median risk scores in a prediction model;

thirdly, establishing a prediction model by using lncRNAs screened by a training data set in a verification data set and a complete data set, and dividing samples of the verification data set and the complete data set into a high risk group and a low risk group based on a critical value of bit risk scores in the model;

step four, constructing an alignment chart combining a model and clinical pathology features based on independent pancreatic cancer prognosis influencing factors screened by univariate and multivariate Cox regression analysis, and performing consistency test, correction curve analysis and time-dependent ROC curve analysis;

step five, using Pearson correlation analysis to screen out protein coding genes co-expressed with the prognosis lncRNAs;

wherein, the step of preprocessing in the step one comprises the following steps: sorting and annotating genes, processing pancreatic cancer expression profiles by using an edge software package of R language, comparing Ensembl ID, separating and screening lncRNA with average expression value larger than 1 from the genes, analyzing the next step, screening differential expression lncRNA of |logFC| >1 and p <0.05 from the pancreatic cancer expression profiles, carrying out univariate Cox regression analysis, screening out lncRNA related to OS, and deleting sample data with incomplete clinical information, survival time of 0 and repetition;

the lncRNA related to the OS is L031658.1, ABCA9-AS1, DNAH17-AS1, AP003086.1 and AC018755.4;

the risk score includes: the risk score calculation formula for each sample is: risk score = -0.23189 x al031658.1 expression level +0.20984 x abca9-AS1 expression level +0.03709 x dnah17-AS1 expression level + -0.26114 x ap003086.1 expression level +0.15556 x ac018755.4 expression level, wherein the risk score formula is a prediction model, and the median of the risk score is a critical value, wherein: the high-risk group is larger than or equal to the critical value, and the low-risk group is smaller than the critical value.

Further, step six, using LASSOCox regression analysis to further select and build predictive models based on lncRNAs for predicting survival rates of 3 years and 5 years in pancreatic cancer patients.

Further, in the third step, patients of the test dataset and the complete dataset are divided into a high risk group and a low risk group according to the critical value, and Kaplan-Meier and log rank test are performed.

Further, in step four, the clinical pathology features include lncRNAs model, age, sex, drinking, history of radiotherapy, history of chemotherapy, family history, smoking, tumor differentiation, and pathological stage.

In the fifth step, pearson correlation analysis is used to screen out the protein coding genes co-expressed with the prognostic lncRNAs, the absolute value of the correlation coefficient is more than 0.4, p is less than 0.001, and the selected protein coding genes are respectively subjected to KEGG and GO function enrichment analysis and drawing by using ClueGO and cluepetia of Cytoscape, so as to find a path for the prognostic lncRNAs.

Further, the pancreatic cancer independent prognosis influencing factors are lncRNAs model, history of chemotherapy and pathological stage.

Further, constructing a nomogram based on the pancreatic cancer independent prognosis influence factors, wherein each sample has the risk score level, the chemotherapy history and the score of pathological stage of the lncRNAs model of the pancreatic cancer patient, and adding the three scores obtained by the pancreatic cancer patient to calculate the total score of the nomogram for predicting the survival rate of the pancreatic cancer patient for 3 years and 5 years;

alignment predictions and actual observations were drawn to evaluate consistency of survival rates for 3 years and 5 years for the pancreatic cancer patients.

Further, the steps of performing KEGG and GO function enrichment analysis and drawing by using the ClueGO and cluepetia of Cytoscape respectively for the selected protein coding genes specifically include:

extracting all mRNA from pancreatic cancer data downloaded from TCGA, and screening mRNA which satisfies the conditions of |logFC| >2, p <0.05 and is differentially expressed in pancreatic cancer and paracancerous normal tissues;

carrying out Pearson analysis on the mRNA after screening and 5 lncRNAs screened before;

KEGG and GO functional enrichment analysis was performed using ClueGO from Cytoscape and plotted with CluePedia. The beneficial technical effects of the invention are at least as follows:

(1) The capacity of the prognosis model constructed by the invention is better than that of the lncRNAs model and TNM stage system published before, and when the prognosis model and the TNM stage system are used in combination, the prognosis prediction accuracy of pancreatic cancer can be greatly improved;

(2) A clinic comprehensive prognosis prediction model of pancreatic cancer can be constructed based on the constructed model, the chemotherapy history and the pathological stage, and the prediction capability is good;

(3) Based on KEGG and GO enrichment analysis of mRNA co-expressed with the 5 lncRNAs, 30 pathways with p.ltoreq.0.05 and 3 pathways with p.ltoreq.0.00001 are found, and lymphocyte activation, B cell activation and leukocyte activation are respectively carried out, which indicates that the influence of the 5 lncRNAs on prognosis of pancreatic cancer is related to immunity to a certain extent, the survival rate of patients with pancreatic cancer is greatly improved, and more targeted treatment can be combined with clinic.

Drawings

The invention will be further described with reference to the accompanying drawings, in which embodiments do not constitute any limitation of the invention, and other drawings can be obtained by one of ordinary skill in the art without inventive effort from the following drawings.

FIG. 1a is a flow chart of a method for constructing a prognosis model of pancreatic cancer based on Cox regression analysis.

FIG. 1b is a flowchart showing a specific implementation of a method for constructing a pancreatic cancer prognosis model based on Cox regression analysis.

FIG. 2 is a flow chart of the present invention for predicting the function of lncRNAs.

FIG. 3 is a volcanic plot of the expression profile of 92 delncRNAs of example. Of these, 3 were up-regulated in pancreatic cancer and 89 were down-regulated in expression.

FIG. 4 shows the results of LASSOCox regression analysis in the training dataset of this example: (A) The parameters of the OS are selected and adjusted by LASSO, the model is punished by continuously reducing parameter lambda, so that the model with the best performance under the minimum parameter is obtained, and the variable lncRNAs are increased along with the reduction of the parameter in the figure; (B) The LASOCox regression coefficient profile of 5 OS-related lncRNAs, when the included variable was 5, the model was the least likely to be partially abnormal, and the smallest parameters were obtained, i.e., the best performance of the model when the included lncRNAs were 5.

FIG. 5 shows the results of the patient distribution, survival analysis and time-dependent ROC curve analysis in the training dataset of this example: (a) a survival state of the patient; (B) a cumulative function profile of risk scores; (C) Time-dependent ROC curves of survival of lncRNAs model in pancreatic cancer patients for 3 years and 5 years; (D) overall survival curve for pancreatic cancer patients.

FIG. 6 is a heat map of 5 lncRNA expression profiles in high-risk and low-risk groups in the training dataset of the examples.

Fig. 7 is a graph showing the results of the survival analysis and time-dependent ROC curve analysis of the patient distribution in the data set according to the present embodiment: (a) a survival state of the patient; (B) a cumulative function profile of risk scores; (C) Time-dependent ROC curves of survival of lncRNAs model in pancreatic cancer patients for 3 years and 5 years; (D) overall survival curve for pancreatic cancer patients.

FIG. 8 is a heat map of the present example validating 5 lncRNA expression profiles in high-risk and low-risk groups in a dataset.

Fig. 9 shows the distribution of patients in the complete dataset of this example, and the results of survival analysis and time-dependent ROC curve analysis: (a) a survival state of the patient; (B) a cumulative function profile of risk scores; (C) Time-dependent ROC curves of survival of lncRNAs model in pancreatic cancer patients for 3 years and 5 years; (D) overall survival curve for pancreatic cancer patients.

FIG. 10 is a heat map of 5 lncRNA expression profiles in the high-risk and low-risk groups in the complete dataset of this example.

FIG. 11 is a schematic view of the dependency ROC curve of the present embodiment: (A) Time-dependent ROC curves of survival of pancreatic cancer patients for 3 years and 5 years when pathologically staged alone; (B) A time-dependent ROC curve of survival rate of pancreatic cancer patients for 3 years using lncRNAs model alone and lncRNAs model in combination with pathological stage; (C) lncRNAs model, pathological stage alone, and time-dependent ROC curves of 5 year survival of pancreatic cancer patients when lncRNAs model is combined with pathological stage.

FIG. 12 is a nomogram constructed in combination with the lncRNAs model, history of chemotherapy and stage pathology of this example. The risk scores of the different chemotherapeutics, pathological stage and lncRNAs model of the patient have corresponding scores on average, and the three scores are added to obtain a total score, and then the total score corresponds to the predicted value of the survival rate of 3 years and 5 years.

Fig. 13 is a verification result of prognostic predictive value of the alignment chart of the present example in pancreatic cancer: (A) Time-dependent ROC curves for survival in pancreatic cancer patients for 3 years and 5 years; (B) A correction curve of survival rate of pancreatic cancer patients for 3 years, wherein the predicted value and the actual observed value of the nomogram are basically consistent; (C) The correction curve of 5 years survival rate of pancreatic cancer patients, the predicted value of the nomogram and the actual observed value are basically consistent.

FIG. 14 shows the KEGG and GO enrichment pathways (p.ltoreq.0.05) of mRNA co-expressed with 5 lncRNAs in this example.

FIG. 15 shows the KEGG and GO enrichment pathways (p.ltoreq.0.00001) of mRNA co-expressed with 5 lncRNAs in this example.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

In the description of the present invention, it should be noted that the positional or positional relationship indicated by the terms such as "upper", "lower", "inner", "outer", "top/bottom", etc. are based on the positional or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

As shown in fig. 1a, the method for constructing a pancreatic cancer prognosis model based on Cox regression analysis comprises the following steps:

s1, downloading a pancreatic cancer expression profile from a TCGA database, and preprocessing the pancreatic cancer expression profile to form sample data; wherein the sample data is a complete data set;

s2, randomly dividing the complete data set into a training data set and a verification data set according to a proportion, further screening the lncRNA related to the OS (operating system) from the training data set by using LASOCox regression analysis, calculating the risk score of each patient sample based on a risk score calculation formula of the lncRNAs regression coefficient and the expression quantity, and dividing the patients into a high-risk group and a low-risk group according to the critical value of the median risk score in the prediction model;

s3, establishing a prediction model by using lncRNAs screened by a training data set in the verification data set and the complete data set, and dividing samples of the verification data set and the complete data set into a high risk group and a low risk group based on a critical value of bit risk scores in the model;

s4, constructing an alignment chart combining a model and clinical pathology characteristics based on independent pancreatic cancer prognosis influencing factors screened by univariate and multivariate Cox regression analysis, and performing consistency test, correction curve analysis and time-dependent ROC curve analysis;

s5, screening out protein coding genes co-expressed with the prognostic lncRNAs by using Pearson correlation analysis.

Wherein, the step of preprocessing in the step S1 includes: sorting and annotating genes, processing pancreatic cancer expression profiles by using an edge software package of R language, comparing Ensembl ID, separating and screening lncRNA with average expression value larger than 1 from the genes, analyzing the next step, screening differential expression lncRNA of |logFC| >1 and p <0.05 from the pancreatic cancer expression profiles, carrying out univariate Cox regression analysis, screening lncRNA related to OS, and deleting sample data with incomplete clinical information, survival time of 0 and repetition;

the risk score includes: the risk score calculation formula for each sample is: risk score = -0.23189 x al031658.1 expression level +0.20984 x abca9-AS1 expression level +0.03709 x dnah17-AS1 expression level + -0.26114 x ap003086.1 expression level +0.15556 x ac018755.4 expression level, wherein the risk score formula is a predictive model, and the median of the risk score is a critical value, wherein: the high-risk group is larger than or equal to the critical value, and the low-risk group is smaller than the critical value.

As a preferred solution, this embodiment further includes the step of, after step S5:

LASOCox regression analysis was used to further select and build predictive models based on lncRNAs for predicting survival of pancreatic cancer patients for 3 years and 5 years.

In the following, a method for constructing a prognosis model of pancreatic cancer based on Cox regression analysis is described in detail with reference to fig. 1b and 2.

In this example, 182 pancreatic cancer tissues and their adjacent normal tissues were downloaded with expression and clinical information from the TCGA database (https:// portal. Gdc. Cancer. Gov /). The download time is 2018, 10, 12.

Wherein the pretreatment steps are as follows: genes in 182 tissues were sorted and annotated with perl, then the expression profile in the tissues was processed with the edge software package of R language, and after comparison with Ensembl ID, the genes were isolated to screen lncRNA. lncRNA with average expression values greater than 1 will be left for further analysis. Then, differential expression lncRNA of |logFC| >1 and p <0.05 is screened from pancreatic cancer tissues and adjacent normal tissues, and univariate regression analysis is carried out on the differential expression lncRNA, so that the lncRNA related to OS is screened. And deleting the sample data with incomplete clinical information, survival time of 0 and repetition.

Specifically, in the Perl software data sorting process, 178 pancreatic cancer tissues (n=178) and 4 paracancerous normal tissues (n=4) were found in 182 tissues. After comparing Ensembl ID with edge R, 14447 lncRNAs were identified and 8090 lncRNAs with average expression values greater than 1 were screened out, and further operations were performed to obtain 92 |logFC| >1 and p <0.05 differentially expressed lncRNAs, 3 of which were up-regulated in pancreatic cancer and 89 of which were down-regulated in pancreatic cancer tissues, as shown in FIG. 3. Subsequently, a univariate Cox regression analysis screened a total of 5 OS-related lncRNA (p < 0.05). After deleting the sample data with incomplete clinical information, survival time of 0 and repetition, 174 pancreatic cancer tissue samples remain.

The samples containing all the required data were then randomly split into a training data set and an internal validation data set in a 1:1 ratio. Then, LASOCox regression analysis was applied to the training dataset to further screen OS-related lncRNA to prevent overfitting. Next, a risk score of each patient sample is calculated based on a risk score calculation formula of lncRNAs regression coefficients and expression amounts, thereby constructing a prognosis prediction model of pancreatic cancer. Finally, patients are divided into high-risk groups and low-risk groups according to the critical value of the median risk score in the model.

Specifically, 174 pancreatic cancer tissue samples were randomly divided into a training dataset (n=87) and a validation dataset (n=87) at a 1:1 ratio, as shown in table one.

Table 1 clinical information of 174 pancreatic cancer patient samples

In the training dataset, the 5 lncRNA and OS were further determined to have strong correlation using LASSOCox regression analysis, as shown in fig. 4, so that the 5 lncRNA were used to construct a prognosis prediction model for pancreatic cancer, and their detailed information is shown in table 2.

TABLE 2 details of 5 lncRNA significantly associated with OS in pancreatic cancer patients

Ensembl ID is obtained from Ensembl (http:// asia. Bl. Org /). A calculation formula based on the risk score: regression coefficient of multivariate regression analysis lncRNA expression level, risk score for each pancreatic cancer sample was calculated as follows: risk score = -0.23189 x al031658.1 expression level +0.20984 x abca9-AS1 expression level +0.03709 x dnah17-AS1 expression level + -0.26114 x ap003086.1 expression level +0.15556 x ac018755.4 expression level. Based on the risk score of each sample, a prognosis prediction model based on 5-lncRNA was constructed. Samples in the training dataset are divided into high-risk and low-risk groups based on the threshold of the median risk score in the model, as shown in fig. 5.

Kaplan-Meier and log rank test results showed significantly higher survival rates (p < 0.01) for the low-risk group than for the high-risk group. In order to evaluate the predictive value of the model in pancreatic cancer prognosis, time-dependent ROC curve analysis was performed, and AUCs of survival rates of the 5-lncRNA model for 3 years and 5 years were 0.748 and 0.995, respectively, with good predictive ability. Of these 5 lncRNA, both the regression coefficients of AL031658.1 and AP003086.1 are negative, and appear to be protective factors for pancreatic cancer, meaning that high expression will have a lower risk score. The regression coefficients of ABCA9-AS1, DNAH17-AS1 and AC018755.4 are positive, and may be risk factors for pancreatic cancer, and may have higher risk scores when expressed. To show the expression of these 5 lncRNA in the training dataset, a heat map was drawn, as shown in fig. 6, and ranked according to risk score.

And (3) establishing a prediction model in the verification data set and the complete data set by using the lncRNAs screened by the training data set, and dividing the samples of the verification data set and the complete data set into a high risk group and a low risk group based on the critical value of the bit risk score in the model. The predictive capacity of the model in the complete dataset was then compared to previously published lncRNAs models and TNM staging systems and the predictive capacity of the model when used in combination with TNM staging was explored.

Pancreatic cancer independent prognosis influencing factors screened based on univariate and multivariate Cox regression analysis are used for constructing an alignment chart combining a model and clinical pathological characteristics to predict prognosis of pancreatic cancer. To verify the prognostic predictive value of nomograms, a consistency test, a calibration curve analysis and a time-dependent ROC curve analysis were performed.

In particular, as shown in fig. 7-10, to further verify the prognostic value of 5-lncRNA based models in pancreatic cancer, we performed the same procedure in both the validation dataset and the complete dataset. Patients of the test dataset and the complete dataset were divided into high-risk and low-risk groups according to the median risk score threshold in the 5-lncRNA model, and then subjected to Kaplan-Meier and log rank tests, and significant differences in survival rates were found between the two groups, with low-risk groups each having significantly higher survival rates (p < 0.01) than the high-risk groups. Time-dependent ROC curve analysis results show that AUCs verifying survival rates of data sets for 3 years and 5 years are 0.775 and 0.907 respectively, while AUCs of survival rates of complete data sets for 3 years and 5 years are 0.746 and 0.897 respectively, and have strong prediction capability. The expression profile heat maps of 5 lncRNA in the validation dataset and the complete dataset were plotted and ranked according to their risk scores.

Downloading pancreatic cancer protein-encoding genes from TCGA based on expression levels between prognostic lncRNAs and protein-encoding genes we used Pearson-related analysis to screen for protein-encoding genes co-expressed with prognostic lncRNAs, as shown in figure 2. The screening criteria are correlation coefficient absolute value >0.4, p <0.001. The screened protein coding genes are respectively subjected to KEGG and GO function enrichment analysis and drawing by using ClueGO and CluePedia of Cytoscape, and a path for the action of prognostic lncRNAs is searched. Univariate Cox regression analysis screening OS-related lncRNA, LASSOCox regression analysis was further selected and used to build a predictive model of prognosis based on lncRNAs, where the regression coefficients of lncRNAs were derived from the multivariate Cox regression analysis. The Kaplan-Meier and log rank test explored whether the model had statistical significance for stratification of sample risk between high risk groups and low risk groups. The specificity and sensitivity of the model to prognosis prediction were analyzed using a time-dependent ROC curve, and the prediction accuracy of the model was taken as the Area (AUC) of the time-dependent ROC curve and the horizontal axis of coordinates. In the validation and comparison studies, kaplan-Meier and log rank test and time dependent ROC curve analysis were also used. The model independence was verified using univariate and multivariate regression analysis, and consistency tests, calibration curve analysis, and time-dependent ROC curve analysis were used in the verification of nomogram prognostic predictive ability. In screening for protein-encoding genes co-expressed with prognostic lncRNAs, pearson-related assays were used.

In this example, it is shown that the data are randomly split into training data set and complete data set at a ratio of 1:1, so there is no significant difference or deviation in the composition of the clinical pathology between the two groups, as demonstrated by the results of Table 1. To investigate the relationship of lncRNAs model to clinical pathology in the impact on pancreatic cancer OS, univariate and multivariate Cox regression analyses were performed in combination with lncRNAs model, age, sex, alcohol consumption, radiotherapy history, chemotherapy history, family history, smoking, tumor differentiation and pathology stage, respectively in training, validation and complete data sets. The results show that the effect of the lncRNAs model on pancreatic cancer prognosis is independent of the clinical pathology features, as shown in table 3.

TABLE 3 results of single-variable and multi-variable Cox regression analysis of the lncRNAs model in training, validation and complete dataset

In univariate Cox regression analysis, tumor differentiation, pathology stage and lncRNAs models were strongly correlated with OS in the training dataset (p < 0.05), radiotherapy history, chemotherapy history and lncRNAs models were strongly correlated with OS in the validation dataset (p < 0.05), whereas in the complete dataset, radiotherapy history, chemotherapy history, pathology stage and lncRNAs models were strongly correlated with OS (p < 0.05). However, through multivariate Cox regression analysis, the tumor differentiation, pathological stage and lncRNAs model in the training dataset still have strong correlation with OS (p < 0.05), and the chemotherapy history and lncRNAs model in the dataset still have strong correlation with OS (p < 0.05), while the remaining chemotherapy history, pathological stage and lncRNAs model in the complete dataset are verified to be in accordance with the results of the complete dataset.

In the examples, when the risk score of lncRNAs model is related to the clinical pathology features, it can be seen from table 4 that gender only has significant differences between the high-risk group and the low-risk group in the training dataset (p < 0.05), but significant differences between the high-risk group and the low-risk group in the validation dataset and the full dataset (p > 0.05), while pathology stage has significant differences between the high-risk group and the low-risk group in the training dataset and the full dataset (p < 0.05). Thus, in combination with chi-square values, we can conclude that there is a positive correlation between the lncRNAs model and the pathological stage. Patients with high pathological phases are more likely to have higher risk scores.

TABLE 4 correlation of risk scores with clinical pathology characteristics for the lncRNAs model in training, validation and complete dataset

/>

Time-dependent ROC curve analysis was performed in the complete dataset with AUCs of 3-year and 5-year survival of 0.746 and 0.897, respectively. The lncRNAs model of the invention has higher predictive power than previously published lncRNA models [48, 49].

Also in fact, in clinical practice, the most widely known and accepted method of prognostic risk classification is the TNM staging system established by AJCC. Thus, the present invention investigated the relationship between the lncRNAs model and TNM staging system as shown in FIG. 11.

The results showed that the survival rates AUC for 3 and 5 years for the TNM staging system were 0.67 and 0.723, respectively, which were lower than the AUC of the model of the invention, indicating that the lncRNAs model of the invention had better prognostic predictive power than the TNM staging system. In addition, when the lncRNAs model is used in combination with a TNM staging system, the survival rates AUC of 3 years and 5 years are respectively 0.82 and 0.959, so that the accuracy of pancreatic cancer prognosis prediction is greatly improved.

As shown in fig. 12-13, predictive models are often built clinically in conjunction with multiple clinical pathology features. In order to establish a clinical comprehensive model for predicting pancreatic cancer patient OS, independent influencing factors of pancreatic cancer OS were selected from the lncRNAs model and the clinical pathological features. As shown in table 3, in the complete dataset, single-variable and multivariate Cox regression analysis results indicated that lncRNAs model, history of chemotherapy and pathological stage were independent influencing factors of pancreatic cancer OS (p < 0.05).

Therefore, we construct a nomogram using these three independent influencing factors. Each sample had scores for its lncRNAs model risk score level, history of chemotherapy, and pathology stage, and then the three scores obtained were added to calculate the total score of the nomogram, to predict its survival for 3 years and 5 years, with specific scores shown in table 5.

TABLE 5 alignment chart variable score cases

/>

To verify the predictive value of the nomograms, the C index was calculated, a time-dependent ROC curve analysis was performed and a correction graph was drawn. The results showed that the C index of the nomogram was 0.677 (95% CI: 0.614-0.740), the AUC for survival rates of 3 years and 5 years was 0.718, 0.786, respectively, and the predictive value was higher, but slightly lower than that of the lncRNAs model. And drawing a correction curve, and finding that the predicted value of the nomogram is basically matched with the actual observed value in the aspect of survival rate of 3 years and 5 years, and further proving that the predicted value is higher.

To further investigate the potential biological role of these 5 lncRNA, the present example performed KEGG and GO enrichment analyses on the protein-encoding genes co-expressed with these 5 lncRNA. As shown in fig. 14-15, first, all mRNA was extracted from pancreatic cancer data downloaded from TCGA, and then 415 mRNA differentially expressed in pancreatic cancer and paracancerous normal tissues were screened therein (|logfc| >2, p < 0.05). Then Pearson analysis was performed with these 415 mRNAs and 5 lncRNAs we screened before, and 339 co-expressed mRNAs were finally obtained (absolute correlation coefficient >0.4, p < 0.001). Based on these 339 mRNAs, we performed KEGG and GO functional enrichment analysis using ClueGO from Cytoscape and plotted with CluePedia, 30 pathways with p.ltoreq.0.05, 3 pathways with p.ltoreq.0.00001, lymphocyte activation, B cell activation and leukocyte activation, respectively, indicating that perhaps the effect of these 5 lncRNAs on pancreatic cancer prognosis is immune-related.

The above statistical analysis was all done on R studio. To investigate the clinical pathology feature composition ratio differences in the training dataset and the validation dataset, and the relationship of the risk score of the model to the clinical pathology features, chi-square tests were performed on SPSS software.

While embodiments of the invention have been shown and described, it will be understood by those skilled in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A method for constructing a prognosis model of pancreatic cancer based on Cox regression analysis, the method comprising:

step one, downloading a pancreatic cancer expression profile from a TCGA database, and preprocessing the pancreatic cancer expression profile to form sample data; the sample data is a complete data set;

step two, randomly dividing the complete data set into a training data set and a verification data set according to a proportion, further screening the lncRNA related to the OS (operating system) from the training data set by using LASOCox regression analysis, calculating the risk score of each patient sample based on a risk score calculation formula of the lncRNAs regression coefficient and the expression quantity, and dividing the patients into a high-risk group and a low-risk group according to the critical value of the median risk score in the prediction model;

wherein, the step of preprocessing in the step one comprises the following steps: sorting and annotating genes, processing pancreatic cancer expression profiles by using an edge software package of R language, comparing Ensembl ID, separating and screening lncRNA with average expression value larger than 1 from the genes, analyzing the next step, screening differential expression lncRNA of |logFC| >1 and p <0.05 from the pancreatic cancer expression profiles, carrying out univariate Cox regression analysis, screening lncRNA related to OS, and deleting sample data with incomplete clinical information, survival time of 0 and repetition;

2. The method for constructing a prognosis model for pancreatic cancer based on Cox regression analysis according to claim 1, further comprising: step six, LASOCox regression analysis is used for further selecting and establishing a prediction model based on the lncRNAs, and the prediction model is used for predicting the survival rate of pancreatic cancer patients for 3 years and 5 years.

3. The method for constructing a pancreatic cancer prognosis model based on Cox regression analysis according to claim 1, wherein in the third step, patients of the test dataset and the complete dataset are classified into a high risk group and a low risk group according to the threshold value, and Kaplan-Meier and log rank test are performed.

4. The method for constructing a prognosis model for pancreatic cancer based on Cox regression analysis according to claim 1, wherein in the fourth step, the clinical pathological features include lncRNAs model, age, sex, drinking, history of radiotherapy, history of chemotherapy, family history, smoking, tumor differentiation and pathological stage.

5. The method for constructing a pancreatic cancer prognosis model based on Cox regression analysis according to claim 1, wherein in the fifth step, the Pearson correlation analysis is used to screen out the protein coding genes co-expressed with the prognostic lncRNAs, wherein the absolute value of the correlation coefficient is >0.4, p <0.001, and the selected protein coding genes are respectively subjected to KEGG and GO function enrichment analysis and drawing by using ClueGO and cluepetia of Cytoscape, so as to find a path for the prognostic lncRNAs.

6. The method for constructing a pancreatic cancer prognosis model based on Cox regression analysis according to claim 1, wherein the pancreatic cancer independent prognosis influencing factors are lncRNAs model, chemotherapy history and pathological stage.

7. The method for constructing a prognosis model for pancreatic cancer based on Cox regression analysis according to claim 6, wherein a nomogram is constructed based on the independent prognosis influencing factors for pancreatic cancer, each sample has scores of lncRNAs model risk score level, chemotherapy history and pathological stage for pancreatic cancer patients, and the three scores obtained for pancreatic cancer patients are added to calculate the total score of nomogram for predicting survival of pancreatic cancer patients for 3 years and 5 years;

8. The method for constructing a pancreatic cancer prognosis model based on Cox regression analysis according to claim 5, wherein the steps of KEGG and GO functional enrichment analysis and mapping using ClueGO and cluepetia of Cytoscape are performed by using the selected protein encoding gene, respectively, specifically comprising:

KEGG and GO functional enrichment analysis was performed using ClueGO from Cytoscape and plotted with CluePedia.