CN111243662A

CN111243662A - Pan-cancer gene pathway prediction method, system and storage medium based on improved XGboost

Info

Publication number: CN111243662A
Application number: CN202010041366.9A
Authority: CN
Inventors: 阿丽玛; 刘朝锐; 张玉; 周维
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-06-05
Anticipated expiration: 2040-01-15
Also published as: CN111243662B

Abstract

The invention discloses a method, a system and a storage medium for predicting a pan-cancer gene pathway based on improved XGboost, wherein the method utilizes a training data set to train an improved XGboost model until the model converges; the improved XGboost model is characterized in that a threshold value selection process is added on the basis of the XGboost model, a threshold value is used for controlling the classification boundary of positive and negative samples, and the threshold value selection process adjusts the threshold value according to classification indexes. The method is based on the tree structure XGboost, solves the problem of abnormal values in biological data continuous values through a split node selection mode, and simultaneously solves the problem of data classification boundary deviation caused by data preprocessing; and the cross validation is supported, and the optimal training effect can be obtained by stopping in advance. The XGboost is innovatively improved, the threshold control is increased, the problem of weight deviation caused by unbalance of class data samples is solved, the predicted AUROC and AUPR values are improved, and the classification effect is better.

Description

Pan-cancer gene pathway prediction method, system and storage medium based on improved XGboost

Technical Field

The invention relates to the field of biological genes, in particular to a pan-cancer gene pathway prediction method, a system and a storage medium based on improved XGboost.

Background

The method has the advantages that a pan-cancer gene pathway is predicted according to TCGA gene expression data, early diagnosis can be carried out on cancer, the relation between gene expression and cancer pathway activation is found, a pan-cancer gene pathway analysis algorithm XBBPCPA is provided, a machine learning XGboost algorithm is utilized to carry out data integration on a plurality of 9000 samples of 1.8 hundred million characteristic points, and the influence of the pan-cancer gene expression on the pathway activation condition is mined and analyzed. The threshold control over-parameter is designed to control the classification boundary of the positive and negative samples, the problem of sample unbalance in data is solved, and the classification evaluation parameters AUC and AUPR are improved. Comparative experiments show that the XBPPA algorithm has higher generalization performance on cancer pathway prediction.

Pan-Cancer (Pan-Cancer) contains 33 common cancers of human, and a Cancer gene map (The Cancer gene Atlas, TCGA) is a project (https:// Cancer tumor. nih. gov /) which is jointly completed by The U.S. national human genome and The U.S. national Cancer institute and collects gene data of a plurality of 11000 common cancers.

Experiments and validation of the RAS pathway and the P53 pathway revealed that the RAS pathway is altered in most cancers, and that when activated, there are usually number variations, including increased pattern variations (KRAS, NRAS and HRAS variations) and lost pattern variations (NF1 variations). Cancer types such as pancreatic cancer, cutaneous melanoma, thyroid cancer, lung adenocarcinoma, and the like are determined to be triggered by RAS gene pathway variation. In addition, RAS pathway variation has been shown to be an early event in cancer development. Cancers caused by RAS pathway variation are difficult to treat, and accurate prediction and localization of conditions that cause RAS pathway activation are critical for subsequent treatment. The P53 pathway is the most highly known gene currently associated with cancer, and among the large number of cancers known, variations and abnormal expression of P53 are found. P53 is more of a sign of cancer diagnosis, and accurate prediction will undoubtedly lead to earlier findings and corresponding treatments.

In 2018, the RAS pathway was predicted by using a memory-based algorithmic logistic regression in the article "Machine learning detection and characterization-cancer RAS pathway activation in the cancer gene atlas" of Gregory P.Wa in cell report ", in which the 5-fold cross-validation fitting capability showed AUROC of 0.86, AUPR of 0.61, the generalization capability on the new data set showed AUROC of 0.76, and AUPR of 0.58. However, this method has low generalization ability and cannot be used for other paths than the RAS path. And the evaluation parameters AUROC, AUPR of the method do not reach the theoretical upper limit of the data.

Disclosure of Invention

The invention aims to: in view of the above existing problems, a method for providing a pan-cancer gene pathway with strong generalization performance for various data types and predicting various pan-cancer gene pathways is provided. The method is based on the tree structure XGboost, solves the problem of abnormal values in biological data continuous values through a split node selection mode, and simultaneously solves the problem of data classification boundary deviation caused by data preprocessing; and the cross validation is supported, and the optimal training effect can be obtained by stopping in advance. The XGboost is innovatively improved, the threshold control is increased, the problem of weight deviation caused by unbalance of class data samples is solved, the predicted AUROC and AUPR values are improved, and the classification effect is better.

The technical scheme adopted by the invention is as follows:

a pan-cancer gene pathway prediction method based on improved XGboost comprises the following steps:

training the improved XGboost model by using a training data set until the model converges; the improved XGboost model is characterized in that a threshold value selection process is added on the basis of the XGboost model, the threshold value is used for controlling the classification boundary of positive and negative samples, and the threshold value selection process adjusts the threshold value according to the classification index.

In the above-mentioned training data set, each cancer sample corresponds to a cancer type to which it belongs, for the acquired cancer sample data.

A threshold selection process is added on the basis of the XGboost model, so that the problem of weight deviation caused by unbalanced class data samples can be solved. Based on the tree structure XGboost, the problem of abnormal values in the continuous values of the biological data is solved by a split node selection mode, and meanwhile, the problem of data classification boundary deviation caused by data preprocessing is solved.

Further, the method for training the improved XGBoost model by using the training data set specifically includes: training a training data set by using an improved XGboost model, and training the improved XGboost model by using K-fold cross validation. The invention supports cross validation and can obtain the optimal training effect by stopping in advance.

Further, the classification index is ROC-AUC.

Further, in the process of training the improved XGBoost model by using K-fold cross validation, the adjusted parameters include iteration number, maximum depth of spanning tree, down-sampling coefficient, regularization coefficient and learning rate.

Further, K is 5.

Further, the process of adjusting the threshold according to the classification index includes: and predicting the positive and negative sample intervals by taking 0.5 as a reference threshold, calculating AUROC, and adjusting the threshold according to the calculation result.

Further, the preparation process of the training data set comprises:

merging the number variation matrix and the gene expression matrix according to the ID of the sample; labeling mutation data of the sample; wherein, the number variation matrix and the gene expression matrix are generated by correspondingly inputting RNA-seq, copy number and mutation data;

and preprocessing the merged matrix, wherein the preprocessing comprises a filtering step so as to finally obtain a training data set.

Further, the step of preprocessing the integrated data includes: integrating the expression data and the variant data according to the sample ID, and filtering the cancer categories with the data volume of the patient being less than the predetermined number.

A computer-readable storage medium storing a computer program which is run to perform the above-described pan-cancer gene pathway prediction method based on the improved XGBoost.

An improved XGboost-based pan-cancer gene pathway prediction system comprises a processor and the computer-readable storage medium, wherein the processor is used for operating a computer program stored in the computer-readable storage medium to execute an improved XGboost-based pan-cancer gene pathway prediction method.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the method is based on the tree structure XGboost, solves the problem of abnormal values in biological data continuous values through a split node selection mode, and simultaneously solves the problem of data classification boundary deviation caused by data preprocessing; and the cross validation is supported, and the optimal training effect can be obtained by stopping in advance. The XGboost is innovatively improved, threshold control (threshold selection) is added, the problem of weight deviation caused by unbalance of class data samples is solved, and predicted AUROC and AUPR values (evaluation indexes of gene path prediction) are improved, so that the classification effect is better, and the generalization performance of the model is greatly improved.

Drawings

The invention will now be described, by way of example, with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of the pan-cancer gene pathway prediction method of the present invention based on the improved XGboost.

FIG. 2 is a diagram of an improved XGboost model architecture.

Fig. 3 is a graph of P53 pathway distribution for each cancer after sample filtration.

Fig. 4 shows the evaluation index of the prediction result of the P53 model trained by the method of the present invention.

Fig. 5 is a RAS pathway profile for each cancer after sample filtration.

FIG. 6 is an evaluation index of the prediction result of the RAS model trained by the method of the present invention.

Detailed Description

All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations of features and/or steps that are mutually exclusive.

Any feature disclosed in this specification (including any accompanying claims, abstract) may be replaced by alternative features serving equivalent or similar purposes, unless expressly stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.

At present, cancer gene data are various in types and have a great influence on generalization performance of a logistic regression method of Gregory P.Wa and the like, gene expression data are continuous features which are real values in a certain range, numerical values represent expression quantity and do not represent classification, if continuous values of the gene data are discretized, a large number of original features are lost, but a large number of abnormal values exist in the continuous values, such as an algorithm of logistic regression, weight is easily influenced by extreme values, therefore, the data algorithm selects a split interval capable of processing the continuous values of the features and automatically calculating the split interval of the continuous values of the features, meanwhile, algorithm classification effect cannot be influenced by the abnormal values, gene variation data are discrete data types and have small dimensions, and the algorithm needs to be simple, such as logistic regression and the like, and also needs to deal with overfitting problems. Often, variant data is used in combination with other data.

The logistic regression belongs to a memory model, all values in a sample are weighted through training, all weights are added according to the values of the feature numbers when a new sample is predicted, and the obtained result is the prediction possibility. However, logistic regression still has a certain problem, and the generalization performance needs to adjust both the L1 regularization and the L2 regularization. Meanwhile, logistic regression also has a certain problem, numerical continuous features have a large number of abnormal values, but the Sigmoid function has a gradient close to 0 at infinity and infinitesimal positions, and the problem of gradient disappearance can occur, but after the features are discretized, the original data distribution can be changed, so that logistic regression is not applicable under the condition of continuous numerical features with more abnormal data, and the generalization performance of the model can be greatly influenced by the data.

The invention adds a threshold value selection step, solves the problem of weight deviation caused by unbalance of the category data samples, and improves the predicted AUROC and AUPR values, so that the classification effect is better. Based on the tree structure XGboost, the problem of abnormal values in the continuous values of the biological data is solved by a split node selection mode, and meanwhile, the problem of data classification boundary deviation caused by data preprocessing is solved.

Example one

The embodiment discloses a pan-cancer gene pathway prediction method based on improved XGboost, which trains an improved XGboost model by using a training data set until the model converges; the improved XGboost model is characterized in that a threshold value selection process is added on the basis of the XGboost model, a threshold value is used for controlling the classification boundary of positive and negative samples, and the threshold value selection process adjusts the threshold value according to the classification index. After a certain threshold value is set, training a sample to obtain a corresponding classification index, and automatically adjusting the threshold value according to the index when the index does not meet the requirement.

The training data set used for the training is prepared cancer sample data, and each cancer sample corresponds to the cancer type to which the cancer sample belongs.

Example two

The embodiment discloses a pan-cancer gene pathway prediction method based on improved XGboost, which comprises a data acquisition step, a matrix generation step, a matrix synthesis step, a model training step and a model verification step as shown in FIG. 1, and specifically comprises the following steps:

the number variation matrix and the gene expression matrix are combined according to the sample ID. At the same time, the mutation data of the samples are used for labeling. For example, when the pathway is activated, typically a TP53 mutation in the P53 pathway and a gain of function KRAS, NRAS or HRAS mutation in the RAS. The data compiling matrix and the gene expression matrix are generated by correspondingly inputting RNA-seq (gene expression data), copy number and mutation data. RNA-seq, copy number and mutation data were obtained from TCGA (https:// cancerrgenom. nih. gov /).

And preprocessing the merged data, wherein the preprocessing comprises merging and filtering steps so as to finally obtain a training data set.

The training data set is trained using the modified XGboost algorithm until the model converges. The improved XGboost algorithm is characterized in that a threshold selection process is added on the basis of the XGboost algorithm to control the classification boundary of positive and negative samples, and the purpose is to adjust the weight deviation of the positive and negative samples caused by data imbalance. And the threshold selection process takes 0.5 as a reference, predicts the upper and lower intervals, calculates AUROC, and obtains the optimal threshold parameter according to the numerical value of the classification index. The corresponding procedure examples are as follows:

and for the improved XGboost model, training is carried out by using a training data set, and a K-fold cross validation training model is used. The main parameters of the adjustment include: the number of iterations (n _ estimators), the maximum depth of the spanning tree (max _ depth), the downsampling coefficient (subsample), the regularization coefficients (L1 regularized reg _ alpha and L2 regularized reg _ lambda), and finally the learning rate (learning _ rate) is adjusted. When the overlapping verification is used, the ROC-AUC is used for defining the evaluation index.

The training procedure described above, using tumor data from TCGA PanCanAtlas to train the model with a modified XGBoost, can integrate RNA-seq, copy number and mutation data with 9075 cancer samples into 33 different cancer types.

EXAMPLE III

The embodiment discloses a pan-cancer gene pathway prediction method based on improved XGboost, which comprises the following steps:

and obtaining RNA-seq, copy number and mutation data from the TCGA, and correspondingly recording the RNA-seq, the copy number and the mutation data into a number mutation matrix and a gene expression matrix. The number variation matrix and the gene expression matrix are combined according to the sample ID. At the same time, the mutation data of the sample is labeled. For example, when the pathway is activated, typically a TP53 mutation in the P53 pathway and a gain of function KRAS, NRAS or HRAS mutation in the RAS.

Cancer types containing smaller sample numbers (not reaching the specified sample number) are filtered. Using tumor data from tcgapancanaatlas, RNA-seq, copy number and mutation data were integrated with 9075 cancer samples into 33 different cancer types using a modified XGBoost training model. RNA-seq was used as a metric to describe tumor expression status and a classifier was trained to detect gene expression patterns associated with aberrant pathway activity. The XGBoost algorithm maps samples to leaf nodes by a node splitting function and adds trees to fit the residual of the previous tree. The tree algorithm can learn combinations of gene importance weights that together learn to best separate the abnormality from the wild-type expression pattern. In principle, this method can be used to predict other gene pathways or similar tasks. In the present invention, this method is applied to classify P53 and RAS activities. In addition, the invention also compares a method based on logistic regression with the XGboost, and the result proves that the XGboost has stronger generalization performance.

Specifically, the prediction method is as follows:

in the TCGA database, according to the required filtering condition (cancer disease type), selecting proper files (Expression quantity file Gene Expression Quantification, HTSeq-FPKM, mutation file copy number variation), downloading to obtain a Manifest file named at the current time, wherein the Manifest file comprises four fields (id, filename, md5, size), wherein id is the identifier of the data stored in the TCGA database, and after the fields are analyzed, using request to send request to download the data.

And preprocessing the data by using the pandas, wherein the preprocessing comprises the steps of merging, filtering and the like, and mainly comprises the steps of generating a label related to a channel according to a classification target by using functions such as merge, group, count and the like, removing the data of the patients with the cancer types less than 15 according to a set filtering condition, and finally obtaining a training data set. The remaining 6746 samples of the P53 pathway were 22 cancer species. The characteristics of the data set comprise two types, one is gene expression quantity, belongs to continuous numerical characteristics, and the other is whether the gene has variation or not, belongs to discrete characteristics.

The training data set is trained using the modified XGboost algorithm until the model converges.

The data was trained using XGBoost and the model was trained using 5-fold cross validation (GridSearchCV). Parameter adjustment: the number of iterations (n _ estimators), the maximum depth of the spanning tree (max _ depth), the down-sampling coefficient (subsample), the regularization coefficients (L1 regularized reg _ alpha and L2 regularized reg _ lambda) are mainly adjusted, and finally the learning rate (learning _ rate) is adjusted. When cross-validation is used, ROC-AUC is used to define the evaluation index.

Example four

The embodiment introduces the improvement necessity of improving the XGboost algorithm, namely the theoretical basis of the algorithm and the derivation of the loss function, and analyzes the applicability of the XGboost to the path analysis from the perspective of the loss function.

From the data set, a supervised learning decision tree model (Tianqi Chen et al 2016) XGboost was trained in this experiment, where we given a data set D, consisting of n samples and m features:

the predicted value of the result for a tree is the sum of the outputs per tree:

wherein f is_k(x_i) Is the k-th_thSample x in leaf node_iThe occupied weight.

Then we define the value range of XGBoost.

Where q (x) is the single tree in the set, T is the number of leaf nodes, w_q(x)The structural function of a single tree represents the fraction of samples x in the tree qq map the input x to leaf nodes with weight w the structure of XGBoost is explained with a simple example.

(A) Examples of formats of input data are: the expression matrix is combined with the number variation matrix. Expression matrices such as the following table:

(B) the final output of sample x is equal to the sum of the scores of the leaf nodes to which sample x is mapped. The structure is shown in fig. 2.

Multiple tests prove that the AUC is improved compared with the logistic regression when the XGboost is found in training, but the Recall is not obviously improved.

Analysis shows that the classification effect of the model is influenced by the distribution condition of positive and negative samples of the data, and the parameters of the model are adjusted more by the types of the samples, so that the algorithm tends to predict the types of the samples. From the XGBoost algorithm analysis by nature, the predicted value of the result of K trees of the algorithm is the sum of the outputs of each tree:

if too many samples in a certain class result in weak classifiers fitting only a specific class, the algorithm cannot learn the characteristics of samples in other classes after the number of the weak classifiers exceeds a certain number. Such an offset would result in the algorithm ultimately being able to output only the probability of more sample classes.

In a traditional method for solving the problem of unbalanced distribution of positive and negative samples, a relatively balanced state is achieved by sampling data in a certain proportion or performing different weighting strategies on the samples. However, the data preprocessing method has a great influence on the original distribution of the data, the expression data of the cancer genes is accurate, and the small expression difference causes different lesion degrees and types, so the classification of the prediction sample must be controlled from the output stage of the algorithm, thereby solving the influence of the classification effect caused by the imbalance of the sample as much as possible.

Therefore, the invention adds the hyper-parameters on the basis of the original algorithm: threshold value

The classification boundary of the positive and negative samples is controlled in order to adjust the weight shift of the positive and negative samples due to data imbalance. The corresponding algorithm is exemplified as follows:

in an experiment, the selection of the threshold is the optimal solution of a typical interval search problem, a threshold selection method is designed and realized by taking binary search as a basis and combining an XGboost training process, the AUROC is calculated by predicting the upper interval and the lower interval by taking 0.5 as a reference, and the optimal threshold parameter is obtained according to the value of a classification index. In one embodiment, one pseudocode implemented by the threshold selection method is as follows:

the threshold value is dynamically adjusted according to the AUROC value, so that the sample classification boundary is adaptively and automatically adjusted, and the problem of positive and negative sample weight deviation caused by data imbalance is solved.

EXAMPLE five

This example discloses a specific implementation procedure of the pan-cancer gene pathway prediction method based on the improved XGBoost of the present invention to demonstrate the advantages of the present invention.

The P53 pathway is activated mainly due to TP53 gene variation, the P53 pathway is known to be the gene pathway with the highest degree of correlation with cancer at present, in all malignant tumors, over 50% of the probability of the TP53 gene is mutated, so that the P53 pathway is activated, and the TP53 is essentially a cancer suppressor gene, but is transformed into a cancer gene after mutation, the spatial structure of the cancer gene is changed, and the control effect on cell division apoptosis and DNA repair is lost.

In this example, 6746 data samples including 22 cancer species were generated after preprocessing filtering (removing data for a number of cancer species less than 15 patients), and the data distribution (P53 pathway status for various diseases) is shown in fig. 3. The data of 20% (n-1349) was divided into verification data sets to verify the generalization performance of the model, and the remaining 80% (n-5397) was used as a training data set to perform 5-fold cross-validation, train the model, and verify the generalization performance of the model with the verification data sets.

For the P53 training model, 5-fold cross validation had an AUC-ROC of 0.939, an AUPR of 0.892, an AUC-ROC of 0.90 and an AUPR of 0.95 on the validation dataset. Compared with logistic regression, AUC-ROC is 0.83, AUPR is 0.83, the improvement is large, the improved algorithm has strong generalization performance, cross validation AUC is also improved, AUC-ROC is 0.88 before threshold control is not used, and AUPR is 0.89. The improved algorithm can be seen to have greater improvement on the evaluation index for the prediction of the gene pathway. The effect is shown in FIG. 4, where FIG. A is AUROC and B is AUPR.

RAS pathway activation, due to mutation in KRAS, NRAS, HRAS or NF1 genes, is known to be driven primarily by RAS pathway gene mutation in certain cancer types, such as pancreatic cancer (PAAD), cutaneous melanoma (SKCM), thyroid cancer (THCA), lung adenocarcinoma (LUAD) and COAD.

In this example, 2326 samples were generated after pre-processing filtering (data of less than 15 cancer types were removed), the data distribution is shown in fig. 5, 20% (n 465) of the samples was selected as the verification data set, the rest (n 1861) was selected as the training data set, 5-fold cross-validation training model was also used, for RAS pathway prediction model, 5-fold cross-validation AUC-ROC was 0.84, after increasing threshold control, the most generalization performance of the model was improved, on the verification data set, AUC-ROC was improved from 0.766 before improvement to 0.80, and aurr was improved from 0.60 before improvement to 0.845. The effect is shown in FIG. 6, where FIG. A is AUROC and B is AUPR.

EXAMPLE six

The embodiment discloses a computer-readable storage medium, which stores a computer program, and the computer program is run to execute the method for predicting the gene pathway of pan-cancer based on the improved XGboost of any one of the above embodiments.

EXAMPLE seven

The embodiment discloses an improved XGboost-based pan-cancer gene pathway prediction system, which comprises a processor and the computer-readable storage medium of the sixth embodiment, wherein the processor runs a computer program in the computer-readable storage medium to run an improved XGboost-based pan-cancer gene pathway prediction method.

Example eight

The embodiment discloses an improved XGboost-based pan-cancer gene pathway prediction system, which runs the improved XGboost-based pan-cancer gene pathway prediction method of any one of the first to fifth embodiments.

The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed.

Claims

1. A pan-cancer gene pathway prediction method based on improved XGboost is characterized by comprising the following steps:

2. The method for predicting pan-cancer gene pathway based on improved XGBoost as claimed in claim 1, wherein the method for training the improved XGBoost model using the training data set specifically comprises: training a training data set by using an improved XGboost model, and training the improved XGboost model by using K-fold cross validation.

3. The method for predicting a pan-cancer gene pathway based on improved XGBoost of claim 2, wherein the classification index is ROC-AUC.

4. The method for predicting pan-cancer gene pathway based on improved XGBoost of claim 2, wherein in the training of the improved XGBoost model using K-fold cross validation, the adjusted parameters comprise iteration number, maximum depth of spanning tree, down-sampling coefficient, regularization coefficient and learning rate.

5. The method for improving XGBoost-based pan-cancer gene pathway prediction according to claim 2, wherein K-5.

6. The method for predicting pan-cancer gene pathway based on improved XGboost according to any one of claims 1-4, wherein the process of adjusting the threshold value according to the classification index comprises: and predicting the positive and negative sample intervals by taking 0.5 as a reference threshold, calculating AUROC, and adjusting the threshold according to the calculation result.

7. The method of claim 1, wherein the preparation of the training data set comprises:

8. The method of claim 6, wherein the step of pre-processing the integrated data comprises: integrating the expression data and the variant data according to the sample ID, and filtering the cancer categories with the data volume of the patient being less than the predetermined number.

9. A computer-readable storage medium storing a computer program, wherein the computer program is executed to perform the method for predicting a pan-cancer gene pathway based on modified XGBoost according to any one of claims 1 to 8.

10. An improved XGBoost-based pan-cancer gene pathway prediction system comprising a processor and the computer-readable storage medium of claim 9, the processor being configured to execute a computer program stored in the computer-readable storage medium to execute an improved XGBoost-based pan-cancer gene pathway prediction method.