CN111243662A - Pan-cancer gene pathway prediction method, system and storage medium based on improved XGboost - Google Patents

Pan-cancer gene pathway prediction method, system and storage medium based on improved XGboost Download PDF

Info

Publication number
CN111243662A
CN111243662A CN202010041366.9A CN202010041366A CN111243662A CN 111243662 A CN111243662 A CN 111243662A CN 202010041366 A CN202010041366 A CN 202010041366A CN 111243662 A CN111243662 A CN 111243662A
Authority
CN
China
Prior art keywords
xgboost
improved
data
pan
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010041366.9A
Other languages
Chinese (zh)
Other versions
CN111243662B (en
Inventor
阿丽玛
刘朝锐
张玉
周维
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan University YNU
Original Assignee
Yunnan University YNU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan University YNU filed Critical Yunnan University YNU
Priority to CN202010041366.9A priority Critical patent/CN111243662B/en
Publication of CN111243662A publication Critical patent/CN111243662A/en
Application granted granted Critical
Publication of CN111243662B publication Critical patent/CN111243662B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Chemical & Material Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method, a system and a storage medium for predicting a pan-cancer gene pathway based on improved XGboost, wherein the method utilizes a training data set to train an improved XGboost model until the model converges; the improved XGboost model is characterized in that a threshold value selection process is added on the basis of the XGboost model, a threshold value is used for controlling the classification boundary of positive and negative samples, and the threshold value selection process adjusts the threshold value according to classification indexes. The method is based on the tree structure XGboost, solves the problem of abnormal values in biological data continuous values through a split node selection mode, and simultaneously solves the problem of data classification boundary deviation caused by data preprocessing; and the cross validation is supported, and the optimal training effect can be obtained by stopping in advance. The XGboost is innovatively improved, the threshold control is increased, the problem of weight deviation caused by unbalance of class data samples is solved, the predicted AUROC and AUPR values are improved, and the classification effect is better.

Description

Pan-cancer gene pathway prediction method, system and storage medium based on improved XGboost
Technical Field
The invention relates to the field of biological genes, in particular to a pan-cancer gene pathway prediction method, a system and a storage medium based on improved XGboost.
Background
The method has the advantages that a pan-cancer gene pathway is predicted according to TCGA gene expression data, early diagnosis can be carried out on cancer, the relation between gene expression and cancer pathway activation is found, a pan-cancer gene pathway analysis algorithm XBBPCPA is provided, a machine learning XGboost algorithm is utilized to carry out data integration on a plurality of 9000 samples of 1.8 hundred million characteristic points, and the influence of the pan-cancer gene expression on the pathway activation condition is mined and analyzed. The threshold control over-parameter is designed to control the classification boundary of the positive and negative samples, the problem of sample unbalance in data is solved, and the classification evaluation parameters AUC and AUPR are improved. Comparative experiments show that the XBPPA algorithm has higher generalization performance on cancer pathway prediction.
Pan-Cancer (Pan-Cancer) contains 33 common cancers of human, and a Cancer gene map (The Cancer gene Atlas, TCGA) is a project (https:// Cancer tumor. nih. gov /) which is jointly completed by The U.S. national human genome and The U.S. national Cancer institute and collects gene data of a plurality of 11000 common cancers.
Experiments and validation of the RAS pathway and the P53 pathway revealed that the RAS pathway is altered in most cancers, and that when activated, there are usually number variations, including increased pattern variations (KRAS, NRAS and HRAS variations) and lost pattern variations (NF1 variations). Cancer types such as pancreatic cancer, cutaneous melanoma, thyroid cancer, lung adenocarcinoma, and the like are determined to be triggered by RAS gene pathway variation. In addition, RAS pathway variation has been shown to be an early event in cancer development. Cancers caused by RAS pathway variation are difficult to treat, and accurate prediction and localization of conditions that cause RAS pathway activation are critical for subsequent treatment. The P53 pathway is the most highly known gene currently associated with cancer, and among the large number of cancers known, variations and abnormal expression of P53 are found. P53 is more of a sign of cancer diagnosis, and accurate prediction will undoubtedly lead to earlier findings and corresponding treatments.
In 2018, the RAS pathway was predicted by using a memory-based algorithmic logistic regression in the article "Machine learning detection and characterization-cancer RAS pathway activation in the cancer gene atlas" of Gregory P.Wa in cell report ", in which the 5-fold cross-validation fitting capability showed AUROC of 0.86, AUPR of 0.61, the generalization capability on the new data set showed AUROC of 0.76, and AUPR of 0.58. However, this method has low generalization ability and cannot be used for other paths than the RAS path. And the evaluation parameters AUROC, AUPR of the method do not reach the theoretical upper limit of the data.
Disclosure of Invention
The invention aims to: in view of the above existing problems, a method for providing a pan-cancer gene pathway with strong generalization performance for various data types and predicting various pan-cancer gene pathways is provided. The method is based on the tree structure XGboost, solves the problem of abnormal values in biological data continuous values through a split node selection mode, and simultaneously solves the problem of data classification boundary deviation caused by data preprocessing; and the cross validation is supported, and the optimal training effect can be obtained by stopping in advance. The XGboost is innovatively improved, the threshold control is increased, the problem of weight deviation caused by unbalance of class data samples is solved, the predicted AUROC and AUPR values are improved, and the classification effect is better.
The technical scheme adopted by the invention is as follows:
a pan-cancer gene pathway prediction method based on improved XGboost comprises the following steps:
training the improved XGboost model by using a training data set until the model converges; the improved XGboost model is characterized in that a threshold value selection process is added on the basis of the XGboost model, the threshold value is used for controlling the classification boundary of positive and negative samples, and the threshold value selection process adjusts the threshold value according to the classification index.
In the above-mentioned training data set, each cancer sample corresponds to a cancer type to which it belongs, for the acquired cancer sample data.
A threshold selection process is added on the basis of the XGboost model, so that the problem of weight deviation caused by unbalanced class data samples can be solved. Based on the tree structure XGboost, the problem of abnormal values in the continuous values of the biological data is solved by a split node selection mode, and meanwhile, the problem of data classification boundary deviation caused by data preprocessing is solved.
Further, the method for training the improved XGBoost model by using the training data set specifically includes: training a training data set by using an improved XGboost model, and training the improved XGboost model by using K-fold cross validation. The invention supports cross validation and can obtain the optimal training effect by stopping in advance.
Further, the classification index is ROC-AUC.
Further, in the process of training the improved XGBoost model by using K-fold cross validation, the adjusted parameters include iteration number, maximum depth of spanning tree, down-sampling coefficient, regularization coefficient and learning rate.
Further, K is 5.
Further, the process of adjusting the threshold according to the classification index includes: and predicting the positive and negative sample intervals by taking 0.5 as a reference threshold, calculating AUROC, and adjusting the threshold according to the calculation result.
Further, the preparation process of the training data set comprises:
merging the number variation matrix and the gene expression matrix according to the ID of the sample; labeling mutation data of the sample; wherein, the number variation matrix and the gene expression matrix are generated by correspondingly inputting RNA-seq, copy number and mutation data;
and preprocessing the merged matrix, wherein the preprocessing comprises a filtering step so as to finally obtain a training data set.
Further, the step of preprocessing the integrated data includes: integrating the expression data and the variant data according to the sample ID, and filtering the cancer categories with the data volume of the patient being less than the predetermined number.
A computer-readable storage medium storing a computer program which is run to perform the above-described pan-cancer gene pathway prediction method based on the improved XGBoost.
An improved XGboost-based pan-cancer gene pathway prediction system comprises a processor and the computer-readable storage medium, wherein the processor is used for operating a computer program stored in the computer-readable storage medium to execute an improved XGboost-based pan-cancer gene pathway prediction method.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
the method is based on the tree structure XGboost, solves the problem of abnormal values in biological data continuous values through a split node selection mode, and simultaneously solves the problem of data classification boundary deviation caused by data preprocessing; and the cross validation is supported, and the optimal training effect can be obtained by stopping in advance. The XGboost is innovatively improved, threshold control (threshold selection) is added, the problem of weight deviation caused by unbalance of class data samples is solved, and predicted AUROC and AUPR values (evaluation indexes of gene path prediction) are improved, so that the classification effect is better, and the generalization performance of the model is greatly improved.
Drawings
The invention will now be described, by way of example, with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of the pan-cancer gene pathway prediction method of the present invention based on the improved XGboost.
FIG. 2 is a diagram of an improved XGboost model architecture.
Fig. 3 is a graph of P53 pathway distribution for each cancer after sample filtration.
Fig. 4 shows the evaluation index of the prediction result of the P53 model trained by the method of the present invention.
Fig. 5 is a RAS pathway profile for each cancer after sample filtration.
FIG. 6 is an evaluation index of the prediction result of the RAS model trained by the method of the present invention.
Detailed Description
All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations of features and/or steps that are mutually exclusive.
Any feature disclosed in this specification (including any accompanying claims, abstract) may be replaced by alternative features serving equivalent or similar purposes, unless expressly stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.
At present, cancer gene data are various in types and have a great influence on generalization performance of a logistic regression method of Gregory P.Wa and the like, gene expression data are continuous features which are real values in a certain range, numerical values represent expression quantity and do not represent classification, if continuous values of the gene data are discretized, a large number of original features are lost, but a large number of abnormal values exist in the continuous values, such as an algorithm of logistic regression, weight is easily influenced by extreme values, therefore, the data algorithm selects a split interval capable of processing the continuous values of the features and automatically calculating the split interval of the continuous values of the features, meanwhile, algorithm classification effect cannot be influenced by the abnormal values, gene variation data are discrete data types and have small dimensions, and the algorithm needs to be simple, such as logistic regression and the like, and also needs to deal with overfitting problems. Often, variant data is used in combination with other data.
The logistic regression belongs to a memory model, all values in a sample are weighted through training, all weights are added according to the values of the feature numbers when a new sample is predicted, and the obtained result is the prediction possibility. However, logistic regression still has a certain problem, and the generalization performance needs to adjust both the L1 regularization and the L2 regularization. Meanwhile, logistic regression also has a certain problem, numerical continuous features have a large number of abnormal values, but the Sigmoid function has a gradient close to 0 at infinity and infinitesimal positions, and the problem of gradient disappearance can occur, but after the features are discretized, the original data distribution can be changed, so that logistic regression is not applicable under the condition of continuous numerical features with more abnormal data, and the generalization performance of the model can be greatly influenced by the data.
The invention adds a threshold value selection step, solves the problem of weight deviation caused by unbalance of the category data samples, and improves the predicted AUROC and AUPR values, so that the classification effect is better. Based on the tree structure XGboost, the problem of abnormal values in the continuous values of the biological data is solved by a split node selection mode, and meanwhile, the problem of data classification boundary deviation caused by data preprocessing is solved.
Example one
The embodiment discloses a pan-cancer gene pathway prediction method based on improved XGboost, which trains an improved XGboost model by using a training data set until the model converges; the improved XGboost model is characterized in that a threshold value selection process is added on the basis of the XGboost model, a threshold value is used for controlling the classification boundary of positive and negative samples, and the threshold value selection process adjusts the threshold value according to the classification index. After a certain threshold value is set, training a sample to obtain a corresponding classification index, and automatically adjusting the threshold value according to the index when the index does not meet the requirement.
The training data set used for the training is prepared cancer sample data, and each cancer sample corresponds to the cancer type to which the cancer sample belongs.
Example two
The embodiment discloses a pan-cancer gene pathway prediction method based on improved XGboost, which comprises a data acquisition step, a matrix generation step, a matrix synthesis step, a model training step and a model verification step as shown in FIG. 1, and specifically comprises the following steps:
the number variation matrix and the gene expression matrix are combined according to the sample ID. At the same time, the mutation data of the samples are used for labeling. For example, when the pathway is activated, typically a TP53 mutation in the P53 pathway and a gain of function KRAS, NRAS or HRAS mutation in the RAS. The data compiling matrix and the gene expression matrix are generated by correspondingly inputting RNA-seq (gene expression data), copy number and mutation data. RNA-seq, copy number and mutation data were obtained from TCGA (https:// cancerrgenom. nih. gov /).
And preprocessing the merged data, wherein the preprocessing comprises merging and filtering steps so as to finally obtain a training data set.
The training data set is trained using the modified XGboost algorithm until the model converges. The improved XGboost algorithm is characterized in that a threshold selection process is added on the basis of the XGboost algorithm to control the classification boundary of positive and negative samples, and the purpose is to adjust the weight deviation of the positive and negative samples caused by data imbalance. And the threshold selection process takes 0.5 as a reference, predicts the upper and lower intervals, calculates AUROC, and obtains the optimal threshold parameter according to the numerical value of the classification index. The corresponding procedure examples are as follows:
Figure RE-GDA0002418511130000071
and for the improved XGboost model, training is carried out by using a training data set, and a K-fold cross validation training model is used. The main parameters of the adjustment include: the number of iterations (n _ estimators), the maximum depth of the spanning tree (max _ depth), the downsampling coefficient (subsample), the regularization coefficients (L1 regularized reg _ alpha and L2 regularized reg _ lambda), and finally the learning rate (learning _ rate) is adjusted. When the overlapping verification is used, the ROC-AUC is used for defining the evaluation index.
The training procedure described above, using tumor data from TCGA PanCanAtlas to train the model with a modified XGBoost, can integrate RNA-seq, copy number and mutation data with 9075 cancer samples into 33 different cancer types.
EXAMPLE III
The embodiment discloses a pan-cancer gene pathway prediction method based on improved XGboost, which comprises the following steps:
and obtaining RNA-seq, copy number and mutation data from the TCGA, and correspondingly recording the RNA-seq, the copy number and the mutation data into a number mutation matrix and a gene expression matrix. The number variation matrix and the gene expression matrix are combined according to the sample ID. At the same time, the mutation data of the sample is labeled. For example, when the pathway is activated, typically a TP53 mutation in the P53 pathway and a gain of function KRAS, NRAS or HRAS mutation in the RAS.
Cancer types containing smaller sample numbers (not reaching the specified sample number) are filtered. Using tumor data from tcgapancanaatlas, RNA-seq, copy number and mutation data were integrated with 9075 cancer samples into 33 different cancer types using a modified XGBoost training model. RNA-seq was used as a metric to describe tumor expression status and a classifier was trained to detect gene expression patterns associated with aberrant pathway activity. The XGBoost algorithm maps samples to leaf nodes by a node splitting function and adds trees to fit the residual of the previous tree. The tree algorithm can learn combinations of gene importance weights that together learn to best separate the abnormality from the wild-type expression pattern. In principle, this method can be used to predict other gene pathways or similar tasks. In the present invention, this method is applied to classify P53 and RAS activities. In addition, the invention also compares a method based on logistic regression with the XGboost, and the result proves that the XGboost has stronger generalization performance.
Specifically, the prediction method is as follows:
in the TCGA database, according to the required filtering condition (cancer disease type), selecting proper files (Expression quantity file Gene Expression Quantification, HTSeq-FPKM, mutation file copy number variation), downloading to obtain a Manifest file named at the current time, wherein the Manifest file comprises four fields (id, filename, md5, size), wherein id is the identifier of the data stored in the TCGA database, and after the fields are analyzed, using request to send request to download the data.
And preprocessing the data by using the pandas, wherein the preprocessing comprises the steps of merging, filtering and the like, and mainly comprises the steps of generating a label related to a channel according to a classification target by using functions such as merge, group, count and the like, removing the data of the patients with the cancer types less than 15 according to a set filtering condition, and finally obtaining a training data set. The remaining 6746 samples of the P53 pathway were 22 cancer species. The characteristics of the data set comprise two types, one is gene expression quantity, belongs to continuous numerical characteristics, and the other is whether the gene has variation or not, belongs to discrete characteristics.
The training data set is trained using the modified XGboost algorithm until the model converges.
The data was trained using XGBoost and the model was trained using 5-fold cross validation (GridSearchCV). Parameter adjustment: the number of iterations (n _ estimators), the maximum depth of the spanning tree (max _ depth), the down-sampling coefficient (subsample), the regularization coefficients (L1 regularized reg _ alpha and L2 regularized reg _ lambda) are mainly adjusted, and finally the learning rate (learning _ rate) is adjusted. When cross-validation is used, ROC-AUC is used to define the evaluation index.
Example four
The embodiment introduces the improvement necessity of improving the XGboost algorithm, namely the theoretical basis of the algorithm and the derivation of the loss function, and analyzes the applicability of the XGboost to the path analysis from the perspective of the loss function.
From the data set, a supervised learning decision tree model (Tianqi Chen et al 2016) XGboost was trained in this experiment, where we given a data set D, consisting of n samples and m features:
Figure RE-GDA0002418511130000091
the predicted value of the result for a tree is the sum of the outputs per tree:
Figure RE-GDA0002418511130000092
wherein f isk(xi) Is the k-ththSample x in leaf nodeiThe occupied weight.
Then we define the value range of XGBoost.
Figure RE-GDA0002418511130000101
Where q (x) is the single tree in the set, T is the number of leaf nodes, wq(x)The structural function of a single tree represents the fraction of samples x in the tree qq map the input x to leaf nodes with weight w the structure of XGBoost is explained with a simple example.
(A) Examples of formats of input data are: the expression matrix is combined with the number variation matrix. Expression matrices such as the following table:
Figure RE-GDA0002418511130000102
(B) the final output of sample x is equal to the sum of the scores of the leaf nodes to which sample x is mapped. The structure is shown in fig. 2.
Multiple tests prove that the AUC is improved compared with the logistic regression when the XGboost is found in training, but the Recall is not obviously improved.
Analysis shows that the classification effect of the model is influenced by the distribution condition of positive and negative samples of the data, and the parameters of the model are adjusted more by the types of the samples, so that the algorithm tends to predict the types of the samples. From the XGBoost algorithm analysis by nature, the predicted value of the result of K trees of the algorithm is the sum of the outputs of each tree:
Figure RE-GDA0002418511130000111
if too many samples in a certain class result in weak classifiers fitting only a specific class, the algorithm cannot learn the characteristics of samples in other classes after the number of the weak classifiers exceeds a certain number. Such an offset would result in the algorithm ultimately being able to output only the probability of more sample classes.
In a traditional method for solving the problem of unbalanced distribution of positive and negative samples, a relatively balanced state is achieved by sampling data in a certain proportion or performing different weighting strategies on the samples. However, the data preprocessing method has a great influence on the original distribution of the data, the expression data of the cancer genes is accurate, and the small expression difference causes different lesion degrees and types, so the classification of the prediction sample must be controlled from the output stage of the algorithm, thereby solving the influence of the classification effect caused by the imbalance of the sample as much as possible.
Therefore, the invention adds the hyper-parameters on the basis of the original algorithm: threshold value
Figure RE-GDA0002418511130000112
The classification boundary of the positive and negative samples is controlled in order to adjust the weight shift of the positive and negative samples due to data imbalance. The corresponding algorithm is exemplified as follows:
Figure RE-GDA0002418511130000121
in an experiment, the selection of the threshold is the optimal solution of a typical interval search problem, a threshold selection method is designed and realized by taking binary search as a basis and combining an XGboost training process, the AUROC is calculated by predicting the upper interval and the lower interval by taking 0.5 as a reference, and the optimal threshold parameter is obtained according to the value of a classification index. In one embodiment, one pseudocode implemented by the threshold selection method is as follows:
Figure RE-GDA0002418511130000131
the threshold value is dynamically adjusted according to the AUROC value, so that the sample classification boundary is adaptively and automatically adjusted, and the problem of positive and negative sample weight deviation caused by data imbalance is solved.
EXAMPLE five
This example discloses a specific implementation procedure of the pan-cancer gene pathway prediction method based on the improved XGBoost of the present invention to demonstrate the advantages of the present invention.
The P53 pathway is activated mainly due to TP53 gene variation, the P53 pathway is known to be the gene pathway with the highest degree of correlation with cancer at present, in all malignant tumors, over 50% of the probability of the TP53 gene is mutated, so that the P53 pathway is activated, and the TP53 is essentially a cancer suppressor gene, but is transformed into a cancer gene after mutation, the spatial structure of the cancer gene is changed, and the control effect on cell division apoptosis and DNA repair is lost.
In this example, 6746 data samples including 22 cancer species were generated after preprocessing filtering (removing data for a number of cancer species less than 15 patients), and the data distribution (P53 pathway status for various diseases) is shown in fig. 3. The data of 20% (n-1349) was divided into verification data sets to verify the generalization performance of the model, and the remaining 80% (n-5397) was used as a training data set to perform 5-fold cross-validation, train the model, and verify the generalization performance of the model with the verification data sets.
For the P53 training model, 5-fold cross validation had an AUC-ROC of 0.939, an AUPR of 0.892, an AUC-ROC of 0.90 and an AUPR of 0.95 on the validation dataset. Compared with logistic regression, AUC-ROC is 0.83, AUPR is 0.83, the improvement is large, the improved algorithm has strong generalization performance, cross validation AUC is also improved, AUC-ROC is 0.88 before threshold control is not used, and AUPR is 0.89. The improved algorithm can be seen to have greater improvement on the evaluation index for the prediction of the gene pathway. The effect is shown in FIG. 4, where FIG. A is AUROC and B is AUPR.
RAS pathway activation, due to mutation in KRAS, NRAS, HRAS or NF1 genes, is known to be driven primarily by RAS pathway gene mutation in certain cancer types, such as pancreatic cancer (PAAD), cutaneous melanoma (SKCM), thyroid cancer (THCA), lung adenocarcinoma (LUAD) and COAD.
In this example, 2326 samples were generated after pre-processing filtering (data of less than 15 cancer types were removed), the data distribution is shown in fig. 5, 20% (n 465) of the samples was selected as the verification data set, the rest (n 1861) was selected as the training data set, 5-fold cross-validation training model was also used, for RAS pathway prediction model, 5-fold cross-validation AUC-ROC was 0.84, after increasing threshold control, the most generalization performance of the model was improved, on the verification data set, AUC-ROC was improved from 0.766 before improvement to 0.80, and aurr was improved from 0.60 before improvement to 0.845. The effect is shown in FIG. 6, where FIG. A is AUROC and B is AUPR.
EXAMPLE six
The embodiment discloses a computer-readable storage medium, which stores a computer program, and the computer program is run to execute the method for predicting the gene pathway of pan-cancer based on the improved XGboost of any one of the above embodiments.
EXAMPLE seven
The embodiment discloses an improved XGboost-based pan-cancer gene pathway prediction system, which comprises a processor and the computer-readable storage medium of the sixth embodiment, wherein the processor runs a computer program in the computer-readable storage medium to run an improved XGboost-based pan-cancer gene pathway prediction method.
Example eight
The embodiment discloses an improved XGboost-based pan-cancer gene pathway prediction system, which runs the improved XGboost-based pan-cancer gene pathway prediction method of any one of the first to fifth embodiments.
The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed.

Claims (10)

1. A pan-cancer gene pathway prediction method based on improved XGboost is characterized by comprising the following steps:
training the improved XGboost model by using a training data set until the model converges; the improved XGboost model is characterized in that a threshold value selection process is added on the basis of the XGboost model, the threshold value is used for controlling the classification boundary of positive and negative samples, and the threshold value selection process adjusts the threshold value according to the classification index.
2. The method for predicting pan-cancer gene pathway based on improved XGBoost as claimed in claim 1, wherein the method for training the improved XGBoost model using the training data set specifically comprises: training a training data set by using an improved XGboost model, and training the improved XGboost model by using K-fold cross validation.
3. The method for predicting a pan-cancer gene pathway based on improved XGBoost of claim 2, wherein the classification index is ROC-AUC.
4. The method for predicting pan-cancer gene pathway based on improved XGBoost of claim 2, wherein in the training of the improved XGBoost model using K-fold cross validation, the adjusted parameters comprise iteration number, maximum depth of spanning tree, down-sampling coefficient, regularization coefficient and learning rate.
5. The method for improving XGBoost-based pan-cancer gene pathway prediction according to claim 2, wherein K-5.
6. The method for predicting pan-cancer gene pathway based on improved XGboost according to any one of claims 1-4, wherein the process of adjusting the threshold value according to the classification index comprises: and predicting the positive and negative sample intervals by taking 0.5 as a reference threshold, calculating AUROC, and adjusting the threshold according to the calculation result.
7. The method of claim 1, wherein the preparation of the training data set comprises:
merging the number variation matrix and the gene expression matrix according to the ID of the sample; labeling mutation data of the sample; wherein, the number variation matrix and the gene expression matrix are generated by correspondingly inputting RNA-seq, copy number and mutation data;
and preprocessing the merged matrix, wherein the preprocessing comprises a filtering step so as to finally obtain a training data set.
8. The method of claim 6, wherein the step of pre-processing the integrated data comprises: integrating the expression data and the variant data according to the sample ID, and filtering the cancer categories with the data volume of the patient being less than the predetermined number.
9. A computer-readable storage medium storing a computer program, wherein the computer program is executed to perform the method for predicting a pan-cancer gene pathway based on modified XGBoost according to any one of claims 1 to 8.
10. An improved XGBoost-based pan-cancer gene pathway prediction system comprising a processor and the computer-readable storage medium of claim 9, the processor being configured to execute a computer program stored in the computer-readable storage medium to execute an improved XGBoost-based pan-cancer gene pathway prediction method.
CN202010041366.9A 2020-01-15 2020-01-15 Method, system and storage medium for predicting genetic pathway of pan-cancer based on improved XGBoost Active CN111243662B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010041366.9A CN111243662B (en) 2020-01-15 2020-01-15 Method, system and storage medium for predicting genetic pathway of pan-cancer based on improved XGBoost

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010041366.9A CN111243662B (en) 2020-01-15 2020-01-15 Method, system and storage medium for predicting genetic pathway of pan-cancer based on improved XGBoost

Publications (2)

Publication Number Publication Date
CN111243662A true CN111243662A (en) 2020-06-05
CN111243662B CN111243662B (en) 2023-04-21

Family

ID=70876947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010041366.9A Active CN111243662B (en) 2020-01-15 2020-01-15 Method, system and storage medium for predicting genetic pathway of pan-cancer based on improved XGBoost

Country Status (1)

Country Link
CN (1) CN111243662B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801140A (en) * 2021-01-07 2021-05-14 长沙理工大学 XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm
CN113269190A (en) * 2021-07-21 2021-08-17 中国平安人寿保险股份有限公司 Data classification method and device based on artificial intelligence, computer equipment and medium
CN113506593A (en) * 2021-07-06 2021-10-15 大连海事大学 Intelligent inference method for large-scale gene regulation network
CN113674864A (en) * 2021-08-30 2021-11-19 重庆大学 Method for predicting risk of malignant tumor complicated with venous thromboembolism
CN116417070A (en) * 2023-04-17 2023-07-11 齐鲁工业大学(山东省科学院) Method for improving prognosis prediction precision of gastric cancer typing based on gradient lifting depth feature selection algorithm

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170185904A1 (en) * 2015-12-29 2017-06-29 24/7 Customer, Inc. Method and apparatus for facilitating on-demand building of predictive models
US20180365372A1 (en) * 2017-06-19 2018-12-20 Jungla Inc. Systems and Methods for the Interpretation of Genetic and Genomic Variants via an Integrated Computational and Experimental Deep Mutational Learning Framework
US20190156159A1 (en) * 2017-11-20 2019-05-23 Kavya Venkata Kota Sai KOPPARAPU System and method for automatic assessment of cancer
CN110070916A (en) * 2019-04-29 2019-07-30 安徽大学 A kind of Cancerous disease gene expression characteristics selection method based on historical data
CN110111888A (en) * 2019-05-16 2019-08-09 闻康集团股份有限公司 A kind of XGBoost disease probability forecasting method, system and storage medium
EP3540654A1 (en) * 2018-03-16 2019-09-18 Ricoh Company, Ltd. Learning classification device and learning classification method
US20190316209A1 (en) * 2018-04-13 2019-10-17 Grail, Inc. Multi-Assay Prediction Model for Cancer Detection
WO2019242597A1 (en) * 2018-06-20 2019-12-26 The Chinese University Of Hong Kong Measurement and prediction of virus genetic mutation patterns

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170185904A1 (en) * 2015-12-29 2017-06-29 24/7 Customer, Inc. Method and apparatus for facilitating on-demand building of predictive models
US20180365372A1 (en) * 2017-06-19 2018-12-20 Jungla Inc. Systems and Methods for the Interpretation of Genetic and Genomic Variants via an Integrated Computational and Experimental Deep Mutational Learning Framework
US20190156159A1 (en) * 2017-11-20 2019-05-23 Kavya Venkata Kota Sai KOPPARAPU System and method for automatic assessment of cancer
EP3540654A1 (en) * 2018-03-16 2019-09-18 Ricoh Company, Ltd. Learning classification device and learning classification method
US20190316209A1 (en) * 2018-04-13 2019-10-17 Grail, Inc. Multi-Assay Prediction Model for Cancer Detection
WO2019242597A1 (en) * 2018-06-20 2019-12-26 The Chinese University Of Hong Kong Measurement and prediction of virus genetic mutation patterns
CN110070916A (en) * 2019-04-29 2019-07-30 安徽大学 A kind of Cancerous disease gene expression characteristics selection method based on historical data
CN110111888A (en) * 2019-05-16 2019-08-09 闻康集团股份有限公司 A kind of XGBoost disease probability forecasting method, system and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
R. VIJAYARAJESWARI等: "Classification of mammogram for early detection of breast cancer using SVM classifier and Hough transform", 《MEASUREMENT》 *
王桂松等: "基于XGBoost建模和Change-Point残差处理的风电机组齿轮箱温度预警", 《电力科学与工程》 *
高欣等: "基于模型自适应选择融合的智能电表故障多分类方法", 《电网技术》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801140A (en) * 2021-01-07 2021-05-14 长沙理工大学 XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm
CN113506593A (en) * 2021-07-06 2021-10-15 大连海事大学 Intelligent inference method for large-scale gene regulation network
CN113506593B (en) * 2021-07-06 2024-04-12 大连海事大学 Intelligent inference method for large-scale gene regulation network
CN113269190A (en) * 2021-07-21 2021-08-17 中国平安人寿保险股份有限公司 Data classification method and device based on artificial intelligence, computer equipment and medium
CN113674864A (en) * 2021-08-30 2021-11-19 重庆大学 Method for predicting risk of malignant tumor complicated with venous thromboembolism
CN113674864B (en) * 2021-08-30 2023-08-11 重庆大学 Malignant tumor combined venous thromboembolism risk prediction method
CN116417070A (en) * 2023-04-17 2023-07-11 齐鲁工业大学(山东省科学院) Method for improving prognosis prediction precision of gastric cancer typing based on gradient lifting depth feature selection algorithm

Also Published As

Publication number Publication date
CN111243662B (en) 2023-04-21

Similar Documents

Publication Publication Date Title
CN111243662A (en) Pan-cancer gene pathway prediction method, system and storage medium based on improved XGboost
Karabulut et al. Analysis of cardiotocogram data for fetal distress determination by decision tree based adaptive boosting approach
Pritom et al. Predicting breast cancer recurrence using effective classification and feature selection technique
García-Díaz et al. Unsupervised feature selection algorithm for multiclass cancer classification of gene expression RNA-Seq data
CN111000553B (en) Intelligent classification method for electrocardiogram data based on voting ensemble learning
AU2002228000A1 (en) Expert system for classification and prediction of genetic diseases
Kianmehr et al. Fuzzy clustering-based discretization for gene expression classification
Choubey et al. GA_J48graft DT: a hybrid intelligent system for diabetes disease diagnosis
CN114783524B (en) Path abnormity detection system based on self-adaptive resampling depth encoder network
CN113555070A (en) Machine learning algorithm for constructing drug sensitivity related gene classifier of acute myeloid leukemia
Sekaran et al. Predicting autism spectrum disorder from associative genetic markers of phenotypic groups using machine learning
CN116153495A (en) Prognosis survival prediction method for immunotherapy of esophageal cancer patient
CN113871009A (en) Sepsis prediction system, storage medium and apparatus in intensive care unit
Pramanik et al. Machine Learning Frameworks in Cancer Detection
Gao et al. A novel effective diagnosis model based on optimized least squares support machine for gene microarray
Boutorh et al. Classication of SNPs for breast cancer diagnosis using neural-network-based association rules
Patel et al. Multi-Classifier Analysis of Leukemia Gene Expression From Curated Microarray Database (CuMiDa)
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
Cateni et al. Improving the stability of sequential forward variables selection
Schaefer Gene expression analysis based on ant colony optimisation classification
Pyman et al. Exploring microRNA regulation of cancer with context-aware deep cancer classifier
Chellamuthu et al. Data mining and machine learning approaches in breast cancer biomedical research
Li et al. An ensemble method for gene discovery based on DNA microarray data
CN112382382B (en) Cost-sensitive integrated learning classification method and system
Khozama et al. Study the Effect of the Risk Factors in the Estimation of the Breast Cancer Risk Score Using Machine Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant