CN113284612B - Survival analysis method based on XGBoost algorithm - Google Patents

Survival analysis method based on XGBoost algorithm Download PDF

Info

Publication number
CN113284612B
CN113284612B CN202110560207.4A CN202110560207A CN113284612B CN 113284612 B CN113284612 B CN 113284612B CN 202110560207 A CN202110560207 A CN 202110560207A CN 113284612 B CN113284612 B CN 113284612B
Authority
CN
China
Prior art keywords
survival
individual
decision tree
patient
risk ratio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110560207.4A
Other languages
Chinese (zh)
Other versions
CN113284612A (en
Inventor
马宝山
侯晓宇
苍佳慧
赵浩然
钱政宇
陈一盈
闫格
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Maritime University
Original Assignee
Dalian Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Maritime University filed Critical Dalian Maritime University
Priority to CN202110560207.4A priority Critical patent/CN113284612B/en
Publication of CN113284612A publication Critical patent/CN113284612A/en
Application granted granted Critical
Publication of CN113284612B publication Critical patent/CN113284612B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a survival analysis method based on XGBoost algorithm, which optimizes an objective function in the original XGBoost method and uses Cox regression with penalty term as a new learning target. A specific loss function is customized according to the survival data, and the first-order gradient and the second-order gradient of the loss function are deduced. And a Breslow approximation of the Cox partial likelihood estimate with an L1 penalty term is used to derive a simplified mathematical expression of the gradient. According to the expression, the individual risk ratio predicted value is optimized through a decision tree algorithm, so that accurate prediction of survival rate of a disease patient based on gene expression data and interpretation and adaptability of the disease patient to high-dimensional data are realized, and survival state of the patient is effectively predicted.

Description

Survival analysis method based on XGBoost algorithm
Technical Field
The invention relates to the technical field of biomedicine, in particular to a survival analysis method based on an XGBoost algorithm.
Background
The survival analysis (Survival analysi s) is a method of analyzing and deducing the survival time of a living being or a person from data obtained by a test or investigation, and researching the relationship between the survival time and the outcome and a plurality of influencing factors and the extent thereof, and is also called survival rate analysis or survival rate analysis. The distribution of the survival time of the subjects is studied, so that the influence of experimental conditions on the survival time is known. In biomedical research, survival analysis is a very important and common analytical method. It can be used to understand the prognosis of disease, evaluate the quality of therapeutic method or observe the effect of health-care measures. In addition, plotting a survival curve based on the results of the survival analysis, quantifying and testing survival differences between two or more groups of patients would be helpful to clinicians and clinical researchers in treating patients.
In medical research, a Cox Proportional Hazards (CPH) model is often used for survival analysis, commonly referred to as the Cox model. It can predict risk scores by regressing characteristics or covariates of a set of patient data to expiration time while effectively utilizing deleted data. The Cox model, as a linear model, does not fully represent the complex nonlinear relationship between logarithmic risk ratios and static covariates. Thus, XGBoost survival analysis models may also be affected when processing gene expression data.
Disclosure of Invention
The invention provides a survival analysis method based on an XGBoost algorithm, which aims to overcome the technical problems.
The invention discloses a survival analysis method based on XGBoost algorithm, which comprises the following steps:
inputting survival analysis data; the survival analysis data includes: patient individual samples, gene characteristics, individual risk ratio predictive values and XGBoost algorithm parameters; the patient individual sample comprising: training samples and test samples; the training sample and the test sample both comprise the survival time and the survival state of the patient;
initializing the individual risk ratio predictor;
establishing a Cox model based on living time and living state, defining a loss function of the Cox model according to the living time and living state, and adding a punishment item into the Cox model through the loss function;
calculating a first derivative and a second derivative of the loss function to the individual risk ratio predictor;
establishing a decision tree according to the first derivative and the second derivative, and training the decision tree through the training sample to optimize the prediction precision of the individual risk ratio and XGBoost algorithm parameters;
calculating the test sample by using the trained decision tree to obtain a plurality of prediction precision values, and selecting the optimal value in the prediction precision values to be substituted into a Cox regression model with penalty items to obtain an optimized individual risk ratio prediction value; and predicting the survival time of the individual sample according to the optimized individual risk ratio predicted value.
Further, the survival analysis data includes:
the patient individual samples were tested with { (x) i ,t i ,δ i )|i=1,…,n},x i E R represents;
wherein x is i For patient gene expression data, t i For patient survival time, delta i Is the survival state of the patient;
the patient individual sample is selected from the TCGA database; the XGBoost algorithm parameters include: the parameters of XGBoost algorithm range, the number of loops for calculating prediction accuracy, the boost parameters and the step size.
Further, the establishing the Cox model based on the survival time and the survival state comprises the following steps: establishing a risk function to obtain the instantaneous mortality of the observed object at the time t, wherein the risk function is expressed as:
wherein X represents a gene characteristic, h (T, X) represents a risk function, Δt represents a time interval, and P (T > T) represents a survival probability;
assuming that the ratio of individual risk to population baseline risk is a constant scalar factor, the function of the Cox model is expressed as:
wherein beta is 1 ,β 2 …,β m Partial regression coefficient as independent variable, h 0 (t) is a reference risk rate h (t, X) when X is 0; y=f (x) =β T X represents a logarithmic hazard ratio, where β ε R m The coefficient vector is covariate, m is gene characteristic number;
the cox model estimates the coefficient vector β of uncorrelated survival data by maximizing the partial likelihood function, expressed as:
wherein beta is T X i Logarithmic risk ratio prediction, delta, representing individual i i =1 indicates the occurrence of an event of interest, δ i =0 indicates deletion; beta epsilon R m Is a coefficient vector of covariates, q (t) represents an individual dying at time t;
when the survival data were tied, the partial likelihood functions given by the bristol approximation were:
further, the defining the loss function of the Cox model according to the survival time and the survival state, adding a penalty term to the Cox model through the loss function includes:
the partial likelihood function obtained according to equation (4) is a Blastlol approximation L B By collection of When expressedEstimated risk ratio sum for individuals at risk at t, for L B Taking negative logarithm from two sides to obtain:
adding an L1 penalty term and building a loss function, expressed as:
L p =l B -P(β) (6)
wherein m is the number of gene characteristics,for the L1 penalty, λ is the penalty term parameter, β= (β) 1 ,β 2 ,…,β m ) T Is the correlation coefficient of gene expression data in Cox regression.
Further, the calculating first and second derivatives of the loss function to the individual risk ratio predictor includes:
deriving and calculating the loss function L p Is the first and second derivatives of (2), log-likelihood function l B Is the derivative d (y) i ) Expressed as:
g i calculated from P' (β), expressed as:
wherein,
for individual sample i, it is necessary to derive a prediction of the relative risk ratioIs a gradient of (2); if the patient dies, the index variable is delta i When the value of T is equal to or less than T is equal to or less than 1 i The method comprises the steps of carrying out a first treatment on the surface of the If the patient is not observed to die, delta i =0,t<T i The method comprises the steps of carrying out a first treatment on the surface of the Then L is p For->Is expressed as:
derived according to formula (10), then L p For a pair ofThe second derivative of (2) is expressed as:
further, the building a decision tree from the first and second derivatives includes:
the decision tree is expressed as:
F={f(x)=ω q(x) }(q:R m →T,ω∈R T ) (12)
wherein F represents a set of leaf node weights omega, q represents the structure of each decision tree, and individual samples are mapped to corresponding leaf nodes; t represents the number of leaf nodes of the corresponding decision tree; q (x) corresponds to the structure q of the decision tree and the weight omega of the leaf node; the predicted value of the XGBoost algorithm is the sum of leaf node values of each decision tree; f (x) represents the classification and regression tree of the XGBoost algorithm framework.
Further, the training the decision tree by the training sample to optimize the prediction accuracy of the individual risk ratio and XGBoost algorithm parameters includes:
minimizing objective function with penalty term, adding new function f in each iteration t (x) The method comprises the steps of carrying out a first treatment on the surface of the And optimizing individual risk ratios using second order approximationsPrediction accuracy and XGBoost algorithm parameters are expressed as:
in the formula (13), t represents the iteration number, g i And h i First and second derivatives, Ω (f) t ) Represents f t (x) Is used in the constraint of (a),and y i Respectively representing a predicted value and a true value of an individual sample i; epsilon in formula (14) is the step size; l (L) (t) Representing an objective function.
Further, the calculating the test sample by using the trained decision tree to obtain a plurality of prediction precision values, selecting an optimal value in the prediction precision values to be substituted into a Cox regression model with a penalty term, and obtaining an optimized individual risk ratio prediction value, including: the parameters of the trained decision tree are reserved, and the trained decision tree is utilized to calculate the prediction precision of the test sample; if l' is less than l, recalculating a first derivative and a second derivative of the loss function on the individual risk ratio predicted value; if l' =l, selecting the maximum value in the prediction precision value and substituting the maximum value into a Cox regression model with a penalty term to obtain an optimized individual risk ratio prediction value; and l 'is the current cycle number for calculating the prediction precision of the test sample, and l is the preset cycle number for calculating the prediction precision, wherein l' is less than or equal to l.
According to the method, the cox regression objective function is optimized in the original XGBoost method, and the cox regression with the penalty term is used as a new learning target. A specific loss function is customized according to the survival data, and the first-order gradient and the second-order gradient of the loss function are deduced. And a Breslow approximation of the Cox partial likelihood estimate with an L1 penalty term is used to derive a simplified mathematical expression of the gradient. According to the expression, the individual risk ratio predicted value is optimized through a decision tree algorithm, so that accurate prediction of survival rate of a disease patient based on gene expression data and interpretation and adaptability of the disease patient to high-dimensional data are realized, and survival state of the patient is effectively predicted.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the present embodiment provides a survival analysis method based on XGBoost algorithm, including:
101. inputting survival analysis data; survival analysis data, comprising: patient individual samples, gene characteristics, individual risk ratio predictive values and XGBoost algorithm parameters; patient individual samples include: training samples and test samples; the training sample and the test sample both comprise the survival time and the survival state of the patient;
specifically, survival analysis data includes:
patient individual samples were taken { (x) i ,t i ,δ i )|i=1,…,n},x i E R represents;
wherein x is i For patient gene expression data, t i For patient survival time, delta i Is the survival state of the patient; in survival analysis, survival function S (t, X) =p r (T > T, X) is also known as survival probability; genomic characteristics and corresponding clinical information for each cancer type are downloaded from the published TCGA database. The cancer genomic profile (TCGA) project collects genomic features and clinical information for 33 different cancer types in over 10,000 cancer patients. XGBoost algorithm parameters, including: macro control parameters, boost parameters, and learning target parameters. When the method is applied, the parameter range of the XGBoost algorithm, the cycle number and the step length of calculating the prediction precision and the parameters needing to be adjusted, particularly the boost parameters, are mainly required to be preset.
102. Initializing an individual risk ratio predictive value;
103. establishing a Cox model based on the survival time and the survival state, defining a loss function of the Cox model according to the survival time and the survival state, and adding a punishment item into the Cox model through the loss function;
specifically, a risk function is established to obtain the instantaneous mortality of the observed subject at time t, the risk function being expressed as:
wherein X represents a gene characteristic, h (T, X) represents a risk function, Δt represents a time interval, and P (T > T) represents a survival probability;
assuming that the ratio of individual risk to population baseline risk is a constant scalar factor, the function of the Cox model is expressed as:
wherein beta is 1 ,β 2 …,β m Partial regression coefficient as independent variable, h 0 (t) is the reference risk rate h (t,X);y=f(x)=β T x represents a logarithmic hazard ratio, where β ε R m The coefficient vector is covariate, m is gene characteristic number;
the cox model estimates the coefficient vector β of uncorrelated survival data by maximizing the partial likelihood function, expressed as:
wherein beta is T X i Logarithmic risk ratio prediction, delta, representing individual i i =1 indicates the occurrence of an event of interest, δ i =0 indicates erasure, β∈r m Is a coefficient vector of covariates, q (t) represents an individual dying at time t;
when the survival data were tied, the partial likelihood functions given by the bristol approximation were:
the partial likelihood function obtained according to equation (4) is a Blastlol approximation L B By collection of The sum of estimated risk ratios of individuals at risk at time t, for L B Taking negative logarithm from two sides to obtain:
adding an L1 penalty term and building a loss function, expressed as:
L p =l B -P(β) (6)
wherein m is the number of gene characteristics,for the L1 penalty, λ is the penalty term parameter, β= (β) 1 ,β 2 ,…,β m ) T Is the correlation coefficient of gene expression data in Cox regression.
104. Calculating a first derivative and a second derivative of the loss function on the individual risk ratio predicted value;
specifically, a log-likelihood gradient d (y i ) Is expressed as:
g i calculated from P' (β), expressed as:
wherein,
for individual sample i, it is necessary to derive a prediction of the relative risk ratioIs a gradient of (2); if the patient dies, the index variable is delta i When the value of T is equal to or less than T is equal to or less than 1 i The method comprises the steps of carrying out a first treatment on the surface of the If the patient is not observed to die, delta i =0,t<T i The method comprises the steps of carrying out a first treatment on the surface of the Then L is p For->Is expressed as:
derived according to formula (10), then L p For a pair ofThe second derivative of (2) is expressed as:
105. establishing a decision tree according to the first derivative and the second derivative, and training the decision tree through training samples to optimize the prediction precision of the individual risk ratio and XGBoost algorithm parameters;
specifically, the decision tree is expressed as:
F={f(x)=ω q(x) }(q:R m →T,ω∈R T ) (12)
wherein F represents a set of leaf node weights omega, q represents the structure of each decision tree, and individual samples are mapped to corresponding leaf nodes; t represents the number of leaf nodes of the corresponding decision tree; q (x) corresponds to the structure q of the decision tree and the weight omega of the leaf node; the predicted value of the XGBoost algorithm is the sum of leaf node values of each decision tree; f (x) represents the classification and regression tree of the XGBoost algorithm framework.
Minimizing the objective function with penalty, adding a new function f (t) in each iteration; and optimizing the individual risk ratio predicted value and XGBoost algorithm parameters by using second-order approximation, wherein the parameters are expressed as follows:
in the formula (13), t represents the iteration number, g i And h i First and second derivatives, Ω (f) t ) Represents f t (x) Is used in the constraint of (a),and y i Respectively represent the individual samplesThe predicted value and the true value of the i; epsilon in formula (14) is the step size; l (L) (t) Representing an objective function.
106. Calculating a plurality of prediction precision values for the test sample by using the trained decision tree, and selecting an optimal value in the prediction precision values to be substituted into a Cox regression model with penalty items to obtain an optimized individual risk ratio prediction value; and predicting the survival time of the individual sample according to the optimized individual risk ratio predicted value.
Specifically, parameters of the trained decision tree are reserved, and prediction precision is calculated on the test sample by using the trained decision tree; if l' is less than l, recalculating a first derivative and a second derivative of the loss function on the individual risk ratio predicted value; if l' =l, the maximum value of the prediction accuracy values is selected and substituted into the Cox regression model with penalty term. And l 'is the current cycle number for calculating the prediction precision of the test sample, and l is the preset cycle number for calculating the prediction precision, wherein l' is less than or equal to l. Therefore, the accurate prediction of the survival rate of the disease patient based on the gene expression data and the interpretation and adaptability of the disease patient to the high-dimensional data are realized, and the survival period of the patient is effectively predicted.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (5)

1. The survival analysis method based on the XGBoost algorithm is characterized by comprising the following steps of:
inputting survival analysis data; the survival analysis data includes: patient individual samples, gene characteristics, individual risk ratio predictive values and XGBoost algorithm parameters; the patient individual sample comprising: training samples and test samples; the training sample and the test sample both comprise the survival time and the survival state of the patient;
with { (x) i ,t ii )|i=1,…,n},x i E R represents the patient individual sample;
wherein x is i For patient gene expression data, t i For patient survival time, delta i Is the survival state of the patient; delta i =1 indicates the occurrence of an event of interest, δ i =0 indicates deletion;
initializing the individual risk ratio predictor;
establishing a Cox model based on living time and living state, defining a loss function of the Cox model according to the living time and living state, and adding a punishment item into the Cox model through the loss function;
the establishing the Cox model based on the survival time and the survival state comprises the following steps:
establishing a risk function to obtain the instantaneous mortality of the observed object at the time t, wherein the risk function is expressed as:
wherein X represents a gene characteristic, h (T, X) represents a risk function, Δt represents a time interval, and P (T > T) represents a survival probability;
assuming that the ratio of individual risk to population baseline risk is a constant scalar factor, the function of the Cox model is expressed as:
wherein beta is 12 …,β m Partial regression coefficient as independent variable, h 0 (t) is a reference risk rate h (t, X) when X is 0; y=f (x) =β T X represents a logarithmic hazard ratio, where β ε R m The coefficient vector is covariate, m is gene characteristic number;
the cox model estimates the coefficient vector β of uncorrelated survival data by maximizing the partial likelihood function, expressed as:
wherein beta is T X i Logarithmic risk ratio prediction representing individual i, β ε R m Is a coefficient vector of covariates, q (t) represents an individual dying at time t;
when the survival data were tied, the partial likelihood functions given by the bristol approximation were:
defining a loss function of the Cox model according to the survival time and the survival state, adding a penalty term to the Cox model through the loss function, and comprising:
the partial likelihood function obtained according to equation (4) is a Blastlol approximation L Β By collection of The sum of estimated risk ratios of individuals at risk at time t, for L Β Taking negative logarithm from two sides to obtain:
adding an L1 penalty term and building a loss function, expressed as:
L p =l B -P(β) (6)
in the method, in the process of the invention,punishment for L1, m for gene feature number, lambda for punishment term parameter,β=(β 12 ,…,β m ) T Correlation coefficients of gene expression data in Cox regression;
calculating a first derivative and a second derivative of the loss function to the individual risk ratio predictor;
establishing a decision tree according to the first derivative and the second derivative, training the decision tree through the training sample to optimize the prediction precision of the individual risk ratio and XGBoost algorithm parameters, wherein the method comprises the following steps:
minimizing objective function with penalty term, adding new function f in each iteration t (x) The method comprises the steps of carrying out a first treatment on the surface of the And the prediction precision of the individual risk ratio and XGBoost algorithm parameters are optimized by utilizing second-order approximation, and the method is expressed as follows:
in the formula (13), t represents the iteration number, g i And h i First and second derivatives, Ω (f) t ) Represents f t (x) Is used in the constraint of (a),and y i Respectively representing a predicted value and a true value of an individual sample i; epsilon in formula (14) is the step size; l (L) (t) Representing an objective function; l is the preset cycle number for calculating the prediction precision;
calculating the test sample by using the trained decision tree to obtain a plurality of prediction precision values, and selecting the optimal value in the prediction precision values to be substituted into a Cox regression model with penalty items to obtain an optimized individual risk ratio prediction value; and predicting the survival time of the individual sample according to the optimized individual risk ratio predicted value.
2. The XGBoost algorithm-based survival analysis method according to claim 1, wherein the survival analysis data comprises:
the patient individual sample is selected from the TCGA database; the XGBoost algorithm parameters include: the parameters of XGBoost algorithm range, the number of loops for calculating prediction accuracy, the boost parameters and the step size.
3. The XGBoost algorithm-based survival analysis method of claim 1, wherein the calculating of the first and second derivatives of the loss function to the individual risk ratio predictor comprises:
deriving and calculating the loss function L p Is the first and second derivatives of (2), log-likelihood function l B Is the derivative d (y) i ) Expressed as:
g i calculated from P' (β), expressed as:
wherein,
for individual sample i, it is necessary to derive a prediction of the relative risk ratioIs a gradient of (2); if the patient dies, the index variable is delta i When the value of T is equal to or less than T is equal to or less than 1 i The method comprises the steps of carrying out a first treatment on the surface of the If the patient is not observed to die, delta i =0,t<T i The method comprises the steps of carrying out a first treatment on the surface of the Then L is p For->First derivative of (C), tableThe method is shown as follows:
derived according to formula (10), then L p For a pair ofThe second derivative of (2) is expressed as:
4. a survival analysis method based on XGBoost algorithm according to claim 3, characterized in that the building of decision tree from the first and second derivatives comprises:
the decision tree is expressed as:
F={f(x)=ω q(x) }(q:R m →T,ω∈R T ) (12)
wherein F represents a set of leaf node weights omega, q represents the structure of each decision tree, and individual samples are mapped to corresponding leaf nodes; t represents the number of leaf nodes of the corresponding decision tree; q (x) corresponds to the structure q of the decision tree and the weight omega of the leaf node; the predicted value of the XGBoost algorithm is the sum of leaf node values of each decision tree; f (x) represents the classification and regression tree of the XGBoost algorithm framework.
5. The XGBoost algorithm-based survival analysis method according to claim 1, wherein the calculating the test sample by using the trained decision tree to obtain a plurality of prediction precision values, selecting an optimal value of the prediction precision values to be substituted into a Cox regression model with a penalty term to obtain an optimized individual risk ratio prediction value, includes:
the parameters of the trained decision tree are reserved, and the trained decision tree is utilized to calculate the prediction precision of the test sample; if l' is less than l, recalculating a first derivative and a second derivative of the loss function on the individual risk ratio predicted value; if l' =l, selecting the maximum value in the prediction precision value and substituting the maximum value into a Cox regression model with a penalty term to obtain an optimized individual risk ratio prediction value; and l 'is the current cycle number for calculating the prediction precision of the test sample, and l is the preset cycle number for calculating the prediction precision, wherein l' is less than or equal to l.
CN202110560207.4A 2021-05-21 2021-05-21 Survival analysis method based on XGBoost algorithm Active CN113284612B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110560207.4A CN113284612B (en) 2021-05-21 2021-05-21 Survival analysis method based on XGBoost algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110560207.4A CN113284612B (en) 2021-05-21 2021-05-21 Survival analysis method based on XGBoost algorithm

Publications (2)

Publication Number Publication Date
CN113284612A CN113284612A (en) 2021-08-20
CN113284612B true CN113284612B (en) 2024-04-16

Family

ID=77281044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110560207.4A Active CN113284612B (en) 2021-05-21 2021-05-21 Survival analysis method based on XGBoost algorithm

Country Status (1)

Country Link
CN (1) CN113284612B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006104474A2 (en) * 2005-03-14 2006-10-05 Immunivest Corporation A method for predicting progression free and overall survival at each follow-up time point during therapy of metastatic breast cancer patients using circulating tumor cells
CN108922628A (en) * 2018-04-23 2018-11-30 华北电力大学 A kind of Prognosis in Breast Cancer survival rate prediction technique based on dynamic Cox model
CN111243736A (en) * 2019-10-24 2020-06-05 中国人民解放军海军军医大学第三附属医院 Survival risk assessment method and system
KR20210001959A (en) * 2019-06-27 2021-01-06 서울대학교산학협력단 Etiome model for gastric cancer development based on multi-layer ad multi-factor panel and computational biological network modeling
CN112635063A (en) * 2020-12-30 2021-04-09 华南理工大学 Lung cancer prognosis comprehensive prediction model, construction method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006104474A2 (en) * 2005-03-14 2006-10-05 Immunivest Corporation A method for predicting progression free and overall survival at each follow-up time point during therapy of metastatic breast cancer patients using circulating tumor cells
CN108922628A (en) * 2018-04-23 2018-11-30 华北电力大学 A kind of Prognosis in Breast Cancer survival rate prediction technique based on dynamic Cox model
KR20210001959A (en) * 2019-06-27 2021-01-06 서울대학교산학협력단 Etiome model for gastric cancer development based on multi-layer ad multi-factor panel and computational biological network modeling
CN111243736A (en) * 2019-10-24 2020-06-05 中国人民解放军海军军医大学第三附属医院 Survival risk assessment method and system
CN112635063A (en) * 2020-12-30 2021-04-09 华南理工大学 Lung cancer prognosis comprehensive prediction model, construction method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Computing the Hazard Ratios Associated With Explanatory Variables Using Machine Learning Models of Survival Data";Sameer Sundrani et al;JCO Clinical cancer informaics;第5卷;第364-378页 *
"基于梯度提升树的生存分析优化方法研究及应用";刘沛;中国优秀硕士学位论文全文数据库基础科学辑(月刊)(第07期);A002-384 *
基于梯度提升决策树的轴承剩余使用寿命预测方法;张成龙;刘杰;;信息与电脑(理论版)(第10期);第34-35页 *

Also Published As

Publication number Publication date
CN113284612A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN109659033B (en) Chronic disease state of an illness change event prediction device based on recurrent neural network
AU2001218164B2 (en) Non-invasive in-vivo tissue classification using near-infrared measurements
CN112635063B (en) Comprehensive lung cancer prognosis prediction model, construction method and device
CN111243736B (en) Survival risk assessment method and system
JP2002535023A5 (en)
CN114783524B (en) Path abnormity detection system based on self-adaptive resampling depth encoder network
Zeng et al. Simultaneous modelling of survival and longitudinal data with an application to repeated quality of life measures
CN113380327B (en) Human biological age prediction and human aging degree assessment method
CN113782183B (en) Device and method for predicting risk of pressure injury based on multi-algorithm fusion
CN115132358A (en) Machine learning for multi-state models of disease
CN113284612B (en) Survival analysis method based on XGBoost algorithm
Ivanets et al. Approach to the evaluation of the functional state of the human body taking into account the variability of medical and biological indicators
CN116959585B (en) Deep learning-based whole genome prediction method
Luzyanina et al. Markov chain Monte Carlo parameter estimation of the ODE compartmental cell growth model
Li et al. Age prediction by DNA methylation in neural networks
CN113345525B (en) Analysis method for reducing influence of covariates on detection result in high-throughput detection
Keret et al. Optimal cox regression subsampling procedure with rare events
CN113851221A (en) Dynamic health evaluation method and system based on time sequence body measurement data
CN115376638A (en) Physiological characteristic data analysis method based on multi-source health perception data fusion
CN107564588A (en) A kind of physiological health data prediction device
CN113724868A (en) Chronic disease prediction method based on continuous time Markov chain
Mwanyekange et al. Bayesian Inference in a Joint Model for Longitudinal and Time to Event Data with Gompertz Baseline Hazards
CN116825342A (en) Survival prediction method combining XGBoost and Elastic Net-cox algorithm
Wang et al. Modeling biomarker variability in joint analysis of longitudinal and time-to-event data
Merry et al. Typecasting of microarray data using machine learning algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant