CN113284612B

CN113284612B - Survival analysis method based on XGBoost algorithm

Info

Publication number: CN113284612B
Application number: CN202110560207.4A
Authority: CN
Inventors: 马宝山; 侯晓宇; 苍佳慧; 赵浩然; 钱政宇; 陈一盈; 闫格
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2024-04-16
Anticipated expiration: 2041-05-21
Also published as: CN113284612A

Abstract

The invention discloses a survival analysis method based on XGBoost algorithm, which optimizes an objective function in the original XGBoost method and uses Cox regression with penalty term as a new learning target. A specific loss function is customized according to the survival data, and the first-order gradient and the second-order gradient of the loss function are deduced. And a Breslow approximation of the Cox partial likelihood estimate with an L1 penalty term is used to derive a simplified mathematical expression of the gradient. According to the expression, the individual risk ratio predicted value is optimized through a decision tree algorithm, so that accurate prediction of survival rate of a disease patient based on gene expression data and interpretation and adaptability of the disease patient to high-dimensional data are realized, and survival state of the patient is effectively predicted.

Description

Survival analysis method based on XGBoost algorithm

Technical Field

The invention relates to the technical field of biomedicine, in particular to a survival analysis method based on an XGBoost algorithm.

Background

The survival analysis (Survival analysi s) is a method of analyzing and deducing the survival time of a living being or a person from data obtained by a test or investigation, and researching the relationship between the survival time and the outcome and a plurality of influencing factors and the extent thereof, and is also called survival rate analysis or survival rate analysis. The distribution of the survival time of the subjects is studied, so that the influence of experimental conditions on the survival time is known. In biomedical research, survival analysis is a very important and common analytical method. It can be used to understand the prognosis of disease, evaluate the quality of therapeutic method or observe the effect of health-care measures. In addition, plotting a survival curve based on the results of the survival analysis, quantifying and testing survival differences between two or more groups of patients would be helpful to clinicians and clinical researchers in treating patients.

In medical research, a Cox Proportional Hazards (CPH) model is often used for survival analysis, commonly referred to as the Cox model. It can predict risk scores by regressing characteristics or covariates of a set of patient data to expiration time while effectively utilizing deleted data. The Cox model, as a linear model, does not fully represent the complex nonlinear relationship between logarithmic risk ratios and static covariates. Thus, XGBoost survival analysis models may also be affected when processing gene expression data.

Disclosure of Invention

The invention provides a survival analysis method based on an XGBoost algorithm, which aims to overcome the technical problems.

The invention discloses a survival analysis method based on XGBoost algorithm, which comprises the following steps:

inputting survival analysis data; the survival analysis data includes: patient individual samples, gene characteristics, individual risk ratio predictive values and XGBoost algorithm parameters; the patient individual sample comprising: training samples and test samples; the training sample and the test sample both comprise the survival time and the survival state of the patient;

initializing the individual risk ratio predictor;

establishing a Cox model based on living time and living state, defining a loss function of the Cox model according to the living time and living state, and adding a punishment item into the Cox model through the loss function;

calculating a first derivative and a second derivative of the loss function to the individual risk ratio predictor;

establishing a decision tree according to the first derivative and the second derivative, and training the decision tree through the training sample to optimize the prediction precision of the individual risk ratio and XGBoost algorithm parameters;

calculating the test sample by using the trained decision tree to obtain a plurality of prediction precision values, and selecting the optimal value in the prediction precision values to be substituted into a Cox regression model with penalty items to obtain an optimized individual risk ratio prediction value; and predicting the survival time of the individual sample according to the optimized individual risk ratio predicted value.

Further, the survival analysis data includes:

the patient individual samples were tested with { (x) _i ，t _i ，δ _i )|i＝1，…，n}，x _i E R represents;

wherein x is _i For patient gene expression data, t _i For patient survival time, delta _i Is the survival state of the patient;

the patient individual sample is selected from the TCGA database; the XGBoost algorithm parameters include: the parameters of XGBoost algorithm range, the number of loops for calculating prediction accuracy, the boost parameters and the step size.

Further, the establishing the Cox model based on the survival time and the survival state comprises the following steps: establishing a risk function to obtain the instantaneous mortality of the observed object at the time t, wherein the risk function is expressed as:

wherein X represents a gene characteristic, h (T, X) represents a risk function, Δt represents a time interval, and P (T > T) represents a survival probability;

assuming that the ratio of individual risk to population baseline risk is a constant scalar factor, the function of the Cox model is expressed as:

wherein beta is ₁ ，β ₂ …，β _m Partial regression coefficient as independent variable, h ₀ (t) is a reference risk rate h (t, X) when X is 0; y=f (x) =β ^T X represents a logarithmic hazard ratio, where β ε R ^m The coefficient vector is covariate, m is gene characteristic number;

the cox model estimates the coefficient vector β of uncorrelated survival data by maximizing the partial likelihood function, expressed as:

wherein beta is ^T X _i Logarithmic risk ratio prediction, delta, representing individual i _i =1 indicates the occurrence of an event of interest, δ _i =0 indicates deletion; beta epsilon R ^m Is a coefficient vector of covariates, q (t) represents an individual dying at time t;

when the survival data were tied, the partial likelihood functions given by the bristol approximation were:

further, the defining the loss function of the Cox model according to the survival time and the survival state, adding a penalty term to the Cox model through the loss function includes:

the partial likelihood function obtained according to equation (4) is a Blastlol approximation L _B By collection of When expressedEstimated risk ratio sum for individuals at risk at t, for L _B Taking negative logarithm from two sides to obtain:

adding an L1 penalty term and building a loss function, expressed as:

L _p ＝l _B -P(β) (6)

wherein m is the number of gene characteristics,for the L1 penalty, λ is the penalty term parameter, β= (β) ₁ ，β ₂ ，…，β _m ) ^T Is the correlation coefficient of gene expression data in Cox regression.

Further, the calculating first and second derivatives of the loss function to the individual risk ratio predictor includes:

deriving and calculating the loss function L _p Is the first and second derivatives of (2), log-likelihood function l _B Is the derivative d (y) _i ) Expressed as:

g _i calculated from P' (β), expressed as:

wherein,

for individual sample i, it is necessary to derive a prediction of the relative risk ratioIs a gradient of (2); if the patient dies, the index variable is delta _i When the value of T is equal to or less than T is equal to or less than 1 _i The method comprises the steps of carrying out a first treatment on the surface of the If the patient is not observed to die, delta _i ＝0，t＜T _i The method comprises the steps of carrying out a first treatment on the surface of the Then L is _p For->Is expressed as:

derived according to formula (10), then L _p For a pair ofThe second derivative of (2) is expressed as:

further, the building a decision tree from the first and second derivatives includes:

the decision tree is expressed as:

F＝{f(x)＝ω _q(x) }(q：R ^m →T，ω∈R ^T ) (12)

wherein F represents a set of leaf node weights omega, q represents the structure of each decision tree, and individual samples are mapped to corresponding leaf nodes; t represents the number of leaf nodes of the corresponding decision tree; q (x) corresponds to the structure q of the decision tree and the weight omega of the leaf node; the predicted value of the XGBoost algorithm is the sum of leaf node values of each decision tree; f (x) represents the classification and regression tree of the XGBoost algorithm framework.

Further, the training the decision tree by the training sample to optimize the prediction accuracy of the individual risk ratio and XGBoost algorithm parameters includes:

minimizing objective function with penalty term, adding new function f in each iteration _t (x) The method comprises the steps of carrying out a first treatment on the surface of the And optimizing individual risk ratios using second order approximationsPrediction accuracy and XGBoost algorithm parameters are expressed as:

in the formula (13), t represents the iteration number, g _i And h _i First and second derivatives, Ω (f) _t ) Represents f _t (x) Is used in the constraint of (a),and y _i Respectively representing a predicted value and a true value of an individual sample i; epsilon in formula (14) is the step size; l (L) ^(t) Representing an objective function.

Further, the calculating the test sample by using the trained decision tree to obtain a plurality of prediction precision values, selecting an optimal value in the prediction precision values to be substituted into a Cox regression model with a penalty term, and obtaining an optimized individual risk ratio prediction value, including: the parameters of the trained decision tree are reserved, and the trained decision tree is utilized to calculate the prediction precision of the test sample; if l' is less than l, recalculating a first derivative and a second derivative of the loss function on the individual risk ratio predicted value; if l' =l, selecting the maximum value in the prediction precision value and substituting the maximum value into a Cox regression model with a penalty term to obtain an optimized individual risk ratio prediction value; and l 'is the current cycle number for calculating the prediction precision of the test sample, and l is the preset cycle number for calculating the prediction precision, wherein l' is less than or equal to l.

According to the method, the cox regression objective function is optimized in the original XGBoost method, and the cox regression with the penalty term is used as a new learning target. A specific loss function is customized according to the survival data, and the first-order gradient and the second-order gradient of the loss function are deduced. And a Breslow approximation of the Cox partial likelihood estimate with an L1 penalty term is used to derive a simplified mathematical expression of the gradient. According to the expression, the individual risk ratio predicted value is optimized through a decision tree algorithm, so that accurate prediction of survival rate of a disease patient based on gene expression data and interpretation and adaptability of the disease patient to high-dimensional data are realized, and survival state of the patient is effectively predicted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the present embodiment provides a survival analysis method based on XGBoost algorithm, including:

101. inputting survival analysis data; survival analysis data, comprising: patient individual samples, gene characteristics, individual risk ratio predictive values and XGBoost algorithm parameters; patient individual samples include: training samples and test samples; the training sample and the test sample both comprise the survival time and the survival state of the patient;

specifically, survival analysis data includes:

patient individual samples were taken { (x) _i ，t _i ，δ _i )|i＝1，…，n}，x _i E R represents;

wherein x is _i For patient gene expression data, t _i For patient survival time, delta _i Is the survival state of the patient; in survival analysis, survival function S (t, X) =p _r (T > T, X) is also known as survival probability; genomic characteristics and corresponding clinical information for each cancer type are downloaded from the published TCGA database. The cancer genomic profile (TCGA) project collects genomic features and clinical information for 33 different cancer types in over 10,000 cancer patients. XGBoost algorithm parameters, including: macro control parameters, boost parameters, and learning target parameters. When the method is applied, the parameter range of the XGBoost algorithm, the cycle number and the step length of calculating the prediction precision and the parameters needing to be adjusted, particularly the boost parameters, are mainly required to be preset.

102. Initializing an individual risk ratio predictive value;

103. establishing a Cox model based on the survival time and the survival state, defining a loss function of the Cox model according to the survival time and the survival state, and adding a punishment item into the Cox model through the loss function;

specifically, a risk function is established to obtain the instantaneous mortality of the observed subject at time t, the risk function being expressed as:

wherein beta is ₁ ，β ₂ …，β _m Partial regression coefficient as independent variable, h ₀ (t) is the reference risk rate h (t,X)；y＝f(x)＝β ^T x represents a logarithmic hazard ratio, where β ε R ^m The coefficient vector is covariate, m is gene characteristic number;

wherein beta is ^T X _i Logarithmic risk ratio prediction, delta, representing individual i _i =1 indicates the occurrence of an event of interest, δ _i =0 indicates erasure, β∈r ^m Is a coefficient vector of covariates, q (t) represents an individual dying at time t;

the partial likelihood function obtained according to equation (4) is a Blastlol approximation L _B By collection of The sum of estimated risk ratios of individuals at risk at time t, for L _B Taking negative logarithm from two sides to obtain:

adding an L1 penalty term and building a loss function, expressed as:

L _p ＝l _B -P(β) (6)

104. Calculating a first derivative and a second derivative of the loss function on the individual risk ratio predicted value;

specifically, a log-likelihood gradient d (y _i ) Is expressed as:

g _i calculated from P' (β), expressed as:

wherein,

105. establishing a decision tree according to the first derivative and the second derivative, and training the decision tree through training samples to optimize the prediction precision of the individual risk ratio and XGBoost algorithm parameters;

specifically, the decision tree is expressed as:

F＝{f(x)＝ω _q(x) }(q：R ^m →T，ω∈R ^T ) (12)

Minimizing the objective function with penalty, adding a new function f (t) in each iteration; and optimizing the individual risk ratio predicted value and XGBoost algorithm parameters by using second-order approximation, wherein the parameters are expressed as follows:

in the formula (13), t represents the iteration number, g _i And h _i First and second derivatives, Ω (f) _t ) Represents f _t (x) Is used in the constraint of (a),and y _i Respectively represent the individual samplesThe predicted value and the true value of the i; epsilon in formula (14) is the step size; l (L) ^(t) Representing an objective function.

106. Calculating a plurality of prediction precision values for the test sample by using the trained decision tree, and selecting an optimal value in the prediction precision values to be substituted into a Cox regression model with penalty items to obtain an optimized individual risk ratio prediction value; and predicting the survival time of the individual sample according to the optimized individual risk ratio predicted value.

Specifically, parameters of the trained decision tree are reserved, and prediction precision is calculated on the test sample by using the trained decision tree; if l' is less than l, recalculating a first derivative and a second derivative of the loss function on the individual risk ratio predicted value; if l' =l, the maximum value of the prediction accuracy values is selected and substituted into the Cox regression model with penalty term. And l 'is the current cycle number for calculating the prediction precision of the test sample, and l is the preset cycle number for calculating the prediction precision, wherein l' is less than or equal to l. Therefore, the accurate prediction of the survival rate of the disease patient based on the gene expression data and the interpretation and adaptability of the disease patient to the high-dimensional data are realized, and the survival period of the patient is effectively predicted.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The survival analysis method based on the XGBoost algorithm is characterized by comprising the following steps of:

with { (x) _i ,t _i ,δ _i )|i＝1,…,n},x _i E R represents the patient individual sample;

wherein x is _i For patient gene expression data, t _i For patient survival time, delta _i Is the survival state of the patient; delta _i =1 indicates the occurrence of an event of interest, δ _i =0 indicates deletion;

initializing the individual risk ratio predictor;

the establishing the Cox model based on the survival time and the survival state comprises the following steps:

establishing a risk function to obtain the instantaneous mortality of the observed object at the time t, wherein the risk function is expressed as:

wherein beta is ₁ ,β ₂ …,β _m Partial regression coefficient as independent variable, h ₀ (t) is a reference risk rate h (t, X) when X is 0; y=f (x) =β ^T X represents a logarithmic hazard ratio, where β ε R ^m The coefficient vector is covariate, m is gene characteristic number;

wherein beta is ^T X _i Logarithmic risk ratio prediction representing individual i, β ε R ^m Is a coefficient vector of covariates, q (t) represents an individual dying at time t;

defining a loss function of the Cox model according to the survival time and the survival state, adding a penalty term to the Cox model through the loss function, and comprising:

the partial likelihood function obtained according to equation (4) is a Blastlol approximation L _Β By collection of The sum of estimated risk ratios of individuals at risk at time t, for L _Β Taking negative logarithm from two sides to obtain:

adding an L1 penalty term and building a loss function, expressed as:

L _p ＝l _B -P(β) (6)

in the method, in the process of the invention,punishment for L1, m for gene feature number, lambda for punishment term parameter，β＝(β ₁ ,β ₂ ,…,β _m ) ^T Correlation coefficients of gene expression data in Cox regression;

establishing a decision tree according to the first derivative and the second derivative, training the decision tree through the training sample to optimize the prediction precision of the individual risk ratio and XGBoost algorithm parameters, wherein the method comprises the following steps:

minimizing objective function with penalty term, adding new function f in each iteration _t (x) The method comprises the steps of carrying out a first treatment on the surface of the And the prediction precision of the individual risk ratio and XGBoost algorithm parameters are optimized by utilizing second-order approximation, and the method is expressed as follows:

in the formula (13), t represents the iteration number, g _i And h _i First and second derivatives, Ω (f) _t ) Represents f _t (x) Is used in the constraint of (a),and y _i Respectively representing a predicted value and a true value of an individual sample i; epsilon in formula (14) is the step size; l (L) ^(t) Representing an objective function; l is the preset cycle number for calculating the prediction precision;

2. The XGBoost algorithm-based survival analysis method according to claim 1, wherein the survival analysis data comprises:

3. The XGBoost algorithm-based survival analysis method of claim 1, wherein the calculating of the first and second derivatives of the loss function to the individual risk ratio predictor comprises:

g _i calculated from P' (β), expressed as:

wherein,

for individual sample i, it is necessary to derive a prediction of the relative risk ratioIs a gradient of (2); if the patient dies, the index variable is delta _i When the value of T is equal to or less than T is equal to or less than 1 _i The method comprises the steps of carrying out a first treatment on the surface of the If the patient is not observed to die, delta _i ＝0，t<T _i The method comprises the steps of carrying out a first treatment on the surface of the Then L is _p For->First derivative of (C), tableThe method is shown as follows:

4. a survival analysis method based on XGBoost algorithm according to claim 3, characterized in that the building of decision tree from the first and second derivatives comprises:

the decision tree is expressed as:

F＝{f(x)＝ω _q(x) }(q:R ^m →T,ω∈R ^T ) (12)

5. The XGBoost algorithm-based survival analysis method according to claim 1, wherein the calculating the test sample by using the trained decision tree to obtain a plurality of prediction precision values, selecting an optimal value of the prediction precision values to be substituted into a Cox regression model with a penalty term to obtain an optimized individual risk ratio prediction value, includes:

the parameters of the trained decision tree are reserved, and the trained decision tree is utilized to calculate the prediction precision of the test sample; if l' is less than l, recalculating a first derivative and a second derivative of the loss function on the individual risk ratio predicted value; if l' =l, selecting the maximum value in the prediction precision value and substituting the maximum value into a Cox regression model with a penalty term to obtain an optimized individual risk ratio prediction value; and l 'is the current cycle number for calculating the prediction precision of the test sample, and l is the preset cycle number for calculating the prediction precision, wherein l' is less than or equal to l.