CN112309576A

CN112309576A - Colorectal cancer survival period prediction method based on deep learning CT (computed tomography) image omics

Info

Publication number: CN112309576A
Application number: CN202011005022.9A
Authority: CN
Inventors: 潘祥; 王孝磊; 胡曙东; 张衡; 吕天旭; 谢振平; 刘渊
Original assignee: Jiangnan University; Affiliated Hospital of Jiangnan University
Current assignee: Jiangnan University; Affiliated Hospital of Jiangsu University; Affiliated Hospital of Jiangnan University
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2021-02-02

Abstract

The invention discloses a colorectal cancer survival time prediction method based on deep learning CT (computed tomography) image omics. Belongs to the technical field of medical image processing. The method comprises the following specific steps: (1) acquiring data; (2) labeling the colorectal tumor region of the CT image omics data; (3) preprocessing the acquired data; (4) constructing a feature learning model based on a deep neural network; (5) establishing a risk scoring model of the patient by utilizing Lasso regression to reduce dimension for the colorectal cancer CT imagemics depth high-flux characteristics; (6) grouping according to the risk score; (7) verifying the effectiveness of the curve and the characteristic; (8) constructing a deep neural network multi-task logistic regression (DNN-MTLR) model for predicting the life cycle probability; according to the invention, the system analysis is introduced after the CT image of the patient is obtained, and the result can provide reference for doctors (especially the radiologist with short experience) so as to better understand the patient condition and make the next decision.

Description

Colorectal cancer survival period prediction method based on deep learning CT (computed tomography) image omics

Technical Field

The invention belongs to the technical field of medical image processing, can be used for intelligent medical disease diagnosis, introduces a colorectal cancer survival prediction method based on deep learning CT (computed tomography) image omics, and finally obtains the five-year disease-free survival (DFS) probability of colorectal cancer patients.

Background

Colorectal cancer is a common malignant tumor in the gastrointestinal tract, and has high morbidity and mortality. According to the results of the international tumor research institution in 2018 on global investigation, the incidence rate of colorectal cancer ranks the third and is second only to lung cancer and breast cancer. Mortality rates ranked second, second only to lung cancer. In China, the incidence and mortality of areas with developed economic conditions and the coastal areas of southeast are also in a remarkably increasing trend.

The accurate prediction of the life cycle of the patient has important clinical value and social value. For physicians, accurate prediction of the patient's survival (especially for young inexperienced physicians) may help physicians better understand the patient's condition, make diagnoses, and make optimal medical decisions. For a patient, the life cycle of the patient can be accurately predicted, scientific survival expectation can be provided for the patient, and the physical condition of the patient can be better understood. Therefore, the patient is guided to scientifically follow a treatment plan, excessive medical treatment is avoided, the family economic burden is reduced, and the doctor-patient relationship is favorably improved.

With the development of imaging and artificial intelligence technologies, imaging technologies such as Computed Tomography (CT), Positron Emission Tomography (PET), and Magnetic Resonance (MR) play an increasingly important role in diagnosis, and prognosis of tumors. The function of medical imaging is gradually changed from the traditional analysis methods such as disease diagnosis and screening to individual precise diagnosis and treatment. The mainstream direction for future medical development is accurate medicine, which needs to take into account individual variability prevention and corresponding diagnostic and therapeutic strategies. The combination of artificial intelligence and medicine is also a necessary way for future development of future medicine, and the realization of artificial intelligence is one of the technologies without machine learning, and deep learning is one of the technologies. Deep learning techniques can be combined with CT imaging omics features for life prediction of colorectal cancer patients.

Disclosure of Invention

In view of the above problems, the present invention provides a novel method for predicting the survival of CRC (colorectal cancer) patients based on deep learning CT image group; in particular to a colorectal cancer survival period prediction method based on deep learning CT (computed tomography) image omics.

The technical scheme of the invention is as follows: the colorectal cancer survival period prediction method based on deep learning CT imaging omics specifically comprises the following steps:

step (1.1), data acquisition: the data comprises clinical data and CT imaging omics data;

step (1.2), carrying out colorectal tumor region labeling on CT image omics data;

step (1.3), preprocessing the acquired data;

step (1.4), constructing a feature learning model based on a deep neural network to obtain deep high-flux features of the colorectal cancer CT image omics data;

step (1.5), reducing dimensions by utilizing the Lasso regression to the deep high-flux characteristics of the colorectal cancer CT image omics data, and establishing a risk score model of a patient;

step (1.6), according to the proteomic risk score S of the patient, obtaining a cutoff value T by using a median value of the proteomic label score values, and dividing the patient into a survival period high risk group and a survival period low risk group;

step (1.7), carrying out curve evaluation and verification on the obtained deep high-flux characteristics by using a drawing KM curve and adopting data analysis software;

and (1.8) constructing a deep neural network multi-task logistic regression model to predict the life cycle probability.

Further, in step (1.1), specifically:

(1.1.1), clinical data: including the age, sex, survival status of the patient: 1 or 0 and the time of interest since the CT image was taken; wherein 1 represents death and 0 represents survival;

(1.1.2), CT imaging omics data: i.e. CT image data taken by the patient.

Further, in step (1.2), the specific operation manner of labeling the colorectal tumor region for CT imaging omics data is as follows: and (3) introducing the CT image omics data into the ITK-SNAP in batches according to unit sequence, manually marking the ITK-SNAP, selecting an interested region where the tumor is located, and storing the marked CT image omics data into an nii file.

Further, in the step (1.3), the specific operation steps of preprocessing the acquired data are as follows:

and (3) carrying out pre-selection deletion on the data, wherein the elimination criteria are as follows:

(1.3.1) incomplete information of clinical information record, wherein the incomplete reasons comprise missed visits, quits and terminations;

(1.3.2), the cut-off of the survival observation process is due to other causes, not to death events;

(1.3.3) obtaining a region of interest nii file according to the step (1.2), extracting the region of interest features by combining original CT image omics data, and obtaining a feature three-dimensional matrix f (P, P, P) containing the region of interest by each unit, wherein P represents the size of the matrix.

Further, in step (1.4), the deep neural network-based feature learning model is specifically described as follows: obtaining a characteristic matrix containing an interested area by each unit as the input of a network, wherein the size of the characteristic matrix is [ M multiplied by P ], wherein M represents the total number of the units; p represents the feature matrix dimension of each unit in the total units;

putting the obtained object into a feature selector for feature selection; wherein the feature selector is composed of N₀A convolution layer, N₀Each pooling layer, the full-connection layer and the logistic regression output layer; the convolution layer comprises M₁A filter, other convolutional layers including M_iA plurality of filters, wherein the filter size is n × n × n, and n represents the filter size;

after each convolution layer, the maximum pooling operation is carried out, and each convolution layer with the size of the pool being m multiplied by m has a linear rectification function; the loss function adopts a mean square error, and the formula is as follows:

wherein, y_mThe actual value is represented by the value of,

indicating the predicted value.

Further, in step (1.5), the specific operation method for performing effective dimensionality reduction on the colorectal cancer CT imaging omics data is as follows: firstly, M multiplied by K node information of a full connection layer of a feature learning model of a deep neural network is selected as first effective feature dimension reduction, wherein M represents the total unit number, and K is the node information number; standardizing the data;

then, further effective dimensionality reduction is carried out on the features by adopting a least absolute contraction selection operator Lasso regression, and the risk coefficient score S of each person is obtained; the Lasso regression loss function is given by:

wherein xi represents each unit feature label, yi represents each unit time label, λ represents the regularization coefficient,

representing the weight coefficients.

Further, in step (1.7), the specific operation steps of curve evaluation and verification for the selected features are as follows:

(1.7.1) drawing a corresponding KM curve according to the cut-off value T obtained in the step (1.6), so that a result is visualized, and two survival probability curves are obtained;

(1.7.2) after different survival probability curves are obtained by using a KM method, chi-square test is carried out through data analysis software, and finally a P value is obtained;

(1.7.3) judging whether the two curves have significant difference according to the P value.

Further, in the step (1.8), the specific operation steps of constructing the deep neural network multi-task logistic regression model for predicting the lifetime probability are as follows:

(1.8.1) introducing the final effective characteristics obtained in the step (1.5), the time labels and the survival state labels into a deep neural network multitask logistic regression model;

wherein each layer of the deep neural network multitask logistic regression model uses the following activation function:

layer #1: M1 neurons using the activation function h⁽¹⁾(x)＝LeakyReLu(x)

Layer # 2M 2 neurons using the activation function h⁽²⁾(x)＝ReLu(x)

Layer # 3M 3 neurons using the activation function h⁽³⁾(x)＝ReLu(x)

Wherein LeakyReLu represents a linear unit function with leakage correction, and ReLu represents a linear unit function with leakage correction;

the time axis is divided into J-time intervals such that

Having τ ₀0 and τ_JInfinity; as shown in the following formula:

at each interval a_jA logistic regression model is established, and parameters

And response variable

I.e. the event occurs in interval a_jIs 1, otherwise is 0;

when a unit is in the interval a_sWhen an event is experienced, s ∈ [1, J ∈]The state of the remaining interval remains unchanged; thus, the response vector is described by:

wherein, a_jRepresents a unit time interval: one month; y is_jAs response variables: 1 represents the occurrence of an event and the like,

0 represents no occurrence;

probability density function:

wherein exp () represents an exponential function with a natural number e as the base;

survival function:

wherein the content of the first and second substances,

is → x ∈ Rp

The feature vector is the nonlinear transformation of the input; the output of which is one

The vector, whose values are mapped to the J subdivision of the time axis, is described as follows:

(1.8.2), wherein the ratio of training set to test set is set to 8: 2, visualizing the result;

(1.8.3) evaluating the identification power of the deep neural network multitask logistic regression model by using a consistency index: the consistency index represents the overall evaluation of the identification power of the deep neural network multi-task logistic regression model, the numerical range of the consistency index is 0-1, the numerical value 1 is the optimal prediction model, the numerical value 0.5 is the random prediction model, and the numerical value 0 is the inapplicable model; the consistency index is calculated as follows:

wherein C-index represents the consistency index and η i represents the risk score of a unit i; 1Tj < Ti satisfies that Tj < Ti is 1, otherwise 0;

(1.8.4) evaluating the accuracy of the deep neural network multiple task logistic regression model using IBS: the numerical range is between 0 and 1, wherein 0 is the best possible value; IBS <0.25 represents a useful model; wherein, the IBS calculation formula is as follows:

wherein IBS represents a composite brix score used to assess the accuracy of the model's predictive survival function. N is the number of data samples,

representing the actual probability of the occurrence of the event t for sample i.

The invention has the beneficial effects that: the invention uses deep learning technology and CT imaging omics label to predict the survival time of the colorectal cancer patient; the technique relies on CT imaging, and CT images are easily obtained clinically; in medicine, after a CT image of a patient is obtained, system analysis is introduced, and the result can provide reference for a doctor (especially a young radiologist with insufficient experience) so as to better understand the condition of the patient and make a next decision; in addition, the patient can better understand the condition of the patient;

the CT image contains abundant features, but the CT image has large size and excessive slices, so that the data volume is large and the redundant features are large; according to the method, data dimensionality reduction is realized through a DL feature selector and a least absolute shrinkage operator Lasso regression, so that effective features which are low in dimensionality and beneficial to prediction are obtained;

in addition, the present invention constructs a deep neural network multiple task logistic regression (DNN-MTLR) model that provides similar results to the CoxPH model, but without relying on the assumptions required by the latter, can be used to estimate the likelihood of an event of interest occurring within each centerline using the DNN-MTLR model.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic illustration of the manual labeling of the present invention using ITK-SNAP;

FIG. 3 is a diagram of a model of a DL feature selector network in accordance with the present invention;

FIG. 4 is a high-low risk group-KM graph in accordance with the present invention;

FIG. 5 is a diagram of a DNN-MTLR network model in the present invention;

FIG. 6 is a graph of the results of the present invention using a DNN-MTLR model for prediction;

FIG. 7 is a diagram of a prediction diagram according to an embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the technical solution of the present invention, the following detailed description is made with reference to the accompanying drawings:

as depicted in fig. 1; the colorectal cancer survival time prediction method based on deep learning CT image omics finally obtains the five-year disease-free survival time (DFS) probability of colorectal cancer patients, and comprises the following specific steps:

step (1.3), preprocessing the acquired data;

step (1.6), according to the proteomic risk score S of the patient, obtaining a cutoff value T by using a median value of the proteomic tag score values, and dividing the patient into a survival time high risk group (S > T) and a survival time low risk group (S < T);

and (1.8) constructing a deep neural network multitask logistic regression (DNN-MTLR) model for predicting the survival time probability.

Further, in step (1.1), specifically:

(1.1.2), CT imaging omics data: i.e. CT image data taken by the patient.

Further, in step (1.2), the specific operation manner of labeling the colorectal tumor region for CT imaging omics data is as follows: introducing CT image omics data into ITK-SNAP in batches according to unit sequence, manually marking the ITK-SNAP, selecting an interested region where a tumor is located, and storing the marked CT image omics data into an nii file; the labeling results are shown in FIG. 2.

Further, in the step (1.3), the specific operation steps of preprocessing the data are as follows:

(1.3.1), incomplete information of clinical information records, wherein the incomplete reasons include missed visits (meaning loss of contact), withdrawal (withdrawal from study due to non-study or non-treatment factors), termination (termination of observation after the time specified by the design has been reached, but the study still survived);

(1.3.3) obtaining a region of interest nii file according to the step (1.2), extracting the region of interest features by combining the original CT image data, and obtaining a feature three-dimensional matrix f (32,32,32) containing the region of interest by each unit.

Further, in step (1.4), as shown in fig. 3, the deep neural network-based feature learning model is specifically described as follows: obtaining a characteristic matrix containing an interested area by each unit as the input of a network, wherein the size of the characteristic matrix is [ M multiplied by P ], wherein M represents the total number of the units; p represents the feature matrix dimension of each unit in the total units;

it is prepared byPutting the obtained product into a feature selector for feature selection; wherein the feature selector is composed of N₀A convolution layer, N₀The system comprises a pooling layer, a full-connection layer and a logistic regression output layer; the convolution layer comprises M₁A filter, other convolutional layers including M_iA plurality of filters, wherein the size of the filters is n multiplied by n, and n is the size of the filter size;

after each convolution layer, the maximum pooling operation is carried out, and each convolution layer with the pool size of m multiplied by m has a linear rectification function (RELU); the loss function is a Mean Square Error (MSE) which is given by the following equation:

wherein, y_mWhich represents the true value of the image data,

representing the predicted value.

Further, in step (1.5), the specific operation method for performing effective dimensionality reduction on the colorectal cancer CT imaging omics data is as follows: firstly, selecting 600 multiplied by 6400 node information of a full connection layer of a feature learning model of a deep neural network as first effective feature dimension reduction; standardizing the data;

then, further effective dimensionality reduction is carried out on the features by adopting a least absolute contraction selection operator Lasso regression, and the risk coefficient score S of each person is obtained; wherein the Lasso regression loss function is as follows:

wherein xi represents each unit feature label, yi represents each unit time label, and λ represents the regularization coefficient,

representing the weight coefficients.

Further, in step (1.6), the specific procedures for classifying patients into high-risk survival group (S > T) and low-risk survival group (S < T) are as follows: finding a risk coefficient score S file for the patient, using the median of the imagery omics label score values as the cutoff value T: -2.227, with T as a cutoff value, for S > T the high risk group with a short life span and S < T the low risk group with a long life span.

(1.7.2) after different survival probability curves are obtained by using a KM method, determining whether the obvious difference among the curves is insufficient only by direct observation, and performing log-rank test by using an IBM SPSS statics 26 to finally obtain a P value;

(1.7.3) judging whether the two curves have significant difference according to the P value; p <0.05 is generally considered statistically different; the result P <0.01 was obtained with statistical differences.

Further, in the step (1.8), the specific operation steps of constructing a deep neural network multi-task logistic regression (DNN-MTLR) model for lifetime probability prediction are as follows:

(1.8.1) introducing the final effective characteristics obtained in the step (5), the time labels and the survival state labels into a deep neural network multitask logistic regression (DNN-MTLR) model; wherein the DNN-MTLR model is shown in FIG. 5;

each layer uses the following activation function:

layer #1:326 neurons, using activation function h⁽¹⁾(x)＝LeakyReLu(x)

Layer #2:652 neurons using the activation function h⁽²⁾(x)＝ReLu(x)

Layer #3:1304 neurons, using the activation function h⁽³⁾(x)＝ReLu(x)

Wherein LeakyReLu is a linear unit function with leakage correction, and ReLu is a linear unit function with leakage correction;

the time axis is divided into J-time intervals such that

Having τ ₀0 and τ_JInfinity; as shown in the following formula:

at each interval a_jA logistic regression model is established, and parameters

And response variable

I.e. the event occurs in interval a_jIs 1, otherwise is 0; however, since the effects of repeated events are not analyzed, it is necessary to ensure that when a unit is at interval a_sWhen an event is experienced, s ∈ [1, J ∈]The state of the remaining interval remains unchanged; thus, the response vector is described by:

wherein, a_jRepresents a unit time interval: one month; y is_jAs response variables: 1 represents the occurrence of an event, 0

Represents that no occurrence has occurred;

probability density function:

survival function:

wherein the content of the first and second substances,

so as to make^→x∈Rp

(1.8.3) evaluating the discriminative power of the deep neural network multitask logistic regression model using the consistency index (C-index) DNN-MTLR model: the C-index represents the overall evaluation of the identification power of the deep neural network multi-task logistic regression model, and the C-index (0.82).1 is obtained as the optimal prediction model, the numerical value 0.5 is obtained as the random prediction model, and the numerical value 0 is obtained as the inapplicable model. The calculation formula of the C-index is as follows:

(1.8.4) evaluating the accuracy of a deep neural network multiple task logistic regression (DNN-MTLR) model using Integrated Brisket Score (IBS): wherein IBS represents the accuracy of the prediction survival function of the assessment model, IBS value: (0.06), wherein 0 is the best possible value; IBS <0.25 represents a useful model; wherein, the IBS calculation formula is as follows:

The specific embodiment is as follows:

(1) and acquiring data: and obtaining CT image omics data of the patient A.

(2) And labeling the colorectal tumor region of the CT image omics data.

(3) And preprocessing the acquired data to obtain time and state labels.

(4) And constructing a feature learning model based on a deep neural network to obtain the CT image omics deep high-flux features of the patient A.

(5) And performing dimensionality reduction on the depth high-flux characteristic of the CT image omics data by using a lasso regression operator, and establishing a risk scoring model of the patient A.

(6) And classifying the patient A into a high-risk group according to the risk score of the imaging group of the patient A.

(7) And evaluating and verifying the obtained deep high-flux characteristics.

(8) And putting the features obtained by dimensionality reduction into a deep neural network multi-task logistic regression model for life cycle probability prediction, and finally obtaining a prediction result.

The results are shown in FIG. 7, where the annual probability results are shown in the following table:

month of the year	12	24	36	48	60
						Probability of survival	97.649882％	93.804818％	85.995414％	66.933918％	48.437748％

The results show that such methods can be used for survival prediction in colorectal cancer patients.

Claims

1. The colorectal cancer survival period prediction method based on deep learning CT (computed tomography) image omics is characterized by comprising the following specific steps of:

step (1.3), preprocessing the acquired data;

2. The method for predicting survival of colorectal cancer based on deep learning CT imaging omics as claimed in claim 1, wherein in step (1.1), specifically:

(1.1.2), CT imaging omics data: i.e. CT image data taken by the patient.

3. The method for predicting survival of colorectal cancer based on deep learning CT imaging omics as claimed in claim 1, wherein in step (1.2), the specific operation manner of labeling the colorectal tumor region for the CT imaging omics data is as follows: and (3) introducing the CT image omics data into the ITK-SNAP in batches according to unit sequence, manually marking the ITK-SNAP, selecting an interested region where the tumor is located, and storing the marked CT image omics data into an nii file.

4. The method for predicting survival of colorectal cancer based on deep learning CT imaging omics as claimed in claim 1, wherein in said step (1.3), the specific operation steps of preprocessing the obtained data are as follows:

5. The method for predicting survival of colorectal cancer based on deep learning CT imaging omics as claimed in claim 1, wherein in step (1.4), the deep neural network-based feature learning model is specifically described as follows: obtaining a characteristic matrix containing an interested area by each unit as the input of a network, wherein the size of the characteristic matrix is [ M multiplied by P ], wherein M represents the total number of the units; p represents the feature matrix dimension of each unit in the total units;

wherein, y_mThe actual value is represented by the value of,

indicating the predicted value.

6. The method for predicting survival of colorectal cancer based on deep learning CT imaging omics as claimed in claim 1, wherein in step (1.5), the specific operation method for performing effective dimension reduction on the colorectal cancer CT imaging omics data is as follows: firstly, M multiplied by K node information of a full connection layer of a feature learning model of a deep neural network is selected as first effective feature dimension reduction, wherein M represents the total unit number, and K is the node information number; standardizing the data;

representing the weight coefficients.

7. The method for predicting survival of colorectal cancer based on deep learning CT imaging omics as claimed in claim 1, wherein the specific operation steps of curve evaluation and verification of the selected features in step (1.7) are as follows:

8. The method for predicting survival time of colorectal cancer based on deep learning CT (computed tomography) proteomics as claimed in claim 1, wherein in the step (1.8), the specific operation steps of constructing a deep neural network multi-task logistic regression model for predicting survival time probability are as follows:

layer #1: M1 neurons using the activation function h⁽¹⁾(x)＝LeakyReLu(x)

Layer # 2M 2 neurons using the activation function h⁽²⁾(x)＝ReLu(x)

Layer # 3M 3 neurons using the activation function h⁽³⁾(x)＝ReLu(x)

the time axis is divided into J-time intervals such that