CN115938590A

CN115938590A - Construction method and prediction system of colorectal cancer postoperative LARS prediction model

Info

Publication number: CN115938590A
Application number: CN202310088636.5A
Authority: CN
Inventors: 汪晓东; 黄明君; 李立; 叶林
Original assignee: West China Hospital of Sichuan University
Current assignee: West China Hospital of Sichuan University
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2023-04-07
Anticipated expiration: 2043-02-09
Also published as: CN115938590B

Abstract

The invention discloses a construction method and a prediction system of a colorectal cancer postoperative LARS prediction model, which comprise the following steps: obtaining a first sample pair; generating a second sample pair; training to generate a prediction model; selecting a test set to test the prediction model; calibrating a prediction model in a clustering space; performing semi-supervised clustering analysis to generate a clustering result; and taking the patient variables as input data of the colorectal cancer postoperative LARS prediction model, and taking the corresponding output data of the optimal prediction model as output data of the colorectal cancer postoperative LARS prediction model. According to the invention, different patient variables correspond to different prediction models, so that the problem of poor prediction accuracy of a single model on LARS is avoided, and LARS prediction data with good accuracy can be obtained only by selecting the optimal prediction model corresponding to the patient variables for prediction in the use process of the prediction model.

Description

Construction method and prediction system of colorectal cancer postoperative LARS prediction model

Technical Field

The invention relates to the technical field of information analysis, in particular to a construction method and a prediction system of a colorectal cancer postoperative LARS prediction model.

Background

Colorectal cancer is the third most common cancer in the world and is also the cause of death associated with the second cancer in developed countries. With the continuous development of colorectal cancer diagnosis and treatment technology and the increased importance of doctors and patients on the requirements of postoperative life quality, the operation for protecting the anus also becomes a more and more preferred operation formula for colorectal cancer, and the life quality of patients is improved to a certain extent. However, many colorectal cancer patients have symptoms of increased defecation frequency, urgent defecation, gastic and fecal incontinence, incomplete defecation and the like, namely low-level anterior resection syndrome after anus protection, and the symptoms have serious influence on the life quality of the patients. Currently, only a few studies in China relate to the influence factors of the occurrence of Low Anterior Resection Syndrome, including the basic conditions (such as sex and age), the stage of tumor, and the treatment scheme (such as whether preoperative chemotherapy is performed), which may have some correlation with the occurrence and severity of the Low Anterior Resection Syndrome (LARS). However, at present, no research or product can predict the occurrence of low anterior resection syndrome after colorectal cancer operation is available at home and abroad.

Therefore, the method relies on clinical data of a large number of colorectal cancer patients, a prospective cohort study is developed, and a prediction model of the low-order forward syndrome is constructed based on a machine learning method by carrying out appropriate data analysis.

Disclosure of Invention

In order to overcome at least the above defects in the prior art, the present application aims to provide a method for constructing a prediction model of LARS after colorectal cancer surgery and a prediction system.

In a first aspect, the embodiments of the present application provide a method for constructing a prediction model of LARS after colorectal cancer surgery, including:

obtaining a first sample pair from a sample library; the first sample pair is a correspondence of a patient variable and a LARS score;

preprocessing the patient variable in the first sample pair to generate a second sample pair;

selecting a training set from the second sample pair, and training a plurality of models through the training set to generate a plurality of prediction models corresponding to different models;

selecting a plurality of groups of test sets corresponding to different patient variables from the second sample pair, and testing all the prediction models to generate test evaluation data;

constructing a clustering space by using the test evaluation data and the corresponding patient variables, and calibrating all the prediction models in the clustering space;

performing semi-supervised clustering analysis on all second sample pairs according to the test evaluation data to generate clustering results; the clustering result is an optimal prediction model corresponding to the patient variable in the second sample pair;

and taking the patient variables as input data of the post-colorectal cancer LARS prediction model, and taking the corresponding output data of the optimal prediction model as output data of the post-colorectal cancer LARS prediction model to complete the construction of the post-colorectal cancer LARS prediction model.

In the prior art, research on medical prediction models has been carried out, and chinese patent application No. 202111071269.5 discloses a method for constructing a prediction model for survival rate after lung cancer surgery and a prediction model system, wherein the method for predicting survival rate after lung cancer surgery by measuring clinical data including gene mutation typing comprises: a data acquisition step, in which clinical data after lung cancer surgery are acquired; a preprocessing step, namely classifying and grouping clinical data after lung cancer surgery to obtain modeling group clinical data and verification group clinical data; a risk factor screening step, namely screening risk factors of the clinical data of the modeling group to obtain risk factor data and overall life cycle data; and a regression analysis step, wherein the risk factor data and the overall survival period data are subjected to regression analysis to obtain data after the regression analysis, and clinical data after lung cancer surgery comprise gene mutation typing, age, tumor size, lymph node metastasis and a surgery mode. The inventor finds that due to the complexity of medical samples, no matter how parameter screening and model correction are carried out, a more accurate prediction result is difficult to obtain.

In the embodiment of the application, in order to effectively improve the prediction accuracy of the prediction model, after the first sample pair extracted from the sample library is processed into the second sample pair, a plurality of different models are trained to generate corresponding different prediction models, the models can be any of a logistic regression model, a support vector machine, a decision tree model, a random forest model and a neural network model, and the trained prediction models also correspond to the types of the models. The inventor finds that the prediction accuracy of each model on different patient variables is limited, so that classification is performed by adopting a semi-supervised learning mode of a preset clustering center, and the aim is to select the patient variables which can be better predicted by each prediction model.

In the embodiment of the application, the clustering center is set as a prediction model, and the setting mode can be that the prediction model is calibrated at the position of the patient variable with the best test evaluation data; then, the second sample pair is subjected to cluster analysis, so that patient variables which can be well predicted by each prediction model can be calculated, and the prediction model is the optimal prediction model of the patient variables. Therefore, during subsequent prediction, after the data related to the patient is obtained, the data are converted into the patient variable and input into the corresponding optimal prediction model for prediction, so that the points of different prediction models can be utilized, and the prediction precision is greatly improved. According to the embodiment of the application, different patient variables correspond to different prediction models, the problem that a single model is poor in prediction accuracy of the LARS is solved, and in the using process of the prediction models, only the optimal prediction model corresponding to the patient variables needs to be selected for prediction, and LARS prediction data with good accuracy can be obtained.

In one possible implementation, the patient variables include demographic data, tumor location, tumor length and diameter, surgical procedure, prophylactic stoma, neoadjuvant therapy, and tumor distance from dentate line.

In one possible implementation, preprocessing the patient variable in the first sample pair to generate a second sample pair includes:

interpolating missing values in the patient variables based on a random forest method to form interpolated patient variables;

selecting characteristics of the interpolated patient variables, screening out characteristics with correlation larger than a preset value in the interpolated patient variables as relevant characteristics, and performing decorrelation calculation on the relevant characteristics to form corrected patient variables;

the corrected patient variable is normalized to generate a second sample pair.

In one possible implementation manner, screening out features, of the interpolated patient variables, of which correlation with each other is greater than a preset value as correlation features, and performing decorrelation calculation on the correlation features to form a corrected patient variable includes:

calculating a variable variance expansion factor of the features in the interpolated patient variable, and taking the features with the variable variance expansion factor larger than 10 as the relevant features;

and splitting the related features, and removing related parts from the splitting result to form a corrected patient variable.

In one possible implementation, the plurality of models includes at least three of logistic regression, support vector machine, decision tree, random forest, and neural network.

In a possible implementation manner, the selecting of the training set and the test set includes:

arranging the second sample pairs into a development queue;

randomly selecting a sample pair with a first preset proportion from the development queue as a training set, and selecting a sample pair with a second preset proportion as a testing set; the sum of the first preset proportion and the second preset proportion is 1;

training a prediction model through the training set, and testing the prediction model through the test set;

and circularly selecting a training set and a testing set according to the first preset proportion and the second preset proportion to train and test the prediction models until the number of the prediction models trained by each model meets the expectation.

In one possible implementation, the constructing a clustering space and performing semi-supervised clustering analysis includes:

constructing a clustering space according to the patient variables, and calibrating the prediction model in the clustering space according to the test evaluation data;

clustering analysis is carried out by taking the position of the prediction model in the clustering space as a clustering center, and the second sample pairs are clustered to be close to different clustering centers to form a plurality of categories;

and taking the prediction model corresponding to the same class clustering center as the corresponding optimal prediction model of the patient variable in the class.

In one possible implementation, selecting a plurality of test sets corresponding to different patient variables from the second sample pair, and testing all the predictive models to generate test evaluation data includes:

testing all the prediction models through the test set to obtain output data of the prediction models as calculation data;

comparing the LARS score in the test set serving as reference data with the calculation data to obtain difference data of the prediction model; the difference data comprises at least one of accuracy, specificity, sensitivity, positive predictive value, and negative predictive value;

and acquiring a working characteristic curve and an area under the curve of the subject according to the difference data to serve as test evaluation data.

In a second aspect, an embodiment of the present application provides a prediction system, including:

an acquisition module configured to acquire patient data of a target patient as input data;

a calculation module configured to input the input data into the post-colorectal cancer LARS prediction model and obtain output data as a prediction result of post-colorectal cancer LARS.

Compared with the prior art, the invention has the following advantages and beneficial effects:

according to the construction method and the prediction system of the colorectal cancer postoperative LARS prediction model, different patient variables correspond to different prediction models, the problem that a single model is poor in LARS prediction precision is solved, and LARS prediction data with good precision can be obtained only by selecting the optimal prediction model corresponding to the patient variables for prediction in the using process of the prediction model.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a schematic diagram of the steps of an embodiment of the method of the present application;

FIG. 2 is a system architecture diagram according to an embodiment of the present application;

FIG. 3 is a graph illustrating ROC and AUC curves according to the examples of the present application.

Detailed description of the preferred embodiments

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Further, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some of the embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. In addition, one skilled in the art, under the guidance of the present disclosure, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

Please refer to fig. 1, which is a schematic flow chart of a method for constructing a post-colorectal cancer surgery LARS prediction model according to an embodiment of the present invention, and further, the method for constructing the post-colorectal cancer surgery LARS prediction model may specifically include the contents described in the following steps S1 to S27.

S1: obtaining a first sample pair from a sample library; the first sample pair is a correspondence of a patient variable and a LARS score;

s2: preprocessing a patient variable in the first sample pair to generate a second sample pair;

s3: selecting a training set from the second sample pair, and training a plurality of models through the training set to generate a plurality of prediction models corresponding to different models;

s4: selecting a plurality of groups of test sets corresponding to different patient variables from the second sample pair, and testing all the prediction models to generate test evaluation data;

s5: constructing a clustering space by using the test evaluation data and the corresponding patient variables, and calibrating all the prediction models in the clustering space;

s6: performing semi-supervised clustering analysis on all second sample pairs according to the test evaluation data to generate clustering results; the clustering result is an optimal prediction model corresponding to the patient variable in the second sample pair;

s7: and taking the patient variables as input data of the post-colorectal cancer LARS prediction model, and taking the corresponding output data of the optimal prediction model as output data of the post-colorectal cancer LARS prediction model to complete the construction of the post-colorectal cancer LARS prediction model.

In the prior art, research on medical prediction models has been carried out, and chinese patent application No. 202111071269.5 discloses a method for constructing a prediction model for survival rate after lung cancer surgery and a prediction model system, wherein the method for predicting survival rate after lung cancer surgery by measuring clinical data including gene mutation typing comprises: a data acquisition step, in which clinical data after lung cancer surgery are acquired; a preprocessing step, namely classifying and grouping clinical data after lung cancer surgery to obtain modeling group clinical data and verification group clinical data; a risk factor screening step, namely screening risk factors of the clinical data of the modeling group to obtain risk factor data and total life cycle data; and a regression analysis step, wherein the risk factor data and the overall survival period data are subjected to regression analysis to obtain data after the regression analysis, and clinical data after lung cancer surgery comprise gene mutation typing, age, tumor size, lymph node metastasis and a surgery mode. The inventor finds that due to the complexity of medical samples, no matter how parameter screening and model correction are carried out, accurate prediction results are difficult to obtain.

In one possible implementation, the patient variables include demographic data, tumor location, tumor length and diameter, surgical procedure, prophylactic stoma, neoadjuvant therapy, and tumor distance from the dentate line.

the corrected patient variable is normalized to generate a second sample pair.

In the embodiment of the application, the general sample library is obtained by regular follow-up visits to patients, the LARS symptoms of the patients are evaluated by adopting the LARS scale, and the result of the LARS occurring 6 months after operation is used as an ending index, so that partial data loss may exist in the first sample pair, and the current data statistics are not in place if the patients go out for a certain week. Therefore, in the embodiment of the present application, it is necessary to complement these data by using a random forest method, which is generally used for complementing continuous variables such as age and the distance between the anastomotic orifice and the anal margin.

Meanwhile, in practice, the inventor finds that strong correlation exists among many characteristics in the patient variable, the strong correlation reduces the reliability of the model, so in the embodiment of the application, the method of performing decorrelation calculation on the relevant characteristics is adopted to reduce the influence.

In the embodiment of the present application, a specific calculation scheme is provided, in which multiple collinearity screening of independent variables is performed by calculating a variable Variance Inflation Factor (VIF) after feature selection, and variables with VIF values >10 are eliminated.

Wherein->

The negative correlation coefficient of regression analysis was performed for the independent variables versus the remaining independent variables.

For the related characteristics which can be split, the related parts are proposed after the splitting is carried out, for example, one related characteristic is that a patient takes two medicines of A and B, and the other related characteristic is that defecation is not smooth; the medicine B is highly related to the unclogging of defecation, at the moment, the characteristic that the patient takes the medicine A and the medicine B is split into the characteristic that the patient takes the medicine A and the characteristic that the patient takes the medicine B is eliminated. Therefore, the stability of the model can be effectively improved.

arranging the second sample pairs into a development queue;

In the embodiment of the application, the sample is divided into a training set and a testing set for model training, which belongs to the prior art and is not repeated here. However, because of the rare samples, in the embodiment of the present application, model training is performed in a resampling mode, that is, a first preset ratio and a second preset ratio are determined, a training set and a test set are selected and a model is trained, and then the training set and the test set training model are randomly selected again according to the first preset ratio and the second preset ratio until enough prediction models are obtained, so that sufficient resources can be provided for selecting the prediction models at a later stage.

In one possible implementation, the constructing a clustering space and the performing semi-supervised clustering analysis comprise:

In the implementation of the embodiment of the application, the calibration process is already explained in the above contents, and since a specific clustering center is determined, the scheme is a typical semi-supervised learning manner, a plurality of categories are formed after clustering, and each category is gathered around one clustering center. This indicates that a patient variable in a category will correspond to a prediction model that is the optimal prediction model for the patient variable in that category.

In the embodiment of the present application, please refer to fig. 3, which shows specific contents of a receiver operating characteristic curve (ROC) and an area under the curve (AUC) as test evaluation data, wherein sensitivity is sensitivity, specificity is specificity, an ROC curve is formed according to a result of a model, and the result shows that AUC is 0.832, an optimal threshold is 0.540, a corresponding sensitivity is 0.911, and specificity is 0.717, and a prediction capability of the model is good.

To facilitate the explanation of the prediction system, please refer to fig. 2, which is a schematic diagram of a communication architecture of the prediction system according to an embodiment of the present invention. Wherein the prediction system may comprise:

Those of ordinary skill in the art will appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The elements described as separate parts may or may not be physically separate, and it will be apparent to those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general sense in the foregoing description for the purpose of clearly illustrating the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partly contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a grid device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The method for constructing the colorectal cancer postoperative LARS prediction model is characterized by comprising the following steps:

preprocessing a patient variable in the first sample pair to generate a second sample pair;

constructing a clustering space from the test evaluation data and the corresponding patient variables, and defining all the prediction models in the clustering space;

2. The method of claim 1, wherein the patient variables include demographic data, tumor location, tumor length and diameter, surgical procedure, preventive stoma, neoadjuvant therapy, and tumor dentate line distance.

3. The method of claim 1, wherein preprocessing the patient variables in the first sample pair to generate a second sample pair comprises:

the corrected patient variable is normalized to generate a second sample pair.

4. The method of claim 3, wherein the step of screening the interpolated patient variables for which the correlation between the interpolated patient variables is greater than a predetermined value as correlation features, and the step of performing decorrelation calculation on the correlation features to form corrected patient variables comprises:

calculating a variable variance expansion factor of the characteristics in the interpolated patient variable, and taking the characteristics of which the variable variance expansion factor is more than 10 as the relevant characteristics;

5. The method of claim 1, wherein the plurality of models includes at least three of logistic regression, support vector machines, decision trees, random forests, and neural networks.

6. The method of claim 1, wherein the selecting the training set and the testing set comprises:

arranging the second sample pairs into a development queue;

7. The method for constructing a model for predicting postoperative LARS for colorectal cancer according to claim 1, wherein constructing a clustering space and performing semi-supervised clustering analysis comprises:

8. The method of claim 1, wherein selecting a plurality of test sets corresponding to different patient variables from the second sample pair for testing all the predictive models to generate test evaluation data comprises:

9. The prediction system of the prediction model of LARS after colorectal cancer surgery, which is constructed by the method of any one of claims 1 to 8, is characterized by comprising: