CN115938590A - Construction method and prediction system of colorectal cancer postoperative LARS prediction model - Google Patents

Construction method and prediction system of colorectal cancer postoperative LARS prediction model Download PDF

Info

Publication number
CN115938590A
CN115938590A CN202310088636.5A CN202310088636A CN115938590A CN 115938590 A CN115938590 A CN 115938590A CN 202310088636 A CN202310088636 A CN 202310088636A CN 115938590 A CN115938590 A CN 115938590A
Authority
CN
China
Prior art keywords
prediction
prediction model
patient
data
lars
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310088636.5A
Other languages
Chinese (zh)
Other versions
CN115938590B (en
Inventor
汪晓东
黄明君
李立
叶林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
West China Hospital of Sichuan University
Original Assignee
West China Hospital of Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by West China Hospital of Sichuan University filed Critical West China Hospital of Sichuan University
Priority to CN202310088636.5A priority Critical patent/CN115938590B/en
Publication of CN115938590A publication Critical patent/CN115938590A/en
Application granted granted Critical
Publication of CN115938590B publication Critical patent/CN115938590B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a construction method and a prediction system of a colorectal cancer postoperative LARS prediction model, which comprise the following steps: obtaining a first sample pair; generating a second sample pair; training to generate a prediction model; selecting a test set to test the prediction model; calibrating a prediction model in a clustering space; performing semi-supervised clustering analysis to generate a clustering result; and taking the patient variables as input data of the colorectal cancer postoperative LARS prediction model, and taking the corresponding output data of the optimal prediction model as output data of the colorectal cancer postoperative LARS prediction model. According to the invention, different patient variables correspond to different prediction models, so that the problem of poor prediction accuracy of a single model on LARS is avoided, and LARS prediction data with good accuracy can be obtained only by selecting the optimal prediction model corresponding to the patient variables for prediction in the use process of the prediction model.

Description

Construction method and prediction system of colorectal cancer postoperative LARS prediction model
Technical Field
The invention relates to the technical field of information analysis, in particular to a construction method and a prediction system of a colorectal cancer postoperative LARS prediction model.
Background
Colorectal cancer is the third most common cancer in the world and is also the cause of death associated with the second cancer in developed countries. With the continuous development of colorectal cancer diagnosis and treatment technology and the increased importance of doctors and patients on the requirements of postoperative life quality, the operation for protecting the anus also becomes a more and more preferred operation formula for colorectal cancer, and the life quality of patients is improved to a certain extent. However, many colorectal cancer patients have symptoms of increased defecation frequency, urgent defecation, gastic and fecal incontinence, incomplete defecation and the like, namely low-level anterior resection syndrome after anus protection, and the symptoms have serious influence on the life quality of the patients. Currently, only a few studies in China relate to the influence factors of the occurrence of Low Anterior Resection Syndrome, including the basic conditions (such as sex and age), the stage of tumor, and the treatment scheme (such as whether preoperative chemotherapy is performed), which may have some correlation with the occurrence and severity of the Low Anterior Resection Syndrome (LARS). However, at present, no research or product can predict the occurrence of low anterior resection syndrome after colorectal cancer operation is available at home and abroad.
Therefore, the method relies on clinical data of a large number of colorectal cancer patients, a prospective cohort study is developed, and a prediction model of the low-order forward syndrome is constructed based on a machine learning method by carrying out appropriate data analysis.
Disclosure of Invention
In order to overcome at least the above defects in the prior art, the present application aims to provide a method for constructing a prediction model of LARS after colorectal cancer surgery and a prediction system.
In a first aspect, the embodiments of the present application provide a method for constructing a prediction model of LARS after colorectal cancer surgery, including:
obtaining a first sample pair from a sample library; the first sample pair is a correspondence of a patient variable and a LARS score;
preprocessing the patient variable in the first sample pair to generate a second sample pair;
selecting a training set from the second sample pair, and training a plurality of models through the training set to generate a plurality of prediction models corresponding to different models;
selecting a plurality of groups of test sets corresponding to different patient variables from the second sample pair, and testing all the prediction models to generate test evaluation data;
constructing a clustering space by using the test evaluation data and the corresponding patient variables, and calibrating all the prediction models in the clustering space;
performing semi-supervised clustering analysis on all second sample pairs according to the test evaluation data to generate clustering results; the clustering result is an optimal prediction model corresponding to the patient variable in the second sample pair;
and taking the patient variables as input data of the post-colorectal cancer LARS prediction model, and taking the corresponding output data of the optimal prediction model as output data of the post-colorectal cancer LARS prediction model to complete the construction of the post-colorectal cancer LARS prediction model.
In the prior art, research on medical prediction models has been carried out, and chinese patent application No. 202111071269.5 discloses a method for constructing a prediction model for survival rate after lung cancer surgery and a prediction model system, wherein the method for predicting survival rate after lung cancer surgery by measuring clinical data including gene mutation typing comprises: a data acquisition step, in which clinical data after lung cancer surgery are acquired; a preprocessing step, namely classifying and grouping clinical data after lung cancer surgery to obtain modeling group clinical data and verification group clinical data; a risk factor screening step, namely screening risk factors of the clinical data of the modeling group to obtain risk factor data and overall life cycle data; and a regression analysis step, wherein the risk factor data and the overall survival period data are subjected to regression analysis to obtain data after the regression analysis, and clinical data after lung cancer surgery comprise gene mutation typing, age, tumor size, lymph node metastasis and a surgery mode. The inventor finds that due to the complexity of medical samples, no matter how parameter screening and model correction are carried out, a more accurate prediction result is difficult to obtain.
In the embodiment of the application, in order to effectively improve the prediction accuracy of the prediction model, after the first sample pair extracted from the sample library is processed into the second sample pair, a plurality of different models are trained to generate corresponding different prediction models, the models can be any of a logistic regression model, a support vector machine, a decision tree model, a random forest model and a neural network model, and the trained prediction models also correspond to the types of the models. The inventor finds that the prediction accuracy of each model on different patient variables is limited, so that classification is performed by adopting a semi-supervised learning mode of a preset clustering center, and the aim is to select the patient variables which can be better predicted by each prediction model.
In the embodiment of the application, the clustering center is set as a prediction model, and the setting mode can be that the prediction model is calibrated at the position of the patient variable with the best test evaluation data; then, the second sample pair is subjected to cluster analysis, so that patient variables which can be well predicted by each prediction model can be calculated, and the prediction model is the optimal prediction model of the patient variables. Therefore, during subsequent prediction, after the data related to the patient is obtained, the data are converted into the patient variable and input into the corresponding optimal prediction model for prediction, so that the points of different prediction models can be utilized, and the prediction precision is greatly improved. According to the embodiment of the application, different patient variables correspond to different prediction models, the problem that a single model is poor in prediction accuracy of the LARS is solved, and in the using process of the prediction models, only the optimal prediction model corresponding to the patient variables needs to be selected for prediction, and LARS prediction data with good accuracy can be obtained.
In one possible implementation, the patient variables include demographic data, tumor location, tumor length and diameter, surgical procedure, prophylactic stoma, neoadjuvant therapy, and tumor distance from dentate line.
In one possible implementation, preprocessing the patient variable in the first sample pair to generate a second sample pair includes:
interpolating missing values in the patient variables based on a random forest method to form interpolated patient variables;
selecting characteristics of the interpolated patient variables, screening out characteristics with correlation larger than a preset value in the interpolated patient variables as relevant characteristics, and performing decorrelation calculation on the relevant characteristics to form corrected patient variables;
the corrected patient variable is normalized to generate a second sample pair.
In one possible implementation manner, screening out features, of the interpolated patient variables, of which correlation with each other is greater than a preset value as correlation features, and performing decorrelation calculation on the correlation features to form a corrected patient variable includes:
calculating a variable variance expansion factor of the features in the interpolated patient variable, and taking the features with the variable variance expansion factor larger than 10 as the relevant features;
and splitting the related features, and removing related parts from the splitting result to form a corrected patient variable.
In one possible implementation, the plurality of models includes at least three of logistic regression, support vector machine, decision tree, random forest, and neural network.
In a possible implementation manner, the selecting of the training set and the test set includes:
arranging the second sample pairs into a development queue;
randomly selecting a sample pair with a first preset proportion from the development queue as a training set, and selecting a sample pair with a second preset proportion as a testing set; the sum of the first preset proportion and the second preset proportion is 1;
training a prediction model through the training set, and testing the prediction model through the test set;
and circularly selecting a training set and a testing set according to the first preset proportion and the second preset proportion to train and test the prediction models until the number of the prediction models trained by each model meets the expectation.
In one possible implementation, the constructing a clustering space and performing semi-supervised clustering analysis includes:
constructing a clustering space according to the patient variables, and calibrating the prediction model in the clustering space according to the test evaluation data;
clustering analysis is carried out by taking the position of the prediction model in the clustering space as a clustering center, and the second sample pairs are clustered to be close to different clustering centers to form a plurality of categories;
and taking the prediction model corresponding to the same class clustering center as the corresponding optimal prediction model of the patient variable in the class.
In one possible implementation, selecting a plurality of test sets corresponding to different patient variables from the second sample pair, and testing all the predictive models to generate test evaluation data includes:
testing all the prediction models through the test set to obtain output data of the prediction models as calculation data;
comparing the LARS score in the test set serving as reference data with the calculation data to obtain difference data of the prediction model; the difference data comprises at least one of accuracy, specificity, sensitivity, positive predictive value, and negative predictive value;
and acquiring a working characteristic curve and an area under the curve of the subject according to the difference data to serve as test evaluation data.
In a second aspect, an embodiment of the present application provides a prediction system, including:
an acquisition module configured to acquire patient data of a target patient as input data;
a calculation module configured to input the input data into the post-colorectal cancer LARS prediction model and obtain output data as a prediction result of post-colorectal cancer LARS.
Compared with the prior art, the invention has the following advantages and beneficial effects:
according to the construction method and the prediction system of the colorectal cancer postoperative LARS prediction model, different patient variables correspond to different prediction models, the problem that a single model is poor in LARS prediction precision is solved, and LARS prediction data with good precision can be obtained only by selecting the optimal prediction model corresponding to the patient variables for prediction in the using process of the prediction model.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is a schematic diagram of the steps of an embodiment of the method of the present application;
FIG. 2 is a system architecture diagram according to an embodiment of the present application;
FIG. 3 is a graph illustrating ROC and AUC curves according to the examples of the present application.
Detailed description of the preferred embodiments
In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Further, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some of the embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. In addition, one skilled in the art, under the guidance of the present disclosure, may add one or more other operations to, or remove one or more operations from, the flowchart.
In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
Please refer to fig. 1, which is a schematic flow chart of a method for constructing a post-colorectal cancer surgery LARS prediction model according to an embodiment of the present invention, and further, the method for constructing the post-colorectal cancer surgery LARS prediction model may specifically include the contents described in the following steps S1 to S27.
S1: obtaining a first sample pair from a sample library; the first sample pair is a correspondence of a patient variable and a LARS score;
s2: preprocessing a patient variable in the first sample pair to generate a second sample pair;
s3: selecting a training set from the second sample pair, and training a plurality of models through the training set to generate a plurality of prediction models corresponding to different models;
s4: selecting a plurality of groups of test sets corresponding to different patient variables from the second sample pair, and testing all the prediction models to generate test evaluation data;
s5: constructing a clustering space by using the test evaluation data and the corresponding patient variables, and calibrating all the prediction models in the clustering space;
s6: performing semi-supervised clustering analysis on all second sample pairs according to the test evaluation data to generate clustering results; the clustering result is an optimal prediction model corresponding to the patient variable in the second sample pair;
s7: and taking the patient variables as input data of the post-colorectal cancer LARS prediction model, and taking the corresponding output data of the optimal prediction model as output data of the post-colorectal cancer LARS prediction model to complete the construction of the post-colorectal cancer LARS prediction model.
In the prior art, research on medical prediction models has been carried out, and chinese patent application No. 202111071269.5 discloses a method for constructing a prediction model for survival rate after lung cancer surgery and a prediction model system, wherein the method for predicting survival rate after lung cancer surgery by measuring clinical data including gene mutation typing comprises: a data acquisition step, in which clinical data after lung cancer surgery are acquired; a preprocessing step, namely classifying and grouping clinical data after lung cancer surgery to obtain modeling group clinical data and verification group clinical data; a risk factor screening step, namely screening risk factors of the clinical data of the modeling group to obtain risk factor data and total life cycle data; and a regression analysis step, wherein the risk factor data and the overall survival period data are subjected to regression analysis to obtain data after the regression analysis, and clinical data after lung cancer surgery comprise gene mutation typing, age, tumor size, lymph node metastasis and a surgery mode. The inventor finds that due to the complexity of medical samples, no matter how parameter screening and model correction are carried out, accurate prediction results are difficult to obtain.
In the embodiment of the application, in order to effectively improve the prediction accuracy of the prediction model, after the first sample pair extracted from the sample library is processed into the second sample pair, a plurality of different models are trained to generate corresponding different prediction models, the models can be any of a logistic regression model, a support vector machine, a decision tree model, a random forest model and a neural network model, and the trained prediction models also correspond to the types of the models. The inventor finds that the prediction accuracy of each model on different patient variables is limited, so that classification is performed by adopting a semi-supervised learning mode of a preset clustering center, and the aim is to select the patient variables which can be better predicted by each prediction model.
In the embodiment of the application, the clustering center is set as a prediction model, and the setting mode can be that the prediction model is calibrated at the position of the patient variable with the best test evaluation data; then, the second sample pair is subjected to cluster analysis, so that patient variables which can be well predicted by each prediction model can be calculated, and the prediction model is the optimal prediction model of the patient variables. Therefore, during subsequent prediction, after the data related to the patient is obtained, the data are converted into the patient variable and input into the corresponding optimal prediction model for prediction, so that the points of different prediction models can be utilized, and the prediction precision is greatly improved. According to the embodiment of the application, different patient variables correspond to different prediction models, the problem that a single model is poor in prediction accuracy of the LARS is solved, and in the using process of the prediction models, only the optimal prediction model corresponding to the patient variables needs to be selected for prediction, and LARS prediction data with good accuracy can be obtained.
In one possible implementation, the patient variables include demographic data, tumor location, tumor length and diameter, surgical procedure, prophylactic stoma, neoadjuvant therapy, and tumor distance from the dentate line.
In one possible implementation, preprocessing the patient variable in the first sample pair to generate a second sample pair includes:
interpolating missing values in the patient variables based on a random forest method to form interpolated patient variables;
selecting characteristics of the interpolated patient variables, screening out characteristics with correlation larger than a preset value in the interpolated patient variables as relevant characteristics, and performing decorrelation calculation on the relevant characteristics to form corrected patient variables;
the corrected patient variable is normalized to generate a second sample pair.
In the embodiment of the application, the general sample library is obtained by regular follow-up visits to patients, the LARS symptoms of the patients are evaluated by adopting the LARS scale, and the result of the LARS occurring 6 months after operation is used as an ending index, so that partial data loss may exist in the first sample pair, and the current data statistics are not in place if the patients go out for a certain week. Therefore, in the embodiment of the present application, it is necessary to complement these data by using a random forest method, which is generally used for complementing continuous variables such as age and the distance between the anastomotic orifice and the anal margin.
Meanwhile, in practice, the inventor finds that strong correlation exists among many characteristics in the patient variable, the strong correlation reduces the reliability of the model, so in the embodiment of the application, the method of performing decorrelation calculation on the relevant characteristics is adopted to reduce the influence.
In one possible implementation manner, screening out features, of the interpolated patient variables, of which correlation with each other is greater than a preset value as correlation features, and performing decorrelation calculation on the correlation features to form a corrected patient variable includes:
calculating a variable variance expansion factor of the features in the interpolated patient variable, and taking the features with the variable variance expansion factor larger than 10 as the relevant features;
and splitting the related features, and removing related parts from the splitting result to form a corrected patient variable.
In the embodiment of the present application, a specific calculation scheme is provided, in which multiple collinearity screening of independent variables is performed by calculating a variable Variance Inflation Factor (VIF) after feature selection, and variables with VIF values >10 are eliminated.
Figure SMS_1
Wherein->
Figure SMS_2
The negative correlation coefficient of regression analysis was performed for the independent variables versus the remaining independent variables.
For the related characteristics which can be split, the related parts are proposed after the splitting is carried out, for example, one related characteristic is that a patient takes two medicines of A and B, and the other related characteristic is that defecation is not smooth; the medicine B is highly related to the unclogging of defecation, at the moment, the characteristic that the patient takes the medicine A and the medicine B is split into the characteristic that the patient takes the medicine A and the characteristic that the patient takes the medicine B is eliminated. Therefore, the stability of the model can be effectively improved.
In one possible implementation, the plurality of models includes at least three of logistic regression, support vector machine, decision tree, random forest, and neural network.
In a possible implementation manner, the selecting of the training set and the test set includes:
arranging the second sample pairs into a development queue;
randomly selecting a sample pair with a first preset proportion from the development queue as a training set, and selecting a sample pair with a second preset proportion as a testing set; the sum of the first preset proportion and the second preset proportion is 1;
training a prediction model through the training set, and testing the prediction model through the test set;
and circularly selecting a training set and a testing set according to the first preset proportion and the second preset proportion to train and test the prediction models until the number of the prediction models trained by each model meets the expectation.
In the embodiment of the application, the sample is divided into a training set and a testing set for model training, which belongs to the prior art and is not repeated here. However, because of the rare samples, in the embodiment of the present application, model training is performed in a resampling mode, that is, a first preset ratio and a second preset ratio are determined, a training set and a test set are selected and a model is trained, and then the training set and the test set training model are randomly selected again according to the first preset ratio and the second preset ratio until enough prediction models are obtained, so that sufficient resources can be provided for selecting the prediction models at a later stage.
In one possible implementation, the constructing a clustering space and the performing semi-supervised clustering analysis comprise:
constructing a clustering space according to the patient variables, and calibrating the prediction model in the clustering space according to the test evaluation data;
clustering analysis is carried out by taking the position of the prediction model in the clustering space as a clustering center, and the second sample pairs are clustered to be close to different clustering centers to form a plurality of categories;
and taking the prediction model corresponding to the same class clustering center as the corresponding optimal prediction model of the patient variable in the class.
In the implementation of the embodiment of the application, the calibration process is already explained in the above contents, and since a specific clustering center is determined, the scheme is a typical semi-supervised learning manner, a plurality of categories are formed after clustering, and each category is gathered around one clustering center. This indicates that a patient variable in a category will correspond to a prediction model that is the optimal prediction model for the patient variable in that category.
In one possible implementation, selecting a plurality of test sets corresponding to different patient variables from the second sample pair, and testing all the predictive models to generate test evaluation data includes:
testing all the prediction models through the test set to obtain output data of the prediction models as calculation data;
comparing the LARS score in the test set serving as reference data with the calculation data to obtain difference data of the prediction model; the difference data comprises at least one of accuracy, specificity, sensitivity, positive predictive value, and negative predictive value;
and acquiring a working characteristic curve and an area under the curve of the subject according to the difference data to serve as test evaluation data.
In the embodiment of the present application, please refer to fig. 3, which shows specific contents of a receiver operating characteristic curve (ROC) and an area under the curve (AUC) as test evaluation data, wherein sensitivity is sensitivity, specificity is specificity, an ROC curve is formed according to a result of a model, and the result shows that AUC is 0.832, an optimal threshold is 0.540, a corresponding sensitivity is 0.911, and specificity is 0.717, and a prediction capability of the model is good.
To facilitate the explanation of the prediction system, please refer to fig. 2, which is a schematic diagram of a communication architecture of the prediction system according to an embodiment of the present invention. Wherein the prediction system may comprise:
an acquisition module configured to acquire patient data of a target patient as input data;
a calculation module configured to input the input data into the post-colorectal cancer LARS prediction model and obtain output data as a prediction result of post-colorectal cancer LARS.
Those of ordinary skill in the art will appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The elements described as separate parts may or may not be physically separate, and it will be apparent to those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general sense in the foregoing description for the purpose of clearly illustrating the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partly contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a grid device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. The method for constructing the colorectal cancer postoperative LARS prediction model is characterized by comprising the following steps:
obtaining a first sample pair from a sample library; the first sample pair is a correspondence of a patient variable and a LARS score;
preprocessing a patient variable in the first sample pair to generate a second sample pair;
selecting a training set from the second sample pair, and training a plurality of models through the training set to generate a plurality of prediction models corresponding to different models;
selecting a plurality of groups of test sets corresponding to different patient variables from the second sample pair, and testing all the prediction models to generate test evaluation data;
constructing a clustering space from the test evaluation data and the corresponding patient variables, and defining all the prediction models in the clustering space;
performing semi-supervised clustering analysis on all second sample pairs according to the test evaluation data to generate clustering results; the clustering result is an optimal prediction model corresponding to the patient variable in the second sample pair;
and taking the patient variables as input data of the post-colorectal cancer LARS prediction model, and taking the corresponding output data of the optimal prediction model as output data of the post-colorectal cancer LARS prediction model to complete the construction of the post-colorectal cancer LARS prediction model.
2. The method of claim 1, wherein the patient variables include demographic data, tumor location, tumor length and diameter, surgical procedure, preventive stoma, neoadjuvant therapy, and tumor dentate line distance.
3. The method of claim 1, wherein preprocessing the patient variables in the first sample pair to generate a second sample pair comprises:
interpolating missing values in the patient variables based on a random forest method to form interpolated patient variables;
selecting characteristics of the interpolated patient variables, screening out characteristics with correlation larger than a preset value in the interpolated patient variables as relevant characteristics, and performing decorrelation calculation on the relevant characteristics to form corrected patient variables;
the corrected patient variable is normalized to generate a second sample pair.
4. The method of claim 3, wherein the step of screening the interpolated patient variables for which the correlation between the interpolated patient variables is greater than a predetermined value as correlation features, and the step of performing decorrelation calculation on the correlation features to form corrected patient variables comprises:
calculating a variable variance expansion factor of the characteristics in the interpolated patient variable, and taking the characteristics of which the variable variance expansion factor is more than 10 as the relevant characteristics;
and splitting the related features, and removing related parts from the splitting result to form a corrected patient variable.
5. The method of claim 1, wherein the plurality of models includes at least three of logistic regression, support vector machines, decision trees, random forests, and neural networks.
6. The method of claim 1, wherein the selecting the training set and the testing set comprises:
arranging the second sample pairs into a development queue;
randomly selecting a sample pair with a first preset proportion from the development queue as a training set, and selecting a sample pair with a second preset proportion as a testing set; the sum of the first preset proportion and the second preset proportion is 1;
training a prediction model through the training set, and testing the prediction model through the test set;
and circularly selecting a training set and a testing set according to the first preset proportion and the second preset proportion to train and test the prediction models until the number of the prediction models trained by each model meets the expectation.
7. The method for constructing a model for predicting postoperative LARS for colorectal cancer according to claim 1, wherein constructing a clustering space and performing semi-supervised clustering analysis comprises:
constructing a clustering space according to the patient variables, and calibrating the prediction model in the clustering space according to the test evaluation data;
clustering analysis is carried out by taking the position of the prediction model in the clustering space as a clustering center, and the second sample pairs are clustered to be close to different clustering centers to form a plurality of categories;
and taking the prediction model corresponding to the same class clustering center as the corresponding optimal prediction model of the patient variable in the class.
8. The method of claim 1, wherein selecting a plurality of test sets corresponding to different patient variables from the second sample pair for testing all the predictive models to generate test evaluation data comprises:
testing all the prediction models through the test set to obtain output data of the prediction models as calculation data;
comparing the LARS score in the test set serving as reference data with the calculation data to obtain difference data of the prediction model; the difference data comprises at least one of accuracy, specificity, sensitivity, positive predictive value, and negative predictive value;
and acquiring a working characteristic curve and an area under the curve of the subject according to the difference data to serve as test evaluation data.
9. The prediction system of the prediction model of LARS after colorectal cancer surgery, which is constructed by the method of any one of claims 1 to 8, is characterized by comprising:
an acquisition module configured to acquire patient data of a target patient as input data;
a calculation module configured to input the input data into the post-colorectal cancer LARS prediction model and obtain output data as a prediction result of post-colorectal cancer LARS.
CN202310088636.5A 2023-02-09 2023-02-09 Construction method and prediction system of colorectal cancer postoperative LARS prediction model Active CN115938590B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310088636.5A CN115938590B (en) 2023-02-09 2023-02-09 Construction method and prediction system of colorectal cancer postoperative LARS prediction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310088636.5A CN115938590B (en) 2023-02-09 2023-02-09 Construction method and prediction system of colorectal cancer postoperative LARS prediction model

Publications (2)

Publication Number Publication Date
CN115938590A true CN115938590A (en) 2023-04-07
CN115938590B CN115938590B (en) 2023-05-02

Family

ID=85838728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310088636.5A Active CN115938590B (en) 2023-02-09 2023-02-09 Construction method and prediction system of colorectal cancer postoperative LARS prediction model

Country Status (1)

Country Link
CN (1) CN115938590B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580846A (en) * 2023-07-05 2023-08-11 四川大学华西医院 Colorectal cancer prognosis risk model construction method and system based on correlation analysis
CN117393171A (en) * 2023-12-11 2024-01-12 四川大学华西医院 Method and system for constructing prediction model of LARS development track after rectal cancer operation

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101389957A (en) * 2005-12-23 2009-03-18 环太平洋生物技术有限公司 Prognosis prediction for colorectal cancer
CN104915560A (en) * 2015-06-11 2015-09-16 万达信息股份有限公司 Method for disease diagnosis and treatment scheme based on generalized neural network clustering
US20160358290A1 (en) * 2012-04-20 2016-12-08 Humana Inc. Health severity score predictive model
CN109599157A (en) * 2018-11-29 2019-04-09 同济大学 A kind of accurate intelligent diagnosis and treatment big data system
CN111223569A (en) * 2019-04-25 2020-06-02 岭南师范学院 LARS diabetes prediction method based on feature weight
CN113538070A (en) * 2020-10-30 2021-10-22 深圳市九九互动科技有限公司 User life value cycle detection method and device and computer equipment
TWM627900U (en) * 2022-02-15 2022-06-01 新光人壽保險股份有限公司 Repurchase predictive model system
US20220229071A1 (en) * 2017-11-02 2022-07-21 Prevencio, Inc. Diagnostic and prognostic methods for peripheral arterial diseases, aortic stenosis, and outcomes

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101389957A (en) * 2005-12-23 2009-03-18 环太平洋生物技术有限公司 Prognosis prediction for colorectal cancer
US20160358290A1 (en) * 2012-04-20 2016-12-08 Humana Inc. Health severity score predictive model
CN104915560A (en) * 2015-06-11 2015-09-16 万达信息股份有限公司 Method for disease diagnosis and treatment scheme based on generalized neural network clustering
US20220229071A1 (en) * 2017-11-02 2022-07-21 Prevencio, Inc. Diagnostic and prognostic methods for peripheral arterial diseases, aortic stenosis, and outcomes
CN109599157A (en) * 2018-11-29 2019-04-09 同济大学 A kind of accurate intelligent diagnosis and treatment big data system
CN111223569A (en) * 2019-04-25 2020-06-02 岭南师范学院 LARS diabetes prediction method based on feature weight
CN113538070A (en) * 2020-10-30 2021-10-22 深圳市九九互动科技有限公司 User life value cycle detection method and device and computer equipment
TWM627900U (en) * 2022-02-15 2022-06-01 新光人壽保險股份有限公司 Repurchase predictive model system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MING JUN HUANG等: "Development of prediction model of low anterior resection syndrome for colorectal cancer patients after surgery based on machine-learning technique", 《CANCER MEDICINE》 *
孟倩雯: "支持个性化体征的再入院预测问题研究", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580846A (en) * 2023-07-05 2023-08-11 四川大学华西医院 Colorectal cancer prognosis risk model construction method and system based on correlation analysis
CN116580846B (en) * 2023-07-05 2023-09-15 四川大学华西医院 Colorectal cancer prognosis risk model construction method and system based on correlation analysis
CN117393171A (en) * 2023-12-11 2024-01-12 四川大学华西医院 Method and system for constructing prediction model of LARS development track after rectal cancer operation
CN117393171B (en) * 2023-12-11 2024-02-20 四川大学华西医院 Method and system for constructing prediction model of LARS development track after rectal cancer operation

Also Published As

Publication number Publication date
CN115938590B (en) 2023-05-02

Similar Documents

Publication Publication Date Title
CN115938590A (en) Construction method and prediction system of colorectal cancer postoperative LARS prediction model
KR100794516B1 (en) System and method for diagnosis and clinical test selection using case based machine learning inference
WO2021217867A1 (en) Xgboost-based data classification method and apparatus, computer device, and storage medium
US20160026917A1 (en) Ranking of random batches to identify predictive features
US11664126B2 (en) Clinical predictor based on multiple machine learning models
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
JP7381815B1 (en) Passage anomaly detection system based on adaptive resampling deep encoder network
CN111508603A (en) Birth defect prediction and risk assessment method and system based on machine learning and electronic equipment
TWI814154B (en) Method for predicting disease based on medical image
CN114864099B (en) Clinical data automatic generation method and system based on causal relationship mining
CN116092680B (en) Abdominal aortic aneurysm early prediction method and system based on random forest algorithm
CN112598086A (en) Deep neural network-based common colon disease classification method and auxiliary system
Shrestha et al. Supervised machine learning for early predicting the sepsis patient: modified mean imputation and modified chi-square feature selection
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
CN116403701A (en) Method and device for predicting TMB level of non-small cell lung cancer patient
CN112382395B (en) Integrated modeling system based on machine learning
WO2020209191A1 (en) Learning device, learning method, and non-transitory computer-readable medium
CN113362927A (en) Squamous esophageal cancer chemoradiotherapy effect prediction method based on deep learning
Özkan et al. Effect of data preprocessing on ensemble learning for classification in disease diagnosis
Khozama et al. Study the Effect of the Risk Factors in the Estimation of the Breast Cancer Risk Score Using Machine Learning
WO2020199692A1 (en) Method and apparatus for screening predictive image features for cancer metastasis, and storage medium
CN117393171B (en) Method and system for constructing prediction model of LARS development track after rectal cancer operation
CN117496279B (en) Image classification model building method and device, and classification method, device and system
CN116504407B (en) Branch occlusion risk prediction method and system for coronary left trunk bifurcation
Ryblov et al. Comparison of Machine Learning Methods forA nalysis of Ulcerative Colitis Proteomic Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant