CN114743685B - Endometrial cancer risk screening method and system based on artificial intelligence - Google Patents
Endometrial cancer risk screening method and system based on artificial intelligence Download PDFInfo
- Publication number
- CN114743685B CN114743685B CN202210338149.5A CN202210338149A CN114743685B CN 114743685 B CN114743685 B CN 114743685B CN 202210338149 A CN202210338149 A CN 202210338149A CN 114743685 B CN114743685 B CN 114743685B
- Authority
- CN
- China
- Prior art keywords
- data
- training
- model
- endometrial cancer
- structured
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 206010014733 Endometrial cancer Diseases 0.000 title claims abstract description 74
- 206010014759 Endometrial neoplasm Diseases 0.000 title claims abstract description 73
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000012216 screening Methods 0.000 title claims abstract description 58
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 122
- 238000012545 processing Methods 0.000 claims abstract description 53
- 230000008569 process Effects 0.000 claims abstract description 25
- 239000011159 matrix material Substances 0.000 claims abstract description 22
- 238000012795 verification Methods 0.000 claims abstract description 17
- 238000002372 labelling Methods 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims description 13
- 230000002159 abnormal effect Effects 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 6
- 238000002790 cross-validation Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 5
- 238000010219 correlation analysis Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 abstract description 2
- 238000013058 risk prediction model Methods 0.000 abstract description 2
- 230000003902 lesion Effects 0.000 description 9
- 238000012360 testing method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 206010028980 Neoplasm Diseases 0.000 description 5
- 210000004696 endometrium Anatomy 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 4
- 238000003745 diagnosis Methods 0.000 description 4
- 201000006828 endometrial hyperplasia Diseases 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 230000002357 endometrial effect Effects 0.000 description 3
- 238000002657 hormone replacement therapy Methods 0.000 description 3
- 230000001575 pathological effect Effects 0.000 description 3
- 238000002604 ultrasonography Methods 0.000 description 3
- 241000701806 Human papillomavirus Species 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000003821 menstrual periods Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000007170 pathology Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 206010008342 Cervix carcinoma Diseases 0.000 description 1
- 206010049678 Endometrial dysplasia Diseases 0.000 description 1
- 208000006105 Uterine Cervical Neoplasms Diseases 0.000 description 1
- 208000028183 atypical endometrial hyperplasia Diseases 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 201000010881 cervical cancer Diseases 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 208000009060 clear cell adenocarcinoma Diseases 0.000 description 1
- 230000002380 cytological effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 210000004996 female reproductive system Anatomy 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000009245 menopause Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 229940127234 oral contraceptive Drugs 0.000 description 1
- 239000003539 oral contraceptive agent Substances 0.000 description 1
- 230000027758 ovulation cycle Effects 0.000 description 1
- 238000010827 pathological analysis Methods 0.000 description 1
- 230000002980 postoperative effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000035935 pregnancy Effects 0.000 description 1
- 239000000583 progesterone congener Substances 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 230000028327 secretion Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 208000019694 serous adenocarcinoma Diseases 0.000 description 1
- 208000004548 serous cystadenocarcinoma Diseases 0.000 description 1
- 208000022483 simple endometrial hyperplasia Diseases 0.000 description 1
- 239000000779 smoke Substances 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Data Mining & Analysis (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
The invention provides an endometrial cancer risk screening method and system based on artificial intelligence, wherein the method comprises a training process and a verification process. The system comprises a data processing module, a data processing module and a data processing module, wherein the data processing module is used for performing data processing, extracting characteristic information of data, and performing labeling processing to form structured data; the Xgboost loss model module is used for training or predicting the structured data to obtain trained or predicted data; the Lasso model simplification module is used for feature simplification to obtain a feature data matrix; and the BR ridge model module is used for obtaining an endometrial cancer risk screening prediction model according to the trained data and obtaining an endometrial cancer risk prediction value according to the predicted data. The invention can be widely used for cross-section risk screening of clinic or physical examination conventional patients, avoids missing high-risk patients, and simultaneously facilitates the work of doctors in automatic risk screening and improves the efficiency; the risk prediction model can be moved forward to the patient end for preliminary screening, so that the self-monitoring and management of the patient are facilitated.
Description
Technical Field
The invention relates to the technical field of natural language processing and machine learning, in particular to an endometrial cancer risk screening method and system based on artificial intelligence.
Background
Endometrial cancer is one of three major cancers of the female reproductive system, and the incidence and death rate of which are on a continuous rise worldwide. Early screening of patients with high risk of endometrial lesions is becoming increasingly important as the incidence of endometrial cancer increases and the age of the patient is younger.
Currently, a common screening method for endometrial cancer is tumor markers combined with transvaginal uterine double-annex ultrasound or endometrial sampling combined with cytology. However, the first screening method is more suitable for postmenopausal women, and for women of childbearing age, the screening information provided by ultrasound is limited, the cutting value is lacked, and the accuracy is low. The second screening method, which is more suitable for further screening of patients with high risk of endometrial cancer, is invasive for the patient's endometrium sampling, and the disposable is costly, while cytological examination is a high requirement for the pathological diagnosis level in hospitals. Thus, there is a lack of effective screening methods for endometrial cancer and precancerous lesions relative to the current, more sophisticated cervical cancer early screening systems.
Disclosure of Invention
The invention utilizes natural language processing and machine learning technology to screen the risk of endometrial cancer and precancerous lesions based on the information of electronic medical records of patients, past clinical auxiliary examination data and the like. Unstructured text information such as medical records, examination data and the like is converted into structured data through various artificial intelligence algorithms, wherein a rule-based method is adopted for feature extraction and structured representation of the feature part of endometrial cancer, vectorization processing is carried out, processing results are input into a pre-designed algorithm, data analysis mining and model training are carried out, and a risk screening system for endometrial cancer and precancerous lesions is constructed. Through inputting information such as electronic medical records of patients and past clinical auxiliary examination data, the risk screening system can obtain a risk prediction value, so that not only is the waste of manpower and material resources avoided, but also the accuracy of disease screening and diagnosis is improved, and the risk screening system is an auxiliary diagnosis tool with high sensitivity, specificity, stability and reliability.
The invention provides an artificial intelligence-based endometrial cancer risk screening method, which is characterized by comprising a training process and a verification process, wherein the training process comprises the following steps: firstly, performing data processing, namely taking medical record data information of an endometrial cancer patient as an original data set, extracting characteristic information in a text of the original data set, and performing labeling processing to form a structured training data set; secondly, taking the structured training data set as input, and training the structured training data set by adopting an improved Xgboost loss model to obtain a trained training data set; then, simplifying the trained training data set through a Lasso model to obtain a characteristic data matrix; and finally, taking the characteristic data matrix and the trained training data set as input, and training through a BR ridge model to obtain an endometrial cancer risk screening prediction model.
Further, the verification process of the risk screening method comprises the following steps: firstly, inputting new medical record data information of a patient with endometrial cancer, extracting the characteristic information from the medical record data information, and performing labeling treatment to form structured data information; secondly, taking the structured data information as input, and predicting the structured data information by adopting the improved Xgboost loss model to obtain predicted data information; and finally, taking the predicted data information and the structured data information as input, and obtaining an endometrial cancer risk prediction value corresponding to the new endometrial cancer patient by adopting the endometrial cancer risk screening prediction model obtained in the training process.
Further, before training the structured training data set by using the improved Xgboost loss model, feature processing needs to be performed on the structured training data set, where the feature processing includes the following steps: firstly, splitting the characteristic information in the structured training data set to generate a plurality of independent characteristic information; secondly, converting the independent characteristic information into digital characteristic information, performing one-hot binarization processing on the text characteristic information to obtain discrete characteristics, and performing no one-hot binarization processing, namely performing no discretization processing on the continuous characteristic information; then deleting abnormal characteristic information beyond a normal range, and deleting medical record data sample information in the original dataset lacking the label; and finally, not filling the missing value in the characteristic information.
Furthermore, before the BR ridge model is trained, feature processing is further required to be performed on the trained training data set, and the feature processing includes the following steps: firstly, splitting the characteristic information of the trained training data set to generate a plurality of independent characteristic information; secondly, performing one-hot binarization processing on the independent characteristic information respectively, and converting all the characteristic information into discrete characteristics; then deleting abnormal characteristic information beyond a normal range, and deleting sample information in the original data set lacking a tag; finally, the missing value in the characteristic information is filled with 0.
Further, the overall objective function of the modified Xgboost loss model is formulated as follows:
wherein T represents the number of leaf nodes, and w represents the fraction of the leaf nodes; gamma is used for controlling the number of leaf nodes, and lambda is used for controlling the fraction of the leaf nodes; if the real tag is 1, thenOtherwise, when the true tag is 0, then
Further, the training of the structured training data set by using the improved Xgboost loss model to obtain a trained training data set, and the training method comprises the following steps: firstly, loading the structured training data set, dividing a part of samples in the structured training data set into 10 folds for cross-verifying a training model, and verifying the effect of the endometrial cancer risk screening prediction model by taking the rest samples in the structured training data set as an independent verification set; secondly, inputting the structured training data set into the improved Xgboost loss model to obtain an index of leaf nodes to form a new feature combination, analyzing the new feature combination column and the real label by adopting correlation analysis, selecting features with larger correlation to combine with original features, training the improved Xgboost loss model by using a grid search method again by using the combined new data features, and finding out an optimal model; finally, the grid search method is used for automatically searching the optimal parameters by utilizing the new training set, and 10-fold cross validation is used in the parameter searching process.
The invention also provides an artificial intelligence-based endometrial cancer risk screening system, which is characterized by comprising: the data processing module is used for performing data processing, taking medical record data information of endometrial cancer patients as original data, extracting characteristic information in a text of the original data, and performing labeling processing to form structured data; the Xgboost loss model module is used for taking the structured data as input, and training or predicting the structured data by adopting the improved Xgboost loss model to obtain the trained or predicted data; the Lasso model simplifying module is used for simplifying the trained data through the Lasso model to obtain a characteristic data matrix; and the BR ridge model module is used for taking the characteristic data matrix and the trained or predicted data as input, training or predicting through the BR ridge model, obtaining an endometrial cancer risk screening prediction model aiming at the trained data, and obtaining the predicted endometrial cancer risk prediction value aiming at the predicted data.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the endometrial cancer risk screening method based on artificial intelligence.
The present invention also provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of any of the above-described methods for screening for risk of endometrial cancer based on artificial intelligence.
The beneficial effects of the invention are as follows:
the endometrial cancer risk screening method and system provided by the invention can be widely used for cross-section risk screening of clinic or physical examination conventional patients according to the information such as the electronic medical record of the patients and the auxiliary examination data, so that high-risk patients are prevented from being missed, and meanwhile, the automatic risk screening is convenient for doctors to work, the efficiency is improved, and the influence of doctor service levels under different levels and training degrees is reduced; the risk prediction model can be moved forward to the patient end for preliminary screening, so that the self-monitoring and management of the patient are facilitated.
Meanwhile, the method and the system provided by the invention provide objective and reliable auxiliary diagnosis reference results for doctors in primary hospitals to screen patients with high risk of endometrial cancer and women in daily physical examination to self-determine endometrial cancer risk. The system screens out high-risk people in a noninvasive mode for further diagnosis and screening, and is beneficial to doctors to establish a stepped screening strategy for endometrial cancer.
Drawings
Fig. 1 shows a schematic step diagram of the training process of the present invention.
Fig. 2 shows a schematic step diagram of the verification process of the present invention.
Fig. 3 shows a verification result diagram at a verification set.
Fig. 4 shows a schematic structural diagram of an endometrial cancer risk screening system.
Detailed Description
The following examples and experimental examples are illustrative of the present invention and are not intended to limit the scope of the present invention. The present invention will be further described with reference to specific examples and experimental examples.
The invention provides an endometrial cancer risk screening method based on artificial intelligence, which comprises a training process and a verification process, wherein the training process is shown in the schematic diagram in the step of FIG. 1. Firstly, performing data processing, namely taking medical record data information of an endometrial cancer patient as an original data set, extracting characteristic information in a text of the original data set, and performing labeling processing to form a structured training data set; secondly, taking the structured training data set as input, and training the structured training data set by adopting an improved Xgboost loss model to obtain a trained training data set; then, simplifying the trained training data set through a Lasso model to obtain a characteristic data matrix; and finally, taking the characteristic data matrix and the trained training data set as input, and training through a BR ridge model to obtain an endometrial cancer risk screening prediction model.
A schematic of the steps of the verification process of the present invention is shown in fig. 2. The verification process includes: firstly, inputting new medical record data information of a patient with endometrial cancer, extracting the characteristic information from the medical record data information, and performing labeling treatment to form structured data information; secondly, taking the structured data information as input, and predicting the structured data information by adopting the improved Xgboost loss model to obtain predicted data information; and finally, taking the predicted data information and the structured data information as input, and obtaining an endometrial cancer risk prediction value corresponding to the new endometrial cancer patient by adopting the endometrial cancer risk screening prediction model obtained in the training process.
Specifically, in the data processing step, first, information such as an electronic medical record of endometrial cancer of a patient is subjected to feature preliminary selection, and key information therein is subjected to feature extraction based on the information shown in table 1.
Preoperative TCT (liquid-based cytology) | History of oral contraceptive |
Preoperative last HPV (human papilloma virus) | Age of |
History of other tumors in individuals | Menstrual cycle (Tian) |
Delivery mode | Menopausal age (age) |
Principal symptoms | Pregnancy times (times) |
Complications of the disease | Parity (parity) |
Preoperative ultrasound characterization | Preoperative lowest HGB (g/L) |
Ultrasound-visible combined lesions | BMI |
Family history of tumors | Menstrual period changes |
Whether or not to menopause | Age of beginner (age) |
Whether or not to smoke | Menstrual period (Tian) |
Intrauterine device | Preoperative Ca125 (U/ml) |
Postmenopausal hormone replacement therapy/HRT/hormone replacement therapy | Predictive score |
Preoperatively progestogen use |
TABLE 1 Table of characteristic information to be extracted according to the invention
And then, labeling the characteristic information to obtain a structured data set. Specifically, the label of each case of data, i.e., the outcome label, is taken from the post-operative pathology outcome. As shown in the following table 2, the pathological results marked positive for endometrial cancer and precancerous lesions, the "endometrial dysplasia", "endometrial hyperplasia" and various endometrial cancer subtypes were considered positive for endometrial cancer and precancerous lesions. The pathological results are definitely "normal endometrium", "proliferation stage endometrium", "secretion stage endometrium", "simple endometrial hyperplasia" as negative. The remaining pathology results were not included in the model training.
Atypical hyperplasia of endometrium | Endometrial cancer |
Endometrial hyperplasia | EndometriumSerous carcinoma |
Endometrial cancer | Serous adenocarcinomas of the endometrium |
Endometrial clear cell carcinoma |
Table 2 pathological results marked as positive for endometrial cancer and precancerous lesions
In the model training step, an improved Xgboost loss model training is first performed. However, before training the structured training data set with the improved Xgboost loss model, it is also necessary to perform a feature processing on the structured training data set, where the feature processing includes the following steps:
first, to reduce the interaction between the plurality of original feature information, the original feature information is split to generate a plurality of individual feature information.
Secondly, the individual text feature information is converted into digital feature information, and the training effect obtained after the feature information is converted into the discrete feature by all the one-hot binarization processing is not good, so that the one-hot binarization processing is changed into the discrete feature for part of the text feature information, and the one-hot binarization processing is not performed for continuous feature information, namely, the discretization is not performed.
Then, some abnormal characteristic values beyond the normal range are deleted, and sample information lacking the label is deleted.
Finally, most missing values in the feature information are not filled in, as the Xgboost loss model itself contains a mechanism to handle null values.
After the data set is subjected to the characteristic processing, the improved Xgboost loss model training is started, and the specific steps are as follows:
firstly, loading a training data set after feature processing and labeling processing, dividing most samples in the training data set into 10 folds for cross-verifying a training model, and taking the rest samples after processing as independent verification set to verify the model effect.
And secondly, improving the Xgboost loss model. Since the number of sample sets used for cross-validation in the previous step is much larger than the number of sample sets used for independent validation, a data imbalance is generated, which may result in poor generalization ability of the model. In order to improve the stability of the training model, the invention does not use an up-sampling or down-sampling method from the data layer, but improves the training process of the Xgboost loss model.
The integral objective function of the Xgboost loss model is divided into two parts, wherein the first part is used for measuring the difference between the prediction score and the true score, and the other part is a regular term, and the formula is as follows:
where T represents the number of leaf nodes and w represents the fraction of leaf nodes. And gamma can control the number of leaf nodes, lambda can control the score of the leaf nodes not to be excessive, and overfitting is prevented.
In the original Xgboost loss model, the first partIs a cross entropy loss function, the aforementioned data imbalance problem can lead to small model thresholds and unstable models.
The invention replaces the cross entropy loss function with the focal loss function, and the formula is as follows:
FL(p t )=-α t (1-p t ) γ log(p t )
wherein if the real tag is 1, thenOtherwise, when the real tag is 0, then +.>
The focal loss function can change the loss weight of one or less of the samples by adjusting the alpha and gamma super parameters, so that the prediction of the Xgboost model is not biased to the one with more samples, and the problem of unbalanced data is better solved.
The formula of the overall objective function of the modified Xgboost loss model is as follows:
thirdly, in order to further improve the performance of model training, the original training data set is input into the improved Xgboost model to obtain indexes of leaf nodes, so that new combined characteristics are formed. However, excessive feature columns of the original training data set may cause excessive fitting, in this embodiment, an RM correlation analysis method is adopted to analyze a new feature column and a real tag, features of a person with larger correlation are selected to be combined with the original features, and the new combined data features are used to train the improved Xgboost model again by using a grid search method to find an optimal model.
And fourthly, automatically searching optimal parameters by using the new training set by using the grid searching method, and using a 10-fold cross validation method in the parameter searching process.
In this step, to ensure that better model parameters can be found so that the model performs better in both the training set and the validation set, the evaluation criteria is to find the model parameter value at the highest average AUC using the average AUC values trained during the cross validation process. Specifically, given an optional range of model parameters, the method combines all parameter cases to train, each combination is cross-validated and results in a mean AUC value.
Since the model parameters are more, if each model parameter is given a range and is searched in the search space at the same time, the training time of the model becomes intolerable because the training time of the model is accumulated by the number of values of all parameters, and therefore, the embodiment uses grouping to approximate the optimal solution.
Firstly, dividing parameters into two groups, finding out the optimal solution of two parameters each time, namely, summing the products of the values of the two parameters in the search space; secondly, continuously searching the next two optimal parameters on the basis of the last optimal solution; then, and the like, searching for the optimal solution of other parameters; and finally, drawing an ROC curve according to the optimal model and the test set.
After training the training data set with the modified Xgboost loss model, the training data set after training the Xgboost loss model is then simplified using Lasso model features. The Lasso model has L1 penalty, can effectively compress the coefficient of the non-key feature to 0, and is beneficial to reducing the operation workload and the risk of overfitting. The method for simplifying the Lasso model comprises the following steps:
first, the super parameter α=0.1 is set, the intercept fitting is set to true, the maximum iteration number is set to 10000, and the feature updated in each iteration is randomly selected.
And substituting the original training data matrix corresponding to the BR ridge model in the feature preprocessing into a Lasso model, and performing model training according to a sample set of data packets consistent with the modified Xgboost model.
For training of each fold data, the features with absolute values of coefficients > =0.01 are reserved to the next step according to the coefficients of the features output by the LASSO model, namely, corresponding columns in the data matrix are reserved, and the rest feature columns are omitted. Here, the threshold value of 0.01 is related to the value of the tag, and the person skilled in the art should perform a corresponding transformation in case of different tag values.
Since the Xgboost predicted value is calculated based on the training data corresponding to each fold test data, and includes the information of the training data, the subsequent training will also be performed on each pair of training set and test set, and the grouping of the samples is consistent with the grouping of the samples adopted by the improved Xgboost model.
After the result of the Xgboost loss model and the result of the Lasso model are obtained, the characteristic data matrix and the training data set after training are used as input, training is carried out through a BR ridge model, the Xgboost predicted value is used as a new list of characteristics and is combined into the characteristic data matrix after the Lasso model is simplified, a new characteristic data matrix is formed and is input into the BR ridge model, and the endometrial cancer risk screening prediction model is obtained.
Before the BR ridge model is trained, the trained feature data matrix is required to be subjected to feature processing, and the feature processing comprises the following steps:
first, to reduce the interaction between the plurality of original feature information, the original feature information is split to generate a plurality of individual feature information.
And secondly, performing one-hot binarization processing on each piece of independent characteristic information respectively, and converting all the independent characteristic information into discrete characteristics.
Then, some abnormal characteristic values beyond the normal range are deleted, and sample information lacking the label is deleted.
And finally, filling the information of the missing features with null values to be 0, and obtaining a training data set which can be directly used for training.
In the BR ridge model training step, as for the Xgboost loss model, a 10-fold cross validation method is adopted, namely a data set is divided into 10 folds in advance, and each fold is taken as a test set in turn, so that samples adopted in the training and testing of the two models are the same, and model prediction and evaluation are carried out in the follow-up step according to a given training set and a corresponding test set.
The basic principle of the BR ridge model is that the BR ridge model has likelihood functions when the parameters are w based on the target y value On the premise of carrying out maximum posterior estimation of w value according to Bayes theorem, the formula is as follows:
wherein sigma 0 And sigma is the standard deviation of the parameter w and the standard deviation of the target y value when the parameter is w, respectively. The formula is used as a loss function of the BR algorithm to carry out gradient descent solution.
After substituting the combined characteristic data matrix into the BR algorithm, the optimal coefficient can be further fitted for each characteristic, the coefficient of each characteristic can be directly used for subsequent application, and the predicted value of each case under the linear model is recorded. And then drawing an ROC curve according to the actual labeling result of the test set, and evaluating the prediction performance. The optimal prediction threshold is determined based on this, and used to make decisions at the time of later prediction.
In this embodiment, the risk assessment model of the present invention is verified using data of a different hospital from the training set as the verification set, and fig. 3 shows a verification result diagram in the verification set. For new endometrial cancer patient data information, the data information can be directly used as input of the model to obtain an endometrial cancer risk prediction value.
The invention also provides an endometrial cancer risk screening system based on artificial intelligence, as shown in fig. 4, the risk screening system comprises: the data processing module is used for performing data processing, taking medical record data information of endometrial cancer patients as original data, extracting characteristic information in a text of the original data, and performing labeling processing to form structured data; the Xgboost loss model module is used for taking the structured data as input, and training or predicting the structured data by adopting the improved Xgboost loss model to obtain the trained or predicted data; the Lasso model simplifying module is used for simplifying the trained data through the Lasso model to obtain a characteristic data matrix; and the BR ridge model module is used for taking the characteristic data matrix and the trained or predicted data as input, training or predicting through the BR ridge model, obtaining an endometrial cancer risk screening prediction model aiming at the trained data, and obtaining the predicted endometrial cancer risk prediction value aiming at the predicted data.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the endometrial cancer risk screening methods based on artificial intelligence.
The present invention also provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of any of the aforementioned methods for screening for risk of endometrial cancer based on artificial intelligence.
The characteristic coefficient generated by the embodiment has stronger robustness to the patient data under the same distribution. Specifically, a user inputs an electronic medical record and a clinical auxiliary examination result, the embodiment converts the input information into various characteristics, and automatically binarizes the characteristics to obtain a data matrix which can be identified by the Xgboost model and the BR model, the data matrix is sequentially brought into the Xgboost model and the BR model, and finally the risk prediction value of endometrial cancer of the medical record is output, so that the risk of endometrial cancer and precancerous lesions is predicted.
The invention improves the Xgboost model, simplifies the result of the improved Xgboost model by utilizing the Lasso model, combines the results of the two model training, takes the results as the input of the BR ridge model, and carries out further model training to obtain the final risk prediction result.
The improved Xgboost model replaces the cross entropy loss function with the focal loss function with unbalanced processing data, so that the training model can be ensured not to deviate to more samples, the model is more stable, and the prediction effect is more accurate. The Lasso model reduces the high-dimensional space to the low-dimensional space, and is more beneficial to generalization of the model. The BR ridge model is used in the classification model in the invention to correct some errors of the previous model, thereby realizing the fusion of the models and improving the training performance of the models.
Claims (7)
1. An endometrial cancer risk screening method based on artificial intelligence is characterized in that the risk screening method comprises a training process and a verification process,
wherein the training process comprises:
firstly, performing data processing, namely taking medical record data information of an endometrial cancer patient as an original data set, extracting characteristic information in a text of the original data set, and performing labeling processing to form a structured training data set;
secondly, taking the structured training data set as input, and training the structured training data set by adopting an improved Xgboost loss model to obtain a trained training data set;
then, simplifying the trained training data set through a Lasso model to obtain a characteristic data matrix;
finally, taking the characteristic data matrix and the trained training data set as input, and training through a Bayesian ridge regression (BR) model to obtain an endometrial cancer risk screening prediction model;
the verification process of the risk screening method comprises the following steps:
firstly, inputting new medical record data information of a patient with endometrial cancer, extracting the characteristic information from the medical record data information, and performing labeling treatment to form structured data information;
secondly, taking the structured data information as input, and predicting the structured data information by adopting the improved Xgboost loss model to obtain predicted data information;
finally, taking the predicted data information and the structured data information as input, and obtaining an endometrial cancer risk prediction value corresponding to the new endometrial cancer patient by adopting the endometrial cancer risk screening prediction model obtained in the training process;
wherein the formula of the overall objective function of the improved Xgboost loss model is as follows:
;
;
wherein,representing the number of leaf nodes>A score representing a leaf node; />For controlling the number of leaf nodes, +.>A score for controlling leaf nodes; if the real tag is 1, then +.>=/>The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, when the true tag is 0, then。
2. An artificial intelligence based endometrial cancer risk screening method according to claim 1, wherein said structured training data set is further subjected to a feature processing prior to said training of said structured training data set using a modified Xgboost loss model, said feature processing comprising the steps of:
firstly, splitting the characteristic information in the structured training data set to generate a plurality of independent characteristic information;
secondly, converting the independent characteristic information into digital characteristic information, performing one-hot binarization processing on the text characteristic information to obtain discrete characteristics, and performing no one-hot binarization processing, namely performing no discretization processing on the continuous characteristic information;
then deleting abnormal characteristic information beyond a normal range, and deleting medical record data sample information in the original dataset lacking the label;
and finally, not filling the missing value in the characteristic information.
3. The method of claim 1, wherein before training the Bayesian ridge regression (BR) model, the trained training data set is further subjected to feature processing, the feature processing comprising the steps of:
firstly, splitting the characteristic information of the trained training data set to generate a plurality of independent characteristic information;
secondly, performing one-hot binarization processing on the independent characteristic information respectively, and converting all the characteristic information into discrete characteristics;
then deleting abnormal characteristic information beyond a normal range, and deleting sample information in the original data set lacking a tag;
finally, the missing value in the characteristic information is filled with 0.
4. An artificial intelligence based endometrial cancer risk screening method according to claim 1, wherein said structured training data set is trained using an improved Xgboost loss model to obtain a trained training data set, said training method comprising the steps of:
firstly, loading the structured training data set, dividing a part of samples in the structured training data set into 10 folds for cross-verifying a training model, and verifying the effect of the endometrial cancer risk screening prediction model by taking the rest samples in the structured training data set as an independent verification set;
secondly, inputting the structured training data set into the improved Xgboost loss model to obtain an index of leaf nodes to form a new feature combination, analyzing the new feature combination column and the real label by adopting correlation analysis, selecting features with larger correlation to combine with original features, training the improved Xgboost loss model by using a grid search method again by using the combined new data features, and finding out an optimal model;
finally, the grid search method is used for automatically searching the optimal parameters by utilizing the new training set, and 10-fold cross validation is used in the parameter searching process.
5. An artificial intelligence based endometrial cancer risk screening system, the risk screening system comprising:
the data processing module is used for performing data processing, taking medical record data information of endometrial cancer patients as original data, extracting characteristic information in a text of the original data, and performing labeling processing to form structured data;
the Xgboost loss model module is used for taking the structured data as input, and training or predicting the structured data by adopting the improved Xgboost loss model to obtain the trained or predicted data;
the Lasso model simplifying module is used for simplifying the trained data through the Lasso model to obtain a characteristic data matrix;
a Bayesian ridge regression (BR) model module, configured to use the feature data matrix and the trained or predicted data as inputs, perform training or prediction by using the Bayesian ridge regression (BR) model, obtain an endometrial cancer risk screening prediction model for the trained data, and obtain the predicted endometrial cancer risk prediction value for the predicted data;
wherein the formula of the overall objective function of the improved Xgboost loss model is as follows:
;
;
wherein,representing the number of leaf nodes>A score representing a leaf node; />For controlling the number of leaf nodes, +.>A score for controlling leaf nodes; if the real tag is 1, then +.>=/>;
Otherwise, when the true tag is 0, then。
6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of an artificial intelligence based endometrial cancer risk screening method according to any one of claims 1-5 when said program is executed.
7. A computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of an artificial intelligence based endometrial cancer risk screening method according to any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210338149.5A CN114743685B (en) | 2022-04-01 | 2022-04-01 | Endometrial cancer risk screening method and system based on artificial intelligence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210338149.5A CN114743685B (en) | 2022-04-01 | 2022-04-01 | Endometrial cancer risk screening method and system based on artificial intelligence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114743685A CN114743685A (en) | 2022-07-12 |
CN114743685B true CN114743685B (en) | 2024-01-05 |
Family
ID=82280540
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210338149.5A Active CN114743685B (en) | 2022-04-01 | 2022-04-01 | Endometrial cancer risk screening method and system based on artificial intelligence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114743685B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116705289B (en) * | 2023-05-23 | 2023-12-19 | 北京透彻未来科技有限公司 | Cervical pathology diagnosis device based on semantic segmentation network |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017220782A1 (en) * | 2016-06-24 | 2017-12-28 | Molecular Health Gmbh | Screening method for endometrial cancer |
CN111524554A (en) * | 2020-04-24 | 2020-08-11 | 上海海洋大学 | Cell activity prediction method based on LINCS-L1000 perturbation signal |
CN112102879A (en) * | 2020-07-31 | 2020-12-18 | 蒋涛 | System and method for predicting chemotherapy curative effect of advanced lung cancer |
CN112530592A (en) * | 2020-12-14 | 2021-03-19 | 青岛大学 | Non-small cell lung cancer risk prediction method based on machine learning |
CN112831567A (en) * | 2021-03-04 | 2021-05-25 | 苏州大学 | Marker of endometrial cancer and detection kit thereof |
KR20210081547A (en) * | 2019-12-24 | 2021-07-02 | 연세대학교 산학협력단 | Methods for poviding information about responses to cancer immunotherapy and devices using the same |
KR20210108682A (en) * | 2020-02-26 | 2021-09-03 | 계명대학교 산학협력단 | Method for Providing Information on Predicting Breast Cancer Lymph Node Metastasis Using Machine Learning |
CN114023448A (en) * | 2021-12-10 | 2022-02-08 | 华中科技大学同济医学院附属同济医院 | Construction method of endometrial cancer prediction diagnosis model, diagnosis model and diagnosis device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170003291A1 (en) * | 2015-06-27 | 2017-01-05 | William Beaumont Hospital | Methods for detecting, diagnosing and treating endometrial cancer |
-
2022
- 2022-04-01 CN CN202210338149.5A patent/CN114743685B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017220782A1 (en) * | 2016-06-24 | 2017-12-28 | Molecular Health Gmbh | Screening method for endometrial cancer |
KR20210081547A (en) * | 2019-12-24 | 2021-07-02 | 연세대학교 산학협력단 | Methods for poviding information about responses to cancer immunotherapy and devices using the same |
KR20210108682A (en) * | 2020-02-26 | 2021-09-03 | 계명대학교 산학협력단 | Method for Providing Information on Predicting Breast Cancer Lymph Node Metastasis Using Machine Learning |
CN111524554A (en) * | 2020-04-24 | 2020-08-11 | 上海海洋大学 | Cell activity prediction method based on LINCS-L1000 perturbation signal |
CN112102879A (en) * | 2020-07-31 | 2020-12-18 | 蒋涛 | System and method for predicting chemotherapy curative effect of advanced lung cancer |
CN112530592A (en) * | 2020-12-14 | 2021-03-19 | 青岛大学 | Non-small cell lung cancer risk prediction method based on machine learning |
CN112831567A (en) * | 2021-03-04 | 2021-05-25 | 苏州大学 | Marker of endometrial cancer and detection kit thereof |
CN114023448A (en) * | 2021-12-10 | 2022-02-08 | 华中科技大学同济医学院附属同济医院 | Construction method of endometrial cancer prediction diagnosis model, diagnosis model and diagnosis device |
Non-Patent Citations (1)
Title |
---|
《中国实验诊断学》2018年(第22卷)总目次;中国实验诊断学(第12期);1-2 * |
Also Published As
Publication number | Publication date |
---|---|
CN114743685A (en) | 2022-07-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Nithya et al. | Evaluation of machine learning based optimized feature selection approaches and classification methods for cervical cancer prediction | |
Kumawat et al. | Prognosis of Cervical Cancer Disease by Applying Machine Learning Techniques | |
Yap et al. | Analysis towards classification of infection and ischaemia of diabetic foot ulcers | |
JP3782792B2 (en) | How to select medical and biochemical diagnostic tests using neural network related applications | |
WO2020216324A1 (en) | Artificial intelligence-based medical image automatic diagnosis system and method | |
Austria et al. | Comparison of machine learning algorithms in breast cancer prediction using the coimbra dataset | |
CN112381178B (en) | Medical image classification method based on multi-loss feature learning | |
CN114743685B (en) | Endometrial cancer risk screening method and system based on artificial intelligence | |
CN113113152A (en) | Disease data set sample acquisition processing method, system, device, processor and storage medium thereof for novel coronavirus pneumonia | |
Ashraf et al. | Comparative analysis on prediction models with various data preprocessings in the prognosis of cervical cancer | |
Włodarczyk et al. | Estimation of preterm birth markers with U-Net segmentation network | |
CN115810130A (en) | Tumor fine classification system based on cloud computing and artificial intelligence fusion multi-dimensional medical data and application | |
CN114399634A (en) | Three-dimensional image classification method, system, device and medium based on weak supervised learning | |
Mendoza et al. | Application of data mining techniques in diagnosing various thyroid ailments: a review | |
Bhavani et al. | Supervised algorithms of machine learning in the prediction of cervical cancer: A comparative analysis | |
Durgalakshmi et al. | Feature selection and classification using support vector machine and decision tree | |
Barwal et al. | A Classification System for Breast Cancer Prediction using SVOF-KNN method | |
CN114078137A (en) | Colposcope image screening method and device based on deep learning and electronic equipment | |
Drokow et al. | Building a predictive model to assist in the diagnosis of cervical cancer | |
Meenakshisundaram et al. | Early Identification of cervical cancer using K-Nearest Neighbor (KNN) | |
Chowdary et al. | Multiple Disease Prediction by Applying Machine Learning and Deep Learning Algorithms | |
Bing-jin et al. | Research and practice of X-ray chest film disease classification based on DenseNet | |
Kavya et al. | Heart Disease Prediction Using Logistic Regression | |
Isaac et al. | Diagnosis prognosis and prevention of breast cancer based on present scenario of human life | |
Omololu et al. | Modelling and Diagnosis of Cervical Cancer Using Adaptive Neuro Fuzzy Inference System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |