CN114743685B - Endometrial cancer risk screening method and system based on artificial intelligence - Google Patents

Endometrial cancer risk screening method and system based on artificial intelligence Download PDF

Info

Publication number
CN114743685B
CN114743685B CN202210338149.5A CN202210338149A CN114743685B CN 114743685 B CN114743685 B CN 114743685B CN 202210338149 A CN202210338149 A CN 202210338149A CN 114743685 B CN114743685 B CN 114743685B
Authority
CN
China
Prior art keywords
data
training
model
endometrial cancer
structured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210338149.5A
Other languages
Chinese (zh)
Other versions
CN114743685A (en
Inventor
朱兰
王姝
刘西洋
高颖
王晓东
郭丰
刘创
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Original Assignee
Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking Union Medical College Hospital Chinese Academy of Medical Sciences filed Critical Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Priority to CN202210338149.5A priority Critical patent/CN114743685B/en
Publication of CN114743685A publication Critical patent/CN114743685A/en
Application granted granted Critical
Publication of CN114743685B publication Critical patent/CN114743685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention provides an endometrial cancer risk screening method and system based on artificial intelligence, wherein the method comprises a training process and a verification process. The system comprises a data processing module, a data processing module and a data processing module, wherein the data processing module is used for performing data processing, extracting characteristic information of data, and performing labeling processing to form structured data; the Xgboost loss model module is used for training or predicting the structured data to obtain trained or predicted data; the Lasso model simplification module is used for feature simplification to obtain a feature data matrix; and the BR ridge model module is used for obtaining an endometrial cancer risk screening prediction model according to the trained data and obtaining an endometrial cancer risk prediction value according to the predicted data. The invention can be widely used for cross-section risk screening of clinic or physical examination conventional patients, avoids missing high-risk patients, and simultaneously facilitates the work of doctors in automatic risk screening and improves the efficiency; the risk prediction model can be moved forward to the patient end for preliminary screening, so that the self-monitoring and management of the patient are facilitated.

Description

Endometrial cancer risk screening method and system based on artificial intelligence
Technical Field
The invention relates to the technical field of natural language processing and machine learning, in particular to an endometrial cancer risk screening method and system based on artificial intelligence.
Background
Endometrial cancer is one of three major cancers of the female reproductive system, and the incidence and death rate of which are on a continuous rise worldwide. Early screening of patients with high risk of endometrial lesions is becoming increasingly important as the incidence of endometrial cancer increases and the age of the patient is younger.
Currently, a common screening method for endometrial cancer is tumor markers combined with transvaginal uterine double-annex ultrasound or endometrial sampling combined with cytology. However, the first screening method is more suitable for postmenopausal women, and for women of childbearing age, the screening information provided by ultrasound is limited, the cutting value is lacked, and the accuracy is low. The second screening method, which is more suitable for further screening of patients with high risk of endometrial cancer, is invasive for the patient's endometrium sampling, and the disposable is costly, while cytological examination is a high requirement for the pathological diagnosis level in hospitals. Thus, there is a lack of effective screening methods for endometrial cancer and precancerous lesions relative to the current, more sophisticated cervical cancer early screening systems.
Disclosure of Invention
The invention utilizes natural language processing and machine learning technology to screen the risk of endometrial cancer and precancerous lesions based on the information of electronic medical records of patients, past clinical auxiliary examination data and the like. Unstructured text information such as medical records, examination data and the like is converted into structured data through various artificial intelligence algorithms, wherein a rule-based method is adopted for feature extraction and structured representation of the feature part of endometrial cancer, vectorization processing is carried out, processing results are input into a pre-designed algorithm, data analysis mining and model training are carried out, and a risk screening system for endometrial cancer and precancerous lesions is constructed. Through inputting information such as electronic medical records of patients and past clinical auxiliary examination data, the risk screening system can obtain a risk prediction value, so that not only is the waste of manpower and material resources avoided, but also the accuracy of disease screening and diagnosis is improved, and the risk screening system is an auxiliary diagnosis tool with high sensitivity, specificity, stability and reliability.
The invention provides an artificial intelligence-based endometrial cancer risk screening method, which is characterized by comprising a training process and a verification process, wherein the training process comprises the following steps: firstly, performing data processing, namely taking medical record data information of an endometrial cancer patient as an original data set, extracting characteristic information in a text of the original data set, and performing labeling processing to form a structured training data set; secondly, taking the structured training data set as input, and training the structured training data set by adopting an improved Xgboost loss model to obtain a trained training data set; then, simplifying the trained training data set through a Lasso model to obtain a characteristic data matrix; and finally, taking the characteristic data matrix and the trained training data set as input, and training through a BR ridge model to obtain an endometrial cancer risk screening prediction model.
Further, the verification process of the risk screening method comprises the following steps: firstly, inputting new medical record data information of a patient with endometrial cancer, extracting the characteristic information from the medical record data information, and performing labeling treatment to form structured data information; secondly, taking the structured data information as input, and predicting the structured data information by adopting the improved Xgboost loss model to obtain predicted data information; and finally, taking the predicted data information and the structured data information as input, and obtaining an endometrial cancer risk prediction value corresponding to the new endometrial cancer patient by adopting the endometrial cancer risk screening prediction model obtained in the training process.
Further, before training the structured training data set by using the improved Xgboost loss model, feature processing needs to be performed on the structured training data set, where the feature processing includes the following steps: firstly, splitting the characteristic information in the structured training data set to generate a plurality of independent characteristic information; secondly, converting the independent characteristic information into digital characteristic information, performing one-hot binarization processing on the text characteristic information to obtain discrete characteristics, and performing no one-hot binarization processing, namely performing no discretization processing on the continuous characteristic information; then deleting abnormal characteristic information beyond a normal range, and deleting medical record data sample information in the original dataset lacking the label; and finally, not filling the missing value in the characteristic information.
Furthermore, before the BR ridge model is trained, feature processing is further required to be performed on the trained training data set, and the feature processing includes the following steps: firstly, splitting the characteristic information of the trained training data set to generate a plurality of independent characteristic information; secondly, performing one-hot binarization processing on the independent characteristic information respectively, and converting all the characteristic information into discrete characteristics; then deleting abnormal characteristic information beyond a normal range, and deleting sample information in the original data set lacking a tag; finally, the missing value in the characteristic information is filled with 0.
Further, the overall objective function of the modified Xgboost loss model is formulated as follows:
wherein T represents the number of leaf nodes, and w represents the fraction of the leaf nodes; gamma is used for controlling the number of leaf nodes, and lambda is used for controlling the fraction of the leaf nodes; if the real tag is 1, thenOtherwise, when the true tag is 0, then
Further, the training of the structured training data set by using the improved Xgboost loss model to obtain a trained training data set, and the training method comprises the following steps: firstly, loading the structured training data set, dividing a part of samples in the structured training data set into 10 folds for cross-verifying a training model, and verifying the effect of the endometrial cancer risk screening prediction model by taking the rest samples in the structured training data set as an independent verification set; secondly, inputting the structured training data set into the improved Xgboost loss model to obtain an index of leaf nodes to form a new feature combination, analyzing the new feature combination column and the real label by adopting correlation analysis, selecting features with larger correlation to combine with original features, training the improved Xgboost loss model by using a grid search method again by using the combined new data features, and finding out an optimal model; finally, the grid search method is used for automatically searching the optimal parameters by utilizing the new training set, and 10-fold cross validation is used in the parameter searching process.
The invention also provides an artificial intelligence-based endometrial cancer risk screening system, which is characterized by comprising: the data processing module is used for performing data processing, taking medical record data information of endometrial cancer patients as original data, extracting characteristic information in a text of the original data, and performing labeling processing to form structured data; the Xgboost loss model module is used for taking the structured data as input, and training or predicting the structured data by adopting the improved Xgboost loss model to obtain the trained or predicted data; the Lasso model simplifying module is used for simplifying the trained data through the Lasso model to obtain a characteristic data matrix; and the BR ridge model module is used for taking the characteristic data matrix and the trained or predicted data as input, training or predicting through the BR ridge model, obtaining an endometrial cancer risk screening prediction model aiming at the trained data, and obtaining the predicted endometrial cancer risk prediction value aiming at the predicted data.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the endometrial cancer risk screening method based on artificial intelligence.
The present invention also provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of any of the above-described methods for screening for risk of endometrial cancer based on artificial intelligence.
The beneficial effects of the invention are as follows:
the endometrial cancer risk screening method and system provided by the invention can be widely used for cross-section risk screening of clinic or physical examination conventional patients according to the information such as the electronic medical record of the patients and the auxiliary examination data, so that high-risk patients are prevented from being missed, and meanwhile, the automatic risk screening is convenient for doctors to work, the efficiency is improved, and the influence of doctor service levels under different levels and training degrees is reduced; the risk prediction model can be moved forward to the patient end for preliminary screening, so that the self-monitoring and management of the patient are facilitated.
Meanwhile, the method and the system provided by the invention provide objective and reliable auxiliary diagnosis reference results for doctors in primary hospitals to screen patients with high risk of endometrial cancer and women in daily physical examination to self-determine endometrial cancer risk. The system screens out high-risk people in a noninvasive mode for further diagnosis and screening, and is beneficial to doctors to establish a stepped screening strategy for endometrial cancer.
Drawings
Fig. 1 shows a schematic step diagram of the training process of the present invention.
Fig. 2 shows a schematic step diagram of the verification process of the present invention.
Fig. 3 shows a verification result diagram at a verification set.
Fig. 4 shows a schematic structural diagram of an endometrial cancer risk screening system.
Detailed Description
The following examples and experimental examples are illustrative of the present invention and are not intended to limit the scope of the present invention. The present invention will be further described with reference to specific examples and experimental examples.
The invention provides an endometrial cancer risk screening method based on artificial intelligence, which comprises a training process and a verification process, wherein the training process is shown in the schematic diagram in the step of FIG. 1. Firstly, performing data processing, namely taking medical record data information of an endometrial cancer patient as an original data set, extracting characteristic information in a text of the original data set, and performing labeling processing to form a structured training data set; secondly, taking the structured training data set as input, and training the structured training data set by adopting an improved Xgboost loss model to obtain a trained training data set; then, simplifying the trained training data set through a Lasso model to obtain a characteristic data matrix; and finally, taking the characteristic data matrix and the trained training data set as input, and training through a BR ridge model to obtain an endometrial cancer risk screening prediction model.
A schematic of the steps of the verification process of the present invention is shown in fig. 2. The verification process includes: firstly, inputting new medical record data information of a patient with endometrial cancer, extracting the characteristic information from the medical record data information, and performing labeling treatment to form structured data information; secondly, taking the structured data information as input, and predicting the structured data information by adopting the improved Xgboost loss model to obtain predicted data information; and finally, taking the predicted data information and the structured data information as input, and obtaining an endometrial cancer risk prediction value corresponding to the new endometrial cancer patient by adopting the endometrial cancer risk screening prediction model obtained in the training process.
Specifically, in the data processing step, first, information such as an electronic medical record of endometrial cancer of a patient is subjected to feature preliminary selection, and key information therein is subjected to feature extraction based on the information shown in table 1.
Preoperative TCT (liquid-based cytology) History of oral contraceptive
Preoperative last HPV (human papilloma virus) Age of
History of other tumors in individuals Menstrual cycle (Tian)
Delivery mode Menopausal age (age)
Principal symptoms Pregnancy times (times)
Complications of the disease Parity (parity)
Preoperative ultrasound characterization Preoperative lowest HGB (g/L)
Ultrasound-visible combined lesions BMI
Family history of tumors Menstrual period changes
Whether or not to menopause Age of beginner (age)
Whether or not to smoke Menstrual period (Tian)
Intrauterine device Preoperative Ca125 (U/ml)
Postmenopausal hormone replacement therapy/HRT/hormone replacement therapy Predictive score
Preoperatively progestogen use
TABLE 1 Table of characteristic information to be extracted according to the invention
And then, labeling the characteristic information to obtain a structured data set. Specifically, the label of each case of data, i.e., the outcome label, is taken from the post-operative pathology outcome. As shown in the following table 2, the pathological results marked positive for endometrial cancer and precancerous lesions, the "endometrial dysplasia", "endometrial hyperplasia" and various endometrial cancer subtypes were considered positive for endometrial cancer and precancerous lesions. The pathological results are definitely "normal endometrium", "proliferation stage endometrium", "secretion stage endometrium", "simple endometrial hyperplasia" as negative. The remaining pathology results were not included in the model training.
Atypical hyperplasia of endometrium Endometrial cancer
Endometrial hyperplasia EndometriumSerous carcinoma
Endometrial cancer Serous adenocarcinomas of the endometrium
Endometrial clear cell carcinoma
Table 2 pathological results marked as positive for endometrial cancer and precancerous lesions
In the model training step, an improved Xgboost loss model training is first performed. However, before training the structured training data set with the improved Xgboost loss model, it is also necessary to perform a feature processing on the structured training data set, where the feature processing includes the following steps:
first, to reduce the interaction between the plurality of original feature information, the original feature information is split to generate a plurality of individual feature information.
Secondly, the individual text feature information is converted into digital feature information, and the training effect obtained after the feature information is converted into the discrete feature by all the one-hot binarization processing is not good, so that the one-hot binarization processing is changed into the discrete feature for part of the text feature information, and the one-hot binarization processing is not performed for continuous feature information, namely, the discretization is not performed.
Then, some abnormal characteristic values beyond the normal range are deleted, and sample information lacking the label is deleted.
Finally, most missing values in the feature information are not filled in, as the Xgboost loss model itself contains a mechanism to handle null values.
After the data set is subjected to the characteristic processing, the improved Xgboost loss model training is started, and the specific steps are as follows:
firstly, loading a training data set after feature processing and labeling processing, dividing most samples in the training data set into 10 folds for cross-verifying a training model, and taking the rest samples after processing as independent verification set to verify the model effect.
And secondly, improving the Xgboost loss model. Since the number of sample sets used for cross-validation in the previous step is much larger than the number of sample sets used for independent validation, a data imbalance is generated, which may result in poor generalization ability of the model. In order to improve the stability of the training model, the invention does not use an up-sampling or down-sampling method from the data layer, but improves the training process of the Xgboost loss model.
The integral objective function of the Xgboost loss model is divided into two parts, wherein the first part is used for measuring the difference between the prediction score and the true score, and the other part is a regular term, and the formula is as follows:
where T represents the number of leaf nodes and w represents the fraction of leaf nodes. And gamma can control the number of leaf nodes, lambda can control the score of the leaf nodes not to be excessive, and overfitting is prevented.
In the original Xgboost loss model, the first partIs a cross entropy loss function, the aforementioned data imbalance problem can lead to small model thresholds and unstable models.
The invention replaces the cross entropy loss function with the focal loss function, and the formula is as follows:
FL(p t )=-α t (1-p t ) γ log(p t )
wherein if the real tag is 1, thenOtherwise, when the real tag is 0, then +.>
The focal loss function can change the loss weight of one or less of the samples by adjusting the alpha and gamma super parameters, so that the prediction of the Xgboost model is not biased to the one with more samples, and the problem of unbalanced data is better solved.
The formula of the overall objective function of the modified Xgboost loss model is as follows:
thirdly, in order to further improve the performance of model training, the original training data set is input into the improved Xgboost model to obtain indexes of leaf nodes, so that new combined characteristics are formed. However, excessive feature columns of the original training data set may cause excessive fitting, in this embodiment, an RM correlation analysis method is adopted to analyze a new feature column and a real tag, features of a person with larger correlation are selected to be combined with the original features, and the new combined data features are used to train the improved Xgboost model again by using a grid search method to find an optimal model.
And fourthly, automatically searching optimal parameters by using the new training set by using the grid searching method, and using a 10-fold cross validation method in the parameter searching process.
In this step, to ensure that better model parameters can be found so that the model performs better in both the training set and the validation set, the evaluation criteria is to find the model parameter value at the highest average AUC using the average AUC values trained during the cross validation process. Specifically, given an optional range of model parameters, the method combines all parameter cases to train, each combination is cross-validated and results in a mean AUC value.
Since the model parameters are more, if each model parameter is given a range and is searched in the search space at the same time, the training time of the model becomes intolerable because the training time of the model is accumulated by the number of values of all parameters, and therefore, the embodiment uses grouping to approximate the optimal solution.
Firstly, dividing parameters into two groups, finding out the optimal solution of two parameters each time, namely, summing the products of the values of the two parameters in the search space; secondly, continuously searching the next two optimal parameters on the basis of the last optimal solution; then, and the like, searching for the optimal solution of other parameters; and finally, drawing an ROC curve according to the optimal model and the test set.
After training the training data set with the modified Xgboost loss model, the training data set after training the Xgboost loss model is then simplified using Lasso model features. The Lasso model has L1 penalty, can effectively compress the coefficient of the non-key feature to 0, and is beneficial to reducing the operation workload and the risk of overfitting. The method for simplifying the Lasso model comprises the following steps:
first, the super parameter α=0.1 is set, the intercept fitting is set to true, the maximum iteration number is set to 10000, and the feature updated in each iteration is randomly selected.
And substituting the original training data matrix corresponding to the BR ridge model in the feature preprocessing into a Lasso model, and performing model training according to a sample set of data packets consistent with the modified Xgboost model.
For training of each fold data, the features with absolute values of coefficients > =0.01 are reserved to the next step according to the coefficients of the features output by the LASSO model, namely, corresponding columns in the data matrix are reserved, and the rest feature columns are omitted. Here, the threshold value of 0.01 is related to the value of the tag, and the person skilled in the art should perform a corresponding transformation in case of different tag values.
Since the Xgboost predicted value is calculated based on the training data corresponding to each fold test data, and includes the information of the training data, the subsequent training will also be performed on each pair of training set and test set, and the grouping of the samples is consistent with the grouping of the samples adopted by the improved Xgboost model.
After the result of the Xgboost loss model and the result of the Lasso model are obtained, the characteristic data matrix and the training data set after training are used as input, training is carried out through a BR ridge model, the Xgboost predicted value is used as a new list of characteristics and is combined into the characteristic data matrix after the Lasso model is simplified, a new characteristic data matrix is formed and is input into the BR ridge model, and the endometrial cancer risk screening prediction model is obtained.
Before the BR ridge model is trained, the trained feature data matrix is required to be subjected to feature processing, and the feature processing comprises the following steps:
first, to reduce the interaction between the plurality of original feature information, the original feature information is split to generate a plurality of individual feature information.
And secondly, performing one-hot binarization processing on each piece of independent characteristic information respectively, and converting all the independent characteristic information into discrete characteristics.
Then, some abnormal characteristic values beyond the normal range are deleted, and sample information lacking the label is deleted.
And finally, filling the information of the missing features with null values to be 0, and obtaining a training data set which can be directly used for training.
In the BR ridge model training step, as for the Xgboost loss model, a 10-fold cross validation method is adopted, namely a data set is divided into 10 folds in advance, and each fold is taken as a test set in turn, so that samples adopted in the training and testing of the two models are the same, and model prediction and evaluation are carried out in the follow-up step according to a given training set and a corresponding test set.
The basic principle of the BR ridge model is that the BR ridge model has likelihood functions when the parameters are w based on the target y value On the premise of carrying out maximum posterior estimation of w value according to Bayes theorem, the formula is as follows:
wherein sigma 0 And sigma is the standard deviation of the parameter w and the standard deviation of the target y value when the parameter is w, respectively. The formula is used as a loss function of the BR algorithm to carry out gradient descent solution.
After substituting the combined characteristic data matrix into the BR algorithm, the optimal coefficient can be further fitted for each characteristic, the coefficient of each characteristic can be directly used for subsequent application, and the predicted value of each case under the linear model is recorded. And then drawing an ROC curve according to the actual labeling result of the test set, and evaluating the prediction performance. The optimal prediction threshold is determined based on this, and used to make decisions at the time of later prediction.
In this embodiment, the risk assessment model of the present invention is verified using data of a different hospital from the training set as the verification set, and fig. 3 shows a verification result diagram in the verification set. For new endometrial cancer patient data information, the data information can be directly used as input of the model to obtain an endometrial cancer risk prediction value.
The invention also provides an endometrial cancer risk screening system based on artificial intelligence, as shown in fig. 4, the risk screening system comprises: the data processing module is used for performing data processing, taking medical record data information of endometrial cancer patients as original data, extracting characteristic information in a text of the original data, and performing labeling processing to form structured data; the Xgboost loss model module is used for taking the structured data as input, and training or predicting the structured data by adopting the improved Xgboost loss model to obtain the trained or predicted data; the Lasso model simplifying module is used for simplifying the trained data through the Lasso model to obtain a characteristic data matrix; and the BR ridge model module is used for taking the characteristic data matrix and the trained or predicted data as input, training or predicting through the BR ridge model, obtaining an endometrial cancer risk screening prediction model aiming at the trained data, and obtaining the predicted endometrial cancer risk prediction value aiming at the predicted data.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the endometrial cancer risk screening methods based on artificial intelligence.
The present invention also provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of any of the aforementioned methods for screening for risk of endometrial cancer based on artificial intelligence.
The characteristic coefficient generated by the embodiment has stronger robustness to the patient data under the same distribution. Specifically, a user inputs an electronic medical record and a clinical auxiliary examination result, the embodiment converts the input information into various characteristics, and automatically binarizes the characteristics to obtain a data matrix which can be identified by the Xgboost model and the BR model, the data matrix is sequentially brought into the Xgboost model and the BR model, and finally the risk prediction value of endometrial cancer of the medical record is output, so that the risk of endometrial cancer and precancerous lesions is predicted.
The invention improves the Xgboost model, simplifies the result of the improved Xgboost model by utilizing the Lasso model, combines the results of the two model training, takes the results as the input of the BR ridge model, and carries out further model training to obtain the final risk prediction result.
The improved Xgboost model replaces the cross entropy loss function with the focal loss function with unbalanced processing data, so that the training model can be ensured not to deviate to more samples, the model is more stable, and the prediction effect is more accurate. The Lasso model reduces the high-dimensional space to the low-dimensional space, and is more beneficial to generalization of the model. The BR ridge model is used in the classification model in the invention to correct some errors of the previous model, thereby realizing the fusion of the models and improving the training performance of the models.

Claims (7)

1. An endometrial cancer risk screening method based on artificial intelligence is characterized in that the risk screening method comprises a training process and a verification process,
wherein the training process comprises:
firstly, performing data processing, namely taking medical record data information of an endometrial cancer patient as an original data set, extracting characteristic information in a text of the original data set, and performing labeling processing to form a structured training data set;
secondly, taking the structured training data set as input, and training the structured training data set by adopting an improved Xgboost loss model to obtain a trained training data set;
then, simplifying the trained training data set through a Lasso model to obtain a characteristic data matrix;
finally, taking the characteristic data matrix and the trained training data set as input, and training through a Bayesian ridge regression (BR) model to obtain an endometrial cancer risk screening prediction model;
the verification process of the risk screening method comprises the following steps:
firstly, inputting new medical record data information of a patient with endometrial cancer, extracting the characteristic information from the medical record data information, and performing labeling treatment to form structured data information;
secondly, taking the structured data information as input, and predicting the structured data information by adopting the improved Xgboost loss model to obtain predicted data information;
finally, taking the predicted data information and the structured data information as input, and obtaining an endometrial cancer risk prediction value corresponding to the new endometrial cancer patient by adopting the endometrial cancer risk screening prediction model obtained in the training process;
wherein the formula of the overall objective function of the improved Xgboost loss model is as follows:
wherein,representing the number of leaf nodes>A score representing a leaf node; />For controlling the number of leaf nodes, +.>A score for controlling leaf nodes; if the real tag is 1, then +.>=/>The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, when the true tag is 0, then
2. An artificial intelligence based endometrial cancer risk screening method according to claim 1, wherein said structured training data set is further subjected to a feature processing prior to said training of said structured training data set using a modified Xgboost loss model, said feature processing comprising the steps of:
firstly, splitting the characteristic information in the structured training data set to generate a plurality of independent characteristic information;
secondly, converting the independent characteristic information into digital characteristic information, performing one-hot binarization processing on the text characteristic information to obtain discrete characteristics, and performing no one-hot binarization processing, namely performing no discretization processing on the continuous characteristic information;
then deleting abnormal characteristic information beyond a normal range, and deleting medical record data sample information in the original dataset lacking the label;
and finally, not filling the missing value in the characteristic information.
3. The method of claim 1, wherein before training the Bayesian ridge regression (BR) model, the trained training data set is further subjected to feature processing, the feature processing comprising the steps of:
firstly, splitting the characteristic information of the trained training data set to generate a plurality of independent characteristic information;
secondly, performing one-hot binarization processing on the independent characteristic information respectively, and converting all the characteristic information into discrete characteristics;
then deleting abnormal characteristic information beyond a normal range, and deleting sample information in the original data set lacking a tag;
finally, the missing value in the characteristic information is filled with 0.
4. An artificial intelligence based endometrial cancer risk screening method according to claim 1, wherein said structured training data set is trained using an improved Xgboost loss model to obtain a trained training data set, said training method comprising the steps of:
firstly, loading the structured training data set, dividing a part of samples in the structured training data set into 10 folds for cross-verifying a training model, and verifying the effect of the endometrial cancer risk screening prediction model by taking the rest samples in the structured training data set as an independent verification set;
secondly, inputting the structured training data set into the improved Xgboost loss model to obtain an index of leaf nodes to form a new feature combination, analyzing the new feature combination column and the real label by adopting correlation analysis, selecting features with larger correlation to combine with original features, training the improved Xgboost loss model by using a grid search method again by using the combined new data features, and finding out an optimal model;
finally, the grid search method is used for automatically searching the optimal parameters by utilizing the new training set, and 10-fold cross validation is used in the parameter searching process.
5. An artificial intelligence based endometrial cancer risk screening system, the risk screening system comprising:
the data processing module is used for performing data processing, taking medical record data information of endometrial cancer patients as original data, extracting characteristic information in a text of the original data, and performing labeling processing to form structured data;
the Xgboost loss model module is used for taking the structured data as input, and training or predicting the structured data by adopting the improved Xgboost loss model to obtain the trained or predicted data;
the Lasso model simplifying module is used for simplifying the trained data through the Lasso model to obtain a characteristic data matrix;
a Bayesian ridge regression (BR) model module, configured to use the feature data matrix and the trained or predicted data as inputs, perform training or prediction by using the Bayesian ridge regression (BR) model, obtain an endometrial cancer risk screening prediction model for the trained data, and obtain the predicted endometrial cancer risk prediction value for the predicted data;
wherein the formula of the overall objective function of the improved Xgboost loss model is as follows:
wherein,representing the number of leaf nodes>A score representing a leaf node; />For controlling the number of leaf nodes, +.>A score for controlling leaf nodes; if the real tag is 1, then +.>=/>
Otherwise, when the true tag is 0, then
6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of an artificial intelligence based endometrial cancer risk screening method according to any one of claims 1-5 when said program is executed.
7. A computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of an artificial intelligence based endometrial cancer risk screening method according to any one of claims 1-5.
CN202210338149.5A 2022-04-01 2022-04-01 Endometrial cancer risk screening method and system based on artificial intelligence Active CN114743685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210338149.5A CN114743685B (en) 2022-04-01 2022-04-01 Endometrial cancer risk screening method and system based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210338149.5A CN114743685B (en) 2022-04-01 2022-04-01 Endometrial cancer risk screening method and system based on artificial intelligence

Publications (2)

Publication Number Publication Date
CN114743685A CN114743685A (en) 2022-07-12
CN114743685B true CN114743685B (en) 2024-01-05

Family

ID=82280540

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210338149.5A Active CN114743685B (en) 2022-04-01 2022-04-01 Endometrial cancer risk screening method and system based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN114743685B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705289B (en) * 2023-05-23 2023-12-19 北京透彻未来科技有限公司 Cervical pathology diagnosis device based on semantic segmentation network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017220782A1 (en) * 2016-06-24 2017-12-28 Molecular Health Gmbh Screening method for endometrial cancer
CN111524554A (en) * 2020-04-24 2020-08-11 上海海洋大学 Cell activity prediction method based on LINCS-L1000 perturbation signal
CN112102879A (en) * 2020-07-31 2020-12-18 蒋涛 System and method for predicting chemotherapy curative effect of advanced lung cancer
CN112530592A (en) * 2020-12-14 2021-03-19 青岛大学 Non-small cell lung cancer risk prediction method based on machine learning
CN112831567A (en) * 2021-03-04 2021-05-25 苏州大学 Marker of endometrial cancer and detection kit thereof
KR20210081547A (en) * 2019-12-24 2021-07-02 연세대학교 산학협력단 Methods for poviding information about responses to cancer immunotherapy and devices using the same
KR20210108682A (en) * 2020-02-26 2021-09-03 계명대학교 산학협력단 Method for Providing Information on Predicting Breast Cancer Lymph Node Metastasis Using Machine Learning
CN114023448A (en) * 2021-12-10 2022-02-08 华中科技大学同济医学院附属同济医院 Construction method of endometrial cancer prediction diagnosis model, diagnosis model and diagnosis device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170003291A1 (en) * 2015-06-27 2017-01-05 William Beaumont Hospital Methods for detecting, diagnosing and treating endometrial cancer

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017220782A1 (en) * 2016-06-24 2017-12-28 Molecular Health Gmbh Screening method for endometrial cancer
KR20210081547A (en) * 2019-12-24 2021-07-02 연세대학교 산학협력단 Methods for poviding information about responses to cancer immunotherapy and devices using the same
KR20210108682A (en) * 2020-02-26 2021-09-03 계명대학교 산학협력단 Method for Providing Information on Predicting Breast Cancer Lymph Node Metastasis Using Machine Learning
CN111524554A (en) * 2020-04-24 2020-08-11 上海海洋大学 Cell activity prediction method based on LINCS-L1000 perturbation signal
CN112102879A (en) * 2020-07-31 2020-12-18 蒋涛 System and method for predicting chemotherapy curative effect of advanced lung cancer
CN112530592A (en) * 2020-12-14 2021-03-19 青岛大学 Non-small cell lung cancer risk prediction method based on machine learning
CN112831567A (en) * 2021-03-04 2021-05-25 苏州大学 Marker of endometrial cancer and detection kit thereof
CN114023448A (en) * 2021-12-10 2022-02-08 华中科技大学同济医学院附属同济医院 Construction method of endometrial cancer prediction diagnosis model, diagnosis model and diagnosis device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《中国实验诊断学》2018年(第22卷)总目次;中国实验诊断学(第12期);1-2 *

Also Published As

Publication number Publication date
CN114743685A (en) 2022-07-12

Similar Documents

Publication Publication Date Title
Nithya et al. Evaluation of machine learning based optimized feature selection approaches and classification methods for cervical cancer prediction
Kumawat et al. Prognosis of Cervical Cancer Disease by Applying Machine Learning Techniques
Yap et al. Analysis towards classification of infection and ischaemia of diabetic foot ulcers
JP3782792B2 (en) How to select medical and biochemical diagnostic tests using neural network related applications
WO2020216324A1 (en) Artificial intelligence-based medical image automatic diagnosis system and method
Austria et al. Comparison of machine learning algorithms in breast cancer prediction using the coimbra dataset
CN112381178B (en) Medical image classification method based on multi-loss feature learning
CN114743685B (en) Endometrial cancer risk screening method and system based on artificial intelligence
CN113113152A (en) Disease data set sample acquisition processing method, system, device, processor and storage medium thereof for novel coronavirus pneumonia
Ashraf et al. Comparative analysis on prediction models with various data preprocessings in the prognosis of cervical cancer
Włodarczyk et al. Estimation of preterm birth markers with U-Net segmentation network
CN115810130A (en) Tumor fine classification system based on cloud computing and artificial intelligence fusion multi-dimensional medical data and application
CN114399634A (en) Three-dimensional image classification method, system, device and medium based on weak supervised learning
Mendoza et al. Application of data mining techniques in diagnosing various thyroid ailments: a review
Bhavani et al. Supervised algorithms of machine learning in the prediction of cervical cancer: A comparative analysis
Durgalakshmi et al. Feature selection and classification using support vector machine and decision tree
Barwal et al. A Classification System for Breast Cancer Prediction using SVOF-KNN method
CN114078137A (en) Colposcope image screening method and device based on deep learning and electronic equipment
Drokow et al. Building a predictive model to assist in the diagnosis of cervical cancer
Meenakshisundaram et al. Early Identification of cervical cancer using K-Nearest Neighbor (KNN)
Chowdary et al. Multiple Disease Prediction by Applying Machine Learning and Deep Learning Algorithms
Bing-jin et al. Research and practice of X-ray chest film disease classification based on DenseNet
Kavya et al. Heart Disease Prediction Using Logistic Regression
Isaac et al. Diagnosis prognosis and prevention of breast cancer based on present scenario of human life
Omololu et al. Modelling and Diagnosis of Cervical Cancer Using Adaptive Neuro Fuzzy Inference System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant