CN114743685B

CN114743685B - Endometrial cancer risk screening method and system based on artificial intelligence

Info

Publication number: CN114743685B
Application number: CN202210338149.5A
Authority: CN
Inventors: 朱兰; 王姝; 刘西洋; 高颖; 王晓东; 郭丰; 刘创
Original assignee: Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Current assignee: Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2024-01-05
Anticipated expiration: 2042-04-01
Also published as: CN114743685A

Abstract

The invention provides an endometrial cancer risk screening method and system based on artificial intelligence, wherein the method comprises a training process and a verification process. The system comprises a data processing module, a data processing module and a data processing module, wherein the data processing module is used for performing data processing, extracting characteristic information of data, and performing labeling processing to form structured data; the Xgboost loss model module is used for training or predicting the structured data to obtain trained or predicted data; the Lasso model simplification module is used for feature simplification to obtain a feature data matrix; and the BR ridge model module is used for obtaining an endometrial cancer risk screening prediction model according to the trained data and obtaining an endometrial cancer risk prediction value according to the predicted data. The invention can be widely used for cross-section risk screening of clinic or physical examination conventional patients, avoids missing high-risk patients, and simultaneously facilitates the work of doctors in automatic risk screening and improves the efficiency; the risk prediction model can be moved forward to the patient end for preliminary screening, so that the self-monitoring and management of the patient are facilitated.

Description

Endometrial cancer risk screening method and system based on artificial intelligence

Technical Field

The invention relates to the technical field of natural language processing and machine learning, in particular to an endometrial cancer risk screening method and system based on artificial intelligence.

Background

Endometrial cancer is one of three major cancers of the female reproductive system, and the incidence and death rate of which are on a continuous rise worldwide. Early screening of patients with high risk of endometrial lesions is becoming increasingly important as the incidence of endometrial cancer increases and the age of the patient is younger.

Currently, a common screening method for endometrial cancer is tumor markers combined with transvaginal uterine double-annex ultrasound or endometrial sampling combined with cytology. However, the first screening method is more suitable for postmenopausal women, and for women of childbearing age, the screening information provided by ultrasound is limited, the cutting value is lacked, and the accuracy is low. The second screening method, which is more suitable for further screening of patients with high risk of endometrial cancer, is invasive for the patient's endometrium sampling, and the disposable is costly, while cytological examination is a high requirement for the pathological diagnosis level in hospitals. Thus, there is a lack of effective screening methods for endometrial cancer and precancerous lesions relative to the current, more sophisticated cervical cancer early screening systems.

Disclosure of Invention

The invention utilizes natural language processing and machine learning technology to screen the risk of endometrial cancer and precancerous lesions based on the information of electronic medical records of patients, past clinical auxiliary examination data and the like. Unstructured text information such as medical records, examination data and the like is converted into structured data through various artificial intelligence algorithms, wherein a rule-based method is adopted for feature extraction and structured representation of the feature part of endometrial cancer, vectorization processing is carried out, processing results are input into a pre-designed algorithm, data analysis mining and model training are carried out, and a risk screening system for endometrial cancer and precancerous lesions is constructed. Through inputting information such as electronic medical records of patients and past clinical auxiliary examination data, the risk screening system can obtain a risk prediction value, so that not only is the waste of manpower and material resources avoided, but also the accuracy of disease screening and diagnosis is improved, and the risk screening system is an auxiliary diagnosis tool with high sensitivity, specificity, stability and reliability.

The invention provides an artificial intelligence-based endometrial cancer risk screening method, which is characterized by comprising a training process and a verification process, wherein the training process comprises the following steps: firstly, performing data processing, namely taking medical record data information of an endometrial cancer patient as an original data set, extracting characteristic information in a text of the original data set, and performing labeling processing to form a structured training data set; secondly, taking the structured training data set as input, and training the structured training data set by adopting an improved Xgboost loss model to obtain a trained training data set; then, simplifying the trained training data set through a Lasso model to obtain a characteristic data matrix; and finally, taking the characteristic data matrix and the trained training data set as input, and training through a BR ridge model to obtain an endometrial cancer risk screening prediction model.

Further, the verification process of the risk screening method comprises the following steps: firstly, inputting new medical record data information of a patient with endometrial cancer, extracting the characteristic information from the medical record data information, and performing labeling treatment to form structured data information; secondly, taking the structured data information as input, and predicting the structured data information by adopting the improved Xgboost loss model to obtain predicted data information; and finally, taking the predicted data information and the structured data information as input, and obtaining an endometrial cancer risk prediction value corresponding to the new endometrial cancer patient by adopting the endometrial cancer risk screening prediction model obtained in the training process.

Further, before training the structured training data set by using the improved Xgboost loss model, feature processing needs to be performed on the structured training data set, where the feature processing includes the following steps: firstly, splitting the characteristic information in the structured training data set to generate a plurality of independent characteristic information; secondly, converting the independent characteristic information into digital characteristic information, performing one-hot binarization processing on the text characteristic information to obtain discrete characteristics, and performing no one-hot binarization processing, namely performing no discretization processing on the continuous characteristic information; then deleting abnormal characteristic information beyond a normal range, and deleting medical record data sample information in the original dataset lacking the label; and finally, not filling the missing value in the characteristic information.

Furthermore, before the BR ridge model is trained, feature processing is further required to be performed on the trained training data set, and the feature processing includes the following steps: firstly, splitting the characteristic information of the trained training data set to generate a plurality of independent characteristic information; secondly, performing one-hot binarization processing on the independent characteristic information respectively, and converting all the characteristic information into discrete characteristics; then deleting abnormal characteristic information beyond a normal range, and deleting sample information in the original data set lacking a tag; finally, the missing value in the characteristic information is filled with 0.

Further, the overall objective function of the modified Xgboost loss model is formulated as follows:

wherein T represents the number of leaf nodes, and w represents the fraction of the leaf nodes; gamma is used for controlling the number of leaf nodes, and lambda is used for controlling the fraction of the leaf nodes; if the real tag is 1, thenOtherwise, when the true tag is 0, then

Further, the training of the structured training data set by using the improved Xgboost loss model to obtain a trained training data set, and the training method comprises the following steps: firstly, loading the structured training data set, dividing a part of samples in the structured training data set into 10 folds for cross-verifying a training model, and verifying the effect of the endometrial cancer risk screening prediction model by taking the rest samples in the structured training data set as an independent verification set; secondly, inputting the structured training data set into the improved Xgboost loss model to obtain an index of leaf nodes to form a new feature combination, analyzing the new feature combination column and the real label by adopting correlation analysis, selecting features with larger correlation to combine with original features, training the improved Xgboost loss model by using a grid search method again by using the combined new data features, and finding out an optimal model; finally, the grid search method is used for automatically searching the optimal parameters by utilizing the new training set, and 10-fold cross validation is used in the parameter searching process.

The invention also provides an artificial intelligence-based endometrial cancer risk screening system, which is characterized by comprising: the data processing module is used for performing data processing, taking medical record data information of endometrial cancer patients as original data, extracting characteristic information in a text of the original data, and performing labeling processing to form structured data; the Xgboost loss model module is used for taking the structured data as input, and training or predicting the structured data by adopting the improved Xgboost loss model to obtain the trained or predicted data; the Lasso model simplifying module is used for simplifying the trained data through the Lasso model to obtain a characteristic data matrix; and the BR ridge model module is used for taking the characteristic data matrix and the trained or predicted data as input, training or predicting through the BR ridge model, obtaining an endometrial cancer risk screening prediction model aiming at the trained data, and obtaining the predicted endometrial cancer risk prediction value aiming at the predicted data.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the endometrial cancer risk screening method based on artificial intelligence.

The present invention also provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of any of the above-described methods for screening for risk of endometrial cancer based on artificial intelligence.

The beneficial effects of the invention are as follows:

the endometrial cancer risk screening method and system provided by the invention can be widely used for cross-section risk screening of clinic or physical examination conventional patients according to the information such as the electronic medical record of the patients and the auxiliary examination data, so that high-risk patients are prevented from being missed, and meanwhile, the automatic risk screening is convenient for doctors to work, the efficiency is improved, and the influence of doctor service levels under different levels and training degrees is reduced; the risk prediction model can be moved forward to the patient end for preliminary screening, so that the self-monitoring and management of the patient are facilitated.

Meanwhile, the method and the system provided by the invention provide objective and reliable auxiliary diagnosis reference results for doctors in primary hospitals to screen patients with high risk of endometrial cancer and women in daily physical examination to self-determine endometrial cancer risk. The system screens out high-risk people in a noninvasive mode for further diagnosis and screening, and is beneficial to doctors to establish a stepped screening strategy for endometrial cancer.

Drawings

Fig. 1 shows a schematic step diagram of the training process of the present invention.

Fig. 2 shows a schematic step diagram of the verification process of the present invention.

Fig. 3 shows a verification result diagram at a verification set.

Fig. 4 shows a schematic structural diagram of an endometrial cancer risk screening system.

Detailed Description

The following examples and experimental examples are illustrative of the present invention and are not intended to limit the scope of the present invention. The present invention will be further described with reference to specific examples and experimental examples.

The invention provides an endometrial cancer risk screening method based on artificial intelligence, which comprises a training process and a verification process, wherein the training process is shown in the schematic diagram in the step of FIG. 1. Firstly, performing data processing, namely taking medical record data information of an endometrial cancer patient as an original data set, extracting characteristic information in a text of the original data set, and performing labeling processing to form a structured training data set; secondly, taking the structured training data set as input, and training the structured training data set by adopting an improved Xgboost loss model to obtain a trained training data set; then, simplifying the trained training data set through a Lasso model to obtain a characteristic data matrix; and finally, taking the characteristic data matrix and the trained training data set as input, and training through a BR ridge model to obtain an endometrial cancer risk screening prediction model.

A schematic of the steps of the verification process of the present invention is shown in fig. 2. The verification process includes: firstly, inputting new medical record data information of a patient with endometrial cancer, extracting the characteristic information from the medical record data information, and performing labeling treatment to form structured data information; secondly, taking the structured data information as input, and predicting the structured data information by adopting the improved Xgboost loss model to obtain predicted data information; and finally, taking the predicted data information and the structured data information as input, and obtaining an endometrial cancer risk prediction value corresponding to the new endometrial cancer patient by adopting the endometrial cancer risk screening prediction model obtained in the training process.

Specifically, in the data processing step, first, information such as an electronic medical record of endometrial cancer of a patient is subjected to feature preliminary selection, and key information therein is subjected to feature extraction based on the information shown in table 1.

Preoperative TCT (liquid-based cytology)	History of oral contraceptive
		Preoperative last HPV (human papilloma virus)	Age of
History of other tumors in individuals	Menstrual cycle (Tian)
		Delivery mode	Menopausal age (age)
Principal symptoms	Pregnancy times (times)
		Complications of the disease	Parity (parity)
Preoperative ultrasound characterization	Preoperative lowest HGB (g/L)
		Ultrasound-visible combined lesions	BMI
Family history of tumors	Menstrual period changes
		Whether or not to menopause	Age of beginner (age)
Whether or not to smoke	Menstrual period (Tian)
		Intrauterine device	Preoperative Ca125 (U/ml)
Postmenopausal hormone replacement therapy/HRT/hormone replacement therapy	Predictive score
		Preoperatively progestogen use

TABLE 1 Table of characteristic information to be extracted according to the invention

And then, labeling the characteristic information to obtain a structured data set. Specifically, the label of each case of data, i.e., the outcome label, is taken from the post-operative pathology outcome. As shown in the following table 2, the pathological results marked positive for endometrial cancer and precancerous lesions, the "endometrial dysplasia", "endometrial hyperplasia" and various endometrial cancer subtypes were considered positive for endometrial cancer and precancerous lesions. The pathological results are definitely "normal endometrium", "proliferation stage endometrium", "secretion stage endometrium", "simple endometrial hyperplasia" as negative. The remaining pathology results were not included in the model training.

Atypical hyperplasia of endometrium	Endometrial cancer
		Endometrial hyperplasia	EndometriumSerous carcinoma
Endometrial cancer	Serous adenocarcinomas of the endometrium
		Endometrial clear cell carcinoma

Table 2 pathological results marked as positive for endometrial cancer and precancerous lesions

In the model training step, an improved Xgboost loss model training is first performed. However, before training the structured training data set with the improved Xgboost loss model, it is also necessary to perform a feature processing on the structured training data set, where the feature processing includes the following steps:

first, to reduce the interaction between the plurality of original feature information, the original feature information is split to generate a plurality of individual feature information.

Secondly, the individual text feature information is converted into digital feature information, and the training effect obtained after the feature information is converted into the discrete feature by all the one-hot binarization processing is not good, so that the one-hot binarization processing is changed into the discrete feature for part of the text feature information, and the one-hot binarization processing is not performed for continuous feature information, namely, the discretization is not performed.

Then, some abnormal characteristic values beyond the normal range are deleted, and sample information lacking the label is deleted.

Finally, most missing values in the feature information are not filled in, as the Xgboost loss model itself contains a mechanism to handle null values.

After the data set is subjected to the characteristic processing, the improved Xgboost loss model training is started, and the specific steps are as follows:

firstly, loading a training data set after feature processing and labeling processing, dividing most samples in the training data set into 10 folds for cross-verifying a training model, and taking the rest samples after processing as independent verification set to verify the model effect.

And secondly, improving the Xgboost loss model. Since the number of sample sets used for cross-validation in the previous step is much larger than the number of sample sets used for independent validation, a data imbalance is generated, which may result in poor generalization ability of the model. In order to improve the stability of the training model, the invention does not use an up-sampling or down-sampling method from the data layer, but improves the training process of the Xgboost loss model.

The integral objective function of the Xgboost loss model is divided into two parts, wherein the first part is used for measuring the difference between the prediction score and the true score, and the other part is a regular term, and the formula is as follows:

where T represents the number of leaf nodes and w represents the fraction of leaf nodes. And gamma can control the number of leaf nodes, lambda can control the score of the leaf nodes not to be excessive, and overfitting is prevented.

In the original Xgboost loss model, the first partIs a cross entropy loss function, the aforementioned data imbalance problem can lead to small model thresholds and unstable models.

The invention replaces the cross entropy loss function with the focal loss function, and the formula is as follows:

FL(p _t )＝-α _t (1-p _t ) ^γ log(p ^t )

wherein if the real tag is 1, thenOtherwise, when the real tag is 0, then +.>

The focal loss function can change the loss weight of one or less of the samples by adjusting the alpha and gamma super parameters, so that the prediction of the Xgboost model is not biased to the one with more samples, and the problem of unbalanced data is better solved.

The formula of the overall objective function of the modified Xgboost loss model is as follows:

thirdly, in order to further improve the performance of model training, the original training data set is input into the improved Xgboost model to obtain indexes of leaf nodes, so that new combined characteristics are formed. However, excessive feature columns of the original training data set may cause excessive fitting, in this embodiment, an RM correlation analysis method is adopted to analyze a new feature column and a real tag, features of a person with larger correlation are selected to be combined with the original features, and the new combined data features are used to train the improved Xgboost model again by using a grid search method to find an optimal model.

And fourthly, automatically searching optimal parameters by using the new training set by using the grid searching method, and using a 10-fold cross validation method in the parameter searching process.

In this step, to ensure that better model parameters can be found so that the model performs better in both the training set and the validation set, the evaluation criteria is to find the model parameter value at the highest average AUC using the average AUC values trained during the cross validation process. Specifically, given an optional range of model parameters, the method combines all parameter cases to train, each combination is cross-validated and results in a mean AUC value.

Since the model parameters are more, if each model parameter is given a range and is searched in the search space at the same time, the training time of the model becomes intolerable because the training time of the model is accumulated by the number of values of all parameters, and therefore, the embodiment uses grouping to approximate the optimal solution.

Firstly, dividing parameters into two groups, finding out the optimal solution of two parameters each time, namely, summing the products of the values of the two parameters in the search space; secondly, continuously searching the next two optimal parameters on the basis of the last optimal solution; then, and the like, searching for the optimal solution of other parameters; and finally, drawing an ROC curve according to the optimal model and the test set.

After training the training data set with the modified Xgboost loss model, the training data set after training the Xgboost loss model is then simplified using Lasso model features. The Lasso model has L1 penalty, can effectively compress the coefficient of the non-key feature to 0, and is beneficial to reducing the operation workload and the risk of overfitting. The method for simplifying the Lasso model comprises the following steps:

first, the super parameter α=0.1 is set, the intercept fitting is set to true, the maximum iteration number is set to 10000, and the feature updated in each iteration is randomly selected.

And substituting the original training data matrix corresponding to the BR ridge model in the feature preprocessing into a Lasso model, and performing model training according to a sample set of data packets consistent with the modified Xgboost model.

For training of each fold data, the features with absolute values of coefficients > =0.01 are reserved to the next step according to the coefficients of the features output by the LASSO model, namely, corresponding columns in the data matrix are reserved, and the rest feature columns are omitted. Here, the threshold value of 0.01 is related to the value of the tag, and the person skilled in the art should perform a corresponding transformation in case of different tag values.

Since the Xgboost predicted value is calculated based on the training data corresponding to each fold test data, and includes the information of the training data, the subsequent training will also be performed on each pair of training set and test set, and the grouping of the samples is consistent with the grouping of the samples adopted by the improved Xgboost model.

After the result of the Xgboost loss model and the result of the Lasso model are obtained, the characteristic data matrix and the training data set after training are used as input, training is carried out through a BR ridge model, the Xgboost predicted value is used as a new list of characteristics and is combined into the characteristic data matrix after the Lasso model is simplified, a new characteristic data matrix is formed and is input into the BR ridge model, and the endometrial cancer risk screening prediction model is obtained.

Before the BR ridge model is trained, the trained feature data matrix is required to be subjected to feature processing, and the feature processing comprises the following steps:

And secondly, performing one-hot binarization processing on each piece of independent characteristic information respectively, and converting all the independent characteristic information into discrete characteristics.

And finally, filling the information of the missing features with null values to be 0, and obtaining a training data set which can be directly used for training.

In the BR ridge model training step, as for the Xgboost loss model, a 10-fold cross validation method is adopted, namely a data set is divided into 10 folds in advance, and each fold is taken as a test set in turn, so that samples adopted in the training and testing of the two models are the same, and model prediction and evaluation are carried out in the follow-up step according to a given training set and a corresponding test set.

The basic principle of the BR ridge model is that the BR ridge model has likelihood functions when the parameters are w based on the target y value On the premise of carrying out maximum posterior estimation of w value according to Bayes theorem, the formula is as follows:

wherein sigma ₀ And sigma is the standard deviation of the parameter w and the standard deviation of the target y value when the parameter is w, respectively. The formula is used as a loss function of the BR algorithm to carry out gradient descent solution.

After substituting the combined characteristic data matrix into the BR algorithm, the optimal coefficient can be further fitted for each characteristic, the coefficient of each characteristic can be directly used for subsequent application, and the predicted value of each case under the linear model is recorded. And then drawing an ROC curve according to the actual labeling result of the test set, and evaluating the prediction performance. The optimal prediction threshold is determined based on this, and used to make decisions at the time of later prediction.

In this embodiment, the risk assessment model of the present invention is verified using data of a different hospital from the training set as the verification set, and fig. 3 shows a verification result diagram in the verification set. For new endometrial cancer patient data information, the data information can be directly used as input of the model to obtain an endometrial cancer risk prediction value.

The invention also provides an endometrial cancer risk screening system based on artificial intelligence, as shown in fig. 4, the risk screening system comprises: the data processing module is used for performing data processing, taking medical record data information of endometrial cancer patients as original data, extracting characteristic information in a text of the original data, and performing labeling processing to form structured data; the Xgboost loss model module is used for taking the structured data as input, and training or predicting the structured data by adopting the improved Xgboost loss model to obtain the trained or predicted data; the Lasso model simplifying module is used for simplifying the trained data through the Lasso model to obtain a characteristic data matrix; and the BR ridge model module is used for taking the characteristic data matrix and the trained or predicted data as input, training or predicting through the BR ridge model, obtaining an endometrial cancer risk screening prediction model aiming at the trained data, and obtaining the predicted endometrial cancer risk prediction value aiming at the predicted data.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the endometrial cancer risk screening methods based on artificial intelligence.

The present invention also provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of any of the aforementioned methods for screening for risk of endometrial cancer based on artificial intelligence.

The characteristic coefficient generated by the embodiment has stronger robustness to the patient data under the same distribution. Specifically, a user inputs an electronic medical record and a clinical auxiliary examination result, the embodiment converts the input information into various characteristics, and automatically binarizes the characteristics to obtain a data matrix which can be identified by the Xgboost model and the BR model, the data matrix is sequentially brought into the Xgboost model and the BR model, and finally the risk prediction value of endometrial cancer of the medical record is output, so that the risk of endometrial cancer and precancerous lesions is predicted.

The invention improves the Xgboost model, simplifies the result of the improved Xgboost model by utilizing the Lasso model, combines the results of the two model training, takes the results as the input of the BR ridge model, and carries out further model training to obtain the final risk prediction result.

The improved Xgboost model replaces the cross entropy loss function with the focal loss function with unbalanced processing data, so that the training model can be ensured not to deviate to more samples, the model is more stable, and the prediction effect is more accurate. The Lasso model reduces the high-dimensional space to the low-dimensional space, and is more beneficial to generalization of the model. The BR ridge model is used in the classification model in the invention to correct some errors of the previous model, thereby realizing the fusion of the models and improving the training performance of the models.

Claims

1. An endometrial cancer risk screening method based on artificial intelligence is characterized in that the risk screening method comprises a training process and a verification process,

wherein the training process comprises:

firstly, performing data processing, namely taking medical record data information of an endometrial cancer patient as an original data set, extracting characteristic information in a text of the original data set, and performing labeling processing to form a structured training data set;

secondly, taking the structured training data set as input, and training the structured training data set by adopting an improved Xgboost loss model to obtain a trained training data set;

then, simplifying the trained training data set through a Lasso model to obtain a characteristic data matrix;

finally, taking the characteristic data matrix and the trained training data set as input, and training through a Bayesian ridge regression (BR) model to obtain an endometrial cancer risk screening prediction model;

the verification process of the risk screening method comprises the following steps:

firstly, inputting new medical record data information of a patient with endometrial cancer, extracting the characteristic information from the medical record data information, and performing labeling treatment to form structured data information;

secondly, taking the structured data information as input, and predicting the structured data information by adopting the improved Xgboost loss model to obtain predicted data information;

finally, taking the predicted data information and the structured data information as input, and obtaining an endometrial cancer risk prediction value corresponding to the new endometrial cancer patient by adopting the endometrial cancer risk screening prediction model obtained in the training process;

wherein the formula of the overall objective function of the improved Xgboost loss model is as follows:

；

wherein,representing the number of leaf nodes>A score representing a leaf node; />For controlling the number of leaf nodes, +.>A score for controlling leaf nodes; if the real tag is 1, then +.>=/>The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, when the true tag is 0, then。

2. An artificial intelligence based endometrial cancer risk screening method according to claim 1, wherein said structured training data set is further subjected to a feature processing prior to said training of said structured training data set using a modified Xgboost loss model, said feature processing comprising the steps of:

firstly, splitting the characteristic information in the structured training data set to generate a plurality of independent characteristic information;

secondly, converting the independent characteristic information into digital characteristic information, performing one-hot binarization processing on the text characteristic information to obtain discrete characteristics, and performing no one-hot binarization processing, namely performing no discretization processing on the continuous characteristic information;

then deleting abnormal characteristic information beyond a normal range, and deleting medical record data sample information in the original dataset lacking the label;

and finally, not filling the missing value in the characteristic information.

3. The method of claim 1, wherein before training the Bayesian ridge regression (BR) model, the trained training data set is further subjected to feature processing, the feature processing comprising the steps of:

firstly, splitting the characteristic information of the trained training data set to generate a plurality of independent characteristic information;

secondly, performing one-hot binarization processing on the independent characteristic information respectively, and converting all the characteristic information into discrete characteristics;

then deleting abnormal characteristic information beyond a normal range, and deleting sample information in the original data set lacking a tag;

finally, the missing value in the characteristic information is filled with 0.

4. An artificial intelligence based endometrial cancer risk screening method according to claim 1, wherein said structured training data set is trained using an improved Xgboost loss model to obtain a trained training data set, said training method comprising the steps of:

firstly, loading the structured training data set, dividing a part of samples in the structured training data set into 10 folds for cross-verifying a training model, and verifying the effect of the endometrial cancer risk screening prediction model by taking the rest samples in the structured training data set as an independent verification set;

secondly, inputting the structured training data set into the improved Xgboost loss model to obtain an index of leaf nodes to form a new feature combination, analyzing the new feature combination column and the real label by adopting correlation analysis, selecting features with larger correlation to combine with original features, training the improved Xgboost loss model by using a grid search method again by using the combined new data features, and finding out an optimal model;

finally, the grid search method is used for automatically searching the optimal parameters by utilizing the new training set, and 10-fold cross validation is used in the parameter searching process.

5. An artificial intelligence based endometrial cancer risk screening system, the risk screening system comprising:

the data processing module is used for performing data processing, taking medical record data information of endometrial cancer patients as original data, extracting characteristic information in a text of the original data, and performing labeling processing to form structured data;

the Xgboost loss model module is used for taking the structured data as input, and training or predicting the structured data by adopting the improved Xgboost loss model to obtain the trained or predicted data;

the Lasso model simplifying module is used for simplifying the trained data through the Lasso model to obtain a characteristic data matrix;

a Bayesian ridge regression (BR) model module, configured to use the feature data matrix and the trained or predicted data as inputs, perform training or prediction by using the Bayesian ridge regression (BR) model, obtain an endometrial cancer risk screening prediction model for the trained data, and obtain the predicted endometrial cancer risk prediction value for the predicted data;

；

wherein,representing the number of leaf nodes>A score representing a leaf node; />For controlling the number of leaf nodes, +.>A score for controlling leaf nodes; if the real tag is 1, then +.>=/>；

Otherwise, when the true tag is 0, then。

6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of an artificial intelligence based endometrial cancer risk screening method according to any one of claims 1-5 when said program is executed.

7. A computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of an artificial intelligence based endometrial cancer risk screening method according to any one of claims 1-5.