CN116805534A

CN116805534A - Disease typing method, system, medium and equipment based on weak supervision learning

Info

Publication number: CN116805534A
Application number: CN202310708568.8A
Authority: CN
Inventors: 朱程广; 张维夏
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-09-26

Abstract

The application provides a disease typing method, a system, a medium and equipment based on weak supervision learning, comprising the following steps: step 1: acquiring image data, clinical data and pathological data, and performing data preprocessing; step 2: building a prediction model, including multi-mode feature extraction and feature fusion prediction network, and training the prediction model; step 3: the performance of the predictive model was evaluated by AUC, accuracy, correct rate, recall and F1-score. The method is based on multi-modal data, captures the correlation and multi-scale characteristics of different modal data by adopting a cross-modal data cross-fusion method, can be suitable for predicting clinical applications such as types and risk grades of different diseases, and still can achieve high accuracy and recall rate under the condition of lacking pixel-level labeling.

Description

Disease typing method, system, medium and equipment based on weak supervision learning

Technical Field

The application relates to the technical field of weak supervision learning, in particular to a disease typing method, a system, a medium and equipment based on weak supervision learning.

Background

The disease typing has important significance in the diagnosis and treatment process of the disease, and can help clinicians to pointedly grasp the causal relationship of the generation and the development of the disease. Aiming at the problems of data missing, high labeling cost, low prediction precision and the like in clinical application, the application aims to utilize multi-modal data to mine the association among the data, extract multi-modal data characteristics through weak supervision learning, realize cross-modal data fusion perception, disease typing, assist doctors to carry out prognosis evaluation analysis, provide a targeted scheme for subsequent treatment and improve disease diagnosis and treatment efficiency and accuracy.

Patent document CN116013535A (application number: CN202211251219. X) relates to a complex disease typing method, system and storage medium based on deep learning, and relates to the technical field of disease typing. The method comprises the following steps: collecting disease multi-mode histology data, and screening characteristics of the disease multi-mode histology data to obtain first characteristics; carrying out data fusion on the first features by using sparse typical correlation analysis to obtain multi-mode data; inputting the multi-modal data into a neural network classifier, and extracting a characteristic representation; clustering the feature representations, retraining a neural network classifier by taking the clustering result as a pseudo tag, and updating the neural network parameters for iteration; and extracting characteristic representations by using the updated neural network classifier, and clustering again until the clustering result converges. The patent cannot solve the technical difficulties existing at present and cannot realize the technical effects of the application.

Disclosure of Invention

Aiming at the defects in the prior art, the application aims to provide a disease typing method, a disease typing system, a disease typing medium and disease typing equipment based on weak supervised learning.

The disease typing method based on weak supervised learning provided by the application comprises the following steps:

step 1: acquiring image data, clinical data and pathological data, and performing data preprocessing;

step 2: building a prediction model, including multi-mode feature extraction and feature fusion prediction network, and training the prediction model;

step 3: the performance of the predictive model was evaluated by AUC, accuracy, correct rate, recall and F1-score.

Preferably, the step 1 includes:

step 1.1: reading data of different modes according to the unique number index of a patient in a data query and pairing mode, and realizing sample deduplication, outlier processing and standardization processing by using a third party library Pandas of python;

step 1.2: filling of missing clinical indexes is achieved by MICE;

step 1.3: cutting the image data;

step 1.4: and carrying out normalization processing on pixel values of the image data.

Preferably, the step 2 includes:

step 2.1: dividing the disease parting Model into four branches, wherein branch 1 comprises pathological data processing, a feature extraction network Model1 based on a pre-training Model ResNet50 and a multi-layer perceptron MLP1; branch 2 includes pathological data processing, feature extraction networks mdolel 2 and MLP2 based on the res net50 network framework; branch 3 includes image data processing, model3 and MLP3 based on 3D res net network framework; branch 4 includes clinical data processing and MLP4; processing data in branch 1 to obtain L1 3×256×256 patches, inputting each patch into a pre-trained ResNet50 network to extract a feature map 1024×6×6, and performing pooling operation and MLP1 to obtain a feature vector V1 with a length of 256; the size of the pathological image processed in the branch 2 is 3 multiplied by 256, the pathological image is input into a characteristic extraction network to obtain a characteristic image 1024 multiplied by 6, and a characteristic vector V2 with the length of 256 is obtained after pooling and MLP2; the image data processed in the branch 3 is 100 multiplied by 256 and is input into a 3D ResNet network to obtain a characteristic graph 256 multiplied by 6, and a characteristic vector V3 with the length of 256 is obtained after pooling and MLP3; the branch 4 carries out coding treatment on clinical indexes, the length of the treated feature vector is L2, and the feature vector V4 with the length of 256 is obtained after MLP4; model1 and Model2 have the same network structure, model1 is a Model pre-trained in a public data set, and Model2 needs to be retrained by the task to obtain Model parameters; MLP1, MLP2, MLP3 and MLP4 are all multi-layer perceptron, network structure parameters are obtained by training; the feature extraction module outputs feature vectors V1, V2, V3 and V4, the feature fusion module carries out weighted splicing on the V1, V2, V3 and V4 through the learnable parameters a1, a2, a3 and a4 to obtain a feature vector V5, and the features subjected to multi-mode fusion output prediction probability p through a full-connection layer;

step 2.2: dividing the sample data into a training set and a testing set, adopting ten-fold cross validation in the training set, determining a training stop threshold according to an AUC value, and selecting a model with optimal performance as a final model.

Preferably, the step 3 includes:

the performance of the model on the test set was assessed using AUC, accuracy = TP/(+fp), accuracy = (tp+tn)/(+fp+tn), recall = TP/(tp+fn), and F1-score = 2 x TP/(2 x tp+p+fn);

wherein AUC is the sum of the areas of all parts under the ROC curve, the abscissa of the ROC curve is the false positive rate, and the ordinate is the true positive rate; TP represents predicted positive, actually positive; TN is predicted to be negative and is actually negative; FP represents predicted positive, actually negative; FN represents prediction negative, actually positive; for all indicators, a larger value indicates a better predictive model performance.

The disease typing system based on weak supervised learning provided by the application comprises:

module M1: acquiring image data, clinical data and pathological data, and performing data preprocessing;

module M2: building a prediction model, including multi-mode feature extraction and feature fusion prediction network, and training the prediction model;

module M3: the performance of the predictive model was evaluated by AUC, accuracy, correct rate, recall and F1-score.

Preferably, the module M1 comprises:

module M1.1: reading data of different modes according to the unique number index of a patient in a data query and pairing mode, and realizing sample deduplication, outlier processing and standardization processing by using a third party library Pandas of python;

module M1.2: filling of missing clinical indexes is achieved by MICE;

module M1.3: cutting the image data;

module M1.4: and carrying out normalization processing on pixel values of the image data.

Preferably, the module M2 comprises:

module M2.1: dividing the disease parting Model into four branches, wherein branch 1 comprises pathological data processing, a feature extraction network Model1 based on a pre-training Model ResNet50 and a multi-layer perceptron MLP1; branch 2 includes pathological data processing, feature extraction networks mdolel 2 and MLP2 based on the res net50 network framework; branch 3 includes image data processing, model3 and MLP3 based on 3D res net network framework; branch 4 includes clinical data processing and MLP4; processing data in branch 1 to obtain L1 3×256×256 patches, inputting each patch into a pre-trained ResNet50 network to extract a feature map 1024×6×6, and performing pooling operation and MLP1 to obtain a feature vector V1 with a length of 256; the size of the pathological image processed in the branch 2 is 3 multiplied by 256, the pathological image is input into a characteristic extraction network to obtain a characteristic image 1024 multiplied by 6, and a characteristic vector V2 with the length of 256 is obtained after pooling and MLP2; the image data processed in the branch 3 is 100 multiplied by 256 and is input into a 3D ResNet network to obtain a characteristic graph 256 multiplied by 6, and a characteristic vector V3 with the length of 256 is obtained after pooling and MLP3; the branch 4 carries out coding treatment on clinical indexes, the length of the treated feature vector is L2, and the feature vector V4 with the length of 256 is obtained after MLP4; model1 and Model2 have the same network structure, model1 is a Model pre-trained in a public data set, and Model2 needs to be retrained by the task to obtain Model parameters; MLP1, MLP2, MLP3 and MLP4 are all multi-layer perceptron, network structure parameters are obtained by training; the feature extraction module outputs feature vectors V1, V2, V3 and V4, the feature fusion module carries out weighted splicing on the V1, V2, V3 and V4 through the learnable parameters a1, a2, a3 and a4 to obtain a feature vector V5, and the features subjected to multi-mode fusion output prediction probability p through a full-connection layer;

module M2.2: dividing the sample data into a training set and a testing set, adopting ten-fold cross validation in the training set, determining a training stop threshold according to an AUC value, and selecting a model with optimal performance as a final model.

Preferably, the module M3 includes:

According to the present application, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the disease typing method based on weakly supervised learning.

The electronic equipment provided by the application comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the computer program realizes the steps of the disease typing method based on weak supervised learning when being executed by the processor.

Compared with the prior art, the application has the following beneficial effects:

the method is based on multi-modal data, captures the correlation and multi-scale characteristics of different modal data by adopting a cross-modal data cross-fusion method, can be suitable for predicting clinical applications such as types and risk grades of different diseases, and still can achieve high accuracy and recall rate under the condition of lacking pixel-level labeling.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a flow chart of a disease typing method;

fig. 2 is a predictive network model.

Detailed Description

The present application will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present application, but are not intended to limit the application in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present application.

Example 1:

as shown in fig. 1, the present application provides a disease typing method based on weak supervised learning, comprising: step 1: preprocessing data; step 2: building and training a model; step 3: and (5) evaluating system performance. Taking lung nodule benign and malignant prediction as an example, a system implementation process is introduced.

The step 1 comprises the following steps:

step 1.1: multi-mode data de-duplication, outlier processing and standardization processing; the application encompasses data types including: image data, clinical data, pathology data, etc. In the process of collecting patient data, the problems of data repetition, modality deletion and the like are unavoidable. According to the application, through a data query and pairing mode, data of different modes are read according to the unique number index of the patient. Sample deduplication, outlier handling, and normalization were achieved using the third party library Pandas of python. Outlier processing refers to processing of non-canonical data of clinical data, such as conversion of non-real values into real numbers. For image or pathology processing, the integrity of the data needs to be checked, and the abnormal data needs to be removed or deleted. The purpose of the normalization process is to convert data, such as multi-class variables, into integer data representations.

Step 1.2: complement missing data; because of the variety of clinical data, there are many problems with missing examination indexes such as height, age, CA125, hemoglobin, etc. Common filling methods include average filling method, K nearest neighbor filling method, multiple interpolation method, random forest filling method, MICE (Multiple Imputation by Chained Equations), etc. The application adopts MICE to realize filling of missing clinical indexes. For example, a benign or malignant tumor is selected as a final result, a plurality of sets of filling values are randomly generated for the index sequence with the deficiency, each set of filling values is statistically analyzed to perform statistical distribution analysis on the index sequence with the deficiency, and filling values with optimal distribution are selected.

Step 1.3: cutting picture data; the input data size of the network model is b×c×m×m, B represents the number of samples of the single input network, C represents the number of slices contained in a single sample, and M represents the size of an image. The original image is then cropped to 256×256 by the restore operation.

Step 1.4: carrying out data normalization; the purpose of image normalization is to normalize the pixel values of the input image, so that the characteristics of different dimensions change in a similar range, and the model training process is converged stably. Let the original pixel value of the M x M image be y ₁ ,y ₂ ,…,y _M×M ,y _max Representing the maximum pixel value, y _min The value of the minimum pixel is indicated,representing normalized pixel values, +.>Finally->The variation range is between 0 and 1.

Step 1.5: dividing a data set; aiming at the problem of lack of medical image samples, the application adopts ten-fold cross validation. All sample data were as per 7:3 dividing the training set and the test set. And adopting ten-fold cross validation in the training set, and selecting a model with optimal performance as an optimal model.

The step 2 comprises the following steps:

step 2.1: constructing a multi-mode data feature extraction and fusion prediction model, which comprises two parts: multimodal feature extraction and feature fusion prediction networks. The multi-mode data is input to the feature fusion network through multi-branch preprocessing. As shown in fig. 2, modules 2 and 3 extract global pathology features and CT image features using res net50 and 3 dresenet networks, respectively. Module 1 extracts local pathology data features using pre-trained res net 50. The difference between module 1 and module 2 is that module 2 is trainable, while module 1 employs a network model trained on a common data set. Clinical data is subjected to full-connection layer extraction of index features. The features extracted by the four branches are respectively sent to the full-connection layer for coding. The encoded multi-modal features are sent to a feature fusion prediction module. The feature fusion module fuses the multi-branch features, such as fusing the multi-dimensional features through a splicing operation. And finally, the extracted multi-mode fusion features are subjected to a full-connection layer to realize a classification prediction task.

Step 2.2: model training, namely obtaining an optimal model by adopting ten-fold cross validation. Specifically, the training set is divided by random data to obtain ten data sets including a training set and a verification set. Wherein the verification sets in the ten data sets are non-overlapping. The model was trained on ten training sets and model performance was verified on the corresponding verification set, and a training stop threshold was determined from AUC (Area Under Curve) values. For example, when the optimal value of the training process is AUCbest, the AUC values of 30 continuous epochs are not greater than AUCbest, and the training is stopped.

The step 3 comprises the following steps:

step 3.1: cross-validation;

step 3.2: performance evaluation, performance of the model on the test set was evaluated using AUC, accuracy = TP/(+fp), accuracy = (tp+tn)/(+fp+tn), recall = TP/(tp+fn), and F1-score = 2 x TP/(2 x tp+p+fn). AUC is the false positive rate on the abscissa of the ROC (receiver operating characteristic curve) curve and the true positive rate on the ordinate by summing the areas of the parts under the ROC curve. TP represents predicted positive, actually positive; TN is predicted to be negative and is actually negative; FP represents predicted positive, actually negative; FN indicates that the prediction is negative and actually positive. For all indicators, a larger value indicates better model performance.

Example 2:

the application also provides a disease typing system based on the weak supervised learning, which can be realized by executing the flow steps of the disease typing method based on the weak supervised learning, namely, a person skilled in the art can understand the disease typing method based on the weak supervised learning as a preferred implementation mode of the disease typing system based on the weak supervised learning.

The disease typing system based on weak supervised learning provided by the application comprises: module M1: acquiring image data, clinical data and pathological data, and performing data preprocessing; module M2: building a prediction model, including multi-mode feature extraction and feature fusion prediction network, and training the prediction model; module M3: the performance of the predictive model was evaluated by AUC, accuracy, correct rate, recall and F1-score.

The module M1 includes: module M1.1: reading data of different modes according to the unique number index of a patient in a data query and pairing mode, and realizing sample deduplication, outlier processing and standardization processing by using a third party library Pandas of python; module M1.2: filling of missing clinical indexes is achieved by MICE; module M1.3: cutting the image data; module M1.4: and carrying out normalization processing on pixel values of the image data.

The module M2 includes: module M2.1: dividing the disease parting Model into four branches, wherein branch 1 comprises pathological data processing, a feature extraction network Model1 based on a pre-training Model ResNet50 and a multi-layer perceptron MLP1; branch 2 includes pathological data processing, feature extraction networks mdolel 2 and MLP2 based on the res net50 network framework; branch 3 includes image data processing, model3 and MLP3 based on 3D res net network framework; branch 4 includes clinical data processing and MLP4; processing data in branch 1 to obtain L1 3×256×256 patches, inputting each patch into a pre-trained ResNet50 network to extract a feature map 1024×6×6, and performing pooling operation and MLP1 to obtain a feature vector V1 with a length of 256; the size of the pathological image processed in the branch 2 is 3 multiplied by 256, the pathological image is input into a characteristic extraction network to obtain a characteristic image 1024 multiplied by 6, and a characteristic vector V2 with the length of 256 is obtained after pooling and MLP2; the image data processed in the branch 3 is 100 multiplied by 256 and is input into a 3D ResNet network to obtain a characteristic graph 256 multiplied by 6, and a characteristic vector V3 with the length of 256 is obtained after pooling and MLP3; the branch 4 carries out coding treatment on clinical indexes, the length of the treated feature vector is L2, and the feature vector V4 with the length of 256 is obtained after MLP4; model1 and Model2 have the same network structure, model1 is a Model pre-trained in a public data set, and Model2 needs to be retrained by the task to obtain Model parameters; MLP1, MLP2, MLP3 and MLP4 are all multi-layer perceptron, network structure parameters are obtained by training; the feature extraction module outputs feature vectors V1, V2, V3 and V4, the feature fusion module carries out weighted splicing on the V1, V2, V3 and V4 through the learnable parameters a1, a2, a3 and a4 to obtain a feature vector V5, and the features subjected to multi-mode fusion output prediction probability p through a full-connection layer; module M2.2: dividing the sample data into a training set and a testing set, adopting ten-fold cross validation in the training set, determining a training stop threshold according to an AUC value, and selecting a model with optimal performance as a final model.

The method has the advantages that:

(1) For tumor tissues, the tumor tissues have certain differences from benign tissues in terms of smoothness of the appearance and regularity of the shape; compared with the prior pathological data processing based on multi-instance learning, not only is the local microscopic information extracted through the branch 2, but also the global pathological data features extracted by the branch 1 capture the global benign and non-benign morphological structure information;

(2) For specific clinical tasks, the problem of modal deletion can exist, and fusion prediction based on one or any multiple modalities can be adapted through branch clipping;

(3) The application is based on weak supervision learning, realizes prediction by capturing tissue-level, cell-level features and global and local features, only needs case-level labels, does not need pixel-level segmentation labeling, and reduces the manual labeling cost.

The module M3 includes: the performance of the model on the test set was assessed using AUC, accuracy = TP/(tp+fp), accuracy = (tp+tn)/(tp+fp+tn), recall = TP/(tp+fn), and F1-score = 2 x TP/(2 x tp+fp+fn); wherein AUC is the sum of the areas of all parts under the ROC curve, the abscissa of the ROC curve is the false positive rate, and the ordinate is the true positive rate; TP represents predicted positive, actually positive; TN is predicted to be negative and is actually negative; FP represents predicted positive, actually negative; FN represents prediction negative, actually positive; for all indicators, a larger value indicates a better predictive model performance.

Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the apparatus, and the respective modules thereof provided by the present application may be regarded as one hardware component, and the modules included therein for implementing various programs may also be regarded as structures within the hardware component; modules for implementing various functions may also be regarded as being either software programs for implementing the methods or structures within hardware components.

The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. A method for disease typing based on weakly supervised learning, comprising:

step 2: building a disease parting model, carrying out multi-mode feature extraction and feature fusion, training the model, and predicting the disease type through the model;

step 3: the performance of the model was assessed by AUC, accuracy, correct, recall and F1-score.

2. The method for disease typing based on weakly supervised learning as claimed in claim 1, wherein the step 1 comprises:

step 1.2: filling of missing clinical indexes is achieved by MICE;

step 1.3: cutting the image data;

3. The method for disease typing based on weakly supervised learning as claimed in claim 1, wherein the step 2 comprises:

4. The method for disease typing based on weakly supervised learning as claimed in claim 1, wherein the step 3 comprises:

5. A disease typing system based on weakly supervised learning, comprising:

module M3: the performance of the predictive model was evaluated by AUC, accuracy, correct rate, recall and F-score.

6. The weakly supervised learning based disease typing system of claim 5, wherein the module M1 comprises:

module M1.2: filling of missing clinical indexes is achieved by MICE;

module M1.3: cutting the image data;

7. The weakly supervised learning based disease typing system of claim 5, wherein the module M2 comprises:

8. The weakly supervised learning based disease typing system of claim 5, wherein the module M3 comprises:

9. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the weak supervised learning based disease typing method of any one of claims 1 to 4.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of the weakly supervised learning based disease typing method of any of claims 1 to 4.