CN116228795A

CN116228795A - Ultrahigh resolution medical image segmentation method based on weak supervised learning

Info

Publication number: CN116228795A
Application number: CN202310231039.3A
Authority: CN
Inventors: 刘博�; 王强; 周子安; 丁磊; 杨滨
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-06-06

Abstract

The invention provides a method for segmenting an ultrahigh-resolution medical image based on weak supervised learning. By using the deep learning technique of weak supervision, a more ideal segmentation effect can be obtained under the condition of only weak label information at the image level. In addition, the method not only considers the strength problem of the labeling information, but also focuses on the improvement problem of the ultra-high resolution medical image training process method, and mainly provides several improvements in the preprocessing stage of the medical image dataset and the post-processing stage of model prediction, thereby overcoming the defects of the model training process in the prior art.

Description

Ultrahigh resolution medical image segmentation method based on weak supervised learning

Technical Field

The invention relates to the field of medical image semantic segmentation, in particular to a ultrahigh resolution medical image semantic segmentation method based on weak supervised learning and a training method of a model of the ultrahigh resolution medical image semantic segmentation method. The method is well verified on the ultra-high resolution medical image dataset.

Background

Today, a large number of semantic segmentation schemes are spread around fully supervised convolutional neural networks, which requires the use of huge amounts of pixel level information, whereas labeling of images at the pixel level by manual labor is time consuming and labor intensive. Studies have shown that annotators of MSCOCO datasets take an average of 10.1 minutes to achieve pixel-by-pixel annotation of each picture. For the mode of full-supervision semantic segmentation, the weak-supervision semantic segmentation has the advantages of high labeling speed, low labor cost and higher labeling accuracy. In particular for ultra-high resolution medical image datasets, the labeling amount is quite large, and labeling personnel must have a professional background, which clearly increases the difficulty of label acquisition. In this case, the advantage of weakly supervised semantic segmentation is more pronounced.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a method for segmenting an ultra-high resolution medical image based on weak supervised learning and a training method for a model thereof, which can obtain a relatively ideal segmentation effect under the condition of only weak tag information at image level by using the deep learning technique of weak supervision. In addition, the method not only considers the strength problem of the labeling information, but also focuses on the improvement problem of the ultra-high resolution medical image training process method, and mainly provides several improvements in the preprocessing stage of the medical image dataset and the post-processing stage of model prediction, thereby overcoming the defects of the model training process in the prior art.

To achieve the above and other related objects, the present invention provides a training method of a medical image segmentation model based on weak supervised learning, comprising the steps of: acquiring medical image data from a hospital, and acquiring a plurality of medical images to be segmented, wherein a mask after labeling is used as a reference object of a subsequent prediction graph; preprocessing a medical image to be segmented, and dividing preprocessed image data into a training set, a verification set and a test set according to a preset proportion; then training a weak supervision image segmentation model by using the training set and the verification set; and after training, testing the trained weak supervision image segmentation model by using a test set.

To achieve the above object and other related objects, the present invention further provides a method for segmenting an ultra-high resolution medical image based on weakly supervised learning, where a segmentation flow chart is shown in fig. 1. By using weak supervision to perform image segmentation, a relatively accurate segmentation result can be obtained under the condition of only having a weak label at an image level, and an ultra-high resolution medical image segmentation model based on weak supervision learning can be obtained by training by using the training method, wherein the image segmentation method comprises the following steps:

the data set is first preprocessed and augmented with data enhancement means.

Then a segmentation network is built, and in the invention, the structure of a coder-decoder which is common in the segmentation field is adopted as a whole, the coder can be regarded as a characteristic extraction network, a pooling layer is generally used for gradually reducing the size of input data, and the decoder gradually restores the detail of a target and the corresponding space dimension through a network layer such as up-sampling, deconvolution and the like. The network adopted by the invention is improved based on the ResNet50 characteristic extraction network, the last two common convolution modules of the network are replaced by two cavity convolution modules, and the sampling rate of the cavity convolution is respectively set to be 2 and 4. The input image enters from a contracted path at one side, wherein after three 3×3 convolution operations are performed, the input image is matched with one maximum pooling process;

during this period, the size of the feature map will be continuously reduced until the maximum pooling operation is repeated twice, and at the bottom layer of the model, the feature map is first subjected to a modified hole convolution module, and then is subjected to a 1×1 convolution to obtain a segmentation output.

Further, global averaging pooling operations are used to optimize the performance of convolutional neural networks in image semantic segmentation tasks. And at the end of the segmentation network model, the feature map flows into the classifier to perform global average pooling operation, so that the relation between the feature map and the category is enhanced. The global average pooling layer directly classifies by utilizing the feature map, which is helpful to preserve the spatial position information of the image, and the occurrence of the over-fitting phenomenon can be effectively avoided because the parameters of the global average pooling layer are not changed along with training. Not only improves the performance of the model, but also reduces the parameter number of the model.

In addition, the salient regions of the class activation map for each training round are accumulated for more comprehensive class activation maps for subsequent segmentation, with the class activation map for each round being averaged from the class activation map for the previous round and the class activation map for the current round. With continuous training, the characteristics of the network learning are more and more rich, and the last class activation diagram has larger weight, so that the segmentation result of the model is more accurate.

Subsequently, for the input feature map, the dimension thereof is h×w×1024. The feature map is first processed using a 1 x 1 convolutional layer to change its dimensions to H x W x 256. The feature map is then input into two parallel branches. In each branch, first, separate hole convolution processing with different sampling rates is performed on the image, and the separate hole convolution processing is performed on the image subjected to data transformation and the image not subjected to data transformation. Then, the feature map after the processing is subjected to a point convolution operation once again so that the dimension thereof becomes H×W×1, that is, the feature map is subjected to a dimension reduction by the point convolution operation once. The method can enable the network to capture the detailed information in the image more accurately, and improves the segmentation accuracy.

Further, an attention map is generated by applying a sigmoid function. Note that the attempt is made to multiply the original feature map element by element and sum the results. The summed result is then processed through the point convolution layer, followed by batch normalization and activation function processing. This series of calculations can restore the feature map size to H W1024.

The evaluation index of the model is the average intersection ratio mIoU, which is the intersection of the model prediction region and the actual region divided by the union of the prediction region and the actual region, and mIoU is the average of IoU of all classes. The invention evaluates the predictive performance of the algorithm on the DCI medical image data set, and the result proves that the method of the invention obtains the competitive performance. The segmentation effect of the invention is verified on the test set, and the precision gap between the segmentation effect and the full supervision segmentation is reduced.

Drawings

Fig. 1 is a schematic overall flow chart of a segmentation method according to the present invention.

Fig. 2 is a diagram of a network structure according to the present invention.

Fig. 3 is a diagram of a hardware configuration and a software environment proposed by the present invention.

Fig. 4 is a schematic diagram illustrating sliding clipping of the expansion prediction method according to the present invention.

Fig. 5 is a schematic diagram of a segmentation training process index according to the present invention.

Fig. 6 is a graph of the prediction results of the small graph proposed by the present invention.

Fig. 7 is a diagram of the result of the super-resolution image prediction proposed by the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings, and the embodiments of the present invention will be described in the specific examples, so that those skilled in the art can easily understand the other advantages and effects of the present invention from the disclosure of the present invention. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the illustrations, not according to the number, shape and size of the components in actual implementation, and the form, number and proportion of each component in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.

The invention relates to a super-resolution medical image segmentation method based on weak supervised learning and a training method of a model thereof. As shown in fig. 1, the specific flow is as follows: firstly, preprocessing and data enhancement operation are carried out on the cleaned original data, and the processing can improve the robustness and accuracy of the model. After preprocessing and enhancement, the image is input into a convolutional neural network, and a series of operations such as convolution, up-down sampling and residual error connection are performed, so that a class activation diagram is generated, and the accuracy and performance of the model are further improved. A segmentation model is then retrained using the class activation map. And finally, verifying the effect of the model, and mainly monitoring the average cross ratio index of the model. This index may help evaluate the performance of the model and guide further optimization of the model.

Specific algorithms are referenced below:

(1) Data preprocessing

And carrying out data preprocessing on the DCI medical image dataset. Because of the large difference in image size in the dataset, the proportion of the segmented targets is also unbalanced, and meanwhile, an excessively dark or excessively bright area possibly exists in the image, so that the segmented targets need to be cut to be input into a network for training. In order to ensure the consistency of training and save the video memory, the length and the width of the image are limited, the uniform side length is 1150, and the length-width ratio is kept unchanged. In addition, in order to keep the information of the image edge, the invention adopts a strategy of overlapped clipping. When the ultra-high resolution medical image is cut, the edge areas of the adjacent images are overlapped, and the strategy does not need to scale the original image, so that the pixel value of each position is consistent with the original image, and errors caused by scaling are avoided. The length of the overlapping area is one quarter of the side length.

(2) Data enhancement

The DCI medical image dataset contained 149 images of different resolutions, of which 113 were positive and 36 were negative. Aiming at the characteristics of the data set, the invention adopts various data enhancement methods, including scaling, mirror image, up-down turning, xy translation, cutting, rotation and cutting and Gaussian smoothing. These methods aim at deriving more training samples from the raw data, preventing model overfitting, while improving the generalization ability of the model.

(3) Network construction

The present invention uses ResNet50 as the underlying neural network. During the training of convolutional neural networks, the fixed convolution and pooling operations may reduce the size of the image while increasing the receptive field of the convolution kernel, but may result in the generation of results in class activation maps that contain only the most significant regions of the target. Aiming at the problem that the class activation diagram generated by using the ResNet50 can only contain the most significant target area, the invention provides an improved method, namely a cavity convolution module is added, so that the quality and the reliability of the class activation diagram are improved. The sampling rate of the hole convolution is set to 2 and 4, respectively. The overall network structure is shown in fig. 2. This approach can effectively alleviate the information loss problem due to downsampling and improve the accuracy of the model in locating the region of interest in the medical image.

The invention utilizes the capacity of global average pooling to fully utilize spatial information, and binds the convolutional neural network and target classification. When the last layer of feature map of the convolutional neural network is consistent with the number of target categories, each feature map is more strongly related to each category of features, so that category confidence maps of various categories can be generated by using the feature maps.

Further, for a given input image, the present invention defines the output value at the coordinates (m, n) of the kth feature map in the final convolution layer of the backbone network as f _k (m, n). For this feature map, it is globally averaged and the pooled result is expressed as the following formula:

F ^k ＝∑ _m，n f _k( m, n) formula (1)

Further, after the global averaging pooling process, a fully connected layer is introduced. The results output through the full connection layer are normalized using a Softmax activation function. Wherein the weight is weighted

Defined as the weight of the feature k corresponding to class c in the fully connected layer. For a certain class c, its Softmax input value S _c Can be expressed as the following formula:

for a given class c, define CAM _c (m, n) is a class activation graph for the class at location (m, n), which can be expressed as the following formula:

by combining the formula (2) and the formula (3), an association between the category and the feature map can be established, expressed as the following calculation formula:

S _c ＝∑ _m，n CAM _c (m, n) formula (4)

The network model mainly comprises a main convolution network, an up-sampling module and a cavity convolution module. The model inputs the processed data, high-level features of a lesion area of a medical image are extracted through a backbone network, the features are sent to an up-sampling module for processing, then information loss is reduced through a cavity convolution module, and finally network training is supervised by utilizing a CAM (computer aided manufacturing) graph.

(4) Model training

In order to eliminate the influence of data transformation on the feature map, the invention adopts the idea of a twin network, namely a network with two branches, respectively processes the image after data transformation and the image without data transformation, and evaluates the robustness of the network on the data transformation by comparing the similarity of class activation maps obtained by the two branches. By learning the corresponding relation between the data transformation and the label transformation at the same time, the effect equivalent to the feature map generated by the segmentation network is achieved.

The overall training steps of the segmentation network are as follows:

the network uses the improved ResNet50 as a feature extractor, and the upper and lower networks share weights. The top half processes the image without data transformation, and after all the convolution layers, generates a class activation map by data transformation T. The lower half processes the data transformed image, directly after all the convolution layers, resulting in a class activation map. The loss function of this network training consists of three parts: the classification loss of two branches and the canonical loss of the two types of activation graphs generated.

The classification loss adopts a binary cross entropy function, and the formula is as follows:

in the medical image dataset there are N samples in total, where y _i True image representing medical image and label, q _i Representing predictive labels for the network. The invention adopts the two norms of the class activation diagram as the regularized loss function, which represents the distance from the difference vector of the class activation diagram to the origin in the vector space. Specifically, the regularization loss is calculated using the two norms of the difference vector of the class activation graph obtained by the two branches, and the calculation formula is as follows:

L _eq ＝|T[Net(I)]-Net[T(I)]I ₂ formula (6)

Specifically, where T [ Net (/) ] represents that features are extracted first into the network and then data transformation operations are performed, and Net [ T (I) ] represents that features are extracted first from the data transformation operations and then into the network. Both of these approaches are often used in image processing to explore the relationship between data transformations and network feature extraction.

After deriving a class activation graph that is regularized using the network, the probability graph is translated into a segmentation result by setting different thresholds. And the quality of the class activation graph can directly influence the effect of the subsequent semantic segmentation, so that the segmentation result obtained by threshold segmentation can directly evaluate the effect of weak supervision semantic segmentation.

The hardware configuration and software environment of this experiment is shown in fig. 3, and when a model is constructed and trained, the ten-learning framework developed by *** is used, the version is 2.3.0, the python version is 3.8.6, the cuda version is 10.1, the cudnn version is 7.6.5, and all experiments are run on a NVIDIA Quadro RTX 500016GB video card.

In training the model, an Adam optimizer is used that performs well in most scenarios. Compared with other optimizers, adam has high calculation speed and lower memory requirement. More importantly, adam can automatically adjust the learning rate, so that the parameter adjusting efficiency is improved, and the parameter adjusting difficulty is reduced. In addition, the super parameter of the device is generally not required to be adjusted, so that the device is more convenient to use. In this experiment, β1 was set to 0.9, β2 was set to 0.999, and ε was set to 10e-8. The size of the batch is set to 24, which is limited by the GPU memory size, and the memory occupation is about 95%.

(5) Model prediction: due to the high resolution characteristic of DCI medical images, the resolution can be up to 20000 x 20000, and the memory of a single image is also over hundreds of megabits. If the complete large graph input model is directly predicted, the display memory is likely to overflow, so that the original graph needs to be cut into small graphs according to a certain patch size, and the patch size is 1024 in the embodiment. The invention provides an expansion prediction method for segmenting an ultra-high resolution medical image. The core idea of dilation prediction is to preserve only the central region of the image in each prediction, leaving out the surrounding edge regions. By sliding cutting the super resolution image, the discarded area becomes the center of the next image to be predicted. The method can avoid splicing marks generated by boundary feature extraction, and reduce the generation of holes in the prediction result graph.

The expansion prediction flow for the ultra-high resolution medical image is specifically as follows:

first, for all large graphs of the test set, clipping is performed according to a given patch size, and dividing it into several small graphs of the same size. The present embodiment sets the side length of the plot to 512. However, since the size of the large map is not necessarily an integer multiple of the patchsize, it may occur that the width or height of the last small map is smaller than the patchsize. To solve this problem, the present invention employs a pixel filling method. Specifically, the width of the original image is divided by the patch size to obtain a remainder, and the remainder is subtracted from the patch size to obtain the number of pixels to be filled in the lateral (width) direction. Similarly, the remainder is obtained by dividing the height of the original image by the patch size, and the number of pixels to be filled in the vertical (high) direction can be obtained by subtracting the remainder from the patch size. The pixel filled here is a black pixel with RGB values (0, 0). At this time, there is a pixel filling area in the lower right corner of the large figure.

Next, slide trimming is performed. Firstly, when a large graph to be predicted is cut, the overlapping area of any two adjacent cut graphs is ensured to be half, and the schematic diagram is shown in fig. 4. The sliding step length set by the invention is 512, so that 512/2=256 pixels need to be filled on the upper, lower, left and right sides respectively, and the formed area is an edge filling area around the large graph. As the sliding window moves from left to right, top to bottom, it can be cropped to an integer number of small drawings, regardless of the size of the large drawings, without losing any edge information.

And further, the small images cut from the large images are sent to a segmentation result image obtained after model reasoning for splicing and restoring. The splicing process is as follows: each small image is firstly taken from the central 512 x 512 area according to the adjacent sequence, and the surrounding areas are discarded and then spliced according to the sequence. At the moment, a large graph spliced by small graph prediction result graphs is obtained; if the filled black area exists in the lower right corner, the area needs to be cut off according to the length equal proportion of the original pixel filling, and then a segmentation result diagram with the original diagram size can be obtained.

The evaluation index of the invention aiming at the weak supervision segmentation model is an average cross ratio (mIoU) value. Evaluation index monitoring is performed during model training, as shown in fig. 5, it can be seen that the average cross ratio index of the validation set is stabilized around 0.80 after 160 epochs are trained, indicating that the training trend is stable, and the model is basically fitted.

In order to verify the effectiveness of the segmentation method provided by the invention, the algorithm performance is evaluated on a super-resolution DCI medical image test set, and the obtained average intersection ratio is 0.7732. Therefore, the invention realizes ideal segmentation effect under the condition of using weak labeling information.

When the model prediction is applied to a partial view of a large view of the prediction set, a resulting segmentation map is shown in fig. 6. The left side is an original image to be predicted, the middle is an image marked manually, the right side is an image predicted by the model obtained by the method, the method can obtain a better segmentation effect on the partial image, the background area and the lesion area are better distinguished, and the segmentation fineness of the method is obviously better than that of the manual work in the area at the lower right corner.

When the model obtained by training is completely used for predicting the large graph of the prediction set, the obtained segmentation result comparison graph is shown in fig. 7, fig. 7 (a) is a manual annotation graph with resolution as high as 10656 x 10656, and fig. 7 (b) is a prediction result graph. It can be obviously seen that the algorithm provided by the invention obtains more accurate segmentation performance on the ultra-high resolution medical image, the output segmentation mask has finer edge details, the transition among all areas is smoother, and the difference between the overall segmentation performance and the traditional full-supervision segmentation method is further reduced.

Claims

1. The ultra-high resolution medical image segmentation method based on weak supervised learning is characterized by comprising the following steps of:

firstly, preprocessing a data set and expanding the data set by using a data enhancement means;

then building a segmentation network, adopting an encoder-decoder structure, taking the encoder as a characteristic extraction network, gradually reducing the size of input data by using a pooling layer, and recovering the detail of a target and corresponding space dimension by the decoder; the adopted network is improved based on ResNet50 feature extraction network, the last two convolution modules of the network are replaced by two cavity convolution modules, and the sampling rate of cavity convolution is respectively set to be 2 and 4; the input image enters from a contracted path at one side, wherein after three 3×3 convolution operations are performed, the input image is matched with one maximum pooling process; in the period, the size of the feature map is continuously reduced until the maximum pooling operation is repeated twice, the feature map firstly passes through a cavity convolution module and then carries out 3 multiplied by 3 convolution twice, and finally the segmentation output is obtained through 1 multiplied by 1 convolution;

at the end of the network model segmentation, a feature map with 1024 channels is obtained, and then the feature map flows into a classifier to carry out global average pooling operation;

accumulating the significant areas of the class activation graphs of each training round, wherein the class activation graphs of each round are obtained by averaging the class activation graphs of the previous round and the class activation graphs of the current round;

subsequently, for the input feature map, its dimension is h×w×1024; the feature map is first processed using a 1 x 1 convolution layer to change its dimension to H x W x 256; the feature map is then input into two parallel branches; in each branch, firstly, carrying out split cavity convolution processing on images with different sampling rates, and respectively aiming at the images subjected to data transformation and the images not subjected to data transformation; then, performing point convolution operation once again on the processed feature map to change the dimension of the feature map into H multiplied by W multiplied by 1, namely performing once dimension reduction on the feature map through point convolution;

generating an attention map by applying a sigmoid function; note that the force is multiplied element by element with the original feature map and the results summed; the summed result is then processed by the point convolution layer, followed by batch normalization and activation function processing to restore the feature map dimensions to H W1024.

2. The ultra-high resolution medical image segmentation method based on weak supervised learning as set forth in claim 1, wherein:

(1) Data preprocessing

Performing data preprocessing on the DCI medical image dataset; the length and width of the image are required to be limited, the uniform side length is 1150, and the length-width ratio is kept unchanged; in order to keep the information of the image edge, a strategy of overlapped cutting is adopted; the length of the set overlapping area is one quarter of the side length;

(2) Data enhancement

The DCI medical image dataset comprises a plurality of images with different resolutions, including a positive image and a negative image; aiming at the characteristics of the data set, various data enhancement methods are adopted, including scaling, mirror image, up-down overturning, xy translation, cutting, rotation, cutting and Gaussian smoothing;

(3) Network construction

ResNet50 was used as the underlying neural network; replacing the standard convolution by using a hole convolution in the last two convolution modules of ResNet 50;

for a given input image, the output value at the coordinates (m, n) of the kth feature map in the final convolution layer of the backbone network is defined as f _k (m, n); for the feature map, global average pooling is performed on the feature map, and the pooled result is expressed as the following formula:

F ^k ＝∑ _m,n f _k (m, n) formula (1)

Introducing a full connection layer after global average pooling treatment; normalizing the result output through the full connection layer by using a Softmax activation function; wherein the weight is weighted

Defining the weight of the category c corresponding to the characteristic k in the full connection layer; for a certain class c, its Softmax input value S _c Expressed as the following formula:

for a given class c, define CAM _c (m, n) class activation diagram for the class at location (m, n), expressed as the following formula:

by combining the formula (2) and the formula (3), the association between the category and the feature map is established, which is expressed as the following calculation formula:

S _c ＝∑ _m,n CAM _c (m, n) formula (4)

The network model comprises a main convolution network, an up-sampling module and a cavity convolution module; the model inputs the processed data, high-level features extracted through a backbone network, the features are sent to an up-sampling module for processing, and a segmentation map is used for supervising network training;

(4) Model training

The idea of a twin network is adopted, namely, two branched networks are adopted, the images after data transformation and the images without data transformation are respectively processed, and the robustness of the network to the data transformation is evaluated by comparing the similarity of class activation diagrams obtained by the two branches; meanwhile, the effect equivalent to the feature map generated by the segmentation network is achieved by learning the corresponding relation between the data transformation and the label transformation;

the network adopts an improved ResNet50 as a feature extractor, and an upper network and a lower network share weights; the upper half processes the image without data transformation, and after all convolution layers, generates a class activation graph through data transformation T; the lower half part processes the image after data transformation, and a class activation diagram is directly obtained after all convolution layers; the loss function of this network training consists of three parts: classification loss of two branches and regular loss of two types of generated activation graphs;

in the medical image dataset there are N samples in total, where y _i True image representing medical image and label, q _i A predictive label representing a classification network; the two norms of the class activation diagram are adopted as regularized loss functions, and represent the distance from the difference vector of the class activation diagram to the origin in the vector space; the regularization loss is calculated by using the two norms of the difference vector of the class activation diagram obtained by the two branches, and the calculation formula is as follows:

L _eq ＝∣T[Net(I)]-Net[T(I)]∣ ₂ formula (6)

Wherein, T [ Net (I) ] represents that the extracted features are firstly input into the network and then the data transformation operation is carried out, and Net [ T (I) ] represents that the extracted features are firstly carried out the data transformation operation and then are input into the network;

after deriving a class activation graph that is regularized using the network, the probability graph is translated into a segmentation result by setting different thresholds.