CN117036380A

CN117036380A - Brain tumor segmentation method based on cascade transducer

Info

Publication number: CN117036380A
Application number: CN202310877040.3A
Authority: CN
Inventors: 张建新; 陈柏年; 韩雨童; 刘冬伟; 孙鉴; 张俊星
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2023-07-17
Filing date: 2023-07-17
Publication date: 2023-11-10

Abstract

The application discloses a brain tumor segmentation method based on cascade Transformer, which can segment brain tumor multi-mode Magnetic Resonance (MR) images with high precision. In consideration of the problems of unfixed size and position of the brain tumor MR images, the omnibearing dynamic convolution is introduced into one to three layers of encoders and decoders in the second stage, so that a network can dynamically determine convolution kernel parameters according to input so as to adapt to the situation of huge difference among different brain tumor MR images, and in consideration of the fact that long-distance dependence of brain tumors is important for accurately dividing brain tumors, swin transformers are used as the encoders and decoders in four to six layers in the second stage, and the complexity problem of transformers can be well solved while long-distance dependence of the brain tumors is captured. The method can lead the separation result to be more accurate by enhancing the global and local characteristic information extraction capability of the network, and can provide effective help for doctor diagnosis.

Description

Brain tumor segmentation method based on cascade transducer

Technical Field

The application relates to the technical field of image processing, in particular to a brain tumor segmentation method based on cascade Transformer.

Background

Brain tumors are seriously harmful to human life and health, so early treatment of brain tumor patients has been found to greatly increase the cure rate. Magnetic Resonance (MR) images are one of the most common means for detecting brain tumors, and can generate tumor images of different modes, thereby providing diagnostic basis for doctors. However, the manual segmentation of brain tumors by doctors is time-consuming and labor-consuming, and different doctors have large differences in the segmentation of the same brain tumor image, which is affected by the experience, energy and expertise of the doctors. Therefore, computer-aided diagnosis techniques are needed to aid doctors in segmenting brain tumors with high accuracy.

In recent years, deep learning has become a mainstream method in brain tumor segmentation tasks, and can effectively segment brain tumors from MR images, and particularly, three-dimensional U-Net network structures have been greatly developed. In order to obtain finer segmentation results, researchers cascade multiple segmentation networks into one network, achieving a segmentation of brain tumors from coarse to fine. However, the cascade networks generally adopt the traditional convolution with fixed convolution kernel size to learn the characteristics of brain tumor, so that the receptive field is limited, the long-distance dependence of the brain tumor is difficult to capture, and in addition, the parameters of the traditional convolution kernel are fixed after training is finished, so that the traditional convolution kernel is difficult to adapt to brain tumor images with huge differences.

Disclosure of Invention

According to the problems existing in the prior art, the application discloses a brain tumor segmentation method based on cascade Transformer, which is characterized in that two three-dimensional U-Net are cascaded into a two-stage network, and a Swin Transformer is introduced into a second stage as a deep encoder and decoder for capturing long-distance dependence and local characteristic information of brain tumor, and the complexity of the Transformer is considered, so that the brain tumor can be segmented accurately. The method comprises the following steps:

s1: processing the brain tumor multi-modality MR image into an input dataset of the network;

s2: constructing a network structure based on cascading transformers;

s3: training the network structure and storing a network model;

s4: loading a network model, testing the network model to obtain brain tumor segmentation results, and performing post-processing to obtain final segmentation results.

The following specific mode is adopted in the S1:

s11: brain tumor image dataset BraTS2021 has four modality MR images, each modality having image dimensions 240mm 155mm. The dataset was labeled as four types, respectively: background and healthy sites, gangrene parts, edema parts and enhanced tumor parts. Since there is a lot of redundant background information in the image, the useless background information is clipped first.

S12: and carrying out normalization operation on the images of each mode by adopting a Z-score method, and solving the contrast difference problem among different modes. The formula is as follows:

where z is the input MR image, z ^′ Is the normalized MR image, μ is the input MR image mean, δ is the input MR image standard deviation.

S13: to distinguish background voxels from voxels with normalized values close to 0, a new channel is additionally created for one-hot encoding of foreground voxels. Input to the network during the training phase the image size is 5 x 128.

S14: the input data set is divided into a training set and a testing set according to a certain proportion, and the training set is subjected to data enhancement processing by using a plurality of data enhancement methods.

The following specific mode is adopted in the S2:

s21: two three-dimensional U-Net with different depths are cascaded into a two-stage cascade network, the three-dimensional U-Net is used for coarse segmentation in the first stage, a new three-dimensional U-Net variant structure is adopted in the second stage, and the input of the new three-dimensional U-Net variant structure is a fused image of the input of the first stage and the coarse segmentation result, so that a final fine segmentation result is obtained, the brain tumor can be segmented from coarse to fine by the network, and the brain tumor segmentation precision is further improved.

S22: in the first stage, the three-dimensional U-Net contains 4 decoders and encoders, the lowest layer being the bottleneck layer, there being downsampling and upsampling operations after each layer of decoders and before the encoders, each encoder and decoder employing three-dimensional convolutions, normalized layers and nesting of nonlinear layers, where the convolution kernel size is 3 x 3, the normalized layers employ group normalization and the nonlinear layers employ LeakyRuLe. In addition, downsampling uses a three-dimensional max-pooling layer, and upsampling uses a tri-linear interpolation method.

S23: in the second stage, the encoder and decoder with first to third layers, the third convolution is replaced by the omnibearing dynamic convolution so as to improve the adaptability of the network to brain tumors with changeable shapes. The omnibearing dynamic convolution firstly carries out global average pooling on the input characteristic map, and then uses the full connection layer to compress the channel number of the characteristic map to 16. Then, the four branches are used to calculate the attention of the input channel dimension, the output channel dimension, the convolution kernel space dimension, and the convolution kernel number dimension, respectively. Finally, multiplying these four complementary points of attention to the corresponding dimensions yields the final convolution output.

S24: the feature map after passing through the first three layers of encoders is divided into a plurality of non-overlapping small blocks of a size, and the non-overlapping small blocks are linearly mapped to an input size suitable for the Swin transform encoder. These non-overlapping patches are then input into a Swin transform encoder, passed through a two-layer encoder and bottleneck layer, and passed through a two-layer decoder, and then the number of feature map channels is restored to an input size suitable for a normal encoder using three-dimensional convolution.

S241: each layer of Swin transducer encoder and decoder comprises two Swin transducer modules, each module firstly flattens the space size of the input characteristic diagram into a Patch with 1 dimension, then calculates the self-attention of the characteristic diagram in the window in the form of a sliding window for the Patch, and finally restores the self-attention to the characteristic diagram into a three-dimensional characteristic diagram.

The following mode is specifically adopted in the S3:

s31: setting network super parameters, using an Adam optimizer, dynamically adjusting learning rate by strategy training using cosine annealing, using automatic mixing precision, and finally storing optimal training weights.

S32: the training set is input into a cascading transducer model for training, the Focal loss function and the Dice loss function are fused to serve as the loss function of the network model, and all three focus areas of tumor, tumor core and enhanced tumor are used as segmentation targets.

The following method is specifically adopted in the S4:

s41: loading training weights into the network, and predicting all tumors, tumor cores and segmentation results of the enhanced tumors of the test set.

S42: and (3) respectively overturning and predicting the segmentation result along the x-axis, the y-axis, the z-axis, the xy-axis, the xz-axis, the yz-axis and the xyz-axis of the input brain tumor by a Test Time Augmentation (TTA) method, and averaging the predicted result.

S43: and replacing the prediction result with a corresponding label according to a certain probability by using a voxel clipping method, and obtaining a final segmentation result. In the prediction result, voxels belonging to all tumors are replaced with a label 0 with a certain probability; the voxels belonging to the tumor core are replaced with labels 2 with a certain probability; voxels belonging to an enhanced tumor are replaced with label 1 with a certain probability. Meanwhile, the number of voxels and the overall enhanced tumor mass with the average probability smaller than a certain value are replaced by the label 1.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.

FIG. 1 is a flow chart of the method disclosed in the present application;

FIG. 2 is a network block diagram of the method disclosed in the present application;

FIG. 3 is an omnibearing dynamic convolution block diagram of the method disclosed in the present application;

FIG. 4 is a block diagram of the Swin transducer of the method disclosed in the present application.

Detailed Description

In order to make the technical scheme and advantages of the present application more clear, the technical scheme in the embodiment of the present application is clearly and completely described below with reference to the accompanying drawings in the embodiment of the present application:

the application provides a brain tumor segmentation method based on cascade Transformer, which is characterized in that two three-dimensional U-Net cascades and introduces omnibearing dynamic convolution and Swin Transformer to capture long-distance dependence and local characteristic information of brain tumor and consider the complexity problem of Transformer, thus realizing accurate brain tumor segmentation of all tumors, tumor cores and three focus areas of enhanced tumor.

FIG. 1 is a flow chart of the method of the present application, wherein firstly, the image preprocessing is performed to change the data set into the input required by the network, secondly, the cascading transducer network model structure is constructed, then the data set is trained and the weights are saved, finally, the training weights are used for predicting the test set to obtain the segmentation result, and the accuracy of the brain tumor segmentation is further improved through the post-processing method.

FIG. 2 is a schematic diagram of a cascaded converter network model according to the present application, wherein two three-dimensional U-Net are cascaded into a two-stage network, and U-Net is modified in a second stage. Wherein the third convolution in one to three layers of decoders and encoders is replaced by an omnibearing dynamic convolution, the specific structure is shown in fig. 3; while the four to six decoders and encoders are replaced by Swin transducer modules, the specific structure is shown in FIG. 4.

The method disclosed by the application comprises the following specific steps:

s1: and (3) preprocessing data, and processing the brain tumor multi-mode MR image into an input format of a network. The method specifically adopts the following steps:

s11: brain tumor image dataset the BraTS2021 dataset has four modality MR images, each modality having image dimensions 240mm x 155mm, manually labeled for the dataset by an expert, including background and healthy sites being label 0, gangrene being label 1, edema being label 2 and enhanced tumor being label 4. Since there is a lot of redundant background information in the image, the useless background information is clipped first.

S2: constructing a network structure based on cascade convertors, which specifically adopts the following modes:

s21: two three-dimensional U-Net with different depths are cascaded into a two-stage cascade network, and the three-dimensional U-Net is used for coarse segmentation in the first stage. The second stage adopts three-dimensional U-Net, and the input is the fusion image of the first stage input and the coarse segmentation result, so as to obtain the final fine segmentation result, so that the network can segment brain tumor from coarse to fine, and the brain tumor segmentation accuracy is further improved. As shown in fig. 2

S23: in the second stage, the encoder and decoder of the first to third layers, the third convolution is replaced by an omnibearing dynamic convolution to improve the adaptability of the network to brain tumors with changeable shapes. As shown in fig. 3, the omnibearing dynamic convolution firstly carries out global average pooling on the input feature map, and then uses the full-connection layer to compress the channel number of the feature map. Then, the four branches are used to calculate the attention of the input channel dimension, the output channel dimension, the convolution kernel space dimension, and the convolution kernel number dimension, respectively. The specific operation is that after the channel number dimension is lifted to the input channel number dimension, sigmoid is used to obtain the attention of the input channel; after the channel number dimension is lifted to the output channel number dimension, using sigmoid to obtain the attention of the output channel; the channel number dimension is changed into the third power of the convolution kernel size, the size is deformed into the single convolution kernel size of 1 multiplied by 3, and then sigmoid is used to obtain the convolution kernel space attention; the channel number dimension is changed to the number of convolution kernels set to 4, at the same time, the size is deformed into 4 multiplied by 1 the Softmax was then used to get the convolution kernel number attention. Finally, multiplying these four complementary points of attention to the corresponding dimensions yields the final convolution output.

S24: the feature map after passing through the first three layers of encoders is divided into a plurality of non-overlapping small blocks of a size, and the non-overlapping small blocks are linearly mapped to an input size suitable for the Swin transform encoder. The characteristic diagram channel number is then input into a Swin transform encoder, passed through a two-layer encoder and a bottleneck layer, passed through a two-layer decoder, and then restored to an input size suitable for a common encoder by using three-dimensional convolution.

S241: each layer of Swin transducer encoder and decoder has two Swin transducer modules. As shown in fig. 4, the Swin transducer module first flattens the spatial dimension of the input feature map to a Patch of 1 dimension; subsequently, inputting Patch into the window multi-head self-attention to capture global information in the window, obtaining attention force, and then using a two-layer multi-layer perceptron to enhance the non-linear capability of the network; secondly, capturing global information among different windows by adopting a sliding window multi-head self-attention, and obtaining a Patch by using a multi-layer perceptron layer; and finally, restoring the three-dimensional characteristic diagram. In addition, layerNorm was used to normalize the feature map before each windowed multi-head self-attention and multi-layer perceptron layer.

S3: training a network and storing a network model, wherein the method specifically comprises the following steps of:

S4: loading a network model, testing the test set to obtain brain tumor segmentation results, and performing post-processing to obtain final segmentation results. The method specifically adopts the following steps:

S42: and (3) respectively overturning and predicting the segmentation result along the x-axis, the y-axis, the z-axis, the xy-axis, the xz-axis, the yz-axis and the xyz-axis of the input brain tumor by a Test Time Augmentation (TTA) method, and averaging the predicted result. S43: and replacing the prediction result with a corresponding label according to a certain probability by using a voxel clipping method, and obtaining a final segmentation result. In the prediction result, voxels belonging to all tumors are replaced with a label 0 with a certain probability; the voxels belonging to the tumor core are replaced with labels 2 with a certain probability; voxels belonging to an enhanced tumor are replaced with label 1 with a certain probability. Meanwhile, the number of voxels and the overall enhanced tumor mass with the average probability smaller than a certain value are replaced by the label 1.

The foregoing is only a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art, who is within the scope of the present application, should make equivalent substitutions or modifications according to the technical scheme of the present application and the inventive concept thereof, and should be covered by the scope of the present application.

Claims

1. A brain tumor segmentation method based on cascade transducer is characterized in that: comprising the following steps:

processing the brain tumor multi-modality MR image into an input dataset of the network; constructing a network structure based on cascading transformers; training the network structure and storing a network model; loading a network model, testing the network model to obtain brain tumor segmentation results, and performing post-processing to obtain final segmentation results.

2. The cascade fransformer-based brain tumor segmentation method according to claim 1, wherein: let the brain tumor image dataset comprise MR images of four modalities, label the dataset as four types, respectively: background and healthy parts, gangrene parts, edema parts and enhanced tumor parts, and cutting useless background information in the image;

the Z-score method is adopted to normalize the images of each mode, so as to reduce the contrast difference between different modes, and the formula is as follows:

where z is the input MR image, z ^′ Is a normalized MR image, μ is an input MR image mean value, δ is an input MR image standard deviation;

creating a new channel and carrying out one-hot coding on foreground voxels, and distinguishing background voxels from voxels with normalized values close to 0;

the input data set is divided into a training set and a testing set according to a certain proportion, and the training set is subjected to data enhancement processing by using a plurality of data enhancement methods, including random scaling, random overturn, gaussian noise, gaussian blur and random brightness methods.

3. The cascade fransformer-based brain tumor segmentation method according to claim 1, wherein: when constructing a cascade-converter-based network structure: cascading two three-dimensional U-Net with different depths into a two-stage cascading network, wherein the three-dimensional U-Net is used for coarse segmentation in the first stage, a new three-dimensional U-Net variant structure is adopted in the second stage, and the input of the new three-dimensional U-Net variant structure is a fused image of the input of the first stage and the coarse segmentation result, so that a final fine segmentation result is obtained;

in the first stage, the three-dimensional U-Net comprises 4 decoders and encoders, the bottommost layer is a bottleneck layer, downsampling and upsampling operations are carried out after each layer of decoder and before the encoder, and each encoder and decoder adopts nesting of three-dimensional convolution layers, a normalization layer and a nonlinear layer;

in the second stage, in the encoders and decoders of the first to third layers, the third convolution is replaced by the omnibearing dynamic convolution so as to improve the adaptability of the network to brain tumors with changeable shapes, the omnibearing dynamic convolution firstly carries out global average pooling on the input characteristic images, then uses the full connection layer to compress the channel number of the characteristic images, uses four branches to respectively calculate the attention of the dimension of the input channel, the dimension of the output channel, the space dimension of convolution kernel and the dimension of the convolution kernel, and multiplies the four complementary attention points to the corresponding dimension to obtain final convolution output;

dividing a feature map after passing through the front three layers of encoders into a plurality of non-overlapping small blocks with the size, linearly mapping the non-overlapping small blocks to the input size suitable for the Swin transform encoder, inputting the non-overlapping small blocks into the Swin transform encoder, after passing through the two layers of encoders and the bottleneck layer, recovering the number of channels of the feature map to the input size suitable for the common encoder by using three-dimensional convolution after passing through the two layers of decoders;

each layer of Swin transducer encoder and decoder comprises two Swin transducer modules, each module firstly levels the space size of an input three-dimensional feature map to be a Patch with 1 dimension, then calculates the self-attention of the feature map in a window in a sliding window mode for the Patch, and finally restores the self-attention to the three-dimensional feature map.

4. A cascade transducer-based brain tumor segmentation method according to claim 3: when training the network structure:

setting network super parameters, optimizing a network structure by using an Adam optimizer, adjusting a dynamic adjustment learning rate by using a cosine annealing strategy, reducing calculated amount by using automatic mixing precision, and storing optimal training weights;

the training set is input into a cascading transducer model for training, the Focal loss function and the Dice loss function are fused to serve as the loss function of the network model, and all three focus areas of tumor, tumor core and enhanced tumor are used as segmentation targets.

5. The cascade fransformer-based brain tumor segmentation method according to claim 1, wherein: when the network model is tested to obtain brain tumor segmentation results:

loading training weights into the network model, and predicting all tumors, tumor cores and segmentation results of enhanced tumors of the test set;

the input brain tumor image is turned over along an x-axis, a y-axis, a z-axis, an xy-axis, an xz-axis, a yz-axis and an xyz-axis respectively by a test time augmentation method, and the segmentation result is predicted, and the predicted result is averaged;

the method comprises the steps of replacing a prediction result with a corresponding label according to a certain probability by using a voxel cutting method, and obtaining a final segmentation result, wherein in the prediction result, voxels belonging to all tumors are replaced with a label 0 according to a certain probability; the voxels belonging to the tumor core are replaced with labels 2 with a certain probability; the voxels belonging to the reinforced tumor are replaced with the label 1 with a certain probability, and the independent reinforced tumor blocks and the whole reinforced tumor blocks with the number of voxels and the average probability smaller than a certain value are replaced with the label 1.