CN114463341A

CN114463341A - Medical image segmentation method based on long and short distance features

Info

Publication number: CN114463341A
Application number: CN202210026011.1A
Authority: CN
Inventors: 种衍文; 谢柠迪; 潘少明
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2022-05-10

Abstract

The invention relates to a medical image segmentation method based on long and short distance characteristics. A medical image segmentation network based on long and short distance features is built by utilizing a Pythroch deep learning framework, a design mode of an encoder and a decoder based on a transform and a convolution network is adopted, and a suspicious lesion area is segmented from an input medical image through the processing of four parts, namely an MEtransform module, a convolution feature extraction module, a global local feature fusion module and a deconvolution decoder module. The invention has better segmentation performance for lesion image data sets of various different regions, stable segmentation effect and clearer edges.

Description

Medical image segmentation method based on long and short distance features

Technical Field

The invention belongs to the technical field of medical image segmentation, and particularly relates to a medical image segmentation method based on long and short distance characteristics.

Background

With the advance of medical technology, various new medical devices come out, and the internal images of the human body (such as CT, MRI and the like are common) acquired by the devices provide great convenience for medical diagnosis processes. In the past, image analysis work is mainly completed by professional physicians, but due to the scarcity of medical experts, an auxiliary diagnosis system based on computer software is clinically needed, and lesion areas in images can be automatically divided. How to improve the segmentation accuracy and the segmentation speed of the auxiliary system is one of the current research hotspots.

The deep learning is the most popular image processing method at present, and supervised learning is performed through correctly labeled image samples, so that tasks such as classification, segmentation, detection and the like are completed. Compared with natural image segmentation, the main differences of medical image segmentation are in terms of data: firstly, medical image labeling needs medical professional knowledge, and is more difficult and more expensive than natural images in the aspect of data set acquisition, so that the sample size of the public data set is relatively small; secondly, the similarity of the medical image in form and color is larger, and the lesion area is more hidden; finally, medical image segmentation techniques are generally used in medical assistance systems, with higher requirements on stability and immediacy.

Currently, researchers have done much research work in the field of medical image segmentation. Ronneberger et al propose an encoder and decoder based U-network (Unet) that adds a hopping connection on the basis of encoding and decoding, and complements the underlying edge features to the decoder in a superposition operation to generate segmentation results, thereby improving the segmentation accuracy. Oktar et al propose a Unet network structure with attention mechanism, which monitors the extraction process of shallow shape features by deep semantic features to achieve feature optimization, and takes the output of a decoder as the result of a segmentation task on the basis of optimization. Chen and the like introduce a multi-head attention-based transform structure in natural language processing into Unet so as to better extract global context information and perform more effective feature modeling to generate an output result, thereby further improving the segmentation performance. However, the methods proposed by these authors still have to be improved in terms of segmentation performance, and there is no consideration for the space occupied by the network model and the amount of computation. With the development of deep learning techniques and supercomputer chips, a medical image segmentation technique with both accuracy and instantaneity is required to overcome the above-mentioned shortcomings.

In summary, there is still an improvement space in the aspect of modeling the characteristics of the lesion area in the existing medical image segmentation methods, and a network model architecture which can more comprehensively and effectively fuse global and local characteristics needs to be designed to adapt to complicated and changeable medical detection images in clinical application; meanwhile, in order to meet the requirements of stability and instantaneity of the medical auxiliary diagnosis system, the proposed network model must meet the requirements of smaller parameters and calculated amount so as to be deployed in a clinical application server; in addition, the network model has certain flexibility for reasoning time and precision, and can realize the balance of precision and running time on different clinical equipment.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a medical image segmentation method based on long and short distance characteristics. A medical image segmentation network based on long and short distance features is built by utilizing a Pythroch deep learning framework, a design mode of an encoder and a decoder based on a transform and a convolutional network is adopted, and a suspicious lesion area is segmented from an input medical image through four parts of processing of a high-efficiency transform (Mefransformer) module with a mask, a convolutional feature extraction module, a global local feature fusion module (TCFuse) and a deconvolution decoder module. The invention has better segmentation performance for lesion image data sets of various different regions, stable segmentation effect and clearer edge.

In order to achieve the above object, the present invention provides a medical image segmentation method based on long and short distance features, comprising the following steps:

step 1, firstly, cutting an image in a training set into a size of C multiplied by H multiplied by W, randomly turning a cut training set picture up and down and horizontally to realize data expansion, and then dividing the training set and a test set;

step 2, constructing a medical image segmentation network based on the long and short distance characteristics;

step 3, training a medical image segmentation network based on the long and short distance characteristics by using a training set image;

and 4, performing medical image segmentation by using the segmentation network trained in the step 3.

Moreover, the medical image segmentation network based on the long and short distance features in step 2 adopts a design mode of an encoder and a decoder based on a transform and a convolutional network, and an encoder module is composed of a convolutional feature extraction module, a Mefransformer module and a global local feature fusion module. Firstly, after the image tensor is subjected to primary feature extraction through two convolution feature extraction modules, a MEfransformer module is used for extracting long-distance global features in parallel, a convolution feature extraction module is used for extracting short-distance local features, then a global local feature fusion module is used for fusing long-distance and short-distance feature vectors, the long-distance and short-distance features in data are further modeled through the parallel feature extraction and feature vector fusion operations on the basis of a fused feature diagram, and finally segmentation results of a network are output through three deconvolution decoder modules. The medical image segmentation network model performs iterative learning on parameters through back propagation of various loss functions, automatic optimization of the network is achieved, and medical image segmentation can be achieved after multiple rounds of training.

The convolution characteristic extraction module is composed of a plurality of blocks based on convolution layers, each block is composed of three convolution layers with convolution kernel sizes of 1 x 1, 3 x 3 and 1 x 1 and a normalization layer, a relu activation function is passed after each normalization to ensure the distribution of characteristic activation, a characteristic graph generated by each block is stored as skip connection of low-order characteristics and is sent to the subsequent deconvolution decoding operation, and the final output of the module is a short-distance characteristic graph.

The MERTRANSFORMER module comprises an axial multi-head attention module and a mask module, wherein the input of the axial multi-head attention module is image tensors (B, C, H, W), and a picture is firstly divided into 2 x 2 batches, the number of which is

Each patch is mapped into a vector with the length of C through a full connection layer, and the N vectors are combined into a vector with the size of C according to the original sequence

And averaging the characteristic diagrams respectively in the H and W directions to convert into

And

two vectors, then dividing channel C by multiple number of equally, changing two vectors into

And

on the basis, matrixes Q, K and V are respectively obtained through matrix multiplication; and finally, carrying out matrix multiplication on the weighted query set and V to obtain the output of the module.

The input of the global local feature fusion module is a long-distance feature map and a short-distance feature map, and the scales of the long-distance feature map and the short-distance feature map are (B, C, H and W). The method comprises the steps of firstly obtaining an attention diagram based on a channel through global average pooling of long-distance features, multiplying the attention diagram by a short-distance feature diagram according to elements, then carrying out primary feature fusion through a convolutional layer, then carrying out stacking operation on the fused features and original long-distance features, and then carrying out convolutional layer again to achieve further fusion and feature dimension reduction, thereby generating a subsequent required high-order feature diagram block.

The deconvolution decoder module consists of a plurality of decoder blocks based on convolution and bilinear interpolation, the characteristics output by the second global local characteristic fusion module are subjected to bilinear upsampling interpolation, H, W is doubled, and the characteristics are stacked with skip characteristics and subjected to two-layer convolution to perform multi-level fusion.

In step 3, the batchsize is B, the combined image tensors are B × C × H × W, the network model hyper-parameter left _ rate is set to 0.01, momentum is set to 0.9, and weight _ decay is set to 0.0001, and after 150epochs of iterative optimization, the loss function used in training is as follows:

Loss＝0.5×Cross Entropy Loss+0.5×Dice Loss (1)

in the formula, Cross entry Loss represents a Cross Entropy Loss function value, and Dice Loss is a set similarity measurement function and is used for calculating the similarity of two samples.

Moreover, the medical image segmentation in the step 4 comprises an encoding stage and a decoding stage.

And (3) an encoding stage: initially setting the Batchsize as B, and enabling an original B multiplied by C multiplied by H multiplied by W image tensor to enter two serially connected convolution feature extraction modules for performing initial low-dimensional feature extraction and generating two skip features; then, the low-dimensional features are respectively sent to an MEfransformer module and another convolution feature extraction module to obtain the features with the sizes of all the features

The method comprises the steps that a long-distance global feature and a short-distance local feature are generated, a skip feature is generated at the time, 3 skip features are calculated in total, an MEfransformer module changes the feature into axial attention on the basis of multi-head attention of a fransformer to reduce calculation amount, H, W sizes are all 32 x 32, a mask of a feature map is added, the mask is 32 x 32 in size, and the weight of a short-distance channel is reduced to ensure that long-distance features are extracted; then, the long-distance and short-distance feature maps are sent to a global local feature fusion module, and long-distance features are used as input to generate C on a channel₂Multiplying the channel weight of x 1 to the near features, and then performing convolution and stacking to obtain a preliminary fused feature with the size still equal to

Further extracting and fusing the long and short distance features, and further compressing the features into

Thus, the task of compressing the image data characteristics is completed.

And a decoding stage: input is as

The compression characteristics generated by the coding structure are firstly sampled to reduce the number of channels and improve the length and width dimensions, then stacked with the third skip generated in the coder stage and sent into two convolution layers, and continuously pass through the decoder module for three times to respectively correspond to the three skip characteristics, and finally the compression characteristics are obtained

And then the characteristic diagram is processed by a convolution layer and an up-sampling layer which take an output channel as the category number to obtain the final segmentation result.

Compared with the prior art, the invention has the following advantages:

1) aiming at the problem that the traditional medical image segmentation feature extraction is insufficient, a mask-based axial attention feature extraction module MEtransform module is designed, on the basis of transform, different token mappings are converted into the H direction and the W direction, the correlation is calculated in the two directions respectively, and a Gaussian function-based mask module is designed on a correlation importance graph, so that the remote feature proportion weight is improved, and the near weight is reduced. The mask module aims to guide a multi-head attention structure to effectively model long-distance features and complement the advantages of the multi-head attention structure and a parallel convolution feature extraction module used for modeling a short distance, so that the image feature modeling capability of the whole structure is further improved, and the space and position information of an original image is fully utilized.

2) The long-distance and short-distance feature fusion module is provided for the problem of fusion of two features of long distance and short distance, the long-distance feature has better global information representation and plays an important guiding role in the fusion process, and the short distance has stronger expression capability on detailed information and plays a role in supplementing details. The method comprises the steps of firstly generating a channel attention vector 1 multiplied by C by a long-distance feature, multiplying the channel attention vector by a short-distance feature in a channel dimension, then performing stacking operation, expanding the C dimension under the condition that H and W are not changed, and finally introducing a convolution layer with the step length of 1 to perform channel dimension reduction so as to comprehensively fuse the information of the two.

3) Aiming at the problem that the deep network has large device memory and calculation amount, the network structure is adjusted, on one hand, axial attention is introduced in the attention, ultra-large-scale matrix calculation is avoided, the calculation amount is reduced to a reasonable level, on the other hand, the super-parameters of an encoder and a decoder of a network model are adjusted, most redundant parameters are removed under the condition that the precision is not influenced, the storage space is saved, and the use efficiency of the device is improved.

Drawings

Fig. 1 is a schematic diagram of a medical image segmentation network based on long and short distance features according to the present invention.

Fig. 2 is a structural diagram of the mertransform module of the present invention.

FIG. 3 is a block diagram of a global local feature fusion module according to the present invention.

Fig. 4 is a block diagram of a convolution feature extraction module according to the present invention.

Fig. 5 is a block diagram of a deconvolution decoder module of the present invention.

Fig. 6 is a diagram showing the effect of the segmentation of the lesion region image of the polyp in the stomach according to the present invention, wherein fig. 6(a) is the diagram showing the effect of the segmentation of the lesion region image of the polyp in the stomach, and fig. 6(b) is the diagram showing the effect of the segmentation of the lesion region image of the polyp in the stomach.

Fig. 7 is a diagram illustrating the effect of pathological recognition segmentation on a cell image according to the present invention, wherein fig. 7(a) is a cell image, and fig. 7(b) is a diagram illustrating the effect of pathological recognition segmentation on a cell image.

Detailed Description

The invention provides a medical image segmentation method based on long and short distance characteristics, and the technical scheme of the invention is further explained by combining the accompanying drawings and an embodiment.

As shown in fig. 1, the process of the embodiment of the present invention includes the following steps:

step 1, firstly, cutting the images in the training set into 3 × 512 × 512 sizes, randomly turning the cut images in the training set up and down and horizontally to realize data expansion, and then, according to the following steps: 2, the training set and the test set are divided, that is, 80% of data is used for training, and 20% of data is used for testing the training result.

And 2, constructing a medical image segmentation network based on the long and short distance characteristics.

The method adopts a design mode of an encoder and a decoder based on a transformer and a convolutional network, and an encoder module consists of a convolutional feature extraction module, an MEtransformer module and a global local feature fusion module. Firstly, after the image tensor is subjected to primary feature extraction through two convolution feature extraction modules, a MEfransformer module is used for extracting long-distance global features in parallel, a convolution feature extraction module is used for extracting short-distance local features, then a global local feature fusion module is used for fusing long-distance and short-distance feature vectors, the long-distance and short-distance features in data are further modeled through the parallel feature extraction and feature vector fusion operations on the basis of a fused feature diagram, and finally segmentation results of a network are output through three deconvolution decoder modules. The medical image segmentation network model performs iterative learning on parameters through back propagation of various loss functions, automatic optimization of the network is achieved, and medical image segmentation can be achieved after multiple rounds of training.

A convolution feature extraction module: the module consists of a plurality of blocks based on convolutional layers, each block consists of three convolutional layers with convolutional core sizes of 1 x 1, 3 x 3 and 1 x 1 and a normalization layer, a relu activation function is carried out after each normalization to ensure the distribution of feature activation, a feature graph generated by each block is stored as a skip connection of low-order features and is sent to the subsequent deconvolution decoding operation, and the final output of the module is a short-distance feature graph.

The MERTransformer module: the module includes an axial multi-head attention module and a mask module. The input to the axial multi-headed attention module is the image tensors (B, C, H, W), and the picture is first divided into 2 × 2 patches, the number of which is

And (4) a characteristic diagram of (A). Respectively averaging the characteristic diagram in H and W directions to convert into

And

two vectors, then dividing channel C by multiple number equally, dividing two vectorsBecome into

And

on the basis of this, matrices Q, K and V are obtained by matrix multiplication. And finally, carrying out matrix multiplication on the weighted query set and V to obtain the output of the module.

Global local feature fusion module (TCFuse): the input of the module is a long-distance feature map and a short-distance feature map, and the long-distance feature map and the short-distance feature map are (B, C, H and W) in scale. The method comprises the steps of firstly obtaining an attention diagram based on a channel through global average pooling of long-distance features, multiplying the attention diagram by a short-distance feature diagram according to elements, then carrying out primary feature fusion through a convolutional layer, then carrying out stacking operation on the fused features and original long-distance features, and then carrying out convolutional layer again to achieve further fusion and feature dimension reduction, thereby generating a subsequent required high-order feature diagram block.

A deconvolution decoder module: the module consists of a plurality of decoders block based on convolution and bilinear interpolation, the characteristics output by the second global local characteristic fusion module are subjected to bilinear up-sampling interpolation, H, W is doubled, the characteristics are stacked with skip characteristics, and then multilevel fusion is carried out through two layers of convolution.

And 3, training the medical image segmentation network based on the long and short distance characteristics by using the training set image.

Combining 4 × 3 × 512 × 512 image tensors with the batch size as 4, setting the network model hyper-parameter, namely, left _ rate, 0.01, momentum, and weight _ decay, as 0.0001 as training input of the network, and performing 150epochs iterative optimization to use the following loss functions:

Loss＝0.5×Cross Entropy Loss+0.5×Dice Loss (1)

The segmentation network performs medical image segmentation including two stages of encoding and decoding.

And (3) an encoding stage: initially setting the blocksize to 4, entering the original 4 × 3 × 512 × 512 image tensor into two serially connected convolution feature extraction modules, performing preliminary low-dimensional feature extraction, and generating two skip features, as shown in fig. 4. Then, the low-dimensional features are respectively fed into an MEfransformer module and another convolution feature extraction module, long-distance global features and short-distance local features which are 4 x 512 x 64 in size are respectively extracted (a skip feature is also generated at the time, and 3 skip features are totally obtained), the MEfransformer module changes the features into axial attention on the basis of multi-head attention of the fransformer to reduce the calculation amount, H, W is 32 x 32 in size, and a mask (32 x 32) of a feature map is added, so that the weight of a short-distance channel is reduced to ensure that the long-distance features are extracted. Then, the long-distance feature map and the short-distance feature map are sent to a global local feature fusion module, 512 multiplied by 1 channel weight generated on a channel by taking the long-distance feature as input is multiplied on the short-distance feature, then the long-distance feature map and the short-distance feature map are rolled and stacked to obtain a preliminary fusion feature (the size is still 4 multiplied by 512 multiplied by 64), further long-distance feature extraction and fusion are carried out again, and the feature is further compressed to 4 multiplied by 1024 multiplied by 32, so that the task of image data feature compression is completed.

And a decoding stage: inputting the compression characteristics generated by the 4 × 1024 × 32 × 32 coding structure, performing upsampling to reduce the number of channels and increase the length and width dimensions, stacking the compression characteristics with the third skip generated in the coder stage, and sending the compression characteristics into two convolutional layers, as shown in fig. 5, continuously passing through a decoder module (corresponding to the three skip characteristics respectively) three times to obtain a 4 × 127 × 128 × 128 characteristic diagram, and passing the characteristic diagram through a convolutional layer and an upsampling layer of which the output channels are the number of categories to obtain a final segmentation result.

In specific implementation, the above process can adopt computer software technology to realize automatic operation process.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A medical image segmentation method based on long and short distance features is characterized by comprising the following steps:

the method comprises the following steps of adopting a design mode of an encoder and a decoder based on a transformer and a convolutional network, wherein an encoder module consists of a convolutional feature extraction module, an MEtransformer module and a global local feature fusion module; firstly, after the image tensor is subjected to primary feature extraction through two convolution feature extraction modules, extracting long-distance global features by using a MEfransformer module in parallel, extracting short-distance local features by using the convolution feature extraction module, then fusing long-distance and short-distance feature vectors by using a global local feature fusion module, further modeling long-distance and short-distance features in data by performing the parallel feature extraction and feature vector fusion operations again on the basis of a fused feature map, and finally obtaining segmentation result output of a network through three deconvolution decoder modules;

2. The medical image segmentation method based on the long and short distance features as claimed in claim 1, characterized in that: in the step 2, the convolution feature extraction module is composed of a plurality of convolution layer-based blocks, each block is composed of three convolution layers with convolution kernel sizes of 1 × 1, 3 × 3 and 1 × 1 and a normalization layer, a relu activation function is carried out after each normalization to ensure the distribution of feature activation, a feature graph generated by each block is stored and is used as skip connection of low-order features to be sent to the subsequent deconvolution decoding operation, and the final output of the module is a short-distance feature graph.

3. The medical image segmentation method based on the long and short distance features as claimed in claim 1, characterized in that: in step 2, the MERTRANSFORMER module comprises an axial multi-head attention module and a mask module, the input of the axial multi-head attention module is image tensors (B, C, H, W), firstly, the image is divided into 2 x 2 latches, the number of the latches is

And

And

on the basis of which the multiplication is carried out by matrixRespectively obtaining matrixes Q, K and V; and finally, carrying out matrix multiplication on the weighted query set and V to obtain the output of the module.

4. The medical image segmentation method based on the long and short distance features as claimed in claim 1, characterized in that: the input of the global local feature fusion module in the step 2 is a long-distance feature map and a short-distance feature map, and the scales of the long-distance feature map and the short-distance feature map are (B, C, H and W); the method comprises the steps of firstly obtaining an attention diagram based on a channel through global average pooling of long-distance features, multiplying the attention diagram by a short-distance feature diagram according to elements, then carrying out primary feature fusion through a convolutional layer, then carrying out stacking operation on the fused features and original long-distance features, and then carrying out convolutional layer again to achieve further fusion and feature dimension reduction, thereby generating a subsequent required high-order feature diagram block.

5. The medical image segmentation method based on the long and short distance features as claimed in claim 1, characterized in that: and 2, the deconvolution decoder module consists of a plurality of decoders blocks based on convolution and bilinear interpolation, the characteristics output by the second global local characteristic fusion module are subjected to bilinear upsampling interpolation, H, W is doubled, the characteristics are stacked with skip characteristics, and then multilayer fusion is performed through two layers of convolution.

6. The medical image segmentation method based on the long and short distance features as claimed in claim 1, characterized in that: in step 3, combining the batchsize as B into an image tensor of B × C × H × W, setting a network model hyper-parameter, namely, left _ rate 0.01, momentum 0.9, and weight _ decay 0.0001 as training input of the network, and performing 150epochs iterative optimization to use the following loss function in training:

Loss＝0.5×Cross Entropy Loss+0.5×Dice Loss (1)

7. A medical image segmentation method based on long and short distance features as claimed in claim 1, characterized in that: step 4, the medical image segmentation comprises an encoding stage and a decoding stage, wherein the encoding stage comprises the following steps: initially setting the batch size as B, and enabling an original B multiplied by C multiplied by H multiplied by W image tensor to enter two serially connected convolution feature extraction modules for performing initial low-dimensional feature extraction and generating two skip features; then, the low-dimensional features are respectively sent to an MEfransformer module and another convolution feature extraction module to obtain the features with the sizes of all the features

Further extraction and fusion of long and short distance features are carried out again, and features are further compressed into

The task of compressing the image data features is completed.

8. Medical image segmentation based on long and short distance characteristics as claimed in claim 7The cutting method is characterized in that: decoding stage in step 4: input is as