CN116205927A

CN116205927A - Image segmentation method based on boundary enhancement

Info

Publication number: CN116205927A
Application number: CN202310165505.2A
Authority: CN
Inventors: 古晶; 孙新凯; 翟得胜; 杨淑媛; 冯婕; 侯彪; 刘芳; 焦李成
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-06-02

Abstract

The invention discloses an image segmentation method based on boundary enhancement, which comprises the following steps: establishing a boundary-enhanced image segmentation network model, and segmenting an input image by using the trained network model; the encoder of the image segmentation network model comprises a first feature extraction module, a boundary extraction module and a second feature extraction module, and is mainly used for extracting boundary features and boundary labels of an input image so as to obtain feature images with different scales; the decoder of the image segmentation network model comprises a bidirectional mutual enhancement module and a multi-scale attention aggregation module, and is mainly used for carrying out attention aggregation processing on the enhancement feature map based on the scale dimension, the space dimension and the channel dimension to obtain a multi-dimensional fusion feature map. The boundary information extracted by the method is more accurate, so that the obtained multidimensional fusion feature map is outstanding, and the space information and semantic information of the accuracy of the segmentation result can be more effectively improved, thereby improving the accuracy of image segmentation.

Description

Image segmentation method based on boundary enhancement

Technical Field

The invention belongs to the technical field of image segmentation, and particularly relates to an image segmentation method based on boundary enhancement.

Background

The goal of image segmentation is to segment the input image according to semantic information and predict the semantic class of each pixel from a given set of labels. With the gradual intellectualization of modern life, more and more applications need to infer related semantic information from images for subsequent processing, such as augmented reality, automatic driving, video monitoring, and the like, so that accurate segmentation of the images becomes important.

Conventional image segmentation typically employs conventional machine learning algorithms based on clustering and random forests to obtain image features. In recent years, with the rapid development of specialized computing chips, the computing cost is rapidly reduced, so that the large-area use of a deep learning algorithm becomes possible, and the precision of image segmentation is obviously improved without increasing the cost. Therefore, an image segmentation method based on a deep learning algorithm has received a great deal of attention from students.

For example, jianlong Hou et al, in "BSNet: dynamic Hybrid Gradient Convolution Based Boundary-Sensitive Network for Remote Sensing Image Segmentation" (IEEE Transactions on Geoscience and Remote Sensing, vol.60, pp.1-22,2022) in the literature, propose a boundary-sensitive network based on dynamic hybrid gradient convolution and coordination sensitivity. Chengli Peng et al in Cross Fusion Net A Fast Semantic Segmentation Network for Small-Scale Semantic Information Capturing in Aerial Scenes (IEEE Transactions on Geoscience and Remote Sensing, vol.60, pp.1-13,2022) in the literature propose a Cross Fusion network that extracts multi-scale semantic information. Aijin Li et al, in "Multitask Semantic Boundary Awareness Network for Remote Sensing Image Segmentation" (IEEE Transactions on Geoscience and Remote Sensing, vol.60, pp.1-14,2022) propose a semantic boundary aware network. Guohui Deng et al, in "CCANet: class-Constraint Coarse-to-Fine Attentional Deep Network for Subdecimeter Aerial Image Semantic Segmentation" (IEEE Transactions on Geoscience and Remote Sensing, vol.60, pp.1-20,2022) herein, propose a Class-constrained coarse-to-fine attention depth network. Rui Li et al, in "Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images" (IEEE Transactions on Geoscience and Remote Sensing, vol.60, pp.1-13,2022) herein, propose a multi-noted network.

However, the image segmentation method based on the deep learning algorithm has the following defects:

firstly, when the boundary information of the feature map is extracted in the existing scheme, the error exists in the boundary information in the feature map obtained by convolution, downsampling and other operations, so that the error exists in the space information for recovering the boundary details;

secondly, in the process of aggregating the features of different scales, the feature images of different scales are directly fused after up-sampling by simple cascading or adding operation, the influence degree of the features of different scales on the segmentation result is not considered, and the difference of the space information of low level and the semantic information of high level in the features of different scales is not considered, so that the space information and the semantic information in the feature images of different scales cannot be fully fused, and the image segmentation result is not ideal.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides an image segmentation method based on boundary enhancement. The technical problems to be solved by the invention are realized by the following technical scheme:

an image segmentation method based on boundary enhancement, comprising:

establishing a boundary enhanced image segmentation network model based on the encoder-decoder framework;

training the image segmentation network model, and segmenting an input image by using the trained network model to obtain a segmentation result diagram;

the encoder of the image segmentation network model comprises a first feature extraction module, a boundary extraction module and a second feature extraction module;

the first feature extraction module is used for carrying out feature extraction on the input image to obtain a first feature map;

the boundary extraction module is used for extracting boundary features of the first feature map, extracting boundary labels of image labels corresponding to the input images, and supervising the output boundary features by utilizing the boundary labels to obtain a boundary feature map;

the second feature extraction module is used for carrying out multi-scale feature extraction on the first feature map and the boundary feature map to obtain feature maps with different scales;

the decoder of the image segmentation network model comprises a bidirectional mutual enhancement module and a multi-scale attention aggregation module; the bidirectional mutual enhancement module is used for processing the feature images with different scales to obtain enhancement feature images with different scales;

the multi-scale attention aggregation module is used for performing attention aggregation processing on the enhanced feature map based on the scale dimension, the space dimension and the channel dimension to obtain a multi-dimensional fusion feature map.

The invention has the beneficial effects that:

according to the image segmentation method based on boundary enhancement, on one hand, when the boundary extraction is carried out on the feature image, the boundary extraction is carried out on the label image, the boundary label is obtained, and the boundary extracted from the feature image is supervised through the boundary label, so that the extracted boundary information is more accurate; on the other hand, during feature fusion, the obtained multidimensional fusion feature map highlights spatial information and semantic information which can more effectively improve the accuracy of a segmentation result through scale, space and channel three-dimensional channel attention processing, so that the accuracy of image segmentation is improved.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

Fig. 1 is a schematic flow chart of an image segmentation method based on boundary enhancement according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an image segmentation network model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a bidirectional mutual enhancement sub-network according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but embodiments of the present invention are not limited thereto.

Example 1

Referring to fig. 1, fig. 1 is a flowchart of an image segmentation method based on boundary enhancement according to an embodiment of the present invention, which includes:

training an image segmentation network model, and segmenting an input image by using the trained network model to obtain a segmentation result diagram;

the multi-scale attention aggregation module is used for carrying out attention aggregation processing on the enhanced feature map based on the scale dimension, the space dimension and the channel dimension to obtain a multi-dimensional fusion feature map.

In this embodiment, the first feature extraction module is a convolution layer that includes 13×3 convolution operation Conv, 1 batch normalization BN, and 1 linear rectification function ReLU. The calculation formula of the linear rectification function ReLu is as follows:

where x is an element in the input feature map.

Furthermore, the boundary extraction module in this embodiment may use a laplace operator, a Sobel operator, a Canny operator, or a LoG operator to extract the boundary label.

Preferably, the boundary extraction module is constructed by using the laplace operator in this embodiment. Referring to fig. 2, fig. 2 is a schematic structural diagram of an image segmentation network model according to an embodiment of the present invention; the boundary extraction module comprises 2 Laplacians and 1 convolution layer; wherein,,

the first Laplace operator is used for extracting boundary features of the first feature map, and the second Laplace operator is used for extracting boundary labels of image labels corresponding to the input image;

the convolution layer is used for processing the boundary characteristics and monitoring the output of the first characteristic extraction module based on the boundary labels in the model training stage to obtain a boundary characteristic diagram.

The edge extraction of the Laplace operator on the image label and the supervision on the boundary characteristics output by the convolution layer are only performed in the network training stage. When the trained network is used for reasoning, the boundary extraction module only uses the Laplacian to carry out boundary extraction on the first feature map output by the first feature extraction module, and then the boundary features are processed through the convolution layer. And finally, adding the boundary feature image output by the convolution layer with the first feature image, and inputting the boundary feature image into a second feature extraction module for processing.

In this embodiment, the second feature extraction module employs a convolutional neural network architecture, which may be any one of ResNet-50, resNet-152, resNeXt-50, resNeXt-101, or ResNeXt-152.

Preferably, the embodiment takes a convolutional neural network architecture based on ResNet-101 as a second feature extraction module, and the structural diagram is shown in fig. 2, and the convolutional neural network architecture comprises a downsampling layer, a ResNet-101 first stage, a ResNet-101 second stage, a ResNet-101 third stage and a ResNet-101 fourth stage; wherein,,

the ResNet-101 first stage contains 3 residual blocks, the ResNet-101 second stage contains 4 residual blocks, the ResNet-101 third stage contains 23 residual blocks, and the ResNet-101 fourth stage contains 3 residual blocks.

More specifically, the residual block includes: 1 x 1 convolution operation, 13 x 3 convolution operation, and 1 x 1 convolution operation.

In this embodiment, the downsampling layer, i.e. the maximum value is pooled, with a step size of 2. Downsampling in the ResNet-101 first stage is achieved by maximum pooling, and downsampling in the ResNet-101 second stage, resNet-101 third stage, and ResNet-101 fourth stage are all achieved by setting the step size of the 1 st convolution operation in the stages to 2.

And adding the boundary feature map output by the boundary extraction module with the first feature map output by the first feature extraction module, and then sending the added boundary feature map to the second feature extraction module for processing, so that the high-level features and the low-level features with different sizes can be obtained.

In the embodiment, when the boundary extraction is performed on the feature map, the boundary extraction is performed on the label image, so that the boundary label is obtained, and the extracted boundary in the feature map is supervised through the boundary label, so that the extracted boundary information is more accurate.

Further, the bidirectional mutual enhancement module comprises a plurality of bidirectional mutual enhancement sub-networks with the same structure, and each bidirectional mutual enhancement sub-network comprises a first enhancement module and a second enhancement module which are respectively used for enhancing the feature images with two different sizes.

Specifically, as shown in fig. 2, based on the res net-101 convolutional neural network architecture adopted by the second feature extraction module in this embodiment, two bidirectional mutual enhancement sub-networks are set in the bidirectional mutual enhancement module, where the input of one bidirectional mutual enhancement sub-network is an output feature map of a first stage of res net-101 and a third stage of res net-101; the inputs to another two-way mutual enhancement sub-network are the output profiles of the ResNet-101 second stage and ResNet-101 fourth stage.

Further, referring to fig. 3, fig. 3 is a schematic structural diagram of a bidirectional mutual enhancement sub-network according to an embodiment of the present invention, which includes a first enhancement module and a second enhancement module, wherein,

the first enhancement module firstly adopts two convolution layers to process the input characteristic diagram with a first size so as to obtain a second characteristic diagram with the first size; then processing the input feature map with the first size through a convolution layer, a pooling layer and a Sigmoid activation layer in sequence to obtain a third feature map with the second size;

correspondingly, the second enhancement module firstly adopts two convolution layers to process the input characteristic diagram with the second size so as to obtain a fourth characteristic diagram with the second size; then processing the input feature map with the second size through a convolution layer, an up-sampling layer and a Sigmoid activation layer in sequence to obtain a fifth feature map with the first size;

the first enhancement module is further configured to multiply the second feature map with the fifth feature map, and obtain a first enhancement feature map having a first size through a convolution layer;

the second enhancement module is further configured to multiply the third feature map with the fourth feature map and obtain a second enhanced feature map having a second size through a convolution layer.

For example, for the first enhancement module, the input feature map size (i.e., the first size) is h×w×c: for the input feature map, features in the input feature map are processed through two convolution layers, so that a processed feature map of H multiplied by W multiplied by C, which has the same size as the input size, namely a second feature map, is obtained. In addition, the input feature map is processed through a convolution layer with a step length of 2, a pooling layer with a step length of 2 and a Sigmoid activation layer, so as to obtain an activated feature map with a size of H/4 xW/4 xC, namely a third feature map.

For the second enhancement module, the input feature map size (i.e., the second size) is H/4W/4C: for the input feature map, features in the input feature map are processed through two convolution layers, so that a processed feature map of H/4 xW/4 xC with the same size as the input size, namely a fourth feature map, is obtained. In addition, the input feature map is processed through a convolution layer, an up-sampling operation with an up-sampling multiple of 4 and a Sigmoid activation layer, so as to obtain an activated feature map with a size of h×w×c, namely, a fifth feature map.

The calculation formula of the Sigmoid activation function is as follows:

where x is the input signature.

The processed feature map of size h×w×c (i.e., the second feature map) is multiplied by the activated feature map of size h×w×c (i.e., the fifth feature map), and an output feature map of size h×w×c (i.e., the first enhancement feature map) is obtained by one convolution layer. The processed feature map with the size H/4 xw/4 xc (i.e. the third feature map) is multiplied by the activated feature map with the size H/4 xw/4 xc (i.e. the fourth feature map) to obtain the output feature map with the size H/4 xw/4 xc (i.e. the second enhancement feature map) by one convolution layer.

In the embodiment, the characteristic diagrams of the first stage of the ResNet-101 and the third stage of the ResNet-101 are input into one bidirectional mutual enhancement sub-network, and the second stage of the ResNet-101 and the fourth stage of the ResNet-101 are input into the other bidirectional mutual enhancement sub-network, so that the enhancement characteristic diagrams with four scales consistent with the sizes of the first, second, third and fourth stages of the ResNet-101 are obtained.

Further, please continue to refer to fig. 2, wherein the multi-scale attention aggregation module includes a multi-scale fusion sub-network, a scale-dimension attention sub-network, a space-dimension attention sub-network, and a channel-dimension attention sub-network; wherein,,

the multi-scale fusion sub-network is used for carrying out size transformation and feature cascading on the enhanced feature graphs with different scales;

the scale dimension attention sub-network is used for carrying out global average pooling on the feature images after cascading in the scale dimension so as to obtain the feature images after the scale dimension attention processing;

the space dimension attention sub-network is used for carrying out global average pooling and maximum pooling on the feature map after the dimension attention processing on the space dimension so as to obtain the feature map after the dimension attention processing;

the channel dimension attention sub-network is used for carrying out global average pooling on the feature map after the space dimension attention processing on the channel dimension so as to obtain a multi-dimensional fusion feature map.

Specifically, as shown in fig. 2, the multi-scale fusion sub-network performs size transformation on the four-scale enhancement feature images output by the bidirectional mutual enhancement module, so that the sizes of the four-scale feature images are all changed into the same size as the sizes of the feature images output by the ResNet-101 in the first stage. And then cascading the feature graphs after the size conversion to obtain feature graphs with dimensions of (dimension S, channel C and space H multiplied by W) respectively.

For the feature map after cascading, the dimension attention sub-network performs global average pooling on the dimension to obtain a feature vector, and the size of the feature vector is (Sx1×1×1). And then processing the feature vector through a convolution layer (the convolution kernel is 1 multiplied by 1) to obtain a attention vector of the dimension, and multiplying the attention vector of the dimension by the feature map after the multi-dimension fusion sub-network is cascaded to obtain the feature map after the dimension attention processing.

The space dimension attention sub-network firstly processes the feature map after dimension attention processing through a convolution layer, and then respectively carries out global average pooling and maximum pooling on the processed feature map in the space dimension to obtain two space attention force diagrams with the size of (1 multiplied by H multiplied by W). The obtained two spatial attention patterns are processed through a convolution layer, a spatial dimension attention pattern with the size of (1×1×h×w) is output, and the spatial dimension attention pattern is multiplied by the scale dimension attention processed feature pattern to obtain a spatial dimension attention processed feature pattern.

The channel dimension attention sub-network firstly processes the feature map after the space dimension attention processing through a convolution layer, and then carries out global average pooling on the processed feature map in the channel dimension to obtain a feature vector with the size of (1 XC x 1). And then processing the feature vector through a full connection layer and a convolution layer (the convolution kernel is 1 multiplied by 1)), obtaining the attention vector of the channel dimension, and multiplying the attention vector of the channel dimension by the feature map after the attention processing of the space dimension to obtain the output feature map of the multi-scale attention aggregation module.

According to the embodiment, the three-dimensional channel attention is processed through the dimensions, the space and the channel dimension, so that the obtained multi-dimensional fusion feature map highlights the space information and the semantic information which can effectively improve the accuracy of the segmentation result.

It can be understood that, after the multi-scale attention aggregation module, the decoder of the image segmentation network model provided in this embodiment further includes an upsampling layer and a convolution layer, where the upsampling layer is used to upsample the output feature map of the multi-scale attention aggregation module, and the upsampling multiplying power is 4, and then the segmentation result map is obtained through the convolution layer.

According to the image segmentation method based on boundary enhancement, on one hand, when the boundary extraction is carried out on the feature image, the boundary extraction is carried out on the label image, the boundary label is obtained, and the boundary extracted from the feature image is supervised through the boundary label, so that the extracted boundary information is more accurate; on the other hand, during feature fusion, the obtained multi-dimensional fusion feature map highlights spatial information and semantic information which can more effectively improve the accuracy of a segmentation result through scale, space and channel three-dimensional attention processing, so that the accuracy of image segmentation is improved.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. An image segmentation method based on boundary enhancement, comprising:

2. The boundary enhancement based image segmentation method according to claim 1, wherein the first feature extraction module is a convolution layer comprising 13 x 3 convolution operations Conv, 1 batch normalization BN, and 1 linear rectification function ReLU.

3. The image segmentation method based on boundary enhancement according to claim 1, wherein the boundary extraction module adopts a laplace operator, a Sobel operator, a Canny operator or a LoG operator to extract the boundary label.

4. The boundary enhancement based image segmentation method according to claim 1, wherein the boundary extraction module comprises 2 laplacian and 1 convolutional layer; wherein,,

the convolution layer is used for processing the boundary features and supervising the output of the first feature extraction module based on the boundary labels in a model training stage to obtain a boundary feature map.

5. The boundary-enhancement-based image segmentation method according to claim 1, wherein the second feature extraction module employs a convolutional neural network architecture, which may be any one of ResNet-50, resNet-152, resNeXt-50, resNeXt-101 or ResNeXt-152.

6. The boundary enhancement based image segmentation method according to claim 1, wherein the second feature extraction module adopts a convolutional neural network architecture based on ResNet-101, and comprises a downsampling layer, a ResNet-101 first stage, a ResNet-101 second stage, a ResNet-101 third stage and a ResNet-101 fourth stage; wherein,,

7. The image segmentation method based on boundary enhancement according to claim 1, wherein the bidirectional mutual enhancement module comprises a plurality of bidirectional mutual enhancement sub-networks with the same structure, each bidirectional mutual enhancement sub-network comprises a first enhancement module and a second enhancement module, and the first enhancement module and the second enhancement module are respectively used for carrying out enhancement processing on two feature maps with different sizes; wherein,,

the first enhancement module is further configured to multiply the second feature map and the fifth feature map, and obtain a first enhancement feature map having a first size through a convolution layer;

the second enhancement module is further configured to multiply the third feature map with the fourth feature map, and obtain a second enhancement feature map having a second size through a convolution layer.

8. The boundary-enhancement-based image segmentation method according to claim 6, wherein the bidirectional mutual enhancement module comprises two bidirectional mutual enhancement sub-networks, wherein the input of one bidirectional mutual enhancement sub-network is an output characteristic diagram of a first stage of ResNet-101 and a third stage of ResNet-101; the inputs to another two-way mutual enhancement sub-network are the output profiles of the ResNet-101 second stage and ResNet-101 fourth stage.

9. The boundary-enhancement-based image segmentation method of claim 1, wherein the multi-scale attention aggregation module comprises a multi-scale fusion sub-network, a scale-dimensional attention sub-network, a space-dimensional attention sub-network, and a channel-dimensional attention sub-network; wherein,,

the multi-scale fusion sub-network is used for carrying out size transformation and feature cascading on the enhancement feature graphs with different scales;