CN111028235A

CN111028235A - Image segmentation method for enhancing edge and detail information by utilizing feature fusion

Info

Publication number: CN111028235A
Application number: CN201911094462.3A
Authority: CN
Inventors: 朱和贵; 苗艳
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-11-11
Filing date: 2019-11-11
Publication date: 2020-04-17
Anticipated expiration: 2039-11-11
Also published as: CN111028235B

Abstract

The invention provides an image segmentation method for enhancing edge and detail information by utilizing feature fusion, and relates to the technical field of computer vision. The method utilizes a convolution neural network to extract the characteristics of an input image; inputting the extracted features into a decoding structure added with more feature fusion, and enriching edge and detail information while restoring the image resolution to obtain a dense feature map; outputting the maximum values of different classifications through a normalization method; and calculating a cross entropy loss function, and updating the weight in the network by using a random gradient descent method. The method can restore the position and boundary detail information lost in the encoding stage while restoring the resolution of the feature map, enriches the information of the picture, obtains the dense feature map, makes up the sparse feature map brought by direct up-sampling, makes the boundary and the detail of the segmentation clearer, and improves the effect of segmenting the detailed small objects.

Description

Image segmentation method for enhancing edge and detail information by utilizing feature fusion

Technical Field

The invention relates to the technical field of computer vision, in particular to an image segmentation method for enhancing edge and detail information by utilizing feature fusion.

Background

With the continuous progress of scientific technology and the rapid development of national economy, artificial intelligence gradually enters the visual field of people, plays an increasingly important role in the production and life of human beings, is widely applied in various fields, is an important research direction of artificial intelligence, is a very important means for realizing automatic scene understanding, and can be applied to many fields such as automatic driving systems, unmanned vehicles, application and the like.

The image semantic segmentation technology is an important branch of the computer vision field in machine learning, and is used for processing an input image, automatically segmenting and identifying the content in the image. Before applying deep learning to the field of computer vision, classifiers that build semantic segmentation of images are usually based on the use of a texton forest, or a random forest. With the appearance and the vigorous development of the deep convolutional neural network, an effective method is provided for semantic segmentation, the CNN is applied to the semantic segmentation to make a good progress, the development of the semantic segmentation is promoted, and the application of the CNN in various fields makes a remarkable result.

After deep learning is applied to semantic segmentation, many classical segmentation methods appear, such as a Full Convolution Network (FCN), a segNet network with an encoder-decoder structure and a deep Lab with a hole convolution, but as the hierarchy of a CNN network is deepened, continuous pooling and downsampling can cause position information and boundary detail information of a picture to be lost, the process is irreversible, and the removed information cannot be completely recovered, so that a feature map sampled in a decoding stage becomes sparse due to information loss, and the methods have certain limitations.

The position and edge details of the full convolution network FCN and the traditional SegNet network are lost due to down sampling, the information lost during up sampling in the decoding stage is not reproduced, the obtained characteristic diagram is sparse, and although the SegNet network recovers the position information through the pooling index and enriches the boundary and detail information by using convolution operation, a large amount of information is lost.

The hole convolution is a convolution layer capable of obtaining a dense feature map, but the calculation cost of using the hole convolution is high, and a large amount of memory is occupied when a large amount of high-resolution feature maps are processed.

The existing image semantic segmentation method generally has the problems that the retention of edge detail features and position information still needs to be further improved, and the segmentation accuracy also needs to be improved.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide an image segmentation method for enhancing edge and detail information by feature fusion, so as to realize the segmentation of images.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: an image segmentation method for enhancing edge and detail information by using feature fusion comprises the following steps:

step 1: processing the images in the training data set to obtain images with uniform resolution;

step 1.1: zooming and cutting the images in the training data set to enable the input images to have uniform sizes;

step 1.2: fixing the resolution of the input image to 360 × 480;

step 2: inputting the image into a coding structure for feature extraction; the coding structure is the same as that of a SegNet network, a front 13 layer of VGG-16 is adopted, and a maximum pooling index is added during pooling to remember the maximum value of pixels in an image and the positions of the pixels;

the convolutional kernel size of each convolutional layer of the coding structure is 3 × 3, and the signature graph after each convolutional layer is conv _ i _ j, where i is 1,2,3,4,5, j is 1,2 when i is 1,2, and j is 1,2,3 when i is 3,4, 5; meanwhile, each convolution layer is followed by the Batch normalization and the ReLU activation functions; adding a maximum pooling index into each pooling layer, realizing down-sampling by using 2 × 2 non-overlapping maximum pooling, and keeping the position of the maximum pixel value through the maximum pooling index, wherein a feature map obtained by each pooling layer is represented by pool _ r, wherein r is 1,2,3,4, 5;

the specific method for memorizing the maximum value of the pixel in the image and the position of the pixel by adding the maximum pooling index during pooling comprises the following steps:

for an input profile X ∈ R^h×w×cWherein h and w are each independentlyHeight and width of the feature map, c is the number of channels, and the feature map is obtained by 2 x 2 non-overlapping maximum pooling

Wherein, the value of the pixel point (i, j) is shown as the following formula:

the position corresponding to the maximum value of the pixel point is recorded as (m)_i,n_j) The following formula shows:

and step 3: inputting the pooled feature map pool _5 obtained by the coding structure into a decoding structure added with more feature fusion, releasing the maximum value of pixels in the original position by using the maximum pooled index, and filling the rest positions with 0 to realize 2 times of upsampling to obtain a sparse feature map upsampling 5;

the decoding structure comprises three-layer convolution structures and two-layer convolution structures; each convolution layer in the decoding structure is followed by a Batch normalization and a ReLU activation function;

the obtained sparse feature map upsampling5 has a value of each pixel as shown in the following formula:

wherein Z is_u,vThe pixel value of a pixel point (u, v) in the sparse feature map upsampling5 is obtained;

and 4, step 4: performing feature fusion operation once through a decoding structure, fusing the sparse feature map upsampling5 with the convolution feature maps conv _5_1 and conv _5_2, and fusing the feature map obtained by fusion with the pooled feature map pool _4 with the corresponding size to obtain a fusion feature map F₁；

The fusion process is that the pixel values of the corresponding positions in the feature map are added;

fusing the feature maps F₁Inputting the data into a first three-layer convolution structure to carry out convolution operation to obtain a dense feature map conv _ decode5, and making up for information loss caused by pooling and downsampling;

and 5: performing feature fusion operation for four times through a decoding structure, and repeatedly performing up-sampling, feature fusion and convolution operation until the resolution of the feature map is restored to the original size;

step 5.1: performing second feature fusion through a decoding structure to restore image information;

step 5.1.1: performing 2-time upsampling on conv _ decode5 by using a maximum pooling index stored when the pooling feature map pool _4 is generated to obtain a sparse feature map upsampling 4;

step 5.1.2: fusing the sparse feature map upsampling4 with the convolutional feature maps conv _4_1, conv _4_2 and pooling feature map pool _3 with the same resolution extracted from the coding structure to obtain a fused feature map F₂；

Step 5.1.3: fusing the feature maps F₂Inputting the data into a second three-layer convolution structure to carry out convolution operation to obtain a dense feature map conv _ decode 4;

step 5.2: carrying out third-time feature fusion through a decoding structure to restore image information;

step 5.2.1: performing 2-time upsampling on the feature map conv _ decode4 by using the maximum pooling index stored when the pooling feature map pool _3 is generated to obtain a sparse feature map upsampling 3;

step 5.2.2: performing feature fusion on the sparse feature map upsampling3, the convolutional feature maps conv3_1 and conv3_2 extracted from the coding structure and having the same resolution, and the pooled feature map pool _2 to obtain a fusion feature map F₃；

Step 5.2.3: fusing the feature maps F₃Inputting the data into a third three-layer convolution structure to carry out convolution operation to obtain a dense feature map conv _ decode 3;

step 5.3: performing fourth feature fusion through a decoding structure to recover the detail information of the image;

step 5.3.1: performing 2-time upsampling on the conv _ decode3 of the feature map by using the maximum pooling index stored when the pooling feature map pool _2 is generated to obtain a sparse feature map upsampling 2;

step 5.3.2: performing feature fusion on the sparse feature map upsampling2, the convolution feature map conv _2_1 and the pooling feature map pool _1 to obtain a fusion feature map F₄；

Step 5.3.3: according to the symmetry of the SegNet network, fusing the feature diagram F₄Inputting the data into a first two-layer convolution structure to carry out convolution operation to obtain a dense feature map conv _ decode 2;

step 5.4: performing fifth feature fusion through a decoding structure to recover edge information of the image;

step 5.4.1: performing 2-time upsampling on the feature map conv _ decode2 by using the maximum pooling index stored when the pooling feature map pool _1 is generated to obtain a sparse feature map upsampling 1;

step 5.4.2: performing feature fusion on the sparse feature map upsampling1 and the convolution feature map conv _1_1 to obtain a fusion feature map F₅；

Step 5.4.3: fusing the feature maps F₅Inputting the data into a second two-layer convolution structure to carry out convolution operation to obtain a dense feature map conv _ decode 1;

step 6: inputting the dense feature map conv _ decode1 into a Softmax layer to obtain the maximum probability of pixel classification in the image;

and 7: and calculating a cross entropy loss function through the maximum probability of pixel classification in the image, and updating convolution kernel parameters of each convolution layer and each pooling layer in the coding structure and the decoding structure through a random gradient descent method to realize image segmentation.

The technical principle of the method is as follows: on the basis of an original SegNet network, a decoding stage is improved, the resolution of a feature map is restored, and meanwhile, image position and boundary detail information are restored, so that a dense feature map is obtained; the method is characterized in that features of an image are extracted by utilizing a convolutional layer and a pooling layer in a coding structure, different scales of information are extracted by the convolutional layer and the pooling layer at different depths, global low-level semantic information such as edges, directions, textures, chroma and the like is extracted by a shallow layer structure, local high-level semantic information such as the shape of an object is extracted by a deep layer structure, the features extracted deeper the network level are more abstract, and in order to extract more abstract high-level features, the maximum pooling instead of average pooling is selected by the model in the coding structure.

Since the maximum pixel value extracted from the feature map and the position thereof are important, not only edge detail information is lost during pooling, but also position information is lost due to the reduction of the resolution of the feature map, a pooling index is added into the coding structure to remember the position of the maximum pixel value, the maximum pixel value is released in the original position by the decoding structure through the pooling index, and the rest positions are filled with 0, so that 2-time upsampling can be realized, important position information can be recovered, and errors are reduced.

However, as the network hierarchy of the decoding structure is deepened, the extracted features are more and more abstract, a lot of edge detail information is lost, information with different scales is lost in each layer, the rest of the positions of the feature diagram obtained after upsampling in the decoding structure except the maximum value are all 0, the obtained feature diagram is sparse, the lost information is not reproduced in the feature diagram obtained after upsampling, so that feature fusion is added into the decoding structure to recover the information, and the sparse feature diagram obtained after each upsampling is superposed with the feature diagram obtained after convolution and pooling of the corresponding size in the encoding stage. In this way, each feature map after up-sampling is input into the fusion structure, the information lost in the coding stage is gradually recovered, and then the fusion result is input into the convolution layer to further enrich the information, so that a denser feature map is obtained, the segmentation effect is better, and the accuracy is higher.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the image segmentation method for enhancing the edge and detail information by utilizing feature fusion can restore the position and edge detail information lost in the encoding stage while restoring the resolution of the feature map, enrich the information of the image, obtain the dense feature map, make up for the sparse feature map brought by direct up-sampling, make the segmented edge and detail clearer, improve the segmentation effect on fine and small objects in detail, and improve the average segmentation precision and mIOU.

Drawings

Fig. 1 is a flowchart of an image segmentation method for enhancing edge and detail information by feature fusion according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

In this embodiment, an image segmentation method for enhancing edge and detail information by feature fusion, as shown in fig. 1, includes the following steps:

step 1.2: fixing the resolution of the input image to 360 × 480;

the convolution kernel size of each convolution layer of the coding structure is 3 × 3, which ensures that the image size is unchanged, and the feature map after each convolution layer is denoted as conv _ i _ j, where i is 1,2,3,4,5, j is 1,2 when i is 1,2, and j is 1,2,3 when i is 3,4, 5; meanwhile, each convolution layer is followed by the Batch normalization and the ReLU activation functions; the BatchNormal is used for accelerating the convergence speed of the model and relieving the gradient dispersion problem in the deep network to a certain extent, so that the deep network model is easier and more stable to train; the ReLU activation function is selected, so that gradient disappearance can be solved, and overfitting of the network is relieved; adding a maximum pooling index into each pooling layer, realizing down-sampling by using 2 × 2 non-overlapping maximum pooling, and keeping the position of the maximum pixel value through the maximum pooling index, wherein a feature map obtained by each pooling layer is represented by pool _ r, wherein r is 1,2,3,4, 5;

the coding structure uses the front 13 layers of VGG-16 to extract the characteristics of pictures, and uses the convolution layer and the pooling layer to extract the image characteristics with different scales, the front 4 layers of the structure can be regarded as a shallow layer structure, the obtained low-level semantic information is low, the back 9 layers can be regarded as a deep layer structure, the obtained high-level abstract information is high, and the characteristics with different scales can be obtained through the coding structure;

for an input profile X ∈ R^h×w×cH and w are height and width of the characteristic diagram respectively, c is channel number, and the characteristic diagram is obtained by 2 multiplied by 2 non-overlapping maximum pooling

Wherein, the value of the pixel point (i, j) is shown as the following formula:

the value of each pixel in the obtained sparse feature map upsampling5 is shown as follows:

wherein Z is_u,vIs the pixel value of the pixel point (u, v) in the sparse feature map upsampling 5.

And 4, step 4: as the feature map obtained by up-sampling is sparse, the feature fusion operation is carried out once through a decoding structure; convolutional feature maps extracted from the coding structure and having the same resolution as the sparse feature map upsampling5 include conv _5_1, conv _5_2 and conv _5_3, and as pool _5 is obtained by directly pooling conv _5_3, a part of information is recovered in the process of 2 times of upsampling, and in order to reduce the training parameters of the model, only the sparse feature map upsampling5 is fused with the convolutional feature maps conv _5_1 and conv _5_2, and the fused feature map is fused with the pooled feature map pool _4 with the corresponding size, so that a fused feature map F is obtained₁；

in order to maintain the symmetry of the original SegNet network, a feature map F is fused₁Inputting the data into a first three-layer convolution structure to carry out convolution operation to obtain a dense feature map conv _ decode5, further enriching the information of the picture and making up the information loss caused by pooling and downsampling;

step 4 is equivalent to the first feature fusion operation, five times of feature fusion are required in the decoding process of the method, and the method is divided into three different fusion forms according to the difference of the up-sampling depth, wherein the first three fusion forms are the same, and the next four feature fusion forms are required.

And 5: performing feature fusion operation for four times through a decoding structure, and repeatedly performing up-sampling, feature fusion and convolution operation until the resolution of the feature map is restored to the original size to obtain a dense feature map conv _ decode 1;

step 5.1.1: after the step 4, the resolution of the conv _ decode5 of the feature map is the same as that of the pooled feature map pool _4, and 2 times of upsampling is performed on the conv _ decode5 by using the maximum pooled index stored when the pooled feature map pool _4 is generated, so that a sparse feature map upsampling4 is obtained;

the first three times of feature fusion are coding feature maps corresponding to three stages, have the same fusion structure, and the feature maps participating in the fusion have lower resolution and have local abstract features, so that the local abstract features are recovered by using the same fusion form.

step 5.3.2: since the resolution of the feature map is restored to that of the original image after step 5.3.1

At this time, the corresponding feature maps comprise conv _2_1, conv _2_2 and pool _1, and in order to reduce the parameters of model training, only the sparse feature map upsampling2 is subjected to feature fusion with the convolution feature map conv _2_1 and the pooling feature map pool _1 to obtain a fusion feature map F₄；

different from the feature fusion of the first three times, the feature fusion of the time corresponds to the coding feature maps of the two stages and is used for recovering detailed information, so the fusion forms are different;

step 5.4.2: since the resolution of the feature map is restored to the original size after step 5.4.1, and there are convolution features conv _1_1 and conv _1_2 with the feature map with the same resolution obtained by the coding structure, in order to reduce the parameters of the model training, only the sparse feature map upsampling1 and the convolution feature map conv _1_1 are subjected to feature fusion to obtain a fusion feature map F₅；

the feature fusion only has one stage of coding feature graph participating in the fusion and is used for recovering edge information.

Step 6: inputting the dense feature map conv _ decode1 to the Softmax layer results in the maximum probability of pixel classification in the image.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. An image segmentation method for enhancing edge and detail information by using feature fusion is characterized in that: the method comprises the following steps:

the convolutional kernel size of each convolutional layer of the coding structure is 3 × 3, and the feature map after each convolutional layer is denoted as conv _ i _ j, where i is 1,2,3,4,5, j is 1,2 when i is 1,2, and j is 1,2,3 when i is 3,4, 5; meanwhile, each convolution layer is followed by the Batch normalization and the ReLU activation functions; adding a maximum pooling index into each pooling layer, realizing down-sampling by using 2 × 2 non-overlapping maximum pooling, and keeping the position of the maximum pixel value through the maximum pooling index, wherein a feature map obtained by each pooling layer is represented by pool _ r, wherein r is 1,2,3,4, 5;

2. The image segmentation method for enhancing edge and detail information by feature fusion according to claim 1, wherein: the specific method of the step 1 comprises the following steps:

step 1.2: the resolution of the input image is fixed to 360 × 480.

3. The image segmentation method for enhancing edge and detail information by feature fusion according to claim 1, wherein: step 2, adding a maximum pooling index during pooling to remember the maximum value of the pixel in the image and the position of the pixel in the image comprises the following specific steps:

for an input profile X ∈ R^h×w×cWhere h and w are the height and width of the feature map, respectively, and c is the channelObtaining a feature map by 2 × 2 non-overlapping maximum pooling

Wherein, the value of the pixel point (i, j) is shown as the following formula:

4. the image segmentation method for enhancing edge and detail information by feature fusion according to claim 3, wherein: step 3, the decoding structure comprises three-layer convolution structures and two-layer convolution structures; each convolution layer in the decoding structure is followed by a Batch normalization and a ReLU activation function;

5. The image segmentation method for enhancing edge and detail information by feature fusion according to claim 1, wherein: and 4, the fusion process is to add the pixel values of the corresponding positions in the feature map.

6. The image segmentation method for enhancing edge and detail information by feature fusion according to claim 4, wherein: the specific method of the step 5 comprises the following steps:

Step 5.4.3: fusing the feature maps F₅And inputting the data into a second two-layer convolution structure to carry out convolution operation, and obtaining a dense feature map conv _ decode 1.