CN113436210A

CN113436210A - Road image segmentation method fusing context progressive sampling

Info

Publication number: CN113436210A
Application number: CN202110706637.2A
Authority: CN
Inventors: 陆彦钊; 刘惠义
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-09-24
Anticipated expiration: 2041-06-24
Also published as: CN113436210B

Abstract

The invention discloses a road image segmentation method fusing context progressive sampling, which comprises the following steps: preprocessing the acquired multiple road images to obtain segmentation pictures; inputting the segmented picture into a constructed Xception model to extract a deep layer feature map and a shallow layer feature map; inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion; inputting the deep characteristic map into a constructed ASPP pyramid module for pooling; fusing the deep layer characteristic graphs or the shallow layer characteristic graphs with the same resolution in the fusion result and the pooling result, and performing up-sampling step by 2 times to amplify the deep layer characteristic graphs or the shallow layer characteristic graphs back to the original graph size; the method and the device can improve the segmentation accuracy of the picture, and meanwhile, the segmentation is more precise in detail.

Description

Road image segmentation method fusing context progressive sampling

Technical Field

The invention relates to a road image segmentation method fusing context progressive sampling, and belongs to the technical field of image segmentation.

Background

Image semantic segmentation is a key problem in the computer field nowadays and is also an important direction for computer vision research. In the early stage, the image segmentation in the computer vision technology generally depends on information such as edges, gradual change and the like, pixel-level understanding is not provided, so that the segmentation accuracy is low, and the method cannot be applied to the fields of intelligent driving and the like. In recent years, with the deep research of convolutional neural networks, the understanding capability of a computer to a pixel level is stronger and stronger, networks for semantic segmentation are more and more perfect, and the method has wide application prospects in the fields of automatic driving, human-computer interaction, virtual reality and the like.

Early image semantic segmentation generally has methods based on thresholds, edges, regions, etc. Although the methods are convenient to use and easy to understand, a lot of spatial information is lost, and the segmentation effect is poor. To solve these problems, Jonathan Long et al proposed a full convolution network fcn (full convolution networks) based on the CNN Convolutional neural network. The network removes the last full connection layer of the CNN, deconvolves the last characteristic graph of the CNN to perform upsampling, and then amplifies the upsampling graph into the size of an original graph to achieve the purpose of pixel-level classification. The research of Jonathan Long et al has made a great breakthrough in image semantic segmentation. However, since the FCN reduces the original image by 32 times and then enlarges, pooling therein may result in information loss and a probabilistic model between tags is not applied. Chen LC et al propose a deplab v1 method that uses hole convolution to enlarge the receptive field, reduce pooling layers, and avoid loss of detail information due to excessive pooling. Meanwhile, due to the adoption of the CRF conditional random field, the edge is further refined, and the segmentation effect of complex edges like trees and bicycles is improved. Based on DeepLabV1, DeeplabB V2 was proposed by Linag-Chieh Chen et al. Compared with the Deeplab V1 network, the VGG16 used as the backbone network is changed into ResNet and an ASPP (asynchronous reactive pyramid) pyramid module is added. The ASPP adopts the cavity convolution layers with a plurality of sampling rates to detect in parallel, and the global and local characteristics are fused to improve the segmentation effect. The ensuing Deeplab V3+ introduced the structure of the encoder and decoder, fusing the backbone network output with the shallow features, and gradually reconstructing the spatial information to better capture the details of the object. While employing a depth separable convolution to reduce the amount of computation. Deeplab V3+ can capture context information well, but the edge segmentation accuracy for small-scale objects is still not high.

In order to solve the problems, the application provides a road image segmentation method fusing context progressive sampling.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a road image segmentation method fusing context and progressive sampling, which can identify small target objects on a road more accurately and obviously improve image detail segmentation.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

a road image segmentation method fusing context progressive sampling comprises the following steps:

preprocessing the acquired multiple road images to obtain segmentation pictures;

inputting the segmented picture into a constructed Xception model to extract a deep layer feature map and a shallow layer feature map;

inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion;

inputting the deep characteristic map into a constructed ASPP pyramid module for pooling;

and fusing the deep layer feature map or the shallow layer feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling step by 2 times to amplify the deep layer feature map or the shallow layer feature map back to the original image size.

Preferably, the preprocessing the acquired multiple road images to obtain the segmentation picture includes:

cutting the road image into 1024 x 1024 pixel pictures, and uniformly storing the pictures into a jpg format;

performing semantic annotation on each picture to obtain a segmented picture;

the semantic annotation content comprises a background, an automobile, a person, the sky, a road, a grassland, a wall, a building and a pedestrian road.

Preferably, the construction of the Xception model includes:

building a block1 intermediate feature layer, which consists of a 3 × 3 convolution layer with 32 channels, a relu active layer, a 3 × 3 convolution layer with 64 channels and a relu active layer;

building a block2 intermediate feature layer, which consists of 2 128-channel 3 x 3 depth separable convolutional layers, a relu active layer and a maximum pooling layer;

building a block3 intermediate feature layer, which consists of 2 256-channel 3 x 3 depth separable convolutional layers, a relu active layer and a maximum pooling layer;

building a block4 intermediate feature layer, which consists of 3 x 3 depth separable convolutional layers of 2 728 channels, a relu active layer and a maximum pooling layer;

building block5-block13 intermediate feature layers, which are composed of 3 x 3 depth separable convolution layers of 728 channels and 3 relu activation layers;

wherein, after the output of the block1 middle characteristic layer, the output is simultaneously sent to a 1 × 1 convolution layer, and the result is added with the output of the block2 middle characteristic layer; simultaneously sending the output of the block2 intermediate characteristic layer to a 1 x 1 convolution layer, and adding the result with the output of the block3 intermediate characteristic layer; the output of block3 intermediate feature layers is simultaneously fed into a 1 x 1 convolutional layer, and the result is added to the output of block4 intermediate feature layers.

Preferably, the step of inputting the segmented picture into the constructed Xception model to extract the deep feature map and the shallow feature map includes: the Xception model extracts deep feature maps in the segmented pictures in the block13 intermediate feature layer, and the Xception model extracts shallow feature maps in the segmented pictures in the block2, block3 and block4 intermediate feature layers.

Preferably, the inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets therein, and inputting the output result into the constructed HRNet module for fusion includes:

inputting a shallow layer feature diagram extracted from a feature layer among block2, block3 and block4 into a constructed CBAM attention model to amplify the features of small targets, and outputting out1, out2 and out 3;

performing cross fusion on out1, out2 and out3 in an up-sampling and down-sampling mode to obtain feature maps of 3 corresponding resolutions, namely, hrout1, hrout2 and hrout 3;

wherein, the small target is an object with an area smaller than 10 x 10 pixels in the segmented picture;

the size of hrout2 is 1/2 for hrout1, and the size of hrout3 is 1/2 for hrout 2.

Preferably, the CBAM attention model is constructed by: constructing a channel attention mechanism and a space attention mechanism;

the channel attention mechanism comprises:

respectively performing maximum pooling and average pooling on the input feature map on channel dimensions to extract the maximum weight and the average weight on each channel;

respectively sending the maximum weight and the average weight to two full-connection layers for classification;

adding the classification results and activating by using a sigmoid function to obtain an importance weight matrix of each channel;

multiplying the importance weight matrix of each channel with the input feature map to obtain the output of amplified channel features;

the maximum pooling refers to the maximum value of pixel points of each channel, the average pooling refers to the average value of the pixel points of each channel, and the sigmoid activation function is used for enabling a larger value in input to be larger and a smaller value in input to be smaller;

the spatial attention mechanism comprises:

performing primary maximum pooling and primary average pooling on the output of the amplified channel characteristics on the spatial dimension to extract the maximum weight and the average weight of each pixel point;

carrying out convolution operation on the maximum weight and the average weight through a 3-by-3 convolution layer, activating by using a sigmoid function, and outputting an importance weight matrix of each pixel point;

and multiplying the importance weight matrix of each pixel point by the output of the amplified channel characteristic to obtain the output of the amplified pixel point characteristic.

Preferably, the ASPP pyramid module comprises 3 × 3 convolution layers with step sizes of 6, 12 and 18, respectively, and an average pooling layer with step size of 1; the step of inputting the deep feature map into the constructed ASPP pyramid module for pooling comprises the following steps:

sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 × 3 convolution layer with the step length of 6, and then sending two 1 × 1 convolution layers with the step length of 1 to output results;

sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 x 3 convolution layer with the step length of 12, and then sending two 1 x 1 convolution layers with the step length of 1 to output results;

sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 × 3 convolution layer with the step length of 18, and then sending two 1 × 1 convolution layers with the step length of 1 to output results;

sending the deep characteristic diagram extracted from the block13 middle characteristic layer into an average pooling layer with the step length of 1 to output a result;

and combining the output results to obtain the final ASPP pyramid module pooling output.

Preferably, the fusing the deep feature map or the shallow feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling 2 times step by step to enlarge the deep feature map or the shallow feature map back to the original image size includes:

convolving the pooled output of the ASPP pyramid module once, upsampling by 2 times once, and combining with hrout 3;

convolve the combined result once, upsample by 2 times once, and combine with hrout 2;

convolve the combined result once, upsample by 2 times once, and combine with hrout 1;

and (4) convolving the merging result twice, performing 2 times of upsampling once, and activating by using a softmax function to obtain final output.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a road image segmentation method fusing context and step-by-step sampling, which fuses different levels of features by utilizing a HRNet mode, adds a CBAM attention mechanism in front of an HRNet module, enhances a beneficial feature channel, weakens a useless feature channel, and finally samples the output of an ASPP pyramid module and the fused different levels of features step by step. The experimental results show that: the method for integrating context and up-sampling step by step is more accurate in identifying small target objects on the road and has obvious improvement on image detail segmentation; the invention can help the automobile to identify the type, position and size of the road object, and can effectively pre-judge the remote small target pedestrian in advance because the identification of the small target object is more accurate, thereby having a large exertion space in the intelligent driving direction.

Drawings

Fig. 1 is a flowchart of a road image segmentation method fusing context progressive sampling according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The first embodiment is as follows:

the embodiment provides a road image segmentation method, which comprises the following steps:

step 1, preprocessing a plurality of acquired road images to obtain segmented pictures;

performing semantic annotation on each picture to obtain a segmented picture;

Step 2, inputting the segmented picture into the constructed Xscene model to extract a deep layer feature map and a shallow layer feature map;

the Xception model extracts deep feature maps in the segmented pictures in the block13 intermediate feature layer, and the Xception model extracts shallow feature maps in the segmented pictures in the block2, block3 and block4 intermediate feature layers.

Step 3, inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion;

Step 4, inputting the deep feature map into the constructed ASPP pyramid module for pooling;

And 5, fusing the deep layer feature map or the shallow layer feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling step by 2 times to amplify the deep layer feature map or the shallow layer feature map back to the original image size.

The Xception model is a network structure proposed by *** corporation for image classification, and in this embodiment, the construction of the Xception model includes:

The Convolition Block Attention Module (CBAM) is an Attention model combining space and channels, and can improve the recognition capability of a small target object by using the space and channel information among pixels; the construction of the CBAM attention model in this embodiment includes: constructing a channel attention mechanism and a space attention mechanism;

the channel attention mechanism comprises:

the spatial attention mechanism comprises:

The ASPP pyramid module enlarges the field of view by using the hole convolution with different step sizes, which avoids the problem that the resolution is sacrificed in order to obtain a larger field of view in the conventional method, and in this embodiment, the ASPP pyramid module includes 3 × 3 convolution layers with step sizes of 6, 12, and 18, respectively, and an average pooling layer with step size of 1.

In this embodiment, constructing the HRNet module includes:

combine out1, 2 times up-sampled out2, 4 times up-sampled out3 into out 11; combining the 2 times downsampled out1, out2, and 2 times upsampled out3 into out 22; combining the 4-fold downsampled out1, the 2-fold downsampled out2, out3 into out 33;

combine out11, 2 times up-sampled out22, 4 times up-sampled out33 into out 111; combining the 2 times downsampled out11, out22, and 2 times upsampled out33 into out 222; the 4-fold downsampled out11, the 2-fold downsampled out22, out33 are combined into out 333.

Taking the city street view of Germany as an example, the data set contains 9 broad categories: background, car, person, sky, road, grass, wall, building, pedestrian road. The data set has 1300 road street view pictures of 10 cities in the Germany. 1000 samples were used for training and 300 samples were tested. Each picture is 2048 x 1024 pixels in size. During training, a Tesla P100 GPU with a video memory of 16G is used for training. During training, a random gradient descending mode is adopted, an optimizer is AdamaOptizer, the learning rate is 0.001 in the first 500 epochs, and the learning rate is adjusted to 0.0001 in the last 200 epochs. The loss function is a cross entropy loss function (categorial _ cross).

Putting the test data set into the step (1) for image preprocessing, wherein the steps are as follows:

1.1 cut 1000 test pictures into 2000 pictures with 1024 × 1024 pixels.

1.2 convert the 2000 cut pictures from step 1.1 into 3 channel array format, resulting in 2000 matrices of 1024 x 3 size.

1.3 merge the 2000 three-dimensional matrices of step 1.2 into one 2000 x 1024 x 3 four-dimensional matrix.

Putting the result in the step 1.3 into the step (2), and extracting the characteristics by using an Xception network, wherein the steps are as follows:

2.1 the field width of the four dimensional matrix is filled twice with 0 to form a 2000 x 1028 x 3 matrix, which is placed in block1 in the Xception network to create a 2000 x 512 x 64 matrix.

2.2 put the output of step 2.1 into block2, resulting in a matrix of shallow signatures of size 2000 × 256 × 128.

2.3 put the output of step 2.2 into block3, resulting in a matrix of shallow signatures of size 2000 x 128 x 256.

2.4 put the output of step 2.3 into block4, resulting in a matrix of shallow profiles of size 2000 x 64 x 728.

2.5, sequentially putting the input in the step 2.4 into block5, block6, block7, block8, block9, block10, block11, block12 and block13 to obtain a deep characteristic diagram matrix with the size of 2000 × 64 × 728.

And (3) respectively putting the outputs of the steps 2.2, 2.3 and 2.4 into the step (3), and amplifying the small target characteristics by using a CBAM attention mechanism, wherein the specific steps are as follows:

3.1, the feature matrix before block2 pooling in step 2.2 sequentially enters a channel attention mechanism and a space attention mechanism in the CBAM module to obtain an output matrix diagram after amplifying small target features, wherein the size of the output matrix diagram is 2000 × 512 × 128.

3.2, the feature matrix before block3 pooling in step 2.3 sequentially enters a channel attention mechanism and a space attention mechanism in the CBAM module to obtain an output matrix diagram after amplifying small target features, wherein the size of the output matrix diagram is 2000 × 256.

3.3 the feature matrix before block4 pooling in step 2.4 sequentially enters a channel attention mechanism and a space attention mechanism in the CBAM module to obtain an output matrix diagram after amplifying small target features, wherein the size of the output matrix diagram is 2000 × 128 × 512.

Putting the deep feature map block13 obtained in the step 2.5 into the step (4), and obtaining a larger receptive field by using an ASPP pyramid module, wherein the method specifically comprises the following steps:

4.1 put the output feature matrix map of block13 into 3 × 3 convolutional layers with 256 channels and 6 steps, resulting in feature matrices of size 2000 × 64 × 256.

4.2 put the output feature matrix map of block13 into 3 × 3 convolutional layers with 256 channels and 12 steps, resulting in feature matrices of size 2000 × 64 × 256.

4.3 put the output feature matrix map of block13 into 3 × 3 convolutional layers with 256 channels and 18 steps, resulting in feature matrices of size 2000 × 64 × 256.

4.4 put the output feature matrix map of block13 into pooling layer with step size 1, and then go through a 3 × 3 convolution layer with channel number 256 to get feature matrix with size 2000 × 64 × 256.

4.5 combining the outputs of steps 4.1, 4.2, 4.3, 4.4 to obtain a feature matrix of 2000 x 64 x 1024, and passing through a 3 x 3 convolution layer with 256 channels to obtain a feature matrix of 2000 x 64 x 256.

And (5) putting the shallow feature matrix in the step (3) into the step (5), and amplifying the small target features by using a CBAM attention mechanism, wherein the method specifically comprises the following steps:

5.1 multiply output upsampling 2 from step 3.2 by 2000 x 512 x 256, output upsampling 4 from step 3.3 by 2000 x 512, combining these two results with the results from step 3.1 to obtain a matrix of 2000 x 512 x 896, and then passing through two 3 x 3 convolutional layers with a channel number of 128 to obtain a signature matrix of 2000 x 512 x 128.

5.2 multiply the output of step 3.1 down-sampled 2 by 2000 x 256 x 128, the output of step 3.3 up-sampled 2 by 2000 x 256 x 512, these two results are combined with the result of step 3.2 to give a matrix of 2000 x 256 x 896, and then passed through two 3-up layers of 256 channels to give a signature matrix of 2000 x 256.

5.3 multiply the output of step 3.1 by 4 to 2000 x 128, multiply the output of step 3.2 by 2 to 2000 x 128 x 256, combine these two results with the result of step 3.3 to get a matrix of 2000 x 128 x 896, then pass through two 3 convolution layers of 512 channels to get a signature matrix of 2000 x 128 x 512.

5.4 multiply output upsampling 2 of step 5.2 by 2000 x 512 x 256, output upsampling 4 of step 5.3 by 2000 x 512, combining these two results with the result of step 5.1 to obtain a matrix of 2000 x 512 x 896, and then passing through two 3 x 3 convolutional layers of channel number 128 to obtain a signature matrix of 2000 x 512 x 128.

5.5 multiply the output of step 5.1 by 2 to 2000 x 256 x 128, the output of step 5.3 by 2 to 2000 x 256 x 512, combine these two results with the result of step 5.2 to obtain a matrix of 2000 x 256 x 896, and then pass through two 3 convolution layers with 256 channels to obtain a signature matrix of 2000 x 256.

5.6 multiply the output of step 5.1 by 4 to 2000 x 128, multiply the output of step 5.2 by 2 to 2000 x 128 x 256, combine these two results with the result of step 5.3 to obtain a matrix of 2000 x 128 x 896, and then pass through two 3 convolution layers of 512 channels to obtain a signature matrix of 2000 x 128 x 512.

And (4) gradually sending the steps 4.5, 5.4, 5.5 and 5.6 into the step (6), and gradually up-sampling and amplifying the feature map, wherein the specific steps are as follows:

6.1 put the output of step 5.6 into a 1 x 1 convolutional layer with 80 channels and up-sample 2 times to get a matrix with size of 2000 x 128 x 80.

6.2 the outputs of step 6.1 and 4.5 are combined to give a matrix size of 2000 x 128 x 336, then 3 x 3 convolutional layers with 256 channels, and up sampled 2 times to give a matrix size of 2000 x 256.

6.3 put the output of step 5.5 into a 1 x 1 convolutional layer with 80 channels and up-sample 2 times to get a matrix with size of 2000 x 256 x 80.

6.4 combining the output of step 6.3 and the output of 6.2 to get a matrix of size 2000 x 256 x 336, then passing through 3 x 3 convolutional layers of 256 channels, and up sampling 2 times to get a matrix of size 2000 x 512 x 256.

6.5 put the output of step 5.4 into a 1 x 1 convolutional layer with 80 channels and up-sample 2 times to get a matrix with size of 2000 x 512 x 80.

6.6 combining the output of step 6.5 with the output of 6.4 to get a matrix size of 2000 x 512 x 336, then passing through 3 x 3 convolutional layers with 256 channels, and up sampling 2 times to get a matrix size of 2000 x 1024 256.

6.7 put the output matrix of step 6.6 into a 9-channel 1 x 1 convolutional layer and activate with softmax function, resulting in a matrix size of 2000 x 1024 x 9.

6.8 comparing the difference between the output matrix and the marked picture matrix, continuously optimizing network parameters by using a gradient descent method, and obtaining a final network after 700 times of training by adopting a cross entropy loss function as a loss function.

In step (7), the segmented picture is output, which specifically comprises the following steps:

7.1 testing was performed using 300 test pictures of 1024 x 2048 pixels size to crop each picture into 2 pictures of 1024 x 1024 pixels size.

7.2, sending the cut pictures obtained in step 7.1 into a network at one time to obtain 1 matrix with the size of 600 × 1024 × 9, and using onehot coding to reduce dimension in the last dimension to obtain a matrix with the size of 600 × 1024, namely 600 pictures with the size of 1024 × 1024, wherein each pixel on the pictures is a label from 0 to 9 and represents 9 classifications of background, automobile, people, sky, road, grassland, wall, building and pedestrian road.

TABLE 1 comparison of the method of the invention with other methods

Classification method	Deeplabv1	Deeplabv2	Deeplabv3+	The method of the invention
					Rate of accuracy	79.5％	83.32％	88.48％	90.02％

As can be seen from Table 1, the method of the present invention has better accuracy in road image segmentation than the existing mainstream segmentation network. Especially, the recognition capability to small targets is stronger, and the edge segmentation to the object is more accurate. The method of the invention finds different characteristics of different objects by using a deep learning method, classifies the different characteristics, and can be widely applied to the directions of road identification, road scene segmentation and the like.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A road image segmentation method, comprising:

2. The road image segmentation method according to claim 1, wherein the preprocessing the acquired road images to obtain segmented pictures comprises:

performing semantic annotation on each picture to obtain a segmented picture;

3. The road image segmentation method according to claim 1, wherein the construction of the Xception model comprises:

4. The road image segmentation method of claim 3, wherein the step of inputting the segmented picture into the constructed Xcaption model to extract the deep feature map and the shallow feature map comprises: the Xception model extracts deep feature maps in the segmented pictures in the block13 intermediate feature layer, and the Xception model extracts shallow feature maps in the segmented pictures in the block2, block3 and block4 intermediate feature layers.

5. The road image segmentation method according to claim 4, wherein the inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion comprises:

6. The road image segmentation method according to claim 1, wherein the constructing of the CBAM attention model comprises: constructing a channel attention mechanism and a space attention mechanism;

the channel attention mechanism comprises:

the spatial attention mechanism comprises:

7. The road image segmentation method according to claim 4, wherein the ASPP pyramid module comprises 3 × 3 convolution layers with step sizes of 6, 12 and 18, respectively, and an average pooling layer with step size of 1; the step of inputting the deep feature map into the constructed ASPP pyramid module for pooling comprises the following steps:

8. The road image segmentation method of claim 7, wherein the fusing the deep feature map or the shallow feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling by 2 times in a stage to enlarge the deep feature map or the shallow feature map back to the original size comprises: