CN109241872B

CN109241872B - Image semantic fast segmentation method based on multistage network

Info

Publication number: CN109241872B
Application number: CN201810947526.9A
Authority: CN
Inventors: 程建; 苏炎洲; 周娇; 刘三元; 刘畅
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-08-20
Filing date: 2018-08-20
Publication date: 2022-03-18
Anticipated expiration: 2038-08-20
Also published as: CN109241872A

Abstract

The invention discloses an image semantic fast segmentation method based on a multistage network, which relates to the field of image semantic segmentation and deep learning, wherein a constructed multistage semantic segmentation model comprises a first-stage network, a second-stage network and a third-stage network, wherein the first-stage network comprises a dense connecting block and a convolutional layer which comprise layer, the second-stage network comprises a dense connecting block, a cascade layer and a convolutional layer which comprise layer, the third-stage network comprises a dense connecting block, a cascade layer and a convolutional layer which comprise layer, the cascade layer of the second-stage network is connected with the convolutional layer of the first-stage network through an upper sampling layer, and the cascade layer of the third-stage network is connected with the cascade layer of the second-stage network through an upper sampling layer; each intensive connection block of the first-level network, the second-level network and the third-level network is also connected with an INPLACE-ABN; the input of the layer in each dense connection block comprises the cascaded characteristic diagram of the output of all the previous layers; the invention solves the problem that the image semantic segmentation speed and the precision can not be considered at the same time in the prior art.

Description

Image semantic fast segmentation method based on multistage network

Technical Field

The invention relates to the field of image semantic segmentation and deep learning, in particular to a multistage network-based image semantic fast segmentation method.

Background

Image semantic segmentation is a basic task in the field of computer vision, and aims to predict labels of all pixel points in an image, so that the image semantic segmentation is considered to be an important task for helping to obtain deep understanding of scenes, objects and characters. In recent years, the development of deep convolutional neural networks has led to great success in image semantic segmentation.

Most of the current best image semantic segmentation methods are based on full convolution neural networks. Full convolutional neural networks have designed the most advanced image semantic segmentation algorithms for a large number of applications, but the effectiveness of these networks depends to a large extent on the design of depth and width models, which requires a lot of operations and parameters, which always requires more operations and parameters while increasing the accuracy of the results, which means slower speeds. However, in some practical application fields, such as unmanned video analysis, the speed of semantic segmentation is very important for distinguishing pedestrians from vehicles, and real-time response and real-time operation of the semantic segmentation are often required to be realized.

Disclosure of Invention

The invention aims to design a multistage network-based image semantic fast segmentation method to solve the problem that the image semantic segmentation speed and precision in the prior art cannot be considered at the same time.

The technical scheme of the invention is as follows:

the image semantic fast segmentation method based on the multilevel network comprises the following steps:

step 1: a training data set is selected.

Step 2: and constructing a multi-level semantic segmentation model.

The model comprises a first-level network, a second-level network and a third-level network, wherein the first-level network comprises a layer-containing dense connecting block and a convolutional layer, the second-level network comprises a layer-containing dense connecting block, a cascade layer and a convolutional layer, the third-level network comprises a layer-containing dense connecting block, a cascade layer and a convolutional layer, the cascade layer of the second-level network is connected with the convolutional layer of the first-level network through an upper sampling layer, and the cascade layer of the third-level network is connected with the cascade layer of the second-level network through an upper sampling layer; each intensive connection block of the first-level network, the second-level network and the third-level network is also connected with an INPLACE-ABN; the inputs of the layers in each densely packed block contain the concatenated feature maps of all previous layer outputs.

And step 3: and training a multilevel semantic segmentation model.

And 4, step 4: inputting a new image, carrying out forward propagation in the trained multistage semantic segmentation model, and outputting a predicted semantic segmentation result end to end.

Specifically, in the multistage semantic segmentation model, the specific structure of the layer comprises an INPLACE-ABN, a 1 × 1 convolutional layer, a 3 × 3 convolutional layer, a 1 × 1 convolutional layer and the INPLACE-ABN which are connected in sequence.

Specifically, the input of the first-level network is a down-sampled image obtained by scaling the original image to 1/4, the input of the second-level network is a down-sampled image obtained by scaling the original image to 1/2, and the input of the third-level network is the original image.

Further, the specific process of step 3 is as follows:

step 3.1: images in the training dataset are pre-processed and cropped to a fixed size.

Step 3.2: and initializing the multistage semantic segmentation model.

Step 3.3: the output of the first-level network is subjected to convolution layer and then is subjected to loss with the first-level cross entropy of the semantic segmentation annotation image with the same scale₁(ii) a The output characteristic diagram of the second-level network and the characteristic diagram output by the first-level network and subjected to up-sampling are cascaded to obtain a characteristic diagram, and the second-level cross entropy of the characteristic diagram obtained by a convolution layer and the semantic segmentation annotation image with the same scale is less₂(ii) a The characteristic diagram obtained by cascading the output characteristic diagram of the third-level network and the characteristic diagram obtained by up-sampling the output of the second-level network is obtained through a convolution layerThe third-level cross entropy of the obtained feature map and the semantic segmentation annotation image with the same scale is loss₃(ii) a Setting loss₁、loss₂And loss₃The respective weights.

Step 3.3: and amplifying the data in the data set by means of flipping, zooming and rotating.

Step 3.4: performing formal training on the amplified data set to reduce cross entropy loss₁、loss₂And loss₃And weighting and summing to obtain the overall loss, performing error back propagation by using a stochastic gradient descent algorithm according to the overall loss, and updating model parameters to obtain a trained semantic segmentation model. The method in step 3 can utilize the deep neural network to efficiently process the low-resolution image to correct the semantic inference of the high-resolution image, and accelerate the segmentation speed of the neural network.

After the scheme is adopted, the invention has the following beneficial effects:

the real-time image semantic segmentation method based on the multi-level network carries out multi-level network processing by adopting images with different resolutions, corrects semantic inference of a high-resolution image by utilizing efficient processing of a deep neural network on a low-resolution image, and accelerates the segmentation speed of the neural network. Meanwhile, the quantity of parameters and part of memory consumption are reduced by adopting a dense connection mode and INPLACE-ABN, the speed of the neural network is further increased, and the real-time performance is achieved.

(1) Most of the existing neural networks neglect the speed improvement of the neural networks while pursuing high accuracy, and adopt a strong feature extraction front-end network to extract features, but the strong feature extraction network brings great parameter redundancy, and the existing feature extraction networks adopt a pre-training model as the front-end feature extraction network, and the pre-training model usually has down-sampling, so that the down-sampling inevitably causes spatial information loss. However, if the front-end network for extracting the features does not use the down-sampling operation, although the spatial information is not lost, the semantic segmentation model is huge, and the model parameters are redundant. The invention adopts a dense connection mode, which can reduce the redundancy of parameters to a certain extent, improve the efficiency of the neural network and ensure the maintenance of spatial information.

(2) The dense connection mode is considered as an efficient characteristic extraction mode, and can obtain excellent effects when used in a classification network. The traditional dense connecting block only adopts 1 multiplied by 1 and 3 multiplied by 3 convolution, for extracting and extracting the feature, the 1 multiplied by 1 convolution can be added, and after the feature is extracted, the number of feature channels is expanded, the loss of the feature in the next feature extraction process is reduced.

(3) The invention considers to further reduce calculation and parameters to improve the speed of semantic segmentation, so the invention adds an INPLACE-ABN before the 1X 1 convolution, the 3X 3 convolution and the 1X 1 convolution are input, further reduces the number of parameters and can be suitable for a rapid image semantic segmentation method.

(4) The invention utilizes the multi-level network to carry out rapid semantic segmentation, ensures no loss of spatial information by using a non-down-sampling mode, and simultaneously utilizes the efficient dense network connection and the mode of combining the activation function and the batch normalization layer into the INPLACE-ABN, thereby overcoming the problems of parameter redundancy and slow semantic segmentation speed caused by the unchanged multi-level network and spatial dimension.

(5) The network of the invention achieves 75.36% accuracy on VOC semantic segmentation data set, has good performance, and has the speed similar to that of SegNet network, SegNet is a famous semantic segmentation network, and has higher speed.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a block diagram of a multi-level semantic segmentation model of the present invention;

FIG. 3 is a diagram of a dense connection block of the present invention;

FIG. 4 is a schematic view of the layer structure of the present invention;

FIG. 5 is a schematic diagram of the INPLACE-BN calculation according to the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and accompanying drawings.

In order to solve the problem of low image semantic segmentation speed in the prior art, the invention provides an image semantic rapid segmentation method based on a multistage network.

Hereinafter, the present invention will be described in detail with reference to the most preferred embodiments thereof.

In this embodiment, the image segmentation data set adopts a city street scene data set of cityscaps, which includes 20 category labels (including 1 background label), and which covers 50 cities in europe, and 5000 finely labeled data sets, of which 2975 are used as training data sets, 500 are used as verification data sets, and 1525 are used as test data sets.

As shown in fig. 1, the present embodiment includes the following steps:

step 1: 2975 of the training data sets, i.e., the cityscaps city street scene data sets, were selected as training data sets. The training data set includes training images and annotation images.

Step 2: and constructing a multi-level semantic segmentation model.

As shown in fig. 2, the model includes a first-level network, a second-level network and a third-level network, wherein the first-level network includes dense connection blocks and convolutional layers including 3 layers, the second-level network includes dense connection blocks, cascade layers and convolutional layers including 3 layers, the third-level network includes dense connection blocks, cascade layers and convolutional layers including 3 layers, the cascade layers of the second-level network are connected with the convolutional layers of the first-level network through an upsampling layer, and the cascade layers of the third-level network are connected with the cascade layers of the second-level network through an upsampling layer; each intensive connection block of the first-level network, the second-level network and the third-level network is also connected with an INPLACE-ABN; the inputs of the layers in each densely packed block contain the concatenated feature maps of all previous layer outputs.

Specifically, the specific structure of each layer comprises an INPLACE-ABN, 1 × 1 convolution, 13 × 3 convolution layer, 1 × 1 convolution layer and 1 INPLACE-ABN which are connected in sequence. The input of the first level network is a down-sampled image obtained by scaling the original image to 1/4, the input of the second level network is a down-sampled image obtained by scaling the original image to 1/2, and the input of the third level network is the original image.

For the INPLACE-ABN, whose calculation diagram is shown in fig. 5, when processing the batch normalization layer (BN) -activation layer-convolution layer (Conv) in forward pass, two large buffers need to be stored, namely input x of BN and input z of Conv, since the standard implementation of backward pass of BN and Conv depends on their inputs to calculate the gradient. Replacing the BN activation sequence with INPLACE-ABN, x can be safely discarded, saving up to 50% GPU memory during training. To achieve this, in the present invention, the inverse transfer of BN is rewritten by the output y and then reconstructed from z by inverting the activation function.

And step 3: training a multilevel semantic segmentation model; the specific process of the step 3 is as follows:

step 3.1: preprocessing images in the training data set, and cutting the images into fixed sizes 513 multiplied by 513;

step 3.2: initializing the parameters of a segmentation network model by adopting an Xavier method;

step 3.3: the output of the first-level network is subjected to convolution layer and then is subjected to loss with the first-level cross entropy of the semantic segmentation annotation image with the same scale₁(ii) a The output characteristic diagram of the second-level network and the characteristic diagram output by the first-level network and subjected to up-sampling are cascaded to obtain a characteristic diagram, and the second-level cross entropy of the characteristic diagram obtained by a convolution layer and the semantic segmentation annotation image with the same scale is less₂(ii) a The output characteristic diagram of the third-level network and the characteristic diagram of the second-level network which is up-sampled are cascaded to obtain a characteristic diagram, and the third-level cross entropy of the characteristic diagram obtained by a convolution layer and the semantic segmentation annotation image with the same scale is less₃(ii) a Setting loss₁、loss₂And loss₃The weights are 0.5, 0.3, 0.2, respectively.

Step 3.3: amplifying the data in the data set in a turning, scaling and rotating mode; specifically, flipping is random flipping, randomly scaling the image between 0.5 and 2 times, and randomly rotating the image between-10 and 10 degrees.

Step 3.4: performing formal training on the amplified data set to reduce cross entropy loss₁、loss₂And loss₃The weighted sum resulting in an overall loss, i.e.

With a polynomial learning strategy, the learning rate lr is set to

Where baselr is the initial learning rate, here set to 0.001, power set to 0.9.

And then carrying out error back propagation by using a random gradient descent algorithm according to the overall loss, and updating model parameters to obtain a trained semantic segmentation model.

The most key step in the invention is the construction of a multistage semantic segmentation model. The dense connecting block is used for connecting the output characteristic diagrams of the three layers, namely the input of each layer comprises the cascaded characteristic diagrams output by all the previous layer layers, each layer sequentially comprises an INPLACE-ABN, a convolution layer and an INPLACE-ABN, and the dense connecting block can efficiently utilize the characteristic diagrams, greatly reduce the number of parameters, enable a neural network to be easily trained and enable calculation to be more efficient. The INDLCE-ABN can greatly reduce the memory consumption of the neural network, replace a combination of a batch normalization layer and an activation layer in a common deep neural network into a merging layer, and save up to 50 percent of the memory consumption by storing a small amount of calculation results (discarding part of intermediate results and inverting the calculation to recover the required parameters during back propagation).

After experiments, the network of the invention obtains 75.36% accuracy on VOC semantic segmentation data set, has good performance, and has the speed similar to that of SegNet network, SegNet is a famous semantic segmentation network, and has higher speed.

All the technical variants made according to the technical solution of the present invention fall within the scope of protection of the present invention.

Claims

1. The image semantic fast segmentation method based on the multilevel network is characterized by comprising the following steps of:

step 1: selecting a training data set;

step 2: constructing a multi-level semantic segmentation model;

the model comprises a first-level network, a second-level network and a third-level network, wherein the first-level network comprises a layer-containing dense connecting block and a convolutional layer, the second-level network comprises a layer-containing dense connecting block, a cascade layer and a convolutional layer, the third-level network comprises a layer-containing dense connecting block, a cascade layer and a convolutional layer, the cascade layer of the second-level network is connected with the convolutional layer of the first-level network through an upper sampling layer, and the cascade layer of the third-level network is connected with the cascade layer of the second-level network through an upper sampling layer; each intensive connection block of the first-level network, the second-level network and the third-level network is also connected with an INPLACE-ABN; the input of the layer in each dense connection block comprises the cascaded characteristic diagram of the output of all the previous layers;

and step 3: training a multilevel semantic segmentation model;

and 4, step 4: inputting a new image, carrying out forward propagation in the trained multistage semantic segmentation model, and outputting a predicted semantic segmentation result end to end;

the specific process of the step 3 is as follows:

step 3.1: preprocessing images in the training data set, and cutting the images into fixed sizes;

step 3.2: initializing a multi-level semantic segmentation model;

step 3.3: the output of the first-level network is subjected to convolution layer and then is subjected to loss with the first-level cross entropy of the semantic segmentation annotation image with the same scale₁(ii) a Output characteristic diagram of second-level network and first-level networkThe feature graph obtained after the feature graph subjected to upsampling is output by the level network and cascaded, and the second level cross entropy of the feature graph obtained by a convolution layer and the semantic segmentation annotation image with the same scale is less₂(ii) a The output characteristic diagram of the third-level network and the characteristic diagram of the second-level network which is up-sampled are cascaded to obtain a characteristic diagram, and the third-level cross entropy of the characteristic diagram obtained by a convolution layer and the semantic segmentation annotation image with the same scale is less₃(ii) a Setting loss₁、loss₂And loss₃The respective weights;

step 3.3: amplifying the data in the data set in a turning, scaling and rotating mode;

step 3.4: performing formal training on the amplified data set to reduce cross entropy loss₁、loss₂And loss₃And weighting and summing to obtain the overall loss, performing error back propagation by using a stochastic gradient descent algorithm according to the overall loss, and updating model parameters to obtain a trained semantic segmentation model.

2. The method for image semantic rapid segmentation based on the multilevel network according to claim 1, wherein in the multilevel semantic segmentation model, the specific structure of the layer comprises an INPLACE-ABN, a 1 x 1 convolutional layer, a 3 x 3 convolutional layer, a 1 x 1 convolutional layer and an INPLACE-ABN which are connected in sequence.

3. The method for image semantic fast segmentation based on multi-level network as claimed in claim 1, wherein the input of the first level network is a down-sampled image obtained by scaling the original image to 1/4, the input of the second level network is a down-sampled image obtained by scaling the original image to 1/2, and the input of the third level network is the original image.