CN112861727A

CN112861727A - Real-time semantic segmentation method based on mixed depth separable convolution

Info

Publication number: CN112861727A
Application number: CN202110179063.8A
Authority: CN
Inventors: 王素玉; 王维珍
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-05-28

Abstract

The invention discloses a real-time semantic segmentation method based on mixed depth separable convolution, which has better balance in the aspects of prediction efficiency and prediction accuracy, and comprises the following steps: data preprocessing and data enhancement; a mixed depth separable convolution unit is designed to improve the multi-scale feature expression capability; constructing a mixed depth separable convolution module; building a mixed depth separable convolution semantic segmentation network, and extracting high-level semantic features of the image; training and verifying a semantic segmentation network of hybrid depth separable convolutions. The method has the advantages of high prediction precision, small model parameter number, rapidness and light weight.

Description

Real-time semantic segmentation method based on mixed depth separable convolution

Technical Field

The invention belongs to the field of image real-time semantic segmentation in computer vision, and relates to a method for performing semantic segmentation by using a convolutional neural network.

Background

In the field of computer vision, semantic segmentation of images is a key task, and the task is also a hot point of research. The purpose of semantic segmentation is to segment a picture of a scene taken by a camera into a series of disjoint image regions, assigning each pixel in the image a particular class, which typically includes countable objects (e.g., bicycles, cars, people) and countless objects (e.g., sky, roads, grass). With the rise of deep learning, the image semantic segmentation technology paves a road for scene analysis, and plays a significant role in automatic driving, augmented reality and video monitoring.

However, the existing semantic segmentation methods mainly aim at improving performance, in order to improve the segmentation accuracy of these algorithms, a feature encoder usually adopts a relatively complex backbone network, and a semantic decoder also adopts an intensive computing type network structure, such as spatial pyramid pooling, shallow and deep feature fusion, and setting of cavity convolution with different expansion rates to expand a receptive field, so that the network model has a large scale and the scene segmentation efficiency is low. However, for practical applications, the algorithm needs to maintain high segmentation accuracy, and the requirement of real-time performance also needs to be met. Some real-time semantic segmentation algorithms (such as an ICNet algorithm) are high in speed, but limited by a local receptive field mechanism, lack of sufficient understanding of target features of different scales in a scene image, and are difficult to acquire long-term memory of various features, so that the accuracy is low. Therefore, the limitation of local receptive fields needs to be broken through, the multi-scale features of the images are fused for prediction, the accuracy of scene segmentation is improved, and the real-time performance is met.

In order to solve the problems, the invention provides a real-time semantic segmentation method based on mixed depth separable convolution, which mixes a plurality of convolution kernels with different sizes into a single depth separable convolution unit by adopting the mixed depth separable convolution, designs parallel convolutions with different sizes to enhance the expression capability of multi-scale target features, and simultaneously eliminates the condition that information among channels is mutually independent by using feature rearrangement, breaks through the limitation of local receptive field, improves the feature extraction capability with different sizes, and thus improves the quality of a final feature map. The method provided by the invention has good balance between the segmentation efficiency and the segmentation accuracy.

Disclosure of Invention

Aiming at the problem that the model segmentation efficiency and the segmentation accuracy in the image semantic segmentation field cannot be balanced, the invention designs a real-time semantic segmentation method based on the mixed depth separable convolution, so that the prediction efficiency and the prediction accuracy are well balanced, and the practical application of semantic segmentation is further promoted. The overall schematic of the process of the invention is shown in FIG. 1.

A real-time semantic segmentation method based on mixed depth separable convolution comprises the following steps:

(1) data preprocessing and data enhancement: the method selects 3475 pictures of the public urban landscape data set, wherein 2975 pictures are used as a training set, and 500 pictures are used as a verification set; the training set is used for training the network model, and the verification data set is used for selecting the optimal training result; selecting a class which accords with the practical application from the labeling information, and removing an inapplicable data class; the data are normalized and expanded by 6 different data enhancement methods, namely random clipping and scaling, random horizontal inversion, random brightness, random saturation, random contrast and Gaussian blur.

(2) Designing a hybrid depth separable convolution unit: in order to break through the limitation of local receptive field existing in common convolution, improve the multi-scale feature expression of the image data and meet the requirement of real-time prediction, the mixed depth separable convolution of the invention is shown in fig. 2, firstly, a feature graph to be processed is averagely divided into 4 groups according to channels, and the edges of the 4 groups of feature graphs are filled with 0; secondly, respectively using odd convolution kernels with different sizes to extract the features of the 4 groups; then splicing the obtained 4 groups of characteristics according to channels; finally, the feature mapping is rearranged alternately to eliminate the problem that the information among the channels is independent.

(3) Constructing a mixed depth separable convolution module: the mixed depth separable convolution module of the invention is shown in fig. 3, and is formed by stacking 4 mixed depth separable convolution units, 4 normalization layers and 4 nonlinear activation layers in series, wherein 3 jump connections are used in the module, and 1 × 1 convolution residual connection is used between the modules to relieve the problem of gradient disappearance.

(4) Building a mixed depth separable convolution semantic segmentation network: the network structure diagram of the invention is shown in fig. 4, the backbone network comprises a mixed depth separable convolution, a normalization layer and a nonlinear activation layer, and is formed by connecting 4 mixed depth separable convolution modules in series, and each convolution block can generate a characteristic diagram. To obtain a higher resolution feature map, the downsampling operation is removed in the last 2 mixed depth separable convolution modules, leaving the final feature map size 1/8 of the input image, thus preserving more detail of the feature map.

(5) Training and verifying a semantic segmentation network of hybrid depth separable convolution: inputting the processed picture into a designed network, outputting a predicted segmentation result through forward calculation of the network, and calculating a loss value by using a cross entropy loss function and a corresponding pixel level label; adopting a momentum random gradient descent optimizer, setting a momentum coefficient to be 0.9, setting a weight attenuation parameter exceeding factor to be 0.00005, and increasing the learning rate from 0 to 0.003 after warming up for 5 times; the training is iterated until the cross-entropy loss converges and the performance is verified on the verification set.

The invention designs a real-time semantic segmentation method of mixed depth separable convolution, and on an urban landscape verification data set, a backbone network constructed by using the mixed depth separable convolution has the advantages of higher prediction precision, small parameter quantity, rapidness and light weight, and achieves better balance in the aspects of prediction efficiency and prediction accuracy. The method can realize efficient scene perception tasks and further promote the application of the method in various fields such as automatic driving, augmented reality, video monitoring and the like.

Drawings

FIG. 1 is a schematic overall view of the process of the present invention.

FIG. 2 is a schematic diagram of a hybrid depth separable convolution unit according to the present invention.

Fig. 3 is a block diagram of a hybrid depth separable convolution module according to the present invention.

Fig. 4 is a diagram of a semantic segmentation network structure based on mixed depth separable convolution according to the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings in which:

the invention discloses a real-time semantic segmentation method based on mixed depth separable convolution, which comprises the following specific processes as shown in figure 1: firstly, 3475 pictures of an open urban landscape data set are selected, and preprocessing and data enhancement are carried out on the data; then, in order to improve the multi-scale feature expression in the image data and meet the requirements of real-time prediction, a mixed depth separable convolution unit is designed; then a mixed depth separable convolution module is constructed, so that the multi-scale feature expression capability is further improved, and the aliasing effect of the multi-scale features is eliminated; then building a semantic segmentation network with mixed depth separable convolution; and finally, training and verifying a semantic segmentation network with mixed depth separable convolution, inputting the picture processed in the first step into the network for training, performing performance verification through forward reasoning of the network, and outputting a predicted segmentation result.

The invention obtains good segmentation performance verification on the public data set, and the specific implementation mode comprises the following steps:

the first step is as follows: data pre-processing and data enhancement

(1) And (4) preparing data. 3475 pictures of the public urban landscape data set were taken, of which 2975 pictures were derived from the training set and the remaining 500 pictures from the verification set, which contain the actual traffic scene applied, were taken from the images of the driving scenes of 50 different cities and had high-quality pixel-level labeling information. And selecting the categories which accord with the actual application from the labeling information, removing the categories which are not suitable for use, and setting the categories as neglected categories.

(2) Normalization and data enhancement. Firstly, the RGB (red, green and blue) image is normalized, so that the possible adverse effect brought by singular sample data is eliminated. Secondly, in order to improve the segmentation accuracy of the model and the generalization capability of the model, 6 different data enhancement methods are used for the prepared data: random clipping scaling, random horizontal flipping, random brightness, random saturation, random contrast, and gaussian blur.

The second step is that: designing hybrid depth separable convolution elements

Since the selected image contains many objects of different sizes, a single convolution map often lacks a sufficient understanding of the multi-scale features. For this purpose, a hybrid depth separable convolution unit is designed, which, as shown in fig. 2, first divides the feature map to be processed into 4 groups by channels, and the 4 groups of feature maps are fed in parallel to odd convolution kernels of different sizes, which are depth separable convolutions of 3 × 3, 5 × 5, 7 × 7, and 9 × 9, respectively, to extract the multi-scale feature map. In order to align the size of the feature map size of the output, the edges of the input 4 sets of feature maps need to be filled with 0's before convolution. Then splicing the obtained 4 groups of feature maps according to channel dimensions; because the depth separable convolution belongs to group convolution, the condition that the information among channels is mutually independent is easily caused, and feature fusion between groups and channels is lacked. Therefore, after splicing, feature rearrangement is carried out, 4 groups of feature mappings are disordered, and the disorganized process is that the 4 groups of feature map channels are stacked together in sequence according to the natural sequence of the 4 groups of feature map channels, so that the problem that information among the channels is mutually independent is solved, the improvement of feature extraction capability is facilitated, and the capability of a network for extracting features of targets with different scales is further improved.

The third step: building mixed depth separable convolution modules

The hybrid depth separable convolution module constructed by the present invention is shown in fig. 3. On the basis of the second step, the device is formed by stacking 4 mixed depth separable convolution units, 4 normalization layers and 4 nonlinear activation layers. Specifically, the mixed depth separable convolution module is formed by sequentially stacking a mixed depth separable convolution unit, a normalization layer and a nonlinear activation layer in series for 4 times, wherein the number of channels of feature maps output by the first, second and third mixed depth separable convolution units is consistent, and the sizes of the output feature maps are consistent; the number of feature mapping channels output by the fourth mixed depth separable convolution unit is 2 times of the number of input feature channels, and the size of the output feature mapping is one half of the size of the input feature channels. The problem of gradient disappearance is alleviated by using a residual structure of 3-hop connections. In the first mixed depth separable convolution unit, the 4 groups of convolutions use 3 × 3, 5 × 5, 7 × 7, 9 × 9 depth separable convolution kernels from left to right; a second mixed depth separable convolution unit, the 4 groups using, from left to right, 5 × 5, 7 × 7, 9 × 9, 3 × 3 depth separable convolution kernels; a third mixed depth separable convolution unit, the 4 groups using depth separable convolution kernels of 7 × 7, 9 × 9, 3 × 3, 5 × 5 from left to right; a fourth mixed depth separable convolution element, the 4 groups using, from left to right, 9 × 9, 3 × 3, 5 × 5, 7 × 7 depth separable convolution kernels; the purpose of using different orders is to eliminate the aliasing effect of the multi-scale features, and further improve the expression capability of the multi-scale features.

The fourth step: building semantic segmentation network with mixed depth separable convolution

In order to achieve good balance between the prediction efficiency and the prediction accuracy, the invention builds a semantic segmentation network with mixed depth separable convolution on the basis. As shown in fig. 4, the main network responsible for extracting features is formed by connecting 18 convolutional layers in series. Wherein the layer 1 is the input channel number 3, the output channel number is 64, the 7 x 7 convolution with the step size of 2 is followed by the normalization layer and the nonlinear active layer, the layer 2 to the layer 5 are the first mixed depth separable convolution module, the input channel number of which is 64, the output channel number is 128, wherein the input channel number of the layer 2 to the layer 4 is 64, the output channel number is 64, the mixed depth separable convolution unit with the step size of 1, the input channel number of the layer 5 is 64, the output channel number is 128, and the mixed depth separable convolution unit with the step size of 2. Layers 6 to 9 are the second mixed depth separable convolution module whose input channel number is 128 and output channel number is 256, where the input channel number of layers 6 to 8 is 128, the output channel number is 128, the step size is 1 mixed depth separable convolution unit, the input channel number of layer 9 is 128, the output channel number is 256, and the step size is 2 mixed depth separable convolution unit. The 10 th layer to the 13 th layer are the third mixed depth separable convolution module, the number of input channels is 256, the number of output channels is 512, wherein the number of input channels of the 10 th layer to the 12 th layer is 256, the number of output channels is 256, the mixed depth separable convolution unit with the step size of 1, the number of input channels of the 13 th layer is 256, the number of output channels is 512, and the mixed depth separable convolution unit with the step size of 1 is adopted, so that the resolution of the features is improved. The 14 th layer to the 17 th layer are the fourth mixed depth separable convolution module, the number of input channels is 512, the number of output channels is 1024, wherein the number of input channels from the 14 th layer to the 16 th layer is 512, the number of output channels is 512, the mixed depth separable convolution unit with the step size of 1, the number of input channels of the 17 th layer is 512, the number of output channels is 1024, and the mixed depth separable convolution unit with the step size of 1 is adopted, so as to improve the resolution of the feature mapping. The number of input channels at layer 18 is 1024, the output channel size is the number of classes of data set labels, and a 3 × 3 convolution with step size 1 is used as the final classification layer. Meanwhile, a residual error structure with 3 jump connections is used in each mixed depth separable convolution module to relieve gradient disappearance, and 1 x 1 convolution is used for connection between each mixed depth separable convolution module, so that the training efficiency is improved, and the problem of gradient disappearance is avoided. And finally, the output result of the 18 th layer is utilized to pass through an upsampling layer, and the characteristic mapping size is amplified by 8 times by using bilinear interpolation. The bilinear interpolation function is shown in equation 1.

f(u+i，v+j)＝(1-i)(1-j)f(u，v)+i(1-j)f(u+1，v)+(1-i)jf(u，v+1)+ijf(u+1，v+1) (1)

And f (u, v) is the pixel value of the characteristic mapping at the position (u, v), and the pixel coordinate of the bilinear interpolation result is (u + i, v + j), wherein u and v are integer parts in floating-point coordinates, and i and j are decimal parts of the floating-point coordinates. The bilinear interpolation result pixel is represented as f (u + i, v + j), which is obtained by proportionally weighting pixel values f (u, v), f (u, v +1), f (u +1, v), f (u +1, v +1) of four adjacent positions.

The fifth step: training and validating a semantic segmentation network for mixed-depth separable convolution

(1) Initializing semantic segmentation network parameters of the hybrid depth separable convolution. These parameters include parameters for each convolutional layer, parameters for the normalization layer. The random initialization is carried out by adopting an Xavier method.

(2) And configuring network hyper-parameters. Setting the initial learning rate to be 0.003, and adopting a learning rate warm-up strategy, wherein the learning rate is increased from 0 to the initial learning rate value after the warm-up is carried out for 5 times; a momentum random gradient descent optimizer is adopted, the momentum coefficient is set to be 0.9, and the weight attenuation parameter exceeding factor is set to be 0.00005; using the cross entropy loss function, the loss weight coefficient defaults to 1.

(3) And (5) training a model. Inputting the training picture processed in the first step into a network, outputting a predicted segmentation result through forward calculation of a backbone network, and calculating a loss value by using a cross entropy loss function and a pixel level label corresponding to the picture. Through back propagation of the loss values, gradients of parameters of the respective layers are calculated, and the parameters are updated by using a momentum random gradient descent method. Iteration is continuously carried out until the cross entropy loss fluctuates around a certain value, and the model is considered to be converged at the moment.

The cross entropy loss function is shown in equation (2):

Loss＝-|y_GTlogy_pted+(1-y_GT)log(1-y_pted)| (2)

wherein y is_GTIndicating the category of the artificial mark, y_ptedExpressed as the probability value of the network prediction sample class, and Loss is the cross entropy Loss calculation result.

(4) And verifying the segmentation performance of the model. In order to verify the segmentation performance of the real-time semantic segmentation method based on the mixed depth separable convolution, a group of comparison experiments are designed, the semantic segmentation network is constructed by respectively using common 3 multiplied by 3 convolution and mixed depth separable convolution units, the semantic segmentation network and the mixed depth separable convolution units are tested on an urban landscape verification set through iterative training, and the experiment results prove that compared with the results of the common convolution mode, the parameter quantity of the method is reduced by about three fifths, the calculated quantity is reduced by about three quarters, the real-time performance is met, and meanwhile, the high prediction precision is kept. The average cross-over ratio (MIOU) result of the backbone network constructed by the mixed depth separable convolution is 73.48%, the method has the advantages of high prediction precision, small parameter amount and high speed and light weight, and the performance advantage is 1.78% compared with 71.7% of the MIOU result of the classic real-time semantic segmentation network ICNet. The average cross-over ratio MIOU of the evaluation indexes is shown as formula (3):

where TO represents a true positive case, FO represents a false positive case, and FN represents a false negative case.

Claims

1. A real-time semantic segmentation method based on mixed depth separable convolution is characterized by comprising the following steps:

step S1, 3475 pictures are selected from the public urban landscape data set, proper categories are selected, improper categories are set as neglected categories, and the selected data are preprocessed and enhanced;

step S2, aiming at the requirements of multi-scale feature expression and real-time prediction of the image data of step S1, a mixed depth separable convolution unit is designed, the unit firstly averagely divides the input feature map into 4 groups, adopts 4 groups of depth separable convolution kernels with different sizes to extract features, then splices the feature maps, and finally alternately rearranges the feature maps;

step S3, constructing a mixed depth separable convolution module on the basis of the step S2, and stacking mixed depth separable convolution units with 4 times of different convolution kernel orders in series to eliminate the multi-scale aliasing effect;

step S4, building a mixed depth separable convolution semantic segmentation network based on the step S3, wherein the mixed depth separable convolution network is formed by stacking mixed depth separable convolution modules in series for 4 times, 3 jump connections are used in the modules, and 1 × 1 convolution is used between the modules for residual connection;

and step S5, training and verifying the semantic segmentation network of the hybrid depth separable convolution, inputting the picture processed in the step S1 into the network for training, performing performance verification through forward reasoning of the network, and outputting a predicted segmentation result.

2. The real-time semantic segmentation method based on mixed-depth separable convolution according to claim 1, characterized by: the mixed depth separable convolution unit designed in the step S2; averagely dividing the feature graph to be processed into 4 groups according to channels, filling the edges of the 4 groups of feature graphs with 0, and aligning the resolution of the output 4 groups of feature mapping; the 4 groups are respectively subjected to feature extraction by using odd depth separable convolution kernels with different sizes, wherein the convolution kernels with different sizes are respectively depth separable convolution kernels of 3 × 3, 5 × 5, 7 × 7 and 9 × 9; splicing the obtained 4 groups of features with the same resolution ratio according to channels; and finally, alternately rearranging the feature mapping.

3. The real-time semantic segmentation method based on mixed-depth separable convolution according to claim 1, characterized by: the mixed depth separable convolution module constructed at step S3; the mixed depth separable convolution unit, the normalization layer and the nonlinear activation layer designed in the step S2 are sequentially stacked in series for 4 times, wherein the number of channels of feature maps output by the first, second and third mixed depth separable convolution units is consistent, and the sizes of the output feature maps are consistent; the number of feature mapping channels output by the fourth mixed depth separable convolution unit is 2 times of the number of input feature channels, and the size of the output feature mapping is one half of the size of the input feature channels. Meanwhile, a residual structure with 3 jump connections is used for relieving the problem of gradient disappearance and improving the training speed.

4. The real-time semantic segmentation method based on mixed-depth separable convolution of claim 3, characterized by: stacking 4 times of mixed depth separable convolution units when constructing the mixed depth separable convolution module; in the first mixed depth separable convolution unit, the 4 groups of convolutions use depth separable convolution kernels of 3 × 3, 5 × 5, 7 × 7, and 9 × 9 from left to right; a second mixed depth separable convolution unit, the 4 groups using, from left to right, 5 × 5, 7 × 7, 9 × 9, 3 × 3 depth separable convolution kernels; a third mixed depth separable convolution unit, the 4 groups using depth separable convolution kernels of 7 × 7, 9 × 9, 3 × 3, 5 × 5 from left to right; and the fourth mixed depth separable convolution unit uses the 4 groups of depth separable convolution kernels with the sizes of 9 multiplied by 9, 3 multiplied by 3, 5 multiplied by 5 and 7 multiplied by 7 from left to right, eliminates the aliasing effect of the multi-scale features and improves the expression capability of the multi-scale features.

5. The real-time semantic segmentation method based on mixed-depth separable convolution according to claim 1, characterized by: step S4, constructing a mixed depth separable convolution semantic segmentation network; the mixed depth separable convolution module designed in the step S3 is sequentially stacked in series for 4 times, and 16 times of mixed depth separable convolution units are used in total, so that the characteristic semantic capability of the image is improved; using a 3-hop connected residual structure within the hybrid depth separable convolution module to mitigate gradient vanishing; residual error connection is carried out between the mixed depth separable convolution modules by using 1 multiplied by 1 convolution; finally, 3 × 3 convolution is used to output the final segmentation result.

6. The real-time semantic segmentation method based on mixed-depth separable convolution according to claim 1, characterized by: step S5, training and verifying semantic segmentation network of hybrid depth separable convolution; inputting the training picture processed in the step S1 into the network, performing performance verification through forward calculation of the mixed depth separable convolutional network, outputting a predicted segmentation result, and performing supervised training by using a cross entropy loss function and a corresponding pixel level label calculation loss value.