CN111062395B

CN111062395B - Real-time video semantic segmentation method

Info

Publication number: CN111062395B
Application number: CN201911185021.4A
Authority: CN
Inventors: 赵三元; 吴俊蓉; 文宗正; 黄科乐
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2020-12-18
Anticipated expiration: 2039-11-27
Also published as: CN111062395A

Abstract

The invention belongs to the field of computer vision, and relates to a real-time video semantic segmentation method. The method comprises the following steps: step 1: selecting a training and testing data set; step 2: constructing a backbone network based on images; and step 3: pre-training a backbone network by using a training data set; and 4, step 4: constructing an integral video semantic segmentation model; and 5: training the whole video semantic segmentation model on a training data set; step 6: and inputting the video frame of the test set, carrying out forward propagation in the trained video semantic segmentation model, and outputting a predicted semantic segmentation result end to end. The method has high reasoning speed and can meet the requirement of real-time performance; the accuracy is high, the video semantic segmentation can be accurately realized, and the practicability is very strong.

Description

Real-time video semantic segmentation method

Technical Field

The invention belongs to the field of computer vision, and relates to a real-time video semantic segmentation method.

Background

Semantic segmentation is a basic task in the field of computer vision, which aims at predicting a semantic label for each pixel in a given image. Inspired by deep learning, the task has a brand-new development direction, and particularly, the full convolution network is provided, so that the image semantic segmentation effect reaches a new milestone-like height. Video semantic segmentation tends to be more complex due to one more time dimension than image information and the presence of a large amount of redundant information.

It is time consuming to directly segment each frame in the video by the semantic image-based segmentation method, and the correlation between frames cannot be fully utilized, so that satisfactory performance cannot be obtained. Existing video semantic segmentation methods can be roughly classified according to how temporal information is utilized, and mainly include a method of encoding motion and structural features using 3D convolution, a method of summarizing frame-by-frame information using a recurrent neural network, a method of modeling spatial and temporal contexts using CRF, and a method of calculating optical flow and propagating features using an independent network. However, the 3D convolution based method can be regarded as a way of information aggregation, and the information of the whole video segment is used as input, so the processing efficiency is not high, and the cyclic neural network based method has similar disadvantages. The CRF-based approach requires high computational costs due to the complex reasoning of CRF. Optical flow-based methods are difficult to achieve accurate optical flow estimation, time consuming and always subject to misalignment. Most existing methods are slow to process video frames and cannot achieve real-time, which is necessary in many practical applications of video semantic segmentation, such as automatic driving and intelligent monitoring.

In summary, the current video semantic segmentation method needs to fully utilize inter-frame consistency, reduce information redundancy between adjacent video frames, and further save inference time.

Disclosure of Invention

The invention aims to solve the problem of low reasoning speed of video semantic segmentation in the prior art, and provides a real-time video semantic segmentation method.

The working principle and the process of the invention are as follows: in order to solve the existing problems, firstly, a powerful backbone network which is light, efficient and real-time and takes an image-based network as an integral video semantic segmentation method is provided. The backbone network adopts an encoder-decoder architecture, and a residual double-branch depth separable convolution module (RDDS module) is proposed in an encoder so as to effectively capture detail information and effectively reduce the calculation amount. To enable feature propagation, a key frame selection mechanism is employed and a unique global attention module is proposed to indicate the spatial correlation between non-key frames and their previous key frames. More specifically, we use our proposed attention-based feature propagation architecture to build real-time full-convolution networks. First, the input frame is divided into key frames and non-key frames according to a fixed key frame selection mechanism. For the key frame, the whole backbone network is adopted to extract rich spatial information at multiple levels for feature propagation. The non-key frame does not need to waste a large amount of time to extract redundant features through the whole backbone network, but only needs to extract low-level features through a low-level network of the backbone network, reserves space details, and then fuses the low-level features and the high-level features of the previous key frame obtained by propagation and multiplied by attention weight. In order to achieve dissemination efficiently, the invention proposes an attention-based method: taking the low-level feature maps of the non-key frames and the corresponding key frames as input, and calculating the spatial similarity between any two positions of the feature maps to obtain an overall attention map A, wherein the value of each position in A represents the correlation between the key frames and the corresponding positions of the non-key frames. Since the overall attention strives to integrate per-pixel correlation between two frames, it can be considered as a spatial transformation guide to capture inter-frame consistency information. The high-level features of the predicted non-key frames are obtained by applying attention weights to the high-level features of the corresponding key frames, which are then fused with the low-level functions of the non-key frames to supplement new information that was not present in the previous key frames, thereby enhancing the ability to process complex and changing scenes. The proposed model is guided and end-to-end trained.

The purpose of the invention is realized by the following technical scheme.

A real-time video semantic segmentation method comprises the following steps:

step 1, selecting a training and testing data set;

step 2, constructing a backbone network based on images;

step 3, pre-training the backbone network by using a training data set;

step 4, constructing an integral video semantic segmentation model;

step 5, training the whole video semantic segmentation model on a training data set;

and 6, inputting the video frame of the test set, carrying out forward propagation in the trained video semantic segmentation model, and outputting a predicted semantic segmentation result end to end.

The image-based backbone network described in step 2 employs an encoder-decoder architecture. The encoder comprises a residual two-branch depth separable convolution module (RDDS module) and a downsampling module, wherein the residual two-branch depth separable convolution module comprises two symmetric branches, each branch comprises 1 depth separable convolution layer, 1 depth separable expanded convolution layer, 3 batch normalization (batch normalization) layers, 1 linear rectification function (ReLU) active layer and 1 drop (drop) layer, and the two branch results are connected (Concat) and then pass through a convolution layer of 1 × 1 convolution kernel and a ReLU active layer. The down-sampling module is composed of a maximum pooling layer and a convolution layer Concat of 3 × 3 convolution kernel. The decoder contains a convolutional layer of 1 x 1 convolutional kernels and an 8-fold bilinear upsampled layer.

The step 3 comprises the following steps:

step 3.1: pre-processing and data enhancing the images in the training dataset, resizing the images to a fixed value, using data enhancement modes of horizontal flipping, translation and color (including brightness, saturation and contrast) variation;

step 3.2: initializing a whole image semantic segmentation model;

step 3.3: and recording the cross entropy loss of the semantic segmentation result predicted by the model and the labeled image in the training process as loss.

The overall video semantic segmentation model in the step 4 is based on a key frame selection mechanism, and firstly, a second down-sampling layer is taken as a boundary to divide an encoder part of a backbone network into a low-layer network part and a high-layer network part. If the current input video frame is a key frame, the whole backbone network is used for calculation, and a low-level feature map F is obtained_lkAnd a high level feature map F_hk(ii) a If the current input is a non-key frame, only the non-key frame is calculated by using the lower network of the encoder to obtain a feature map F_lnThen using the global attention module to effect feature propagation, F_lnAnd F of the previous key frame_lkObtaining an overall attention diagram A by matrix multiplication as an input, and then taking F of a previous key frame_hkMultiplying A to obtain the predicted high-level feature map F of the current non-key frame_hnWill F_hnAnd F_lnAdded to supplement the detail information. In both cases, the final semantic segmentation result is finally obtained by the decoder.

The step 5 comprises the following steps:

step 5.1, preprocessing and data enhancing are carried out on the images in the training data set, the images are redefined into fixed values, and a data enhancing mode of horizontal turning, translation and color (including brightness, saturation and contrast) change is used;

step 5.2, loading model parameters pre-trained by a backbone network, initializing the whole video semantic segmentation model, inputting a key frame-non-key frame image pair each time, wherein each continuous video segment comprises 1 key frame and n non-key frames;

and 5.3, recording the cross entropy loss of the semantic segmentation result of the model for predicting the non-key frame and the labeled image in the training process as loss.

And 3, performing error back propagation by using a random shaving reduction algorithm according to the loss, and updating model parameters by using a polynomial learning strategy to obtain a trained semantic segmentation model. In the polynomial learning strategy, the learning rate lr is set as:

where baselr is the initial learning rate and step 3 is set to 5e^-4Step 5 is set to 3e^-3(ii) a The power setting is 0.9.

Advantageous effects

Compared with the prior art, the invention has the following remarkable advantages:

(1) the method has high reasoning speed and can meet the requirement of real-time performance;

(2) the method has high accuracy, can accurately realize video semantic segmentation, and has practicability.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a backbone network architecture of the present invention;

FIG. 3 is a block diagram of the residual two-branch depth separable convolution module of the present invention;

FIG. 4 is a block diagram of a downsampling module of the present invention;

FIG. 5 is a diagram of a video semantic segmentation model architecture of the present invention;

FIG. 6 is a block diagram of the overall attention module of the present invention;

FIG. 7 is a partial example of the present invention on a Cityscapes dataset.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to specific embodiments and the accompanying drawings.

The present invention will now be described more fully hereinafter with particular reference to a preferred embodiment.

As shown in fig. 1, the real-time video semantic segmentation method of the present invention includes the following steps:

step 1, selecting a training and testing data set; in this embodiment, 20 types (of which type 1 is a background) of the cityscape dataset are used as a reference, the cityscape (single frame image) and cityscape sequence (continuous video frame) datasets are used in training a backbone network, only the cityscape sequence dataset is used in training an overall video semantic segmentation model, and the cityscape test dataset is used in testing.

And 2, constructing a backbone network based on the image.

As shown in fig. 2, the backbone network employs an encoder-decoder architecture. The encoder comprises a residual two-branch depth separable convolution module (RDDS module) and a down-sampling module, and the decoder comprises a convolution layer of 1 x 1 convolution kernel and an 8-fold bilinear up-sampling layer.

As shown in fig. 3, the RDDS module includes two symmetric branches, each branch includes 1 depth-separable convolutional layer, 1 depth-separable expanded convolutional layer, 3 batch normalization (batch normalization) layers, 1 linear rectifying function (ReLU) active layer, and 1 drop (dropout) layer, and two branch results are concatenated (Concat) and then passed through a 1 × 1 convolutional core convolutional layer and a ReLU active layer. The RDDS module can effectively capture detailed information and effectively reduce the amount of computation.

As shown in fig. 4, the downsampling module is configured to perform downsampling operation to extract features, and is configured by a convolution layer Concat of one maximum pooling layer and one 3 × 3 convolution kernel.

The specific network structure of the backbone network is shown in table 1:

TABLE 1 RDDS Module network architecture

And 3, pre-training the backbone network by using a Cityscapes and Cityscapes sequence training data set.

The process of step 3 is:

step 3.1, preprocessing and data enhancing are carried out on the images in the training data set, the images are redefined into fixed values, and a data enhancing mode of horizontal turning, translation and color (including brightness, saturation and contrast) change is used;

step 3.2, initializing the whole image semantic segmentation model;

and 3.3, recording the cross entropy loss of the semantic segmentation result predicted by the model and the labeled image in the training process as loss.

And 4, constructing an integral video semantic segmentation model.

As shown in fig. 5, the overall video semantic segmentation model is based on a key frame selection mechanism, and first divides the encoder portion of the backbone network into a lower layer network and an upper layer network, with the second downsampling layer as a boundary. If the current input video frame is a key frame, the whole backbone network is used for calculation, and a low-level feature map F is obtained_lkAnd a high level feature map F_hk(ii) a If the current input is a non-key frame, only the non-key frame is calculated by using the lower network of the encoder to obtain a feature map F_lnThen using the global attention module to effect feature propagation, F_lnAnd F of the previous key frame_lkObtaining an overall attention diagram A by matrix multiplication as an input, and then taking F of a previous key frame_hkMultiplying A to obtain the predicted high-level feature map F of the current non-key frame_hnWill F_hnAnd F_lnAdded to supplement the detail information. In both cases, the final semantic segmentation result is finally obtained by the decoder.

As shown in fig. 6, in order to deeply mine the spatial correlation between the low-level feature maps of the key frame and the non-key frame and implement feature propagation, an overall attention module is designed, and the attention map calculated by the module implicitly contains inter-frame consistency information and can be regarded as guiding information of feature propagation. The calculation process in the global attention module is as follows:

(1) after the calculation of a second down-sampling module of a backbone network encoder, the low-level feature maps of the key frames and the non-key frames are obtained

Reduce the number of channels between the two channels, and reduce F_lkAfter being rotated with F_lnObtaining the graph by matrix multiplication

(N＝H×W)；

(2) We then input a 'into two parallel branches to get the maximum point-to-point response, average and maximum pooling a' over the channel dimension, respectively, and then sum the results Concat of the two branches to get the maximum response attention map of channel number 2.

(3) Finally, we reduce the number of channels to 1 using a 5 × 5 convolutional layer, and then limit the value to [ -1, using a Sigmoid active layer]Finally, get the attention map

And 5, training the whole video semantic segmentation model on a Cityscapes sequence training data set.

The process of step 5 is:

Table 2 shows a comparison between the accuracy (mlio u) and the inference speed of the video semantic segmentation method and other most advanced methods, which means that the method can greatly increase the inference speed while maintaining a high accuracy, and when the mlio u is 60.6%, the inference speed can reach 131.6 fps:

TABLE 2 comparison of this video semantic segmentation method with other most advanced methods

FIG. 7 shows a partial example of the present invention on a Cityscapes dataset.

Claims

1. A real-time video semantic segmentation method is characterized by comprising the following steps:

step 1, selecting a training and testing data set;

step 2, constructing a backbone network based on images;

the backbone network adopts an encoder-decoder architecture, an encoder comprises a residual error double-branch depth separable convolution module RDDS module and a downsampling module, wherein the RDDS module comprises two symmetrical branches, each branch comprises 1 depth separable convolution layer, 1 depth separable expansion convolution layer, 3 Batch Normalization layers, 1 ReLU activation layer and 1 dropout layer, and the two branches after Concat pass through a convolution layer of 1 x 1 convolution kernel and a ReLU activation layer; the down-sampling module is composed of a maximum pooling layer and a convolution layer Concat of 3 multiplied by 3 convolution kernel; the decoder comprises a convolution layer of 1 x 1 convolution kernel and an 8 times bilinear upsampling layer;

step 3, pre-training the backbone network by using a training data set;

step 4, constructing an integral video semantic segmentation model;

the integral video semantic segmentation model is based on a key frame selection mechanism; firstly, dividing an encoder part of a backbone network into a lower layer network part and a higher layer network part by taking a second downsampling layer as a boundary; if the current input video frame is a key frame, the whole backbone network is used for calculation, and a low-level feature map F is obtained_lkAnd a high level feature map F_hk(ii) a If the current input is a non-key frame, only the non-key frame is calculated by using the lower network of the encoder to obtain a feature map F_lnThen using the global attention module to effect feature propagation, F_lnAnd F of the previous key frame_lkAs an input, the calculation process in the global attention module is as follows:

N＝H×W；

(2) Inputting A 'into two parallel branches to obtain the maximum point-to-point response, respectively carrying out average pooling and maximum pooling on A' in a channel dimension, and then adding results Concat of the two branches to obtain a maximum response attention diagram with the channel number being 2;

(3) finally, 5 × 5 convolutional layers are used to reduce the number of channels to 1, and then Sigmoid active layers are used to limit the value to [ -1, 1]Finally, get the attention map

F of the previous key frame_hkMultiplying A to obtain the predicted high-level feature map F of the current non-key frame_hnWill F_hnAnd F_lnAdding to supplement the detail information; in both cases, a final semantic segmentation result is finally obtained through a decoder;

2. The method according to claim 1, wherein step 3 comprises:

step 3.1: preprocessing and data enhancing images in a training data set, redefining the images into fixed values, and using a data enhancing mode of horizontal turning, translation and color change;

step 3.2: initializing a whole image semantic segmentation model;

3. The method according to claim 1, wherein the step 5 comprises:

step 5.1: preprocessing and data enhancing images in a training data set, redefining the images into fixed values, and using a data enhancing mode of horizontal turning, translation and color change;

step 5.2: loading pre-trained model parameters of a backbone network, initializing a whole video semantic segmentation model, inputting a key frame-non-key frame image pair each time, wherein each continuous video segment comprises 1 key frame and n non-key frames;

step 5.3: and recording the cross entropy loss of the semantic segmentation result of the model for predicting the non-key frame and the labeled image in the training process as loss.

4. The real-time video semantic segmentation method according to claim 2 or 3, characterized in that for step 3 and step 5, a stochastic gradient descent algorithm is used for error back propagation according to loss, and a polynomial learning strategy is used to update model parameters to obtain a trained semantic segmentation model; in the polynomial learning strategy, the learning rate lr is set as:

where baselr is the initial learning rate and step 3 is set to 5e^-4Step 5 is set to 3e^-3(ii) a power is set to 0.9.