CN111882563B

CN111882563B - Semantic segmentation method based on directional full convolution network

Info

Publication number: CN111882563B
Application number: CN202010669134.8A
Authority: CN
Inventors: 武伯熹; 蔡登�
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2022-05-27
Anticipated expiration: 2040-07-13
Also published as: CN111882563A

Abstract

The invention discloses a semantic segmentation method based on a directional convolutional network, which comprises the following steps: (1) constructing a full convolution network of directional convolution; (2) adding the top layer of the constructed directional full convolution network into a pooling layer and a full connection layer network to form a first depth model, and pre-training on a large data set; (3) extracting a full convolution part in the pre-trained first depth model, initializing parameters of a directional full convolution network by using a full convolution layer, and adding a new full connection layer to form a second depth model; (4) training a second depth model using the image semantically segmented data set until the model converges; (5) and analyzing the picture to be detected by using the trained second depth model, predicting the category of each pixel in the picture, and forming and outputting a semantic segmentation graph of the picture. The method can promote the relation between the semantic segmentation learning perception field and the central pixel, and improve the robustness of the training model.

Description

Semantic segmentation method based on directional full convolution network

Technical Field

The invention belongs to the field of computer vision and image processing, and particularly relates to a semantic segmentation method based on a directional convolutional network.

Background

With the discovery and the deep research of the deep learning theory, the rapid breakthrough and the remarkable improvement of a plurality of fields and tasks belonging to the computer vision are realized. Among them, semantic segmentation is one of the most troublesome computational vision tasks and the current popular research direction due to its high requirement on the fineness of the vision system. And (3) the semantic segmentation task requires that a computational vision system predicts the category of an object to which each pixel belongs on pictures with any size. The mainstream Semantic Segmentation solution at present adopts a full convolution network architecture, which starts from the work of < full relational Networks for Semantic Segmentation > proposed by Conference on Computer Vision and Pattern Recognition in 2014 by Jonathan Long et al, Berkeley division of California university. The work is based on the experience in the field of calculation and identification, only a convolution neural network (full convolution) is used for image processing, and a bilinear interpolation method is combined, so that output prediction and input picture pixels are in one-to-one correspondence. By an end-to-end training method, the neural network is trained under the framework of supervised learning, and the image characteristics far superior to the traditional learning are obtained. The deapplab v3+ method was the leading-edge solution in the field as published by Liang-Chieh Chen et al, in the 2018 European Conference on Computer Vison Conference, work with Encoder-decoder with apparatus partial communication for the magnetic image segmentation. The size of the effective perception field is improved through various technologies such as diffusion convolution and the like.

However, the effectiveness of the full convolution network is not fully understood, and the prediction result thereof has some defects. Careful analysis of the prediction process of the full convolution network revealed that for a prediction on a single pixel, the neural network could obtain all pixels on the perceptual horizon (the input part to which the neural network output can be directly related), but only the class of the central pixel is output. On one hand, no mechanism is used for explicitly guiding the neural network to predict the pixels in the center of the perception field in the training process, and on the other hand, the experimental result shows that the full convolution network really learns the relevance between the perception field and the pixels in the center from the data. Such contrast elicits us to understand the deep mechanisms of convolutional networks and, based on this understanding, to encourage neural networks to give higher attention to central locations, thereby yielding a more robust semantic segmentation system.

Disclosure of Invention

The invention provides a semantic segmentation method based on a directional convolutional network, which can promote semantic segmentation to learn the relation between a perception field and a central pixel, improve the robustness of a training model and enable the image semantic segmentation to be more accurate.

A semantic segmentation method based on a directional convolutional network is characterized by comprising the following steps:

(1) constructing a full convolution network of directional convolution;

(2) adding the top layer of the constructed directional full convolution network into a pooling layer and a full connection layer network to form a first depth model, and pre-training on a large data set;

(3) extracting a full convolution part in the pre-trained first depth model, initializing parameters of a directional full convolution network by using a full convolution layer, and adding a new full connection layer to form a second depth model;

(4) training a second depth model using the semantically picture segmented data set until the model converges;

(5) and analyzing the picture to be detected by using the trained second depth model, predicting the category of each pixel in the picture, and forming and outputting a semantic segmentation graph of the picture.

The method comprises the steps of firstly constructing a full convolution network only using directional convolution, then pre-training a deep learning network containing a full convolution layer and a pooling-full connection layer to serve as initialization of the full convolution layer, then training on a training data set (semantic segmentation task), and predicting the category of each pixel on an input image. The method can promote the potential task of 'predicting and sensing the centre-of-field pixel' of deep network learning, so that a robust semantic segmentation model is more easily generated.

In step (1), all the common convolutions are replaced by directional convolutions. The exact definition of the directional convolution is as follows:

for normal convolution operations, there is a linear transformation as follows:

wherein, y_coIs the co-th feature of the output, ci represents the index of the input feature, total C_iA characteristic value; s is a position set of pixels sampled in the convolution calculation process; w is a_s，ci、x_s，ciAnd b_coRepresenting the weight, input and offset required in the linear operation process, respectively. Because the ordinary convolution adopts a uniform sampling mode, the selection of the offset set S is as follows:

S＝{0，1，-1}²

for directional convolution the offset set is no longer constant as above, but is chosen from the following dynamic set:

M_k＝{(s₁，s₂)|(s₁-e₁)²+(s₂-e₂)²≤2²；

s₁，s₂∈[-2，2]；s₁，s₂∈Z}∪{(0，0)}

wherein, the value range of k is an integer from 0 to 15, which represents 16 different directions; the specific value rule of S is as follows:

S＝M_(ci％16)

where ci represents the index of the input channel, and the division by 16 means sorting into 16 different groups.

This makes the original 3 x 3 square sampling region become sector-shaped regions in different directions. Because the pixel in the center is sampled all the time, and the pixel of every side is sampled in turn, therefore, on the final computational graph that produces, the center pixel possesses more route transmission information, promotes its attention.

The directional convolution described above is named DirConv-I, I indicating that the selection of directionality is based on the input dimension. Similarly, DirConv-O according to the output dimension can be obtained, with the convolution offset:

S＝M_(co％16)

the above design is based on a variant of the 3 x 3 convolution, which can be treated as a 2 x 2 type convolution to get a slim version of the directional convolution: DirConv-SI and DirConv-SO.

In the step (2), in order to alleviate the problem of excessive data volume of semantic segmentation, the large-scale data set adopts a large-scale image recognition data set ImageNet, so that the convergence speed and the training quality of the semantic segmentation can be accelerated.

The specific steps of the step (2) are as follows:

(2-1) adding an image pooling layer on the top layer of the full convolution network to enable the image to be changed into a feature vector from a three-dimensional feature map, and then deforming the feature vector into a 1000-dimensional vector by using a full connection network, wherein the 1000 image categories correspond to ImageNet;

(2-2) training the constructed first depth model on GPUs, wherein each GPU calculates 32 images at a time, and 8 GPUs are trained in parallel;

(2-3) using the SGDM optimization algorithm, the initial learning rate is 0.256, and after every 30 cycles, the learning rate is reduced to 10% for a total of 90 cycles of training, and the Momentum parameter is set to 0.9 until the model converges.

In the step (3), the parameters of the directional full convolution network are initialized by using the full convolution layer in the previous step, a full connection layer is added afterwards, the characteristic value is transformed into a c-dimensional vector, and c corresponds to the number of object categories in the target semantic segmentation data set. The newly added fully connected layer is initialized randomly with a gaussian distribution.

The specific process of the step (4) is as follows:

(4-1) inputting the pictures of the training set into a second depth model, and generating a feature map after calculation;

(4-2) replacing the last span convolution in the model network with a non-span convolution, and adjusting the diffusivity of all the subsequent convolution networks to be 2;

(4-3) because the resolution of the image is reduced by using the convolution span in the network operation process, the finally generated feature map is equivalent to 1/16 of the original image, so that the feature map needs to be amplified to the size of the original image by adopting bilinear interpolation;

(4-4) sending the generated features into a softmax function to obtain probability distribution of a prediction sequence, calculating the gradient of parameters on the network by using a cross-entropy loss function, and updating parameter values by using an SGDM (generalized minimum mean square deviation) optimization algorithm; initial learning rate is set to 10^-3；

(4-5) repeating the above steps until the model converges.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention designs a novel directional convolution network layer based on the understanding of the perception field of the convolution network, and highlights the attention of the neural network to the center of the perception field, so that the neural network can learn the connotation correlation between input and output more easily.

2. The invention has strong applicability, and can be directly and effectively deployed in most of the existing semantic segmentation technologies by replacing a common convolutional network with a directional convolutional network without influencing other method processes.

Drawings

FIG. 1 is a schematic flow chart of a semantic segmentation method based on a directional convolutional network according to the present invention;

FIG. 2 is a visualization of convolution kernels of different convolution networks and corresponding perception fields in an embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

As shown in fig. 1, a semantic segmentation method based on a directional convolutional network includes the following steps:

and S01, constructing a full convolution network of the directional convolution.

By adopting the design of a deep learning network reference residual error network ResNet-101, parameters such as network depth, network width, image resolution, convolution span and the like are kept unchanged, 3-by-3 type convolution is replaced by directional convolution, and a visual graph of the general convolution of a directional convolution kernel is shown in figure 2.

And S02, adding the top layer of the constructed directional full-convolution network into a pooling layer and a full-connection layer network, and pre-training on a large data set. The directional fully convolutional network constructed in step S01 cannot be directly used in the pre-training process of the image recognition task because the output feature vectors do not meet the criteria of the data set ImageNet. ImageNet is the first ultra-large image Recognition dataset published by Jia Deng et al, Stanford university, Conference on Computer Vision and Pattern Recognition 2009, as an article of "ImageNet: A large-scale hierarchical image database," which searches the Internet for pictures of 1000 classes of objects, with an image size maintained at about 256, and each class having over 1000 training pictures. 1281167 pictures are in the training set, and 50000 pictures are in the verification set.

The full convolution network is accessed to the image pooling layer, so that the feature map can be reduced to feature vectors, and then the feature vectors are converted into prediction vectors with the length of 1000 by using the full connection layer. Pre-training is done on ImageNet with the same training pattern as ResNet-101. The results after pre-training are shown in table 1, and it can be seen that the directional convolution can achieve the same pre-training effect:

TABLE 1

And S03, extracting the full convolution part in the step S02 for subsequent training, wherein other parameters are random initialization.

And S04, starting to train the model, wherein in the training process, when the image is too large, the image can be reduced to one 16 th by using convolution spans, and in the subsequent prediction process, the convolution spans can be cancelled, so that the prediction resolution is improved, and a better result is generated. This difference is due to the fact that multiple pictures must be used for simultaneous training during the training process.

And S05, performing semantic segmentation tasks by using the trained model.

To demonstrate the effectiveness of the method of the invention, tests were performed on the cityscaps dataset. The basic model is ResNet-101, and the semantic segmentation method adopts a DeepLabv3/3+ framework. For semantic segmentation task, the average IOU index of 21 classes in Cityscapes is used for evaluation, and the result is shown in Table 2.

TABLE 2

The results show that the segmentation effect can be effectively improved by four kinds of directional convolution. While showing their number of parameters, it can be seen that DirConv-SI and DirConv-SO can achieve better results with fewer parameters.

And continuing to use multiple deformations of the image to help the neural network to perform joint prediction. Since a single input may cause unstable prediction, the present embodiment uses both the flipped and multi-scaled pictures and the network with adjusted over-convolution span to predict the final result, and the result is shown in table 3.

TABLE 3

As shown in the above table, OS8 represents the reduction of the convolution span of 2 to 3 positions, resulting in an output stride of 8; MS represents averaging after prediction by using three inputs of [0.75,1,1.25] in different proportions; flip represents simultaneous use of the flipped image. The directional convolution achieves a continuous boost in cooperation with these methods. The above experiment was based on the DeepLabv3+ model.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A semantic segmentation method based on a directional full convolution network is characterized by comprising the following steps:

(1) constructing a directional full convolution network;

(2) adding the top layer of the constructed directional full convolution network into a pooling layer and a full connection layer to form a first depth model, and performing pre-training on a large data set;

(3) extracting a directional full convolution network part in the pre-trained first depth model, initializing parameters of the directional full convolution network by using a full convolution layer, and adding a new full connection layer to form a second depth model;

2. The method for semantic segmentation based on the directional full convolution network as claimed in claim 1, wherein in step (1), the definition of the directional convolution is as follows:

wherein, y_coIs the co-th feature of the output, ci represents the index of the input feature, total C_iA characteristic value; s is a position set of pixels sampled in the convolution calculation process; w is a_s，ci、x_s，ciAnd b_coRespectively representing the weight, input and offset required in the linear operation process; the selection of S is selected from the following dynamic set:

M_k＝{(s₁，s₂)|(s₁-e₁)²+(s₂-e₂)²≤2²；

s₁，s₂∈[-2，2]；s₁，s₂∈Z}∪{(0，0)}

S＝M_(ci％16)

3. The method for semantic segmentation based on the directional full convolution network of claim 1, wherein in the step (2), the large-scale data set is a large-scale image recognition data set ImageNet.

4. The semantic segmentation method based on the directional fully convolutional network as claimed in claim 3, wherein the step (2) comprises the following steps:

(2-1) adding an image pooling layer on the top layer of the directional full convolution network to enable an image to be changed into a feature vector from a three-dimensional feature map, and then using a full connection layer to deform the image into a 1000-dimensional vector, wherein the 1000 image types correspond to ImageNet;

(2-3) using the SGDM optimization algorithm, the initial learning rate is 0.256, the learning rate is reduced to 10% after each 30 cycles, the total number of the cycles is trained for 90, and the Momentum parameter is set to be 0.9 until the model converges.

5. The method for semantic segmentation based on directional full convolution network of claim 1, wherein in the step (3), the newly added full connection layer adopts random initialization of Gaussian distribution.

6. The semantic segmentation method based on the directional full convolution network of claim 1, wherein the specific process of the step (4) is as follows:

(4-2) replacing the last span convolution in the second depth model with non-span convolution, and adjusting the diffusivity of all convolutions after the last span convolution to be 2;

(4-3) amplifying the feature map to the size of the original image by adopting bilinear interpolation;

(4-4) sending the generated features into a softmax function to obtain probability distribution of a prediction sequence, calculating the gradient of parameters on the network by using a cross-entropy loss function, and updating parameter values by using an SGDM optimization algorithm; initial learning rate is set to 10^-3；

(4-5) repeating the above steps until the model converges.