CN111882563A

CN111882563A - Semantic segmentation method based on directional convolutional network

Info

Publication number: CN111882563A
Application number: CN202010669134.8A
Authority: CN
Inventors: 武伯熹; 蔡登�
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2020-11-03
Anticipated expiration: 2040-07-13
Also published as: CN111882563B

Abstract

The invention discloses a semantic segmentation method based on a directional convolutional network, which comprises the following steps: (1) constructing a full convolution network of directional convolution; (2) adding the top layer of the constructed directional full convolution network into a pooling layer and a full connection layer network to form a first depth model, and pre-training on a large data set; (3) extracting a full convolution part in the pre-trained first depth model, initializing parameters of a directional full convolution network by using a full convolution layer, and adding a new full connection layer to form a second depth model; (4) training a second depth model using the semantically picture segmented data set until the model converges; (5) and analyzing the picture to be detected by using the trained second depth model, predicting the category of each pixel in the picture, and forming and outputting a semantic segmentation graph of the picture. The method can promote the relation between the semantic segmentation learning perception field and the central pixel, and improve the robustness of the training model.

Description

Semantic segmentation method based on directional convolutional network

Technical Field

The invention belongs to the field of computer vision and image processing, and particularly relates to a semantic segmentation method based on a directional convolutional network.

Background

With the discovery and the deep research of the deep learning theory, the rapid breakthrough and the remarkable improvement of a plurality of fields and tasks belonging to the computer vision are realized. Among them, semantic segmentation is one of the most troublesome computational vision tasks and the current popular research direction due to its high requirement on the fineness of the vision system. And (3) according to the requirement of a semantic segmentation task, predicting the class of an object to which each pixel belongs on the picture with any size by a computational vision system. The mainstream Semantic Segmentation solution at present adopts a full convolution network architecture, which starts from the work of < full relational networks for Semantic Segmentation > proposed by joint on computer Vision and Pattern Recognition in 2014 by Jonathan Long et al, berkeley division university, california. The work is based on the experience in the field of calculation and identification, only a convolution neural network (full convolution) is used for image processing, and a bilinear interpolation method is combined, so that output prediction and input picture pixels are in one-to-one correspondence. By an end-to-end training method, the neural network is trained under the framework of supervised learning, and the image characteristics far superior to the traditional learning are obtained. The work "Encoder-decoder with algorithm partial conversion for the semantic image segmentation" published by Liang-Chieh Chen et al in the European Conference on Computer Vison Conference 2018 introduced the DeepLab v3+ method as a leading-edge solution in this field. The size of the effective perception field is improved through various technologies such as diffusion convolution and the like.

However, the effectiveness of the full convolution network is not fully understood, and the prediction result thereof has some defects. Careful analysis of the prediction process of the full convolution network revealed that for a prediction on a single pixel, the neural network could obtain all pixels on the perceptual horizon (the input part to which the neural network output can be directly related), but only the class of the central pixel is output. On one hand, no mechanism is used for explicitly guiding the neural network to predict the pixels in the center of the perception field in the training process, and on the other hand, the experimental result shows that the full convolution network really learns the relevance between the perception field and the pixels in the center from the data. Such contrast elicits us to understand the deep mechanisms of convolutional networks and, based on this understanding, to encourage neural networks to give higher attention to central locations, thereby yielding a more robust semantic segmentation system.

Disclosure of Invention

The invention provides a semantic segmentation method based on a directional convolutional network, which can promote semantic segmentation to learn the relation between a perception field and a central pixel, improve the robustness of a training model and enable the image semantic segmentation to be more accurate.

A semantic segmentation method based on a directional convolutional network is characterized by comprising the following steps:

(1) constructing a full convolution network of directional convolution;

(2) adding the top layer of the constructed directional full convolution network into a pooling layer and a full connection layer network to form a first depth model, and pre-training on a large data set;

(3) extracting a full convolution part in the pre-trained first depth model, initializing parameters of a directional full convolution network by using a full convolution layer, and adding a new full connection layer to form a second depth model;

(4) training a second depth model using the semantically picture segmented data set until the model converges;

(5) and analyzing the picture to be detected by using the trained second depth model, predicting the category of each pixel in the picture, and forming and outputting a semantic segmentation graph of the picture.

The method comprises the steps of firstly constructing a full convolution network only using directional convolution, then pre-training a deep learning network containing a full convolution layer and a pooling-full connection layer to serve as initialization of the full convolution layer, then training on a training data set (semantic segmentation task), and predicting the category of each pixel on an input image. The method can promote the potential task of 'predicting and sensing the centre-of-field pixel' of deep network learning, so that a robust semantic segmentation model is more easily generated.

In step (1), all the common convolutions are replaced by directional convolutions. The exact definition of the directional convolution is as follows:

for normal convolution operations, there is a linear transformation as follows:

wherein, y_coIs the co-th characteristic of the outputWhere ci represents the index of the input features, total C_iA characteristic value; s is a position set of pixels sampled in the convolution calculation process; w is a_s，ci、x_s，ciAnd b_coRepresenting the weight, input and offset required in the linear operation process, respectively. Because the ordinary convolution adopts a uniform sampling mode, the selection of the offset set S is as follows:

S＝{0，1，-1}²

for directional convolution the offset set is no longer constant as above, but is chosen from the following dynamic set:

M_k＝{(s₁，s₂)|(s₁-e₁)²+(s₂-e₂)²≤2²；

s₁，s₂∈[-2，2]；s₁，s₂∈Z}∪{(0，0)}

wherein, the value range of k is an integer from 0 to 15, which represents 16 different directions; the specific value rule of S is as follows:

S＝M_(ci％16)

where ci represents the index of the input channel, and the division by 16 means sorting into 16 different groups.

This makes the original 3 x 3 square sampling region become sector-shaped regions in different directions. Because the central pixel is sampled all the time and the surrounding pixels are sampled in turn, the central pixel has more paths to transmit information on the finally generated calculation graph, and the attention of the central pixel is improved.

The directional convolution described above is named DirConv-I, I indicates that the selection of directionality is based on the input dimension. Similarly, DirConv-O according to the output dimension can be obtained, with the convolution offset:

S＝M_(co％16)

the above design is based on a variant of the 3 x 3 convolution, which can be treated as a 2 x 2 type convolution to get a slim version of the directional convolution: DirConv-SI and DirConv-SO.

In the step (2), in order to alleviate the problem of excessive data volume of semantic segmentation, the large-scale data set adopts a large-scale image recognition data set ImageNet, so that the convergence speed and the training quality of the semantic segmentation can be accelerated.

The specific steps of the step (2) are as follows:

(2-1) adding an image pooling layer on the top layer of the full convolution network to enable the image to be changed into a feature vector from a three-dimensional feature map, and then deforming the feature vector into a 1000-dimensional vector by using a full connection network, wherein the 1000 image categories correspond to ImageNet;

(2-2) training the constructed first depth model on GPUs, wherein each GPU calculates 32 images at a time, and 8 GPUs are trained in parallel;

(2-3) using the SGDM optimization algorithm, the initial learning rate is 0.256, and after every 30 cycles, the learning rate is reduced to 10% for a total of 90 cycles of training, and the Momentum parameter is set to 0.9 until the model converges.

In the step (3), the parameters of the directional full convolution network are initialized by using the full convolution layer in the previous step, a full connection layer is added afterwards, the characteristic value is transformed into a c-dimensional vector, and c corresponds to the number of object categories in the target semantic segmentation data set. The newly added fully connected layer is initialized randomly with a gaussian distribution.

The specific process of the step (4) is as follows:

(4-1) inputting the pictures of the training set into a second depth model, and generating a feature map after calculation;

(4-2) replacing the last span convolution in the model network with a non-span convolution, and adjusting the diffusivity of all the subsequent convolution networks to be 2;

(4-3) because the resolution of the image is reduced by using the convolution span in the network operation process, the finally generated feature map is equivalent to 1/16 of the original image, so that the feature map needs to be amplified to the size of the original image by adopting bilinear interpolation;

(4-4) feeding the generated features into a softmax function to obtain the probability distribution of the prediction sequence, and using the probability distributioncalculating the gradient of the parameters on the network by using a cross-entropy loss function, and updating the parameter values by using an SGDM optimization algorithm; initial learning rate is set to 10^-3；

(4-5) repeating the above steps until the model converges.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention designs a novel directional convolutional network layer based on understanding of the sensing field of the convolutional network, and highlights the attention of the neural network to the center of the sensing field, so that the neural network can learn the connotation correlation between input and output more easily.

2. The invention has strong applicability, and can be directly and effectively deployed in most of the existing semantic segmentation technologies by replacing a common convolutional network with a directional convolutional network without influencing other method processes.

Drawings

FIG. 1 is a schematic flow chart of a semantic segmentation method based on a directional convolutional network according to the present invention;

FIG. 2 is a visualization of convolution kernels of different convolution networks and corresponding perception fields in an embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

As shown in fig. 1, a semantic segmentation method based on a directional convolutional network includes the following steps:

and S01, constructing a full convolution network of the directional convolution.

By adopting the design of a deep learning network reference residual error network ResNet-101, parameters such as network depth, network width, image resolution, convolution span and the like are kept unchanged, 3-by-3 type convolution is replaced by directional convolution, and a visual graph of the general convolution of a directional convolution kernel is shown in figure 2.

And S02, adding the top layer of the constructed directional full-convolution network into a pooling layer and a full-connection layer network, and pre-training on a large data set. The directional fully convolutional network constructed in step S01 cannot be directly used in the pre-training process of the image recognition task because the output feature vectors do not meet the criteria of the data set ImageNet. ImageNet is the first ultra-large image Recognition dataset published by Jia Deng et al, Stanford university, Conference on Computer Vision and Pattern Recognition 2009, as an article of "ImageNet: A large-scale high-efficiency image database," which searches the Internet for pictures of 1000 classes of objects, with an image size maintained at about 256 × 256, each class having over 1000 training pictures. 1281167 pictures are in the training set, and 50000 pictures are in the verification set.

The full convolution network is accessed to the image pooling layer, so that the feature map can be reduced to feature vectors, and then the feature vectors are converted into prediction vectors with the length of 1000 by using the full connection layer. Pre-training is done on ImageNet with the same training pattern as ResNet-101. The results after pre-training are shown in table 1, and it can be seen that the directional convolution can achieve the same pre-training effect:

TABLE 1

And S03, extracting the full convolution part in the step S02 for subsequent training, wherein other parameters are random initialization.

And S04, starting to train the model, wherein in the training process, when the image is too large, the image can be reduced to one 16 th by using convolution spans, and in the subsequent prediction process, the convolution spans can be cancelled, so that the prediction resolution is improved, and a better result is generated. This difference is due to the fact that multiple pictures must be used for simultaneous training during the training process.

And S05, performing semantic segmentation tasks by using the trained model.

To demonstrate the effectiveness of the method of the invention, tests were performed on the cityscaps dataset. The basic model is ResNet-101, and the semantic segmentation method adopts a DeepLabv3/3+ framework. For semantic segmentation task, the average IOU index of 21 classes in Cityscapes is used for evaluation, and the result is shown in Table 2.

TABLE 2

The results show that the segmentation effect can be effectively improved by four kinds of directional convolution. While showing their number of parameters, it can be seen that DirConv-SI and DirConv-SO can achieve better results with fewer parameters.

And continuing to use multiple deformations of the image to help the neural network to perform joint prediction. Since a single input may cause unstable prediction, the present embodiment uses both the flipped and multi-scaled pictures and the network with adjusted convolutional span to predict the final result, and the result is shown in table 3.

TABLE 3

As shown in the above table, OS8 represents the reduction of the convolution span of 2 to 3 positions, resulting in an output stride of 8; MS represents averaging after prediction by using three inputs of [0.75,1,1.25] in different proportions; flip represents simultaneous use of the flipped image. The directional convolution achieves a continuous boost in cooperation with these methods. The above experiment was based on the DeepLabv3+ model.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A semantic segmentation method based on a directional convolutional network is characterized by comprising the following steps:

(1) constructing a full convolution network of directional convolution;

2. The method for semantic segmentation based on the directional convolutional network as claimed in claim 1, wherein in step (1), the definition of the directional convolution is as follows:

wherein, y_coIs the co-th feature of the output, ci represents the index of the input feature, total C_iA characteristic value; s is a position set of pixels sampled in the convolution calculation process; w is a_s,ci、x_s,ciAnd b_coRespectively representing the weight, input and offset required in the linear operation process; the offset set S is selected from the following dynamic sets:

M_k＝{(s₁,s₂)|(s₁-e₁)²+(s₂-e₂)²≤2²；

s₁,s₂∈[-2,2]；s₁,s₂∈Z}∪{(0,0)}

S＝M_(ci％16)

3. The method for semantic segmentation based on the directional convolutional network as claimed in claim 1, wherein in step (2), the large-scale data set is a large-scale image recognition data set ImageNet.

4. The semantic segmentation method based on the directional convolutional network as claimed in claim 3, wherein the step (2) comprises the following steps:

5. The method for semantic segmentation based on the directional convolutional network as claimed in claim 1, wherein in step (3), the newly added full-link layer is initialized randomly with gaussian distribution.

6. The semantic segmentation method based on the directional convolutional network as claimed in claim 1, wherein the specific process of step (4) is as follows:

(4-3) amplifying the feature map to the size of the original image by adopting bilinear interpolation;

(4-4) sending the generated features into a softmax function to obtain probability distribution of a prediction sequence, calculating the gradient of parameters on the network by using a cross-entropy loss function, and updating parameter values by using an SGDM (generalized minimum mean square deviation) optimization algorithm; initial learning rate is set to 10^-3；

(4-5) repeating the above steps until the model converges.