CN111882563B - Semantic segmentation method based on directional full convolution network - Google Patents

Semantic segmentation method based on directional full convolution network Download PDF

Info

Publication number
CN111882563B
CN111882563B CN202010669134.8A CN202010669134A CN111882563B CN 111882563 B CN111882563 B CN 111882563B CN 202010669134 A CN202010669134 A CN 202010669134A CN 111882563 B CN111882563 B CN 111882563B
Authority
CN
China
Prior art keywords
directional
network
semantic segmentation
convolution
full convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010669134.8A
Other languages
Chinese (zh)
Other versions
CN111882563A (en
Inventor
武伯熹
蔡登�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010669134.8A priority Critical patent/CN111882563B/en
Publication of CN111882563A publication Critical patent/CN111882563A/en
Application granted granted Critical
Publication of CN111882563B publication Critical patent/CN111882563B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a semantic segmentation method based on a directional convolutional network, which comprises the following steps: (1) constructing a full convolution network of directional convolution; (2) adding the top layer of the constructed directional full convolution network into a pooling layer and a full connection layer network to form a first depth model, and pre-training on a large data set; (3) extracting a full convolution part in the pre-trained first depth model, initializing parameters of a directional full convolution network by using a full convolution layer, and adding a new full connection layer to form a second depth model; (4) training a second depth model using the image semantically segmented data set until the model converges; (5) and analyzing the picture to be detected by using the trained second depth model, predicting the category of each pixel in the picture, and forming and outputting a semantic segmentation graph of the picture. The method can promote the relation between the semantic segmentation learning perception field and the central pixel, and improve the robustness of the training model.

Description

Semantic segmentation method based on directional full convolution network
Technical Field
The invention belongs to the field of computer vision and image processing, and particularly relates to a semantic segmentation method based on a directional convolutional network.
Background
With the discovery and the deep research of the deep learning theory, the rapid breakthrough and the remarkable improvement of a plurality of fields and tasks belonging to the computer vision are realized. Among them, semantic segmentation is one of the most troublesome computational vision tasks and the current popular research direction due to its high requirement on the fineness of the vision system. And (3) the semantic segmentation task requires that a computational vision system predicts the category of an object to which each pixel belongs on pictures with any size. The mainstream Semantic Segmentation solution at present adopts a full convolution network architecture, which starts from the work of < full relational Networks for Semantic Segmentation > proposed by Conference on Computer Vision and Pattern Recognition in 2014 by Jonathan Long et al, Berkeley division of California university. The work is based on the experience in the field of calculation and identification, only a convolution neural network (full convolution) is used for image processing, and a bilinear interpolation method is combined, so that output prediction and input picture pixels are in one-to-one correspondence. By an end-to-end training method, the neural network is trained under the framework of supervised learning, and the image characteristics far superior to the traditional learning are obtained. The deapplab v3+ method was the leading-edge solution in the field as published by Liang-Chieh Chen et al, in the 2018 European Conference on Computer Vison Conference, work with Encoder-decoder with apparatus partial communication for the magnetic image segmentation. The size of the effective perception field is improved through various technologies such as diffusion convolution and the like.
However, the effectiveness of the full convolution network is not fully understood, and the prediction result thereof has some defects. Careful analysis of the prediction process of the full convolution network revealed that for a prediction on a single pixel, the neural network could obtain all pixels on the perceptual horizon (the input part to which the neural network output can be directly related), but only the class of the central pixel is output. On one hand, no mechanism is used for explicitly guiding the neural network to predict the pixels in the center of the perception field in the training process, and on the other hand, the experimental result shows that the full convolution network really learns the relevance between the perception field and the pixels in the center from the data. Such contrast elicits us to understand the deep mechanisms of convolutional networks and, based on this understanding, to encourage neural networks to give higher attention to central locations, thereby yielding a more robust semantic segmentation system.
Disclosure of Invention
The invention provides a semantic segmentation method based on a directional convolutional network, which can promote semantic segmentation to learn the relation between a perception field and a central pixel, improve the robustness of a training model and enable the image semantic segmentation to be more accurate.
A semantic segmentation method based on a directional convolutional network is characterized by comprising the following steps:
(1) constructing a full convolution network of directional convolution;
(2) adding the top layer of the constructed directional full convolution network into a pooling layer and a full connection layer network to form a first depth model, and pre-training on a large data set;
(3) extracting a full convolution part in the pre-trained first depth model, initializing parameters of a directional full convolution network by using a full convolution layer, and adding a new full connection layer to form a second depth model;
(4) training a second depth model using the semantically picture segmented data set until the model converges;
(5) and analyzing the picture to be detected by using the trained second depth model, predicting the category of each pixel in the picture, and forming and outputting a semantic segmentation graph of the picture.
The method comprises the steps of firstly constructing a full convolution network only using directional convolution, then pre-training a deep learning network containing a full convolution layer and a pooling-full connection layer to serve as initialization of the full convolution layer, then training on a training data set (semantic segmentation task), and predicting the category of each pixel on an input image. The method can promote the potential task of 'predicting and sensing the centre-of-field pixel' of deep network learning, so that a robust semantic segmentation model is more easily generated.
In step (1), all the common convolutions are replaced by directional convolutions. The exact definition of the directional convolution is as follows:
for normal convolution operations, there is a linear transformation as follows:
Figure BDA0002581592590000031
wherein, ycoIs the co-th feature of the output, ci represents the index of the input feature, total CiA characteristic value; s is a position set of pixels sampled in the convolution calculation process; w is as,ci、xs,ciAnd bcoRepresenting the weight, input and offset required in the linear operation process, respectively. Because the ordinary convolution adopts a uniform sampling mode, the selection of the offset set S is as follows:
S={0,1,-1}2
for directional convolution the offset set is no longer constant as above, but is chosen from the following dynamic set:
Mk={(s1,s2)|(s1-e1)2+(s2-e2)2≤22
Figure BDA0002581592590000032
s1,s2∈[-2,2];s1,s2∈Z}∪{(0,0)}
wherein, the value range of k is an integer from 0 to 15, which represents 16 different directions; the specific value rule of S is as follows:
S=M(ci%16)
where ci represents the index of the input channel, and the division by 16 means sorting into 16 different groups.
This makes the original 3 x 3 square sampling region become sector-shaped regions in different directions. Because the pixel in the center is sampled all the time, and the pixel of every side is sampled in turn, therefore, on the final computational graph that produces, the center pixel possesses more route transmission information, promotes its attention.
The directional convolution described above is named DirConv-I, I indicating that the selection of directionality is based on the input dimension. Similarly, DirConv-O according to the output dimension can be obtained, with the convolution offset:
S=M(co%16)
the above design is based on a variant of the 3 x 3 convolution, which can be treated as a 2 x 2 type convolution to get a slim version of the directional convolution: DirConv-SI and DirConv-SO.
In the step (2), in order to alleviate the problem of excessive data volume of semantic segmentation, the large-scale data set adopts a large-scale image recognition data set ImageNet, so that the convergence speed and the training quality of the semantic segmentation can be accelerated.
The specific steps of the step (2) are as follows:
(2-1) adding an image pooling layer on the top layer of the full convolution network to enable the image to be changed into a feature vector from a three-dimensional feature map, and then deforming the feature vector into a 1000-dimensional vector by using a full connection network, wherein the 1000 image categories correspond to ImageNet;
(2-2) training the constructed first depth model on GPUs, wherein each GPU calculates 32 images at a time, and 8 GPUs are trained in parallel;
(2-3) using the SGDM optimization algorithm, the initial learning rate is 0.256, and after every 30 cycles, the learning rate is reduced to 10% for a total of 90 cycles of training, and the Momentum parameter is set to 0.9 until the model converges.
In the step (3), the parameters of the directional full convolution network are initialized by using the full convolution layer in the previous step, a full connection layer is added afterwards, the characteristic value is transformed into a c-dimensional vector, and c corresponds to the number of object categories in the target semantic segmentation data set. The newly added fully connected layer is initialized randomly with a gaussian distribution.
The specific process of the step (4) is as follows:
(4-1) inputting the pictures of the training set into a second depth model, and generating a feature map after calculation;
(4-2) replacing the last span convolution in the model network with a non-span convolution, and adjusting the diffusivity of all the subsequent convolution networks to be 2;
(4-3) because the resolution of the image is reduced by using the convolution span in the network operation process, the finally generated feature map is equivalent to 1/16 of the original image, so that the feature map needs to be amplified to the size of the original image by adopting bilinear interpolation;
(4-4) sending the generated features into a softmax function to obtain probability distribution of a prediction sequence, calculating the gradient of parameters on the network by using a cross-entropy loss function, and updating parameter values by using an SGDM (generalized minimum mean square deviation) optimization algorithm; initial learning rate is set to 10-3
(4-5) repeating the above steps until the model converges.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention designs a novel directional convolution network layer based on the understanding of the perception field of the convolution network, and highlights the attention of the neural network to the center of the perception field, so that the neural network can learn the connotation correlation between input and output more easily.
2. The invention has strong applicability, and can be directly and effectively deployed in most of the existing semantic segmentation technologies by replacing a common convolutional network with a directional convolutional network without influencing other method processes.
Drawings
FIG. 1 is a schematic flow chart of a semantic segmentation method based on a directional convolutional network according to the present invention;
FIG. 2 is a visualization of convolution kernels of different convolution networks and corresponding perception fields in an embodiment of the present invention.
Detailed Description
The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.
As shown in fig. 1, a semantic segmentation method based on a directional convolutional network includes the following steps:
and S01, constructing a full convolution network of the directional convolution.
By adopting the design of a deep learning network reference residual error network ResNet-101, parameters such as network depth, network width, image resolution, convolution span and the like are kept unchanged, 3-by-3 type convolution is replaced by directional convolution, and a visual graph of the general convolution of a directional convolution kernel is shown in figure 2.
And S02, adding the top layer of the constructed directional full-convolution network into a pooling layer and a full-connection layer network, and pre-training on a large data set. The directional fully convolutional network constructed in step S01 cannot be directly used in the pre-training process of the image recognition task because the output feature vectors do not meet the criteria of the data set ImageNet. ImageNet is the first ultra-large image Recognition dataset published by Jia Deng et al, Stanford university, Conference on Computer Vision and Pattern Recognition 2009, as an article of "ImageNet: A large-scale hierarchical image database," which searches the Internet for pictures of 1000 classes of objects, with an image size maintained at about 256, and each class having over 1000 training pictures. 1281167 pictures are in the training set, and 50000 pictures are in the verification set.
The full convolution network is accessed to the image pooling layer, so that the feature map can be reduced to feature vectors, and then the feature vectors are converted into prediction vectors with the length of 1000 by using the full connection layer. Pre-training is done on ImageNet with the same training pattern as ResNet-101. The results after pre-training are shown in table 1, and it can be seen that the directional convolution can achieve the same pre-training effect:
TABLE 1
Figure BDA0002581592590000061
And S03, extracting the full convolution part in the step S02 for subsequent training, wherein other parameters are random initialization.
And S04, starting to train the model, wherein in the training process, when the image is too large, the image can be reduced to one 16 th by using convolution spans, and in the subsequent prediction process, the convolution spans can be cancelled, so that the prediction resolution is improved, and a better result is generated. This difference is due to the fact that multiple pictures must be used for simultaneous training during the training process.
And S05, performing semantic segmentation tasks by using the trained model.
To demonstrate the effectiveness of the method of the invention, tests were performed on the cityscaps dataset. The basic model is ResNet-101, and the semantic segmentation method adopts a DeepLabv3/3+ framework. For semantic segmentation task, the average IOU index of 21 classes in Cityscapes is used for evaluation, and the result is shown in Table 2.
TABLE 2
Figure BDA0002581592590000071
The results show that the segmentation effect can be effectively improved by four kinds of directional convolution. While showing their number of parameters, it can be seen that DirConv-SI and DirConv-SO can achieve better results with fewer parameters.
And continuing to use multiple deformations of the image to help the neural network to perform joint prediction. Since a single input may cause unstable prediction, the present embodiment uses both the flipped and multi-scaled pictures and the network with adjusted over-convolution span to predict the final result, and the result is shown in table 3.
TABLE 3
Figure BDA0002581592590000072
As shown in the above table, OS8 represents the reduction of the convolution span of 2 to 3 positions, resulting in an output stride of 8; MS represents averaging after prediction by using three inputs of [0.75,1,1.25] in different proportions; flip represents simultaneous use of the flipped image. The directional convolution achieves a continuous boost in cooperation with these methods. The above experiment was based on the DeepLabv3+ model.
The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims (6)

1. A semantic segmentation method based on a directional full convolution network is characterized by comprising the following steps:
(1) constructing a directional full convolution network;
(2) adding the top layer of the constructed directional full convolution network into a pooling layer and a full connection layer to form a first depth model, and performing pre-training on a large data set;
(3) extracting a directional full convolution network part in the pre-trained first depth model, initializing parameters of the directional full convolution network by using a full convolution layer, and adding a new full connection layer to form a second depth model;
(4) training a second depth model using the semantically picture segmented data set until the model converges;
(5) and analyzing the picture to be detected by using the trained second depth model, predicting the category of each pixel in the picture, and forming and outputting a semantic segmentation graph of the picture.
2. The method for semantic segmentation based on the directional full convolution network as claimed in claim 1, wherein in step (1), the definition of the directional convolution is as follows:
Figure FDA0003570116450000011
wherein, ycoIs the co-th feature of the output, ci represents the index of the input feature, total CiA characteristic value; s is a position set of pixels sampled in the convolution calculation process; w is as,ci、xs,ciAnd bcoRespectively representing the weight, input and offset required in the linear operation process; the selection of S is selected from the following dynamic set:
Mk={(s1,s2)|(s1-e1)2+(s2-e2)2≤22
Figure FDA0003570116450000012
s1,s2∈[-2,2];s1,s2∈Z}∪{(0,0)}
wherein, the value range of k is an integer from 0 to 15, which represents 16 different directions; the specific value rule of S is as follows:
S=M(ci%16)
where ci represents the index of the input channel, and the division by 16 means sorting into 16 different groups.
3. The method for semantic segmentation based on the directional full convolution network of claim 1, wherein in the step (2), the large-scale data set is a large-scale image recognition data set ImageNet.
4. The semantic segmentation method based on the directional fully convolutional network as claimed in claim 3, wherein the step (2) comprises the following steps:
(2-1) adding an image pooling layer on the top layer of the directional full convolution network to enable an image to be changed into a feature vector from a three-dimensional feature map, and then using a full connection layer to deform the image into a 1000-dimensional vector, wherein the 1000 image types correspond to ImageNet;
(2-2) training the constructed first depth model on GPUs, wherein each GPU calculates 32 images at a time, and 8 GPUs are trained in parallel;
(2-3) using the SGDM optimization algorithm, the initial learning rate is 0.256, the learning rate is reduced to 10% after each 30 cycles, the total number of the cycles is trained for 90, and the Momentum parameter is set to be 0.9 until the model converges.
5. The method for semantic segmentation based on directional full convolution network of claim 1, wherein in the step (3), the newly added full connection layer adopts random initialization of Gaussian distribution.
6. The semantic segmentation method based on the directional full convolution network of claim 1, wherein the specific process of the step (4) is as follows:
(4-1) inputting the pictures of the training set into a second depth model, and generating a feature map after calculation;
(4-2) replacing the last span convolution in the second depth model with non-span convolution, and adjusting the diffusivity of all convolutions after the last span convolution to be 2;
(4-3) amplifying the feature map to the size of the original image by adopting bilinear interpolation;
(4-4) sending the generated features into a softmax function to obtain probability distribution of a prediction sequence, calculating the gradient of parameters on the network by using a cross-entropy loss function, and updating parameter values by using an SGDM optimization algorithm; initial learning rate is set to 10-3
(4-5) repeating the above steps until the model converges.
CN202010669134.8A 2020-07-13 2020-07-13 Semantic segmentation method based on directional full convolution network Active CN111882563B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010669134.8A CN111882563B (en) 2020-07-13 2020-07-13 Semantic segmentation method based on directional full convolution network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010669134.8A CN111882563B (en) 2020-07-13 2020-07-13 Semantic segmentation method based on directional full convolution network

Publications (2)

Publication Number Publication Date
CN111882563A CN111882563A (en) 2020-11-03
CN111882563B true CN111882563B (en) 2022-05-27

Family

ID=73151747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010669134.8A Active CN111882563B (en) 2020-07-13 2020-07-13 Semantic segmentation method based on directional full convolution network

Country Status (1)

Country Link
CN (1) CN111882563B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107564025B (en) * 2017-08-09 2020-05-29 浙江大学 Electric power equipment infrared image semantic segmentation method based on deep neural network
CN108564587A (en) * 2018-03-07 2018-09-21 浙江大学 A kind of a wide range of remote sensing image semantic segmentation method based on full convolutional neural networks
CN110443805B (en) * 2019-07-09 2021-08-17 浙江大学 Semantic segmentation method based on pixel density
CN110826596A (en) * 2019-10-09 2020-02-21 天津大学 Semantic segmentation method based on multi-scale deformable convolution

Also Published As

Publication number Publication date
CN111882563A (en) 2020-11-03

Similar Documents

Publication Publication Date Title
CN110210551B (en) Visual target tracking method based on adaptive subject sensitivity
CN110969250B (en) Neural network training method and device
CN111583263B (en) Point cloud segmentation method based on joint dynamic graph convolution
WO2021098362A1 (en) Video classification model construction method and apparatus, video classification method and apparatus, and device and medium
Zeng et al. Single image super-resolution using a polymorphic parallel CNN
CN111160533A (en) Neural network acceleration method based on cross-resolution knowledge distillation
CN112070768A (en) Anchor-Free based real-time instance segmentation method
KR20230073751A (en) System and method for generating images of the same style based on layout
CN115565043A (en) Method for detecting target by combining multiple characteristic features and target prediction method
Xu et al. AutoSegNet: An automated neural network for image segmentation
CN115222998A (en) Image classification method
AU2022392233A1 (en) Method and system for analysing medical images to generate a medical report
CN116091823A (en) Single-feature anchor-frame-free target detection method based on fast grouping residual error module
CN116168197A (en) Image segmentation method based on Transformer segmentation network and regularization training
US20220301106A1 (en) Training method and apparatus for image processing model, and image processing method and apparatus
WO2024060839A1 (en) Object operation method and apparatus, computer device, and computer storage medium
CN115280329A (en) Method and system for query training
CN111882563B (en) Semantic segmentation method based on directional full convolution network
CN116562366A (en) Federal learning method based on feature selection and feature alignment
CN112906829B (en) Method and device for constructing digital recognition model based on Mnist data set
CN112529081B (en) Real-time semantic segmentation method based on efficient attention calibration
CN113988154A (en) Unsupervised decoupling image generation method based on invariant information distillation
Hadj-Selem et al. An iterative smoothing algorithm for regression with structured sparsity
Sharma et al. Facial image super-resolution with CNN,“A Review”
CN111899161A (en) Super-resolution reconstruction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant