CN110136141B

CN110136141B - Image semantic segmentation method and device oriented to complex environment

Info

Publication number: CN110136141B
Application number: CN201910333809.9A
Authority: CN
Inventors: 吴俊君; 王嫣然; 陈世浪
Original assignee: Foshan University
Current assignee: Foshan University
Priority date: 2019-04-24
Filing date: 2019-04-24
Publication date: 2023-07-11
Anticipated expiration: 2039-04-24
Also published as: CN110136141A

Abstract

The invention belongs to the technical field of computer vision, and particularly relates to an image semantic segmentation method and device for a complex environment, wherein a fine-tuned VGG16 convolutional neural network is firstly used for generating a basic network, and preliminary features of a training image are extracted through the basic network; connecting the hidden layer convolution feature module with each convolution layer of the VGG16 convolution neural network to generate high-level semantic features; inputting the preliminary features into the cavity convolution of the pyramid structure by a cavity convolution method to obtain fine-granularity low-layer features; then fusing the high-level semantic features and the fine-granularity low-level features to generate a high-resolution feature map; setting network training parameters, taking a cross entropy loss function as a target, and carrying out network training through back propagation so as to establish a semantic segmentation network; finally, inputting the test image into the semantic segmentation network to generate a semantic segmentation result of the test image, wherein the invention can solve the defect of fuzzy segmentation boundary in the complex environment of the existing method, can generate a high-resolution predictive image and improve the performance of the image semantic segmentation method in the complex environment.

Description

Image semantic segmentation method and device oriented to complex environment

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an image semantic segmentation method and device for a complex environment.

Background

The image semantic segmentation is an image segmentation method for classifying the image according to semantic content expressed by each pixel point in the image, is a basic technology of scene understanding, can be used for positioning and identifying objects at a pixel level, has a crucial effect on unmanned systems such as autonomous navigation at an intelligent driving and robot cognition level, an unmanned plane landing system and intelligent security monitoring, and directly relates to the accuracy of the unmanned systems on scene understanding.

Because the traditional semantic segmentation method has poor scene understanding capability and work efficiency when the unmanned system faces an unstructured complex environment, the semantic segmentation problem facing the complex environment in recent years becomes a research hot spot and a series of remarkable results are achieved. Particularly, due to the occurrence of the convolutional neural network, the field of image semantic segmentation obtains favorable progress, and the problem of semantic segmentation precision is improved from different angles such as a model structure, a loss function, efficiency and the like. However, the existing image semantic segmentation precision is challenged by various factors such as unstructured, diversified targets, irregular shapes, and object shielding, which are possessed by a complex real environment.

Disclosure of Invention

The invention aims to provide an image semantic segmentation method and device oriented to a complex environment, which are used for solving the defect of fuzzy segmentation boundary in the complex environment of the existing method and improving the performance of the image semantic segmentation method in the complex environment.

In order to achieve the above object, the present invention provides the following solutions:

a complex environment-oriented image semantic segmentation method comprises the following steps:

step S100, modifying a VGG16 convolutional neural network to generate a basic network, extracting preliminary features of a training image through the basic network, wherein a convolutional layer in the VGG16 convolutional neural network is divided into 5 stages;

step S200, processing preliminary features obtained by a convolution layer at the previous 4 stages in the basic network by using a hidden layer convolution feature module to generate high-level semantic features;

step S300, processing the preliminary features obtained by the last layer convolution in the basic network through the cavity convolution of the pyramid structure to obtain fine-granularity low-layer features;

step S400, fusing the high-level semantic features and the fine-granularity low-level features to generate a high-resolution feature map;

step S500, setting network training parameters, and carrying out network training by counter propagation with the cross entropy loss function as a target, thereby establishing a semantic segmentation network;

and S600, inputting the test image into the semantic segmentation network to generate a semantic segmentation result of the test image.

Further, in step S100, the modifying the VGG16 convolutional neural network to generate the base network specifically includes:

discarding all full connection layers and the last pooling layer in the original VGG16 convolutional neural network, and constructing an end-to-end full convolutional network;

and carrying out convolution, pooling, batch normalization and ReLU operation through the full convolution network to obtain a feature map of each convolution layer in the basic network, thereby extracting the primary features of the image.

Further, the specific implementation method of the step S200 is as follows:

step S210, inputting the feature map into convolution with the size of 1 multiplied by 1 and convolution with the size of 3 multiplied by 3 respectively, and obtaining convolution features of all scales;

step S220, fusing convolution characteristics of all scales, and performing ReLU operation to obtain a first result;

step S230, inputting the first result into convolution with 1×1, and adjusting the number of output feature channels to the corresponding number of categories, thereby generating high-level semantic features.

Further, the specific implementation method of the step S300 is as follows:

step S310, respectively inputting the feature images into two groups of cavity convolutions, respectively carrying out batch normalization and ReLU operation, respectively inputting the feature images into convolutions with the size of 1 multiplied by 1, respectively adjusting the number of output feature channels to the corresponding category number, and generating a first feature image and a second feature image;

step S320, carrying out convolution, batch normalization and ReLU operation on the first feature map and the second feature map so as to form a pyramid structure;

and step S330, fusing the first feature map and the second feature map in the pyramid structure to generate fine-granularity low-level features.

Further, in step S400, the high-level semantic features and the fine-grained low-level features are specifically subjected to addition fusion through an eltwise layer, so as to generate a high-resolution feature map.

Further, in step S500, the network training parameters are specifically set as:

with the poly learning strategy, the initial learning rate is set to 0.001, the power is set to 0.9, the initial value of the convolution kernel weight is set to obey a gaussian distribution with a mean value of 0, a standard deviation of 0.01, the initial value of the bias is set to 0, the weight attenuation value is set to 0.0005, and the attenuation momentum is set to 0.9.

An image semantic segmentation apparatus oriented to a complex environment, the apparatus comprising:

the extraction unit is used for modifying the VGG16 convolutional neural network to generate a basic network, extracting the preliminary characteristics of the training image through the basic network, and dividing a convolutional layer in the VGG16 convolutional neural network into 5 stages;

the high-level semantic feature unit is used for processing the preliminary features obtained by the first 4-stage convolution layers in the basic network by using the hidden layer convolution feature module to generate high-level semantic features;

the fine-granularity low-layer characteristic unit is used for processing the preliminary characteristics obtained by the last layer convolution in the basic network through the cavity convolution of the pyramid structure to obtain fine-granularity low-layer characteristics;

the high-resolution feature map unit is used for fusing the high-level semantic features and the fine-granularity low-level features to generate a high-resolution feature map;

the semantic segmentation network unit is used for setting network training parameters, aiming at the cross entropy loss function, and carrying out network training through back propagation so as to establish a semantic segmentation network;

the semantic segmentation result unit is used for inputting the test image into the semantic segmentation network and generating a semantic segmentation result of the test image.

The beneficial effects of the invention are as follows: the invention discloses an image semantic segmentation method and device facing a complex environment, which comprises the steps of firstly modifying a VGG16 convolutional neural network to generate a basic network, and extracting preliminary features of a training image through the basic network; further, a hidden layer convolution feature module is used for processing the preliminary features obtained by the 4-stage convolution layer before VGG16 to generate high-level semantic features; the preliminary features obtained by the convolution of the final layer of VGG16 are processed through the cavity convolution of the pyramid structure, and fine-granularity low-layer features are obtained; then fusing the high-level semantic features and the fine-granularity low-level features to generate a high-resolution feature map; setting network training parameters, taking a cross entropy loss function as a target, and carrying out network training through back propagation so as to establish a semantic segmentation network; finally, inputting the test image into the semantic segmentation network to generate a semantic segmentation result of the test image, wherein the invention can solve the defect of fuzzy segmentation boundary in the complex environment of the existing method, can generate a high-resolution predictive image and improve the performance of the image semantic segmentation method in the complex environment.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an image semantic segmentation method facing a complex environment according to an embodiment of the invention;

fig. 2 is a schematic structural diagram of an image semantic segmentation device facing a complex environment according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the image semantic segmentation method for a complex environment provided by the embodiment of the invention includes the following steps:

As a preference of this embodiment, in step S100, the VGG16 convolutional neural network specifically includes: a neural network with learning capability, which is formed by connecting 13 convolution layers and 3 full connection layers in series, wherein the 13 convolution layers are divided into 5 stages, the first stage comprises convolution layers with the size of 2 layers of 3×3, and the output dimension is 64; the second stage comprises 2 layers of 3 x 3 size, with an output dimension of 128 convolutions; the third stage comprises two convolution layers with the size of 3 multiplied by 3 and one convolution layer with the size of 1 multiplied by 1, and the output dimension of the two convolution layers is 256; the fourth stage comprises two convolution layers with the size of 3 multiplied by 3 and one convolution layer with the size of 1 multiplied by 1, and the output dimension of the fourth stage is 512; the fifth stage comprises two convolution layers with the size of 3 multiplied by 3 and one convolution layer with the size of 1 multiplied by 1, and the output dimension of the convolution layers is 512; each stage is followed by a maximum pooling layer.

In step S100, the modifying the VGG16 convolutional neural network to generate a base network specifically includes:

In one or more embodiments, the formula for the ReLU operation is:

f (x) =max (0, x), where x is the input and f (x) is the output.

In one embodiment, the specific implementation method of the step S200 is as follows:

In one embodiment, the specific implementation method of the step S300 is as follows:

in a preferred embodiment, the feature map is input into a 3×3 hole convolution with a hole size of 6, and batch normalization and ReLU operation are performed, and then input into a convolution with a size of 1×1, and the number of output feature channels is adjusted to the corresponding number of categories, so as to generate a first feature map;

inputting the feature map into a 3×3 cavity convolution with a hole size of 12, performing batch normalization and ReLU operation, inputting the feature map into a convolution with a size of 1×1, and adjusting the number of output feature channels to the corresponding number of categories to generate a second feature map;

step S320, the first feature map and the second feature map are subjected to convolution, batch normalization and ReLU operation, so that a pyramid structure is formed;

In one embodiment, the high-level semantic features and the fine-grained low-level features described in step S400 are subjected to addition fusion specifically through an eltwise layer, so as to generate a high-resolution feature map.

In one embodiment, in step S500, the network training parameters are specifically set as:

In order to measure the network prediction performance in the embodiment and verify the accuracy of the semantic segmentation result, the method in the embodiment is operated by adopting the following experimental environment: dell Precision Tower T7920 workstation configured as CPU: intel Xeon Silver 4114, 10 core 20 threads, main frequency 2.2GHz, memory: 64GB, operating system: ubuntu 16.04 LTS (64 bits), GPU: NVIDIA Geforce GTX 1080TI, video memory: 11G.

The verification is carried out by adopting the following steps:

step S610, dividing the pictures in the SUN RGB-D dataset into training pictures, verification pictures and test pictures;

step S620, preprocessing the divided training pictures, specifically including: mirror image operation and random cutting are carried out on the pictures;

step S630, training the network by using the training picture and the verification picture, testing the network by using the test picture, and measuring the prediction performance of the network by using the pixel precision, the average pixel precision and the average cross ratio index.

The pixel precision indicates the proportion of the correct pixels to the total pixels.

The average pixel precision represents the improvement of the pixel precision, firstly, the proportion of the number of correctly classified pixels in each class is calculated, and then the average of all classes is calculated.

The average intersection ratio is calculated by calculating the ratio of the intersection and the union of the two sets, and the intersection ratio between the real segmentation and the predicted segmentation is calculated in the semantic segmentation problem, namely the number of the real positive examples is divided by the total number of the real positive examples, the error negative examples and the error positive examples.

The experimental test shows that: the method provided by the embodiment can generate a high-resolution predictive image, and can ensure that a part of methods with larger segmentation effect on the SUN RGB-D data set have certain promotion.

Referring to fig. 2, an embodiment of the present invention further provides an image semantic segmentation apparatus facing a complex environment, where the apparatus includes:

an extracting unit 100, configured to modify a VGG16 convolutional neural network to generate a base network, extract preliminary features of a training image through the base network, where a convolutional layer in the VGG16 convolutional neural network is divided into 5 stages;

the high-level semantic feature unit 200 is configured to process preliminary features obtained by a convolution layer in the first 4 stages in the base network by using a hidden layer convolution feature module, and generate high-level semantic features;

a fine-grained low-level feature unit 300, configured to process the preliminary feature obtained by the last-level convolution in the base network through the hole convolution with the pyramid structure, to obtain a fine-grained low-level feature;

a high-resolution feature map unit 400, configured to fuse the high-level semantic features and the fine-grained low-level features to generate a high-resolution feature map;

the semantic segmentation network unit 500 is configured to set a network training parameter, perform network training by using the cross entropy loss function as a target through back propagation, and thereby establish a semantic segmentation network;

the semantic segmentation result unit 600 is configured to input a test image into the semantic segmentation network, and generate a semantic segmentation result of the test image.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. The image semantic segmentation method facing the complex environment is characterized by comprising the following steps of:

step S600, inputting the test image into the semantic segmentation network to generate a semantic segmentation result of the test image;

carrying out convolution, pooling, batch normalization and ReLU operation through the full convolution network to obtain a feature map of each convolution layer in the basic network, thereby extracting the primary features of the image;

the specific implementation method of the step S200 is as follows:

2. The image semantic segmentation method facing the complex environment according to claim 1, wherein the specific implementation method of the step S300 is as follows:

3. The image semantic segmentation method facing the complex environment according to claim 1, wherein in step S400, the high-level semantic features and the fine-granularity low-level features are subjected to addition fusion through an eltwise layer to generate a high-resolution feature map.

4. The image semantic segmentation method according to claim 1, wherein in step S500, the network training parameters are specifically set as follows:

5. An image semantic segmentation device facing a complex environment, the device comprising:

the semantic segmentation result unit is used for inputting the test image into the semantic segmentation network to generate a semantic segmentation result of the test image;

the modifying the VGG16 convolutional neural network to generate the basic network specifically comprises the following steps:

the high-level semantic feature unit is specifically configured to:

the feature map is input into convolution of 1 multiplied by 1 and convolution of 3 multiplied by 3 respectively, and convolution features of all scales are obtained;

fusing convolution characteristics of all scales, and performing ReLU operation to obtain a first result;

and inputting the first result into convolution with the size of 1 multiplied by 1, and adjusting the number of output characteristic channels to the corresponding number of categories so as to generate high-level semantic characteristics.