CN113537228B

CN113537228B - Real-time image semantic segmentation method based on depth features

Info

Publication number: CN113537228B
Application number: CN202110767097.9A
Authority: CN
Inventors: 李爽; 金�一; 姜天姣; 赵茜; 李雅宁; 梁晓虎; 祝瑞辉; 张衡; 黄璐; 贾浩男; 程建强; 陈冲
Original assignee: CETC 54 Research Institute
Current assignee: CETC 54 Research Institute
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2022-10-21
Anticipated expiration: 2041-07-07
Also published as: CN113537228A

Abstract

The invention discloses a real-time image semantic segmentation method based on depth features, and relates to the field of computer vision. According to the method, an attention mechanism is introduced into a shallow layer part of the double-branch network, so that the characteristics can be acquired more efficiently, the model calculation efficiency is improved, and the introduction of noise is reduced. Learning is carried out in the training process by using an optimizer with Adam and LookAhead combined, unnecessary calculation in the model convergence process is reduced, and the target condition can be converged more quickly. The invention can obviously reduce the calculation overhead, so that the model can carry out real-time semantic segmentation.

Description

Real-time image semantic segmentation method based on depth features

Technical Field

The invention relates to the field of computer vision, in particular to the field of image semantic segmentation, and provides a real-time image semantic segmentation method based on depth features.

Background

The semantic segmentation problem of the image is a very typical computer vision problem, is very important for scene understanding, and has a wide application prospect. With the progress of science and technology, medical image processing, road scene understanding, even game picture processing and other scenes which need a faster semantic segmentation method are more and more. Regarding the task of semantic segmentation of images, there are two types of methods that are mainstream at present: the first type is a traditional semantic segmentation method, which comprises a segmentation method based on threshold, region and edge detection, a segmentation method based on a genetic algorithm and the like, wherein the segmentation method is simpler and easier to understand, but is easily influenced by noise and illumination wind factors in an image to cause poor segmentation effect, or classification information of a region and the like cannot be obtained; the second type is a deep learning method which is researched more enthusiastically at present, the convolutional neural network is developed rapidly along with the development of the neural network and the improvement of the computer computing performance, and the deep learning method is advanced in the field of computer vision rapidly by the proposal of the full convolutional neural network. On the basis, the SegNet model adopts a symmetrical encoder-decoder structure, the positions of the features during down sampling are recorded in the training process, and reduction is performed during up sampling, so that the resolution of the output of the model is improved; the hole convolution enlarges the ' hole ' by inserting the hole ' into a convolution kernel, so that the receptive field area of an output unit is enlarged on the basis of not increasing the number of parameters; the RefineNet model can collect information of images during sampling in a multi-path mode by using the characteristics of each level, utilizes the characteristics of different levels of the whole situation as much as possible, and performs semantic segmentation by adopting a method of increasing remote residual connection; the deep Lab v3 adds a Batch Normalization layer, and designs parallel and serial cavity convolution modules to carry out multi-scale classification on the object.

However, the existing method for semantic segmentation of images has a large parameter amount, needs more hardware resources and longer time consumption in the training process of a model, brings more time consumption to a testing link, an optimization algorithm is not directed to the overall optimization direction in each iteration in the training process, and due to frequent updating, a loss function has larger oscillation and more noises, so that the current semantic segmentation technology based on deep learning has insufficient real-time performance and is difficult to be widely applied.

Disclosure of Invention

In view of the above, the invention provides a depth feature-based real-time image semantic segmentation method, which is low in calculation overhead, high in feature extraction capability and high in convergence speed.

In order to achieve the purpose, the invention adopts the technical scheme that:

a real-time image semantic segmentation method based on depth features comprises the following steps:

(1) Carrying out data standardization and image cutting transformation on an image to be segmented of a training set, inputting the processed image to be segmented into an image semantic segmentation network comprising a channel attention module and a space attention module, and carrying out forward propagation to obtain a semantically segmented image;

(2) Calculating the loss between the semantically segmented image and the target image, performing back propagation on the image semantically segmented network by using the loss, updating the weight of the image semantically segmented network, and returning to the step (1) until the set iteration times are reached to obtain a trained image semantically segmented network;

(3) Loading data of the test set, processing through the trained image semantic segmentation network to obtain an image semantic segmentation result, calculating an evaluation index, judging the performance of the image semantic segmentation network according to the evaluation index, returning to the step (1) if the performance does not meet the expected requirement, and storing the model if the performance reaches the expected performance.

Further, the forward propagation of the image semantic segmentation network in the step (1) is specifically as follows: the input image h W C to be segmented passes through a standard convolution layer and a depth convolution layer and then enters an attention mechanics learning module to obtain a characteristic representation introducing an attention mechanism, the characteristic representation enters the depth separable convolution layer after passing through the attention mechanics learning module, and the output is divided into two branches; one branch sequentially passes through a bottleeck module and a pyramid pooling module, and then is subjected to upsampling, depth convolution layer and common convolution layer to obtain output characteristics, and the other branch passes through a common convolution layer to obtain output characteristics; adding the output characteristics of the two branches, and performing nonlinear transformation by using an activation function; finally, sequentially performing two depth separable convolutions, one convolution with convolution kernel size of 1 multiplied by 1 and one up-sampling operation to obtain a segmented image; wherein h is the image height, w is the image width, and c is the number of image channels.

Further, the ratio of the number of channels inside the bottleeck module to the number of channels at the input end is set to 6, the step size is 2, and three convolution kernels of 1 × 1 and three convolution kernels of 3 × 3 are used.

Further, the attention learning module comprises: the system comprises a channel attention learning module and a space attention learning module; inputting features into channel attention mechanics Xi Mokuai, respectively performing global maximum pooling and average pooling to obtain two 1 × C channel descriptions, where C is the number of channels, respectively sending the two channel descriptions into two fully-connected layers to respectively obtain two features, adding the two obtained features, and obtaining weight M through a Sigmoid activation function _c (F)：

M _c (F)＝σ(MLP(MaxPool(F)+MLP(AvgPool(F)))

Will weight M _c (F) Multiplying the input features to obtain intermediate features, inputting the intermediate features into a space attention mechanics learning module, performing maximum pooling and average pooling of one channel dimension in the space attention mechanics learning module to obtain two channel descriptions h W1, splicing the two descriptions together according to channels, passing through a convolution layer, and obtaining a weight M after passing through a Sigmoid activation function _s (F)：

M _s (F)＝σ(f ^7*7 ([MaxPool(F),AvgPool(F)]))

Wherein F is an input characteristic, and MLP represents a full connection layer;

finally, the weight M _s (F) Multiplication with the intermediate features results in a representation of the features incorporating the attention mechanism.

Furthermore, the number of neurons in the first layer of the fully-connected layer of the two layers is C/r, r is a reduction ratio, the activation function is ReLU, and the number of neurons in the second layer is C.

Further, in the step (2), the loss between the semantically segmented image and the target image is calculated, and the loss is used for performing back propagation of the image semantically segmented network, and updating the image semantically segmented network weight, specifically:

the loss result is calculated using a cross-entropy loss function, the formula is as follows:

wherein j represents a pixel point requiring inference, y _i Representing the correct value, i.e. the target image, P _j Representing a predicted value, namely a semantically segmented image;

and after a loss function value is obtained, updating the image semantic segmentation network parameters by adopting an optimizer fusing Adam and LookAhead through back propagation.

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention provides a novel depth-feature-based real-time semantic segmentation attention learning network, which can apply an attention mechanism when extracting shallow global features, connect a channel attention module and a space attention module in series to obtain a key attention area, reduce the parameter quantity required by a network model, and effectively reduce the training time and the consumption of hardware.

(2) The invention provides a more efficient optimizer in back propagation, wherein an Adam optimizer is fused into a LookAhead algorithm, a search direction is selected through a fast weight sequence generated by the Adam optimizer to calculate the weight updating, and the slow weight lag updating provides longer-term stability for a model and improves the convergence speed of the model.

Drawings

FIG. 1 is a flowchart of a segmentation method according to an embodiment of the present invention.

Fig. 2 is a diagram of a network model structure according to an embodiment of the present invention.

FIG. 3 is a diagram of an attention learning module according to an embodiment of the present invention.

Fig. 4 is a flow chart of the Adam optimizer in the embodiment of the present invention.

Fig. 5 is a flowchart of the LookAhead algorithm in the embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings. Fig. 1 shows a depth feature-based real-time image semantic segmentation method disclosed in the embodiment of the present invention, which includes the following steps:

step 1: carrying out data standardization and image cutting transformation on an image to be segmented of a training set, inputting the processed image to be segmented into an image semantic segmentation network comprising a channel attention module and a space attention module, and carrying out forward propagation to obtain a semantically segmented image;

as shown in fig. 2, the forward propagation of the image semantic segmentation network specifically includes: the input image h x w x c to be segmented passes through a standard convolution layer and a depth convolution layer and then enters an attention mechanics learning module to obtain a characteristic representation introducing an attention mechanism, the characteristic representation enters the depth separable convolution layer after passing through the attention mechanics learning module, and the output is divided into two branches; one branch passes through a bottleeck module and a pyramid pooling module in sequence, then goes through upsampling processing, a depth convolution layer and a common convolution layer to obtain output characteristics, and the other branch passes through a common convolution layer to obtain output characteristics; adding the output characteristics of the two branches, and performing nonlinear transformation by using an activation function; finally, sequentially performing two depth separable convolutions, one convolution with convolution kernel size of 1 multiplied by 1 and one up-sampling operation to obtain a segmented image; where h is the image height, w is the image width, and c is the number of image channels.

The ratio of the number of channels inside the bottommost module to the number of channels at the input end is set to be 6, the step length is 2, and three convolution kernels of 1 × 1 and three convolution kernels of 3 × 3 are used.

As shown in fig. 3, the attention learning module includes: the system comprises a channel attention learning module and a space attention learning module; inputting features into channel attention mechanics Xi Mokuai, respectively performing global maximum pooling and average pooling to obtain two 1 × C channel descriptions, where C is the number of channels, respectively sending the two channel descriptions into two fully-connected layers to respectively obtain two features, adding the two obtained features, and obtaining weight M through a Sigmoid activation function _c (F)：

M _c (F)＝σ(MLP(MaxPool(F)+MLP(AvgPool(F)))

Will weight M _c (F) Multiplying the input features to obtain intermediate features, inputting the intermediate features into a space attention mechanics learning module, performing maximum pooling and average pooling of one channel dimension in the space attention mechanics learning module to obtain two channel descriptions h W1, splicing the two descriptions together according to channels, then passing through a 7 x 7 convolution layer, and obtaining a weight M after passing through a Sigmoid activation function _s (F)：

M _s (F)＝σ(f ^7*7 ([MaxPool(F),AvgPool(F)]))

Wherein, F is an input characteristic, and MLP represents a full connection layer;

finally, the weight M is added _s (F) Multiplying with the intermediate features results in a feature representation that incorporates a mechanism of attention.

Step 2: calculating the loss between the semantically segmented image and the target image, performing back propagation on the image semantically segmented network by using the loss, updating the weight of the image semantically segmented network, and returning to the step (1) until the set iteration times are reached to obtain a trained image semantically segmented network;

wherein j represents a pixel point requiring inference, y _i Indicating the correct value, i.e. the target image, P _j Representing a predicted value, namely a semantically segmented image;

and after a loss function value is obtained, updating the image semantic segmentation network parameters by adopting an optimizer fusing Adam and LookAhead through back propagation. The Adam optimizer flow is shown in fig. 4. On the basis of Adam, the Adam is fused into a LookAhead algorithm to reduce variance, the algorithm flow of the LookAhead is shown in figure 5, the rapid weight is updated by the Adam algorithm, a new round of learning is started after gradient back propagation is completed until a preset iteration number is reached, the model effect is tested after the preset iteration number is reached, if the target requirement is not met, the super-parameter configuration of the model is adjusted, and if the target requirement is reached, the model is stored.

And step 3: loading data of a test set, processing the data through a trained image semantic segmentation network to obtain an image semantic segmentation result, calculating an evaluation index, judging the performance of the image semantic segmentation network according to the evaluation index, returning to the step (1) if the performance does not meet the expected requirement, and storing a model if the performance reaches the expected performance;

and completing the real-time image semantic segmentation based on the depth features.

Claims

1. A real-time image semantic segmentation method based on depth features is characterized by comprising the following steps:

(1) Carrying out data standardization and image cutting transformation on an image to be segmented in a training set, inputting the processed image to be segmented into an image semantic segmentation network comprising a channel attention module and a space attention module, and obtaining a semantically segmented image through forward propagation;

(3) Loading data of a test set, processing the data through a trained image semantic segmentation network to obtain an image semantic segmentation result, calculating an evaluation index, judging the performance of the image semantic segmentation network according to the evaluation index, returning to the step (1) if the performance does not meet the expected requirement, and storing a model if the performance reaches the expected performance;

the forward propagation of the image semantic segmentation network in the step (1) is specifically as follows: the input image h W C to be segmented passes through a standard convolution layer and a depth convolution layer and then enters an attention mechanics learning module to obtain a characteristic representation introducing an attention mechanism, the characteristic representation enters the depth separable convolution layer after passing through the attention mechanics learning module, and the output is divided into two branches; one branch sequentially passes through a bottleeck module and a pyramid pooling module, and then is subjected to upsampling, depth convolution layer and common convolution layer to obtain output characteristics, and the other branch passes through a common convolution layer to obtain output characteristics; adding the output characteristics of the two branches, and performing nonlinear transformation by using an activation function; finally, sequentially performing two depth separable convolutions, one convolution with convolution kernel size of 1 multiplied by 1 and one up-sampling operation to obtain a segmented image; where h is the image height, w is the image width, and c is the number of image channels.

2. The method for real-time image semantic segmentation based on depth features as claimed in claim 1, wherein the ratio of the number of channels inside the bottleeck module to the number of channels at the input end is set to 6, the step size is 2, and three 1 × 1 convolution kernels and three 3 × 3 convolution kernels are used.

3. The method for semantic segmentation of real-time images based on depth features as claimed in claim 1, wherein the attention learning module comprises: a channel attention mechanics learning module and a space attention mechanics learning module; inputting features into channel attention mechanics Xi Mokuai, respectively performing global maximum pooling and average pooling to obtain two channel descriptions 1 × C, where C is the number of channels, and dividing the two channel descriptions into twoRespectively and correspondingly sending the two characteristics to two full-connection layers to respectively obtain two characteristics, then adding the two characteristics, and obtaining the weight M through a Sigmoid activation function _c (F)：

M _c (F)＝σ(MLP(MaxPool(F)+MLP(AvgPool(F)))

M _s (F)＝σ(f ^7*7 ([MaxPool(F),AvgPool(F)]))

4. The method as claimed in claim 3, wherein the number of neurons in the first layer of the fully-connected layer is C/r, r is a reduction ratio, the activation function is ReLU, and the number of neurons in the second layer is C.

5. The real-time image semantic segmentation method based on depth features as claimed in claim 1, wherein in step (2), the loss between the semantically segmented image and the target image is calculated, and the image semantic segmentation network is propagated backwards using the loss, so as to update the image semantic segmentation network weight, specifically: