CN113421268A

CN113421268A - Semantic segmentation method based on deplapv 3+ network of multi-level channel attention mechanism

Info

Publication number: CN113421268A
Application number: CN202110637809.5A
Authority: CN
Inventors: 欧晓焱; 葛琦
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-09-21
Anticipated expiration: 2041-06-08
Also published as: CN113421268B

Abstract

The invention discloses a semantic segmentation method based on a multistage channel attention mechanism depeplabv 3+ network, which belongs to the technical field of image processing segmentation and comprises the following steps: step 1: inputting an image; step 2: acquiring high-level and low-level semantic features of an input image through a deep convolutional neural network; and step 3: and (4) sending the high-level semantic features to a cavity pyramid pooling module, and step 4: sending the first characteristic diagram obtained in the step 3 into a multi-level channel attention mechanism module, and carrying out a step 5: and 6, performing bilinear interpolation upsampling on the second feature map obtained in the step 4, and combining the upsampled second feature map with the low-level semantic features obtained in the step 2: performing bilinear interpolation upsampling on the combined feature map in the step 5 again; and 7: outputting a final prediction result; the method and the device improve the accuracy of the table semantic segmentation, reduce the size of the network model and improve the identification speed so as to meet the real-time requirement of mobile application.

Description

Semantic segmentation method based on deplapv 3+ network of multi-level channel attention mechanism

Technical Field

The invention relates to a semantic segmentation method based on a multistage channel attention mechanism depeplabv 3+ network, and belongs to the technical field of image processing segmentation.

Background

Semantic segmentation is to classify the image at the pixel level and predict the class to which each pixel belongs. Is one of the key problems in the computer field today. As Convolutional Neural Networks (CNNs) and deep learning exhibit excellent performance in the field of computer vision, more and more research tends to build semantic segmentation models using the CNNs and the deep learning.

However, in a complex environment, due to factors such as occlusion, posture tilt, and unbalanced illumination, the accuracy of object edge segmentation is greatly reduced, and the accuracy of object edge segmentation needs to be improved by adding an additional loss function and reasonably utilizing a context modeling method. The attention mechanism in computer vision can enable a model to learn attention, and a large amount of irrelevant information is filtered by an information selection mechanism from top to bottom, so that the attention mechanism is reasonably used in a semantic segmentation model, and the connection and relationship between local and global characteristics and between high-level and low-level semantic characteristics can be more effectively acquired. Meanwhile, as the CNN network model has more parameters and larger model, the classification and segmentation speed is also slower, and as the mobile application is popularized, higher requirements are placed on the size and the running speed of the network model.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a semantic segmentation method based on a deeplabv3+ network of a multi-level channel attention mechanism, which can reduce the size of a network model and improve the recognition speed while improving the accuracy of table semantic segmentation so as to meet the real-time requirement of mobile application.

The technical scheme is as follows: the invention provides a semantic segmentation method based on a multistage channel attention mechanism depeplabv 3+ network, which comprises the following steps of:

step 1: inputting an image;

step 2: acquiring high-level and low-level semantic features of an input image through a deep convolutional neural network;

and step 3: sending the high-level semantic features to a cavity pyramid pooling module to obtain a first feature map;

and 4, step 4: sending the first characteristic diagram obtained in the step 3 into a multi-level channel attention mechanism module to obtain a second characteristic diagram;

and 5: performing bilinear interpolation upsampling on the second feature map obtained in the step 4 and merging the second feature map with the low-level semantic features obtained in the step 2 to obtain a merged feature map;

step 6: performing bilinear interpolation upsampling on the combined feature map in the step 5 again;

and 7: and outputting a final prediction result.

Further, in step 2, the picture is sent to a deep convolution network added with a hole convolution to extract high-level and low-level semantic features, and the convolution equation is as follows:

in the formula: y [ i ] denotes the hole convolution output at each position i, x [ i ] denotes the input at each position i, w [ k ] denotes a convolution filter of length k, r denotes the sample step size of the input signal, where r is set to 2.

Further, in the step 3, the extracted high-level semantic features are sent to a cavity pyramid pooling module, and are respectively convolved and pooled with the cavity convolution layers and pooling layers with different rates to obtain five feature maps, which are then connected into a five-layer input feature map F, wherein the cavity convolution rates are 1, 6, 12 and 18 respectively.

Further, the step 4 specifically includes the following steps:

step 4.1: respectively carrying out convolution with convolution kernels of 1x1,3x3 and 5x5 on the first characteristic diagram to obtain a 3-branch characteristic diagram F₁、F₂、F₃。

Step 4.2: the feature map F (H × W × C) of each branch is subjected to global maximum pooling and global average pooling based on width and height, respectively, to obtain two 1 × 1 × C feature maps, respectively.

Step 4.3: sending the two 1 × 1 × C feature graphs obtained in the step 4.2 into a multilayer perceptron with 2 layers, wherein the number of first-layer neurons is C/r, r is a reduction rate, an activation function Relu, the number of second-layer neurons is C, the neural networks of the first-layer neurons and the second-layer neurons are mutually shared, then carrying out addition operation based on element-wise on features output by the multilayer perceptron, and finally generating a channel attention feature M through sigmoid activation_cThe process is represented as follows:

in the formula: σ denotes sigmoid activation, W₀∈R^C/r×C，W₁∈R^C×C/r，W₀And W₁Is the weight of the MLP.

Step 4.4: and fusing the attention characteristic of each branch channel with the branch characteristic graph to obtain each recalibration characteristic graph, wherein the process expression is as follows:

F_i′＝F_i×M_ci i＝1...k

in the formula: i represents the number of branches, so k is 3.

Step 4.5: adding the recalibration characteristic graphs to obtain a final recalibration characteristic graph, wherein the process expression is as follows:

in the formula: i represents the number of branches, so m is 3.

Further, in step 5, the equation of the bilinear interpolation upsampling is as follows:

f(x,y)＝f(Q₁₁)ω₁₁+f(Q₂₁)ω₂₁+f(Q₁₂)ω₁₂+f(Q₂₂)ω₂₂

in the formula: f () represents a linear relation, Q, for interpolation between selected points_ijRepresenting a selected point, ω_ijDenotes f (Q)_ij) And (4) weighting.

Further, in step 6, the merged feature map in step 5 is convolved by 3 × 3 and then upsampled by bilinear interpolation again.

Further, in step 7, a final prediction result is obtained by a loss function formula, where the loss function formula is:

f_l(p_t)＝-α_t·(1-p_t)^γ·log(p_t)

in the formula: alpha is a class weight, (1-p)_t)^γIs the modulation factor.

Compared with the prior art, the invention has the following beneficial effects:

the invention discloses a semantic segmentation method based on a multilevel channel attention mechanism deplab 3+ network, which comprises the steps of obtaining high-level and low-level semantic features of an input image through a deep convolutional neural network, sending the high-level semantic features to a hollow pyramid pooling module to obtain a first feature map, sending the first feature map to a multilevel channel attention mechanism module, performing bilinear interpolation up-sampling, merging with the low-level semantic features to obtain a merged feature map, performing bilinear interpolation up-sampling again to obtain a more accurate feature map, reducing the size of a network model while improving the accuracy of table semantic segmentation, improving the identification speed and meeting the real-time requirement of mobile application.

Aiming at the characteristic that a target area is complex and changeable, in order to select information with different spatial scales across channels, feature maps are respectively convolved by 3 types to form three branches, each branch is subjected to maximum pooling and global average pooling operation and then shares network layer local connection to obtain the weight of the channel, then the feature maps on the original branches are subjected to channel feature recalibration, and finally the feature maps processed by the three-branch operation are subjected to addition operation to obtain a more accurate feature map, so that the feature enhancement effect on the complex segmentation target area is realized, the loss of important feature information in the training process is reduced, and the segmentation result is more accurate.

Drawings

FIG. 1 is a flow chart of a method provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a depth convolution and void pyramid pooling module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-level channel attention mechanism module in accordance with an embodiment of the present invention;

FIG. 4 is a diagram illustrating bilinear interpolation in accordance with an embodiment of the present invention;

FIG. 5 is a diagram illustrating loss training in an embodiment of the present invention;

FIG. 6 is a diagram illustrating pixel accuracy training in an embodiment of the present invention;

FIG. 7 is a diagram illustrating a final predicted result under a general data set according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

As shown in fig. 1, the semantic segmentation method based on the decaplabv 3+ network of the multi-level channel attention mechanism according to the embodiment of the present invention includes the following steps:

step 1: inputting an image;

acquiring a characteristic diagram of an input image; in order to test the effect of the method in a complex environment, a city spaces data set is adopted, the road environment has a large amount of shielding, posture inclination, uneven illumination and other conditions, and the final prediction effect of the method is still excellent under the data set;

the picture is sent into a deep convolution network added with cavity convolution to extract high-level and low-level semantic features, and the convolution process is as follows:

As shown in fig. 2, step 3: sending the extracted high-level semantic features into a cavity pyramid pooling module, performing convolution and pooling with cavity convolution layers and pooling layers with different rates respectively to obtain five feature maps, and then connecting the five feature maps into a five-layer input feature map F, wherein the cavity convolution rates are respectively 1, 6, 12 and 18, and FIG. 2 is a schematic diagram of a depth convolution and cavity pyramid pooling module in the specific embodiment of the invention;

as shown in fig. 3, step 4: sending the first characteristic diagram obtained in the step 3 into a multi-level channel attention mechanism module to obtain a second characteristic diagram, and specifically comprising the following steps:

step 4.1: respectively carrying out convolution with convolution kernels of 1x1,3x3 and 5x5 on the first characteristic diagram to obtain a 3-branch characteristic diagram F₁、F₂、F₃；

Step 4.2: respectively performing global maximum pooling and global average pooling on the basis of width and height on the feature map F (H multiplied by W multiplied by C) of each branch to respectively obtain two feature maps of 1 multiplied by C;

in the formula: σ denotes sigmoid activation, W₀∈R^C/r×C，W₁∈R^C×C/r，W₀And W₁Is MLPThe weight of (c).

F_i′＝F_i×M_ci i＝1...k

in the formula: i represents the number of branches, so k is 3.

in the formula: i represents the number of branches, so m is 3.

Fig. 3 is a schematic diagram of a multi-level channel attention mechanism module according to an embodiment of the present invention, which recalibrates high-level semantic features to enhance the connection between pixels and between local and global, so as to achieve the purpose of improving the segmentation accuracy.

As shown in fig. 4, step 5: performing bilinear interpolation upsampling on the second feature map obtained in the step 4 and merging the second feature map with the low-level semantic features obtained in the step 2 to obtain a merged feature map;

in step 5, the equation of the bilinear interpolation upsampling is as follows:

in the formula: f () represents a linear relation, Q, for interpolation between selected points_ijRepresenting a selected point, ω_ijDenotes f (Q)_ij) A weight; FIG. 4 is a diagram illustrating a bilinear interpolation method according to an embodiment of the present invention, which interpolates in the X direction first, and then interpolates the interpolation result in the Y direction;

step 6: performing bilinear interpolation upsampling on the combined feature map in the step 5 again; and (4) convolving the combined feature map in the step 5 by 3x3 and then performing bilinear interpolation upsampling again.

And 7: and obtaining a final prediction result through a loss function formula, wherein the loss function formula is as follows:

f_l(p_t)＝-α_t·(1-p_t)^γ·log(p_t)

in the formula: alpha is a class weight, (1-p)_t)^γIs the modulation factor.

Fig. 5 and 6 are schematic diagrams obtained by loss training and pixel accuracy training of the experiment of the present invention, respectively, through actual software measurement, the final total accuracy of the present invention can reach 93.73% at most, fig. 7 is a schematic diagram of the final prediction effect of the experiment of the present invention in a general data set, through actual software measurement, the final mean IoU of the present invention can reach 71.89%.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A semantic segmentation method based on a multistage channel attention mechanism deplapv 3+ network is characterized by comprising the following steps:

step 1: inputting an image;

and 7: and outputting a final prediction result.

2. The semantic segmentation method based on the deplapv 3+ network of the multi-level channel attention mechanism as claimed in claim 1, wherein in the step 2, the picture is fed into a deep convolution network with hole convolution to extract high-level and low-level semantic features, and the convolution equation is as follows:

3. The semantic segmentation method based on the multistage channel attention mechanism depslabv 3+ network as claimed in claim 1, wherein in the step 3, the extracted high-level semantic features are sent to a cavity pyramid pooling module, are respectively convolved and pooled with cavity convolution layers and pooling layers with different rate rates to obtain five feature maps, and are then connected into a five-layer input feature map F, wherein the cavity convolution rates are 1, 6, 12 and 18 respectively.

4. The semantic segmentation method based on the deplapv 3+ network with the multi-level channel attention mechanism as claimed in claim 1, wherein the step 4 specifically comprises the following steps:

Step 4.3: feeding the two 1 × 1 × C feature maps obtained in step 4.2 into a multilayer perceptron with 2 layersThe method comprises the steps of obtaining a first layer of neuron number C/r, obtaining a reduction rate r, obtaining an activation function Relu, obtaining a second layer of neuron number C, sharing neural networks of the first layer of neuron and the second layer of neuron, adding features output by the multilayer perceptron based on element-wise, activating by sigmoid, and finally generating a channel attention feature M_cThe process is represented as follows:

F_i′＝F_i×M_cii＝1...k

in the formula: i represents the number of branches, so k is 3.

in the formula: i represents the number of branches, so m is 3.

5. The semantic segmentation method based on the deplapv 3+ network of the multi-level channel attention mechanism as claimed in claim 1, wherein in the step 5, the bilinear interpolation upsampling has an over-formula:

in the formula: f () represents a linear relation for interpolation between the selected points,Q_ijrepresenting a selected point, ω_ijDenotes f (Q)_ij) And (4) weighting.

6. The semantic segmentation method based on the deplapv 3+ network based on the multi-level channel attention mechanism as claimed in claim 1, wherein in the step 6, the merged feature map in the step 5 is convolved by 3x3 and then upsampled by bilinear interpolation again.

7. The semantic segmentation method based on the deplapv 3+ network of the multi-level channel attention mechanism as claimed in claim 1, wherein in the step 7, a final prediction result is obtained by a loss function formula, and the loss function formula is as follows:

f_l(p_t)＝-α_t·(1-p_t)^γ·log(p_t)

in the formula: alpha is a class weight, (1-p)_t)^γIs the modulation factor.