CN114693929A

CN114693929A - Semantic segmentation method for RGB-D bimodal feature fusion

Info

Publication number: CN114693929A
Application number: CN202210330691.6A
Authority: CN
Inventors: 方艳红; 罗盆琳
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-07-01

Abstract

The invention provides an RGB-D bimodal feature fusion semantic segmentation method. Firstly, a lightweight ResNet34 framework is used as a double-branch coding backbone, different modal characteristic information is extracted in four stages, output different modal characteristic diagrams are transmitted into a bimodal characteristic fusion structure layer by layer, the position and space characteristics needing to be strengthened and reduced are sensed by attention, and the fused characteristics are sent to a jump link module to provide shallow layer detail information for a decoding network; then, a double attention context module is adopted to enrich global information of the feature map at the bottom layer and is connected with a decoder; and finally, combining shallow layer, low layer and fine granularity features from the encoder subnet with depth, semantics and coarse granularity same-scale feature maps from the decoder subnet to obtain global features containing low-level space and high-level semantic information. The method can fully utilize the complementary characteristics of the RGB-D images to obtain excellent semantic segmentation performance, and has the advantages of good segmentation effect, high operation efficiency and good robustness.

Description

Semantic segmentation method for RGB-D bimodal feature fusion

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an indoor scene image semantic segmentation method for performing multistage feature extraction and fusion according to different modal features of RGB-D image data.

Background

Early pixel-level semantic segmentation was mainly based on traditional machine learning methods. More notable are graph cut-based methods, conditional random field-based algorithms, and their improved markov conditional random field algorithms. However, the algorithm can only perform two-classification segmentation on the input image, the segmentation precision is low, and the problem of end-to-end mass data processing cannot be solved far from the aspect of effect or the aspect of operation speed. To realize an accurate semantic understanding task, a computer needs to increase the image data volume to extract more obstacle feature information. However, as the amount of data increases, the requirements of data processing accuracy, speed and the like greatly exceed the information comprehensive processing capability of a common algorithm.

In recent years, with the dramatic increase in computer computing power and the rise of artificial intelligence, a feature fusion method based on deep learning has been proposed in order to maximize the feature complementation advantages of multiple data sets and reduce the difficulty of data fusion. At present, the RGB image segmentation based on deep learning has good effect. However, in a complex indoor scene, due to factors such as a plurality of object types, large light change, easy shielding and the like, the RGB image semantic segmentation task has problems of severe boundary blurring, in-class misrecognition, small target object loss and the like.

Research shows that when semantic segmentation is carried out, the depth image can provide complementary geometric information for the RGB image, so that the segmentation effect is improved. With the wide application of depth cameras, scene depth information acquisition becomes easier. However, the complexity of the network model is easily too high due to different data types when RGB-D is used for feature fusion, so that the effective heterogeneous data feature fusion method can better utilize the cooperative information of RGB features and depth features, and is a breakthrough point for improving the semantic segmentation precision.

According to the difference of the participation stages of the feature fusion method in the segmentation model, the method can be divided into an early stage fusion method, a later stage fusion method and a multi-stage fusion method. The early fusion refers to performing feature fusion before inputting into the deep learning network and then transmitting the fusion features into the deep learning network to perform feature extraction and semantic judgment; the later stage fusion means that RGB and depth images are respectively subjected to a double-branch coding network, RGB characteristic and depth characteristic extraction is independently carried out, and the characteristics are fused in the final stage to provide a basis for judgment; the multi-level fusion is divided into two different forms, one is to fuse the depth image features extracted from each layer of the depth convolutional neural network into the RGB image features, and the other fusion method is different in that two feature extraction network branches of the method are concentrated in extracting the single-mode features, and then the RGB features and the depth features are extracted layer by layer to be fused. It can be seen that the multi-stage fusion mode uses multi-stage feature fusion information for a decoder, and utilizes low, medium and high-level information of bimodal data, so that the feature information utilization rate is better than that of early-stage fusion and later-stage fusion in semantic segmentation tasks.

Disclosure of Invention

The invention aims to solve the problem of image semantic segmentation, provides an RGB image and a depth image in the same scene, and obtains a semantic segmentation image which is classified according to given labels and enables different types of objects to be distinguished on the image through a multilevel bimodal feature fusion convolution neural network model.

In order to achieve the above object, the present invention provides a RGB-D bimodal feature fusion semantic segmentation method, which mainly includes five parts, the first part is to preprocess the RGB-D data set; the second part is to carry out feature extraction and feature fusion on the preprocessed image; the third part is to establish rich context relation for the fused image features; the fourth part is to obtain a semantic segmentation image with the same resolution as the input image by fusing multi-level image features through a decoder and performing up-sampling; the fifth part is training and testing of the network.

The first part comprises two steps:

step 1, downloading an indoor RGB-D image public data set, selecting images with complex scenes, various details and complete categories as training samples, and taking the rest images as test set samples;

step 2, in order to further increase the number of samples, random enhancement is performed using image scaling, cropping and rotation, and slight color dithering on HSV space is also employed for RGB images.

The second part comprises two steps:

step 3, performing multi-level feature extraction on the RGB image and the depth image obtained in the step 2 by using a double-branch coding network;

and 4, fusing the multi-level and multi-scale different modal characteristics obtained in the step 3 through a bimodal characteristic fusion structure to obtain fused image characteristics, and sending the fused image characteristics to a decoder module through jump links.

The third part includes a step of:

and 5, constructing a local context dependency relationship and integrating the correlation of local features by adopting a double attention perception module for the highest-level feature graph obtained in the step 4, processing the obtained feature graph through a multi-branch aggregation structure of a context module, and further enhancing the global information of the features by combining feature information of regions with different scales.

The fourth section comprises four steps:

step 6, fusing the feature map obtained in the step 5 and the third layer of bimodal features obtained in the step 4, and then sampling and enlarging resolution to obtain a first layer of decoder output image;

step 7, fusing the feature map obtained in the step 6 and the second layer bimodal features obtained in the step 4, and then performing up-sampling to enlarge the resolution to obtain a second layer decoder output image;

step 8, fusing the feature map obtained in the step 7 and the first layer of bimodal features obtained in the step 4, and then performing up-sampling to enlarge the resolution to obtain a third layer of decoder output image;

and 9, performing twice-times upper sampling of the learning rate on the image obtained in the step 8, restoring the image resolution to the size of the input image resolution, and outputting a final semantic segmentation result graph.

The fifth part comprises two steps:

step 10, adjusting the network model parameters and weights in the processes from step 3 to step 9 to obtain an optimal network model parameter file;

and 11, inputting the test image data set in the step 1 into the network model obtained in the step 10 to obtain a semantic segmentation image.

The invention provides an RGB-D bimodal feature fusion image semantic segmentation method. On the premise of ensuring the balance of efficiency and precision, the method uses a lightweight ResNet34 architecture introduced with a Non-Bottleneck-1D architecture as a backbone network. Firstly, RGB and depth data use a single encoder branch, and feature information of an RGB image and depth image is respectively extracted in four stages; meanwhile, when the depth features and the RGB features are processed, the output feature graph is transmitted into an attention mechanism bimodal fusion module layer by layer, the position and the space features needing to be enhanced and reduced are known by giving an input model, and then the position and the space features are transmitted into multi-scale jump connection modules of each layer, so that more shallow layer detail information is provided for a decoder network; then, a dual attention context module is adopted to enrich the global information of the characteristic diagram of the highest layer and is connected with a decoder; and finally, combining shallow, low-level and fine-grained features from the encoder subnet with same-scale feature maps of depth, semantics and coarse granularity from the decoder subnet by 1 × 1 convolution, thereby acquiring global features containing low-level spatial information and high-level semantic information. The method can fully utilize the complementary characteristics of the RGB-D image and remove redundant characteristic information to obtain excellent semantic segmentation performance, and has the advantages of good segmentation effect, high operation efficiency and good robustness.

Drawings

Fig. 1 is an overall structure diagram of a network model of the present invention.

FIG. 2 is a diagram of a bimodal feature fusion architecture of the present invention.

FIG. 3 is a diagram of a dual attention aware context architecture of the present invention.

Fig. 4 is an original captured RGB image and depth image.

FIG. 5 is a semantically segmented image from FIG. 4 processed using the present invention.

Detailed Description

For better understanding of the present invention, the following describes the RGB-D bimodal feature fusion semantic segmentation method in more detail with reference to specific embodiments. In the following description, detailed descriptions of the prior art that are currently available may obscure the subject matter of the present invention, and such descriptions will be omitted herein.

In a specific embodiment, the method is carried out according to the following steps:

step 1, downloading indoor RGB-D data sets including NYUv2 and SUNRGB-D, obtaining training set and test set samples of different data sets, and generating txt files for summarizing picture names through data processing.

And 2, setting the number of RGB and depth images which are required to be imported each time and correspond one by one, and carrying out data set enhancement through zooming, cutting and the like to obtain the input sample 101 of the network.

Fig. 1 is an overall structure diagram of an RGB-D bimodal feature fusion model based on a convolutional neural network according to the present embodiment, which is performed according to the following steps.

And 3, using a lightweight double-branch residual structure introduced with a Non-Bottleneck-1D framework as an encoder backbone 102, decomposing a 3 × 3 convolution in an original residual network residual block into two one-dimensional 3 × 1 and 1 × 3 convolutions, and dividing two encoding branches into four stages to focus on extracting different modal characteristics of RGB and depth images.

And 4, fusing the RGB image features and the depth image features obtained in the step 3 by using a bimodal feature fusion structure 103 after each coding stage in the step 3 is finished. The bimodal feature fusion structure is shown in fig. 2, and is implemented as follows:

(1) the RGB feature map and depth feature map 201 group of each layer passes through a coordinate attention mechanism 202, feature information of two different modes is obtained according to the same rule to obtain interested features, and the weight of each feature information in the two different types of feature maps is coordinated;

(2) then, performing cooperative optimization 203 on the feature information acquired through the attention mechanism by using complementary characteristics between RGB image features and depth image features, wherein the weight distribution is 1:1 during cooperative optimization of different modal features, so that bimodal feature fusion is realized;

(3) and finally, the fusion result 204 is linked through multi-scale jumping, the number of channels of the feature map with different layers is respectively converted into 512 channels, 256 channels and 128 channels by using 1 multiplied by 1 convolution and is sent to a decoder, and multi-level and multi-scale modal information mixing is realized.

Step 5, as shown in fig. 3, the context structure 104 for dual attention perception performs local and global context information fusion on the highest level features in step 4, the input feature map 301 is processed by the position attention branch 302 and the channel attention module 303, the aggregated output feature map 304 is input as a pyramid-like pooling context module 305, and finally the information feature map output by the context module is input as the first input 306 of the decoder. The method comprises the following specific steps:

(1) inputting a feature map (

) Inputting the data into a position attention branch of a double-attention mechanism to obtain an output result of a position attention module

. At the same time, a feature map (

) Performing feature extraction through the channel attention branch to obtain the output result of the channel attention branch

；

(2) The outputs from the two attention modules are aggregated with different attention branch aggregation weight ratios of 1:1 to obtain better characterization of pixel level predictionX _T；

(3) Will be provided withX _TBy pyramid-like pooling modelsbBranches of different sizes respectivelybPooling of different scales in branches using 1 × 1 convolution to change the number of channels to 1 ^ or greater than 1 ^ of the input feature mapbRestoring the size of the feature graph to the input size by nearest neighbor up-sampling, connecting the original feature graph with each scale, and finally connecting the original feature graph with each scaleAdjusting the channel number of the feature graph to obtain the feature graph with rich context informationX _OUTIntroduced into the decoder, in the present invention, it is recommended to adoptb=4。

And 6, taking the bimodal feature map of the 512 channels in the step 5 and the third-layer bimodal feature map of the 512 channels in the step 4 as two input feature maps of the first-layer decoder, performing one-time upsampling, and expanding the resolution of the output feature map by two times 105.

And 7, taking the output feature maps of the 256 channels in the step 6 and the second-layer bimodal feature maps of the 256 channels in the step 4 as two input feature maps of a second-layer decoder, performing second-time upsampling, and expanding the resolution of the output feature maps by two times 106.

And 8, taking the output feature map of the 128 channels in the step 7 and the first layer bimodal feature map of the 128 channels in the step 4 together as two input feature maps of a third layer decoder, and performing up-sampling again to increase the resolution of the output feature map by two times 107.

And 9, expanding the scale of the output semantic segmentation result graph to the resolution of the input picture 108 by twice up-sampling at twice learning rate.

And step 10, setting the training batch processing size to be 4 for the network models from the step 3 to the step 9, namely processing 4 random pictures as one batch, carrying out a test once each training period, wherein the batch processing size is 8 during the test, setting the momentum to be 0.9 and the initial learning rate to be 0.01 by using an SGD (generalized minimum delay) optimization method, adjusting the learning rate by using a poly learning rate strategy in each period, and obtaining an optimal parameter model file after training for 500 times.

And step 11, inputting the RGB image and the depth image to be tested into the trained model as shown in FIG. 4 to obtain an image semantic segmentation output result as shown in FIG. 5.

The invention provides an RGB-D bimodal feature fusion semantic segmentation method according to complementary feature information of an RGB image and a depth image, which is characterized in that an attention mechanism fusion module is constructed based on multi-stage multimodal information fusion characteristics according to the limited feature of RGB image information and is used for enhancing the attention degree of position information and channel information of RGB and depth image features, and the complementary feature between the RGB image and the depth image is fully utilized to carry out collaborative optimization while the difference of the RGB image and the depth image is reduced. In addition, a double attention perception context module is set up to be connected with a coder and a decoder, and context analysis of the output characteristic diagram of the coder is enhanced. The decoder part fully excavates local context semantics of different scale characteristics during anti-pooling operation by utilizing multi-scale jump connection, and finally outputs a semantic segmentation result. The method has the advantages of simple algorithm, strong operability and wide applicability.

While the invention has been described with respect to the illustrative embodiments thereof, it is to be understood that the invention is not limited thereto but is intended to cover various changes and modifications which are obvious to those skilled in the art, and which are intended to be included within the spirit and scope of the invention as defined and defined in the appended claims.

Claims

1. A semantic segmentation method for RGB-D bimodal feature fusion mainly comprises five parts, wherein the first part is used for preprocessing an RGB-D data set; the second part is to carry out feature extraction and feature fusion on the preprocessed image; the third part is to establish rich context relation for the fused image features; the fourth part is to obtain a semantic segmentation image with the same input resolution by fusing multi-level and multi-scale image characteristics through a decoder and performing up-sampling; the fifth part is training and testing of the network.

The first part comprises two steps:

And 2, setting the number of RGB and depth images which are required to be imported each time and correspond one by one, and carrying out data set enhancement through zooming, cutting and the like to obtain an input sample of the network.

The second part comprises two steps:

and 3, using a lightweight double-branch residual structure introduced with a Non-Bottleneck-1D framework as a coder backbone, decomposing a 3 x 3 convolution in an original residual network residual block into two one-dimensional 3 x 1 and 1 x 3 convolutions, and dividing two coding branches into four stages to focus on extracting different modal characteristics of RGB and depth images.

And 4, fusing the RGB image characteristics and the depth image characteristics obtained by the step 3 through a dual-mode characteristic fusion structure after each coding stage of the step 3 is completed. The specific implementation is as follows:

(1) obtaining interesting features of the RGB feature map and the depth feature map group of each layer according to the same rule by using the RGB feature map and the depth feature map group of each layer through a coordinate attention mechanism, and further coordinating the weight of each feature information in the two different types of feature maps;

(2) then, performing cooperative optimization on the feature information acquired through the attention mechanism by using complementary characteristics between RGB image features and depth image features, wherein the weight distribution is 1:1 during cooperative optimization of different modal features, and bimodal feature fusion is realized;

(3) and finally, the fusion result is linked through multi-scale jumping, the number of channels of the feature map with different layers is respectively converted into 512 channels, 256 channels and 128 channels by using 1 multiplied by 1 convolution and the channels are sent to a decoder, and multi-level and multi-scale modal information mixing is realized.

The third part includes a step of:

and 5, carrying out local and global context information fusion on the highest-level features in the step 4 through a double attention perception context structure, respectively processing and aggregating the input feature graph through a position attention branch and a channel attention branch, inputting the aggregated output feature graph as a pyramid-like pooling context module, and finally, taking the feature graph output by the context module as the first input of a decoder. The method comprises the following specific steps:

(1) will input the feature map (a)

. At the same time, a feature map (

) Performing feature extraction through the channel attention branch to obtain the channel attention branch output result

。

(2) The outputs from the two attention modules are aggregated with different attention branch aggregation weight ratios of 1:1 to obtain better characterization of pixel level predictionX _T。

(3) Will be provided withX _TBy pyramid-like pooling modelsbBranches of different sizes respectivelybPooling of different scales in branches using 1 × 1 convolution to change the number of channels to 1 ^ or greater than 1 ^ of the input feature mapbRestoring the size of the feature graph to the input size by nearest neighbor up-sampling, connecting the original feature graph with each scale, and finally adjusting the number of channels of the feature graph obtained after connection to obtain the feature graph with rich context informationX _OUTIntroduced into the decoder, in the invention, the preferred useb=4。

The fourth section comprises four steps:

step 6, the bimodal feature maps of the 512 channels in the step 5 and the third layer bimodal feature map of the 512 channels in the step 4 are used as two input feature maps of a first layer decoder together, primary up-sampling is carried out, and the resolution of the output feature maps is expanded by two times;

step 7, the output characteristic diagram of the 256 channels in the step 6 and the second layer bimodal characteristic diagram of the 256 channels in the step 4 are taken as two input characteristic diagrams of a second layer decoder together, second-time upsampling is carried out, and the resolution of the output characteristic diagram is expanded by two times;

step 8, taking the output feature map of the 128 channels in the step 7 and the first layer bimodal feature map of the 128 channels in the step 4 as two input feature maps of a third layer decoder, and performing up-sampling again to double the resolution of the output feature map;

and 9, expanding the scale of the output semantic segmentation result graph to the resolution of the input picture by two times of up-sampling at the learning rate.

The fifth part comprises two steps:

and step 10, setting the training batch processing size to be 4 for the network models from the step 3 to the step 9, namely processing 4 random pictures as one batch, carrying out a test once each training period, wherein the batch processing size is 8 during the test, setting the momentum to be 0.9 and the initial learning rate to be 0.01 by using an SGD (generalized minimum delay) optimization method, adjusting the learning rate by using a poly learning rate strategy in each period, and obtaining an optimal model parameter file after training for 500 times.

And 11, inputting the RGB image to be tested and the depth image into the trained model to obtain an image semantic segmentation output result.

2. The RGB-D bimodal feature fusion semantic segmentation method as claimed in claim 1, wherein in step 4 (1), multi-level and multi-scale different modal features are subjected to a same attention mechanism to coordinate feature weights according to a same rule; step 4 (2) carrying out collaborative optimization on different modal characteristic graphs with the same scale in each coding stage, wherein the weight ratio is 1:1 when the different modal characteristics are cooperatively optimized; in the step 4 (3), jump link and 1 × 1 convolution are adopted to send the three different level feature maps into a decoder to realize multi-level and multi-scale feature fusion, wherein the number of channels of the different level feature maps after 1 × 1 convolution is respectively 512, 256 and 128.

3. The RGB-D bimodal feature fusion semantic segmentation method as claimed in claim 1, wherein in step 5 (1), a double attention mechanism is used to extract position and channel featuresPerforming double attention feature map aggregation in the step 5 (2), wherein the aggregation weight ratio of different attention branches is 1:1, integrating the similarity of local features of any scale and the dependency of self-adaptive integrated local features and global features, enhancing the identification capability of details and providing rich feature information for a context module; step 5 (3) pyramid-like pooling context model fusionbEach branch outputs a feature map of different size, reduces the loss of context information for different region features, provides global context information for the decoder module,b=4。