CN114283315A

CN114283315A - RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion

Info

Publication number: CN114283315A
Application number: CN202111565805.7A
Authority: CN
Inventors: 段松松; 夏晨星; 黄荣梅; 孙延光
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-04-05

Abstract

The invention belongs to the field of computer vision, and provides an RGB-D saliency target detection method based on interactive guidance attention and trapezoidal pyramid fusion, which comprises the following steps: 1) acquiring an RGB-D data set for training and testing the task, and defining an algorithm target of the invention; 2) construction of RG for extracting RGB image featuresA B encoder and a Depth (Depth) image feature Depth encoder; 3) establishing a cross-modal characteristic fusion network, and guiding the RGB image characteristics and Depth image characteristics to carry out cross fusion through an attention mechanism guided by an interactive mode; 4) constructing an ultra-large scale receptive field fusion mechanism to enhance the high-level semantic information of the multi-modal characteristics; 5) decoder based on trapezoidal pyramid feature fusion network to generate saliency map P_est(ii) a 6) Predicted saliency map P_estSegmentation graph P of salient objects labeled manually_GTCalculating loss; 7) testing the test data set to generate a saliency map P_testAnd performing performance evaluation using the evaluation index.

Description

RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion

The technical field is as follows:

the invention relates to the field of computer vision and image processing, in particular to an RGB-D saliency target detection method based on interactive guidance attention and trapezoidal pyramid fusion.

Background art:

saliency target detection aims at locating the most striking targets or regions from given data (such as RGB pictures, RGB-D pictures, video, etc.) by simulating the human visual attention mechanism. In recent years, due to the wide application of salient object detection, salient object detection is rapidly developed, and is applied to many computer vision fields, such as image retrieval, video segmentation, semantic segmentation, video tracking, person reconstruction, thumbnail creation and quality evaluation.

Since the single modality RGB salient object detection algorithm faces challenging scenes (e.g., complex backgrounds, salient objects highly similar to the background, low contrast scenes, etc.), it is difficult to accurately and completely locate salient objects from the background. Therefore, to solve this problem, a Depth (Depth) image is introduced to salient object detection, which is performed by combining an RGB image and a Depth image to constitute RGB-D.

Since the Depth Map can provide many useful information such as information like spatial structure, 3D distribution, object edges, etc. Introducing a Depth map into the SOD task can help SOD models handle challenging scenes such as complex backgrounds, low contrast, salient objects similar to the background appearance, etc. Therefore, how to accurately locate the salient object by using the Depth Map assisted RGB-D salient object detection model is very important. Most of the previous RGB-D saliency target detection methods extract features independently by taking a Depth Map as a data stream independent of an RGB image, or input the Depth image into an RGB-D saliency detection model as a fourth channel of the RGB image, and the method treats the RGB image and the Depth image indiscriminately and ignores the fact that: in the RGB image and the Depth image, there is a great difference in the salient information carried by different areas, and there is also a difference in the representation of the information of the salient object by the RGB image and the Depth image.

Considering the ambiguity problem of cross-modal data existing between RGB image data and Depth image data, the invention tries to explore an efficient cross-modal feature fusion method and effectively eliminates the ambiguity problem between the cross-modal data by utilizing the cross-modal fusion method. In addition, in order to further explore a connection and cooperation mechanism among the multi-scale features, the performance of model detection is effectively improved by utilizing multi-scale feature information, high-level semantic information and low-level detail information can be considered, and edge details and overall integrity of a perception significance target can be achieved. According to the method, the effect of the characteristic pyramid on the multi-scale characteristic fusion is further excavated, so that the significance detection model is helped to predict the significance target more accurately.

The invention content is as follows:

aiming at the problems provided above, the invention provides an RGB-D saliency target detection method based on interactive guidance attention and trapezoidal pyramid fusion, which specifically adopts the following technical scheme:

1. an RGB-D dataset is acquired that trains and tests the task.

1.1) taking the NJUD data set, the NLPR data set and the DUT-RGBD data set as training sets, and taking the rest NLPR data set, the rest DUT-RGBD data set, the SIP data set, the STERE data set and the SSD data set as test sets.

1.2) RGB-D image dataset comprising a single RGB image P_RGBCorresponding Depth image P_DepthAnd corresponding artificially labeled salient object segmentation image P_GT。

2. Constructing a significant target detection model network for extracting RGB image features and Depth image features by using a convolutional neural network;

2.1) using VGG16 as the backbone network for the model of the present invention, for extracting RGB image features and Depth image features due to pairs,are respectively as

And

2.2) the VGG16 parameter weights pre-trained by the invention using ImageNet data sets initialize the VGG16 weights of the invention for building backbone networks.

3. Based on the multi-scale RGB image characteristics extracted in the step 2

And corresponding Depth image features

And performing multi-scale cross-modal feature interactive fusion, and constructing a cross-modal feature fusion network by utilizing the interactive fusion to generate the multi-modal features.

3.1) Cross-modal feature fusion network from 5 levels of CMAF modules to 5 levels of RGB image features

And corresponding Depth image features

Compose and generate 5 levels of multimodal features

And

3.2) input data of CMAF module of i-th level

And

form and generate multi-modal features at level i through an interactively guided attention mechanism

Where i ∈ {1, 2, 3, 4, 5 }.

3.3) CMAF module generates multi-modal features through an interactively guided attention mechanism as follows:

3.3.1) firstly, a residual convolution module is constructed for increasing the receptive field and semantic information of the features and enhancing the expression capability of the significance of the features, and the RGB image features and the corresponding Depth image features can be further enhanced through the residual convolution module.

3.3.2) further fusing the RGB image features and the corresponding Depth image features using element-aware matrix multiplication and element-aware matrix addition, and then converting the fused features into global context-aware attention weights W using softmax activation functions_sAnd channel perceptual attention weight W_c：

Where Resconv represents the residual convolution module, multi represents the element-aware matrix multiplication operation, add represents the element-aware matrix addition operation, GAP represents the global average pooling, and softmax represents the softmax activation function.

3.3.3) attention weight W in obtaining global context awareness_sAnd channel perceptual attention weight W_cAfter that, we will W_sAnd W_cRespectively combining the feature with the RGB image features after enhancement and the corresponding Depth image features, and guiding a salient region in the feature focusing features by using a weight matrix generated by an attention mechanism to obtain multi-modal features after filtering:

wherein, alpha is formed by { r, d }, and RGB image characteristics after being filtered can be obtained through the operation

And corresponding Depth image features

3.3.4) fusing Cross-modal, RGB image features by a Cross-Interactive fusion method

And corresponding Depth image features

Obtaining fusion characteristics

Wherein i ∈ {1, 2, 3, 4, 5} represents the hierarchy of the model in which the feature is located, conv3 represents the convolution operation with a convolution kernel size of 3 × 3, and cat represents the feature join operation.

4) Through the operation, multi-modal features of 5 levels are extracted

And

and inputting the 5 layers into a density hole convolution module, and enhancing the receptive field information and the high-level semantic information of the multi-modal characteristics through the multi-layer hole convolution operation.

4.1) extracting ultra-large-scale receptive field information from the multi-scale multi-modal characteristics through a hole convolution operation, and setting hole convolutions with different hole rates:

where i ∈ {1, 2, 3, 4, 5} represents the hierarchy in which the multimodal features reside, DLA_i() Represents a hole convolution operation with a hole rate i, and DLA₂()、DLA₄() And DLA₈() Representing the hole convolution operations with hole rates of 1, 2, 4 and 8 respectively,

and

respectively representing the features with the void rate i generated by the multi-modal features of the ith level.

4.2) inputting the multi-modal characteristics of the multi-level reception fields generated in the above steps into a trapezoidal pyramid characteristic fusion network, fusing the multi-modal characteristics of different reception fields:

wherein, TPNet represents a trapezoidal pyramid feature fusion network.

5) Inputting the multi-modal characteristics of the 5-level ultra-large scale receptive fields obtained in the step 4 into a decoder formed by a trapezoidal pyramid characteristic fusion network to obtain final fusion characteristics, and obtaining a predicted saliency map P after sigmoid function activation_est：

P_est＝sigmoid(TPNet(f₁，f₂，f₃，f₄，f₅) Equation (7)

6) Saliency map P predicted by the invention_estSegmentation graph P of salient objects labeled manually_GTAnd calculating a loss function, gradually updating the parameter weight of the model provided by the invention through an SGD (generalized regression) and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance target detection algorithm.

7) On the basis of determining the structure and the parameter weight of the model in the step 6, testing the RGB-D image pair on the test set to generate a saliency map P_testAnd evaluating by using the evaluation indexes of MAE, S-measure, F-measure and E-measure.

The invention realizes multi-mode salient target detection based on a deep convolutional neural network, utilizes rich spatial structure information in a Depth image and carries out cross-modal characteristic fusion in an interactive guidance attention mode with Depth characteristics extracted from an RGB image, can adapt to the requirements of salient target detection in different scenes, and particularly has certain robustness in challenging scenes (complex background, low contrast, transparent objects and the like). Compared with the prior RGB-D significance target detection method, the method has the following benefits:

firstly, a deep learning technology is utilized, the relationship between an RGB-D image pair and an image salient object is constructed through an encoder and a decoder, and the salient prediction is obtained through extraction and fusion of cross-modal features.

Secondly, by means of an interactive fusion mode, the complementary information of the Depth image features to the RGB image features is effectively modulated, the cross-modal feature fusion is guided by the aid of Depth distribution information of the Depth image features, interference of background information in the RGB image is eliminated, and a foundation is laid for prediction of a next-stage significant target.

And finally, performing multi-scale multi-mode feature fusion through the constructed trapezoidal pyramid feature fusion network, and predicting a final saliency map.

Drawings

FIG. 1 is a schematic diagram of the model structure of the present invention

FIG. 2 is a schematic diagram of a cross-modal feature fusion module

FIG. 3 is a schematic diagram of a very large scale receptive field fusion module

FIG. 4 is a schematic diagram of a trapezoidal pyramid feature fusion network (TPNet)

FIG. 5 is a schematic diagram of model training and testing

FIG. 6 is a comparison graph of results of the present invention and other RGB-D saliency target detection methods

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the examples of the present invention, and the described examples are only a part of the examples of the present invention, but not all of the examples. Based on the examples of the present invention, all other examples obtained by a person of ordinary skill in the art without any creative effort belong to the protection scope of the present invention.

Referring to fig. 1, an RGB-D saliency target detection method based on interactive attention-guiding and trapezoidal pyramid fusion mainly includes the following steps:

1. an RGB-D dataset is acquired for training and testing the task, and the algorithm goals of the present invention are defined, and a training set and a test set for training and testing the algorithm are determined. The NJUD data set, the NLPR data set and the DUT-RGBD data set are used as training sets, and the rest data sets are used as testing sets and comprise a SIP data set, the rest NLPR data set, the rest DUT-RGBD data set, a STERE data set and an SSD data set.

2. Constructing a salient object detection model network for extracting RGB image features and Depth image features by utilizing a convolutional neural network, wherein the salient object detection model network comprises an RGB coder for extracting the RGB image features and a Depth coder for extracting the Depth image features:

2.1. inputting the RGB image with three channels into RGB coder to generate RGB image characteristics of 5 layers, each of which is

And

2.2. inputting the three-channel Depth image into a Depth encoder to generate Depth image characteristics of 5 layers, wherein the Depth image characteristics are

And

3. referring to FIG. 2, the 5 levels of RGB image features generated in step 2 are combined by a cross-modality fusion module

And Depth image features

Interactive fusion is carried out to obtain multi-modal characteristics of 5 layers

And

the main steps are as follows:

3.1. the cross-modal feature fusion network is composed of 5 levels of CMAF modules and 5 levels of RGB image features

And corresponding Depth image features

Compose and generate 5 levels of multimodal features

And

3.2. the input data of the CMAF module at the ith level is

And

form and output multi-modal features at level i via an interactively guided attention mechanism

Where i ∈ {1, 2, 3, 4, 5 }.

The CMAF module generates the multi-modal features through an interactively guided attention mechanism by the following specific process:

3.3.1. firstly, the invention constructs a residual convolution module for increasing the receptive field and semantic information of the features and enhancing the expression capability of the significance of the features, and the RGB image features and the corresponding Depth image features can be further enhanced through the residual convolution module.

3.3.2. Further utilizing element-aware matrix multiplication operations andelement-aware matrix addition operation fuses RGB image features and corresponding Depth image features, and then converts the fused features into global context-aware attention weights W by utilizing a softmax activation function_sAnd channel perceptual attention weight W_c：

3.3.3. Obtaining global context-aware attention weight W_sAnd channel perceptual attention weight W_cAfter that, we will W_sAnd W_cRespectively combining the feature with the RGB image features after enhancement and the corresponding Depth image features, and guiding a salient region in the feature focusing features by using a weight matrix generated by an attention mechanism to obtain multi-modal features after filtering:

And corresponding Depth image features

3.3.4. Fusing cross-modal, RGB image features by a cross-interactive fusion method

And corresponding Depth image features

Obtaining fusion characteristics

4. Referring to fig. 3, the super-large scale receptive field fusion module is used to enhance the receptive field information and high-level semantic information of the multi-modal features:

and

respectively represent the group iThe void rate generated by the multi-modal features of the hierarchy is a feature of i.

wherein TPNet () represents a trapezoidal pyramid feature fusion network.

5. Referring to fig. 4, the decoder of the algorithm proposed by the present invention using the trapezoidal pyramid uses 5 levels of multi-modal enhancement features f₁、f₂、f₃、f₄And f₅Inputting the prediction data into a decoder, and activating by a sigmoid function to obtain a predicted saliency map P_est：

P_est＝sigmoid(TPNet(f₁，f₂，f₃，f₄，f₅) Equation (7)

6) Saliency map P predicted by the invention_estSegmentation graph P of salient objects labeled manually_GTAnd calculating a loss function, gradually updating the parameter weight of the model provided by the invention through an SGD (generalized Gaussian distribution) and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.

The above description is for the purpose of illustrating preferred embodiments of the present application and is not intended to limit the present application, and it will be apparent to those skilled in the art that various modifications and variations can be made in the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An RGB-D saliency target detection method based on interactive attention guidance and trapezoidal pyramid fusion is characterized by comprising the following steps:

1) acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing the algorithm;

2) constructing an RGB encoder for extracting RGB image characteristics and a Depth (Depth) image characteristic Depth encoder;

3) establishing a cross-modal characteristic fusion network, and guiding the RGB image characteristics and Depth image characteristics to carry out cross fusion through an attention mechanism guided by an interactive mode;

4) constructing an ultra-large-scale receptive field fusion mechanism based on the multi-modal features fused by the cross-modal features to enhance the receptive field information and the high-level semantic information of the multi-modal features;

5) establishing a decoder based on a trapezoidal pyramid feature fusion network, and obtaining a final predicted saliency map through an activation function;

6) predicted saliency map P_estSegmentation graph P of salient objects labeled manually_GTAnd calculating a loss function, gradually updating the parameter weight of the model provided by the invention through an SGD (generalized Gaussian distribution) and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.

7) On the basis of determining the structure and the parameter weight of the model in the step 6, testing the RGB-D image pair on the test set to generate a saliency map P_testAnd performing performance evaluation using the evaluation index.

2. The RGB-D saliency target detection method based on interactive attention-guiding and trapezoidal pyramid fusion of claim 1, characterized in that: the specific method of the step 2) is as follows:

2.1) taking the NJUD data set, the NLPR data set and the DUT-RGBD data set as training sets, and taking the rest NLPR data set, the rest DUT-RGBD data set, the SIP data set, the STERE data set and the SSD data set as test sets.

2.2) RGB-D imagesThe data set comprises a single RGB image P_RGBCorresponding Depth image P_DepthAnd corresponding artificially labeled salient object segmentation image P_GT。

3. The RGB-D saliency target detection method based on interactive attention-guiding and trapezoidal pyramid fusion of claim 1, characterized in that: the specific method of the step 3) is as follows:

3.1) Using VGG16 as the backbone network for the model of the invention for extracting RGB image features and causal Depth image features, respectively

And

3.2) initialize the VGG16 weights of the invention for building the backbone network with VGG16 parameter weights pre-trained on ImageNet data sets.

4. The RGB-D saliency target detection method based on interactive attention-guiding and trapezoidal pyramid fusion of claim 1, characterized in that: the specific method of the step 4) is as follows:

4.1) the cross-modal feature fusion network is composed of 5 levels of CMAF modules and generates 5 levels of multimodal features

And

4.2) input data of CMAF module of i-th level

And

Where i ∈ {1, 2, 3, 4, 5 }.

5. The RGB-D saliency target detection method based on interactive guiding attention and trapezoidal pyramid fusion of claim 1, characterized in that: the specific method of the step 5) is as follows:

5.1) extracting ultra-large-scale receptive field information from the multi-scale multi-modal characteristics through a hole convolution operation, and setting hole convolutions with different hole rates:

and

5.2) inputting the multi-modal characteristics of the multi-level reception fields generated in the above steps into a trapezoidal pyramid characteristic fusion network, fusing the multi-modal characteristics of different reception fields:

wherein TPNet () represents a trapezoidal pyramid feature fusion network.

6) Inputting the multi-modal characteristics of the 5-level ultra-large scale receptive fields obtained in the step 5 into a decoder formed by a trapezoidal pyramid characteristic fusion network to obtain final fusion characteristics, and obtaining a predicted saliency map P after sigmoid function activation_est：

P_est＝sigmoid(TPNet(f₁，f₂，f₃，f₄，f₅) Equation (3)

7) Saliency map P predicted by the invention_estSegmentation graph P of salient objects labeled manually_GTAnd calculating a loss function, gradually updating the parameter weight of the model provided by the invention through an SGD (generalized Gaussian distribution) and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.