CN114283315A - RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion - Google Patents

RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion Download PDF

Info

Publication number
CN114283315A
CN114283315A CN202111565805.7A CN202111565805A CN114283315A CN 114283315 A CN114283315 A CN 114283315A CN 202111565805 A CN202111565805 A CN 202111565805A CN 114283315 A CN114283315 A CN 114283315A
Authority
CN
China
Prior art keywords
rgb
fusion
modal
features
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111565805.7A
Other languages
Chinese (zh)
Inventor
段松松
夏晨星
黄荣梅
孙延光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Science and Technology
Original Assignee
Anhui University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Science and Technology filed Critical Anhui University of Science and Technology
Priority to CN202111565805.7A priority Critical patent/CN114283315A/en
Publication of CN114283315A publication Critical patent/CN114283315A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention belongs to the field of computer vision, and provides an RGB-D saliency target detection method based on interactive guidance attention and trapezoidal pyramid fusion, which comprises the following steps: 1) acquiring an RGB-D data set for training and testing the task, and defining an algorithm target of the invention; 2) construction of RG for extracting RGB image featuresA B encoder and a Depth (Depth) image feature Depth encoder; 3) establishing a cross-modal characteristic fusion network, and guiding the RGB image characteristics and Depth image characteristics to carry out cross fusion through an attention mechanism guided by an interactive mode; 4) constructing an ultra-large scale receptive field fusion mechanism to enhance the high-level semantic information of the multi-modal characteristics; 5) decoder based on trapezoidal pyramid feature fusion network to generate saliency map Pest(ii) a 6) Predicted saliency map PestSegmentation graph P of salient objects labeled manuallyGTCalculating loss; 7) testing the test data set to generate a saliency map PtestAnd performing performance evaluation using the evaluation index.

Description

RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion
The technical field is as follows:
the invention relates to the field of computer vision and image processing, in particular to an RGB-D saliency target detection method based on interactive guidance attention and trapezoidal pyramid fusion.
Background art:
saliency target detection aims at locating the most striking targets or regions from given data (such as RGB pictures, RGB-D pictures, video, etc.) by simulating the human visual attention mechanism. In recent years, due to the wide application of salient object detection, salient object detection is rapidly developed, and is applied to many computer vision fields, such as image retrieval, video segmentation, semantic segmentation, video tracking, person reconstruction, thumbnail creation and quality evaluation.
Since the single modality RGB salient object detection algorithm faces challenging scenes (e.g., complex backgrounds, salient objects highly similar to the background, low contrast scenes, etc.), it is difficult to accurately and completely locate salient objects from the background. Therefore, to solve this problem, a Depth (Depth) image is introduced to salient object detection, which is performed by combining an RGB image and a Depth image to constitute RGB-D.
Since the Depth Map can provide many useful information such as information like spatial structure, 3D distribution, object edges, etc. Introducing a Depth map into the SOD task can help SOD models handle challenging scenes such as complex backgrounds, low contrast, salient objects similar to the background appearance, etc. Therefore, how to accurately locate the salient object by using the Depth Map assisted RGB-D salient object detection model is very important. Most of the previous RGB-D saliency target detection methods extract features independently by taking a Depth Map as a data stream independent of an RGB image, or input the Depth image into an RGB-D saliency detection model as a fourth channel of the RGB image, and the method treats the RGB image and the Depth image indiscriminately and ignores the fact that: in the RGB image and the Depth image, there is a great difference in the salient information carried by different areas, and there is also a difference in the representation of the information of the salient object by the RGB image and the Depth image.
Considering the ambiguity problem of cross-modal data existing between RGB image data and Depth image data, the invention tries to explore an efficient cross-modal feature fusion method and effectively eliminates the ambiguity problem between the cross-modal data by utilizing the cross-modal fusion method. In addition, in order to further explore a connection and cooperation mechanism among the multi-scale features, the performance of model detection is effectively improved by utilizing multi-scale feature information, high-level semantic information and low-level detail information can be considered, and edge details and overall integrity of a perception significance target can be achieved. According to the method, the effect of the characteristic pyramid on the multi-scale characteristic fusion is further excavated, so that the significance detection model is helped to predict the significance target more accurately.
The invention content is as follows:
aiming at the problems provided above, the invention provides an RGB-D saliency target detection method based on interactive guidance attention and trapezoidal pyramid fusion, which specifically adopts the following technical scheme:
1. an RGB-D dataset is acquired that trains and tests the task.
1.1) taking the NJUD data set, the NLPR data set and the DUT-RGBD data set as training sets, and taking the rest NLPR data set, the rest DUT-RGBD data set, the SIP data set, the STERE data set and the SSD data set as test sets.
1.2) RGB-D image dataset comprising a single RGB image PRGBCorresponding Depth image PDepthAnd corresponding artificially labeled salient object segmentation image PGT
2. Constructing a significant target detection model network for extracting RGB image features and Depth image features by using a convolutional neural network;
2.1) using VGG16 as the backbone network for the model of the present invention, for extracting RGB image features and Depth image features due to pairs,are respectively as
Figure BDA0003417611270000031
And
Figure BDA0003417611270000032
Figure BDA0003417611270000033
2.2) the VGG16 parameter weights pre-trained by the invention using ImageNet data sets initialize the VGG16 weights of the invention for building backbone networks.
3. Based on the multi-scale RGB image characteristics extracted in the step 2
Figure BDA0003417611270000034
Figure BDA0003417611270000035
And corresponding Depth image features
Figure BDA0003417611270000036
And performing multi-scale cross-modal feature interactive fusion, and constructing a cross-modal feature fusion network by utilizing the interactive fusion to generate the multi-modal features.
3.1) Cross-modal feature fusion network from 5 levels of CMAF modules to 5 levels of RGB image features
Figure BDA0003417611270000037
And corresponding Depth image features
Figure BDA0003417611270000038
Figure BDA0003417611270000039
Compose and generate 5 levels of multimodal features
Figure BDA00034176112700000310
Figure BDA00034176112700000311
And
Figure BDA00034176112700000312
3.2) input data of CMAF module of i-th level
Figure BDA00034176112700000313
And
Figure BDA00034176112700000314
form and generate multi-modal features at level i through an interactively guided attention mechanism
Figure BDA00034176112700000315
Where i ∈ {1, 2, 3, 4, 5 }.
3.3) CMAF module generates multi-modal features through an interactively guided attention mechanism as follows:
3.3.1) firstly, a residual convolution module is constructed for increasing the receptive field and semantic information of the features and enhancing the expression capability of the significance of the features, and the RGB image features and the corresponding Depth image features can be further enhanced through the residual convolution module.
3.3.2) further fusing the RGB image features and the corresponding Depth image features using element-aware matrix multiplication and element-aware matrix addition, and then converting the fused features into global context-aware attention weights W using softmax activation functionssAnd channel perceptual attention weight Wc
Figure BDA0003417611270000041
Figure BDA0003417611270000042
Where Resconv represents the residual convolution module, multi represents the element-aware matrix multiplication operation, add represents the element-aware matrix addition operation, GAP represents the global average pooling, and softmax represents the softmax activation function.
3.3.3) attention weight W in obtaining global context awarenesssAnd channel perceptual attention weight WcAfter that, we will WsAnd WcRespectively combining the feature with the RGB image features after enhancement and the corresponding Depth image features, and guiding a salient region in the feature focusing features by using a weight matrix generated by an attention mechanism to obtain multi-modal features after filtering:
Figure BDA0003417611270000043
Figure BDA0003417611270000044
wherein, alpha is formed by { r, d }, and RGB image characteristics after being filtered can be obtained through the operation
Figure BDA0003417611270000045
And corresponding Depth image features
Figure BDA0003417611270000046
3.3.4) fusing Cross-modal, RGB image features by a Cross-Interactive fusion method
Figure BDA0003417611270000051
And corresponding Depth image features
Figure BDA0003417611270000052
Obtaining fusion characteristics
Figure BDA0003417611270000053
Figure BDA0003417611270000054
Wherein i ∈ {1, 2, 3, 4, 5} represents the hierarchy of the model in which the feature is located, conv3 represents the convolution operation with a convolution kernel size of 3 × 3, and cat represents the feature join operation.
4) Through the operation, multi-modal features of 5 levels are extracted
Figure BDA0003417611270000055
Figure BDA0003417611270000056
And
Figure BDA0003417611270000057
and inputting the 5 layers into a density hole convolution module, and enhancing the receptive field information and the high-level semantic information of the multi-modal characteristics through the multi-layer hole convolution operation.
4.1) extracting ultra-large-scale receptive field information from the multi-scale multi-modal characteristics through a hole convolution operation, and setting hole convolutions with different hole rates:
Figure BDA0003417611270000058
where i ∈ {1, 2, 3, 4, 5} represents the hierarchy in which the multimodal features reside, DLAi() Represents a hole convolution operation with a hole rate i, and DLA2()、DLA4() And DLA8() Representing the hole convolution operations with hole rates of 1, 2, 4 and 8 respectively,
Figure BDA0003417611270000059
and
Figure BDA00034176112700000510
respectively representing the features with the void rate i generated by the multi-modal features of the ith level.
4.2) inputting the multi-modal characteristics of the multi-level reception fields generated in the above steps into a trapezoidal pyramid characteristic fusion network, fusing the multi-modal characteristics of different reception fields:
Figure BDA00034176112700000511
wherein, TPNet represents a trapezoidal pyramid feature fusion network.
5) Inputting the multi-modal characteristics of the 5-level ultra-large scale receptive fields obtained in the step 4 into a decoder formed by a trapezoidal pyramid characteristic fusion network to obtain final fusion characteristics, and obtaining a predicted saliency map P after sigmoid function activationest
Pest=sigmoid(TPNet(f1,f2,f3,f4,f5) Equation (7)
6) Saliency map P predicted by the inventionestSegmentation graph P of salient objects labeled manuallyGTAnd calculating a loss function, gradually updating the parameter weight of the model provided by the invention through an SGD (generalized regression) and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance target detection algorithm.
7) On the basis of determining the structure and the parameter weight of the model in the step 6, testing the RGB-D image pair on the test set to generate a saliency map PtestAnd evaluating by using the evaluation indexes of MAE, S-measure, F-measure and E-measure.
The invention realizes multi-mode salient target detection based on a deep convolutional neural network, utilizes rich spatial structure information in a Depth image and carries out cross-modal characteristic fusion in an interactive guidance attention mode with Depth characteristics extracted from an RGB image, can adapt to the requirements of salient target detection in different scenes, and particularly has certain robustness in challenging scenes (complex background, low contrast, transparent objects and the like). Compared with the prior RGB-D significance target detection method, the method has the following benefits:
firstly, a deep learning technology is utilized, the relationship between an RGB-D image pair and an image salient object is constructed through an encoder and a decoder, and the salient prediction is obtained through extraction and fusion of cross-modal features.
Secondly, by means of an interactive fusion mode, the complementary information of the Depth image features to the RGB image features is effectively modulated, the cross-modal feature fusion is guided by the aid of Depth distribution information of the Depth image features, interference of background information in the RGB image is eliminated, and a foundation is laid for prediction of a next-stage significant target.
And finally, performing multi-scale multi-mode feature fusion through the constructed trapezoidal pyramid feature fusion network, and predicting a final saliency map.
Drawings
FIG. 1 is a schematic diagram of the model structure of the present invention
FIG. 2 is a schematic diagram of a cross-modal feature fusion module
FIG. 3 is a schematic diagram of a very large scale receptive field fusion module
FIG. 4 is a schematic diagram of a trapezoidal pyramid feature fusion network (TPNet)
FIG. 5 is a schematic diagram of model training and testing
FIG. 6 is a comparison graph of results of the present invention and other RGB-D saliency target detection methods
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the examples of the present invention, and the described examples are only a part of the examples of the present invention, but not all of the examples. Based on the examples of the present invention, all other examples obtained by a person of ordinary skill in the art without any creative effort belong to the protection scope of the present invention.
Referring to fig. 1, an RGB-D saliency target detection method based on interactive attention-guiding and trapezoidal pyramid fusion mainly includes the following steps:
1. an RGB-D dataset is acquired for training and testing the task, and the algorithm goals of the present invention are defined, and a training set and a test set for training and testing the algorithm are determined. The NJUD data set, the NLPR data set and the DUT-RGBD data set are used as training sets, and the rest data sets are used as testing sets and comprise a SIP data set, the rest NLPR data set, the rest DUT-RGBD data set, a STERE data set and an SSD data set.
2. Constructing a salient object detection model network for extracting RGB image features and Depth image features by utilizing a convolutional neural network, wherein the salient object detection model network comprises an RGB coder for extracting the RGB image features and a Depth coder for extracting the Depth image features:
2.1. inputting the RGB image with three channels into RGB coder to generate RGB image characteristics of 5 layers, each of which is
Figure BDA0003417611270000081
And
Figure BDA0003417611270000082
2.2. inputting the three-channel Depth image into a Depth encoder to generate Depth image characteristics of 5 layers, wherein the Depth image characteristics are
Figure BDA0003417611270000083
And
Figure BDA0003417611270000084
3. referring to FIG. 2, the 5 levels of RGB image features generated in step 2 are combined by a cross-modality fusion module
Figure BDA0003417611270000085
And Depth image features
Figure BDA0003417611270000086
Figure BDA0003417611270000087
Interactive fusion is carried out to obtain multi-modal characteristics of 5 layers
Figure BDA0003417611270000088
Figure BDA0003417611270000089
And
Figure BDA00034176112700000810
the main steps are as follows:
3.1. the cross-modal feature fusion network is composed of 5 levels of CMAF modules and 5 levels of RGB image features
Figure BDA00034176112700000811
And corresponding Depth image features
Figure BDA00034176112700000812
Compose and generate 5 levels of multimodal features
Figure BDA00034176112700000813
Figure BDA00034176112700000814
And
Figure BDA00034176112700000815
3.2. the input data of the CMAF module at the ith level is
Figure BDA00034176112700000816
And
Figure BDA00034176112700000817
form and output multi-modal features at level i via an interactively guided attention mechanism
Figure BDA00034176112700000818
Where i ∈ {1, 2, 3, 4, 5 }.
The CMAF module generates the multi-modal features through an interactively guided attention mechanism by the following specific process:
3.3.1. firstly, the invention constructs a residual convolution module for increasing the receptive field and semantic information of the features and enhancing the expression capability of the significance of the features, and the RGB image features and the corresponding Depth image features can be further enhanced through the residual convolution module.
3.3.2. Further utilizing element-aware matrix multiplication operations andelement-aware matrix addition operation fuses RGB image features and corresponding Depth image features, and then converts the fused features into global context-aware attention weights W by utilizing a softmax activation functionsAnd channel perceptual attention weight Wc
Figure BDA0003417611270000091
Figure BDA0003417611270000092
Where Resconv represents the residual convolution module, multi represents the element-aware matrix multiplication operation, add represents the element-aware matrix addition operation, GAP represents the global average pooling, and softmax represents the softmax activation function.
3.3.3. Obtaining global context-aware attention weight WsAnd channel perceptual attention weight WcAfter that, we will WsAnd WcRespectively combining the feature with the RGB image features after enhancement and the corresponding Depth image features, and guiding a salient region in the feature focusing features by using a weight matrix generated by an attention mechanism to obtain multi-modal features after filtering:
Figure BDA0003417611270000093
Figure BDA0003417611270000094
wherein, alpha is formed by { r, d }, and RGB image characteristics after being filtered can be obtained through the operation
Figure BDA0003417611270000095
And corresponding Depth image features
Figure BDA0003417611270000096
3.3.4. Fusing cross-modal, RGB image features by a cross-interactive fusion method
Figure BDA0003417611270000097
And corresponding Depth image features
Figure BDA0003417611270000098
Obtaining fusion characteristics
Figure BDA0003417611270000099
Figure BDA00034176112700000910
Wherein i ∈ {1, 2, 3, 4, 5} represents the hierarchy of the model in which the feature is located, conv3 represents the convolution operation with a convolution kernel size of 3 × 3, and cat represents the feature join operation.
4. Referring to fig. 3, the super-large scale receptive field fusion module is used to enhance the receptive field information and high-level semantic information of the multi-modal features:
4.1) extracting ultra-large-scale receptive field information from the multi-scale multi-modal characteristics through a hole convolution operation, and setting hole convolutions with different hole rates:
Figure BDA0003417611270000101
where i ∈ {1, 2, 3, 4, 5} represents the hierarchy in which the multimodal features reside, DLAi() Represents a hole convolution operation with a hole rate i, and DLA2()、DLA4() And DLA8() Representing the hole convolution operations with hole rates of 1, 2, 4 and 8 respectively,
Figure BDA0003417611270000102
and
Figure BDA0003417611270000103
respectively represent the group iThe void rate generated by the multi-modal features of the hierarchy is a feature of i.
4.2) inputting the multi-modal characteristics of the multi-level reception fields generated in the above steps into a trapezoidal pyramid characteristic fusion network, fusing the multi-modal characteristics of different reception fields:
Figure BDA0003417611270000104
wherein TPNet () represents a trapezoidal pyramid feature fusion network.
5. Referring to fig. 4, the decoder of the algorithm proposed by the present invention using the trapezoidal pyramid uses 5 levels of multi-modal enhancement features f1、f2、f3、f4And f5Inputting the prediction data into a decoder, and activating by a sigmoid function to obtain a predicted saliency map Pest
Pest=sigmoid(TPNet(f1,f2,f3,f4,f5) Equation (7)
6) Saliency map P predicted by the inventionestSegmentation graph P of salient objects labeled manuallyGTAnd calculating a loss function, gradually updating the parameter weight of the model provided by the invention through an SGD (generalized Gaussian distribution) and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.
7) On the basis of determining the structure and the parameter weight of the model in the step 6, testing the RGB-D image pair on the test set to generate a saliency map PtestAnd evaluating by using the evaluation indexes of MAE, S-measure, F-measure and E-measure.
The above description is for the purpose of illustrating preferred embodiments of the present application and is not intended to limit the present application, and it will be apparent to those skilled in the art that various modifications and variations can be made in the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (5)

1. An RGB-D saliency target detection method based on interactive attention guidance and trapezoidal pyramid fusion is characterized by comprising the following steps:
1) acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing the algorithm;
2) constructing an RGB encoder for extracting RGB image characteristics and a Depth (Depth) image characteristic Depth encoder;
3) establishing a cross-modal characteristic fusion network, and guiding the RGB image characteristics and Depth image characteristics to carry out cross fusion through an attention mechanism guided by an interactive mode;
4) constructing an ultra-large-scale receptive field fusion mechanism based on the multi-modal features fused by the cross-modal features to enhance the receptive field information and the high-level semantic information of the multi-modal features;
5) establishing a decoder based on a trapezoidal pyramid feature fusion network, and obtaining a final predicted saliency map through an activation function;
6) predicted saliency map PestSegmentation graph P of salient objects labeled manuallyGTAnd calculating a loss function, gradually updating the parameter weight of the model provided by the invention through an SGD (generalized Gaussian distribution) and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.
7) On the basis of determining the structure and the parameter weight of the model in the step 6, testing the RGB-D image pair on the test set to generate a saliency map PtestAnd performing performance evaluation using the evaluation index.
2. The RGB-D saliency target detection method based on interactive attention-guiding and trapezoidal pyramid fusion of claim 1, characterized in that: the specific method of the step 2) is as follows:
2.1) taking the NJUD data set, the NLPR data set and the DUT-RGBD data set as training sets, and taking the rest NLPR data set, the rest DUT-RGBD data set, the SIP data set, the STERE data set and the SSD data set as test sets.
2.2) RGB-D imagesThe data set comprises a single RGB image PRGBCorresponding Depth image PDepthAnd corresponding artificially labeled salient object segmentation image PGT
3. The RGB-D saliency target detection method based on interactive attention-guiding and trapezoidal pyramid fusion of claim 1, characterized in that: the specific method of the step 3) is as follows:
3.1) Using VGG16 as the backbone network for the model of the invention for extracting RGB image features and causal Depth image features, respectively
Figure FDA0003417611260000021
And
Figure FDA0003417611260000022
Figure FDA0003417611260000023
3.2) initialize the VGG16 weights of the invention for building the backbone network with VGG16 parameter weights pre-trained on ImageNet data sets.
4. The RGB-D saliency target detection method based on interactive attention-guiding and trapezoidal pyramid fusion of claim 1, characterized in that: the specific method of the step 4) is as follows:
4.1) the cross-modal feature fusion network is composed of 5 levels of CMAF modules and generates 5 levels of multimodal features
Figure FDA0003417611260000024
And
Figure FDA0003417611260000025
4.2) input data of CMAF module of i-th level
Figure FDA0003417611260000026
And
Figure FDA0003417611260000027
form and output multi-modal features at level i via an interactively guided attention mechanism
Figure FDA0003417611260000028
Where i ∈ {1, 2, 3, 4, 5 }.
5. The RGB-D saliency target detection method based on interactive guiding attention and trapezoidal pyramid fusion of claim 1, characterized in that: the specific method of the step 5) is as follows:
5.1) extracting ultra-large-scale receptive field information from the multi-scale multi-modal characteristics through a hole convolution operation, and setting hole convolutions with different hole rates:
Figure FDA0003417611260000031
where i ∈ {1, 2, 3, 4, 5} represents the hierarchy in which the multimodal features reside, DLAi() Represents a hole convolution operation with a hole rate i, and DLA2()、DLA4() And DLA8() Representing the hole convolution operations with hole rates of 1, 2, 4 and 8 respectively,
Figure FDA0003417611260000032
and
Figure FDA0003417611260000033
respectively representing the features with the void rate i generated by the multi-modal features of the ith level.
5.2) inputting the multi-modal characteristics of the multi-level reception fields generated in the above steps into a trapezoidal pyramid characteristic fusion network, fusing the multi-modal characteristics of different reception fields:
Figure FDA0003417611260000034
wherein TPNet () represents a trapezoidal pyramid feature fusion network.
6) Inputting the multi-modal characteristics of the 5-level ultra-large scale receptive fields obtained in the step 5 into a decoder formed by a trapezoidal pyramid characteristic fusion network to obtain final fusion characteristics, and obtaining a predicted saliency map P after sigmoid function activationest
Pest=sigmoid(TPNet(f1,f2,f3,f4,f5) Equation (3)
7) Saliency map P predicted by the inventionestSegmentation graph P of salient objects labeled manuallyGTAnd calculating a loss function, gradually updating the parameter weight of the model provided by the invention through an SGD (generalized Gaussian distribution) and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.
CN202111565805.7A 2021-12-17 2021-12-17 RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion Pending CN114283315A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111565805.7A CN114283315A (en) 2021-12-17 2021-12-17 RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111565805.7A CN114283315A (en) 2021-12-17 2021-12-17 RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion

Publications (1)

Publication Number Publication Date
CN114283315A true CN114283315A (en) 2022-04-05

Family

ID=80873250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111565805.7A Pending CN114283315A (en) 2021-12-17 2021-12-17 RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion

Country Status (1)

Country Link
CN (1) CN114283315A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082553A (en) * 2022-08-23 2022-09-20 青岛云智聚智能科技有限公司 Logistics package position detection method and system
CN115439726A (en) * 2022-11-07 2022-12-06 腾讯科技(深圳)有限公司 Image detection method, device, equipment and storage medium
CN117854009A (en) * 2024-01-29 2024-04-09 南通大学 Cross-collaboration fusion light-weight cross-modal crowd counting method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082553A (en) * 2022-08-23 2022-09-20 青岛云智聚智能科技有限公司 Logistics package position detection method and system
CN115439726A (en) * 2022-11-07 2022-12-06 腾讯科技(深圳)有限公司 Image detection method, device, equipment and storage medium
CN117854009A (en) * 2024-01-29 2024-04-09 南通大学 Cross-collaboration fusion light-weight cross-modal crowd counting method

Similar Documents

Publication Publication Date Title
CN113240691B (en) Medical image segmentation method based on U-shaped network
CN114283315A (en) RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion
CN110880019B (en) Method for adaptively training target domain classification model through unsupervised domain
CN112084331A (en) Text processing method, text processing device, model training method, model training device, computer equipment and storage medium
CN109874053A (en) The short video recommendation method with user's dynamic interest is understood based on video content
CN112287170B (en) Short video classification method and device based on multi-mode joint learning
CN109743642B (en) Video abstract generation method based on hierarchical recurrent neural network
CN113284100B (en) Image quality evaluation method based on recovery image to mixed domain attention mechanism
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN114283316A (en) Image identification method and device, electronic equipment and storage medium
CN113033454B (en) Method for detecting building change in urban video shooting
CN111275784A (en) Method and device for generating image
CN113435269A (en) Improved water surface floating object detection and identification method and system based on YOLOv3
CN111724400A (en) Automatic video matting method and system
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN114693952A (en) RGB-D significance target detection method based on multi-modal difference fusion network
CN111368142A (en) Video intensive event description method based on generation countermeasure network
CN114282059A (en) Video retrieval method, device, equipment and storage medium
CN116310396A (en) RGB-D significance target detection method based on depth quality weighting
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
CN115965968A (en) Small sample target detection and identification method based on knowledge guidance
CN114926734A (en) Solid waste detection device and method based on feature aggregation and attention fusion
CN114299305A (en) Salient object detection algorithm for aggregating dense and attention multi-scale features
CN116486112A (en) RGB-D significance target detection method based on lightweight cross-modal fusion network
Li et al. HRVQA: A Visual Question Answering benchmark for high-resolution aerial images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination