CN114359162A - Saliency detection method, saliency detection device, and storage medium - Google Patents

Saliency detection method, saliency detection device, and storage medium Download PDF

Info

Publication number
CN114359162A
CN114359162A CN202111507286.9A CN202111507286A CN114359162A CN 114359162 A CN114359162 A CN 114359162A CN 202111507286 A CN202111507286 A CN 202111507286A CN 114359162 A CN114359162 A CN 114359162A
Authority
CN
China
Prior art keywords
determining
result
output
importance
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111507286.9A
Other languages
Chinese (zh)
Inventor
高伟
廖桂标
李革
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Shenzhen Graduate School
Original Assignee
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Shenzhen Graduate School filed Critical Peking University Shenzhen Graduate School
Priority to CN202111507286.9A priority Critical patent/CN114359162A/en
Publication of CN114359162A publication Critical patent/CN114359162A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a significance detection method, significance detection equipment and a computer readable storage medium, wherein the method comprises the following steps: determining a first output result of a first target convolutional layer of a first backbone network and a second output result of a second target convolutional layer of a second backbone network; determining the first global descriptor according to the first output result and determining a second global descriptor according to the second output result; determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor; determining a first modal characteristic according to the importance and the first output result, and determining a second modal characteristic according to the importance and the second output result; determining a significance detection result based on the first modal characteristics and the second modal characteristics. The invention aims to achieve the effect of improving the robustness of the obvious detection result.

Description

Saliency detection method, saliency detection device, and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a saliency detection method, a saliency detection device, and a computer-readable storage medium.
Background
In the related art, multimodal RGB-D (RGB-Depth) and RGB-T (RGB-Thermal, color-temperature) saliency target detection mainly relies on low-level cables for fusion. This class of methods uses depth significance or temperature significance as additional information. Namely, the former extracts depth features from a depth map and combines with features extracted from an RGB (color system) image to obtain a saliency map, and the latter extracts temperature features from a thermodynamic map and combines with features extracted from an RGB image to obtain a saliency map. For example, in RGB-D salient object detection, the process generally calculates priors in RGB images and Depth images, such as background priors, center priors, Depth priors, and the like, and then detects salient objects by comparing features of colors, brightness, textures, depths, and the like of different regions and fusing by means of multiplication or addition and some post-processing techniques.
Since the quality of the depth map is susceptible to noise such as sensor temperature, background illumination, and distance and reflectivity of the observed object, erroneous or missing regions of the depth map may occur. Similarly, the quality of the thermodynamic diagram is susceptible to camera noise, sensor noise, and areas where the thermodynamic diagram may be hot crossed or blurred. In addition, because different lighting conditions, similar foreground and background scenes, complex background and other challenging scenes exist in the real environment, the acquired RGB image may not provide accurate information. However, the saliency detection method described in the above related art cannot process the defects existing in the original image, and thus the robustness of the saliency detection result is low.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a significance detection method, significance detection equipment and a computer readable storage medium, and aims to solve the technical problem that the robustness of a significance detection result is low in the related technology.
In order to achieve the above object, the present invention provides a saliency detection method comprising the steps of:
determining a first output result of a first target convolutional layer of a first backbone network and a second output result of a second target convolutional layer of a second backbone network;
determining the first global descriptor according to the first output result and determining a second global descriptor according to the second output result;
determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor;
determining a first modal characteristic according to the importance and the first output result, and determining a second modal characteristic according to the importance and the second output result;
determining a significance detection result based on the first modal characteristics and the second modal characteristics.
Optionally, before the steps of determining the first global descriptor according to the first output result and determining the second global descriptor according to the second output result, the method further includes:
acquiring the length, width and channel information of the first target convolutional layer and the length, width and channel information of the second target convolutional layer;
the step of determining the first global descriptor according to the first output result and determining the second global descriptor according to the second output result comprises:
determining the first global descriptor according to the first output result and the length, width and channel information of the first target convolutional layer;
and determining the second global descriptor according to the second output result and the length, the width and the channel information of the second target convolutional layer.
Optionally, the step of determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor includes:
determining a first concatenation result of the first global descriptor and the second global descriptor;
determining the importance based on a preset activation function and the first level association result.
Optionally, the determining the first modality characteristics according to the importance and the first output result, and the determining the second modality characteristics according to the importance and the second output result include:
determining the first modal characteristic according to an element product of the first importance and the first output result, and determining the second modal characteristic according to an element product of the second importance and the second output result.
Optionally, before the step of determining the first modal characteristic according to the element product of the first importance and the first output result, and determining the second modal characteristic according to the element product of the second importance and the second output result, the method further comprises:
a second concatenation result between the element product of the first importance and the first output result and the element product of the second importance and the second output result;
determining a convolution output of the second concatenation result;
the step of determining the first modal characteristic as a function of the first significance and the elemental product of the first output result, and determining the second modal characteristic as a function of the second significance and the elemental product of the second output result comprises:
determining the first modal characteristic from the convolution output and an elemental product of the first importance and the first output result; and
determining the second modal characteristic based on the element product of the second importance and the second output result, and the convolution output.
Optionally, the step of determining a saliency detection result based on the first modal characteristics and the second modal characteristics comprises:
when the first target convolutional network and the second target convolutional network are the last layer of convolutional network, fusing the first modal characteristic and the second modal characteristic, and determining the significance detection result according to the fusion result;
when the first target convolutional network and the second target convolutional network are not the last layer of convolutional network, taking the first modal characteristic as the input of the next convolutional layer of the first target convolutional layer, and taking the second modal characteristic as the input of the next convolutional layer of the second target convolutional layer, so as to obtain the first output characteristic of the first backbone network and the second output characteristic of the second backbone network; and
and fusing the first output characteristic and the second output characteristic, and determining the significance detection result according to the fusion result.
Optionally, the step of determining a saliency detection result based on the first modal characteristics and the second modal characteristics comprises:
determining a first decision output of a modality corresponding to a first backbone network according to the first modality characteristics, and determining a second decision output of a modality corresponding to a second backbone network according to the second modality characteristics;
determining the significance detection result according to the first decision output and the second decision output.
Optionally, the step of determining the significance detection result according to the first decision output and the second decision output comprises:
taking a sum of elements of the first decision output and the second decision output as a preliminary fusion feature;
performing up-sampling on the preliminary fusion features to obtain a sampling result;
and performing convolution and global pooling on the sampling result, and determining the significance detection result according to the convolution and pooling result.
The present invention also provides a saliency detection apparatus comprising: a memory, a processor, and a saliency detection program stored on the memory and executable on the processor, the saliency detection program when executed by the processor implementing the steps of the saliency detection method as described above.
According to the saliency detection method, the saliency detection device and the computer-readable storage medium provided by the embodiments of the present invention, a first output result of a first target convolution layer of a first backbone network and a second output result of a second target convolution layer of a second backbone network are determined, then a first global descriptor is determined according to the first output result, a second global descriptor is determined according to the second output result, the importance of different channels of each modality is determined according to the first global descriptor and the second global descriptor, a first modality feature is determined according to the importance and the first output result, a second modality feature is determined according to the importance and the second output result, and a saliency detection result is determined based on the first modality feature and the second modality feature. The scheme provided by the embodiment is different from a method for directly extracting RGB (red, green and blue) features and auxiliary modal features by using an independent backbone network, noise or redundant information is not introduced in a distinguishing manner, the scheme provided by the embodiment makes full use of the complementarity among the multi-modal features and the quality defect problem of different modal inputs to simultaneously extract the RGB features and the auxiliary modal features by a weight-sharing backbone network feature extraction mode, and makes full use of the complementarity among the multi-modal features to enhance accurate feature expression, so that the effect of improving the robustness of a significance detection result is achieved.
Drawings
Fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram illustrating a significance detection method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an overall architecture of a saliency detection network according to an embodiment of the present invention;
FIG. 4 is a schematic data processing flow diagram of an FCE according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a detailed step of step S50 according to an embodiment of the significance detection method of the present invention;
fig. 6 is a schematic data processing flow diagram of a detection module according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.
The terminal in the embodiment of the invention can be significance detection equipment, such as a computer, a PC (personal computer) or a server.
As shown in fig. 1, the terminal may include: a processor 1001, an interface 1003, a memory 1004, and a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The interface 1003 is provided in communication with other devices or with other components. The memory 1004 may be a NAND. The memory 1004 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, the memory 1004, which is a kind of computer storage medium, may include therein a control system, an interface module, and a saliency detection program.
In the terminal shown in fig. 1, the processor 1001 may be configured to call the saliency detection program stored in the memory 1004 and perform the following operations:
determining a first output result of a first target convolutional layer of a first backbone network and a second output result of a second target convolutional layer of a second backbone network;
determining the first global descriptor according to the first output result and determining a second global descriptor according to the second output result;
determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor;
determining a first modal characteristic according to the importance and the first output result, and determining a second modal characteristic according to the importance and the second output result;
determining a significance detection result based on the first modal characteristics and the second modal characteristics.
Optionally, in some embodiments, the processor 1001 may be further configured to invoke a saliency detection program stored in the memory 1004 and perform the following operations:
acquiring the length, width and channel information of the first target convolutional layer and the length, width and channel information of the second target convolutional layer;
the step of determining the first global descriptor according to the first output result and determining the second global descriptor according to the second output result comprises:
determining the first global descriptor according to the first output result and the length, width and channel information of the first target convolutional layer;
and determining the second global descriptor according to the second output result and the length, the width and the channel information of the second target convolutional layer.
Optionally, in some embodiments, the processor 1001 may be further configured to invoke a saliency detection program stored in the memory 1004 and perform the following operations:
determining a first concatenation result of the first global descriptor and the second global descriptor;
determining the importance based on a preset activation function and the first level association result.
Optionally, in some embodiments, the processor 1001 may be further configured to invoke a saliency detection program stored in the memory 1004 and perform the following operations:
determining the first modal characteristic according to an element product of the first importance and the first output result, and determining the second modal characteristic according to an element product of the second importance and the second output result.
Optionally, in some embodiments, the processor 1001 may be further configured to invoke a saliency detection program stored in the memory 1004 and perform the following operations:
a second concatenation result between the element product of the first importance and the first output result and the element product of the second importance and the second output result;
determining a convolution output of the second concatenation result;
the step of determining the first modal characteristic as a function of the first significance and the elemental product of the first output result, and determining the second modal characteristic as a function of the second significance and the elemental product of the second output result comprises:
determining the first modal characteristic from the convolution output and an elemental product of the first importance and the first output result; and
determining the second modal characteristic based on the element product of the second importance and the second output result, and the convolution output.
Optionally, in some embodiments, the processor 1001 may be further configured to invoke a saliency detection program stored in the memory 1004 and perform the following operations:
when the first target convolutional network and the second target convolutional network are the last layer of convolutional network, fusing the first modal characteristic and the second modal characteristic, and determining the significance detection result according to the fusion result;
when the first target convolutional network and the second target convolutional network are not the last layer of convolutional network, taking the first modal characteristic as the input of the next convolutional layer of the first target convolutional layer, and taking the second modal characteristic as the input of the next convolutional layer of the second target convolutional layer, so as to obtain the first output characteristic of the first backbone network and the second output characteristic of the second backbone network; and
and fusing the first output characteristic and the second output characteristic, and determining the significance detection result according to the fusion result.
Optionally, in some embodiments, the processor 1001 may be further configured to invoke a saliency detection program stored in the memory 1004 and perform the following operations:
determining a first decision output of a modality corresponding to a first backbone network according to the first modality characteristics, and determining a second decision output of a modality corresponding to a second backbone network according to the second modality characteristics;
determining the significance detection result according to the first decision output and the second decision output.
Optionally, in some embodiments, the processor 1001 may be further configured to invoke a saliency detection program stored in the memory 1004 and perform the following operations:
taking a sum of elements of the first decision output and the second decision output as a preliminary fusion feature;
performing up-sampling on the preliminary fusion features to obtain a sampling result;
and performing convolution and global pooling on the sampling result, and determining the significance detection result according to the convolution and pooling result.
In the related art, multimodal RGB-D (RGB-Depth) and RGB-T (RGB-Thermal, color-temperature) saliency target detection mainly relies on low-level cables for fusion. This class of methods uses depth significance or temperature significance as additional information. Namely, the former extracts depth features from a depth map and combines with features extracted from an RGB (color system) image to obtain a saliency map, and the latter extracts temperature features from a thermodynamic map and combines with features extracted from an RGB image to obtain a saliency map. For example, in RGB-D salient object detection, the process generally calculates priors in RGB images and Depth images, such as background priors, center priors, Depth priors, and the like, and then detects salient objects by comparing features of colors, brightness, textures, depths, and the like of different regions and fusing by means of multiplication or addition and some post-processing techniques.
Since the quality of the depth map is susceptible to noise such as sensor temperature, background illumination, and distance and reflectivity of the observed object, erroneous or missing regions of the depth map may occur. Similarly, the quality of the thermodynamic diagram is susceptible to camera noise, sensor noise, and areas where the thermodynamic diagram may be hot crossed or blurred. In addition, because different lighting conditions, similar foreground and background scenes, complex background and other challenging scenes exist in the real environment, the acquired RGB image may not provide accurate information. However, the saliency detection method described in the above related art cannot process the defects existing in the original image, and thus the robustness of the saliency detection result is low.
In order to solve the above-mentioned drawbacks, an embodiment of the present invention provides a significance detection method, aiming to achieve an effect of improving robustness of a significance detection result. For convenience of understanding, the significance detection method proposed by the present invention is explained below by way of specific embodiments.
Referring to fig. 2, the present invention provides a first embodiment of a saliency detection method comprising:
step S10: determining a first output result of a first target convolutional layer of a first backbone network and a second output result of a second target convolutional layer of a second backbone network;
step S20, determining the first global descriptor according to the first output result, and determining the second global descriptor according to the second output result;
step S30, determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor;
step S40: determining a first modal characteristic according to the importance and the first output result, and determining a second modal characteristic according to the importance and the second output result;
step S50: determining a significance detection result based on the first modal characteristics and the second modal characteristics.
In this embodiment, an RGB image to be processed and an auxiliary image may be acquired first, where the auxiliary image may be a thermodynamic diagram or a depth map.
After the RGB image and the auxiliary image are acquired, if the first backbone network is set as a backbone network for processing the RGB image, the RGB image is used as an input of the first backbone network, and the auxiliary image is used as an input of the second backbone network. If the second backbone network is set as a backbone network for processing RGB images, the RGB images are used as input of the second backbone network, and the auxiliary images are used as input of the first backbone network.
Illustratively, referring to fig. 3, the present embodiment provides a saliency detection network, wherein the saliency detection network includes a feature extraction network and a fusion network. The feature extraction network comprises a first backbone network and a second backbone network.
The first backbone network and the second backbone network may include a plurality of convolutional layers, and the number of convolutional layers included in the first backbone network is the same as the number of convolutional layers included in the second backbone network. For example, the first backbone network and the second backbone network may each include 5 convolutional layers.
The feature extraction network further includes at least one FCE (Flow collaboration enhancement module). The FCE may be placed after any one of the convolutional layers. In an alternative embodiment, the second, third, fourth and fifth convolutional layers are all followed by FCEs, as shown in figure 3. It is to be understood that this does not limit the feature extraction network intelligence presented in this embodiment to setting FCEs after the second, third, fourth, and fifth convolutional layers. In other embodiments, only one FCE may be provided after any one convolutional layer. Or one FCE may be placed after each convolutional layer.
Based on the significance detection network provided in this embodiment, a first output result of a first target convolutional layer of the first backbone network and a second output result of a second target convolutional layer of the second backbone network may be determined first. The first target convolutional layer and the second target convolutional layer are convolutional layers corresponding to the first trunk network and the second trunk network respectively. For example, there may be a second convolutional layer Conv2 of the first backbone network and a second convolutional layer Conv2 of the second backbone network.
The first output result and the second output result may be input to the FCE and processed to obtain the first modal characteristic and the second modal characteristic. The FCE processing process comprises the steps of determining the first global descriptor according to the first output result, determining the second global descriptor according to the second output result, determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor, determining a first modality feature according to the importance and the first output result, and determining a second modality feature according to the importance and the second output result.
Referring to fig. 4, fig. 4 is a schematic diagram of an FCE. When the FCE obtains a first output result XriAnd a second output result XtiThereafter, the first output result X may be based onriDetermining the first global descriptor and according to the second output result XtiA second global descriptor is determined.
Optionally, when determining the first global descriptor and the second global descriptor, the length, the width, and the channel information of the first target convolutional layer and the length, the width, and the channel information of the second target convolutional layer may be obtained first, then the first global descriptor is determined according to the first output result and the length, the width, and the channel information of the first target convolutional layer, and the second global descriptor is determined according to the second output result and the length, the width, and the channel information of the second target convolutional layer.
In a specific implementation case, the first layer output Xr of the backbone networki
Figure BDA0003403674910000101
Wherein H, W, C represent the length, width, and channel of the l-th layer, respectively. First, a global descriptor for the original RGB features and the auxiliary modality features may be calculated, i.e. a first global descriptor and a second global descriptor are calculatedThe global descriptor, which is explained below by taking the example that the first backbone network is configured to process RGB images, the second backbone network processes auxiliary images, and the first global descriptor and the second global descriptor are global descriptors of original RGB features and auxiliary modality features, respectively. The global descriptor is computed as follows:
Figure BDA0003403674910000102
wherein (i, j) represents the location of the coordinates,
Figure BDA0003403674910000103
represents the spatial statistics of the C-th channel.
It can be understood that the first global descriptor and the second global descriptor are calculated in the same manner, and the first global descriptor and the second global descriptor can be calculated by changing the first global descriptor and the second global descriptor to parameters corresponding to different networks.
After computing a first global descriptor and a second global descriptor, a first concatenation result of the first global descriptor and the second global descriptor may be determined, and then the importance may be determined based on a preset activation function and the first concatenation result.
Illustratively, to enhance the interaction of channel dimensions and exploit complementarity between multiple modalities to suppress noise signature responses, a global descriptor of multiple modalities
Figure BDA0003403674910000104
And
Figure BDA0003403674910000105
aggregated by learning each other adaptively through the collaborative enhancement layer as follows:
Figure BDA0003403674910000106
where ReLU stands for the ReLU activation function,
Figure BDA00034036749100001111
representing cascaded operation of channel dimensions. FC1And FC2Representing a learnable fully connected layer. Thus, by the above can be obtained
Figure BDA0003403674910000111
And
Figure BDA0003403674910000112
respectively, representing the importance of different channels for each modality.
Further, the first modal characteristic may be determined according to an element product of the first importance and the first output result, and the second modal characteristic may be determined according to an element product of the second importance and the second output result.
Illustratively, a convolution output of the second concatenation result may then be determined based on an element product of the first significance and the first output result, and a second concatenation result between an element product of the second significance and the second output result. Determining the first modal characteristic from the convolution output and an elemental product of the first importance and the first output result; and determining the second modal characteristic from the convolution output and the elemental product of the second significance and the second output result.
For example, in an alternative embodiment, in order to obtain a robust multi-modal feature representation, the original modal feature inputs are recalibrated based on the above-computed channel importance and then further fused as follows:
Figure BDA0003403674910000113
Figure BDA0003403674910000114
Figure BDA0003403674910000115
wherein [ ] represents an element multiplication operation, Conv is a convolution layer,
Figure BDA0003403674910000116
representing cascaded operation of channel dimensions.
Optionally, in some embodiments, using one residual connection makes it easier for the network to train the optimization, as follows:
Figure BDA0003403674910000117
Figure BDA0003403674910000118
enhanced
Figure BDA0003403674910000119
And
Figure BDA00034036749100001110
namely the first modal characteristic and the second modal characteristic.
Optionally, in consideration of the fact that the context information is beneficial to detecting salient objects with different sizes and positions, a Multi-scale flow collaboration enhancement Module (MFCE) may be added to the network.
Where the MFCE may first compute contextual feature embedding (embedding) and then use the FCE to generate an enhanced multi-modal feature expression. Specifically, the multi-scale stream authoring enhancement module can be represented as follows:
Figure BDA0003403674910000121
wherein
Figure BDA0003403674910000122
For a 4 × 4 global average pooling operation, U is a 4 times upsampling operation. σ is the Sigmoid activation function. Similarly, each layer is enhanced
Figure BDA0003403674910000123
The input to the next layer is continuously enhanced. As the network grows deeper, the effective context expression will be highlighted at the encoding stage by using MFCE.
Optionally, in an optional implementation, when the first target convolutional network and the second target convolutional network are the last layer of convolutional network of the feature extraction network, the first modal feature and the second modal feature may be input to a decision fusion network, and a saliency detection result is obtained through the decision fusion network.
And when the first target convolutional network and the second target convolutional network are not the last layer of convolutional network, taking the first modal characteristic as the input of the next convolutional layer of the first target convolutional layer, taking the second modal characteristic as the input of the next convolutional layer of the second target convolutional layer to obtain the first output characteristic of the first main network and the second output characteristic of the second main network, then fusing the first output characteristic and the second output characteristic, and determining the significance detection result according to the fusion result.
It will be appreciated that the convolutional layers subsequent to the first and second target convolutional layers are also provided with FCEs, and may be processed again according to the above manner until the first and second output characteristics are obtained. The specific processing manner is similar to that described above, and is not described herein again.
In the technical solution disclosed in this embodiment, a first output result of a first target convolutional layer of a first backbone network and a second output result of a second target convolutional layer of a second backbone network are determined, then the first global descriptor is determined according to the first output result, the second global descriptor is determined according to the second output result, the importance of different channels of each modality is determined according to the first global descriptor and the second global descriptor, a first modality feature is determined according to the importance and the first output result, a second modality feature is determined according to the importance and the second output result, and a significance detection result is determined based on the first modality feature and the second modality feature. The scheme provided by the embodiment is different from a method for directly extracting RGB (red, green and blue) features and auxiliary modal features by using an independent backbone network, noise or redundant information is not introduced in a distinguishing manner, the scheme provided by the embodiment makes full use of the complementarity among the multi-modal features and the quality defect problem of different modal inputs to simultaneously extract the RGB features and the auxiliary modal features by a weight-sharing backbone network feature extraction mode, and makes full use of the complementarity among the multi-modal features to enhance accurate feature expression, so that the effect of improving the robustness of a significance detection result is achieved.
Optionally, referring to fig. 5, based on the foregoing embodiment, the step S50 includes:
step S51: determining a first decision output of a modality corresponding to a first backbone network according to the first modality characteristics, and determining a second decision output of a modality corresponding to a second backbone network according to the second modality characteristics;
step S52: determining the significance detection result according to the first decision output and the second decision output.
In this embodiment, the first decision output is an output of the first backbone network, and the second decision output is an output of the second backbone network. After a first decision output and a second decision output are obtained, the sum of elements of the first decision output and the second decision output can be used as a preliminary fusion feature, then the preliminary fusion feature is subjected to upsampling to obtain a sampling result, the sampling result is subjected to convolution and global pooling, and the significance detection result is determined according to the convolution and pooling result.
Illustratively, a conventional Feature-based fusion strategy (Feature-based fusion strategy) relies on the accuracy of the extracted features and is sensitive to noise. Direct fusion of noisy features can lead to sub-optimal results if the extracted features are less reliable. Unlike the feature fusion based strategy proposed in the above embodiment, a Decision-based fusion strategy (DFS) may be designed from the viewpoint of reliability of modal output, aiming at generating a more robust and reliable saliency target detection result based on the accuracy of the modal output.
As shown in FIGS. 3 and 6, first the RGB features from the RGB backbone network
Figure BDA0003403674910000131
Characterization of the convolutional layers (Conv) fed to a 3x3 machine
Figure BDA0003403674910000132
Then the
Figure BDA0003403674910000133
Is fed into a modality-specific RGB decoder GRGenerating decision output S for RGB modalitiesRAs follows:
Figure BDA0003403674910000134
wherein Conv is 3x3 convolutional layers,
Figure BDA0003403674910000135
represents the cascade operation of the channel dimension, and U is the upsampling operation. Similarly, the decision output S of the auxiliary modality may be obtained by the following formulaT
Figure BDA0003403674910000141
Then, the decision outputs of different modes are sent to a light Detection Module (Detection Module, Det) to obtain a final multi-mode significance target Detection output Sf
Sf=Det(SR,ST)
As shown in FIG. 6, S may be first introducedRAnd STAnd performing element addition operation to obtain a preliminary fusion feature. Then, bilinear interpolation and deconvolution are adopted for the fusion features to carry out up-sampling 4 times, then 1x1, 3x3, 5x5 convolution and global pooling are respectively carried out on the fusion features to capture multi-scale multi-modal information, finally, the multi-scale information is subjected to channel dimension cascade, the size and the channel are converted to the original size of the label by deconvolution, and the final significant target detection output S is generatedf. This achieves the effect of further improving the robustness of the significance detection result.
The present invention also provides a saliency detection apparatus comprising: a memory, a processor, and a saliency detection program stored on the memory and executable on the processor, the saliency detection program when executed by the processor implementing the steps of the saliency detection method as described above.
The present invention also provides a computer-readable storage medium having a saliency detection program stored thereon, which when executed by a processor implements the steps of the saliency detection method as described above.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on this understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for causing a device to execute the methods according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A significance detection method, characterized by comprising the steps of:
determining a first output result of a first target convolutional layer of a first backbone network and a second output result of a second target convolutional layer of a second backbone network;
determining the first global descriptor according to the first output result and determining a second global descriptor according to the second output result;
determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor;
determining a first modal characteristic according to the importance and the first output result, and determining a second modal characteristic according to the importance and the second output result;
determining a significance detection result based on the first modal characteristics and the second modal characteristics.
2. The saliency detection method of claim 1, wherein said steps of determining said first global descriptor from said first output result and determining a second global descriptor from said second output result are preceded by further comprising:
acquiring the length, width and channel information of the first target convolutional layer and the length, width and channel information of the second target convolutional layer;
the step of determining the first global descriptor according to the first output result and determining the second global descriptor according to the second output result comprises:
determining the first global descriptor according to the first output result and the length, width and channel information of the first target convolutional layer;
and determining the second global descriptor according to the second output result and the length, the width and the channel information of the second target convolutional layer.
3. The saliency detection method according to claim 1, characterized in that said step of determining the importance of the different channels of the respective modalities from said first global descriptor and said second global descriptor comprises:
determining a first concatenation result of the first global descriptor and the second global descriptor;
determining the importance based on a preset activation function and the first level association result.
4. The significance detection method according to claim 1, wherein the importance includes a first importance of each channel of a first modality corresponding to the first backbone network and a second importance of each channel of a second modality corresponding to the second backbone network, the determining a first modality feature according to the importance and the first output result, and the determining a second modality feature according to the importance and the second output result includes:
determining the first modal characteristic according to an element product of the first importance and the first output result, and determining the second modal characteristic according to an element product of the second importance and the second output result.
5. The significance detection method of claim 4, wherein prior to the steps of determining the first modal feature as a function of the first significance and the elemental product of the first output result, and determining the second modal feature as a function of the second significance and the elemental product of the second output result, further comprising:
a second concatenation result between the element product of the first importance and the first output result and the element product of the second importance and the second output result;
determining a convolution output of the second concatenation result;
the step of determining the first modal characteristic as a function of the first significance and the elemental product of the first output result, and determining the second modal characteristic as a function of the second significance and the elemental product of the second output result comprises:
determining the first modal characteristic from the convolution output and an elemental product of the first importance and the first output result; and
determining the second modal characteristic based on the element product of the second importance and the second output result, and the convolution output.
6. The significance detection method of claim 1, wherein the step of determining a significance detection result based on the first modal characteristics and the second modal characteristics comprises:
when the first target convolutional network and the second target convolutional network are the last layer of convolutional network, fusing the first modal characteristic and the second modal characteristic, and determining the significance detection result according to the fusion result;
when the first target convolutional network and the second target convolutional network are not the last layer of convolutional network, taking the first modal characteristic as the input of the next convolutional layer of the first target convolutional layer, and taking the second modal characteristic as the input of the next convolutional layer of the second target convolutional layer, so as to obtain the first output characteristic of the first backbone network and the second output characteristic of the second backbone network; and
and fusing the first output characteristic and the second output characteristic, and determining the significance detection result according to the fusion result.
7. The significance detection method of claim 1, wherein the step of determining a significance detection result based on the first modal characteristics and the second modal characteristics comprises:
determining a first decision output of a modality corresponding to a first backbone network according to the first modality characteristics, and determining a second decision output of a modality corresponding to a second backbone network according to the second modality characteristics;
determining the significance detection result according to the first decision output and the second decision output.
8. The significance detection method of claim 7, wherein the step of determining the significance detection result based on the first decision output and the second decision output comprises:
taking a sum of elements of the first decision output and the second decision output as a preliminary fusion feature;
performing up-sampling on the preliminary fusion features to obtain a sampling result;
and performing convolution and global pooling on the sampling result, and determining the significance detection result according to the convolution and pooling result.
9. A saliency detection device characterized in that it comprises: memory, a processor and a saliency detection program stored on said memory and executable on said processor, said saliency detection program when executed by said processor implementing the steps of the saliency detection method as claimed in any one of claims 1 to 8.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a saliency detection program that, when executed by a processor, implements the steps of the saliency detection method of any one of claims 1 to 8.
CN202111507286.9A 2021-12-10 2021-12-10 Saliency detection method, saliency detection device, and storage medium Pending CN114359162A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111507286.9A CN114359162A (en) 2021-12-10 2021-12-10 Saliency detection method, saliency detection device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111507286.9A CN114359162A (en) 2021-12-10 2021-12-10 Saliency detection method, saliency detection device, and storage medium

Publications (1)

Publication Number Publication Date
CN114359162A true CN114359162A (en) 2022-04-15

Family

ID=81099521

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111507286.9A Pending CN114359162A (en) 2021-12-10 2021-12-10 Saliency detection method, saliency detection device, and storage medium

Country Status (1)

Country Link
CN (1) CN114359162A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116823908A (en) * 2023-06-26 2023-09-29 北京邮电大学 Monocular image depth estimation method based on multi-scale feature correlation enhancement

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116823908A (en) * 2023-06-26 2023-09-29 北京邮电大学 Monocular image depth estimation method based on multi-scale feature correlation enhancement

Similar Documents

Publication Publication Date Title
CN112862685B (en) Image stitching processing method, device and electronic system
WO2018176925A1 (en) Hdr image generation method and apparatus
CN108389224B (en) Image processing method and device, electronic equipment and storage medium
CN111583173B (en) RGB-D image saliency target detection method
JP6355346B2 (en) Image processing apparatus, image processing method, program, and storage medium
US20220012851A1 (en) Image processing method and related device
CN111915483B (en) Image stitching method, device, computer equipment and storage medium
CN108833785A (en) Fusion method, device, computer equipment and the storage medium of multi-view image
CN110335216A (en) Image processing method, image processing apparatus, terminal device and readable storage medium storing program for executing
KR101624801B1 (en) Matting method for extracting object of foreground and apparatus for performing the matting method
WO2020151148A1 (en) Neural network-based black-and-white photograph color restoration method, apparatus, and storage medium
JP5676610B2 (en) System and method for artifact reduction based on region of interest of image sequence
CN114821488B (en) Crowd counting method and system based on multi-modal network and computer equipment
CN114359162A (en) Saliency detection method, saliency detection device, and storage medium
JP2011081804A (en) Method for classifying candidate red-eye object, computer readable medium, and image processor
US20240029460A1 (en) Apparatus and method for performing image authentication
CN111160240A (en) Image object recognition processing method and device, intelligent device and storage medium
KR101592087B1 (en) Method for generating saliency map based background location and medium for recording the same
JP2019220174A (en) Image processing using artificial neural network
US9183454B1 (en) Automated technique for generating a path file of identified and extracted image features for image manipulation
JP2022095565A (en) Method and system for removing scene text from image
CN115731604A (en) Model training method, gesture recognition method, device, equipment and storage medium
CN114764839A (en) Dynamic video generation method and device, readable storage medium and terminal equipment
CN115362481A (en) Motion blur robust image feature descriptors
JP4163651B2 (en) Red-eye correction work support apparatus and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination