CN114359162A

CN114359162A - Saliency detection method, saliency detection device, and storage medium

Info

Publication number: CN114359162A
Application number: CN202111507286.9A
Authority: CN
Inventors: 高伟; 廖桂标; 李革
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-04-15

Abstract

The invention discloses a significance detection method, significance detection equipment and a computer readable storage medium, wherein the method comprises the following steps: determining a first output result of a first target convolutional layer of a first backbone network and a second output result of a second target convolutional layer of a second backbone network; determining the first global descriptor according to the first output result and determining a second global descriptor according to the second output result; determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor; determining a first modal characteristic according to the importance and the first output result, and determining a second modal characteristic according to the importance and the second output result; determining a significance detection result based on the first modal characteristics and the second modal characteristics. The invention aims to achieve the effect of improving the robustness of the obvious detection result.

Description

Saliency detection method, saliency detection device, and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a saliency detection method, a saliency detection device, and a computer-readable storage medium.

Background

In the related art, multimodal RGB-D (RGB-Depth) and RGB-T (RGB-Thermal, color-temperature) saliency target detection mainly relies on low-level cables for fusion. This class of methods uses depth significance or temperature significance as additional information. Namely, the former extracts depth features from a depth map and combines with features extracted from an RGB (color system) image to obtain a saliency map, and the latter extracts temperature features from a thermodynamic map and combines with features extracted from an RGB image to obtain a saliency map. For example, in RGB-D salient object detection, the process generally calculates priors in RGB images and Depth images, such as background priors, center priors, Depth priors, and the like, and then detects salient objects by comparing features of colors, brightness, textures, depths, and the like of different regions and fusing by means of multiplication or addition and some post-processing techniques.

Since the quality of the depth map is susceptible to noise such as sensor temperature, background illumination, and distance and reflectivity of the observed object, erroneous or missing regions of the depth map may occur. Similarly, the quality of the thermodynamic diagram is susceptible to camera noise, sensor noise, and areas where the thermodynamic diagram may be hot crossed or blurred. In addition, because different lighting conditions, similar foreground and background scenes, complex background and other challenging scenes exist in the real environment, the acquired RGB image may not provide accurate information. However, the saliency detection method described in the above related art cannot process the defects existing in the original image, and thus the robustness of the saliency detection result is low.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a significance detection method, significance detection equipment and a computer readable storage medium, and aims to solve the technical problem that the robustness of a significance detection result is low in the related technology.

In order to achieve the above object, the present invention provides a saliency detection method comprising the steps of:

determining a first output result of a first target convolutional layer of a first backbone network and a second output result of a second target convolutional layer of a second backbone network;

determining the first global descriptor according to the first output result and determining a second global descriptor according to the second output result;

determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor;

determining a first modal characteristic according to the importance and the first output result, and determining a second modal characteristic according to the importance and the second output result;

determining a significance detection result based on the first modal characteristics and the second modal characteristics.

Optionally, before the steps of determining the first global descriptor according to the first output result and determining the second global descriptor according to the second output result, the method further includes:

acquiring the length, width and channel information of the first target convolutional layer and the length, width and channel information of the second target convolutional layer;

the step of determining the first global descriptor according to the first output result and determining the second global descriptor according to the second output result comprises:

determining the first global descriptor according to the first output result and the length, width and channel information of the first target convolutional layer;

and determining the second global descriptor according to the second output result and the length, the width and the channel information of the second target convolutional layer.

Optionally, the step of determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor includes:

determining a first concatenation result of the first global descriptor and the second global descriptor;

determining the importance based on a preset activation function and the first level association result.

Optionally, the determining the first modality characteristics according to the importance and the first output result, and the determining the second modality characteristics according to the importance and the second output result include:

determining the first modal characteristic according to an element product of the first importance and the first output result, and determining the second modal characteristic according to an element product of the second importance and the second output result.

Optionally, before the step of determining the first modal characteristic according to the element product of the first importance and the first output result, and determining the second modal characteristic according to the element product of the second importance and the second output result, the method further comprises:

a second concatenation result between the element product of the first importance and the first output result and the element product of the second importance and the second output result;

determining a convolution output of the second concatenation result;

the step of determining the first modal characteristic as a function of the first significance and the elemental product of the first output result, and determining the second modal characteristic as a function of the second significance and the elemental product of the second output result comprises:

determining the first modal characteristic from the convolution output and an elemental product of the first importance and the first output result; and

determining the second modal characteristic based on the element product of the second importance and the second output result, and the convolution output.

Optionally, the step of determining a saliency detection result based on the first modal characteristics and the second modal characteristics comprises:

when the first target convolutional network and the second target convolutional network are the last layer of convolutional network, fusing the first modal characteristic and the second modal characteristic, and determining the significance detection result according to the fusion result;

when the first target convolutional network and the second target convolutional network are not the last layer of convolutional network, taking the first modal characteristic as the input of the next convolutional layer of the first target convolutional layer, and taking the second modal characteristic as the input of the next convolutional layer of the second target convolutional layer, so as to obtain the first output characteristic of the first backbone network and the second output characteristic of the second backbone network; and

and fusing the first output characteristic and the second output characteristic, and determining the significance detection result according to the fusion result.

determining a first decision output of a modality corresponding to a first backbone network according to the first modality characteristics, and determining a second decision output of a modality corresponding to a second backbone network according to the second modality characteristics;

determining the significance detection result according to the first decision output and the second decision output.

Optionally, the step of determining the significance detection result according to the first decision output and the second decision output comprises:

taking a sum of elements of the first decision output and the second decision output as a preliminary fusion feature;

performing up-sampling on the preliminary fusion features to obtain a sampling result;

and performing convolution and global pooling on the sampling result, and determining the significance detection result according to the convolution and pooling result.

The present invention also provides a saliency detection apparatus comprising: a memory, a processor, and a saliency detection program stored on the memory and executable on the processor, the saliency detection program when executed by the processor implementing the steps of the saliency detection method as described above.

According to the saliency detection method, the saliency detection device and the computer-readable storage medium provided by the embodiments of the present invention, a first output result of a first target convolution layer of a first backbone network and a second output result of a second target convolution layer of a second backbone network are determined, then a first global descriptor is determined according to the first output result, a second global descriptor is determined according to the second output result, the importance of different channels of each modality is determined according to the first global descriptor and the second global descriptor, a first modality feature is determined according to the importance and the first output result, a second modality feature is determined according to the importance and the second output result, and a saliency detection result is determined based on the first modality feature and the second modality feature. The scheme provided by the embodiment is different from a method for directly extracting RGB (red, green and blue) features and auxiliary modal features by using an independent backbone network, noise or redundant information is not introduced in a distinguishing manner, the scheme provided by the embodiment makes full use of the complementarity among the multi-modal features and the quality defect problem of different modal inputs to simultaneously extract the RGB features and the auxiliary modal features by a weight-sharing backbone network feature extraction mode, and makes full use of the complementarity among the multi-modal features to enhance accurate feature expression, so that the effect of improving the robustness of a significance detection result is achieved.

Drawings

Fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram illustrating a significance detection method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an overall architecture of a saliency detection network according to an embodiment of the present invention;

FIG. 4 is a schematic data processing flow diagram of an FCE according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a detailed step of step S50 according to an embodiment of the significance detection method of the present invention;

fig. 6 is a schematic data processing flow diagram of a detection module according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.

The terminal in the embodiment of the invention can be significance detection equipment, such as a computer, a PC (personal computer) or a server.

As shown in fig. 1, the terminal may include: a processor 1001, an interface 1003, a memory 1004, and a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The interface 1003 is provided in communication with other devices or with other components. The memory 1004 may be a NAND. The memory 1004 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, the memory 1004, which is a kind of computer storage medium, may include therein a control system, an interface module, and a saliency detection program.

In the terminal shown in fig. 1, the processor 1001 may be configured to call the saliency detection program stored in the memory 1004 and perform the following operations:

Optionally, in some embodiments, the processor 1001 may be further configured to invoke a saliency detection program stored in the memory 1004 and perform the following operations:

determining a convolution output of the second concatenation result;

In order to solve the above-mentioned drawbacks, an embodiment of the present invention provides a significance detection method, aiming to achieve an effect of improving robustness of a significance detection result. For convenience of understanding, the significance detection method proposed by the present invention is explained below by way of specific embodiments.

Referring to fig. 2, the present invention provides a first embodiment of a saliency detection method comprising:

step S10: determining a first output result of a first target convolutional layer of a first backbone network and a second output result of a second target convolutional layer of a second backbone network;

step S20, determining the first global descriptor according to the first output result, and determining the second global descriptor according to the second output result;

step S30, determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor;

step S40: determining a first modal characteristic according to the importance and the first output result, and determining a second modal characteristic according to the importance and the second output result;

step S50: determining a significance detection result based on the first modal characteristics and the second modal characteristics.

In this embodiment, an RGB image to be processed and an auxiliary image may be acquired first, where the auxiliary image may be a thermodynamic diagram or a depth map.

After the RGB image and the auxiliary image are acquired, if the first backbone network is set as a backbone network for processing the RGB image, the RGB image is used as an input of the first backbone network, and the auxiliary image is used as an input of the second backbone network. If the second backbone network is set as a backbone network for processing RGB images, the RGB images are used as input of the second backbone network, and the auxiliary images are used as input of the first backbone network.

Illustratively, referring to fig. 3, the present embodiment provides a saliency detection network, wherein the saliency detection network includes a feature extraction network and a fusion network. The feature extraction network comprises a first backbone network and a second backbone network.

The first backbone network and the second backbone network may include a plurality of convolutional layers, and the number of convolutional layers included in the first backbone network is the same as the number of convolutional layers included in the second backbone network. For example, the first backbone network and the second backbone network may each include 5 convolutional layers.

The feature extraction network further includes at least one FCE (Flow collaboration enhancement module). The FCE may be placed after any one of the convolutional layers. In an alternative embodiment, the second, third, fourth and fifth convolutional layers are all followed by FCEs, as shown in figure 3. It is to be understood that this does not limit the feature extraction network intelligence presented in this embodiment to setting FCEs after the second, third, fourth, and fifth convolutional layers. In other embodiments, only one FCE may be provided after any one convolutional layer. Or one FCE may be placed after each convolutional layer.

Based on the significance detection network provided in this embodiment, a first output result of a first target convolutional layer of the first backbone network and a second output result of a second target convolutional layer of the second backbone network may be determined first. The first target convolutional layer and the second target convolutional layer are convolutional layers corresponding to the first trunk network and the second trunk network respectively. For example, there may be a second convolutional layer Conv2 of the first backbone network and a second convolutional layer Conv2 of the second backbone network.

The first output result and the second output result may be input to the FCE and processed to obtain the first modal characteristic and the second modal characteristic. The FCE processing process comprises the steps of determining the first global descriptor according to the first output result, determining the second global descriptor according to the second output result, determining the importance of different channels of each modality according to the first global descriptor and the second global descriptor, determining a first modality feature according to the importance and the first output result, and determining a second modality feature according to the importance and the second output result.

Referring to fig. 4, fig. 4 is a schematic diagram of an FCE. When the FCE obtains a first output result X_riAnd a second output result X_tiThereafter, the first output result X may be based on_riDetermining the first global descriptor and according to the second output result X_tiA second global descriptor is determined.

Optionally, when determining the first global descriptor and the second global descriptor, the length, the width, and the channel information of the first target convolutional layer and the length, the width, and the channel information of the second target convolutional layer may be obtained first, then the first global descriptor is determined according to the first output result and the length, the width, and the channel information of the first target convolutional layer, and the second global descriptor is determined according to the second output result and the length, the width, and the channel information of the second target convolutional layer.

In a specific implementation case, the first layer output Xr of the backbone network_i，

Wherein H, W, C represent the length, width, and channel of the l-th layer, respectively. First, a global descriptor for the original RGB features and the auxiliary modality features may be calculated, i.e. a first global descriptor and a second global descriptor are calculatedThe global descriptor, which is explained below by taking the example that the first backbone network is configured to process RGB images, the second backbone network processes auxiliary images, and the first global descriptor and the second global descriptor are global descriptors of original RGB features and auxiliary modality features, respectively. The global descriptor is computed as follows:

wherein (i, j) represents the location of the coordinates,

represents the spatial statistics of the C-th channel.

It can be understood that the first global descriptor and the second global descriptor are calculated in the same manner, and the first global descriptor and the second global descriptor can be calculated by changing the first global descriptor and the second global descriptor to parameters corresponding to different networks.

After computing a first global descriptor and a second global descriptor, a first concatenation result of the first global descriptor and the second global descriptor may be determined, and then the importance may be determined based on a preset activation function and the first concatenation result.

Illustratively, to enhance the interaction of channel dimensions and exploit complementarity between multiple modalities to suppress noise signature responses, a global descriptor of multiple modalities

And

aggregated by learning each other adaptively through the collaborative enhancement layer as follows:

where ReLU stands for the ReLU activation function,

representing cascaded operation of channel dimensions. FC₁And FC₂Representing a learnable fully connected layer. Thus, by the above can be obtained

And

respectively, representing the importance of different channels for each modality.

Further, the first modal characteristic may be determined according to an element product of the first importance and the first output result, and the second modal characteristic may be determined according to an element product of the second importance and the second output result.

Illustratively, a convolution output of the second concatenation result may then be determined based on an element product of the first significance and the first output result, and a second concatenation result between an element product of the second significance and the second output result. Determining the first modal characteristic from the convolution output and an elemental product of the first importance and the first output result; and determining the second modal characteristic from the convolution output and the elemental product of the second significance and the second output result.

For example, in an alternative embodiment, in order to obtain a robust multi-modal feature representation, the original modal feature inputs are recalibrated based on the above-computed channel importance and then further fused as follows:

wherein [ ] represents an element multiplication operation, Conv is a convolution layer,

representing cascaded operation of channel dimensions.

Optionally, in some embodiments, using one residual connection makes it easier for the network to train the optimization, as follows:

enhanced

And

namely the first modal characteristic and the second modal characteristic.

Optionally, in consideration of the fact that the context information is beneficial to detecting salient objects with different sizes and positions, a Multi-scale flow collaboration enhancement Module (MFCE) may be added to the network.

Where the MFCE may first compute contextual feature embedding (embedding) and then use the FCE to generate an enhanced multi-modal feature expression. Specifically, the multi-scale stream authoring enhancement module can be represented as follows:

wherein

For a 4 × 4 global average pooling operation, U is a 4 times upsampling operation. σ is the Sigmoid activation function. Similarly, each layer is enhanced

The input to the next layer is continuously enhanced. As the network grows deeper, the effective context expression will be highlighted at the encoding stage by using MFCE.

Optionally, in an optional implementation, when the first target convolutional network and the second target convolutional network are the last layer of convolutional network of the feature extraction network, the first modal feature and the second modal feature may be input to a decision fusion network, and a saliency detection result is obtained through the decision fusion network.

And when the first target convolutional network and the second target convolutional network are not the last layer of convolutional network, taking the first modal characteristic as the input of the next convolutional layer of the first target convolutional layer, taking the second modal characteristic as the input of the next convolutional layer of the second target convolutional layer to obtain the first output characteristic of the first main network and the second output characteristic of the second main network, then fusing the first output characteristic and the second output characteristic, and determining the significance detection result according to the fusion result.

It will be appreciated that the convolutional layers subsequent to the first and second target convolutional layers are also provided with FCEs, and may be processed again according to the above manner until the first and second output characteristics are obtained. The specific processing manner is similar to that described above, and is not described herein again.

In the technical solution disclosed in this embodiment, a first output result of a first target convolutional layer of a first backbone network and a second output result of a second target convolutional layer of a second backbone network are determined, then the first global descriptor is determined according to the first output result, the second global descriptor is determined according to the second output result, the importance of different channels of each modality is determined according to the first global descriptor and the second global descriptor, a first modality feature is determined according to the importance and the first output result, a second modality feature is determined according to the importance and the second output result, and a significance detection result is determined based on the first modality feature and the second modality feature. The scheme provided by the embodiment is different from a method for directly extracting RGB (red, green and blue) features and auxiliary modal features by using an independent backbone network, noise or redundant information is not introduced in a distinguishing manner, the scheme provided by the embodiment makes full use of the complementarity among the multi-modal features and the quality defect problem of different modal inputs to simultaneously extract the RGB features and the auxiliary modal features by a weight-sharing backbone network feature extraction mode, and makes full use of the complementarity among the multi-modal features to enhance accurate feature expression, so that the effect of improving the robustness of a significance detection result is achieved.

Optionally, referring to fig. 5, based on the foregoing embodiment, the step S50 includes:

step S51: determining a first decision output of a modality corresponding to a first backbone network according to the first modality characteristics, and determining a second decision output of a modality corresponding to a second backbone network according to the second modality characteristics;

step S52: determining the significance detection result according to the first decision output and the second decision output.

In this embodiment, the first decision output is an output of the first backbone network, and the second decision output is an output of the second backbone network. After a first decision output and a second decision output are obtained, the sum of elements of the first decision output and the second decision output can be used as a preliminary fusion feature, then the preliminary fusion feature is subjected to upsampling to obtain a sampling result, the sampling result is subjected to convolution and global pooling, and the significance detection result is determined according to the convolution and pooling result.

Illustratively, a conventional Feature-based fusion strategy (Feature-based fusion strategy) relies on the accuracy of the extracted features and is sensitive to noise. Direct fusion of noisy features can lead to sub-optimal results if the extracted features are less reliable. Unlike the feature fusion based strategy proposed in the above embodiment, a Decision-based fusion strategy (DFS) may be designed from the viewpoint of reliability of modal output, aiming at generating a more robust and reliable saliency target detection result based on the accuracy of the modal output.

As shown in FIGS. 3 and 6, first the RGB features from the RGB backbone network

Characterization of the convolutional layers (Conv) fed to a 3x3 machine

Then the

Is fed into a modality-specific RGB decoder G_RGenerating decision output S for RGB modalities_RAs follows:

wherein Conv is 3x3 convolutional layers,

represents the cascade operation of the channel dimension, and U is the upsampling operation. Similarly, the decision output S of the auxiliary modality may be obtained by the following formula_T：

Then, the decision outputs of different modes are sent to a light Detection Module (Detection Module, Det) to obtain a final multi-mode significance target Detection output S_f：

S_f＝Det(S_R，S_T)

As shown in FIG. 6, S may be first introduced_RAnd S_TAnd performing element addition operation to obtain a preliminary fusion feature. Then, bilinear interpolation and deconvolution are adopted for the fusion features to carry out up-sampling 4 times, then 1x1, 3x3, 5x5 convolution and global pooling are respectively carried out on the fusion features to capture multi-scale multi-modal information, finally, the multi-scale information is subjected to channel dimension cascade, the size and the channel are converted to the original size of the label by deconvolution, and the final significant target detection output S is generated_f. This achieves the effect of further improving the robustness of the significance detection result.

The present invention also provides a computer-readable storage medium having a saliency detection program stored thereon, which when executed by a processor implements the steps of the saliency detection method as described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on this understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for causing a device to execute the methods according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A significance detection method, characterized by comprising the steps of:

2. The saliency detection method of claim 1, wherein said steps of determining said first global descriptor from said first output result and determining a second global descriptor from said second output result are preceded by further comprising:

3. The saliency detection method according to claim 1, characterized in that said step of determining the importance of the different channels of the respective modalities from said first global descriptor and said second global descriptor comprises:

4. The significance detection method according to claim 1, wherein the importance includes a first importance of each channel of a first modality corresponding to the first backbone network and a second importance of each channel of a second modality corresponding to the second backbone network, the determining a first modality feature according to the importance and the first output result, and the determining a second modality feature according to the importance and the second output result includes:

5. The significance detection method of claim 4, wherein prior to the steps of determining the first modal feature as a function of the first significance and the elemental product of the first output result, and determining the second modal feature as a function of the second significance and the elemental product of the second output result, further comprising:

determining a convolution output of the second concatenation result;

6. The significance detection method of claim 1, wherein the step of determining a significance detection result based on the first modal characteristics and the second modal characteristics comprises:

7. The significance detection method of claim 1, wherein the step of determining a significance detection result based on the first modal characteristics and the second modal characteristics comprises:

8. The significance detection method of claim 7, wherein the step of determining the significance detection result based on the first decision output and the second decision output comprises:

9. A saliency detection device characterized in that it comprises: memory, a processor and a saliency detection program stored on said memory and executable on said processor, said saliency detection program when executed by said processor implementing the steps of the saliency detection method as claimed in any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a saliency detection program that, when executed by a processor, implements the steps of the saliency detection method of any one of claims 1 to 8.