CN117237188A

CN117237188A - Multi-scale attention network saliency target detection method based on remote sensing image

Info

Publication number: CN117237188A
Application number: CN202311185844.3A
Authority: CN
Inventors: 霍丽娜; 王咏梅; 王威; 刘金生; 李欢; 高学渊
Original assignee: Hebei Normal University
Current assignee: Hebei Normal University
Priority date: 2023-09-14
Filing date: 2023-09-14
Publication date: 2023-12-15

Abstract

The invention discloses a method for detecting a salient object of a multi-scale attention network based on a remote sensing image, which adopts a lightweight backbone network to perform feature coding; mining high-level semantic information at the top of the encoder by compression operation and propagating it top-down; adding a semantic information guided multi-scale extraction module to expand the expression of the receptive field enhanced multi-scale features after the encoder; adding a residual fusion module to eliminate redundant information and noise in the encoder; a two-dimensional residual attention module is added to capture the dependency between space and channel, thereby improving the accuracy of small target detection. The invention can extract the most attractive content in any given image, pay more attention to small objects and restrain background interference.

Description

Multi-scale attention network saliency target detection method based on remote sensing image

Technical Field

The invention relates to a method for detecting a salient object of a multi-scale attention network, in particular to a method for detecting a salient object of a multi-scale attention network based on a remote sensing image, and belongs to the technical field of computer vision.

Background

With the rapid development and wide application of information technology and intelligent technology, a stream of "intelligent" hot flashes are raised worldwide, and artificial intelligence and deep learning are areas of great attention in recent years. The amount of data generated by applications in various fields has been increasing explosively, and how to extract important information from mass data accurately and rapidly has become important. In particular, short video and graphic messages have become a complete part of people's work and entertainment. Video image information is a very visual expression mode, and when the information is processed, the human brain is often focused on more important areas, and the attention of other irrelevant areas is reduced to obtain more detail information related to a remarkable object. Therefore, how to make a computer simulate the human visual system, to quickly locate the most attractive object in an image or video, so that solving the complex visual problem becomes the focus of researchers. In deep learning, the more the number of samples is, the better the model effect is trained, and the stronger the generalization capability of the model is. Therefore, the SRGAN model is disclosed in the journal paper "The theoretical research of generative adversarial networks: an overlapping" by Li et al 2021, which discloses data enhancement for two data sets respectively, so as to improve the definition of the image and reduce noise interference. The SRGAN has good data distribution modeling capability, two neural networks are used for learning in a mutual game mode, and after continuous optimization iteration, the model reaches Nash balance, so that small target information can be well reserved, and the detection capability of the small target is improved. This opens up a new way for detecting complete salient objects.

To date, salient object detection has been extended from the field of natural images to the field of optical remote sensing images, which are usually taken at a high resolution bird's eye view, most objects being possibly small in size, and the structure being more complex. Many researchers have proposed depth models for salient object detection of optical remote sensing images. In the journal IEEE Transactions on Geoscience and Remote Sensing published paper "RRNet: relational reasoning network with parallel multi-scale attention for salient object detection in optical remote sensing images" by Cong et al 2022, a parallel multi-scale attention module is introduced to recover significant object detail information, solving the multi-scale variation problem, but still has the limitation of incomplete detection. Lin et al 2022, international conference paper "Attention guided network for salient object detection in optical remote sensing images" at Artificial Neural Networks and Machine Learning, employed channel attention and spatial attention in combination to extract multiple features from different dimensions, but the salient object boundaries were not sufficiently sharp. All of these methods use a single dimension, ignoring the dependencies and dependencies between dimensions. Therefore, modeling the correlation of channel dimensions and spatial dimensions is needed to improve the accuracy of small target detection.

Disclosure of Invention

The invention aims to provide a multi-scale attention network saliency target detection method based on a remote sensing image.

In order to solve the technical problems, the invention adopts the following technical scheme: a method for detecting a salient object of a multi-scale attention network based on a remote sensing image comprises the following steps:

step 1, image preprocessing: using an SRGAN model to adjust the size of an input image to an input tensor X with a preset size;

step 2, establishing a two-dimensional multi-scale attention network: the two-dimensional multiscale attention network comprises first to fifth feature processing units E ₁ -E ₅ And a semantic compression operation; the semantic compression operation comprises a depth separable convolution layer DSConv and an adaptive average pooling layer AP; tensor X is sequentially passed through feature encoder E ₁ -E ₅ After the processing, obtaining first to fifth feature tensors; the fifth feature tensor obtains global semantic information K through semantic compression operation; feedforward of global semantic information K to multipleIn the scale extraction modules SMEM1-SMEM4, the fifth feature tensor after upsampling is spliced with the fourth feature tensor is input into the multi-scale extraction module SMEM4, the output of the multi-scale extraction module SMEM4 after upsampling is spliced with the third feature tensor is input into the multi-scale extraction module SMEM3, the output of the multi-scale extraction module SMEM3 after upsampling is spliced with the second feature tensor is input into the multi-scale extraction module SMEM2, the output of the multi-scale extraction module SMEM2 after upsampling is spliced with the first feature tensor is input into the multi-scale extraction module SMEM1, and the multi-scale extraction modules SMEM1-SMEM4 respectively output the decoded feature tensor F ¹ _s -F ⁴ _s The method comprises the steps of carrying out a first treatment on the surface of the Feature tensor F ⁴ _s Post-upsampling and decoding feature tensor F ³ _s Splicing to obtain joint decoding characteristic tensor F ³⁴ _s Feature tensor F ³ _s Post-upsampling and decoding feature tensor F ² _s Splicing to obtain joint decoding characteristic tensor F ²³ _s Feature tensor F ² _s Post-upsampling and decoding feature tensor F ¹ _s Splicing to obtain joint decoding characteristic tensor F ¹² _s Joint decoding feature tensor F ¹² _s 、F ²³ _s 、F ³⁴ _s Inputting a residual error fusion module RFM; residual fusion module RFM outputs multi-scale feature tensor F _r The method comprises the steps of carrying out a first treatment on the surface of the Multiscale feature tensor F _r Input two-dimensional residual attention module BRAM, and output decoding characteristic tensor F by the two-dimensional residual attention module BRAM _sa As a saliency map;

step 3, detecting a significance map: the image to be detected is input into a two-dimensional multi-scale attention network, and a saliency map is output.

Further, the size of the input tensor X is 224×224×3, and the sizes of the first to fifth feature tensors are 112×112×16, 56×56×24, 28×28×32, 14×14×96, 7×7×320, respectively.

Further, the fifth feature tensor is subjected to semantic compression operation, and first, 3×3 depth separable convolution is applied to the fifth feature tensor to obtain a 7×7×64 feature map, and then, adaptive average pooling is applied to obtain a 5×5×64 feature map.

Further, the fifth feature tensor calculates an advanced semantic feature tensor K through convolution and pooling:

K＝avgpool(DSConv _3×3 (E ₅ ))

further, the output of the decoder features is:

where UP denotes upsampling using bilinear interpolation and cat denotes connecting in the channel dimension. Sequentially combining deep coding feature upsampling and the previous layer features pairwise in channel dimension for splicing, and then performing SMEM processing to obtain a multi-scale feature F ⁱ _s F is to F ⁱ _s The above operation is repeated and input to the RFM optimization correction.

Further, after BRAM decoding, the output characteristic diagram F with the original size is obtained by up-sampling by using a bilinear interpolation method _sa 。

By adopting the technical scheme, the invention has the following technical effects:

according to the invention, by introducing data enhancement and multi-scale attention strategies, the definition of an original image can be improved, and the practicability and accuracy of the model are improved. The semantic information representation capability is enhanced by compressing the advanced features, the semantic information is fed into the SMEM for multi-scale feature interaction, more effective multi-scale information is captured, and the problem of complex scale is solved; constructing RFM fusion semantic information and detail information, further capturing remarkable clues, and improving detail representation capability of low-level features; and the correlation between the BRAM modeling space and the channel is designed, and the BRAM modeling space and the channel are fused to inhibit background interference, so that the accuracy of small target detection is improved.

Drawings

Fig. 1 is a frame diagram of the present invention.

Fig. 2 is a block diagram of the semantic information guided multi-scale extraction module SMEM of the invention.

Fig. 3 is a block diagram of the residual fusion module RFM of the invention.

Fig. 4 is a block diagram of a two-dimensional residual attention module BRAM of the present invention.

Fig. 5 is an input image of embodiment 1 of the present invention.

Fig. 6 is a graph of the significance of the test of example 1 of the present invention.

Detailed Description

The following examples serve to illustrate the invention.

Example 1

Referring to fig. 1, a method for detecting a salient object of a two-dimensional multi-scale attention network based on an optical remote sensing image includes the following steps:

step 1, image preprocessing: the SRGAN model is used to increase the fine granularity of the image, the input image is adjusted to a tensor X of a preset size, in this embodiment 224×224×3;

step 2, establishing a two-dimensional multi-scale attention network: the two-dimensional multiscale attention network comprises first to fifth feature processing units E ₁ -E ₅ And a semantic compression operation; the semantic compression operation comprises a depth separable convolution layer DSConv and an adaptive average pooling layer AP; tensor X is sequentially passed through feature encoder E ₁ -E ₅ After the processing, obtaining first to fifth feature tensors; the fifth feature tensor obtains global semantic information K through semantic compression operation; the global semantic information K is fed forward to a multi-scale extraction module SMEM1-SMEM4, the fifth feature tensor after upsampling is spliced with the fourth feature tensor is input to a multi-scale extraction module SMEM4, the output of the multi-scale extraction module SMEM4 after upsampling is spliced with the third feature tensor is input to a multi-scale extraction module SMEM3, the output of the multi-scale extraction module SMEM3 after upsampling is spliced with the second feature tensor is input to a multi-scale extraction module SMEM2, the output of the multi-scale extraction module SMEM2 after upsampling is spliced with the first feature tensor is input to a multi-scale extraction module SMEM1, and the multi-scale extraction modules SMEM1-SMEM4 respectively output decoded feature tensors F ¹ _s -F ⁴ _s The method comprises the steps of carrying out a first treatment on the surface of the Feature tensor F ⁴ _s Post-upsampling and decoding feature tensor F ³ _s Splicing to obtain joint decoding characteristic tensor F ³⁴ _s Feature tensor F ³ _s Post-upsampling and decoding feature tensor F ² _s Splicing to obtain joint decoding characteristic tensor F ²³ _s Feature tensor F ² _s Post-upsampling and decoding feature tensor F ¹ _s Splicing to obtain joint decoding characteristic tensor F ¹² _s Joint decoding feature tensor F ¹² _s 、F ²³ _s 、F ³⁴ _s Inputting a residual error fusion module RFM; residual fusion module RFM outputs multi-scale feature tensor F _r The method comprises the steps of carrying out a first treatment on the surface of the Multiscale feature tensor F _r Input two-dimensional residual attention module BRAM, and output decoding characteristic tensor F by the two-dimensional residual attention module BRAM _sa As a saliency map;

step 3, detecting a significance map: and inputting tensors, and obtaining a saliency map after processing the tensors through a two-dimensional multi-scale attention network.

The sizes of the first to fifth feature tensors in the present embodiment are 112×112×16, 56×56×24, 28×28×32, 14×14×96, 7×7×320, respectively.

And carrying out semantic compression operation on the fifth feature tensor, firstly, carrying out dimension reduction on the fifth feature tensor by applying 3×3 depth separable convolution to obtain a 7×7×64 depth separation feature map, and then, carrying out space compression by applying adaptive average pooling to obtain a 5×5×64 compression feature map.

The fifth feature tensor calculates an advanced semantic feature tensor K through convolution and pooling:

K＝avgpool(DSConv _3×3 (E ₅ ))

and the global semantic information K is sequentially fed forward and fused with each level of coding feature tensor and is input into a module SMEM to relieve interference caused by scale change and complex background. Referring to fig. 3, two feature tensors are obtained by respectively performing two 1×1 convolutions on the input, one feature tensor is input into three parallel dynamic depth convolutions to extract multi-scale information, and the void ratio is r=1, 2 and 3; another feature tensor is refined by adopting the traditional 3×3 convolution; the input feature tensor is compressed into a one-dimensional vector by the operations of maximum pooling and average pooling. The two one-dimensional vectors are then passed through a 1 x 1 convolution, sigmoid function and SELayThe er layer enhances the salient features; finally, the three pieces of context information are aggregated, and 4 groups of characteristic tensors F with different resolution sizes are output ¹ _s -F ⁴ _s Sizes are 112×112×64, 56×56×64, 28×28×64, 14×14×64, respectively;

then the characteristic tensor F ¹ _s -F ⁴ _s Sequentially connecting in channel dimension to obtain 3 groups of decoding characteristic tensors F ¹² _s 、F ²³ _s 、F ³⁴ _s As inputs to the residual fusion module RFM, the sizes are 112×112×64, 56×56×64, 28×28×64, respectively. Referring to fig. 4, the rfm fuses semantic information and detail information layer by layer to achieve effective multi-scale feature fusion, and obtains a multi-scale feature tensor F _r The size is 112 multiplied by 64;

tensor of multi-scale features F _r Input into a two-dimensional residual attention module BRAM for processing. Referring to fig. 5, the input is first subjected to multi-scale feature extraction by adopting a multi-scale strategy, so as to obtain multi-scale features; then explore two-dimensional attention along channel and space dimension, solve the problems of attention loss and multi-scale change, recover detail information of remarkable object, output decoding characteristic tensor F _sa As a significance map, the size thereof was 112×112×64. The output process of each module in the decoder is as follows:

wherein UP is UP sampled by bilinear interpolation, cat denotes connection in channel dimension, SMEM, RFM and BRAM denote operation of three modules.

Finally, the output characteristic diagram is up-sampled by 2 times to be the same as the original size, namely 224×224×64.

It should be noted that, at present, the technical scheme of the invention has been applied and researched in a small range, and the research result shows that the user satisfaction is higher. Now, the preparation technology conversion application is started, and the intellectual property risk early warning investigation is carried out.

Claims

1. The method for detecting the saliency target of the multi-scale attention network based on the remote sensing image is characterized by comprising the following steps of:

step 2, establishing a two-dimensional multi-scale attention network: the two-dimensional multiscale attention network comprises first to fifth feature processing units E ₁ -E ₅ And a semantic compression operation; the semantic compression operation comprises a depth separable convolution layer DSConv and an adaptive average pooling layer AP; tensor X is sequentially passed through feature encoder E ₁ -E ₅ After the processing, obtaining first to fifth feature tensors; the fifth feature tensor obtains global semantic information K through semantic compression operation; the global semantic information K is fed forward to a multi-scale extraction module SMEM1-SMEM4, the fifth feature tensor after upsampling is spliced with the fourth feature tensor is input to a multi-scale extraction module SMEM4, the output of the multi-scale extraction module SMEM4 after upsampling is spliced with the third feature tensor is input to a multi-scale extraction module SMEM3, the output of the multi-scale extraction module SMEM3 after upsampling is spliced with the second feature tensor is input to a multi-scale extraction module SMEM2, the output of the multi-scale extraction module SMEM2 after upsampling is spliced with the first feature tensor is input to a multi-scale extraction module SMEM1, and the multi-scale extraction modules SMEM1-SMEM4 respectively output decoded feature tensors F ¹ _s -F ⁴ _s The method comprises the steps of carrying out a first treatment on the surface of the Feature tensor F ⁴ _s Post-upsampling and decoding feature tensor F ³ _s Splicing to obtain joint decoding characteristic tensor F ³⁴ _s Feature tensor F ³ _s Post-upsampling and decoding feature tensor F ² _s Splicing to obtain joint decoding characteristic tensor F ²³ _s Feature tensor F ² _s Post-upsampling and decoding feature tensor F ¹ _s Splicing to obtain joint decoding characteristic tensor F ¹² _s Joint decoding feature tensor F ¹² _s 、F ²³ _s 、F ³⁴ _s Input residual fusionCombining the modules RFM; residual fusion module RFM outputs multi-scale feature tensor F _r The method comprises the steps of carrying out a first treatment on the surface of the Multiscale feature tensor F _r Input two-dimensional residual attention module BRAM, and output decoding characteristic tensor F by the two-dimensional residual attention module BRAM _sa As a saliency map;

2. The method for detecting a saliency target of a multiscale attention network based on remote sensing images according to claim 1, wherein the input tensor size is 224×224×3, and the first to fifth feature tensors are 112×112×16, 56×56×24, 28×28×32, 14×14×96, 7×7×320, respectively.

3. The method for detecting the saliency target of the multi-scale attention network based on the remote sensing image according to claim 2, wherein the semantic compression operation is performed on a fifth feature tensor, firstly, 3×3 depth separable convolution is applied to the fifth feature tensor to perform dimension reduction, a 7×7×64 depth separation feature map is obtained, and then, adaptive average pooling is applied to perform spatial compression, so as to obtain a 5×5×64 compression feature map.

4. The method for detecting saliency target of a remote sensing image-based multi-scale attention network of claim 1, wherein the output of the decoder characterizer is:

sequentially combining deep coding feature upsampling and the previous layer features pairwise in channel dimension for splicing, and then performing SMEM processing to obtain a multi-scale feature F ⁱ _s F is to F ⁱ _s The above operation is repeated and input to the RFM optimization correction.

5. The remote sensing image-based multi-scale of claim 1The significance target detection method of the degree attention network is characterized by comprising the following steps of ⁴ _s Upsampling and decoding feature tensor F by bilinear interpolation ³ _s Splicing to obtain joint decoding characteristic tensor F ³⁴ _s 。

6. The method for detecting saliency target of a multiscale attention network based on remote sensing images according to claim 1, wherein a feature tensor F ³ _s Upsampling and decoding feature tensor F by bilinear interpolation ² _s Splicing to obtain joint decoding characteristic tensor F ²³ _s 。

7. The method for detecting saliency target of a multiscale attention network based on remote sensing images according to claim 1, wherein a feature tensor F ² _s Upsampling and decoding feature tensor F by bilinear interpolation ¹ _s Splicing to obtain joint decoding characteristic tensor F ¹² _s 。

8. The method for detecting saliency target of multi-scale attention network based on remote sensing image according to claim 1, wherein the feature tensor F is ³⁴ _s Characteristic tensor F ²³ _s And a characteristic tensor F ¹² _s Input to RFM optimization correction.

9. The method for detecting the saliency target of the multiscale attention network based on the remote sensing image according to claim 1, wherein the output characteristic diagram F with the original size is obtained by upsampling through a bilinear interpolation method after BRAM decoding _sa 。