CN118015332A

CN118015332A - Remote sensing image saliency target detection method

Info

Publication number: CN118015332A
Application number: CN202410008809.2A
Authority: CN
Inventors: 王鑫; 张之露; 左光汋; 李黎; 宁晨
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2024-01-03
Filing date: 2024-01-03
Publication date: 2024-05-10

Abstract

The invention discloses a remote sensing image saliency target detection method. The method mainly comprises three parts of a attention characteristic encoder, a three-branch decoder and a double supervision loss module. In the encoder, an encoder structure in the form of a cascade pyramid is built, and a global context-aware attention module is designed to guide the encoder to generate an encoding feature map with attention. In the decoder, three branches are designed: the first branch is a decoding convolution module branch and is used for generating an original decoding prediction graph; the second branch is an expansion reverse attention module branch and is used for generating a target area saliency map; and the third branch is a branch of the fusion dense convolution module and is used for generating a target boundary saliency map. And finally, designing a double supervision loss module for generating a final saliency target detection result. The method designed by the invention can learn the depth characteristics which are complete and have strong discrimination, thereby being beneficial to accurately detecting the salient target in the remote sensing image.

Description

Remote sensing image saliency target detection method

Technical Field

The invention belongs to the field of image processing, and particularly relates to a remote sensing image saliency target detection method.

Background

The purpose of saliency target detection is to identify the most visually prominent target or region in the overall image, and then segment the prominent parts from the background, generating a pixel-level saliency probability map. Visual saliency refers to the property of an attractive object or area depending on the visual attention mechanism of a human. Currently, salient object detection is widely used in various computer vision tasks, such as: example segmentation, object tracking, person re-identification, image cropping, etc.

Currently, in the remote sensing field, significant target detection of remote sensing images is widely used as a preprocessing technology because of important practical application value, so as to assist various downlink vision applications, such as change detection, semantic segmentation, target detection, scene classification, and the like. Significant target detection has made tremendous progress in recent decades from the use of hand-made features to end-to-end deep neural networks. However, these methods are basically focused on natural scene salient object detection, and few documents focus on salient object detection of remote sensing images.

Unlike a natural scene image photographed by hand, a remote sensing image is generally a high-angle top view obtained by automatically sensing and capturing through a satellite or an air sensor outdoors, so that the remote sensing image and the natural scene image have great difference, and the existing obvious target detection method based on the natural scene image can not be directly applied to the remote sensing image.

The publication number CN108629286B is a remote sensing airport target detection method based on subjective perception salient model, firstly, a model salient map is constructed on a remote sensing image after super-pixel segmentation by utilizing a potential theme semantic model; then, calculating an airport target feature map based on the linear density feature, fusing the obtained target feature map with the model saliency map to generate a subjective perception characteristic driving map, and obtaining a learning-based saliency map by using the driving map; and finally, fusing the learning-based saliency map with a background area to obtain a final airport target saliency map. The method can accurately detect remote sensing airport targets under different sizes and illumination conditions, but complicated human processing and feature extraction operation are required to be carried out on the images, so that the end-to-end efficient application cannot be realized.

The publication No. CN114241308A is a method for detecting the saliency of a lightweight remote sensing image based on a compression module, which comprises the steps of preprocessing an input image to obtain corresponding saliency information and multi-receptive field information, outputting the information as the compression module, and constructing a lightweight model according to the compression module. The method utilizes the compression module to compress the image information, reduces the required parameters, and builds a lightweight model so as to improve the detection speed. However, the method loses part of important target detail information while being light in weight, so that the detection accuracy is reduced; meanwhile, the multi-level intermediate feature fusion is adopted to acquire richer semantic features, redundant information is fused uniformly, and semantic relations among different spatial positions are not considered, namely, the acquisition of the context information is incomplete.

In summary, the existing remote sensing optical image significance detection method has a plurality of limitations mainly expressed in the following steps:

(1) The remote sensing image generally has the problems of background redundancy and excessive interference, various types of characteristics of the image are extracted by means of different image characteristic extraction algorithms in the existing method, the process is complicated, meanwhile, the interference of background noise cannot be effectively restrained, and the extracted characteristics of the remote sensing image are incomplete.

(2) The significant targets in the remote sensing image have complex structures and topologies, the coverage range is wide, the structures of the targets are complex, the textures are various, the structural difference of different parts of the same target is large, the complete target is difficult to detect, the characteristic information is perfected by the existing method through superposition and fusion of the characteristics, but the probability is high that only some redundant characteristics in the object are learned, the extracted characteristics cannot be effectively perfected, and the algorithm efficiency is reduced.

(3) The size of a part of the remote sensing image with obvious targets is undersized, available features are limited, semantic information of the remote sensing image can appear in a shallower feature map, and as a network deepens, detailed information of the remote sensing image can disappear completely, so that the targets cannot be positioned better, and structural details of the targets can not be perfected.

Disclosure of Invention

The invention aims to: aiming at the problems in the prior art, the invention provides a remote sensing image saliency target detection method. The method introduces a context information coding and attention mechanism to build a global context sensing module to acquire global features so as to inhibit background interference; an expansion convolution and reverse attention mechanism is introduced to build an expansion reverse attention module to expand a receptive field, reduce feature redundancy and perfect feature information; a fused dense upsampling module is introduced to recover the feature information and capture the structural details of the target location.

The technical scheme is as follows: in order to achieve the purpose of the invention, the technical scheme adopted by the invention is as follows: the remote sensing image significance target detection method comprises a training stage and a testing stage, and specifically comprises the following steps:

(1) Constructing a remote sensing image salient target detection data set, manufacturing a salient pixel level label corresponding to each input sample, randomly disturbing the data set, and dividing the remote sensing image salient target detection data set into a training set Train and a Test set Test;

(2) The method comprises the steps of building a proposed attention-aware three-branch network model, wherein the network is of an encoder-decoder structure and comprises an attention feature encoder, a three-branch decoder and three main parts of double supervision loss, wherein the encoder part comprises five encoding convolution modules and five global context-aware attention modules, and the decoder part comprises four decoding convolution modules of a first branch, an expansion convolution part of a second branch expansion inverse convolution module, three inverse attention modules and four fusion dense up-sampling modules of a third branch;

(3) Inputting the training set to the attention feature encoder part in the step (2), and obtaining an aggregate attention feature map of each image through the global context awareness attention module;

(4) Inputting the aggregate attention feature map obtained in the step (3) to a three-branch decoder part, and obtaining an original decoding feature map through a first branch convolution decoder module;

(5) Inputting the original decoding feature map obtained in the step (4) into a second branch expansion reverse attention module and a third branch fusion dense up-sampling module to respectively obtain a preliminary target region saliency map and a preliminary target boundary saliency map;

(6) Performing convolution fusion operation on the preliminary target region saliency map and the preliminary target boundary saliency map obtained in the step (5) and the original decoding feature map obtained in the step (4) to generate a final target saliency region prediction map and a final target saliency boundary prediction map;

(7) Calculating the salient region loss and the boundary enhancement loss respectively from the target salient region prediction graph and the target salient boundary prediction graph obtained in the step (6), and training the network in a double loss supervision mode;

(8) Inputting the test set into the trained network model in the step (2) to obtain a saliency target area prediction graph of each image.

The method for constructing the data set sample set in the step (1) is as follows:

(1.1) constructing a pixel-level label set corresponding to the input remote sensing image data sample, wherein X= { X _i |i=1, 2,..N } is the input remote sensing image significant target detection data sample, Y= { Y _i |i=1, 2,..N } is the input remote sensing image data sample, Representing tag vectors,/>Representing a dimension space, wherein I is a total label category, and is divided into two categories in the salient object detection, namely a salient object foreground category and a salient object background category, and N is a total training sample number;

(1.2) dividing the data set into a training set part Train and a Test set part Test, randomly extracting m pictures from the remote sensing image salient target detection data set to construct a training set The remaining N-m pictures form a test set/>The subscript i indicates the ordering of the number of picture samples.

The structure of the attention-aware three-branch network model constructed in the step (2) is as follows:

(2.1) normalizing each remote sensing scene image in the input section to an RGB image format of 224 x3 size;

(2.2) in the attention profile encoder section, the first five convolution modules of the VGG16 network are employed as five-layer primary encoder modules of the encoder, with one global context-aware attention module being introduced at each layer of encoder modules;

(2.3) in a first branch portion of the three-branch decoder, constructed primarily of four decoder modules;

(2.4) in the second branch portion of the three-branch decoder, constructed primarily from four expansion convolutional layers of different expansion coefficients and three reverse attention modules;

(2.5) in the third branch portion of the three-branch decoder, consisting essentially of four fused dense upsampling modules;

(2.6) the first and second branches of the three-branch decoder are fused by one convolution module, and the first and third branches of the three-branch decoder are fused by another convolution module.

In the step (3), the method for obtaining the aggregate attention profile of each image is as follows:

(3.1) setting five groups of encoder convolution modules to be denoted as En (l), l= {1,2,3,4,5}, wherein l represents the partial layer number of the network encoder, and let the side output characteristic diagram of the first layer encoder module En (l) be Representing dimension space, C _l,H_l and W _l are the channel number, length and width of the first layer, and the pixel space correlation diagram of any two positions in the encoder module of the first layer is set as/>P _l＝H_l×H_l is the number of pixels, S ^l is defined as:

Wherein, Is a side output feature graph after regularization,/>Is a size conversion operation, i.e./>Becomes as followsD₂₃＝D₂×D₃，/>Is a matrix multiplication operation, and T is a matrix transposition operation.

Then, the pixel spatial correlation map S ^l is converted into a global pixel relationship mapThe relative effect of the ith pixel on the jth pixel is measured:

Wherein, Characteristic similarity representing two pixel embedded vectors based on cosine distance,/>Calculating Gaussian weighted sum of all elements in the j-th column in S ^l, wherein e is a natural constant;

Mapping global pixel relationships And original feature map/>Performing matrix multiplication operations/>The resulting coding feature map G ^l with global context is defined as:

Wherein, Representation/>Is the inverse of the above.

(3.2) Performing pixel-by-pixel multiplication operation on the new global context coding relationship feature map G ^l and the original feature map X ^l, and then introducing residual connection to perform feature enhancement, so as to obtain a final global aggregation feature map F ^l, which is defined as:

F^l＝X^l+α·(G^l⊙X^l)

wherein, as follows, as indicated by element-wise multiplication, α indicates setting of a weight factor.

(3.3) Introducing a cascade pyramid attention mechanism, gradually guiding and perfecting characteristic and attention prompt information from coarse to fine, respectively carrying out average pooling and maximum pooling operation on the obtained global aggregation characteristic diagram F ^l along the channel dimension, and respectively generating two one-dimensional channel descriptorsAnd/>The definition is as follows:

Wherein avepool (-) represents the average pooling operation and maxpool (-) represents the maximum pooling operation. Connecting the two to obtain new channel space descriptor The convolution then generates a spatial attention map a ^l, defined as:

Wherein Att (·) represents a custom convolution attention operation, σ (·) represents a sigmoid activation function, conv (·; θ) represents a convolution layer operation, θ is a parameter of the convolution layer, and concat (·) represents a channel tandem operation.

(3.4) Performing multiple downsampling operations on the original global aggregation feature map F ^l and performing channel compression to obtain feature maps with different resolutions respectively, and constructing a feature map pyramidRepresenting a plurality of downsampling operations. First from the lowest resolution/>Initially, the Att (& gt) is operated according to the self-defined convolution attention of the above formula, and a corresponding attention force diagram/>And then generating/> according to the following formula

Wherein,Representing channel-by-channel and pixel-by-pixel multiplications, +.. According to the above procedure, a final aggregate attention map/>I.e. highest resolution/>Corresponding attention strives/>The definition is as follows:

(3.5) aggregate attention diagram for each layer Attention attempts at downsampling the previous layers/>Then they are subjected to a connection concat (-) operation along the channel dimension, and finally the function sigma (-) is activated through convolution conv (-) and Sigmoid to generate a final global aggregate attention strive/>The definition is as follows:

where ∈ represents the downsampling operation.

(3.6) Combining the Global aggregate feature map F ^l with the Global aggregate attention mapPerforming pixel-by-pixel multiplication operation, and obtaining a final aggregate attention profile/>, through residual connectionThe definition is as follows:

In the step (4), the method for inputting the aggregate attention profile to the first branch convolution decoder module of the decoder to obtain the original decoding profile is as follows:

(4.1) the resulting aggregate attention profile The convolution and activation function operations are performed after the concatenation along the channel dimension, and the resulting decoding feature map D ^m, m= {1,2,3,4} represents the number of layers of the decoder module, and D ^m is defined as:

wherein σ (·) represents a sigmoid activation function, conv (·; θ) represents a convolutional layer operation, θ is a parameter of the convolutional layer, and conca (·) represents a channel tandem operation.

In the step (5), the obtained original decoding feature map is input into a second branch expansion reverse attention module and a third branch fusion dense upsampling module, and the method for respectively obtaining the preliminary target region saliency map and the preliminary target boundary saliency map is as follows:

(5.1) setting the original decoding profile as D ^m, m= {1,2,3,4} representing the different decoder layers, inputting it into the second branch expansion back attention module, starting from the lowest resolution profile D ⁴, through a multi-scale expansion convolution structure containing four branches, the expansion factors r= {1,2,4,6}. And then the single-channel global saliency region mask predictive map M ⁴ is generated by serially connecting the single-channel global saliency region mask predictive maps along the channel dimension and then sending the single-channel global saliency region mask predictive map M ⁴ into a 3X 3 convolution layer.

(5.2) Taking M ⁴ as a guide prediction graph, activating the function through Sigmoid, and then performing a negation operation to generate a reverse attention diagram. And performing element-by-element multiplication and convolution layer operation on the saliency region mask predictive map M ³ and the decoding feature map D ³ of the upper layer, and obtaining the saliency region mask predictive map M ³ by residual connection. According to the steps, M ²,M¹ is obtained in sequence. Thus, the overall process of obtaining the individual layer saliency region mask predictive map M ^m is defined as:

M^m＝M^m+1↑+conv(D^m⊙(1-σ(M^m+1↑))；θ)

Wherein ∈indicates an up-sampling operation, conv (·; θ) indicates a convolutional layer operation, θ is a parameter of the convolutional layer, and ∈indicates an element-wise multiplication operation, σ (·) indicates a sigmoid activation function, wherein M ¹ is a preliminary saliency region prediction graph.

(5.3) Inputting the original decoding feature map D ^m into a third branch fusion dense up-sampling module, and setting an output decoding feature mapC _m,H_m and W _m are the channel number, length and width of the mth decoding layer, d is a downsampling factor, and represents the scale between adjacent level feature maps, which is defined as:

then, using dense convolution operation to generate dense convolution channel characteristic diagram Upsampling is realized through a reconstruction array operation, and the result and the original decoding feature map D ^m-1 are fused to generate an updated decoding feature map/>The whole process is defined as:

Wherein reshape (·) represents the reconstruct array operation, to be By/>Conv (·; θ) represents the operation of the convolutional layer, θ being a parameter of the convolutional layer. Thereby, updated decoding feature images are obtained in turnAnd a generated preliminary saliency boundary prediction graph D.

In the step (6), the method for generating the final target salient region prediction graph and the target salient boundary prediction graph comprises the following steps:

(6.1) setting an original decoding feature map as D ^m, m= {1,2,3,4}, a preliminary saliency region prediction map as M ¹ obtained in the step (5), and performing convolution fusion on M ¹ after 2×up-sampling operation +.:

M＝M¹↑+conv(D¹；θ)

wherein conv (·; θ) represents the operation of the convolutional layer, and θ is a parameter of the convolutional layer.

(6.2) Fusing the preliminary saliency boundary prediction graph D obtained in the step (5) with the D ¹ subjected to convolution operation to generate a final saliency boundary prediction graph E, wherein the final saliency boundary prediction graph E is defined as:

E＝D+conv(D¹；θ)

conv (·; θ) represents the operation of the convolutional layer, θ being a parameter of the convolutional layer.

In the step (7), the method for training the network by performing double-loss supervision on the obtained preliminary saliency region prediction graph and the preliminary saliency boundary prediction graph comprises the following steps:

(7.1) setting the preliminary saliency region prediction graph output in the step (6) as M, the saliency region mask truth value graph as G _M, and processing the preliminary saliency region prediction graph and the saliency region mask truth value graph by using a class balance binary cross entropy loss function to obtain a saliency region mask cross entropy loss L _M, wherein the definition is as follows:

L_M＝-[γ₁·G_M log(M)+γ₂·(1-G_M)log(1-M)]

Wherein the parameters are The weights in the total pixel B for the target foreground pixel B _M and the background pixels B-B _M in G _M are shown, respectively.

(7.2) Setting the preliminary saliency boundary prediction graph output in the step (6) as E, the saliency boundary truth value graph as G _E, and processing the preliminary saliency boundary prediction graph and the saliency boundary truth value graph by using a class-balanced binary cross entropy loss function to obtain a saliency boundary enhancement loss L _E, wherein the definition is as follows:

L_E＝-[μ₁·G_Elog(E)+μ₂·(1-G_E)log(1-E)]

Wherein the parameter μ ₁,μ₂ represents the weights of the target foreground pixel and the background pixel in the total pixel in G _E, respectively.

(7.3) Combining the saliency region mask cross entropy loss L _M with the saliency boundary enhancement loss L _E to obtain the final saliency total loss L, defined as:

L＝η₁L_M+η₂L_E

η ₁,η₂ is the weight factor set for both losses.

In the step (8), the method for obtaining the saliency target area prediction graph of each image comprises the following steps:

(8.1) test set And (3) inputting images in the number sequence of the picture samples into a trained network model, wherein the required saliency target region prediction graph is the final saliency region prediction graph M obtained in the step (6).

The beneficial effects are that: the invention adopts the technical scheme and has the following beneficial effects:

(1) The invention provides a three-branch network model based on attention perception, which mainly comprises an encoder and a decoder, wherein a target region significant prediction graph and a target boundary significant prediction graph can be respectively generated, and the network can be better trained by adopting a double loss supervision mode, so that the three-branch network model has good performance in remote sensing image significant target detection;

(2) Introducing a global context awareness attention module into an encoder part, firstly calculating cosine distances of pixel pairs at different positions through context information to obtain feature similarity, realizing feature alignment by integrating point-to-point relations to obtain corresponding coding feature graphs, then focusing on different visual contents through a cascading pyramid attention frame, establishing correlation relations of attention diagrams at different layers to obtain a final global aggregation attention feature graph, and effectively enhancing saliency to detect a complete salient target in the mode;

(3) The method comprises the steps of constructing a framework of a fusion decoding convolution module, an expansion reverse attention module and a fusion dense up-sampling module in a decoder part, wherein the expansion reverse attention module can increase receptive fields by utilizing expansion convolution, and can find areas and details of missing salient objects layer by utilizing a reverse attention mechanism, so that a target area salient prediction graph is generated; the fused dense up-sampling module makes up the defect of the length and the width of the image by utilizing the channel dimension, and forms a target boundary significant prediction graph by utilizing up-sampling, so that the fused dense up-sampling module can better help to locate a target, perfect structural details and promote the learning of a target region significant prediction graph.

Drawings

FIG. 1 is a block diagram of the method of the present invention;

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

As shown in fig. 1, the technical scheme of the present invention is described in further detail as follows:

(1) The invention provides a novel attention-aware three-branch network model, which is of an encoder-decoder structure and comprises three main parts of an attention characteristic encoder, a three-branch decoder and double supervision loss.

(1.1) Constructing a pixel-level label set corresponding to the input remote sensing image data sample, wherein X= { X _i |i=1, 2,..N } is the input remote sensing image significant target detection data sample, Y= { Y _i |i=1, 2,..N } is the input remote sensing image data sample,Representing tag vectors,/>The representative dimension space, I is the total label class, and in the salient object detection is classified into two classes, namely a salient object foreground and background class, and N is the total training sample number. The data sets selected by the invention are EORSSD and ORSSD remote sensing image saliency detection data sets.

(1.2) Dividing the data set into a training set part Train and a Test set part Test, randomly extracting m pictures from the remote sensing image salient target detection data set to construct a training setThe remaining N-m pictures form a test set/>The subscript i indicates the ordering of the number of picture samples. In EORSSD datasets, m=1400, n=2000; in ORSSD datasets, m=600, n=800.

(1.3) The constructed attention-aware three-branch network model structure is as follows:

(a) In the input section, normalizing each remote sensing scene image to an RGB image format of 224 x 3 size;

(b) In the attention feature encoder part, the first five convolution modules of the VGG16 network are adopted as five layers of main encoder modules En (l) of the encoder, i= {1,2,3,4,5}, and a global context awareness attention module GCA (l) is introduced into the encoder modules of each layer;

(c) In the first branch part of the three-branch decoder, mainly built by four decoder modules De (m), m= {1,2,3,4 };

(d) In the second branch part of the three-branch decoder, the three-branch decoder is mainly built by four expansion convolution layers r= {1,2,4,6} with different expansion coefficients and three reverse attention modules RA (m), m= {1,2,3 };

(e) In the third branch part of the three-branch decoder, the three-branch decoder is mainly composed of four fused dense up-sampling modules FDUP (m), m= {1,2,3,4 };

(f) The first branch and the second branch of the three-branch decoder are fused by one convolution module CONV, and the first branch and the third branch of the three-branch decoder are fused by another convolution module CONV.

(2) In the encoder part, a coding convolution module and a global context awareness attention module are used for capturing the relation of pixels at different positions to generate a coding feature map, then a coding feature map with attention is generated based on the attention of the cascade pyramid, and an aggregation attention feature map of each image is obtained based on the fusion of the coding feature maps with attention.

(2.1) Five groups of encoder convolution modules are denoted as En (l), l= {1,2,3,4,5}, l represents the partial layer number of the network encoder, let the side output feature map of the first layer encoder module En (l) beRepresenting dimension space, C _l,H_l and W _l are the channel number, length and width of the first layer, and the pixel space correlation diagram of any two positions in the encoder module of the first layer is set as/>P _l＝H_l×H_l is the number of pixels, S ^l is defined as:

Wherein, Representation/>Is the inverse of the above.

(2.2) Performing pixel-by-pixel multiplication operation on the new global context coding relationship feature map G ^l and the original feature map X ^l, and then introducing residual connection to perform feature enhancement, so as to obtain a final global aggregation feature map F ^l, which is defined as:

F^l＝X^l+α·(G^l⊙X^l)

(2.3) Introducing a cascade pyramid attention mechanism, gradually guiding and perfecting the characteristics and attention prompt information from coarse to fine, respectively carrying out average pooling and maximum pooling operations on the obtained global aggregation characteristic diagram F ^l along the channel dimension, and respectively generating two one-dimensional channel descriptorsAnd/>The definition is as follows:

(2.4) Performing multiple downsampling operations on the original global aggregation feature map F ^l and performing channel compression to obtain feature maps with different resolutions respectively, and constructing a feature map pyramidRepresenting a plurality of downsampling operations. First from the lowest resolution/>Initially, the Att (& gt) is operated according to the self-defined convolution attention of the above formula, and a corresponding attention force diagram/>And then generating/> according to the following formula

(2.5) aggregate attention diagram for each layer Attention attempts at downsampling the previous layers/>Then they are subjected to a connection concat (-) operation along the channel dimension, and finally the function sigma (-) is activated through convolution conv (-) and Sigmoid to generate a final global aggregate attention strive/>The definition is as follows:

where ∈ represents the downsampling operation.

(2.6) Finally, the global aggregate feature map F ^l is combined with the global aggregate attention mapPerforming pixel-by-pixel multiplication operation, and obtaining a final aggregate attention profile/>, through residual connectionThe definition is as follows:

(3) In the decoder section, the following three branches are used: the first branch is a traditional decoding convolution module, and an original decoding prediction graph is generated; the second branch is an expansion reverse attention module, a global information guide graph is generated through mixed expansion convolution, then reverse attention is focused on a low response area from top to bottom, so that ignored detail parts are more effectively detected, and the ignored detail parts are fused with a first branch result to generate a final target area saliency graph; and the third branch is a fusion dense convolution module, the generated channel number is utilized to compensate the missing of the image size so as to recover the boundary details of the target loss, and the third branch is fused with the first branch result to generate a final target boundary saliency map.

(3.1) The resulting aggregate attention profileThe convolution and activation function operations are performed after the concatenation along the channel dimension, and the resulting decoding feature map D ^m, m= {1,2,3,4} represents the number of layers of the decoder module, and D ^m is defined as:

(3.2) Setting the original decoding profile as D ^m, m= {1,2,3,4} representing the different decoder layers, inputting it into the second branch expansion back attention module, starting from the lowest resolution profile D ⁴, through a multi-scale expansion convolution structure containing four branches, the expansion factors r= {1,2,4,6}. And then the single-channel global saliency region mask predictive map M ⁴ is generated by serially connecting the single-channel global saliency region mask predictive maps along the channel dimension and then sending the single-channel global saliency region mask predictive map M ⁴ into a 3X 3 convolution layer. And (3) taking M ⁴ as a guiding prediction graph, activating a function through Sigmoid, and then performing a negation operation to generate a reverse attention diagram. And performing element-by-element multiplication and convolution layer operation on the saliency region mask predictive map M ³ and the decoding feature map D ³ of the upper layer, and obtaining the saliency region mask predictive map M ³ by residual connection. According to the steps, M ²,M¹ is obtained in sequence. Thus, the overall process of obtaining the individual layer saliency region mask predictive map M ^m is defined as:

M^m＝M^m+1↑+conv(D^m⊙(1-σ(M^m+1↑))；θ)

(3.3) Inputting the original decoding feature map D ^m into a third branch fusion dense up-sampling module, and setting an output decoding feature mapC _m,H_m and W _m are the channel number, length and width of the mth decoding layer, d is a downsampling factor, and represents the scale between adjacent level feature maps, which is defined as:

(3.4) Setting a preliminary saliency region prediction map as M ¹, carrying out convolution fusion on M ¹ after 2 x up-sampling operation +.and D ¹ after convolution operation to generate a final saliency mask prediction map M, wherein the definition is as follows:

M＝M¹↑+conv(D¹；θ)

Fusing the preliminary saliency boundary prediction graph D and the D ¹ subjected to convolution operation to generate a final saliency boundary prediction graph E, wherein the definition is as follows:

E＝D+conv(D¹；θ)

(4) And performing double loss supervision on the obtained preliminary significance region prediction graph and the preliminary significance boundary prediction graph to train the network.

(4.1) Setting an output preliminary saliency region prediction graph as M, a saliency region mask truth graph as G _M, and processing the preliminary saliency region prediction graph and the saliency region mask truth graph by using a class-balanced binary cross entropy loss function to obtain a saliency region mask cross entropy loss L _M, wherein the definition is as follows:

L_M＝-[γ₁·G_M log(M)+γ₂·(1-G_M)log(1-M)]

Setting an output preliminary saliency boundary prediction graph as E, setting a saliency boundary truth value graph as G _E, processing the preliminary saliency boundary prediction graph and the saliency boundary truth value graph by using a class balance binary cross entropy loss function to obtain a saliency boundary enhancement loss L _E, and defining as:

L_E＝-[μ₁·G_Elog(E)+μ₂·(1-G_E)log(1-E)]

(4.2) Combining the saliency region mask cross entropy loss L _M with the saliency boundary enhancement loss L _E to obtain the final saliency total loss L, defined as:

L＝η₁L_M+η₂L_E

η ₁,η₂ is the weight factor set for both losses.

(5) Test setThe index i indicates that images in the number sequence of the picture samples are input into a trained network model, and the required saliency target area prediction graph is the obtained final saliency area prediction graph M.

The invention selects a different remote sensing image saliency target detection method to compare with the proposed method, and the selected comparison method is as follows:

the method for detecting the significance target of the remote sensing image based on the dense attention network, which is proposed by Zhang et al in "Dense Attention Fluid Network for Salient Object Detection in Optical Remote Sensing Images[J].IEEE Transactions on Image Processing,2021.", is called method 1 for short.

Table 1 is a comparison of the performance of the two methods on the disclosed high-resolution remote sensing significance target detection dataset ORSSD. Table 2 is a comparison of the performance of the two methods on the disclosed high-resolution remote sensing significance target detection dataset EORSSD. Wherein the MAE metric reflects the error of significance detection, the smaller the value thereof, the better the performance is represented; the F-measure metric reflects the accuracy of the significance detection, with larger values indicating better performance. Therefore, the method provided by the invention has better detection effect on the remarkable target of the remote sensing image.

Table 1 comparison of performance of two methods on ORSSD published datasets

Method of	MAE metric	F-measure metric
			Method 1	0.0116	0.9121
The method of the invention	0.0073	0.9470

Table 2 comparison of performance of two methods on EORSSD published datasets

Method of	MAE metric	F-measure metric
			Method 1	0.0063	0.8884
The method of the invention	0.0054	0.9105

Claims

1. The remote sensing image saliency target detection method comprises a training stage and a testing stage, and is characterized in that:

(2) Constructing an attention-aware three-branch network model, wherein the network is of an encoder-decoder structure and comprises an attention feature encoder, a three-branch decoder and three main parts of double supervision loss;

(3) Inputting the training set to the attention feature encoder section in step (2), calculating an aggregate attention feature map for each image by the global context aware attention module;

(7) Performing two double-loss supervision of region significant loss and boundary significant loss on the target significant region prediction graph and the target significant boundary prediction graph obtained in the step (6) to train a network;

2. The method for detecting the significance target of the remote sensing image according to claim 1, wherein the method for constructing the data set sample set in the step (1) comprises the following steps:

3. The remote sensing image saliency target detection method according to claim 1, wherein in the step (2), the constructed attention-aware three-branch network model structure is as follows:

(2.3) in the first branch portion of the three-branch decoder, constructed primarily from four decoder modules;

(2.4) in the second branch part of the three-branch decoder, the three-branch decoder is mainly constructed by four expansion convolution layers with different expansion coefficients and three reverse attention modules;

4. The method for detecting a salient object of a remote sensing image according to claim 1, wherein in the step (3), the method for calculating the aggregate attention profile of each image through the global context aware attention module is as follows:

(3.1) setting five groups of encoder convolution modules to be denoted as En (l), l= {1,2,3,4,5}, wherein l represents the partial layer number of the network encoder, and let the side output characteristic diagram of the first layer encoder module En (l) be Representing dimension space, C _l、H_l、W_l is the channel number, length and width of the first layer, and the pixel space correlation diagram of any two positions in the encoder module of the first layer is set asP _l＝H_l×H_l is the number of pixels, S ^l is defined as:

Wherein, Is a side output feature graph after regularization,/>Is a size conversion operation, i.e./>Becomes as followsD₂₃＝D₂×D₃，/>The operation is matrix multiplication operation, and T is matrix transposition operation;

Then, the pixel spatial correlation map S ^l is converted into a global pixel relationship map The relative effect of the ith pixel on the jth pixel is measured:

Mapping global pixel relationships And original feature map/>Performing matrix multiplication operations/>The coding feature map G ^l with global context is obtained as follows:

Wherein, Representation/>Is the reverse of the above;

F^l＝X^l+α·(G^l⊙X^l)

wherein, as follows, as would be the case if the element-wise multiplication was performed, α represents the set weight factor;

(3.3) introducing a cascade pyramid attention mechanism, gradually guiding and perfecting characteristic and attention prompt information from coarse to fine, respectively carrying out average pooling and maximum pooling operation on the obtained global aggregation characteristic diagram F ^l along the channel dimension, and respectively generating two one-dimensional channel descriptors And/>The definition is as follows:

Wherein avepool (·) represents the average pooling operation, maxpool (·) represents the maximum pooling operation. Connecting the two to obtain new channel space descriptor The convolution then generates a spatial attention map a ^l, defined as:

wherein Att (·) represents a custom convolution attention operation, σ (·) represents a sigmoid activation function, conv (·) represents a convolution layer operation, θ is a parameter of the convolution layer, and concat (·) represents a channel tandem operation;

(3.4) performing multiple downsampling operations on the original global aggregation feature map F ^l and performing channel compression to obtain feature maps with different resolutions respectively, and constructing a feature map pyramid Representing multiple downsampling operations, from the lowest resolution/>Initially, the Att (& gt) is operated according to the self-defined convolution attention of the above formula, and a corresponding attention force diagram/>And then generating/> according to the following formula

Wherein,Representing channel-by-channel and pixel-by-pixel multiplication, +.DELTA.represents an upsampling operation, concat (-) represents a channel tandem operation, and a final aggregate attention map/>, is obtained according to the above procedureI.e. highest resolution/>Corresponding attention strives/>The definition is as follows:

Wherein ∈represents a downsampling operation;

(3.6) combining the Global aggregate feature map F ^l with the Global aggregate attention map Performing pixel-by-pixel multiplication operation, and obtaining a final aggregate attention profile/>, through residual connectionThe definition is as follows:

5. The method for detecting a salient object of a remote sensing image according to claim 1, wherein in the step (4), the method for inputting the aggregate attention profile to the first branch convolution decoder module of the decoder to obtain the original decoding profile is as follows:

6. The method for detecting the significance target of the remote sensing image according to claim 1, wherein in the step (5), the obtained original decoding feature map is input into a second branch expansion reverse attention module and a third branch fusion dense up-sampling module, and the method for respectively obtaining the preliminary target region significance map and the preliminary target boundary significance map is as follows:

(5.1) setting the original decoding profile as D ^m, m= {1,2,3,4} representing the different decoder layers, inputting it into the second branch expansion back attention module, starting from the lowest resolution profile D ⁴, through a multi-scale expansion convolution structure containing four branches, the expansion factors r= {1,2,4,6}. Then, the single-channel global saliency region mask predictive map M ⁴ is generated by serially connecting the single-channel global saliency region mask predictive maps along the channel dimension and then sending the single-channel global saliency region mask predictive map M ⁴ into a 3X 3 convolution layer;

(5.2) taking M ⁴ as a guided predictive map, activating a function through Sigmoid, then performing a reverse operation to generate a reverse attention diagram, performing element-by-element multiplication and convolution layer operation on the reverse attention diagram and a decoding feature map D ³ of the upper layer, obtaining a saliency region mask predictive map M ³ by residual connection, and sequentially obtaining M ²,M¹ according to the steps, thereby obtaining the whole process of each layer of saliency region mask predictive map M ^m, namely:

M^m＝M^m+1↑+conv(D^m⊙(1-σ(M^m+1↑))；θ)

Wherein ∈is represented by an up-sampling operation, ∈s represented by a convolutional layer operation, ∈s represented by an element-wise multiplication operation, ∈s represented by a sigmoid activation function, wherein M ¹ is a preliminary saliency region prediction graph;

(5.3) inputting the original decoding feature map D ^m into a third branch fusion dense up-sampling module, and setting an output decoding feature map C _m、H_m、W_m is the channel number, length and width of the mth decoding layer, d is a downsampling factor, and represents the scale between adjacent level feature graphs, which is defined as:

Wherein reshape (·) represents the reconstruct array operation, to be By/>Conv (·; θ) represents the operation of the convolutional layer, θ is a parameter of the convolutional layer, so that an updated decoding feature map/>, can be obtained sequentiallyAnd a generated preliminary saliency boundary prediction graph D.

7. The method for detecting the saliency target of the remote sensing image according to claim 1, in the step (6), the method for generating the final target saliency region prediction map and the target saliency boundary prediction map comprises the following steps:

M＝M¹↑+conv(D¹；θ)

wherein conv (·; θ) represents the operation of the convolutional layer, θ being a parameter of the convolutional layer;

E＝D+conv(D¹；θ)

8. The method for performing double-loss supervision on the obtained preliminary saliency region prediction graph and the preliminary saliency boundary prediction graph to train a network in the step (7) comprises the following steps:

L_M＝-[γ₁·G_Mlog(M)+γ₂·(1-G_M)log(1-M)]

Wherein the parameters are Respectively representing the weights of the target foreground pixel B _M and the background pixels B-B _M in G _M in the total pixel B;

L_E＝-[μ₁·G_Elog(E)+μ₂·(1-G_E)log(1-E)]

Wherein the parameter μ ₁,μ₂ represents the weights of the target foreground pixel and the background pixel in G _E, respectively, in the total pixels;

L＝η₁L_M+η₂L_E

wherein η ₁,η₂ is the weight factor set for both losses.

9. The method for detecting the salient object of the remote sensing image according to claim 1, wherein in the step (8), the method for obtaining the predicted map of the salient object area of each image comprises the following steps: