CN115063685B

CN115063685B - Remote sensing image building feature extraction method based on attention network

Info

Publication number: CN115063685B
Application number: CN202210810000.2A
Authority: CN
Inventors: 周亚男; 汪顺营; 冯莉; 杨先增
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2023-10-03
Anticipated expiration: 2042-07-11
Also published as: CN115063685A

Abstract

The invention discloses a remote sensing image building feature extraction method based on an attention network, which comprises the following steps: acquiring a remote sensing image building picture with ground feature elements, and preprocessing to obtain a preprocessed picture; inputting the preprocessed picture into an attention network, wherein a conventional stream comprises 5 convolution blocks, each convolution block has the same structure, and a feature map after the conventional stream is a binary semantic map F _s The input of the shape stream is an output characteristic diagram of 5 convolution blocks of the regular stream; binary image F obtained by separating conventional flow and shape flow _s And F _e The two-dimensional semantic graph is input to a fusion module after being connected through a channel, and finally output to be a target binary semantic graph with HW after being subjected to convolution downsampling and sampling operation, and the part of output is subjected to loss supervision by semantic tags used in a conventional stream, so that the binary semantic graph with clear edges is obtained. The invention improves the traditional attention mechanism, so that a smaller network structure can be adopted to improve the extraction effect of the remote sensing image building.

Description

Remote sensing image building feature extraction method based on attention network

Technical Field

The invention relates to the technical field of feature extraction, in particular to a remote sensing image building feature extraction method based on an attention network.

Background

The existing remote sensing image building extraction method has the problems of excessive parameters and high occupation of GPU memory in a self-attention mechanism.

Disclosure of Invention

The invention aims to: in order to overcome the defects of the prior art, the invention provides a remote sensing image building feature extraction method based on an attention network, which solves the technical problems.

The technical scheme is as follows: the invention provides a remote sensing image building feature extraction method based on an attention network, which comprises the following steps:

firstly, obtaining a remote sensing image building picture with a ground feature element, and preprocessing to obtain a preprocessed picture;

secondly, inputting the preprocessed picture with the size of C multiplied by H multiplied by W into an attention network, wherein the attention network comprises a conventional stream, a shape stream and a fusion module, the conventional stream comprises 5 convolution blocks, each convolution block has the same structure, and a feature map after the conventional stream is a binary semantic map F _s The input of the shape stream is the output characteristic diagram of 5 convolution blocks of the regular stream, and the sizes are CHW 1/2 respectively ⁱ The up-sampled CHW is input into 4 gating convolution layers GCL connected in series to obtain a binary edge graph F with the size of 1HW _e ，0≤i≤4；

Finally, the binary image F obtained by the conventional flow and the shape flow respectively _s And F _e The two-dimensional semantic graph is input to a fusion module after being connected through a channel, and finally output to be a target binary semantic graph with HW after being subjected to convolution downsampling and sampling operation, and the part of output is subjected to loss supervision by semantic tags used in a conventional stream, so that the binary semantic graph with clear edges is obtained.

Further, the method comprises the steps of:

each convolution block comprises an encoder, an attention portion and a decoder, and the processing method of the conventional stream on the image comprises the following steps:

(1) Inputting the preprocessed picture into an encoder, wherein the encoder comprises 5 encoding layers, the four encoding layers except the first encoding layer reduce the size of the feature map by one half, and when the size of the input feature map is CHW, the feature map is still C after passing through the first encoding layer ₁ HW, after passing through the second coding layer, is of size C ₂ H/2W/2, size C after passing through the third coding layer ₃ H/2 ² W/2 ² The size of the encoded layer is C after passing through the fourth encoding layer ₄ H/2 ³ W/2 ³ The size of the encoded layer is C after passing through the fifth encoding layer ₅ H/2 ⁴ W/2 ⁴ After five coding layers, the size is changed to 1/16 of the original size, and a coded characteristic diagram F is obtained _e ；

(2) The attention part comprises a channel attention module, a position attention module and a superposition module, and firstly, a feature diagram F is formed _e The number of channel dimensions is scaled from C by a 1X 1 convolution ₅ Reducing to C ', and keeping the space dimension unchanged to obtain C' H/2 ⁴ W/2 ⁴ Feature map F of (1) _i Input into the channel attention module to obtain the sum F _i Feature map F of the same size _o1 The channel attention module is used for processing the input characteristic graph by using characteristic channel extrusion and excitation;

(3) Will be of size C ₅ H/2 ⁴ W/2 ⁴ Feature map F of (1) _e The number of channel dimensions is first convolved from C by a 1X 1 convolution ₅ Reducing to C ', and keeping the space dimension unchanged to obtain C' H/2 ⁴ W/2 ⁴ Feature map F of (1) _i F herein _i F with step (2) _i The same, the two attentions of step (2) and step (3) are parallel, and then input into the position attention module to obtain the sum F _i Feature map F of the same size _o2 The position attention module is used for obtaining the spatial global context information of the pixels through twice superposition by using Criss-Cross Attention Module;

(4) Map F of the characteristics _o1 And feature map F _o2 Inputting the feature images into a superposition module, wherein the superposition module adopts a short-cut structure to fuse the feature images F _o1 And feature map F _o2 Finally obtaining the product with the size of 3*C' multiplied by H/2 ⁴ ×W/2 ⁴ The superposition module is used for carrying out superposition operation on the feature graphs of different inputs in the channel dimension;

(5) The feature map F obtained in the step (4) is processed _leveli Binary semantic map F upsampled to 1HW size by decoded block _s 。

Further, the method comprises the steps of:

the shape flow carries out loss supervision on the output binary edge graph by the edge label.

Further, the method comprises the steps of:

the channel attention module specifically comprises:

the extrusion is performed in space dimension, and C' H/2 is compressed ⁴ W/2 ⁴ Feature map F of (1) _i Performing global average pooling operation, obtaining a scalar for each channel, outputting the scalar as C ' multiplied by 1, exciting the obtained C ' multiplied by 1 characteristic diagram, sending the characteristic diagram into a two-layer fully connected neural network, keeping the size unchanged, and obtaining weights M between 0 and 1 of C ' through a Sigmoid function _c As the respective weights of the C 'channels, the weights are multiplied by each element of the corresponding channel respectively to realize the enhancement of important features and the attenuation of unimportant features, so that the directivity of the extracted features is stronger, and the output of the module is C' x H/2 ⁴ ×W/2 ⁴ Feature map F of (1) _o1 。

Further, the method comprises the steps of:

the position attention module specifically comprises:

for convolution, a size of C'. Times.H/2 is obtained ⁴ ×W/2 ⁴ Feature map F of (1) _i The characteristic graphs Q, K, V are obtained by respectively carrying out 1×1 convolution, wherein the magnitudes of the characteristic graphs Q and K are C' ×H/2 ⁴ ×W/2 ⁴ The size of the characteristic diagram V is C'. Times.H/2 ⁴ ×W/2 ⁴ Then, affinicy operation is carried out on Q and K, namely, a channel vector Q is obtained at any pixel point u on Q _u The total number of the pixels is H, W, the shape is C' multiplied by 1, and at the same time, a characteristic vector K is obtained at all positions of the pixel points u on K and Q in the same row and column _u 、V _u With K _u 、V _u ∈[(H/2 ⁴ +W/2 ⁴ -1)×C’]，K _i，u 、V _i，u Represent K _u 、V _u The channel vector of the ith pixel is C' ×1×1, for Q _u And K _u Vector multiplication operation is carried out to obtain a vector D _i，u Then, a softmax layer is applied to the channel dimension to obtain attention graph A, wherein the attention graph A represents the correlation degree of any pixel point and other pixels points in the same row and the same column; then re-combining A _i，u And V is equal to _i，u The matrix multiplication is carried out, and the contextual information of any pixel point and the horizontal and vertical directions can be captured through the operation; the connection between a pixel and its surrounding pixels not in the cross-path is still lost, and global context information can be collected from all pixels of a given image at each location by two CCAM module overlap operations, the output of the module being of size C' x H/2 ⁴ ×W/2 ⁴ Feature map F of (1) _o2 。

Further, the method comprises the steps of:

the superposition module comprises a Short-Cut structure, when the original characteristics are input into the module, the channel dimension reduction processing is performed by using 1X 1 convolution, and then the output results of the two modules and the Short-Cut structure are connected in series on the channel scale according to the proportion of 1:1:1, so that the output of the module is formed together.

The beneficial effects are that: compared with the prior art, the invention has the remarkable advantages that: the invention improves the traditional attention mechanism, so that a smaller network structure can be adopted to improve the extraction effect of the remote sensing image building.

Drawings

Fig. 1 is a schematic diagram of a lightweight attentive network LANet according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a dual-flow lightweight attention network DLANet according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a fusion module according to an embodiment of the invention.

Detailed Description

The technical scheme of the invention is described in detail below.

As shown in fig. 1, the invention provides a dual-flow lightweight attention network for remote sensing image building extraction, which aims to acquire context information of complicated ground feature elements in remote sensing images from the angles of space and channels, solve the problems of excessive parameters and high occupation of GPU memory in the traditional self-attention mechanism, and realize smaller network structure and better extraction effect.

The present invention includes two networks, one of which is a lightweight attention network LANet detailed architecture shown in FIG. 1. One is a dual stream lightweight attention network DLANet, which is made up of regular streams, shape streams and fusion modules, as shown in fig. 2 above. Dual stream LANet (DLANet), which combines LANet with halved network width as regular stream and shape stream into network structure, to realize parallel processing of texture information, spectrum information, shape information and semantic information; finally, semantic information and edge information are fused by utilizing an encoder-decoder sub-network, so that further adjustment of the shape of an extraction result is realized, noise is filtered, a more accurate extraction result can be obtained, the network DLANet can be said to be finally obtained, and LANet is taken as a component part of the conventional flow and can reduce the number of parameters, so that the network width is halved. And the method uses the channel and the position attention, and can find out key and important information in two dimensions of the space position and the channel, thereby enhancing the processing of semantic information by the network.

In which a lightweight attention network LANet is shown in the upper left part of fig. 2 as a regular stream of DLANet, module 5. The LANet is subdivided into a coding block encoder, an intermediate attention part mid and a decoding block encoder, the input of the shape stream is the output characteristic diagram of the conventional stream, namely the LANet coding block, the binary semantic map output by the conventional stream and the shape stream is input into a fusion module, after convolution downsampling and downsampling operations, the binary semantic map is finally output into a target binary semantic map with HW, the parallel processing of texture information, spectrum information, shape information and semantic information is realized, and the output of the part is subjected to loss supervision by semantic tags used in the conventional stream, so that the binary semantic map with clear edges is obtained.

Based on the network structure, the invention provides a remote sensing image building feature extraction method based on an attention network, which comprises the following steps of;

secondly, inputting the preprocessed picture with the size of C multiplied by H multiplied by W into an attention network, wherein the attention network comprises a conventional stream, a shape stream and a fusion module, the conventional stream comprises 5 convolution blocks, each convolution block has the same structure, and a feature map after the conventional stream is a binary semantic map F _s The shape flow uses the Gated convolution layer and local supervision in the Gated-SCNN to filter and process shape information of the feature map, which is a combination of a series of residual modules and Gated Convolution Layers (GCLs), the required labels from edge extraction of the original sample labels, the inputs being the output of the first layer of the regular flow and the gradients of different levels.

The GCL is used to perform dot multiplication of two pixels pixel by pixel in order to filter out other information than edges, and uses four GCLs to connect the second to fifth layers of the conventional stream encoder to process multi-scale shape information;

the input of the shape stream is the output characteristic diagram of 5 convolution blocks of the conventional stream, and the sizes are CHW 1/2 respectively ⁱ The up-sampled CHW is input into 4 gating convolution layers GCL connected in series to obtain a binary edge graph F with the size of 1HW _e I is more than or equal to 0 and less than or equal to 4; the stream is loss-supervised by the edge labels for the output binary edge map.

Taking the first Gated Convolutional Layer (GCL) as an example, its input is the output profile F of the first convolutional block in the LANet code block _level1 And a second convolved block up-sampled output feature map F _level2 Wherein F is _level1 First passing through a 1 x 1 convolution layer and a residual convolution layer, then combining it with F _level2 Performing channel connection, performing 1×1 convolution and Sigmoid activation function operation to obtain edge information probability distribution map with size of CHW, and then comparing it with F _level1 And performing dot multiplication operation to obtain a characteristic diagram of the CHW size, and performing similar operation on the other three GCLs to finally obtain a binary edge diagram with the size of 1 HW.

Finally, the binary image F obtained by the conventional flow and the shape flow respectively _s And F _e Is input to a fusion module after being connected by a channel, and finally output after being subjected to convolution downsampling and up-sampling operations as shown in figure 3For the target binary semantic graph with the size HW, the part outputs the binary semantic graph with clear edges by performing loss supervision on semantic tags used in the conventional stream, and the shape stream performs loss supervision on the output binary edge graph by using edge tags.

In this embodiment, each convolution block of the normal stream includes an encoder, an attention portion, and a decoder, and the processing method of the normal stream on the image includes the steps of:

the coding block uses a classical convolution pooling component to perform downsampling operation, the final obtained characteristic diagram is 1/16 of the original diagram, the decoding block adopts a lightweight decoder structure, namely, a quadruple upsampling operation is used for replacing common secondary upsampling operation, and meanwhile, low-level characteristics are connected, so that information loss in the downsampling process is recovered.

The channel attention module specifically comprises: the extrusion is performed in space dimension, and C' H/2 is compressed ⁴ W/2 ⁴ Feature map F of (1) _i Performing global average pooling operation, obtaining a scalar for each channel, outputting the scalar as C ' multiplied by 1, exciting the obtained C ' multiplied by 1 characteristic diagram, sending the characteristic diagram into a two-layer fully connected neural network, keeping the size unchanged, and obtaining weights M between 0 and 1 of C ' through a Sigmoid function _c As the respective weights of the C 'channels, the weights are multiplied by each element of the corresponding channel respectively to realize the enhancement of important features and the attenuation of unimportant features, so that the directivity of the extracted features is stronger, and the output of the module is C' x H/2 ⁴ ×W/2 ⁴ Feature map F of (1) _o1 。

The position attention module specifically comprises: for convolution, a size of C'. Times.H/2 is obtained ⁴ ×W/2 ⁴ Feature map F of (1) _i Respectively carrying out 1X 1 convolution to obtain specialA signature Q, K, V, wherein the magnitudes of the signatures Q and K are C'. Times.H/2 ⁴ ×W/2 ⁴ The size of the characteristic diagram V is C'. Times.H/2 ⁴ ×W/2 ⁴ Then, affinicy operation is carried out on Q and K, namely, a channel vector Q is obtained at any pixel point u on Q _u The total number of the pixels is H, W, the shape is C' multiplied by 1, and at the same time, a characteristic vector K is obtained at all positions of the pixel points u on K and Q in the same row and column _u 、V _u With K _u 、V _u ∈[(H/2 ⁴ +W/2 ⁴ -1)×C’]，K _i，u 、V _i，u Represent K _u 、V _u The channel vector of the ith pixel is C' ×1×1, for Q _u And K _u Vector multiplication operation is carried out to obtain a vector D _i，u Then, a softmax layer is applied to the channel dimension to obtain attention graph A, wherein the attention graph A represents the correlation degree of any pixel point and other pixels points in the same row and the same column; then re-combining A _i，u And V is equal to _i，u The matrix multiplication is carried out, and the contextual information of any pixel point and the horizontal and vertical directions can be captured through the operation; the connection between a pixel and its surrounding pixels not in the cross-path is still lost, and global context information can be collected from all pixels of a given image at each location by two CCAM module overlap operations, the output of the module being of size C' x H/2 ⁴ ×W/2 ⁴ Feature map F of (1) _o2 。

The superposition module comprises a Short-Cut structure, when the original features are input into the module, the channel dimension reduction processing is performed by using 1X 1 convolution, and then the output results of the two modules and the Short-Cut structure are connected in series on the channel scale according to the proportion of 1:1:1, so that the output of the module is formed.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims and the equivalents thereof, the present invention is also intended to include such modifications and variations.

Claims

1. The remote sensing image building feature extraction method based on the attention network is characterized by comprising the following steps of:

secondly, inputting the preprocessed picture with the size of C multiplied by H multiplied by W into an attention network, wherein the attention network comprises a conventional stream, a shape stream and a fusion module, the conventional stream comprises 5 convolution blocks, each convolution block has the same structure, and a feature map after the conventional stream is a binary semantic map F _s The input of the shape stream is the output characteristic diagram of 5 convolution blocks of the regular stream, and the sizes are CHW respectively1/2 ⁱ The up-sampled CHW is input into 4 gating convolution layers GCL connected in series to obtain a binary edge graph F with the size of 1HW _e ，/>；

Finally, the binary image F obtained by the conventional flow and the shape flow respectively _s And F _e The two-dimensional semantic graph is input into a fusion module after being connected through a channel, and finally output into a target binary semantic graph with HW after being subjected to convolution downsampling and sampling operation, and the part of output is subjected to loss supervision by semantic tags used in a conventional stream, so that a binary semantic graph with clear edges is obtained;

(1) Inputting the preprocessed picture into an encoderThe encoder comprises 5 encoding layers, the other four encoding layers except the first encoding layer can reduce the size of the characteristic diagram by one half, and when the size of the input characteristic diagram is CHW, the characteristic diagram is still C after passing through the first encoding layer ₁ HW, after passing through the second coding layer, is of size C ₂ H/2W/2, having a size C after passing through the third coding layer ₃ H/2 ² W/2 ² The size of the encoded layer is C after passing through the fourth encoding layer ₄ H/2 ³ W/2 ³ The size of the encoded layer is C after passing through the fifth encoding layer ₅ H/2 ⁴ W/2 ⁴ After five coding layers, the size is changed to 1/16 of the original size, and a coded characteristic diagram is obtainedF _e ；

(2) The attention part comprises a channel attention module, a position attention module and a superposition module, and firstly, a feature map is formedF _e The number of channel dimensions is scaled from C by a 1X 1 convolution ₅ Reducing to C ', and keeping the space dimension unchanged to obtain C' H/2 ⁴ W/2 ⁴ Is characterized by (a)F _i Input into the channel attention module to get andF _i feature map F of the same size _o1 The channel attention module is used for processing the input characteristic graph by using characteristic channel extrusion and excitation;

(3) Will be of size C ₅ H/2 ⁴ W/2 ⁴ Is characterized by (a)F _e The number of channel dimensions is first convolved from C by a 1X 1 convolution ₅ Reducing to C ', and keeping the space dimension unchanged to obtain C' H/2 ⁴ W/2 ⁴ Is characterized by (a)F _i Here, whereF _i And step (2)F _i The same, the two attentions of step (2) and step (3) are parallel, and then input into the position attention module to obtain the sumF _i Feature map F of the same size _o2 The position attention module is used for obtaining the spatial global context information of the pixels through twice superposition by using Criss-Cross Attention Module;

(4) Map F of the characteristics _o1 And feature map F _o2 Input toIn the superposition module, the superposition module adopts a short-cut structure to fuse the feature map F _o1 And feature map F _o2 Finally, the size of 3 is obtainedC’×H/2 ⁴ ×W/2 ⁴ The superposition module is used for carrying out superposition operation on the feature graphs of different inputs in the channel dimension;

(5) The feature map F obtained in the step (4) is processed _leveli Binary semantic map F upsampled to 1HW size by decoded block _s ；

The position attention module specifically comprises:

for convolution, a size C is obtained ^’ ×H/2 ⁴ ×W/2 ⁴ Is characterized by (a)F _i The characteristic graphs Q, K, V are obtained by respectively carrying out 1×1 convolution, wherein the magnitudes of the characteristic graphs Q and K are C ^’ ×H/2 ⁴ ×W/2 ⁴ The size of the characteristic diagram V is C ^’ ×H/2 ⁴ ×W/2 ⁴ Then, affinicy operation is carried out on Q and K, namely, a channel vector Q is obtained at any pixel point u on Q _u Sharing HW is C in shape ^’ X 1, and at the same time, obtaining a feature vector K at all positions of the same row and column of the pixel points u on K and Q _u 、V _u With K _u 、V _u ∈[(H/2 ⁴ +W/2 ⁴ -1) ×C ^’ ]，K _i，u 、V _i，u Represent K _u 、V _u Upper firstiChannel vector of each pixel, shape C ^’ X 1, pair Q _u And K _u Vector multiplication operation is carried out to obtain a vector D _i，u Then, a softmax layer is applied to the channel dimension to obtain attention graph A, wherein the attention graph A represents the correlation degree of any pixel point and other pixels points in the same row and the same column; then re-combining A _i，u And V is equal to _i，u The matrix multiplication is carried out, and the contextual information of any pixel point and the horizontal and vertical directions can be captured through the operation; but one imageThe connection between a pixel and its surrounding pixels not in the cross path is still lost, and global context information can be collected from all pixels of a given image at each location by two CCAM module superposition operations, the output of the module being of size C ^’ ×H/2 ⁴ ×W/2 ⁴ Feature map F of (1) _o2 。

2. The method for extracting features of a remote sensing image building based on an attention network according to claim 1, wherein the shape flow performs loss supervision on the output binary edge map by an edge tag.

3. The remote sensing image building feature extraction method based on the attention network according to claim 1, wherein the channel attention module specifically comprises:

4. The method for extracting features of remote sensing image building based on attention network as set forth in claim 1, wherein the superposition module includes a Short-Cut structure, when the original features are input into the module, the channel dimension reduction process is performed by first using 1 x 1 convolution, and then the output results of the two modules and the Short-Cut structure are connected in series on the channel dimension according to the ratio of 1:1:1 to jointly form the output of the module.