CN115063685B - Remote sensing image building feature extraction method based on attention network - Google Patents

Remote sensing image building feature extraction method based on attention network Download PDF

Info

Publication number
CN115063685B
CN115063685B CN202210810000.2A CN202210810000A CN115063685B CN 115063685 B CN115063685 B CN 115063685B CN 202210810000 A CN202210810000 A CN 202210810000A CN 115063685 B CN115063685 B CN 115063685B
Authority
CN
China
Prior art keywords
size
module
attention
channel
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210810000.2A
Other languages
Chinese (zh)
Other versions
CN115063685A (en
Inventor
周亚男
汪顺营
冯莉
杨先增
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN202210810000.2A priority Critical patent/CN115063685B/en
Publication of CN115063685A publication Critical patent/CN115063685A/en
Application granted granted Critical
Publication of CN115063685B publication Critical patent/CN115063685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/176Urban or other man-made structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a remote sensing image building feature extraction method based on an attention network, which comprises the following steps: acquiring a remote sensing image building picture with ground feature elements, and preprocessing to obtain a preprocessed picture; inputting the preprocessed picture into an attention network, wherein a conventional stream comprises 5 convolution blocks, each convolution block has the same structure, and a feature map after the conventional stream is a binary semantic map F s The input of the shape stream is an output characteristic diagram of 5 convolution blocks of the regular stream; binary image F obtained by separating conventional flow and shape flow s And F e The two-dimensional semantic graph is input to a fusion module after being connected through a channel, and finally output to be a target binary semantic graph with HW after being subjected to convolution downsampling and sampling operation, and the part of output is subjected to loss supervision by semantic tags used in a conventional stream, so that the binary semantic graph with clear edges is obtained. The invention improves the traditional attention mechanism, so that a smaller network structure can be adopted to improve the extraction effect of the remote sensing image building.

Description

Remote sensing image building feature extraction method based on attention network
Technical Field
The invention relates to the technical field of feature extraction, in particular to a remote sensing image building feature extraction method based on an attention network.
Background
The existing remote sensing image building extraction method has the problems of excessive parameters and high occupation of GPU memory in a self-attention mechanism.
Disclosure of Invention
The invention aims to: in order to overcome the defects of the prior art, the invention provides a remote sensing image building feature extraction method based on an attention network, which solves the technical problems.
The technical scheme is as follows: the invention provides a remote sensing image building feature extraction method based on an attention network, which comprises the following steps:
firstly, obtaining a remote sensing image building picture with a ground feature element, and preprocessing to obtain a preprocessed picture;
secondly, inputting the preprocessed picture with the size of C multiplied by H multiplied by W into an attention network, wherein the attention network comprises a conventional stream, a shape stream and a fusion module, the conventional stream comprises 5 convolution blocks, each convolution block has the same structure, and a feature map after the conventional stream is a binary semantic map F s The input of the shape stream is the output characteristic diagram of 5 convolution blocks of the regular stream, and the sizes are CHW 1/2 respectively i The up-sampled CHW is input into 4 gating convolution layers GCL connected in series to obtain a binary edge graph F with the size of 1HW e ,0≤i≤4;
Finally, the binary image F obtained by the conventional flow and the shape flow respectively s And F e The two-dimensional semantic graph is input to a fusion module after being connected through a channel, and finally output to be a target binary semantic graph with HW after being subjected to convolution downsampling and sampling operation, and the part of output is subjected to loss supervision by semantic tags used in a conventional stream, so that the binary semantic graph with clear edges is obtained.
Further, the method comprises the steps of:
each convolution block comprises an encoder, an attention portion and a decoder, and the processing method of the conventional stream on the image comprises the following steps:
(1) Inputting the preprocessed picture into an encoder, wherein the encoder comprises 5 encoding layers, the four encoding layers except the first encoding layer reduce the size of the feature map by one half, and when the size of the input feature map is CHW, the feature map is still C after passing through the first encoding layer 1 HW, after passing through the second coding layer, is of size C 2 H/2W/2, size C after passing through the third coding layer 3 H/2 2 W/2 2 The size of the encoded layer is C after passing through the fourth encoding layer 4 H/2 3 W/2 3 The size of the encoded layer is C after passing through the fifth encoding layer 5 H/2 4 W/2 4 After five coding layers, the size is changed to 1/16 of the original size, and a coded characteristic diagram F is obtained e
(2) The attention part comprises a channel attention module, a position attention module and a superposition module, and firstly, a feature diagram F is formed e The number of channel dimensions is scaled from C by a 1X 1 convolution 5 Reducing to C ', and keeping the space dimension unchanged to obtain C' H/2 4 W/2 4 Feature map F of (1) i Input into the channel attention module to obtain the sum F i Feature map F of the same size o1 The channel attention module is used for processing the input characteristic graph by using characteristic channel extrusion and excitation;
(3) Will be of size C 5 H/2 4 W/2 4 Feature map F of (1) e The number of channel dimensions is first convolved from C by a 1X 1 convolution 5 Reducing to C ', and keeping the space dimension unchanged to obtain C' H/2 4 W/2 4 Feature map F of (1) i F herein i F with step (2) i The same, the two attentions of step (2) and step (3) are parallel, and then input into the position attention module to obtain the sum F i Feature map F of the same size o2 The position attention module is used for obtaining the spatial global context information of the pixels through twice superposition by using Criss-Cross Attention Module;
(4) Map F of the characteristics o1 And feature map F o2 Inputting the feature images into a superposition module, wherein the superposition module adopts a short-cut structure to fuse the feature images F o1 And feature map F o2 Finally obtaining the product with the size of 3*C' multiplied by H/2 4 ×W/2 4 The superposition module is used for carrying out superposition operation on the feature graphs of different inputs in the channel dimension;
(5) The feature map F obtained in the step (4) is processed leveli Binary semantic map F upsampled to 1HW size by decoded block s
Further, the method comprises the steps of:
the shape flow carries out loss supervision on the output binary edge graph by the edge label.
Further, the method comprises the steps of:
the channel attention module specifically comprises:
the extrusion is performed in space dimension, and C' H/2 is compressed 4 W/2 4 Feature map F of (1) i Performing global average pooling operation, obtaining a scalar for each channel, outputting the scalar as C ' multiplied by 1, exciting the obtained C ' multiplied by 1 characteristic diagram, sending the characteristic diagram into a two-layer fully connected neural network, keeping the size unchanged, and obtaining weights M between 0 and 1 of C ' through a Sigmoid function c As the respective weights of the C 'channels, the weights are multiplied by each element of the corresponding channel respectively to realize the enhancement of important features and the attenuation of unimportant features, so that the directivity of the extracted features is stronger, and the output of the module is C' x H/2 4 ×W/2 4 Feature map F of (1) o1
Further, the method comprises the steps of:
the position attention module specifically comprises:
for convolution, a size of C'. Times.H/2 is obtained 4 ×W/2 4 Feature map F of (1) i The characteristic graphs Q, K, V are obtained by respectively carrying out 1×1 convolution, wherein the magnitudes of the characteristic graphs Q and K are C' ×H/2 4 ×W/2 4 The size of the characteristic diagram V is C'. Times.H/2 4 ×W/2 4 Then, affinicy operation is carried out on Q and K, namely, a channel vector Q is obtained at any pixel point u on Q u The total number of the pixels is H, W, the shape is C' multiplied by 1, and at the same time, a characteristic vector K is obtained at all positions of the pixel points u on K and Q in the same row and column u 、V u With K u 、V u ∈[(H/2 4 +W/2 4 -1)×C’],K i,u 、V i,u Represent K u 、V u The channel vector of the ith pixel is C' ×1×1, for Q u And K u Vector multiplication operation is carried out to obtain a vector D i,u Then, a softmax layer is applied to the channel dimension to obtain attention graph A, wherein the attention graph A represents the correlation degree of any pixel point and other pixels points in the same row and the same column; then re-combining A i,u And V is equal to i,u The matrix multiplication is carried out, and the contextual information of any pixel point and the horizontal and vertical directions can be captured through the operation; the connection between a pixel and its surrounding pixels not in the cross-path is still lost, and global context information can be collected from all pixels of a given image at each location by two CCAM module overlap operations, the output of the module being of size C' x H/2 4 ×W/2 4 Feature map F of (1) o2
Further, the method comprises the steps of:
the superposition module comprises a Short-Cut structure, when the original characteristics are input into the module, the channel dimension reduction processing is performed by using 1X 1 convolution, and then the output results of the two modules and the Short-Cut structure are connected in series on the channel scale according to the proportion of 1:1:1, so that the output of the module is formed together.
The beneficial effects are that: compared with the prior art, the invention has the remarkable advantages that: the invention improves the traditional attention mechanism, so that a smaller network structure can be adopted to improve the extraction effect of the remote sensing image building.
Drawings
Fig. 1 is a schematic diagram of a lightweight attentive network LANet according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a dual-flow lightweight attention network DLANet according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a fusion module according to an embodiment of the invention.
Detailed Description
The technical scheme of the invention is described in detail below.
As shown in fig. 1, the invention provides a dual-flow lightweight attention network for remote sensing image building extraction, which aims to acquire context information of complicated ground feature elements in remote sensing images from the angles of space and channels, solve the problems of excessive parameters and high occupation of GPU memory in the traditional self-attention mechanism, and realize smaller network structure and better extraction effect.
The present invention includes two networks, one of which is a lightweight attention network LANet detailed architecture shown in FIG. 1. One is a dual stream lightweight attention network DLANet, which is made up of regular streams, shape streams and fusion modules, as shown in fig. 2 above. Dual stream LANet (DLANet), which combines LANet with halved network width as regular stream and shape stream into network structure, to realize parallel processing of texture information, spectrum information, shape information and semantic information; finally, semantic information and edge information are fused by utilizing an encoder-decoder sub-network, so that further adjustment of the shape of an extraction result is realized, noise is filtered, a more accurate extraction result can be obtained, the network DLANet can be said to be finally obtained, and LANet is taken as a component part of the conventional flow and can reduce the number of parameters, so that the network width is halved. And the method uses the channel and the position attention, and can find out key and important information in two dimensions of the space position and the channel, thereby enhancing the processing of semantic information by the network.
In which a lightweight attention network LANet is shown in the upper left part of fig. 2 as a regular stream of DLANet, module 5. The LANet is subdivided into a coding block encoder, an intermediate attention part mid and a decoding block encoder, the input of the shape stream is the output characteristic diagram of the conventional stream, namely the LANet coding block, the binary semantic map output by the conventional stream and the shape stream is input into a fusion module, after convolution downsampling and downsampling operations, the binary semantic map is finally output into a target binary semantic map with HW, the parallel processing of texture information, spectrum information, shape information and semantic information is realized, and the output of the part is subjected to loss supervision by semantic tags used in the conventional stream, so that the binary semantic map with clear edges is obtained.
Based on the network structure, the invention provides a remote sensing image building feature extraction method based on an attention network, which comprises the following steps of;
firstly, obtaining a remote sensing image building picture with a ground feature element, and preprocessing to obtain a preprocessed picture;
secondly, inputting the preprocessed picture with the size of C multiplied by H multiplied by W into an attention network, wherein the attention network comprises a conventional stream, a shape stream and a fusion module, the conventional stream comprises 5 convolution blocks, each convolution block has the same structure, and a feature map after the conventional stream is a binary semantic map F s The shape flow uses the Gated convolution layer and local supervision in the Gated-SCNN to filter and process shape information of the feature map, which is a combination of a series of residual modules and Gated Convolution Layers (GCLs), the required labels from edge extraction of the original sample labels, the inputs being the output of the first layer of the regular flow and the gradients of different levels.
The GCL is used to perform dot multiplication of two pixels pixel by pixel in order to filter out other information than edges, and uses four GCLs to connect the second to fifth layers of the conventional stream encoder to process multi-scale shape information;
the input of the shape stream is the output characteristic diagram of 5 convolution blocks of the conventional stream, and the sizes are CHW 1/2 respectively i The up-sampled CHW is input into 4 gating convolution layers GCL connected in series to obtain a binary edge graph F with the size of 1HW e I is more than or equal to 0 and less than or equal to 4; the stream is loss-supervised by the edge labels for the output binary edge map.
Taking the first Gated Convolutional Layer (GCL) as an example, its input is the output profile F of the first convolutional block in the LANet code block level1 And a second convolved block up-sampled output feature map F level2 Wherein F is level1 First passing through a 1 x 1 convolution layer and a residual convolution layer, then combining it with F level2 Performing channel connection, performing 1×1 convolution and Sigmoid activation function operation to obtain edge information probability distribution map with size of CHW, and then comparing it with F level1 And performing dot multiplication operation to obtain a characteristic diagram of the CHW size, and performing similar operation on the other three GCLs to finally obtain a binary edge diagram with the size of 1 HW.
Finally, the binary image F obtained by the conventional flow and the shape flow respectively s And F e Is input to a fusion module after being connected by a channel, and finally output after being subjected to convolution downsampling and up-sampling operations as shown in figure 3For the target binary semantic graph with the size HW, the part outputs the binary semantic graph with clear edges by performing loss supervision on semantic tags used in the conventional stream, and the shape stream performs loss supervision on the output binary edge graph by using edge tags.
In this embodiment, each convolution block of the normal stream includes an encoder, an attention portion, and a decoder, and the processing method of the normal stream on the image includes the steps of:
the coding block uses a classical convolution pooling component to perform downsampling operation, the final obtained characteristic diagram is 1/16 of the original diagram, the decoding block adopts a lightweight decoder structure, namely, a quadruple upsampling operation is used for replacing common secondary upsampling operation, and meanwhile, low-level characteristics are connected, so that information loss in the downsampling process is recovered.
(1) Inputting the preprocessed picture into an encoder, wherein the encoder comprises 5 encoding layers, the four encoding layers except the first encoding layer reduce the size of the feature map by one half, and when the size of the input feature map is CHW, the feature map is still C after passing through the first encoding layer 1 HW, after passing through the second coding layer, is of size C 2 H/2W/2, size C after passing through the third coding layer 3 H/2 2 W/2 2 The size of the encoded layer is C after passing through the fourth encoding layer 4 H/2 3 W/2 3 The size of the encoded layer is C after passing through the fifth encoding layer 5 H/2 4 W/2 4 After five coding layers, the size is changed to 1/16 of the original size, and a coded characteristic diagram F is obtained e
(2) The attention part comprises a channel attention module, a position attention module and a superposition module, and firstly, a feature diagram F is formed e The number of channel dimensions is scaled from C by a 1X 1 convolution 5 Reducing to C ', and keeping the space dimension unchanged to obtain C' H/2 4 W/2 4 Feature map F of (1) i Input into the channel attention module to obtain the sum F i Feature map F of the same size o1 The channel attention module is used for processing the input characteristic graph by using characteristic channel extrusion and excitation;
(3) Will be of size C 5 H/2 4 W/2 4 Feature map F of (1) e The number of channel dimensions is first convolved from C by a 1X 1 convolution 5 Reducing to C ', and keeping the space dimension unchanged to obtain C' H/2 4 W/2 4 Feature map F of (1) i F herein i F with step (2) i The same, the two attentions of step (2) and step (3) are parallel, and then input into the position attention module to obtain the sum F i Feature map F of the same size o2 The position attention module is used for obtaining the spatial global context information of the pixels through twice superposition by using Criss-Cross Attention Module;
(4) Map F of the characteristics o1 And feature map F o2 Inputting the feature images into a superposition module, wherein the superposition module adopts a short-cut structure to fuse the feature images F o1 And feature map F o2 Finally obtaining the product with the size of 3*C' multiplied by H/2 4 ×W/2 4 The superposition module is used for carrying out superposition operation on the feature graphs of different inputs in the channel dimension;
(5) The feature map F obtained in the step (4) is processed leveli Binary semantic map F upsampled to 1HW size by decoded block s
The channel attention module specifically comprises: the extrusion is performed in space dimension, and C' H/2 is compressed 4 W/2 4 Feature map F of (1) i Performing global average pooling operation, obtaining a scalar for each channel, outputting the scalar as C ' multiplied by 1, exciting the obtained C ' multiplied by 1 characteristic diagram, sending the characteristic diagram into a two-layer fully connected neural network, keeping the size unchanged, and obtaining weights M between 0 and 1 of C ' through a Sigmoid function c As the respective weights of the C 'channels, the weights are multiplied by each element of the corresponding channel respectively to realize the enhancement of important features and the attenuation of unimportant features, so that the directivity of the extracted features is stronger, and the output of the module is C' x H/2 4 ×W/2 4 Feature map F of (1) o1
The position attention module specifically comprises: for convolution, a size of C'. Times.H/2 is obtained 4 ×W/2 4 Feature map F of (1) i Respectively carrying out 1X 1 convolution to obtain specialA signature Q, K, V, wherein the magnitudes of the signatures Q and K are C'. Times.H/2 4 ×W/2 4 The size of the characteristic diagram V is C'. Times.H/2 4 ×W/2 4 Then, affinicy operation is carried out on Q and K, namely, a channel vector Q is obtained at any pixel point u on Q u The total number of the pixels is H, W, the shape is C' multiplied by 1, and at the same time, a characteristic vector K is obtained at all positions of the pixel points u on K and Q in the same row and column u 、V u With K u 、V u ∈[(H/2 4 +W/2 4 -1)×C’],K i,u 、V i,u Represent K u 、V u The channel vector of the ith pixel is C' ×1×1, for Q u And K u Vector multiplication operation is carried out to obtain a vector D i,u Then, a softmax layer is applied to the channel dimension to obtain attention graph A, wherein the attention graph A represents the correlation degree of any pixel point and other pixels points in the same row and the same column; then re-combining A i,u And V is equal to i,u The matrix multiplication is carried out, and the contextual information of any pixel point and the horizontal and vertical directions can be captured through the operation; the connection between a pixel and its surrounding pixels not in the cross-path is still lost, and global context information can be collected from all pixels of a given image at each location by two CCAM module overlap operations, the output of the module being of size C' x H/2 4 ×W/2 4 Feature map F of (1) o2
The superposition module comprises a Short-Cut structure, when the original features are input into the module, the channel dimension reduction processing is performed by using 1X 1 convolution, and then the output results of the two modules and the Short-Cut structure are connected in series on the channel scale according to the proportion of 1:1:1, so that the output of the module is formed.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims and the equivalents thereof, the present invention is also intended to include such modifications and variations.

Claims (4)

1. The remote sensing image building feature extraction method based on the attention network is characterized by comprising the following steps of:
firstly, obtaining a remote sensing image building picture with a ground feature element, and preprocessing to obtain a preprocessed picture;
secondly, inputting the preprocessed picture with the size of C multiplied by H multiplied by W into an attention network, wherein the attention network comprises a conventional stream, a shape stream and a fusion module, the conventional stream comprises 5 convolution blocks, each convolution block has the same structure, and a feature map after the conventional stream is a binary semantic map F s The input of the shape stream is the output characteristic diagram of 5 convolution blocks of the regular stream, and the sizes are CHW respectively1/2 i The up-sampled CHW is input into 4 gating convolution layers GCL connected in series to obtain a binary edge graph F with the size of 1HW e ,/>
Finally, the binary image F obtained by the conventional flow and the shape flow respectively s And F e The two-dimensional semantic graph is input into a fusion module after being connected through a channel, and finally output into a target binary semantic graph with HW after being subjected to convolution downsampling and sampling operation, and the part of output is subjected to loss supervision by semantic tags used in a conventional stream, so that a binary semantic graph with clear edges is obtained;
each convolution block comprises an encoder, an attention portion and a decoder, and the processing method of the conventional stream on the image comprises the following steps:
(1) Inputting the preprocessed picture into an encoderThe encoder comprises 5 encoding layers, the other four encoding layers except the first encoding layer can reduce the size of the characteristic diagram by one half, and when the size of the input characteristic diagram is CHW, the characteristic diagram is still C after passing through the first encoding layer 1 HW, after passing through the second coding layer, is of size C 2 H/2W/2, having a size C after passing through the third coding layer 3 H/2 2 W/2 2 The size of the encoded layer is C after passing through the fourth encoding layer 4 H/2 3 W/2 3 The size of the encoded layer is C after passing through the fifth encoding layer 5 H/2 4 W/2 4 After five coding layers, the size is changed to 1/16 of the original size, and a coded characteristic diagram is obtainedF e
(2) The attention part comprises a channel attention module, a position attention module and a superposition module, and firstly, a feature map is formedF e The number of channel dimensions is scaled from C by a 1X 1 convolution 5 Reducing to C ', and keeping the space dimension unchanged to obtain C' H/2 4 W/2 4 Is characterized by (a)F i Input into the channel attention module to get andF i feature map F of the same size o1 The channel attention module is used for processing the input characteristic graph by using characteristic channel extrusion and excitation;
(3) Will be of size C 5 H/2 4 W/2 4 Is characterized by (a)F e The number of channel dimensions is first convolved from C by a 1X 1 convolution 5 Reducing to C ', and keeping the space dimension unchanged to obtain C' H/2 4 W/2 4 Is characterized by (a)F i Here, whereF i And step (2)F i The same, the two attentions of step (2) and step (3) are parallel, and then input into the position attention module to obtain the sumF i Feature map F of the same size o2 The position attention module is used for obtaining the spatial global context information of the pixels through twice superposition by using Criss-Cross Attention Module;
(4) Map F of the characteristics o1 And feature map F o2 Input toIn the superposition module, the superposition module adopts a short-cut structure to fuse the feature map F o1 And feature map F o2 Finally, the size of 3 is obtainedC’×H/2 4 ×W/2 4 The superposition module is used for carrying out superposition operation on the feature graphs of different inputs in the channel dimension;
(5) The feature map F obtained in the step (4) is processed leveli Binary semantic map F upsampled to 1HW size by decoded block s
The position attention module specifically comprises:
for convolution, a size C is obtained ×H/2 4 ×W/2 4 Is characterized by (a)F i The characteristic graphs Q, K, V are obtained by respectively carrying out 1×1 convolution, wherein the magnitudes of the characteristic graphs Q and K are C ×H/2 4 ×W/2 4 The size of the characteristic diagram V is C ×H/2 4 ×W/2 4 Then, affinicy operation is carried out on Q and K, namely, a channel vector Q is obtained at any pixel point u on Q u Sharing HW is C in shape X 1, and at the same time, obtaining a feature vector K at all positions of the same row and column of the pixel points u on K and Q u 、V u With K u 、V u ∈[(H/2 4 +W/2 4 -1) ×C ],K i,u 、V i,u Represent K u 、V u Upper firstiChannel vector of each pixel, shape C X 1, pair Q u And K u Vector multiplication operation is carried out to obtain a vector D i,u Then, a softmax layer is applied to the channel dimension to obtain attention graph A, wherein the attention graph A represents the correlation degree of any pixel point and other pixels points in the same row and the same column; then re-combining A i,u And V is equal to i,u The matrix multiplication is carried out, and the contextual information of any pixel point and the horizontal and vertical directions can be captured through the operation; but one imageThe connection between a pixel and its surrounding pixels not in the cross path is still lost, and global context information can be collected from all pixels of a given image at each location by two CCAM module superposition operations, the output of the module being of size C ×H/2 4 ×W/2 4 Feature map F of (1) o2
2. The method for extracting features of a remote sensing image building based on an attention network according to claim 1, wherein the shape flow performs loss supervision on the output binary edge map by an edge tag.
3. The remote sensing image building feature extraction method based on the attention network according to claim 1, wherein the channel attention module specifically comprises:
the extrusion is performed in space dimension, and C' H/2 is compressed 4 W/2 4 Feature map F of (1) i Performing global average pooling operation, obtaining a scalar for each channel, outputting the scalar as C ' multiplied by 1, exciting the obtained C ' multiplied by 1 characteristic diagram, sending the characteristic diagram into a two-layer fully connected neural network, keeping the size unchanged, and obtaining weights M between 0 and 1 of C ' through a Sigmoid function c As the respective weights of the C 'channels, the weights are multiplied by each element of the corresponding channel respectively to realize the enhancement of important features and the attenuation of unimportant features, so that the directivity of the extracted features is stronger, and the output of the module is C' x H/2 4 ×W/2 4 Feature map F of (1) o1
4. The method for extracting features of remote sensing image building based on attention network as set forth in claim 1, wherein the superposition module includes a Short-Cut structure, when the original features are input into the module, the channel dimension reduction process is performed by first using 1 x 1 convolution, and then the output results of the two modules and the Short-Cut structure are connected in series on the channel dimension according to the ratio of 1:1:1 to jointly form the output of the module.
CN202210810000.2A 2022-07-11 2022-07-11 Remote sensing image building feature extraction method based on attention network Active CN115063685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210810000.2A CN115063685B (en) 2022-07-11 2022-07-11 Remote sensing image building feature extraction method based on attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210810000.2A CN115063685B (en) 2022-07-11 2022-07-11 Remote sensing image building feature extraction method based on attention network

Publications (2)

Publication Number Publication Date
CN115063685A CN115063685A (en) 2022-09-16
CN115063685B true CN115063685B (en) 2023-10-03

Family

ID=83205953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210810000.2A Active CN115063685B (en) 2022-07-11 2022-07-11 Remote sensing image building feature extraction method based on attention network

Country Status (1)

Country Link
CN (1) CN115063685B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183360A (en) * 2020-09-29 2021-01-05 上海交通大学 Lightweight semantic segmentation method for high-resolution remote sensing image
CN114022785A (en) * 2021-11-15 2022-02-08 中国华能集团清洁能源技术研究院有限公司 Remote sensing image semantic segmentation method, system, equipment and storage medium
WO2022073452A1 (en) * 2020-10-07 2022-04-14 武汉大学 Hyperspectral remote sensing image classification method based on self-attention context network
CN114495210A (en) * 2022-01-07 2022-05-13 中北大学南通智能光机电研究院 Posture change face recognition method based on attention mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183360A (en) * 2020-09-29 2021-01-05 上海交通大学 Lightweight semantic segmentation method for high-resolution remote sensing image
WO2022073452A1 (en) * 2020-10-07 2022-04-14 武汉大学 Hyperspectral remote sensing image classification method based on self-attention context network
CN114022785A (en) * 2021-11-15 2022-02-08 中国华能集团清洁能源技术研究院有限公司 Remote sensing image semantic segmentation method, system, equipment and storage medium
CN114495210A (en) * 2022-01-07 2022-05-13 中北大学南通智能光机电研究院 Posture change face recognition method based on attention mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Gated-SCNN: Gated Shape CNNs for Semantic Segmentation;Towaki Takikawa,David Acuna;《arXiv:1907.05740v1 [cs.CV]》;第1-10页 *
LANet:local attention embedding to improve the semantic segmentation of remote sensing images;Lei Ding, Hao Tang;《IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENING》;第1-10页 *

Also Published As

Publication number Publication date
CN115063685A (en) 2022-09-16

Similar Documents

Publication Publication Date Title
CN112287940B (en) Semantic segmentation method of attention mechanism based on deep learning
CN110120011B (en) Video super-resolution method based on convolutional neural network and mixed resolution
CN110348487B (en) Hyperspectral image compression method and device based on deep learning
CN112163449B (en) Lightweight multi-branch feature cross-layer fusion image semantic segmentation method
CN110580704A (en) ET cell image automatic segmentation method and system based on convolutional neural network
CN116051549B (en) Method, system, medium and equipment for dividing defects of solar cell
CN115311720B (en) Method for generating deepfake based on transducer
CN113076957A (en) RGB-D image saliency target detection method based on cross-modal feature fusion
CN115359370B (en) Remote sensing image cloud detection method and device, computer device and storage medium
CN117078930A (en) Medical image segmentation method based on boundary sensing and attention mechanism
CN111860683A (en) Target detection method based on feature fusion
CN111160378A (en) Depth estimation system based on single image multitask enhancement
CN115861703A (en) Remote sensing image change detection method and device based on multi-scale CNN-Transformer
CN115631107A (en) Edge-guided single image noise removal
CN118015332A (en) Remote sensing image saliency target detection method
CN113379606A (en) Face super-resolution method based on pre-training generation model
CN115063685B (en) Remote sensing image building feature extraction method based on attention network
CN117036436A (en) Monocular depth estimation method and system based on double encoder-decoder
CN112488115B (en) Semantic segmentation method based on two-stream architecture
CN113781376B (en) High-definition face attribute editing method based on divide-and-congress
CN114067101A (en) Image significance detection method of double-stream decoder based on information complementation
CN116778539A (en) Human face image super-resolution network model based on attention mechanism and processing method
CN111191674A (en) Primary feature extractor based on densely-connected porous convolution network and extraction method
CN114630125B (en) Vehicle image compression method and system based on artificial intelligence and big data
CN115631115B (en) Dynamic image restoration method based on recursion transform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant