CN115641473A

CN115641473A - Remote sensing image classification method based on CNN-self-attention mechanism hybrid architecture

Info

Publication number: CN115641473A
Application number: CN202211292933.3A
Authority: CN
Inventors: 王威; 李希杰; 王新; 李骥
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-01-24

Abstract

The application relates to a remote sensing image classification method based on a CNN-self-attention mechanism hybrid architecture in the technical field of image recognition. The method comprises the following steps: marking the obtained remote sensing image to obtain a training sample; constructing a remote sensing image-based classification model, wherein the model comprises an input network, a feature extraction network and a classification network, and the feature extraction network is used for carrying out multi-scale global and local feature extraction on a convolution feature map by adopting 4 stages which are connected in sequence to obtain a multi-scale feature map; the stage adopts a structure of stacking an ADC module and a CPA module; the ADC module and the CPA module are both constructed on the basis of a MetaFormer paradigm; and training the remote sensing image classification model by adopting the training sample, and classifying the remote sensing images to be classified by adopting the trained remote sensing image classification model to obtain a remote sensing image classification result. By adopting the method, the accuracy of remote sensing image classification can be improved.

Description

Remote sensing image classification method based on CNN-self-attention mechanism hybrid architecture

Technical Field

The application relates to the technical field of image recognition, in particular to a remote sensing image classification method based on a CNN-self-attention mechanism hybrid architecture.

Background

In the task of remote sensing image classification, the identification of image information is very important. The remote sensing image has the characteristics of abundant image information, complex scene composition and ground feature element details, and due to the characteristics of the remote sensing image, the classification result of the remote sensing image is always subjected to the conditions of missing classification and wrong classification, so that the classification precision is low.

Convolutional Neural Networks (CNNs) are the most fundamental pillars of the general visual task in the prior art. They effectively extract high frequency representations by covering more local information with local convolutions within the receptive field, but these networks are relatively general in learning global information. The Transformer can effectively capture low-frequency information in visual data through an attention mechanism, mainly including the global shape and structure of a scene or an object, but is not very strong in the ability to learn local edges and textures.

Disclosure of Invention

In view of the above, it is necessary to provide a remote sensing image classification method based on a CNN-attention mechanism hybrid architecture.

A remote sensing image classification method based on a CNN-attention mechanism hybrid architecture comprises the following steps:

and acquiring a remote sensing image, and labeling the remote sensing image to obtain a training sample.

Constructing a remote sensing image classification model based on a CNN-self-attention mechanism hybrid architecture, wherein the remote sensing image classification model comprises an input network, a feature extraction network and a classification network, and the input network is used for processing training samples by adopting a plurality of convolution downsampling layers with the same scale to obtain a convolution feature map; the feature extraction network is used for extracting multi-scale global features and local features of the convolution feature map by adopting 4 stages which are connected in sequence to obtain a multi-scale feature map; the stage comprises an average pooling down-sampling layer and a module stacked by a plurality of ADC modules + a plurality of CPA modules; the ADC module is constructed by adopting an asymmetric convolution group and a multilayer perceptron-two-dimensional attention layer in a MetaFormer model, and the CPA module is constructed by adopting a convolution-attention parallel block and a multilayer perceptron-two-dimensional attention layer in a MetaFormer model; the multilayer perceptron-two-dimensional attention layer is used for extracting the characteristics of the input characteristic graph by adopting the multilayer perceptron and the two-dimensional attention layer; and the classification network is used for classifying the multi-scale characteristic graph to obtain a remote sensing image classification prediction result.

And training the remote sensing image classification model by adopting the marking of the training sample and the remote sensing image classification prediction result obtained by inputting the training sample into the remote sensing image classification model to obtain the trained remote sensing image classification model.

And inputting the remote sensing image to be classified into the trained remote sensing image classification model to obtain a remote sensing image classification result.

In one embodiment, the input network comprises 3 convolutional downsampling layers of the same scale connected in sequence.

Adopting the marking of the training sample and the remote sensing image classification prediction result obtained by inputting the training sample into the remote sensing image classification model to train the remote sensing image classification model to obtain the trained remote sensing image classification model, comprising the following steps:

and inputting the training sample into the input network to obtain a convolution characteristic diagram.

And inputting the convolution feature map into a first stage of the feature extraction network to obtain a first-layer feature map.

And inputting the first feature map into a second stage of the feature extraction network to obtain a second-layer feature map.

And inputting the second feature map into a third stage of the feature extraction network to obtain a third-layer feature map.

And inputting the third feature map into a fourth stage of the feature extraction network to obtain a multi-scale feature map.

And inputting the multi-scale characteristic graph into a classification network to obtain a remote sensing image classification prediction result.

And carrying out reverse training on the remote sensing image classification model by adopting the marking of the training sample and the remote sensing image classification prediction result to obtain the trained remote sensing image classification model.

In one embodiment, the number of ADC modules in the first stage of the feature extraction network is 2 and the number of cpa modules is 0.

Inputting the convolution feature map into a first stage of the feature extraction network to obtain a first-layer feature map, wherein the method comprises the following steps:

inputting the convolution feature map into an average pooling downsampling layer of a first stage of the feature extraction network for sampling to obtain a downsampling feature map;

and inputting the downsampled feature map into an asymmetric convolution group of a first ADC module of a first stage of the feature extraction network for feature extraction, and adding the extracted features and the downsampled feature map to obtain an enhanced feature map.

And inputting the enhanced feature map into a multi-layer perceptron-two-dimensional attention layer of the first stage of the feature extraction network to obtain a shallow feature map fusing spatial attention and channel attention.

And inputting the shallow layer feature map into a second ADC module of a first stage of the feature extraction network to obtain a first layer feature map.

In one embodiment, the multi-layered perceptron-two-dimensional attention layer includes a first normalization layer, a multi-layered perceptron, and a two-dimensional attention module.

Inputting the enhanced feature map into a multi-layer perceptron-two-dimensional attention layer of a first stage of the feature extraction network to obtain a shallow feature map fusing spatial attention and channel attention, wherein the shallow feature map comprises

And inputting the enhanced feature map into a multi-layer perceptron-two-dimensional attention layer of a first stage of the feature extraction network, and processing by adopting a first normalization layer to obtain a normalized enhanced feature map.

And inputting the standardized enhanced feature map into a multilayer perceptron of a first stage of the feature extraction network, namely a multilayer perceptron of a two-dimensional attention layer, so as to obtain a multilayer perceptron.

And inputting the multi-layer perception feature map into a multi-layer perception machine-two-dimensional attention module of a two-dimensional attention layer of a first stage of the feature extraction network to obtain a first two-dimensional attention feature.

And adding the enhanced feature map and the first two-dimensional attention feature to obtain a shallow feature map fusing spatial attention and channel attention.

In one embodiment, the two-dimensional attention module includes a channel attention branch and a spatial attention branch.

Inputting the multi-layer perception feature map into a two-dimensional attention module of a multi-layer perception machine-two-dimensional attention layer of a first stage of the feature extraction network to obtain a first two-dimensional attention feature, wherein the method comprises the following steps:

inputting the multilayer perception feature map into a channel attention branch of a two-dimensional attention module of a multilayer perception machine-two-dimensional attention layer of a first stage of the feature extraction network, firstly performing global average pooling on the multilayer perception feature map, activating a pooling result by a GELU (global GeLU) activation function after the pooling result is processed by a full connection layer to obtain a left branch global feature, and processing the left branch global feature by the full connection layer to obtain the channel attention feature.

Inputting the multilayer perception feature graph into a spatial attention branch of a two-dimensional attention module of a multilayer perceptron-two-dimensional attention layer of a first stage of the feature extraction network, activating by adopting a GELU activation function after processing by a full connection layer to obtain a right branch global feature, splicing the left branch global feature and the right branch global feature, and processing by the full connection layer to obtain the spatial attention feature.

And adding the channel attention feature and the space attention feature, activating by adopting a Sigmoid function, and multiplying the obtained activation feature by the multilayer perception feature map to obtain a first two-dimensional attention feature.

In one embodiment, the number of ADC modules in the second stage of the feature extraction network is 3 and the number of cpa modules is 3.

Inputting the first feature map into a second stage of the feature extraction network to obtain a second-layer feature map, wherein the second-layer feature map comprises:

and sampling the first feature map through an average pooling downsampling layer of a second stage of the feature extraction network, and inputting the sampled first feature map into three sequentially connected ADC modules of the second stage to obtain a convolution attention feature map.

And inputting the convolution attention feature map into a first CPA module of a second stage of the feature extraction network, and performing channel segmentation after standardized layer processing to obtain a first segmentation feature and a pooling segmentation feature map.

And inputting the first tangent feature into a convolution branch of a convolution-attention parallel block of a first CPA module of a second stage of the feature extraction network, and processing by adopting deep convolution to obtain a deep convolution feature map.

And inputting the pooling segmentation feature map into an attention branch of a convolution-attention parallel block of a first CPA module of a second stage of the feature extraction network, and processing the feature map by an average pooling and attention module to obtain an attention branch feature map.

And performing channel splicing on the deep convolution feature map and the attention branch feature map, adding the deep convolution feature map and the attention branch feature map, and inputting an obtained result into a multi-layer perceptron-two-dimensional attention layer of a first CPA module of a second stage of the feature extraction network to obtain a first shallow feature map.

And inputting the shallow feature map into a second CPA module of a second stage of the feature extraction network to obtain a second deep feature map.

And inputting the second deep feature map into a third CPA module of a second stage of the feature extraction network to obtain a second-layer feature map.

In one embodiment, inputting the pooling segmented feature map into the attention branch of the convolution-attention parallel block of the first CPA module of the second stage of the feature extraction network, and obtaining the attention branch feature map after processing by the average pooling and attention module includes:

and inputting the pooling segmentation characteristic diagram into an attention branch of a convolution-attention parallel block of a first CPA module of a second stage of the characteristic extraction network, and performing average pooling to obtain the pooling segmentation characteristic diagram.

Adopting an embedded matrix W according to the pooling segmentation characteristic diagram ^K ，W ^Q ，W ^V And obtaining a key K = XW through point convolution calculation ^K Query Q = XW ^Q Sum value V = XW ^V ，

According to bond K = XW ^K Query Q = XW ^Q Sum value V = XW ^V Obtaining an attention branch feature map, wherein the expression of the attention branch feature map is as follows:

Y＝Linear(Softmax(Q)((Softmax(K ^T )V)W))

wherein, Y is an attention branch characteristic diagram, Q is a query, K is a key, V is a value, W is a parameter matrix which can be learnt, softmax () is an activation function, and Linear () is Linear transformation.

In one embodiment, the number of ADC modules in the third stage of the feature extraction network is 4 and the number of cpa modules is 3.

In one embodiment, the number of ADC modules in the fourth stage of the feature extraction network is 1 and the number of cpa modules is 3.

In one embodiment, the classification network includes an average pooling layer and a full connectivity layer.

The remote sensing image classification method based on the CNN-self-attention mechanism hybrid architecture comprises the following steps: obtaining a remote sensing image, and labeling the remote sensing image to obtain a training sample; constructing a remote sensing image classification-based model, wherein the model comprises an input network, a feature extraction network and a classification network, and the feature extraction network is used for extracting multi-scale global features and local features of the convolution feature map by adopting 4 stages which are connected in sequence to obtain a multi-scale feature map; the stage comprises an average pooling downsampling layer and a module stacked by a plurality of ADC modules and a plurality of CPA modules; the ADC module is constructed by adopting an asymmetric convolution group and a multilayer perceptron-two-dimensional attention layer in a MetaFormer model, and the CPA module is constructed by adopting a convolution-attention parallel block and a multilayer perceptron-two-dimensional attention layer in a MetaFormer model; and training the remote sensing image classification model by adopting the training sample, and classifying the remote sensing image to be classified by adopting the trained remote sensing image classification model to obtain a remote sensing image classification result. By adopting the method, the accuracy of remote sensing image classification can be improved.

Drawings

FIG. 1 is a schematic flow chart illustrating a remote sensing image classification method based on a CNN-self-attention mechanism hybrid architecture in an embodiment;

FIG. 2 is a block diagram of a remote sensing image classification model in another embodiment;

FIG. 3 is a schematic diagram of a process for training a classification model of a remote sensing image according to another embodiment;

FIG. 4 is a block diagram of an ADC module in another embodiment;

FIG. 5 is a block diagram of a two-dimensional attention module in another embodiment;

fig. 6 is a structural diagram of a CPA module in another embodiment;

FIG. 7 is a block diagram of an attention module in another embodiment;

FIG. 8 is a comparison of classification accuracy for several network models in another embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Convolution-attention parallel module: a capacitive Paralleling attachment Block, abbreviated as CPA module;

an aggregate deep convolution module: the Aggregation Depthwise conversion Block is referred to as ADC module for short;

two-dimensional attention module: two-dimensional Attention Block, tdAtten module for short.

In one embodiment, as shown in fig. 1, a method for classifying remote sensing images based on a CNN-attention mechanism hybrid architecture is provided, which includes the following steps:

step 100: and acquiring a remote sensing image, and labeling the remote sensing image to obtain a training sample.

Specifically, the remote sensing image may be, but is not limited to: grassland remote sensing images, field remote sensing images, industrial area remote sensing images, river and lake remote sensing images, forest remote sensing images, residential area remote sensing images and parking lot remote sensing images.

Step 102: and constructing a remote sensing image classification model based on a CNN-self-attention mechanism hybrid framework.

The remote sensing image classification model comprises an input network, a feature extraction network and a classification network.

The input network is used for processing the training samples by adopting a plurality of convolution downsampling layers with the same scale to obtain a convolution characteristic diagram.

The characteristic extraction network is used for extracting multi-scale global characteristics and local characteristics of the convolution characteristic graph by adopting 4 stages which are connected in sequence to obtain a multi-scale characteristic graph; the stage comprises an average pooling downsampling layer and a module stacked by a plurality of ADC modules and a plurality of CPA modules; the ADC module is constructed by adopting an asymmetric convolution group and a multilayer perceptron-two-dimensional attention layer in a MetaFormer model, and the CPA module is constructed by adopting a convolution-attention parallel block and a multilayer perceptron-two-dimensional attention layer in a MetaFormer model; the multi-layer perceptron-two-dimensional attention layer is used for extracting the features of the input feature map by adopting the multi-layer perceptron and the two-dimensional attention layer.

The classification network is used for classifying the multi-scale characteristic graph to obtain a remote sensing image classification prediction result.

Specifically, the ADC module is an efficient volume block, the overall architecture of the ADC module follows the general architecture of the metaprovider, and the metaprovider architecture is very important for the transform module after being verified.

The CPA module is a module which efficiently combines parallel structures of CNN and Transformer, and the overall architecture follows the general architecture of MetaFormer.

Two-dimensional attention (TdAtten) is an efficient attention mechanism for extracting spatial and channel information of input feature maps using spatial attention branches and channel spatial attention branches.

An ADC module: local information is effectively extracted, and meanwhile robustness to image overturning and rotation is improved. Two-dimensional attention can effectively extract useful information when the number of channels is small.

CPA module: global and local information is effectively extracted, and the global and local information are combined, so that the useful information can be effectively extracted by two-dimensional attention when the number of channels is small.

A network structure 2 of a remote sensing image classification model based on a CNN-self-attention mechanism hybrid architecture is shown.

Step 104: and training the remote sensing image classification model by adopting the marking of the training sample and the remote sensing image classification prediction result obtained by inputting the training sample into the remote sensing image classification model to obtain the trained remote sensing image classification model.

Step 106: and inputting the remote sensing images to be classified into the trained remote sensing image classification model to obtain a remote sensing image classification result.

In the above method for classifying remote sensing images based on the CNN-attention mechanism hybrid architecture, the method includes: obtaining a remote sensing image, and labeling the remote sensing image to obtain a training sample; constructing a remote sensing image classification-based model, wherein the model comprises an input network, a feature extraction network and a classification network, and the feature extraction network is used for extracting multi-scale global features and local features of the convolution feature map by adopting 4 stages which are connected in sequence to obtain a multi-scale feature map; the stage comprises an average pooling downsampling layer and a module stacked by a plurality of ADC modules and a plurality of CPA modules; the ADC module is constructed by adopting an asymmetric convolution group and a multi-layer perceptron-two-dimensional attention layer in a MetaFormer paradigm, and the CPA module is constructed by adopting a convolution-attention parallel block and a multi-layer perceptron-two-dimensional attention layer in the MetaFormer paradigm; and training the remote sensing image classification model by adopting the training sample, and classifying the remote sensing image to be classified by adopting the trained remote sensing image classification model to obtain a remote sensing image classification result. By adopting the method, the accuracy of remote sensing image classification can be improved.

In one embodiment, the input network comprises 3 convolution downsampling layers of the same scale connected in sequence; as shown in fig. 3, step 104 specifically includes the following steps:

step 300: and inputting the training sample into an input network to obtain a convolution characteristic diagram.

Specifically, after a training sample is input into a first convolution downsampling layer of an input network, a GELU function is used for activation after batch standardization processing; inputting the result obtained by activation into a second convolution down-sampling layer of an input network, and activating the sampled characteristics by using a GELU function after batch standardization processing; and inputting the result obtained by activation into a third convolution downsampling layer of the input network, and activating the sampled features by using a GELU function after batch standardization processing to obtain a convolution feature map. Wherein the 3 convolutional downsampling layers of the same scale are 3 × 3 convolutional downsampling layers.

Step 302: and inputting the convolution feature map into a first stage of the feature extraction network to obtain a first-layer feature map.

Step 304: and inputting the first feature map into a second stage of the feature extraction network to obtain a second-layer feature map.

Step 306: and inputting the second feature map into a third stage of the feature extraction network to obtain a third-layer feature map.

Step 308: and inputting the third feature map into a fourth stage of the feature extraction network to obtain a multi-scale feature map.

Step 310: and inputting the multi-scale characteristic graph into a classification network to obtain a remote sensing image classification prediction result.

Step 312: and carrying out reverse training on the remote sensing image classification model by adopting the labeling of the training sample and the classification prediction result of the remote sensing image to obtain the trained remote sensing image classification model.

In one embodiment, the number of ADC modules in the first stage of the feature extraction network is 2, and the number of cpa modules is 0; step 302 includes: inputting the convolution characteristic diagram into an average pooling downsampling layer of a first stage of the characteristic extraction network for sampling to obtain a downsampling characteristic diagram; inputting the downsampled feature map into an asymmetric convolution group of a first ADC module of a first stage of a feature extraction network for feature extraction, and adding the extracted features and the downsampled feature map to obtain an enhanced feature map; inputting the enhanced feature map into a multi-layer perceptron-two-dimensional attention layer of a first stage of the feature extraction network to obtain a shallow feature map fusing spatial attention and channel attention; and inputting the shallow layer characteristic diagram into a second ADC module of a first stage of the characteristic extraction network to obtain a first layer characteristic diagram. The structure of the ADC block is shown in fig. 4.

Specifically, firstly, the input convolution feature map is sampled by an average pooling downsampling layer, then the obtained downsampling feature map is processed by an asymmetric convolution group and added with the downsampling feature map to obtain an enhanced feature map, the asymmetric convolution group is composed of a 3 × 3 depth convolution layer +3 × 1 depth convolution layer +1 × 3 depth convolution layer, wherein 3 depth convolution layers are connected in parallel, and each depth convolution layer is followed by a GELU activation function and Batch Normalization (BN). The asymmetric convolution group is characterized in that the robustness of a model to image turnover and rotation is improved, and meanwhile, researches show that the asymmetric convolution group can be regarded as a method for explicitly enhancing a skeleton part after 3 x 1 depth convolution and 1 x 3 depth convolution are added on a 3 x 3 depth convolution kernel, so that the performance is improved. And then the enhanced feature map enters a multi-layer perceptron-two-dimensional attention layer (MLP-TdAtten layer), firstly passes through a standardization layer (LayerNorm) and a multi-layer perceptron (MLP) (default expansion ratio 4), and then enters TdAtten (two-dimensional attention module), and due to the fact that the channel size of the network model is small, the performance of the model can be effectively improved by adding the two-dimensional attention module.

In one embodiment, as shown in FIG. 4, the multi-layered perceptron-two-dimensional attention layer includes a first normalization layer, a multi-layered perceptron, and a two-dimensional attention module; the method comprises the following steps: inputting the enhanced feature map into a multi-layer perceptron-two-dimensional attention layer of a first stage of the feature extraction network to obtain a shallow feature map fusing spatial attention and channel attention, wherein the shallow feature map comprises: inputting the enhanced feature map into a multi-layer perceptron-two-dimensional attention layer of a first stage of a feature extraction network, and processing by adopting a first normalization layer to obtain a normalized enhanced feature map; inputting the standardized enhanced feature map into a multilayer perceptron of a first stage of a feature extraction network, namely a multilayer perceptron of a two-dimensional attention layer, so as to obtain a multilayer perceptron; inputting the multilayer perception feature map into a two-dimensional attention module of a multilayer perceptron-two-dimensional attention layer of a first stage of a feature extraction network to obtain a first two-dimensional attention feature; and adding the enhanced feature map and the first two-dimensional attention feature to obtain a shallow feature map fusing the spatial attention and the channel attention.

In one embodiment, a two-dimensional attention module structure is shown in FIG. 5. The two-dimensional attention module comprises a channel attention branch and a space attention branch; the method comprises the following steps: inputting the multi-layer perception feature map into a multi-layer perception machine-two-dimensional attention module of a first stage of the feature extraction network to obtain a first two-dimensional attention feature, wherein the method comprises the following steps: inputting the multilayer perception feature map into a channel attention branch of a multilayer perception machine of a first stage of a feature extraction network, namely a two-dimensional attention module of a two-dimensional attention layer, firstly carrying out global average pooling on the multilayer perception feature map, activating a pooling result by a GELU (generic object locator) activation function after processing the pooling result by a full connection layer to obtain a left branch global feature, and processing the left branch global feature by the full connection layer to obtain the channel attention feature; inputting the multilayer perception feature graph into a spatial attention branch of a multilayer perceptron-two-dimensional attention module of a two-dimensional attention layer of a first stage of a feature extraction network, activating by adopting a GELU activation function after processing by a full connection layer to obtain a right branch global feature, splicing the left branch global feature and the right branch global feature, and processing by the full connection layer to obtain the spatial attention feature; and adding the channel attention feature and the spatial attention feature, activating by adopting a Sigmoid function, and multiplying the obtained activation feature by the multilayer perception feature graph to obtain a first two-dimensional attention feature.

Specifically, the multi-layer perceptual feature map is (H, W, C), where H and W represent the height and width of the feature map size, respectively, and C represents the number of channels of the feature map. The left side of the two-dimensional attention module is the channel attention branch for capturing global information and the right side is the spatial attention branch for capturing local information.

In the channel attention branch, global average pooling is performed first, and the size becomes (1, C/r) through the first fully-connected layer, where r represents the reduction rate (default 4), and the size becomes (1, C) through the GELU activation function, fully-connected layer. Therefore, light calculation burden can be obtained in the dimension of the small channel, and meanwhile, information interaction between channels is increased.

When the branch is a space attention branch, the size is changed into (H, W, C/r) through a first full connection layer, the global information output of the channel attention branch is spliced with the local information output of the space attention branch at the time of attention, the size is changed into (H, W, 2C/r), information interaction is carried out, the model performance is further improved, and then the size is changed into (H, W, 1) through the full connection layer. And finally, adding the two branches, and multiplying the obtained result by the input through Sigmoid to obtain a final result.

In one embodiment, the number of ADC modules in the second stage of the feature extraction network is 3 and the number of cpa modules is 3; step 304 includes: sampling the first feature map through an average pooling downsampling layer of a second stage of the feature extraction network, and inputting the sampled first feature map into three sequentially connected ADC modules of the second stage to obtain a convolution attention feature map; inputting the convolution attention feature map into a first CPA module of a second stage of the feature extraction network, and performing channel segmentation after standardized layer processing to obtain a first segmentation feature and a pooling segmentation feature map; inputting the first cut feature into a convolution branch of a convolution-attention parallel block of a first CPA module of a second stage of the feature extraction network, and processing by adopting deep convolution to obtain a deep convolution feature map; inputting the pooling segmentation feature map into an attention branch of a convolution-attention parallel block of a first CPA module of a second stage of the feature extraction network, and obtaining an attention branch feature map after processing by an average pooling and attention module; channel splicing is carried out on the depth convolution feature map and the attention branch feature map, then the obtained result is added with the convolution attention feature map, and the obtained result is input into a multi-layer perceptron-two-dimensional attention layer of a first CPA module of a second stage of the feature extraction network to obtain a first shallow feature map; inputting the shallow feature map into a second CPA module of a second stage of the feature extraction network to obtain a second deep feature map; and inputting the second deep feature map into a third CPA module of a second stage of the feature extraction network to obtain a second-layer feature map. The structure of the CPA module is shown in fig. 6.

Specifically, the convolution notes that the feature map size is (H, W, C), where H, W represent the height and width of the feature map size, respectively, and C represents the number of channels of the feature map. After the channel segmentation operation, segmenting the number C of the characteristic diagrams according to the proportion t (the default segmentation proportion t is 1), thus obtaining two characteristic diagrams with the same size and the same number of channels, wherein the two characteristic diagrams are (H, W and C/2) and sequentially pass through two branches. And performing feature extraction through deep convolution in a convolution branch to output a result of (H, W, C/2), performing average pooling in an Attention branch, performing global information extraction by using an Attention module (Attention), and outputting a result of (H, W, C/2). And finally, splicing the feature maps of the two branches together through channel splicing to obtain a feature map with the shape of (H, W, C), wherein the convolution-attention parallel block can effectively combine local information and global information together. The input profile then enters the MLP-TdAtten layer.

In one embodiment, the structure of the attention module is shown in FIG. 7. The method comprises the following steps: inputting the pooling segmentation feature map into an attention branch of a convolution-attention parallel block of a first CPA module of a second stage of the feature extraction network, and obtaining an attention branch feature map after processing by an average pooling and attention module, wherein the method comprises the following steps: inputting the pooling segmentation feature map into an attention branch of a convolution-attention parallel block of a first CPA module of a second stage of the feature extraction network, and obtaining the pooling segmentation feature map after average pooling; using an embedded matrix W according to a pooling segmentation profile ^K ,W ^Q ,W ^V By the calculation of point convolution, key K = XW is obtained ^K Query Q = XW ^Q Sum value V = XW ^V According to the bond K = XW ^K Query Q = XW ^Q Sum value V = XW ^V Obtaining an attention branch feature map, wherein the expression of the attention branch feature map is as follows:

Y＝Linear(Softmax(Q)( (Softmax(K ^T )V)W)) (1)

wherein, Y is an attention branch feature map, Q is a query, K is a key, V is a value, W is a learnable parameter matrix, softmax () is an activation function, and Linear () is a Linear transformation (preferably, the Linear transformation is implemented by using a full connection layer).

Specifically, the original multi-head attention mechanism firstly generates corresponding Query, key and Value, and then generates a Value R through the dot product of the Query and the Key ^n×n And finally, performing dot product with Value to calculate a formula:

the process usually consumes a large amount of computing resources (video memory) due to the large size of the input features, which brings difficulty to the training and deployment of the network, and meanwhile, as the network continuously deepens, the attention features gradually become similar or even the same, which causes that the effective content for representing learning cannot be learned by the self-attention mechanism of the model in the deep layer, and prevents the model from obtaining the expected effect.

The original multi-head attention module is improved in the invention, and the original multi-head attention module is subjected to linearization measures and added with a learnable matrix method to prevent attention collapse.

In the attention Module (preferably, the head number of the attention Module is 8), the feature map (size R) is pooled and segmented ^n×d ) As input to the attention module, first the corresponding key K = XW is generated ^K Query Q = XW ^Q Sum value V = XW ^V Then calculating to obtain Softmax (Q), softmax (K) ^T ) At this time, softmax (K) ^T ) And performing dot product on the attention feature and the value V matrix to obtain a semantic matrix with the size of d multiplied by d, performing dot product on the semantic matrix with the size of d multiplied by d and a learnable matrix W, performing dot product on the semantic matrix with Softmax (Q), and performing linear transformation on dot product results to obtain an attention branch feature map. Here it can be noted that the complexity is from the original O (n) ² ) To O (d) ² ) The complexity can be effectively reduced, and meanwhile, the learnable matrix is added in the middle, so that the association between the patches in a larger range can be learnt.

In a specific embodiment, the structure of the remote sensing image classification model (ADC-CPANet) based on the CNN-attention mechanism hybrid architecture is shown in fig. 2, and the parameters of the remote sensing image classification model are shown in table 1.

TABLE 1 remote sensing image classification model principal part parameters

The method comprises the steps of inputting training samples into a remote sensing image classification model, firstly carrying out down-sampling and channel dimension lifting processing on 3 convolutional layers with the same size, then carrying out multi-scale feature extraction on 4 stages, and finally outputting labels by using a classifier. The ADC-CPANET adopts a stacking form of a plurality of ADC modules + a plurality of CPA modules in each stage. The latter three stages adopt a (ADC block × N + CPA block × 3) stacking mode. Different from the traditional CNN-Transformer mixed architecture, only the Transformer module is placed at the end, and the architecture of the invention can learn local information and global information in the shallow layer, the middle layer and the deep layer of the characteristic diagram. The number of channels is changed before each stage by downsampling using an average pooling of size 2 x 2 and step size 2 and a convolution of 1 x 1. The first stage uses two ADC modules to extract feature information. The second stage extracts feature information in the manner of (ADC module × 3+ CPA module × 3). The third stage extracts feature information in a manner of (ADC module × 4+ module × 3). The fourth stage extracts feature information in the manner of (ADC module × 1+ CPA module × 3). Generally, the most direct method for improving the network performance is to increase the depth and the width of the network, the mixed stacking mode provided by the invention increases the depth of the network and ensures that deep information can be learned, and the CPA module increases the width of the network and ensures that characteristic information can be learned from different scales. Meanwhile, the mixed stacking mode can enable the characteristics to have a dependency relationship in the whole situation, and can enable the local information to complement the details of the information.

It should be understood that although the steps in the flowcharts of fig. 1 and 3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 and 3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

In a verification embodiment, the experiment adopts Python language under Pythrch framework, and the model is realized on NVIDIA GeForce RTX 2080Ti server. The experimental Data come from a remote sensing image Data Set RSSCN7 DataSet published by Wuhan university in 2015, and the Data Set is a common Data Set for remote sensing image scene classification. The system comprises 7 types of data such as grassland (Grass), field (Field), industrial area (Industry), river lake (river lake), forest (Forest), residential area (resource), parking lot (park) and the like, grass, field, river lake and Forest which represent natural factors, industry, resource and park which represent human production and living scenes, and the like, and has the characteristic of wide coverage range. The data set had 2800 photos randomly selected of which 2240 were taken as a training set and 560 were taken as a test set.

In order to embody the recognition effect of the remote sensing image classification model on the data set, a ConvNeXt network model, a CoAtNet network model and a Resnet50 network model are used for carrying out comparison experiments. Table 2 is the parameter number, calculated amount, highest accuracy and the average of the highest five accuracies. As shown in the table below, the accuracy of the remote sensing image classification network provided by the invention is still higher than that of other networks under the condition that the calculated amount and the parameter amount are lower than those of the classical network. The results of the accuracy comparison are shown in fig. 8.

Table 2 comparison of results for each index

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A remote sensing image classification method based on a CNN-attention mechanism hybrid architecture is characterized by comprising the following steps:

obtaining a remote sensing image, and labeling the remote sensing image to obtain a training sample;

constructing a remote sensing image classification model based on a CNN-self-attention mechanism hybrid architecture, wherein the remote sensing image classification model comprises an input network, a feature extraction network and a classification network, and the input network is used for processing training samples by adopting a plurality of convolution downsampling layers with the same scale to obtain a convolution feature map; the feature extraction network is used for extracting multi-scale global features and local features of the convolution feature map by adopting 4 stages which are connected in sequence to obtain a multi-scale feature map; the stage comprises an average pooling downsampling layer and a module stacked by a plurality of ADC modules and a plurality of CPA modules; the ADC module is constructed by adopting an asymmetric convolution group and a multi-layer perceptron-two-dimensional attention layer in a MetaFormer paradigm, and the CPA module is constructed by adopting a convolution-attention parallel block and a multi-layer perceptron-two-dimensional attention layer in the MetaFormer paradigm; the multilayer perceptron-two-dimensional attention layer is used for extracting the characteristics of the input characteristic graph by adopting the multilayer perceptron and the two-dimensional attention layer; the classification network is used for classifying the multi-scale characteristic graph to obtain a remote sensing image classification prediction result;

training the remote sensing image classification model by adopting the label of the training sample and the remote sensing image classification prediction result obtained by inputting the training sample into the remote sensing image classification model to obtain the trained remote sensing image classification model;

and inputting the remote sensing images to be classified into the trained remote sensing image classification model to obtain a remote sensing image classification result.

2. The method of claim 1, wherein the input network comprises 3 same-scale convolutional downsampling layers connected in sequence;

training the remote sensing image classification model by adopting the marking of the training sample and the remote sensing image classification prediction result obtained by inputting the training sample into the remote sensing image classification model to obtain the trained remote sensing image classification model, and the method comprises the following steps:

inputting the training sample into the input network to obtain a convolution characteristic diagram;

inputting the convolution feature map into a first stage of the feature extraction network to obtain a first-layer feature map;

inputting the first feature map into a second stage of the feature extraction network to obtain a second-layer feature map;

inputting the second feature map into a third stage of the feature extraction network to obtain a third-layer feature map;

inputting the third feature map into a fourth stage of the feature extraction network to obtain a multi-scale feature map;

inputting the multi-scale characteristic graph into a classification network to obtain a remote sensing image classification prediction result;

3. The method of claim 2, wherein the number of ADC modules in the first stage of the feature extraction network is 2 and the number of cpa modules is 0;

inputting the convolution feature map into a first stage of the feature extraction network to obtain a first-layer feature map, wherein the method comprises the following steps of:

inputting the downsampled feature map into an asymmetric convolution group of a first ADC module of a first stage of the feature extraction network for feature extraction, and adding the extracted features and the downsampled feature map to obtain an enhanced feature map;

inputting the enhanced feature map into a multi-layer perceptron-two-dimensional attention layer of a first stage of the feature extraction network to obtain a shallow feature map fusing spatial attention and channel attention;

and inputting the shallow feature map into a second ADC module of a first stage of the feature extraction network to obtain a first-layer feature map.

4. The method of claim 3, wherein the multi-layered perceptron-two-dimensional attention layer comprises a first normalization layer, a multi-layered perceptron, and a two-dimensional attention module;

Inputting the enhanced feature map into a multi-layer perceptron-two-dimensional attention layer of a first stage of the feature extraction network, and processing by adopting a first normalization layer to obtain a normalized enhanced feature map;

inputting the standardized enhanced feature map into a multilayer perceptron of a first stage of the feature extraction network, namely a multilayer perceptron of a two-dimensional attention layer, so as to obtain a multilayer perceptron;

inputting the multi-layer perception feature map into a multi-layer perception machine-two-dimensional attention module of a first stage of the feature extraction network to obtain a first two-dimensional attention feature;

and adding the enhanced feature map and the first two-dimensional attention feature to obtain a shallow feature map fusing the spatial attention and the channel attention.

5. The method of claim 4, wherein the two-dimensional attention module comprises a channel attention branch and a spatial attention branch;

inputting the multilayer perception feature map into a channel attention branch of a two-dimensional attention module of a multilayer perception machine-two-dimensional attention layer of a first stage of the feature extraction network, firstly performing global average pooling on the multilayer perception feature map, activating a pooling result by a GELU (global GeLU) activation function after the pooling result is processed by a full connection layer to obtain a left branch global feature, and processing the left branch global feature by the full connection layer to obtain a channel attention feature;

inputting the multilayer perception feature map into a spatial attention branch of a multilayer perception machine of a first stage of the feature extraction network, namely a two-dimensional attention module of a two-dimensional attention layer, activating by adopting a GELU activation function after full-connection layer processing to obtain a right branch global feature, splicing the left branch global feature and the right branch global feature, and processing by using a full-connection layer to obtain a spatial attention feature;

6. The method according to claim 2, wherein the number of ADC modules in the second stage of the feature extraction network is 3, the number of cpa modules is 3;

sampling the first feature map through an average pooling downsampling layer of a second stage of the feature extraction network, and inputting the sampled first feature map into three sequentially connected ADC modules of the second stage to obtain a convolution attention feature map;

inputting the convolution attention feature map into a first CPA module of a second stage of the feature extraction network, and performing channel segmentation after standardized layer processing to obtain a first segmentation feature and a pooling segmentation feature map;

inputting the first cut feature into a convolution branch of a convolution-attention parallel block of a first CPA module of a second stage of the feature extraction network, and processing by adopting deep convolution to obtain a deep convolution feature map;

inputting the pooling segmentation feature map into an attention branch of a convolution-attention parallel block of a first CPA module of a second stage of the feature extraction network, and obtaining an attention branch feature map after processing by an average pooling and attention module;

channel splicing is carried out on the depth convolution feature map and the attention branch feature map, then the obtained result is added to the convolution attention feature map, and the obtained result is input into a multi-layer perceptron-two-dimensional attention layer of a first CPA module of a second stage of the feature extraction network to obtain a first shallow feature map;

inputting the shallow feature map into a second CPA module of a second stage of the feature extraction network to obtain a second deep feature map;

7. The method of claim 6, wherein inputting the pooling segmented feature map into the attention branch of the convolution-attention parallel block of the first CPA module of the second stage of the feature extraction network, and after processing by the average pooling and attention module, obtaining an attention branch feature map comprises:

inputting the pooling segmentation feature map into an attention branch of a convolution-attention parallel block of a first CPA module of a second stage of the feature extraction network, and obtaining a pooling segmentation feature map after average pooling;

using an embedded matrix W according to the pooling segmentation feature map ^K ，W ^Q ，W ^V By the calculation of point convolution, key K = XW is obtained ^K Query Q = XW ^Q Sum value V = XW ^V ，

Y＝Linear(Softmax(Q)((Softmax(K ^T )V)W))

wherein, Y is an attention branch characteristic diagram, Q is a query, K is a key, V is a value, W is a parameter matrix which can be learned, softmax () is an activation function, and Linear () is Linear transformation.

8. The method of claim 1, wherein the number of ADC modules in the third stage of the feature extraction network is 4 and the number of cpa modules is 3.

9. The method of claim 1, wherein the number of ADC modules in the fourth stage of the feature extraction network is 1 and the number of CPA modules in the fourth stage of the feature extraction network is 3.

10. The method of claim 1, wherein the classification network comprises an average pooling layer and a full connectivity layer.