CN115545166A

CN115545166A - Improved ConvNeXt convolutional neural network and remote sensing image classification method thereof

Info

Publication number: CN115545166A
Application number: CN202211342737.2A
Authority: CN
Inventors: 王坤; 杜景林; 高文凯; 杨陆
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2022-12-30

Abstract

The invention discloses an improved ConvNeXt convolutional neural network and a classification method of remote sensing images thereof, wherein the method comprises the steps of obtaining the remote sensing images, and processing the remote sensing images into images of 224 multiplied by 3; respectively carrying out down-sampling, global feature extraction and local feature extraction on the processed remote sensing image, and then inputting the processed remote sensing image into an average pooling layer and a full-connection layer to obtain a first feature vector with the size of 1 multiplied by 1000; respectively carrying out context information modeling, a random node sampler and a graph convolution network on the processed remote sensing image, and inputting an average pooling layer and a full-connection layer to obtain a second feature vector with the size of 1 multiplied by 1000; fusing the first feature vector and the second feature vector by using an addition strategy to obtain fused features; inputting the fused features into a full-link layer and a Softmax classification layer to predict and obtain a final classification result.

Description

Improved ConvNeXt convolutional neural network and remote sensing image classification method thereof

Technical Field

The invention relates to an improved ConvNeXt convolutional neural network and a remote sensing image classification method thereof, and belongs to the technical field of image classification.

Background

The high-resolution remote sensing image scene classification is an important component of remote sensing data processing, namely, a fixed semantic label is automatically allocated to each scene image, and the method is widely applied to the fields of urban planning, emergency disasters, land utilization, environment monitoring and the like. The early remote sensing image classification mainly uses a method for manually designing features, needs experts to carefully design, visually and explicitly extract the features aiming at the characteristics of different scenes, and is used for classification tasks after being coded. But are typically low-level dense features that contain a large amount of redundant information that affects the classification accuracy.

ConvNeXt is one of the best performance models in the field of image classification at present, and in macroscopic design, the calculation distribution is optimized, and the Patchthyy operation in ViT is used for replacing the initial downsampling operation. With the block convolution in resenxt, the use of deep separable convolution reduces the amount of parameters while broadening the number of channels to compensate for capacity loss. And a reverse bottleneck structure in the MobileNet V2 is adopted to avoid information loss, and a 7 × 7 convolution kernel is used for replacing a 3 × 3 convolution kernel to obtain a larger receptive field. However, when extracting features, the same weight is given to all channels, which limits the classification performance of the algorithm and cannot accurately extract local features and long-distance spatial features.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides an improved ConvNeXt convolutional neural network and a remote sensing image classification method thereof, and can effectively fuse local key features and long-distance spatial features to realize classification.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides an improved ConvNeXt convolutional neural network, which is, from front to back, a Conv1 layer, a layer regularization, a ConvNeXt network, an attention mechanism module, a fully connected layer, and parallel random node samplers, a graph convolution network, a fully connected layer;

the ConvNeXT network comprises

stages

1, 2, 3 and 4, wherein Stage1 comprises a plurality of ConvNeXT blocks, stage2 comprises a plurality of ConvNeXT blocks, stage3 comprises a plurality of ConvNeXT blocks, and Stage4 comprises a plurality of ConvNeXT blocks.

In a second aspect, the present invention provides a method for classifying remote sensing images, which is applied to the above improved ConvNeXt convolutional neural network, and includes:

acquiring a remote sensing image, and processing the remote sensing image into an image of 224 multiplied by 3;

respectively carrying out down-sampling, global feature extraction and local feature extraction on the processed remote sensing image, and then inputting the processed remote sensing image into an average pooling layer and a full-connection layer to obtain a first feature vector with the size of 1 multiplied by 1000;

respectively carrying out context information modeling, a random node sampler and a graph convolution network on the processed remote sensing image, and then inputting the remote sensing image into an average pooling layer and a full-connection layer to obtain a second feature vector with the size of 1 multiplied by 1000;

fusing the first feature vector and the second feature vector by using an addition strategy to obtain fused features;

and inputting the fused features into a full-connection layer and a Softmax classification layer to predict to obtain a final classification result.

Further, the steps of respectively performing down-sampling, global feature extraction and local feature extraction on the processed remote sensing image, and then inputting the processed remote sensing image into the average pooling layer and the full-connection layer to obtain a first feature vector with the size of 1 × 1 × 1000 include:

down-sampling the processed remote sensing image to obtain a characteristic map of 56 multiplied by 96;

extracting global features from the 56 × 56 × 96 feature map obtained by down-sampling by using depth convolution;

extracting local features of the feature map with attention mechanism after global features are extracted to obtain a feature map containing global features and local features;

and inputting the feature map containing the global features and the local features into an average pooling layer and a full-connection layer to obtain a first feature vector with the size of 1 × 1 × 1000.

Further, after the context information modeling, the random node sampler and the graph convolution network are respectively performed on the processed remote sensing image, a second feature vector with the size of 1 × 1 × 1000 is obtained by inputting the average pooling layer and the full connection layer, and the method comprises the following steps:

carrying out context information modeling on the processed remote sensing image through a graph structure to obtain image space information;

constructing a vertex set by using pixel points in the remote sensing image, determining the relation between vertexes according to the image space information, and constructing an adjacency graph;

inputting the adjacency graph into a random node sampler, and repeatedly sampling vertexes in the adjacency graph until all vertexes are sampled to generate a group of subgraphs;

inputting the subgraph into a graph convolution network, and extracting context characteristics of the subgraph;

and inputting the subgraph after the context features are extracted into an average pooling layer and a full-connection layer to obtain a second feature vector with the size of 1 multiplied by 1000.

Further, the extracting local features from the feature map after the global features are extracted by using an attention mechanism to obtain a feature map including the global features and the local features includes:

the feature map with the global features extracted is segmented into S parts, and then spatial information is extracted in a multi-scale convolution kernel grouping convolution mode; the convolution kernel size K and the number of groups G are set as follows:

after extracting the spatial information, cascading the feature maps of all parts to obtain a multi-scale fusion feature map, wherein the whole process is specifically calculated as follows:

F _i ＝Conv(K _i ,K _i ,G _i )(X _i ),i＝0,...,S-1

F＝Concat([F ₀ ,...,F _S-1 ])

after passing through the attention mechanism module, obtaining channel level attention vector scale features by using ECA;

re-correcting the channel-level attention vector scale features by adopting a Softmax function, and applying the corrected attention vector to a multi-scale fusion feature map to obtain a feature map with richer multi-scale information;

adding the characteristic graph with more abundant multi-scale information to a pooling layer and a full-connection layer to finally obtain a characteristic vector F containing global characteristics and local characteristics _CNN,AM ∈R ¹⁰⁰⁰ 。

Further, the constructing a vertex set by using pixel points in the remote sensing image, determining a relationship between vertices according to the image space information, and constructing an adjacency graph includes:

constructing a vertex set V by using pixel points in the remote sensing image, wherein an edge set E consists of any two vertexes V _i And V _j The relation between the adjacent graphs is formed, and an adjacent graph G (V, E) is constructed;

describing the relationship between vertices using an adjacency matrix A, the weight a of an edge in the adjacency matrix _i,j Obtained from the following function:

in the formula: x is the number of _i And x _j Representing a vertex v _i And v _j The associated feature vector, σ, is the width parameter of the function.

Further, inputting the subgraph into a graph convolution network, and extracting context features of the subgraph, including:

inputting the subgraph into a graph convolution network, wherein the graph convolution network aggregates a vertex V and all the vertices u epsilon V _s The transfer of neighborhood relationship is realized by the features between the vertex v and the conduction equation of the vertex v at the l-th layer is defined as follows:

in the formula: s represents the s-th sub-graph and the s-th batch of network training, W is a parameter matrix, h (-) is an activation function, b is a bias parameter,

is a contiguous matrix of self-connecting,

a degree matrix for D, defined as follows:

wherein i, j are rows and columns;

and (4) carrying out cascade operation on the output results of all the subgraphs to obtain the subgraphs after the context features are extracted.

In a third aspect, the present invention provides a device for classifying remote sensing images, comprising:

respectively carrying out context information modeling, a random node sampler and a graph convolution network on the processed remote sensing image, and inputting an average pooling layer and a full-connection layer to obtain a second feature vector with the size of 1 multiplied by 1000;

In a fourth aspect, the present invention provides a device for classifying remote sensing images, comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of the preceding claims.

In a fifth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the preceding claims.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides an improved ConvNeXt convolutional neural network and a classification method of remote sensing images thereof, which can fuse global feature information of different scales and assign different weights to the importance degrees of different channel feature maps, so that a model can more easily extract distinguishable features, and long-distance spatial information in the remote sensing images can be effectively modeled; the invention has the characteristics of high classification accuracy, small calculation parameter quantity and high speed.

Drawings

FIG. 1 is a structural diagram of SPCECA provided in an embodiment of the present invention;

FIG. 2 is a block diagram of an improved ConvNeXt convolutional neural network provided by an embodiment of the present invention;

FIG. 3 is a flow chart of an improved ConvNeXt convolutional neural network and a method for classifying remote sensing images thereof according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a ConvNeXt convolution structure and a downsampling layer structure according to an embodiment of the present invention;

fig. 5 is a structural diagram of an improved ConvNeXt convolutional neural network according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Example 1

As shown in fig. 5, the present embodiment introduces an improved ConvNeXt convolutional neural network, which is a Conv1 layer, a layer regularization, a ConvNeXt network, an attention mechanism module, a fully connected layer, and parallel random node samplers, a graph convolution network, a fully connected layer from front to back;

the ConvNeXT network comprises

stages

The network comprises two branches, specifically, an image with the size of 224 × 224 × 3 is input into a convolution branch, a feature map with the size of 56 × 56 × 96 is obtained through a first layer of down-sampling layer, then the feature map is input into ConvNeXt Block, the ConvNeXt Block extracts global features in the image through deep convolution, then the feature map containing the global features is input into a local feature extraction module, and the local feature extraction module uses AM to extract local features of the feature map. Then, the features are extracted according to the same method, and after passing through 3 layers of the module, the module enters an average pooling layer and a full connection layer to obtain a feature vector with the size of 1 multiplied by 1000. The image is input into a context feature extraction module, the context feature extraction module models context information of a deep image through a graph structure, the context information enters an average pooling layer and a full-connection layer after passing through a random node sampler and two layers of GCN networks to obtain feature vectors with the size of 1 multiplied by 1000, and finally, final classification output is obtained through feature fusion.

Example 2

The embodiment provides a method for classifying remote sensing images, which is applied to the improved ConvNeXt convolutional neural network in embodiment 1, and includes:

As shown in fig. 1 to fig. 3, the application process of the improved ConvNeXt convolutional neural network and the classification method of the remote sensing image thereof provided by this embodiment specifically involves the following steps:

the method comprises the following steps: input of 224X 3 remote sensing image

The invention selects the method for testing the remote sensing scene classification data sets with different scales, namely UCMercered Land-Use (UCM) and initial Image Dataset (AID). The selected data set contains multiple types of scene images, each type of scene contains thousands of images, and all the selected images are processed into 224 multiplied by 3 images for input.

Step two: downsampling an input image

And inputting the image of the step one into a down-sampling layer for sampling, and then obtaining a feature map of 56 × 56 × 96. The downsampled layer structure is as in fig. 4 (right).

Step three: modeling deep image context information by graph structure

After the image is input in the first step, the image is input into a context feature extraction module while down-sampling, and long-distance spatial information is modeled by using a graph structure by adopting a GCN method. As can be seen from FIG. 4, the vertex set V is constructed by using the pixel points in the remote sensing image, and the edge set E is formed by any two vertices V _i And V _j The relationship between them constitutes, construct the adjacency graph G (V, E). Describing the relationship between vertices using an adjacency matrix A, the weight a of an edge in the adjacency matrix _i,j Obtained from the following function:

in the formula: x is a radical of a fluorine atom _i And x _j Representing a vertex v _i And v _j The associated feature vector, σ, is the width parameter of the function.

Step four: extracting global features using deep convolution

After the downsampling in the second step, a feature map of 56 × 56 × 96 is obtained, and then the capacity loss is compensated by using the number of channels while reducing the parameter amount by using the depth separable convolution by using the block convolution. Meanwhile, a reverse bottleneck structure in the MobileNet V2 is adopted to avoid information loss, and 7 is used ^× 7 convolution kernel instead of 3 ^× The convolution kernel acquires a larger receptive field. Meanwhile, a smoother GELU function is adopted, fewer activation functions and regularization functions are used, and a layer regularization function is adopted, so that the model can be trained more accurately and efficiently by adopting the design. And the network depth is increased, the characteristic diagram with a larger receptive field is extracted, and the overall content of the scene image is better expressed.

Step five: local feature extraction using AM

The AM can obviously improve the performance of ConvNeXt, and the ECA is a local cross-channel interaction attention mechanism without dimension reduction, and the local cross-channel interaction range is determined through one-dimensional convolution. However, the ground object targets in the remote sensing image are usually small and dispersed, and the key area is judged only by using each feature map, so that misjudgment often occurs. ECAs can only capture channel information, neglecting the importance of spatial information, and thus need to be improved. An SPCECA mechanism is provided, channels are segmented by adding an SPC module before ECA, and multi-scale feature extraction is carried out on spatial information on each channel feature map, so that the channel and the spatial information are effectively combined. The number of channels of the input feature map is divided into S parts, and then spatial information is extracted in a multi-scale convolution kernel grouping convolution mode, so that the grouping convolution can reduce the parameter number. The convolution kernel size K and the number of groups G are set as follows:

after extraction, cascading all the characteristic diagrams to obtain a multi-scale fusion characteristic diagram, wherein the whole process is specifically calculated as follows:

F _i ＝Conv(K _i ,K _i ,G _i )(X _i ),i＝0,...,S-1

F＝Concat([F ₀ ,...,F _S-1 ])

after the SPC module, the ECA is used for obtaining the scale characteristics of the channel-level attention vector, the Softmax function is used for re-correcting the attention vector, the corrected attention vector acts on the multi-scale characteristic diagram, and the characteristic diagram with rich multi-scale information is obtained. Adding a GAP layer and a full connection layer (Fc layer) at the end of the network to finally obtain a feature vector F containing global features and local features _CNN,AM ∈R ¹⁰⁰⁰ 。

Step six: sampling by a random node sampler

After modeling of the context information in the third step, in order to reduce the calculation cost of the GCN, before each iteration of the GCN, a random node sampler with the size of M is used, and the vertexes in the graph G are repeatedly sampled until all the vertexes are sampled, so that a group of subgraphs are generated.

Step seven: GCN extraction of image context features

And after the sixth sampling, inputting the subgraph into a GCN, wherein the GCN comprises a GCN layer and an Fc layer. To improve the stability of the modelUsing self-connecting adjacent matrices

And

the degree matrix of (c). GCN is formed by aggregating vertex V and all vertices u ∈ V _s The transfer of neighborhood relationship is realized by the features between the vertex v and the conduction equation of the vertex v at the l-th layer is defined as follows:

in the formula: s denotes the s sub-graph and the s batch of network training. W is a parameter matrix, h (-) is an activation function, b is a bias parameter,

is a contiguous matrix of self-connecting,

a degree matrix of D, defined as follows:

where i, j are rows and columns. And performing cascade operation on the output results of all the subgraphs to obtain a final output result. miniGCN can make GCN complexity from O (NDP + N) ² D) And (4) reducing to O (NDP + NMD), wherein N is the number of GCN vertexes, D and P are input and output characteristic dimensions, and M is the number of subgraph vertexes, and meanwhile, achieving a better network local optimal result. Further compressing the context characteristics of the result output by the GCN layer through an Fc layer to finally obtain the long-distance spatial characteristic F with the dimension of 1000 _GCN ∈R ¹⁰⁰⁰ 。

Step eight: obtaining the final classification result by feature fusion

Different network structures have different feature representations extracted from remote sensing images, and generally, due to the lack of feature diversity, a single model often cannot achieve the best performance. ConvNeXt and GCN added with AM through combined training enhance feature recognition capability, features extracted by ConvNeXt and GCN are fused by using an addition strategy, and feature F after fusion _fusion Is represented as follows:

and inputting the result obtained by the formula into the Fc layer and the Softmax classification layer to predict the final classification result.

The experimental result data pair of the invention and the baseline model is shown in table 1, and the baseline model comprises a classic CNN network and a visual Transformer network:

TABLE 1

Model (model)	Rate of accuracy	Quantity of model parameters
			ViT	96.40％	89237152
ConvNeXt	97.25％	87597214
			The invention	99.18％	32754842

Example 3

The embodiment provides a classification device of remote sensing images, including:

Example 4

The embodiment provides a classification device of remote sensing images, which comprises a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any of embodiment 2.

Example 5

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method of any of embodiment 2.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, it is possible to make various improvements and modifications without departing from the technical principle of the present invention, and those improvements and modifications should be considered as the protection scope of the present invention.

Claims

1. An improved ConvNeXt convolutional neural network, which is characterized in that the improved ConvNeXt convolutional neural network comprises a Conv1 layer, a layer regularization, a ConvNeXT network, an attention mechanism module, a full connection layer, a parallel random node sampler, a graph convolution network and a full connection layer from front to back;

the ConvNeXT network comprises stages 1, 2, 3 and 4, wherein Stage1 comprises a plurality of ConvNeXT blocks, stage2 comprises a plurality of ConvNeXT blocks, stage3 comprises a plurality of ConvNeXT blocks, and Stage4 comprises a plurality of ConvNeXT blocks.

2. A method for classifying remote sensing images, which is applied to the improved ConvNeXt convolutional neural network as claimed in claim 1, and is characterized by comprising the following steps:

3. The method for classifying remote sensing images according to claim 2, wherein the steps of down-sampling, global feature extraction and local feature extraction are performed on the processed remote sensing images, and then the processed remote sensing images are input into an average pooling layer and a full connection layer to obtain a first feature vector with the size of 1 x 1000 comprise:

4. The method for classifying remote sensing images according to claim 2, wherein after the context information modeling, the random node sampler and the graph convolution network are respectively performed on the processed remote sensing images, an average pooling layer and a full connection layer are input to obtain a second eigenvector with a size of 1 x 1000, comprising:

5. The method for classifying remote sensing images according to claim 3, wherein the step of extracting local features from the feature map after global features are extracted by adopting an attention mechanism to obtain the feature map containing the global features and the local features comprises the following steps:

the feature map with the global features extracted is segmented into S parts, and then spatial information is extracted in a multi-scale convolution kernel grouping convolution mode; the convolution kernel size K and the group number G are set as follows:

F _i ＝Conv(K _i ,K _i ,G _i )(X _i ),i＝0,...,S-1

F＝Concat([F ₀ ,...,F _S-1 ])

after passing through the attention mechanism module, obtaining a channel-level attention vector scale characteristic by using ECA;

performing re-correction on the channel level attention vector scale features by adopting a Softmax function, and acting the corrected attention vector on a multi-scale fusion feature map to obtain a feature map with more abundant multi-scale information;

6. The method for classifying remote sensing images according to claim 4, wherein the step of constructing a vertex set by using pixel points in the remote sensing images, determining the relation between the vertices according to the image space information and constructing an adjacency graph comprises the steps of:

constructing a vertex set V by using pixel points in the remote sensing image, wherein an edge set E consists of any two vertexes V _i And V _j Constructing an adjacency graph G (V, E);

7. The method for classifying remote sensing images according to claim 2, wherein the step of inputting the subgraph into a graph convolution network and extracting context features of the subgraph comprises the following steps:

is a contiguous matrix of self-connecting links,

a degree matrix of D, defined as follows:

wherein i, j are rows and columns;

8. A device for classifying a remote sensing image, comprising:

and inputting the fused features into a full connection layer and a Softmax classification layer to predict and obtain a final classification result.

9. A classification device for remote sensing images is characterized in that: comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 2 to 7.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the program when executed by a processor implements the steps of the method of any one of claims 2 to 7.