CN115861824A

CN115861824A - Remote sensing image identification method based on improved Transformer

Info

Publication number: CN115861824A
Application number: CN202310155748.8A
Authority: CN
Inventors: 李兵; 梁嘉鸿; 王琪文; 杨露; 熊振华; 余珂
Original assignee: Shantou University
Current assignee: Shantou University
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-03-28
Anticipated expiration: 2043-02-23
Also published as: CN115861824B

Abstract

The invention discloses a remote sensing image identification method based on an improved Transformer, which comprises the steps of obtaining a remote sensing image to be classified and preprocessing the remote sensing image; carrying out recognition and classification on the ground object target on the preprocessed image by using the trained improved neural network model; the model comprises a multi-size convolution feature extraction module, a Gaussian weighting feature word segmentation device, a transform encoder and a classification module; the feature extraction module extracts discriminative spatial features, the word segmentation device balances the difference between categories, the features are converted into high-level semantic features, the encoder extracts global information, and discriminative classification information is generated by combining the classification module, so that ground objects are classified more efficiently and accurately. The model of the invention has stronger generalization ability, higher training speed and recognition and classification speed, and improves the recognition precision of the remote sensing image; and the model has lower model parameters, can be better deployed on small memory equipment, and has high availability.

Description

Remote sensing image identification method based on improved Transformer

Technical Field

The invention relates to the technical field of image recognition, in particular to a remote sensing image recognition method based on an improved Transformer.

Background

Remote sensing image classification is an important branch in computer vision. With the development of deep learning, a deep learning network is introduced into the field of remote sensing image classification to improve the precision of remote sensing image classification and reduce the cost loss of manual classification of remote sensing images. At present, most of deep learning methods commonly used for remote sensing image classification are based on convolutional neural networks, and a few of deep learning methods are based on emerging neural networks such as transformers.

Disclosure of Invention

The invention aims to provide a remote sensing image recognition method based on an improved Transformer, so as to solve one or more technical problems in the prior art and provide at least one beneficial selection or creation condition.

The solution of the invention for solving the technical problem is as follows: the application provides a remote sensing image recognition method based on an improved Transformer, which comprises the following steps:

obtaining a remote sensing image to be classified containing a plurality of ground object targets, and preprocessing the remote sensing image to be classified to obtain a plurality of pixel blocks;

according to the pixel blocks, the remote sensing images to be classified are identified and classified by utilizing a trained improved neural network model, and a classification result of the ground object target is obtained;

the trained improved neural network model is obtained by training marked sample remote sensing images and corresponding marking results; the improved neural network model comprises a multi-size convolution feature extraction module, a Gaussian weighting word segmentation device, a transform encoder and a classification module which are connected in sequence;

the multi-size convolution feature extraction module is used for extracting features of the pixel blocks based on a two-dimensional convolution structure to obtain a feature graph to be processed and outputting the feature graph to the Gaussian weighted word segmentation device;

the Gaussian weighting word segmentation device is used for processing the characteristic graph to be processed through a Gaussian weighting matrix to obtain high-level semantic characteristics and outputting the high-level semantic characteristics to the transform encoder;

the Transformer encoder is used for learning the high-level semantic features through a multi-layer self-attention layer to obtain a first feature matrix, performing de-linearization processing on the first feature matrix to obtain a second feature matrix, generating a third feature matrix according to the first feature matrix and the second feature matrix and outputting the third feature matrix to the classification module;

and the classification module is used for calculating the maximum probability of the category to which the surface feature target belongs according to the third feature matrix and outputting the classification result of the surface feature target in the remote sensing image to be classified according to the maximum probability.

The invention has the beneficial effects that: the remote sensing image recognition method based on the improved Transformer is provided, the classification and recognition of the ground features of the remote sensing image are achieved through the improved neural network model, and the improved neural network model comprises a multi-size convolution feature extraction module, a Gaussian weighted feature word segmentation device, a Transformer encoder and a classification module. Compared with the traditional convolutional neural network, the ITFormer not only has better effectiveness, high efficiency and stronger generalization capability, but also has higher training speed and recognition and classification speed, and can improve the speed and accuracy of remote sensing image recognition; in addition, the ITFormer has lower model parameters, can be better deployed on a small memory device, reduces unnecessary loss caused by the need of occupying higher memory, and has higher practical availability.

Drawings

FIG. 1 is a data flow diagram of an ITFormer provided herein;

fig. 2 is a schematic structural diagram of a multi-size convolution feature extraction module provided in the present application;

FIG. 3 is a schematic structural diagram of a Gaussian weighted feature tokenizer provided in the present application;

FIG. 4 is a schematic structural diagram of a Transformer encoder provided in the present application;

FIG. 5 is a representation of the predicted plots for the ITFormer and three other comparison networks on an IP dataset;

FIG. 6 is a representation of the prediction maps of the ITFormer and three other comparison networks on a PU data set;

FIG. 7 is a schematic diagram of a surface feature selected by Labelme software labeling according to the present application;

FIG. 8 is a real feature map corresponding to a marked feature generated by manually marking a location as provided herein;

FIG. 9 is a schematic of a training curve of an ITFormer on a self-constructed data set;

FIG. 10 is a schematic of a training curve of GoogleNet on a self-constructed dataset;

FIG. 11 is a schematic of a training curve of MobileNet on a self-constructed data set;

FIG. 12 is a schematic of the training curve of ResNet on a self-constructed data set;

FIG. 13 is a classification representation of the ITFormer and three other comparison networks on a self-constructed data set;

FIG. 14 is a prediction plot of ITFormer and three other comparison networks on a self-constructed data set.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The method is realized based on an improved Transformer network ITFormer extracted by multi-size convolution and Gaussian weighted features, and the ITFormer and a grid method are combined and applied to a classification task of the remote sensing image to realize the ground object target identification and classification of the remote sensing image. In addition, the feasibility and the effectiveness of the ITFormer are verified through two standard hyperspectral remote sensing data sets and a self-constructed visible light remote sensing data set. Experimental results show that the ITFormer has better effectiveness, high efficiency and stronger generalization capability, and has more obvious advantages compared with the traditional convolutional neural network. In addition, the ITFormer has a lighter design, can be better deployed on a small memory device to reduce unnecessary loss caused by the need of occupying a higher memory, and has higher practical availability.

In one embodiment of the present application, the method for identifying a remote sensing image may include, but is not limited to, the following steps.

S100, obtaining a remote sensing image to be classified containing a plurality of ground object targets, and preprocessing the remote sensing image to be classified to obtain a plurality of pixel blocks.

It should be noted that the pixel block is a two-dimensional pixel block.

And S200, recognizing and classifying the ground object target of the preprocessed remote sensing image to be classified by using the trained improved neural network model to obtain a classification result of the ground object target.

It should be noted that the model for classifying the features of the remote sensing image to be classified is a trained improved transform model. And the trained improved Transformer model is obtained by training the marked sample remote sensing image and the corresponding marking result.

One embodiment of the present application, S100, will be further described and illustrated below. The remote sensing image refers to the detection of a sensor or a remote sensor under the conditions of being far away from a target and being not in contact with the targetAnd measuring the target ground object, acquiring reflected, radiated or scattered electromagnetic wave information in the process, and extracting and processing the formed image. In the present application, reference is made to remote sensing data derived from visible light remote sensing. Optionally, the detection band is in

In the meantime. And after the remote sensing image to be classified is obtained, preprocessing the remote sensing image to be classified.

Further, preprocessing the remote sensing image to be classified may include, but is not limited to, the following steps.

Defining the size of the remote sensing image to be classified as

M is the height of the remote sensing image, N is the width of the remote sensing image, and the pixel blocks of the remote sensing image to be classified are extracted by utilizing a grid method to obtain

Pixel block->

. Wherein the label information of the pixel block is determined by the original label of the central pixel. />

The spatial dimension of the pixel block is shown, d is the number of channels, and corresponds to the spectral depth of the remote sensing image to be classified. In this particular embodiment, d =3.

Furthermore, the extraction of pixel blocks of the remote sensing image to be classified by using a grid method comprises the following steps:

defining a center pixel of a block of pixels

，/>

，/>

Extracting remote sensing picture to be classifiedFrom the height->

To/is>

And width->

To/is>

And for all pixel points in the center pixel->

And (4) filling the pixel points at the edge, wherein all the extracted pixel points and the filled pixel points are used as the extracted pixel blocks.

The filling process is a filling process with a filling length of (S-1)/2, and the center pixel is a pixel

Is->

、/>

、/>

and />

。

In this embodiment, since the visible light remote sensing image is an RGB image, it has three wavelength bands or spectral depths. In the classification application of the remote sensing image, each pixel point of the remote sensing image is represented as a type of ground object. Therefore, the process of classifying the remote sensing image is actually to classify each pixel point of the remote sensing image. Moreover, for each pixel point, the characteristic of the pixel point generally has characteristic continuity with the surrounding pixel points, so in the preprocessing operation of the application, a grid method of one pixel block is used for learning the feature of the ground feature, namely the grid method is used for extracting the pixel block of the remote sensing image to be classified to obtain a plurality of pixel blocks. This is done to further facilitate subsequent classification.

The method comprises the steps of extracting pixel blocks of the remote sensing image to be classified by using a grid method, wherein the extraction covers the height

To/is>

And width>

To/is>

All the pixels of (1). In addition, the extraction operation is not performed on the pixel point at the edge of the center pixel, but the filling processing operation is performed. Finally, for a dimension of

To be classified, to obtain>

Pixel block->

。

Referring to FIG. 1, FIG. 1 is a data flow diagram of the ITFormer of the present application. In one embodiment of the present application, the structure of the ITFormer provided in the present application will be described and illustrated.

The ITFormer comprises a Multiscale Convolution Feature Extraction (MCEF) module, a Gaussian-weighted Feature tokenizer (Gaussian-weighted Feature tokenizer), a Transformer Encoder (Transformer Encoder) and a classification module which are connected in sequence. The multi-size convolution feature extraction module, the Gaussian weighted feature tokenizer and the transform encoder are important modules of the ITFormer. Specifically, the method comprises the following steps:

the feature extraction module is used for extracting features of the pixel blocks based on the two-dimensional convolution structure to obtain a feature graph to be processed. The Gaussian weighted feature word segmentation device is used for flattening the feature graph to be processed, and processing the flattened feature graph to be processed through the Gaussian weighted matrix to obtain high-level semantic features, so that the balance of the number of samples is realized.

The function of the Transformer encoder is to learn high-level semantic features through multiple layers of self-attention layers to obtain a first feature matrix, perform de-linearization processing on the first feature matrix to obtain a second feature matrix, generate a third feature matrix according to the first feature matrix and the second feature matrix, and output the third feature matrix to the classification module. In this application, the transform encoder carries out semantic information's classification and combination, can utilize different spatial feature to carry out discriminative characteristic identification to distinguish different ground objects better, realize more accurate classification.

And the classification module is used for calculating the maximum probability of the category to which the surface feature target belongs according to the third characteristic matrix and outputting the classification result of the surface feature target in the remote sensing image to be classified according to the maximum probability.

Further, the classification module comprises a flattening layer, a first fully connected layer and a classifier (Softmax) connected in sequence. In particular, the flattening layer functions to flatten the third feature matrix into one-dimensional features. The first full-connection layer is used for converting the one-dimensional features into the number of types corresponding to the ground object target and the label value of the type. The classifier is used for calculating the maximum probability of the category to which the surface feature target belongs through a Softmax function according to the number of the categories corresponding to the surface feature target and the label value of the category, and outputting the classification result of the surface feature target in the remote sensing image to be classified according to the maximum probability. Wherein, the value of the maximum probability is in the range of [0,1 ].

Referring to fig. 2, a Multiscale Convolution Feature Extraction (MCEF) module will be further described and illustrated. The multi-size convolution feature extraction module provided by the application comprises a feature input layer, a two-dimensional convolution structure and a splicing layer which are connected in sequence. The two-dimensional convolution structure is composed of four convolution layers connected in parallel. Wherein:

the role of the feature input layer is to duplicate a block of pixels into four identical sub-blocks of pixels. The four sub-pixel blocks are inputted into the four convolutional layers, respectively. And outputting the corresponding characteristic diagrams by each convolution layer according to the filled pixels, wherein the sizes of the characteristic diagrams output by the four convolution layers are the same. And the splicing layer is used for splicing the feature maps output by the four convolutional layers into the feature map to be processed.

It should be noted that the purpose of the convolution layer is to maintain the layer relationship between pixels through convolution operation, and extract Feature information in an image, where the result of the convolution operation is generally called a Feature Map (Feature Map). The main parameters of the convolutional layer include the convolution kernel and the step size (padding). And for a two-dimensional convolution, the position of the jth feature cube at the ith layer

The value of (d) can be given by the following formula:

；

wherein ,

for an activation function, <' > based on>

Is biased to->

Is the magnitude of a two-dimensional convolution kernel>

Is in position->

The weight parameter of (2).

Through the formula, the size of the convolution kernel is an important parameter for the convolution layer, and the size of the convolution kernel directly influences the extraction of the characteristic value of the convolution layer. Different feature information can be extracted by convolutional layers with different convolutional kernel sizes, so that the method and the device use convolutional layers with four different convolutional kernel sizes to extract the feature information in the design of the multi-size convolutional feature extraction module.

Further, in the convolution module, the four convolutional layers are respectively a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer, main parameters of the convolutional layers include a convolution kernel and a padding pixel (padding), and the padding pixel is used for enabling the sizes of the input of the convolutional layers and the output of the convolutional layers to be consistent. Specifically, the method comprises the following steps:

the first winding layer is composed of two layers of 3 connected in sequence

3 convolution kernel and convolution layer with 0 filling pixel. In this embodiment, the two convolutional layers with the same structure are adopted, so that the spatial dimensions of the feature maps output by the first convolutional layer and the other three convolutional layers are consistent.

The second convolution layer is 5

5 convolution kernel, convolution layer with 0 fill pixel.

The third convolution layer is 7

7 convolution kernel, convolution layer with fill pixel 1.

The fourth convolution layer is 9

9 convolution kernel, convolution layer with 2 filled pixels.

Optionally, to make training of the network easier and more stable, a batch normalization layer is added after each of the four convolutional layers in the convolutional module. Namely, the batch normalization layer is connected between the four convolution layers and the splicing layer. The batch normalization layer is used for normalizing each batch of data, and can accelerate the convergence speed of the model and relieve the problem of scattered deep network characteristic distribution to a certain extent.

In this embodiment, in the multi-size convolution feature extraction module, each pixel block data input to the multi-size convolution feature extraction module is first copied into four identical sub-pixel blocks, and is respectively input to convolution layers with four different convolution kernel sizes. Four characteristic graphs with the same output size are obtained by controlling the filling pixels in the convolution layer. And finally, splicing the four data together, and inputting the data into a Gaussian weighted feature word segmentation device to complete subsequent feature weighting and classification processing.

As shown with reference to FIG. 3 "

"refers to a multiplication operation. In the foregoing multi-size convolution feature extraction module, although the features extracted by the 2D convolution include a large number of spatial features, the spatial features are difficult to describe the features of the ground object target. Therefore, in order to further express the surface feature and make the surface feature clearer, the spatial features are taken as semantic features through a Gaussian weighted feature word segmentation device, and high-level semantic features are obtained through Gaussian weighted processing. The Gaussian weighted feature word segmentation device is used for further representing the spatial feature features extracted by the multi-size convolution feature extraction module into high-level semantic features. With reference to fig. 3, the specific processing procedure is as follows:

firstly, flattening the characteristic diagram to be processed, and defining the flattened characteristic diagram as

I.e. one-dimensional semantic features. Where h is the height, w is the width, and c is the spectral depth or number of bands of the image. Next, a Gaussian matrix is defined

Multiplying the result by the X to obtain semantic data. Introducing a Softmax function, and processing the semantic data through the Softmax function to obtain weight information. And (3) transposing the weight information, and multiplying the transposed weight information by the one-dimensional semantic feature X to obtain the final high-level semantic feature weighted by Gaussian. Among other things, the high level semantic features may be represented as: />

(ii) a T is a high-level semantic feature, X represents a flattened feature map, and>

represents a Gaussian matrix, <' > based on>

The semantic data is represented.

Referring to FIG. 4, an embodiment of the present application, a transform encoder will be further described and illustrated below. The Transformer encoder comprises an embedded input layer, a multi-head attention layer, an MLP layer and an output layer which are connected in sequence, wherein the multi-head attention layer comprises a plurality of self-attention layers which are mutually overlapped. A regularization layer is connected between the multi-head attention layer and the embedded input layer, and another regularization layer and a first residual error structure are connected between the multi-head attention layer and the MLP layer. In addition, the output of the MLP layer is connected with a second residual error structure, and the second residual error structure is connected with the first full connection layer. Specifically, the method comprises the following steps:

the high-level semantic features are input into an embedded input layer. The role of the embedded input layer is: to high-level semantic features

And learning flag>

Concatenates together and connects the result of the concatenation with the location information ≧ initialized by the normal distribution>

Adding to obtain the final input->

. wherein ,/>

The following formula is satisfied: />

。

wherein ,

is AND>

Null matrices of the same type. />

For showing

The input order of (2).

The multi-head attention layer has the following functions: learning the internal relevance of the final input by multiple layers of self-attention layers, the internal relevance being understood as

In a manner such that a sub-characteristic matrix is obtained>

，/>

By connecting several sub-feature matrices>

And generating a first characteristic matrix and outputting the first characteristic matrix to the MLP layer. It should be noted that the output of the ith layer from the attention layer satisfies the following formula:

；

wherein Q, K and V are respectively a query matrix, a key matrix and a value matrix,

representing the dimension of the input.

Further, the step of obtaining Q, K and V comprises: processing the final input through a preset shared matrix W to obtain a plurality of characteristic embeddings

By embedding each feature->

Different from three and learnable weights->

and />

And performing multiplication to obtain a query matrix, a key matrix and a value matrix.

It should be noted that the output of the multi-head attention layer, i.e. the first feature matrix, satisfies the following formula:

；

wherein ,

h is the number of self-attention layers, and W is a parameter matrix; />

Is composed of

Is greater than or equal to>

。

The effect of the first residual structure is: the problem of gradient disappearance is solved, namely the problem of accuracy reduction caused by too deep improvement of the neural network is prevented.

The function of the MLP layer is: and carrying out de-linearization processing on the first feature matrix to obtain a second feature matrix.

The second residual structure functions as: and connecting the first feature matrix output by the multi-head attention layer and the second feature matrix output by the MLP layer to obtain a third feature matrix. The third feature matrix satisfies:

. wherein

Is a third feature matrix of the first set of feature matrices, device for selecting or keeping>

Is a first characteristic matrix>

Is a second feature matrix.

Further, the MLP layer includes a second fully-connected layer, a GELU nonlinear activation function layer, and a third fully-connected layer, which are connected in sequence. It should be noted that a disposal layer is connected to the rear of each of the second fully-connected layer and the third fully-connected layer. This discard layer is not labeled in fig. 4.

The first feature matrix output by the multi-head attention layer is input into the MLP layer. The second full-connection layer has the functions of: the first feature matrix

The number of channels is compressed to one eighth of the original number. The role of the GELU nonlinear activation function layer is as follows: and carrying out de-linearization processing on the first characteristic matrix after the channel number compression through the GELU function to obtain a de-linearized first characteristic matrix. The third full connection layer has the functions of: the number of channels of the first de-linearized feature matrix is restored to the original number of channels, and a second feature matrix is generated->

。

In this embodiment, in the transform encoder, the high-level semantic features are first associated with a learnable null tag, i.e., a learning tag

Connected together and then initialized with position information based on a normal distribution>

Adding them to obtain the final input->

. Learning ≥ through multiple layers of self-attention layers in multi-head attention layer>

Get the sub-feature matrix->

And then a first characteristic matrix is obtained>

. The MLP layer makes a decision on the first characteristic matrix->

Delignification processing to obtain a second feature matrix->

。/>

and />

Forming a third feature matrix ^ by completing the concatenation with the second residual structure>

And output to the first fully-connected layer.

Based on the above embodiment, please refer to fig. 1 again, and the data flow of the present application will be described and illustrated below by taking the classification task of the surface feature target of the remote sensing image to be classified as an example. Firstly, the remote sensing image to be classified is preprocessed

The extraction of the pixel block is completed by a grid method to obtain ^ er>

Pixel block->

These pixel blocks are input into the ITFormer.

In the multi-size convolution feature extraction module, firstly, the feature input layer will

Dividing the pixel blocks into four identical sub-pixel blocks, performing feature extraction through convolution layers with different convolution kernels and filling pixels to obtain four feature maps, and outputting a feature map to be processed, which is formed by the four feature maps, through a splicing layer.

In the Gaussian weighted feature word segmentation device, firstly, a feature map to be processed is flattened into one-dimensional semantic features

Combine X with the Gaussian matrix->

Multiplying, and obtaining weight information through a Softmax function; then, transpose processing weight information, and multiply it with X to obtain high-level semantic feature T.

In the Transformer encoder, firstly, the high-level semantic features T are marked through learning

And location information

Processing to obtain the final input->

. Then, learn ≦ based on the multi-headed attention layer>

Is obtained by a number of sub-feature matrices>

，/>

Forming a first feature matrix. And the first characteristic matrix is subjected to the linearization treatment of the MLP layer to obtain a second characteristic matrix. Then, pass through the second residual error structureAnd connecting the first feature matrix and the second feature matrix to obtain a third feature matrix.

In the classification module, the third feature matrix is sent into the first full-connection layer after being subjected to flattening layer processing to obtain the number of types corresponding to the surface feature target and the label value of the surface feature target, and the classification result of the surface feature target in the remote sensing image to be classified is obtained through calculation of a Softmax function.

Based on the above examples, the validity of the ITFormer of the present application is verified and explained below by examples 1 and 2. Firstly, in order to train and verify the ITFormer of the application better, for the ITFormer obtained by training, on the relevant measurement evaluation parameters of the visible light remote sensing image, four parameters are selected for evaluating the classification performance: overall Accuracy (OA), average Accuracy (AA), kappa coefficient (Kappa coefficient) and accuracy per class (EA).

Wherein the total accuracy OA represents the total classified test pixels divided by the total number of test pixels. Define correctly classified pixels as

The number of categories is N, and the number of pixels in the total test set is N. EA represents the percentage of samples in each class that are accurately classified, then the accuracy of each class, EA, satisfies the following equation: />

. Then, the overall accuracy OA may be calculated as: />

. For the average accuracy AA, it represents the average class accuracy obtained by dividing the exact sum of each class by the number of classes. The average accuracy AA can be expressed as: />

. The Kappa coefficient is a statistical measure for calculating the information between the ground truth map and the prediction classification map, and the tableShowing strong consistency.

The Kappa coefficient may be expressed as:

。

example 1:

the ITFormer of the application is trained and verified on two standard hyperspectral remote sensing image data sets, the recognition result of the ITFormer is compared with the recognition results of three excellent convolutional neural networks Resnet, mobilenetV3 and GoogleNet in the field, the Resnet, mobilenetV3 and GoogleNet are collectively called as comparison networks in the embodiment, and the effectiveness of the ITFormer is verified through visual analysis.

The datasets for these two standards are the Indian Pine (IP) dataset and the Pavia University (PU) dataset. Wherein: the IP dataset includes images of several indian pines. Wherein the 20 bands are discarded since they do not reflect water. The corrected image includes 200 spectral bands, 145

145, 16 different types of plant terrain, and an imaged image with a spatial resolution of 20 m. The PU data set includes a number of imaging images of a city imaged by a reflective optical spectral imaging system. Data size 610 +>

340, the data includes 103 spectral bands, 9 different classes of urban terrain, and 1.3m spatial resolution of the imaged image. The IP data set and the PU data set have the data tag types and numbers shown in table 1 below. In Table 1, "Class" is the Class, "Training" is the partitioned Training set, and "Test" is the partitioned Test set.

Table 1: types and number of data labels of IP data set and PU data set

The ITFormer training is performed based on a Pythrch framework, and the training and verification are performed by respectively adopting an IP data set and a PU data set. The data set was randomly partitioned into 10% as the training set and the remaining 90% into the test set. The network employs a random gradient descent (SGD) optimizer, defining an initial learning rate of 0.005, a momentum of 0.8, a batch number of 16, a pixel block size of 13, a maximum training round number of 100 rounds and a cross entropy loss function (crossentrypyloss). The ITFormer is trained according to the parameters defined above. Meanwhile, resnet, mobilenetV3 and GoogleNet are trained.

And after the model training is finished, entering a verification stage. The verification phase of the present application is divided into two parts: evaluation parameter analysis and visual result analysis. First, the evaluation parameters used for verification are verified as described above, and verification of the model is completed using the overall accuracy OA, the average accuracy AA, the Kappa coefficient, and the accuracy EA of each class. The model is then used to perform a rendering of a prediction map of the data set for visual analysis. Specifically, the method comprises the following steps:

(1) Evaluation parameter analysis:

first is the validation of the IP data set.

The classification results on IP datasets for ITFormer and other comparative networks are shown in table 2 below. As can be seen from tables 1 and 2: the ITFormer proposed in this application shows the best results, with the highest OA, AA and Kappa coefficients, and occupies the best values for class 9 in the EA of class 16 land features. It is worth noting that the other three types of networks (Resnet, mobilenetV3, ***Net) perform poorly on small data features (No. 1Alfalfa, no. 9Oats) due to the extreme imbalance of the samples of the IP dataset. Especially, mobileNet V3 has a class accuracy of only 19.51% in the No.1Alfalfa feature, which seriously affects the final classification effect. In the ITFormer proposed by the present application, the gaussian-weighted feature tokenizer benefits from feature balancing of the feature samples of the ground objects, so that the final precision is not good because the features of the samples between the ground objects are not seriously affected by the number of the ground objects, and therefore, the ITFormer of the present application performs best on two types of small data ground objects, i.e., no. 1alfalfalfalfalfa and No. 9Oats. Thus, it is demonstrated that: the ITFormer provided by the application can also have efficient performance on a sample extremely unbalanced data set.

Table 2: classification results of different methods on IP data sets

Then verification of the PU data set.

Classification results on PU datasets for ITFormer, resnet, mobilenet v3, ***Net as shown in table 3 below. As can be seen from table 1 and table 3 below: unlike the IP dataset, the PU dataset has fewer surface features, a greater number of samples and is relatively more balanced. Therefore, ITFormer, resnet, mobilenetV3, and GoogleNet all can achieve better classification effect on PU data sets. However, the ITFormer provided by the present application includes a multi-size convolution feature extraction module and a Transformer encoder, the multi-size convolution feature extraction module can extract features of ground features of a remote sensing image, and the Transformer encoder can perform global feature learning, so that the ITFormer provided by the present application obtains an optimal classification result on a PU data set compared with other three types of convolution neural networks. The performance of the ITFormer of the present application on the PU data set also demonstrates again that the ITFormer can achieve the best effectiveness on different data sets.

Table 3: classification of PU data sets by different methods

(2) Visual result analysis:

referring to fig. 5-6, fig. 5 illustrates predicted graphical representations of four different methods on an IP dataset. FIG. 6 shows the predicted graphical representation of four different methods on a PU data set. Wherein, (a) is GoogleNet, (b) is MobilenetV3, (c) is Resnet, and (d) is ITFormer.

As can be seen from fig. 5: all four methods have certain salt and pepper points and continuous classification error areas. The MobileNetV3 performs the worst, and has a large number of salt and pepper spots and misclassification areas. The ITFormer has the best performance, has fewer classification error areas and can distinguish various ground objects more discriminatively. Moreover, classification error points of the ITFormer are concentrated in the edge zone of the region, and further the classification continuity and effectiveness of the ITFormer are proved. Similarly, the PU data set prediction graph of FIG. 6 also demonstrates the effectiveness of the ITFormer proposed in the present application, and the ITFormer has the least classification error salt and pepper points and has classification continuous effectiveness.

The verification stage proves that the ITFormer provided by the application has superiority, advancement and effectiveness.

Example 2:

in order to further verify the effectiveness of the ITFormer provided by the application, the application trains and verifies the ITFormer through a self-constructed data set. In this embodiment, the experimental data is a data set constructed by the user, and the data set views streets of a certain city and includes seven different ground object categories, namely roads, forests, houses, lakes and marshes, lawns, bare soil and coastal beaches. Firstly, downloading the whole map of the street of the city through BigeMap software, then cutting data through ENVI software, and converting the cut data map into the data map with the resolution of 2500

2500/>

3, data map.

After the original data set is obtained through the steps, the data set needs to be subjected to manual pre-classification processing. In the step, labelme software is used for partially selecting and marking the ground feature. The 7 categories of the surface feature are determined in the process of manual marking. And randomly selecting the ground features, and endowing real label categories in the cut data map image. Figure 7 shows the surface feature labels selected by the Labelme software.

After the selection of the ground features is completed, json files related to the ground feature marks are obtained, and then the json files with the ground feature type information are stored through a compiling program and are converted into real label graph files which correspond to the real ground features and can be directly used. Fig. 8 is a real feature map corresponding to the marked feature generated by manually marking the position.

In order to fully verify the effectiveness, the high efficiency and the generalization of the ITFormer, the ITFormer is built under a Pythrch framework together with a comparison network Resnet, a MobilenetV3 and a GoogleNet, and the self-constructed data set is used as a verification data set of the model, so that the effectiveness, the high efficiency and the generalization of the proposed model are fully verified. The self-constructed data set was randomly divided into 30% as a training set and the remaining 70% was used as a test set. The label type of the data set and the corresponding number of pixels are shown in table 4 below.

Table 4: self-building dataset tag types and numbers

In this embodiment, during training, the ITFormer and the other three comparison networks also use Adam optimizers, and define that the initial learning rate is set to 0.001, the learning rate attenuation is set to 0, and the batch processing data size is 64. The CPU of the equipment used for training and verifying the experiment is AMDRyzen 7 3700X 8-Core Processor, the GPU is an NVIDIA GeForceRTX 3060 display card and is provided with a 12G GPU memory.

In terms of setting other parameters, the sample remote sensing image in the self-constructed data set needs to be preprocessed in a manner substantially consistent with S100. However, unlike embodiment 1, the size of the pixel block is set to 21 in the present embodiment

21/>

3, i.e. M = N =21. The convolution kernel size of each convolution layer corresponding to the multi-size convolution feature extraction module is 64, that is, the convolution kernels in the first convolution layer to the fourth convolution layer are all 8 ^ or greater>

8. The highest number of rounds of training is set to 100 rounds.

With respect to verification, the verified evaluation parameters use the above-described overall accuracy OA, average accuracy AA, kappa coefficient, and accuracy per class EA to complete the verification of the model. In addition, the prediction maps of the ITFormer and the other three comparison networks are plotted for visual comparison and analysis. Further, this embodiment verifies ITFormer in three parts. The four parts are respectively: evaluation parameter analysis, visual effect analysis and wear analysis. The specific verification process is as follows:

(1) Evaluation parameter analysis:

table of classification results for ITFormer and the other three comparative networks as shown in table 5 below. As can be seen from tables 4 and 5: the best performing is GoogleNet, while the proposed ITFormer reaches the second highest in three main evaluation parameters of global accuracy OA, average accuracy AA and Kappa coefficient. It is worth noting that ITFormer has 3 types of ground features to achieve the highest class accuracy.

Table 5: classification results of different methods on self-constructed datasets

Referring to fig. 9-12, fig. 9-12 show training graphs of the ITFormer and three other comparative networks, including training loss curves, training and testing accuracy curves, and evaluation parameter curves for each network, respectively. (a) is a training loss curve; (b) is a training and testing accuracy curve; and (c) is an evaluation parameter curve.

As can be seen from fig. 9 to 12: mobileNetV3 performs the worst in terms of convergence speed, requiring more than 40 rounds to begin convergence. While the remaining three methods converge around the highest accuracy around 20 rounds. As can be seen from the test accuracy curves of the four methods: ***Net based on deep inclusion structure is the most stable, followed by ResNet based on residual structure and ITFormer proposed in this application, and mobilenet v3 based on inverse residual and SE modules, the least stable. The reason is that GoogleNet uses multi-layer multi-size convolution, which can better improve the stability of the model. And ResNet adopts a deep residual structure, so that the gradient disappearance effect of the model can be better relieved. The ITFormer provided by the application has the advantages that the multi-size convolution is adopted to extract discriminative spatial features, the spatial features are converted into high-level semantic features, long-term dependence feature information can be explored, and the ITFormer has generalization capability and stability.

In addition, as shown in the curves (b) of fig. 9 to 12, the verification accuracy of the ITFormer proposed by the present application can have a higher value than the training accuracy, and the verification accuracy of the other three comparison networks is below the training accuracy. Therefore, the ITFormer has higher generalization capability and can better adapt to the situation of less data volume in reality.

(2) Visual effect analysis:

referring to FIG. 13, FIG. 13 shows classification performance of ITFormer and three other comparative networks on a self-constructed data set. Wherein, (a) is GoogleNet, (b) is MobilenetV3, (c) is Resnet, and (d) is ITFormer. As can be seen from fig. 13: the best performance is ITFormer, which has fewer salt and pepper points on the classification chart and does not have large-area classification error continuous areas. Therefore, the ITFormer with the global information capable of being better learned can have better effect on the performance of classification. And the misclassification areas of the other three comparison networks are relatively concentrated and have more salt and pepper spots. In addition, the four models described above were used for the rendering of a prediction map of the self-constructed data set for visual analysis. Referring to FIG. 14, FIG. 14 illustrates a prediction graph of different methods on a self-constructed data set. Wherein, (a) is GoogleNet, (b) is MobilenetV3, (c) is Resnet, and (d) is ITFormer. As can be seen from fig. 14: of the four methods, the ITFormer achieves the best results in fewer misclassified regions, relatively continuous regions and relatively few salt and pepper spots. Therefore, the ITFormer provided by the application is proved to have superiority, advancement and effectiveness.

(3) And (3) loss analysis:

although the accuracy of the model is the most important, the training loss, the prediction loss and the memory loss of the model are also important indexes for evaluating the superiority of the model. In order to guarantee the fairness of verification, the verification experiment that this application provided is based on going on under same experiment platform and the same laboratory glassware. Experimental results training and testing times and model sizes on self-constructed data sets for ITFormer and the other three comparative networks are shown in table 6 below.

Table 6: training and testing time and model size on self-building data sets by different methods

As can be seen from table 6: the training time, the testing time and the model parameters of the ITFormer all reach the minimum loss, and the model parameters are only 1/8 of ResNet, so that the ITFormer can be better deployed on equipment with a small memory. Meanwhile, the training time of the ITFormer is only half of ResNet, and the test time is about half of ResNet. Although the effect of ITFormer is less accurate than GoogleNet by 0.1%, it is also tolerable. Therefore, the ITFormer has absolute advantages in terms of high efficiency and light weight, can be better deployed on light-weight equipment, and is more suitable for actual working scenes.

The application provides a transform network-ITFormer for multi-convolution-Gaussian weighted feature extraction for remote sensing image classification. The multi-size convolution module can better extract discriminative spatial features, and the Gaussian weighted feature word segmentation device can better balance the difference between categories and convert the features into high-level semantic features. And finally, the transform-based encoder classification module can extract global information, generate discriminative classification information and classify the ground objects more efficiently and accurately. On the experiment of a hyperspectral remote sensing data set, excellent convolutional neural networks (Resnet, mobilenetV3 and GoogleNet) are compared, and the ITFormer provided by the application can achieve the best effect. On the basis of a visible light remote sensing data set, although the classification effect of GoogleNet is compared, the ITFormer provided by the application is 0.1% different from the GoogleNet, the ITFormer has the fastest training speed and the test classification speed and has extremely low model parameters, so that the ITFormer has better effectiveness, high efficiency and generalization when applied to actual remote sensing image classification, and is relatively low in deployment cost and more suitable for application in practical situations.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. The remote sensing image identification method based on the improved Transformer is characterized by comprising the following steps of:

the Gaussian weighting word segmentation device is used for processing the feature graph to be processed through a Gaussian weighting matrix to obtain high-level semantic features and outputting the high-level semantic features to the transform encoder;

the transform encoder is used for learning the high-level semantic features through a plurality of self-attention layers to obtain a first feature matrix, performing de-linearization processing on the first feature matrix to obtain a second feature matrix, generating a third feature matrix according to the first feature matrix and the second feature matrix and outputting the third feature matrix to the classification module;

2. The improved Transformer-based remote sensing image recognition method according to claim 1, wherein the classification module comprises a flattening layer, a first full-connected layer and a classifier which are connected in sequence;

the flattening layer is used for flattening the third feature matrix into one-dimensional features;

the first full-connection layer is used for converting the one-dimensional features into the number of types corresponding to the surface feature target and the label value of the type;

the classifier is used for calculating the maximum probability of the category to which the surface feature target belongs according to the number of the categories corresponding to the surface feature target and the label value of the category, and outputting the classification result of the surface feature target in the remote sensing image to be classified according to the maximum probability;

wherein, the value of the maximum probability is in the range of [0,1 ].

3. The remote sensing image recognition method based on the improved Transformer as claimed in claim 1, wherein the remote sensing image to be classified is preprocessed to obtain a plurality of pixel blocks, and the method comprises the following steps:

defining the size of the remote sensing image to be classified as

M is the height of the remote sensing image to be classified, N is the width of the remote sensing image to be classified, and the remote sensing image to be classified is subjected to pixel block extraction by utilizing a grid method to obtain

A plurality of pixel blocks>

；

Wherein the label information of the pixel block is determined by the original label of the central pixel,

and d is the number of channels corresponding to the spectral depth of the remote sensing image to be classified.

4. The improved Transformer-based remote sensing image recognition method as claimed in claim 3, wherein the extracting of pixel blocks from the remote sensing image to be classified by using a grid method comprises:

defining a center pixel of the pixel block

，/>

，/>

Extracting the height from the remote sensing image to be classified>

To/is>

And width->

To/is>

And for all pixel points lying in said central pixel->

The pixel points at the edge of the image are filled and extractedAll the pixel points and the filled pixel points are used as the extracted pixel blocks; />

Wherein the filling process is a filling process with a filling length of (S-1)/2, and the central pixel

Has an edge of

、/>

、/>

and />

。

5. The remote sensing image recognition method based on the improved Transformer as claimed in claim 1, wherein the multi-size convolution feature extraction module comprises a feature input layer, a two-dimensional convolution structure and a splicing layer which are connected in sequence, and the two-dimensional convolution structure is formed by connecting four convolution layers in parallel; wherein:

the characteristic input layer is used for copying the pixel blocks into four identical sub-pixel blocks, the four sub-pixel blocks are respectively input into the four convolutional layers, each convolutional layer outputs a corresponding characteristic diagram according to the filled pixels of the convolutional layer, and the sizes of the characteristic diagrams output by the four convolutional layers are identical; the splicing layer is used for splicing the characteristic diagrams output by the four layers of convolution layers into characteristic diagrams to be processed;

the splicing layer is a splicing layer, wherein the splicing layer is a first splicing layer, a second splicing layer, a third splicing layer and a fourth splicing layer;

wherein the first convolution layer comprises two layers of convolution kernels connected in sequence3

3 and the fill pixel is 0; the second convolution layer has a convolution kernel of 5 ^ er>

5 and a convolution layer with a fill pixel of 0; the third convolution layer has a convolution kernel of 7>

7 and fill the convolution layer with pixel 1; the fourth convolution layer has a convolution kernel of 9>

9 and fills the convolution layer with pixel 2.

6. The method for recognizing the remote sensing image based on the improved Transformer as claimed in claim 1, wherein the processing of the flattened feature graph to be processed through the gaussian weighting matrix to obtain the high-level semantic features comprises:

flattening the feature graph to be processed into one-dimensional semantic features

Wherein h is height, w is width, and c is spectral depth or wave band number of the image;

defining a Gaussian matrix as

Multiplying the Gaussian matrix by the one-dimensional semantic features, and obtaining weight information through a Softmax function;

transposing the weight information, multiplying the transposed weight information by the one-dimensional semantic features to obtain Gaussian-weighted high-level semantic features, wherein the high-level semantic features meet the requirement of

。

7. The remote sensing image recognition method based on the improved Transformer is characterized in that the Transformer encoder comprises an embedded input layer, a multi-head attention layer and an MLP layer which are connected in sequence, a regularization layer is connected between the multi-head attention layer and the embedded input layer, and another regularization layer and a first residual error structure are connected between the multi-head attention layer and the MLP layer;

the high-level semantic features are input into the embedded input layer, the embedded input layer is used for connecting the high-level semantic features with preset learning marks, and adding a connection result with position information initialized by normal distribution to obtain final input; wherein the final input satisfies:

；

wherein ,

for high level semantic features, be>

For learning marks, <' >>

Is AND>

Null matrices of the same type; />

For the location information, for representing &>

The input order of (1);

the multi-head attention layer comprises a plurality of self-attention layers which are mutually overlapped and used for self-attention through a plurality of layersThe force layer learns the internal correlation of the final input to obtain a sub-feature matrix

，/>

By connecting several sub-feature matrices>

Generating a first characteristic matrix and outputting the first characteristic matrix to an MLP layer; />

The MLP layer is used for carrying out de-linearization processing on the first characteristic matrix to obtain a second characteristic matrix; a second residual structure is connected to the back of the MLP layer, and the second residual structure is used for connecting the first feature matrix and the second feature matrix to obtain a third feature matrix, where the third feature matrix satisfies:

，/>

Is the first characteristic matrix, is greater than or equal to>

Is a second feature matrix.

8. The remote sensing image recognition method based on the improved Transformer as claimed in claim 7, wherein the MLP layer comprises a second full connection layer, a GELU nonlinear activation function layer and a third full connection layer which are connected in sequence, and a discarding layer is connected behind each of the second full connection layer and the third full connection layer;

the first characteristic matrix output by the multi-head attention layer is input into the MLP layer, the second full-connection layer is used for compressing the channel number of the first characteristic matrix to one eighth of the original channel number, and the GELU nonlinear activation function layer is used for carrying out de-linearization processing on the first characteristic matrix after the channel number is compressed through a GELU function to obtain a de-linearized first characteristic matrix; and the third full connection layer is used for restoring the number of the channels of the first characteristic matrix subjected to the linearization to the original number of the channels to generate a second characteristic matrix.

9. The improved Transformer-based remote sensing image recognition method according to claim 7, wherein the output of the ith layer from the attention layer satisfies the following formula:

；

a dimension representing an input; the steps of obtaining the query matrix, the key matrix and the value matrix comprise: the final input is ≥ via a shared matrix W>

Processing to obtain a plurality of characteristic embedding->

By embedding each feature->

Different from three and learnable weights->

and />

10. The improved Transformer-based remote sensing image recognition method according to claim 9, wherein the output of the multi-head attention layer satisfies the following formula:

；

wherein ,

h is the number of self-attention layers, W is a parameter matrix, and ` H `>

Is composed of

Is greater than or equal to>

。/>