CN115861824A - Remote sensing image identification method based on improved Transformer - Google Patents

Remote sensing image identification method based on improved Transformer Download PDF

Info

Publication number
CN115861824A
CN115861824A CN202310155748.8A CN202310155748A CN115861824A CN 115861824 A CN115861824 A CN 115861824A CN 202310155748 A CN202310155748 A CN 202310155748A CN 115861824 A CN115861824 A CN 115861824A
Authority
CN
China
Prior art keywords
layer
matrix
feature
remote sensing
sensing image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310155748.8A
Other languages
Chinese (zh)
Other versions
CN115861824B (en
Inventor
李兵
梁嘉鸿
王琪文
杨露
熊振华
余珂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shantou University
Original Assignee
Shantou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shantou University filed Critical Shantou University
Priority to CN202310155748.8A priority Critical patent/CN115861824B/en
Publication of CN115861824A publication Critical patent/CN115861824A/en
Application granted granted Critical
Publication of CN115861824B publication Critical patent/CN115861824B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a remote sensing image identification method based on an improved Transformer, which comprises the steps of obtaining a remote sensing image to be classified and preprocessing the remote sensing image; carrying out recognition and classification on the ground object target on the preprocessed image by using the trained improved neural network model; the model comprises a multi-size convolution feature extraction module, a Gaussian weighting feature word segmentation device, a transform encoder and a classification module; the feature extraction module extracts discriminative spatial features, the word segmentation device balances the difference between categories, the features are converted into high-level semantic features, the encoder extracts global information, and discriminative classification information is generated by combining the classification module, so that ground objects are classified more efficiently and accurately. The model of the invention has stronger generalization ability, higher training speed and recognition and classification speed, and improves the recognition precision of the remote sensing image; and the model has lower model parameters, can be better deployed on small memory equipment, and has high availability.

Description

Remote sensing image identification method based on improved Transformer
Technical Field
The invention relates to the technical field of image recognition, in particular to a remote sensing image recognition method based on an improved Transformer.
Background
Remote sensing image classification is an important branch in computer vision. With the development of deep learning, a deep learning network is introduced into the field of remote sensing image classification to improve the precision of remote sensing image classification and reduce the cost loss of manual classification of remote sensing images. At present, most of deep learning methods commonly used for remote sensing image classification are based on convolutional neural networks, and a few of deep learning methods are based on emerging neural networks such as transformers.
Disclosure of Invention
The invention aims to provide a remote sensing image recognition method based on an improved Transformer, so as to solve one or more technical problems in the prior art and provide at least one beneficial selection or creation condition.
The solution of the invention for solving the technical problem is as follows: the application provides a remote sensing image recognition method based on an improved Transformer, which comprises the following steps:
obtaining a remote sensing image to be classified containing a plurality of ground object targets, and preprocessing the remote sensing image to be classified to obtain a plurality of pixel blocks;
according to the pixel blocks, the remote sensing images to be classified are identified and classified by utilizing a trained improved neural network model, and a classification result of the ground object target is obtained;
the trained improved neural network model is obtained by training marked sample remote sensing images and corresponding marking results; the improved neural network model comprises a multi-size convolution feature extraction module, a Gaussian weighting word segmentation device, a transform encoder and a classification module which are connected in sequence;
the multi-size convolution feature extraction module is used for extracting features of the pixel blocks based on a two-dimensional convolution structure to obtain a feature graph to be processed and outputting the feature graph to the Gaussian weighted word segmentation device;
the Gaussian weighting word segmentation device is used for processing the characteristic graph to be processed through a Gaussian weighting matrix to obtain high-level semantic characteristics and outputting the high-level semantic characteristics to the transform encoder;
the Transformer encoder is used for learning the high-level semantic features through a multi-layer self-attention layer to obtain a first feature matrix, performing de-linearization processing on the first feature matrix to obtain a second feature matrix, generating a third feature matrix according to the first feature matrix and the second feature matrix and outputting the third feature matrix to the classification module;
and the classification module is used for calculating the maximum probability of the category to which the surface feature target belongs according to the third feature matrix and outputting the classification result of the surface feature target in the remote sensing image to be classified according to the maximum probability.
The invention has the beneficial effects that: the remote sensing image recognition method based on the improved Transformer is provided, the classification and recognition of the ground features of the remote sensing image are achieved through the improved neural network model, and the improved neural network model comprises a multi-size convolution feature extraction module, a Gaussian weighted feature word segmentation device, a Transformer encoder and a classification module. Compared with the traditional convolutional neural network, the ITFormer not only has better effectiveness, high efficiency and stronger generalization capability, but also has higher training speed and recognition and classification speed, and can improve the speed and accuracy of remote sensing image recognition; in addition, the ITFormer has lower model parameters, can be better deployed on a small memory device, reduces unnecessary loss caused by the need of occupying higher memory, and has higher practical availability.
Drawings
FIG. 1 is a data flow diagram of an ITFormer provided herein;
fig. 2 is a schematic structural diagram of a multi-size convolution feature extraction module provided in the present application;
FIG. 3 is a schematic structural diagram of a Gaussian weighted feature tokenizer provided in the present application;
FIG. 4 is a schematic structural diagram of a Transformer encoder provided in the present application;
FIG. 5 is a representation of the predicted plots for the ITFormer and three other comparison networks on an IP dataset;
FIG. 6 is a representation of the prediction maps of the ITFormer and three other comparison networks on a PU data set;
FIG. 7 is a schematic diagram of a surface feature selected by Labelme software labeling according to the present application;
FIG. 8 is a real feature map corresponding to a marked feature generated by manually marking a location as provided herein;
FIG. 9 is a schematic of a training curve of an ITFormer on a self-constructed data set;
FIG. 10 is a schematic of a training curve of GoogleNet on a self-constructed dataset;
FIG. 11 is a schematic of a training curve of MobileNet on a self-constructed data set;
FIG. 12 is a schematic of the training curve of ResNet on a self-constructed data set;
FIG. 13 is a classification representation of the ITFormer and three other comparison networks on a self-constructed data set;
FIG. 14 is a prediction plot of ITFormer and three other comparison networks on a self-constructed data set.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The method is realized based on an improved Transformer network ITFormer extracted by multi-size convolution and Gaussian weighted features, and the ITFormer and a grid method are combined and applied to a classification task of the remote sensing image to realize the ground object target identification and classification of the remote sensing image. In addition, the feasibility and the effectiveness of the ITFormer are verified through two standard hyperspectral remote sensing data sets and a self-constructed visible light remote sensing data set. Experimental results show that the ITFormer has better effectiveness, high efficiency and stronger generalization capability, and has more obvious advantages compared with the traditional convolutional neural network. In addition, the ITFormer has a lighter design, can be better deployed on a small memory device to reduce unnecessary loss caused by the need of occupying a higher memory, and has higher practical availability.
In one embodiment of the present application, the method for identifying a remote sensing image may include, but is not limited to, the following steps.
S100, obtaining a remote sensing image to be classified containing a plurality of ground object targets, and preprocessing the remote sensing image to be classified to obtain a plurality of pixel blocks.
It should be noted that the pixel block is a two-dimensional pixel block.
And S200, recognizing and classifying the ground object target of the preprocessed remote sensing image to be classified by using the trained improved neural network model to obtain a classification result of the ground object target.
It should be noted that the model for classifying the features of the remote sensing image to be classified is a trained improved transform model. And the trained improved Transformer model is obtained by training the marked sample remote sensing image and the corresponding marking result.
One embodiment of the present application, S100, will be further described and illustrated below. The remote sensing image refers to the detection of a sensor or a remote sensor under the conditions of being far away from a target and being not in contact with the targetAnd measuring the target ground object, acquiring reflected, radiated or scattered electromagnetic wave information in the process, and extracting and processing the formed image. In the present application, reference is made to remote sensing data derived from visible light remote sensing. Optionally, the detection band is in
Figure SMS_1
In the meantime. And after the remote sensing image to be classified is obtained, preprocessing the remote sensing image to be classified.
Further, preprocessing the remote sensing image to be classified may include, but is not limited to, the following steps.
Defining the size of the remote sensing image to be classified as
Figure SMS_2
M is the height of the remote sensing image, N is the width of the remote sensing image, and the pixel blocks of the remote sensing image to be classified are extracted by utilizing a grid method to obtain
Figure SMS_3
Pixel block->
Figure SMS_4
. Wherein the label information of the pixel block is determined by the original label of the central pixel. />
Figure SMS_5
The spatial dimension of the pixel block is shown, d is the number of channels, and corresponds to the spectral depth of the remote sensing image to be classified. In this particular embodiment, d =3.
Furthermore, the extraction of pixel blocks of the remote sensing image to be classified by using a grid method comprises the following steps:
defining a center pixel of a block of pixels
Figure SMS_7
,/>
Figure SMS_9
,/>
Figure SMS_11
Extracting remote sensing picture to be classifiedFrom the height->
Figure SMS_8
To/is>
Figure SMS_10
And width->
Figure SMS_12
To/is>
Figure SMS_13
And for all pixel points in the center pixel->
Figure SMS_6
And (4) filling the pixel points at the edge, wherein all the extracted pixel points and the filled pixel points are used as the extracted pixel blocks.
The filling process is a filling process with a filling length of (S-1)/2, and the center pixel is a pixel
Figure SMS_14
Is->
Figure SMS_15
、/>
Figure SMS_16
、/>
Figure SMS_17
and />
Figure SMS_18
In this embodiment, since the visible light remote sensing image is an RGB image, it has three wavelength bands or spectral depths. In the classification application of the remote sensing image, each pixel point of the remote sensing image is represented as a type of ground object. Therefore, the process of classifying the remote sensing image is actually to classify each pixel point of the remote sensing image. Moreover, for each pixel point, the characteristic of the pixel point generally has characteristic continuity with the surrounding pixel points, so in the preprocessing operation of the application, a grid method of one pixel block is used for learning the feature of the ground feature, namely the grid method is used for extracting the pixel block of the remote sensing image to be classified to obtain a plurality of pixel blocks. This is done to further facilitate subsequent classification.
The method comprises the steps of extracting pixel blocks of the remote sensing image to be classified by using a grid method, wherein the extraction covers the height
Figure SMS_19
To/is>
Figure SMS_20
And width>
Figure SMS_21
To/is>
Figure SMS_22
All the pixels of (1). In addition, the extraction operation is not performed on the pixel point at the edge of the center pixel, but the filling processing operation is performed. Finally, for a dimension of
Figure SMS_23
To be classified, to obtain>
Figure SMS_24
Pixel block->
Figure SMS_25
Referring to FIG. 1, FIG. 1 is a data flow diagram of the ITFormer of the present application. In one embodiment of the present application, the structure of the ITFormer provided in the present application will be described and illustrated.
The ITFormer comprises a Multiscale Convolution Feature Extraction (MCEF) module, a Gaussian-weighted Feature tokenizer (Gaussian-weighted Feature tokenizer), a Transformer Encoder (Transformer Encoder) and a classification module which are connected in sequence. The multi-size convolution feature extraction module, the Gaussian weighted feature tokenizer and the transform encoder are important modules of the ITFormer. Specifically, the method comprises the following steps:
the feature extraction module is used for extracting features of the pixel blocks based on the two-dimensional convolution structure to obtain a feature graph to be processed. The Gaussian weighted feature word segmentation device is used for flattening the feature graph to be processed, and processing the flattened feature graph to be processed through the Gaussian weighted matrix to obtain high-level semantic features, so that the balance of the number of samples is realized.
The function of the Transformer encoder is to learn high-level semantic features through multiple layers of self-attention layers to obtain a first feature matrix, perform de-linearization processing on the first feature matrix to obtain a second feature matrix, generate a third feature matrix according to the first feature matrix and the second feature matrix, and output the third feature matrix to the classification module. In this application, the transform encoder carries out semantic information's classification and combination, can utilize different spatial feature to carry out discriminative characteristic identification to distinguish different ground objects better, realize more accurate classification.
And the classification module is used for calculating the maximum probability of the category to which the surface feature target belongs according to the third characteristic matrix and outputting the classification result of the surface feature target in the remote sensing image to be classified according to the maximum probability.
Further, the classification module comprises a flattening layer, a first fully connected layer and a classifier (Softmax) connected in sequence. In particular, the flattening layer functions to flatten the third feature matrix into one-dimensional features. The first full-connection layer is used for converting the one-dimensional features into the number of types corresponding to the ground object target and the label value of the type. The classifier is used for calculating the maximum probability of the category to which the surface feature target belongs through a Softmax function according to the number of the categories corresponding to the surface feature target and the label value of the category, and outputting the classification result of the surface feature target in the remote sensing image to be classified according to the maximum probability. Wherein, the value of the maximum probability is in the range of [0,1 ].
Referring to fig. 2, a Multiscale Convolution Feature Extraction (MCEF) module will be further described and illustrated. The multi-size convolution feature extraction module provided by the application comprises a feature input layer, a two-dimensional convolution structure and a splicing layer which are connected in sequence. The two-dimensional convolution structure is composed of four convolution layers connected in parallel. Wherein:
the role of the feature input layer is to duplicate a block of pixels into four identical sub-blocks of pixels. The four sub-pixel blocks are inputted into the four convolutional layers, respectively. And outputting the corresponding characteristic diagrams by each convolution layer according to the filled pixels, wherein the sizes of the characteristic diagrams output by the four convolution layers are the same. And the splicing layer is used for splicing the feature maps output by the four convolutional layers into the feature map to be processed.
It should be noted that the purpose of the convolution layer is to maintain the layer relationship between pixels through convolution operation, and extract Feature information in an image, where the result of the convolution operation is generally called a Feature Map (Feature Map). The main parameters of the convolutional layer include the convolution kernel and the step size (padding). And for a two-dimensional convolution, the position of the jth feature cube at the ith layer
Figure SMS_26
The value of (d) can be given by the following formula:
Figure SMS_27
wherein ,
Figure SMS_28
for an activation function, <' > based on>
Figure SMS_29
Is biased to->
Figure SMS_30
Is the magnitude of a two-dimensional convolution kernel>
Figure SMS_31
Is in position->
Figure SMS_32
The weight parameter of (2).
Through the formula, the size of the convolution kernel is an important parameter for the convolution layer, and the size of the convolution kernel directly influences the extraction of the characteristic value of the convolution layer. Different feature information can be extracted by convolutional layers with different convolutional kernel sizes, so that the method and the device use convolutional layers with four different convolutional kernel sizes to extract the feature information in the design of the multi-size convolutional feature extraction module.
Further, in the convolution module, the four convolutional layers are respectively a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer, main parameters of the convolutional layers include a convolution kernel and a padding pixel (padding), and the padding pixel is used for enabling the sizes of the input of the convolutional layers and the output of the convolutional layers to be consistent. Specifically, the method comprises the following steps:
the first winding layer is composed of two layers of 3 connected in sequence
Figure SMS_33
3 convolution kernel and convolution layer with 0 filling pixel. In this embodiment, the two convolutional layers with the same structure are adopted, so that the spatial dimensions of the feature maps output by the first convolutional layer and the other three convolutional layers are consistent.
The second convolution layer is 5
Figure SMS_34
5 convolution kernel, convolution layer with 0 fill pixel.
The third convolution layer is 7
Figure SMS_35
7 convolution kernel, convolution layer with fill pixel 1.
The fourth convolution layer is 9
Figure SMS_36
9 convolution kernel, convolution layer with 2 filled pixels.
Optionally, to make training of the network easier and more stable, a batch normalization layer is added after each of the four convolutional layers in the convolutional module. Namely, the batch normalization layer is connected between the four convolution layers and the splicing layer. The batch normalization layer is used for normalizing each batch of data, and can accelerate the convergence speed of the model and relieve the problem of scattered deep network characteristic distribution to a certain extent.
In this embodiment, in the multi-size convolution feature extraction module, each pixel block data input to the multi-size convolution feature extraction module is first copied into four identical sub-pixel blocks, and is respectively input to convolution layers with four different convolution kernel sizes. Four characteristic graphs with the same output size are obtained by controlling the filling pixels in the convolution layer. And finally, splicing the four data together, and inputting the data into a Gaussian weighted feature word segmentation device to complete subsequent feature weighting and classification processing.
As shown with reference to FIG. 3 "
Figure SMS_37
"refers to a multiplication operation. In the foregoing multi-size convolution feature extraction module, although the features extracted by the 2D convolution include a large number of spatial features, the spatial features are difficult to describe the features of the ground object target. Therefore, in order to further express the surface feature and make the surface feature clearer, the spatial features are taken as semantic features through a Gaussian weighted feature word segmentation device, and high-level semantic features are obtained through Gaussian weighted processing. The Gaussian weighted feature word segmentation device is used for further representing the spatial feature features extracted by the multi-size convolution feature extraction module into high-level semantic features. With reference to fig. 3, the specific processing procedure is as follows:
firstly, flattening the characteristic diagram to be processed, and defining the flattened characteristic diagram as
Figure SMS_38
I.e. one-dimensional semantic features. Where h is the height, w is the width, and c is the spectral depth or number of bands of the image. Next, a Gaussian matrix is defined
Figure SMS_39
Multiplying the result by the X to obtain semantic data. Introducing a Softmax function, and processing the semantic data through the Softmax function to obtain weight information. And (3) transposing the weight information, and multiplying the transposed weight information by the one-dimensional semantic feature X to obtain the final high-level semantic feature weighted by Gaussian. Among other things, the high level semantic features may be represented as: />
Figure SMS_40
(ii) a T is a high-level semantic feature, X represents a flattened feature map, and>
Figure SMS_41
represents a Gaussian matrix, <' > based on>
Figure SMS_42
The semantic data is represented.
Referring to FIG. 4, an embodiment of the present application, a transform encoder will be further described and illustrated below. The Transformer encoder comprises an embedded input layer, a multi-head attention layer, an MLP layer and an output layer which are connected in sequence, wherein the multi-head attention layer comprises a plurality of self-attention layers which are mutually overlapped. A regularization layer is connected between the multi-head attention layer and the embedded input layer, and another regularization layer and a first residual error structure are connected between the multi-head attention layer and the MLP layer. In addition, the output of the MLP layer is connected with a second residual error structure, and the second residual error structure is connected with the first full connection layer. Specifically, the method comprises the following steps:
the high-level semantic features are input into an embedded input layer. The role of the embedded input layer is: to high-level semantic features
Figure SMS_43
And learning flag>
Figure SMS_44
Concatenates together and connects the result of the concatenation with the location information ≧ initialized by the normal distribution>
Figure SMS_45
Adding to obtain the final input->
Figure SMS_46
. wherein ,/>
Figure SMS_47
The following formula is satisfied: />
Figure SMS_48
wherein ,
Figure SMS_49
is AND>
Figure SMS_50
Null matrices of the same type. />
Figure SMS_51
For showing
Figure SMS_52
The input order of (2).
The multi-head attention layer has the following functions: learning the internal relevance of the final input by multiple layers of self-attention layers, the internal relevance being understood as
Figure SMS_53
In a manner such that a sub-characteristic matrix is obtained>
Figure SMS_54
,/>
Figure SMS_55
By connecting several sub-feature matrices>
Figure SMS_56
And generating a first characteristic matrix and outputting the first characteristic matrix to the MLP layer. It should be noted that the output of the ith layer from the attention layer satisfies the following formula:
Figure SMS_57
wherein Q, K and V are respectively a query matrix, a key matrix and a value matrix,
Figure SMS_58
representing the dimension of the input.
Further, the step of obtaining Q, K and V comprises: processing the final input through a preset shared matrix W to obtain a plurality of characteristic embeddings
Figure SMS_59
By embedding each feature->
Figure SMS_60
Different from three and learnable weights->
Figure SMS_61
and />
Figure SMS_62
And performing multiplication to obtain a query matrix, a key matrix and a value matrix.
It should be noted that the output of the multi-head attention layer, i.e. the first feature matrix, satisfies the following formula:
Figure SMS_63
wherein ,
Figure SMS_64
h is the number of self-attention layers, and W is a parameter matrix; />
Figure SMS_65
Is composed of
Figure SMS_66
Is greater than or equal to>
Figure SMS_67
The effect of the first residual structure is: the problem of gradient disappearance is solved, namely the problem of accuracy reduction caused by too deep improvement of the neural network is prevented.
The function of the MLP layer is: and carrying out de-linearization processing on the first feature matrix to obtain a second feature matrix.
The second residual structure functions as: and connecting the first feature matrix output by the multi-head attention layer and the second feature matrix output by the MLP layer to obtain a third feature matrix. The third feature matrix satisfies:
Figure SMS_68
. wherein
Figure SMS_69
Is a third feature matrix of the first set of feature matrices, device for selecting or keeping>
Figure SMS_70
Is a first characteristic matrix>
Figure SMS_71
Is a second feature matrix.
Further, the MLP layer includes a second fully-connected layer, a GELU nonlinear activation function layer, and a third fully-connected layer, which are connected in sequence. It should be noted that a disposal layer is connected to the rear of each of the second fully-connected layer and the third fully-connected layer. This discard layer is not labeled in fig. 4.
The first feature matrix output by the multi-head attention layer is input into the MLP layer. The second full-connection layer has the functions of: the first feature matrix
Figure SMS_72
The number of channels is compressed to one eighth of the original number. The role of the GELU nonlinear activation function layer is as follows: and carrying out de-linearization processing on the first characteristic matrix after the channel number compression through the GELU function to obtain a de-linearized first characteristic matrix. The third full connection layer has the functions of: the number of channels of the first de-linearized feature matrix is restored to the original number of channels, and a second feature matrix is generated->
Figure SMS_73
In this embodiment, in the transform encoder, the high-level semantic features are first associated with a learnable null tag, i.e., a learning tag
Figure SMS_75
Connected together and then initialized with position information based on a normal distribution>
Figure SMS_77
Adding them to obtain the final input->
Figure SMS_79
. Learning ≥ through multiple layers of self-attention layers in multi-head attention layer>
Figure SMS_76
Get the sub-feature matrix->
Figure SMS_82
And then a first characteristic matrix is obtained>
Figure SMS_83
. The MLP layer makes a decision on the first characteristic matrix->
Figure SMS_84
Delignification processing to obtain a second feature matrix->
Figure SMS_74
。/>
Figure SMS_78
and />
Figure SMS_80
Forming a third feature matrix ^ by completing the concatenation with the second residual structure>
Figure SMS_81
And output to the first fully-connected layer.
Based on the above embodiment, please refer to fig. 1 again, and the data flow of the present application will be described and illustrated below by taking the classification task of the surface feature target of the remote sensing image to be classified as an example. Firstly, the remote sensing image to be classified is preprocessed
Figure SMS_85
The extraction of the pixel block is completed by a grid method to obtain ^ er>
Figure SMS_86
Pixel block->
Figure SMS_87
These pixel blocks are input into the ITFormer.
In the multi-size convolution feature extraction module, firstly, the feature input layer will
Figure SMS_88
Dividing the pixel blocks into four identical sub-pixel blocks, performing feature extraction through convolution layers with different convolution kernels and filling pixels to obtain four feature maps, and outputting a feature map to be processed, which is formed by the four feature maps, through a splicing layer.
In the Gaussian weighted feature word segmentation device, firstly, a feature map to be processed is flattened into one-dimensional semantic features
Figure SMS_89
Combine X with the Gaussian matrix->
Figure SMS_90
Multiplying, and obtaining weight information through a Softmax function; then, transpose processing weight information, and multiply it with X to obtain high-level semantic feature T.
In the Transformer encoder, firstly, the high-level semantic features T are marked through learning
Figure SMS_91
And location information
Figure SMS_92
Processing to obtain the final input->
Figure SMS_93
. Then, learn ≦ based on the multi-headed attention layer>
Figure SMS_94
Is obtained by a number of sub-feature matrices>
Figure SMS_95
,/>
Figure SMS_96
Forming a first feature matrix. And the first characteristic matrix is subjected to the linearization treatment of the MLP layer to obtain a second characteristic matrix. Then, pass through the second residual error structureAnd connecting the first feature matrix and the second feature matrix to obtain a third feature matrix.
In the classification module, the third feature matrix is sent into the first full-connection layer after being subjected to flattening layer processing to obtain the number of types corresponding to the surface feature target and the label value of the surface feature target, and the classification result of the surface feature target in the remote sensing image to be classified is obtained through calculation of a Softmax function.
Based on the above examples, the validity of the ITFormer of the present application is verified and explained below by examples 1 and 2. Firstly, in order to train and verify the ITFormer of the application better, for the ITFormer obtained by training, on the relevant measurement evaluation parameters of the visible light remote sensing image, four parameters are selected for evaluating the classification performance: overall Accuracy (OA), average Accuracy (AA), kappa coefficient (Kappa coefficient) and accuracy per class (EA).
Wherein the total accuracy OA represents the total classified test pixels divided by the total number of test pixels. Define correctly classified pixels as
Figure SMS_97
The number of categories is N, and the number of pixels in the total test set is N. EA represents the percentage of samples in each class that are accurately classified, then the accuracy of each class, EA, satisfies the following equation: />
Figure SMS_98
. Then, the overall accuracy OA may be calculated as: />
Figure SMS_99
. For the average accuracy AA, it represents the average class accuracy obtained by dividing the exact sum of each class by the number of classes. The average accuracy AA can be expressed as: />
Figure SMS_100
. The Kappa coefficient is a statistical measure for calculating the information between the ground truth map and the prediction classification map, and the tableShowing strong consistency.
The Kappa coefficient may be expressed as:
Figure SMS_101
example 1:
the ITFormer of the application is trained and verified on two standard hyperspectral remote sensing image data sets, the recognition result of the ITFormer is compared with the recognition results of three excellent convolutional neural networks Resnet, mobilenetV3 and GoogleNet in the field, the Resnet, mobilenetV3 and GoogleNet are collectively called as comparison networks in the embodiment, and the effectiveness of the ITFormer is verified through visual analysis.
The datasets for these two standards are the Indian Pine (IP) dataset and the Pavia University (PU) dataset. Wherein: the IP dataset includes images of several indian pines. Wherein the 20 bands are discarded since they do not reflect water. The corrected image includes 200 spectral bands, 145
Figure SMS_102
145, 16 different types of plant terrain, and an imaged image with a spatial resolution of 20 m. The PU data set includes a number of imaging images of a city imaged by a reflective optical spectral imaging system. Data size 610 +>
Figure SMS_103
340, the data includes 103 spectral bands, 9 different classes of urban terrain, and 1.3m spatial resolution of the imaged image. The IP data set and the PU data set have the data tag types and numbers shown in table 1 below. In Table 1, "Class" is the Class, "Training" is the partitioned Training set, and "Test" is the partitioned Test set.
Table 1: types and number of data labels of IP data set and PU data set
Figure SMS_104
The ITFormer training is performed based on a Pythrch framework, and the training and verification are performed by respectively adopting an IP data set and a PU data set. The data set was randomly partitioned into 10% as the training set and the remaining 90% into the test set. The network employs a random gradient descent (SGD) optimizer, defining an initial learning rate of 0.005, a momentum of 0.8, a batch number of 16, a pixel block size of 13, a maximum training round number of 100 rounds and a cross entropy loss function (crossentrypyloss). The ITFormer is trained according to the parameters defined above. Meanwhile, resnet, mobilenetV3 and GoogleNet are trained.
And after the model training is finished, entering a verification stage. The verification phase of the present application is divided into two parts: evaluation parameter analysis and visual result analysis. First, the evaluation parameters used for verification are verified as described above, and verification of the model is completed using the overall accuracy OA, the average accuracy AA, the Kappa coefficient, and the accuracy EA of each class. The model is then used to perform a rendering of a prediction map of the data set for visual analysis. Specifically, the method comprises the following steps:
(1) Evaluation parameter analysis:
first is the validation of the IP data set.
The classification results on IP datasets for ITFormer and other comparative networks are shown in table 2 below. As can be seen from tables 1 and 2: the ITFormer proposed in this application shows the best results, with the highest OA, AA and Kappa coefficients, and occupies the best values for class 9 in the EA of class 16 land features. It is worth noting that the other three types of networks (Resnet, mobilenetV3, ***Net) perform poorly on small data features (No. 1Alfalfa, no. 9Oats) due to the extreme imbalance of the samples of the IP dataset. Especially, mobileNet V3 has a class accuracy of only 19.51% in the No.1Alfalfa feature, which seriously affects the final classification effect. In the ITFormer proposed by the present application, the gaussian-weighted feature tokenizer benefits from feature balancing of the feature samples of the ground objects, so that the final precision is not good because the features of the samples between the ground objects are not seriously affected by the number of the ground objects, and therefore, the ITFormer of the present application performs best on two types of small data ground objects, i.e., no. 1alfalfalfalfalfa and No. 9Oats. Thus, it is demonstrated that: the ITFormer provided by the application can also have efficient performance on a sample extremely unbalanced data set.
Table 2: classification results of different methods on IP data sets
Figure SMS_105
Then verification of the PU data set.
Classification results on PU datasets for ITFormer, resnet, mobilenet v3, ***Net as shown in table 3 below. As can be seen from table 1 and table 3 below: unlike the IP dataset, the PU dataset has fewer surface features, a greater number of samples and is relatively more balanced. Therefore, ITFormer, resnet, mobilenetV3, and GoogleNet all can achieve better classification effect on PU data sets. However, the ITFormer provided by the present application includes a multi-size convolution feature extraction module and a Transformer encoder, the multi-size convolution feature extraction module can extract features of ground features of a remote sensing image, and the Transformer encoder can perform global feature learning, so that the ITFormer provided by the present application obtains an optimal classification result on a PU data set compared with other three types of convolution neural networks. The performance of the ITFormer of the present application on the PU data set also demonstrates again that the ITFormer can achieve the best effectiveness on different data sets.
Table 3: classification of PU data sets by different methods
Figure SMS_106
(2) Visual result analysis:
referring to fig. 5-6, fig. 5 illustrates predicted graphical representations of four different methods on an IP dataset. FIG. 6 shows the predicted graphical representation of four different methods on a PU data set. Wherein, (a) is GoogleNet, (b) is MobilenetV3, (c) is Resnet, and (d) is ITFormer.
As can be seen from fig. 5: all four methods have certain salt and pepper points and continuous classification error areas. The MobileNetV3 performs the worst, and has a large number of salt and pepper spots and misclassification areas. The ITFormer has the best performance, has fewer classification error areas and can distinguish various ground objects more discriminatively. Moreover, classification error points of the ITFormer are concentrated in the edge zone of the region, and further the classification continuity and effectiveness of the ITFormer are proved. Similarly, the PU data set prediction graph of FIG. 6 also demonstrates the effectiveness of the ITFormer proposed in the present application, and the ITFormer has the least classification error salt and pepper points and has classification continuous effectiveness.
The verification stage proves that the ITFormer provided by the application has superiority, advancement and effectiveness.
Example 2:
in order to further verify the effectiveness of the ITFormer provided by the application, the application trains and verifies the ITFormer through a self-constructed data set. In this embodiment, the experimental data is a data set constructed by the user, and the data set views streets of a certain city and includes seven different ground object categories, namely roads, forests, houses, lakes and marshes, lawns, bare soil and coastal beaches. Firstly, downloading the whole map of the street of the city through BigeMap software, then cutting data through ENVI software, and converting the cut data map into the data map with the resolution of 2500
Figure SMS_107
2500/>
Figure SMS_108
3, data map.
After the original data set is obtained through the steps, the data set needs to be subjected to manual pre-classification processing. In the step, labelme software is used for partially selecting and marking the ground feature. The 7 categories of the surface feature are determined in the process of manual marking. And randomly selecting the ground features, and endowing real label categories in the cut data map image. Figure 7 shows the surface feature labels selected by the Labelme software.
After the selection of the ground features is completed, json files related to the ground feature marks are obtained, and then the json files with the ground feature type information are stored through a compiling program and are converted into real label graph files which correspond to the real ground features and can be directly used. Fig. 8 is a real feature map corresponding to the marked feature generated by manually marking the position.
In order to fully verify the effectiveness, the high efficiency and the generalization of the ITFormer, the ITFormer is built under a Pythrch framework together with a comparison network Resnet, a MobilenetV3 and a GoogleNet, and the self-constructed data set is used as a verification data set of the model, so that the effectiveness, the high efficiency and the generalization of the proposed model are fully verified. The self-constructed data set was randomly divided into 30% as a training set and the remaining 70% was used as a test set. The label type of the data set and the corresponding number of pixels are shown in table 4 below.
Table 4: self-building dataset tag types and numbers
Figure SMS_109
In this embodiment, during training, the ITFormer and the other three comparison networks also use Adam optimizers, and define that the initial learning rate is set to 0.001, the learning rate attenuation is set to 0, and the batch processing data size is 64. The CPU of the equipment used for training and verifying the experiment is AMDRyzen 7 3700X 8-Core Processor, the GPU is an NVIDIA GeForceRTX 3060 display card and is provided with a 12G GPU memory.
In terms of setting other parameters, the sample remote sensing image in the self-constructed data set needs to be preprocessed in a manner substantially consistent with S100. However, unlike embodiment 1, the size of the pixel block is set to 21 in the present embodiment
Figure SMS_110
21/>
Figure SMS_111
3, i.e. M = N =21. The convolution kernel size of each convolution layer corresponding to the multi-size convolution feature extraction module is 64, that is, the convolution kernels in the first convolution layer to the fourth convolution layer are all 8 ^ or greater>
Figure SMS_112
8. The highest number of rounds of training is set to 100 rounds.
With respect to verification, the verified evaluation parameters use the above-described overall accuracy OA, average accuracy AA, kappa coefficient, and accuracy per class EA to complete the verification of the model. In addition, the prediction maps of the ITFormer and the other three comparison networks are plotted for visual comparison and analysis. Further, this embodiment verifies ITFormer in three parts. The four parts are respectively: evaluation parameter analysis, visual effect analysis and wear analysis. The specific verification process is as follows:
(1) Evaluation parameter analysis:
table of classification results for ITFormer and the other three comparative networks as shown in table 5 below. As can be seen from tables 4 and 5: the best performing is GoogleNet, while the proposed ITFormer reaches the second highest in three main evaluation parameters of global accuracy OA, average accuracy AA and Kappa coefficient. It is worth noting that ITFormer has 3 types of ground features to achieve the highest class accuracy.
Table 5: classification results of different methods on self-constructed datasets
Figure SMS_113
Referring to fig. 9-12, fig. 9-12 show training graphs of the ITFormer and three other comparative networks, including training loss curves, training and testing accuracy curves, and evaluation parameter curves for each network, respectively. (a) is a training loss curve; (b) is a training and testing accuracy curve; and (c) is an evaluation parameter curve.
As can be seen from fig. 9 to 12: mobileNetV3 performs the worst in terms of convergence speed, requiring more than 40 rounds to begin convergence. While the remaining three methods converge around the highest accuracy around 20 rounds. As can be seen from the test accuracy curves of the four methods: ***Net based on deep inclusion structure is the most stable, followed by ResNet based on residual structure and ITFormer proposed in this application, and mobilenet v3 based on inverse residual and SE modules, the least stable. The reason is that GoogleNet uses multi-layer multi-size convolution, which can better improve the stability of the model. And ResNet adopts a deep residual structure, so that the gradient disappearance effect of the model can be better relieved. The ITFormer provided by the application has the advantages that the multi-size convolution is adopted to extract discriminative spatial features, the spatial features are converted into high-level semantic features, long-term dependence feature information can be explored, and the ITFormer has generalization capability and stability.
In addition, as shown in the curves (b) of fig. 9 to 12, the verification accuracy of the ITFormer proposed by the present application can have a higher value than the training accuracy, and the verification accuracy of the other three comparison networks is below the training accuracy. Therefore, the ITFormer has higher generalization capability and can better adapt to the situation of less data volume in reality.
(2) Visual effect analysis:
referring to FIG. 13, FIG. 13 shows classification performance of ITFormer and three other comparative networks on a self-constructed data set. Wherein, (a) is GoogleNet, (b) is MobilenetV3, (c) is Resnet, and (d) is ITFormer. As can be seen from fig. 13: the best performance is ITFormer, which has fewer salt and pepper points on the classification chart and does not have large-area classification error continuous areas. Therefore, the ITFormer with the global information capable of being better learned can have better effect on the performance of classification. And the misclassification areas of the other three comparison networks are relatively concentrated and have more salt and pepper spots. In addition, the four models described above were used for the rendering of a prediction map of the self-constructed data set for visual analysis. Referring to FIG. 14, FIG. 14 illustrates a prediction graph of different methods on a self-constructed data set. Wherein, (a) is GoogleNet, (b) is MobilenetV3, (c) is Resnet, and (d) is ITFormer. As can be seen from fig. 14: of the four methods, the ITFormer achieves the best results in fewer misclassified regions, relatively continuous regions and relatively few salt and pepper spots. Therefore, the ITFormer provided by the application is proved to have superiority, advancement and effectiveness.
(3) And (3) loss analysis:
although the accuracy of the model is the most important, the training loss, the prediction loss and the memory loss of the model are also important indexes for evaluating the superiority of the model. In order to guarantee the fairness of verification, the verification experiment that this application provided is based on going on under same experiment platform and the same laboratory glassware. Experimental results training and testing times and model sizes on self-constructed data sets for ITFormer and the other three comparative networks are shown in table 6 below.
Table 6: training and testing time and model size on self-building data sets by different methods
Figure SMS_114
As can be seen from table 6: the training time, the testing time and the model parameters of the ITFormer all reach the minimum loss, and the model parameters are only 1/8 of ResNet, so that the ITFormer can be better deployed on equipment with a small memory. Meanwhile, the training time of the ITFormer is only half of ResNet, and the test time is about half of ResNet. Although the effect of ITFormer is less accurate than GoogleNet by 0.1%, it is also tolerable. Therefore, the ITFormer has absolute advantages in terms of high efficiency and light weight, can be better deployed on light-weight equipment, and is more suitable for actual working scenes.
The application provides a transform network-ITFormer for multi-convolution-Gaussian weighted feature extraction for remote sensing image classification. The multi-size convolution module can better extract discriminative spatial features, and the Gaussian weighted feature word segmentation device can better balance the difference between categories and convert the features into high-level semantic features. And finally, the transform-based encoder classification module can extract global information, generate discriminative classification information and classify the ground objects more efficiently and accurately. On the experiment of a hyperspectral remote sensing data set, excellent convolutional neural networks (Resnet, mobilenetV3 and GoogleNet) are compared, and the ITFormer provided by the application can achieve the best effect. On the basis of a visible light remote sensing data set, although the classification effect of GoogleNet is compared, the ITFormer provided by the application is 0.1% different from the GoogleNet, the ITFormer has the fastest training speed and the test classification speed and has extremely low model parameters, so that the ITFormer has better effectiveness, high efficiency and generalization when applied to actual remote sensing image classification, and is relatively low in deployment cost and more suitable for application in practical situations.
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims (10)

1. The remote sensing image identification method based on the improved Transformer is characterized by comprising the following steps of:
obtaining a remote sensing image to be classified containing a plurality of ground object targets, and preprocessing the remote sensing image to be classified to obtain a plurality of pixel blocks;
according to the pixel blocks, the remote sensing images to be classified are identified and classified by utilizing a trained improved neural network model, and a classification result of the ground object target is obtained;
the trained improved neural network model is obtained by training marked sample remote sensing images and corresponding marking results; the improved neural network model comprises a multi-size convolution feature extraction module, a Gaussian weighting word segmentation device, a transform encoder and a classification module which are connected in sequence;
the multi-size convolution feature extraction module is used for extracting features of the pixel blocks based on a two-dimensional convolution structure to obtain a feature graph to be processed and outputting the feature graph to the Gaussian weighted word segmentation device;
the Gaussian weighting word segmentation device is used for processing the feature graph to be processed through a Gaussian weighting matrix to obtain high-level semantic features and outputting the high-level semantic features to the transform encoder;
the transform encoder is used for learning the high-level semantic features through a plurality of self-attention layers to obtain a first feature matrix, performing de-linearization processing on the first feature matrix to obtain a second feature matrix, generating a third feature matrix according to the first feature matrix and the second feature matrix and outputting the third feature matrix to the classification module;
and the classification module is used for calculating the maximum probability of the category to which the surface feature target belongs according to the third feature matrix and outputting the classification result of the surface feature target in the remote sensing image to be classified according to the maximum probability.
2. The improved Transformer-based remote sensing image recognition method according to claim 1, wherein the classification module comprises a flattening layer, a first full-connected layer and a classifier which are connected in sequence;
the flattening layer is used for flattening the third feature matrix into one-dimensional features;
the first full-connection layer is used for converting the one-dimensional features into the number of types corresponding to the surface feature target and the label value of the type;
the classifier is used for calculating the maximum probability of the category to which the surface feature target belongs according to the number of the categories corresponding to the surface feature target and the label value of the category, and outputting the classification result of the surface feature target in the remote sensing image to be classified according to the maximum probability;
wherein, the value of the maximum probability is in the range of [0,1 ].
3. The remote sensing image recognition method based on the improved Transformer as claimed in claim 1, wherein the remote sensing image to be classified is preprocessed to obtain a plurality of pixel blocks, and the method comprises the following steps:
defining the size of the remote sensing image to be classified as
Figure QLYQS_1
M is the height of the remote sensing image to be classified, N is the width of the remote sensing image to be classified, and the remote sensing image to be classified is subjected to pixel block extraction by utilizing a grid method to obtain
Figure QLYQS_2
A plurality of pixel blocks>
Figure QLYQS_3
Wherein the label information of the pixel block is determined by the original label of the central pixel,
Figure QLYQS_4
and d is the number of channels corresponding to the spectral depth of the remote sensing image to be classified.
4. The improved Transformer-based remote sensing image recognition method as claimed in claim 3, wherein the extracting of pixel blocks from the remote sensing image to be classified by using a grid method comprises:
defining a center pixel of the pixel block
Figure QLYQS_6
,/>
Figure QLYQS_9
,/>
Figure QLYQS_10
Extracting the height from the remote sensing image to be classified>
Figure QLYQS_7
To/is>
Figure QLYQS_8
And width->
Figure QLYQS_11
To/is>
Figure QLYQS_12
And for all pixel points lying in said central pixel->
Figure QLYQS_5
The pixel points at the edge of the image are filled and extractedAll the pixel points and the filled pixel points are used as the extracted pixel blocks; />
Wherein the filling process is a filling process with a filling length of (S-1)/2, and the central pixel
Figure QLYQS_13
Has an edge of
Figure QLYQS_14
、/>
Figure QLYQS_15
、/>
Figure QLYQS_16
and />
Figure QLYQS_17
5. The remote sensing image recognition method based on the improved Transformer as claimed in claim 1, wherein the multi-size convolution feature extraction module comprises a feature input layer, a two-dimensional convolution structure and a splicing layer which are connected in sequence, and the two-dimensional convolution structure is formed by connecting four convolution layers in parallel; wherein:
the characteristic input layer is used for copying the pixel blocks into four identical sub-pixel blocks, the four sub-pixel blocks are respectively input into the four convolutional layers, each convolutional layer outputs a corresponding characteristic diagram according to the filled pixels of the convolutional layer, and the sizes of the characteristic diagrams output by the four convolutional layers are identical; the splicing layer is used for splicing the characteristic diagrams output by the four layers of convolution layers into characteristic diagrams to be processed;
the splicing layer is a splicing layer, wherein the splicing layer is a first splicing layer, a second splicing layer, a third splicing layer and a fourth splicing layer;
wherein the first convolution layer comprises two layers of convolution kernels connected in sequence3
Figure QLYQS_18
3 and the fill pixel is 0; the second convolution layer has a convolution kernel of 5 ^ er>
Figure QLYQS_19
5 and a convolution layer with a fill pixel of 0; the third convolution layer has a convolution kernel of 7>
Figure QLYQS_20
7 and fill the convolution layer with pixel 1; the fourth convolution layer has a convolution kernel of 9>
Figure QLYQS_21
9 and fills the convolution layer with pixel 2.
6. The method for recognizing the remote sensing image based on the improved Transformer as claimed in claim 1, wherein the processing of the flattened feature graph to be processed through the gaussian weighting matrix to obtain the high-level semantic features comprises:
flattening the feature graph to be processed into one-dimensional semantic features
Figure QLYQS_22
Wherein h is height, w is width, and c is spectral depth or wave band number of the image;
defining a Gaussian matrix as
Figure QLYQS_23
Multiplying the Gaussian matrix by the one-dimensional semantic features, and obtaining weight information through a Softmax function;
transposing the weight information, multiplying the transposed weight information by the one-dimensional semantic features to obtain Gaussian-weighted high-level semantic features, wherein the high-level semantic features meet the requirement of
Figure QLYQS_24
7. The remote sensing image recognition method based on the improved Transformer is characterized in that the Transformer encoder comprises an embedded input layer, a multi-head attention layer and an MLP layer which are connected in sequence, a regularization layer is connected between the multi-head attention layer and the embedded input layer, and another regularization layer and a first residual error structure are connected between the multi-head attention layer and the MLP layer;
the high-level semantic features are input into the embedded input layer, the embedded input layer is used for connecting the high-level semantic features with preset learning marks, and adding a connection result with position information initialized by normal distribution to obtain final input; wherein the final input satisfies:
Figure QLYQS_25
wherein ,
Figure QLYQS_26
for high level semantic features, be>
Figure QLYQS_27
For learning marks, <' >>
Figure QLYQS_28
Is AND>
Figure QLYQS_29
Null matrices of the same type; />
Figure QLYQS_30
For the location information, for representing &>
Figure QLYQS_31
The input order of (1);
the multi-head attention layer comprises a plurality of self-attention layers which are mutually overlapped and used for self-attention through a plurality of layersThe force layer learns the internal correlation of the final input to obtain a sub-feature matrix
Figure QLYQS_32
,/>
Figure QLYQS_33
By connecting several sub-feature matrices>
Figure QLYQS_34
Generating a first characteristic matrix and outputting the first characteristic matrix to an MLP layer; />
The MLP layer is used for carrying out de-linearization processing on the first characteristic matrix to obtain a second characteristic matrix; a second residual structure is connected to the back of the MLP layer, and the second residual structure is used for connecting the first feature matrix and the second feature matrix to obtain a third feature matrix, where the third feature matrix satisfies:
Figure QLYQS_35
,/>
Figure QLYQS_36
is a third feature matrix of the first set of feature matrices, device for selecting or keeping>
Figure QLYQS_37
Is the first characteristic matrix, is greater than or equal to>
Figure QLYQS_38
Is a second feature matrix.
8. The remote sensing image recognition method based on the improved Transformer as claimed in claim 7, wherein the MLP layer comprises a second full connection layer, a GELU nonlinear activation function layer and a third full connection layer which are connected in sequence, and a discarding layer is connected behind each of the second full connection layer and the third full connection layer;
the first characteristic matrix output by the multi-head attention layer is input into the MLP layer, the second full-connection layer is used for compressing the channel number of the first characteristic matrix to one eighth of the original channel number, and the GELU nonlinear activation function layer is used for carrying out de-linearization processing on the first characteristic matrix after the channel number is compressed through a GELU function to obtain a de-linearized first characteristic matrix; and the third full connection layer is used for restoring the number of the channels of the first characteristic matrix subjected to the linearization to the original number of the channels to generate a second characteristic matrix.
9. The improved Transformer-based remote sensing image recognition method according to claim 7, wherein the output of the ith layer from the attention layer satisfies the following formula:
Figure QLYQS_39
wherein Q, K and V are respectively a query matrix, a key matrix and a value matrix,
Figure QLYQS_40
a dimension representing an input; the steps of obtaining the query matrix, the key matrix and the value matrix comprise: the final input is ≥ via a shared matrix W>
Figure QLYQS_41
Processing to obtain a plurality of characteristic embedding->
Figure QLYQS_42
By embedding each feature->
Figure QLYQS_43
Different from three and learnable weights->
Figure QLYQS_44
and />
Figure QLYQS_45
And performing multiplication to obtain a query matrix, a key matrix and a value matrix.
10. The improved Transformer-based remote sensing image recognition method according to claim 9, wherein the output of the multi-head attention layer satisfies the following formula:
Figure QLYQS_46
wherein ,
Figure QLYQS_47
h is the number of self-attention layers, W is a parameter matrix, and ` H `>
Figure QLYQS_48
Is composed of
Figure QLYQS_49
Is greater than or equal to>
Figure QLYQS_50
。/>
CN202310155748.8A 2023-02-23 2023-02-23 Remote sensing image recognition method based on improved transducer Active CN115861824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310155748.8A CN115861824B (en) 2023-02-23 2023-02-23 Remote sensing image recognition method based on improved transducer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310155748.8A CN115861824B (en) 2023-02-23 2023-02-23 Remote sensing image recognition method based on improved transducer

Publications (2)

Publication Number Publication Date
CN115861824A true CN115861824A (en) 2023-03-28
CN115861824B CN115861824B (en) 2023-06-06

Family

ID=85658758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310155748.8A Active CN115861824B (en) 2023-02-23 2023-02-23 Remote sensing image recognition method based on improved transducer

Country Status (1)

Country Link
CN (1) CN115861824B (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108614997A (en) * 2018-04-04 2018-10-02 南京信息工程大学 A kind of remote sensing images recognition methods based on improvement AlexNet
US20190171862A1 (en) * 2017-12-05 2019-06-06 Transport Planning and Research Institute Ministry of Transport Method of extracting image of port wharf through multispectral interpretation
CN110796105A (en) * 2019-11-04 2020-02-14 中国矿业大学 Remote sensing image semantic segmentation method based on multi-modal data fusion
US20200250428A1 (en) * 2019-02-04 2020-08-06 Farmers Edge Inc. Shadow and cloud masking for remote sensing images in agriculture applications using a multilayer perceptron
CN111860235A (en) * 2020-07-06 2020-10-30 中国科学院空天信息创新研究院 Method and system for generating high-low-level feature fused attention remote sensing image description
CN112084842A (en) * 2020-07-28 2020-12-15 北京工业大学 Hydrological remote sensing image target identification method based on depth semantic model
CN114005028A (en) * 2021-07-30 2022-02-01 北京航空航天大学 Anti-interference light-weight model and method for remote sensing image target detection
CN114120102A (en) * 2021-11-03 2022-03-01 中国华能集团清洁能源技术研究院有限公司 Boundary-optimized remote sensing image semantic segmentation method, device, equipment and medium
CN114283285A (en) * 2021-11-17 2022-04-05 华能盐城大丰新能源发电有限责任公司 Cross consistency self-training remote sensing image semantic segmentation network training method and device
US20220130145A1 (en) * 2019-12-01 2022-04-28 Pointivo Inc. Systems and methods for generating of 3d information on a user display from processing of sensor data for objects, components or features of interest in a scene and user navigation thereon
CN114937173A (en) * 2022-05-17 2022-08-23 中国地质大学(武汉) Hyperspectral image rapid classification method based on dynamic graph convolution network
CN114943963A (en) * 2022-04-29 2022-08-26 南京信息工程大学 Remote sensing image cloud and cloud shadow segmentation method based on double-branch fusion network
CN115049936A (en) * 2022-08-12 2022-09-13 武汉大学 High-resolution remote sensing image-oriented boundary enhancement type semantic segmentation method
CN115049922A (en) * 2022-05-18 2022-09-13 山东师范大学 Method and system for detecting change of remote sensing image
US20220308848A1 (en) * 2021-03-25 2022-09-29 Microsoft Technology Licensing, Llc. Semi-supervised translation of source code programs using neural transformers
US20220415203A1 (en) * 2021-06-28 2022-12-29 ACADEMIC MERIT LLC d/b/a FINETUNE LEARNING Interface to natural language generator for generation of knowledge assessment items
CN115601549A (en) * 2022-12-07 2023-01-13 山东锋士信息技术有限公司(Cn) River and lake remote sensing image segmentation method based on deformable convolution and self-attention model

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190171862A1 (en) * 2017-12-05 2019-06-06 Transport Planning and Research Institute Ministry of Transport Method of extracting image of port wharf through multispectral interpretation
CN108614997A (en) * 2018-04-04 2018-10-02 南京信息工程大学 A kind of remote sensing images recognition methods based on improvement AlexNet
US20200250428A1 (en) * 2019-02-04 2020-08-06 Farmers Edge Inc. Shadow and cloud masking for remote sensing images in agriculture applications using a multilayer perceptron
CN110796105A (en) * 2019-11-04 2020-02-14 中国矿业大学 Remote sensing image semantic segmentation method based on multi-modal data fusion
US20220130145A1 (en) * 2019-12-01 2022-04-28 Pointivo Inc. Systems and methods for generating of 3d information on a user display from processing of sensor data for objects, components or features of interest in a scene and user navigation thereon
CN111860235A (en) * 2020-07-06 2020-10-30 中国科学院空天信息创新研究院 Method and system for generating high-low-level feature fused attention remote sensing image description
CN112084842A (en) * 2020-07-28 2020-12-15 北京工业大学 Hydrological remote sensing image target identification method based on depth semantic model
US20220308848A1 (en) * 2021-03-25 2022-09-29 Microsoft Technology Licensing, Llc. Semi-supervised translation of source code programs using neural transformers
US20220415203A1 (en) * 2021-06-28 2022-12-29 ACADEMIC MERIT LLC d/b/a FINETUNE LEARNING Interface to natural language generator for generation of knowledge assessment items
CN114005028A (en) * 2021-07-30 2022-02-01 北京航空航天大学 Anti-interference light-weight model and method for remote sensing image target detection
CN114120102A (en) * 2021-11-03 2022-03-01 中国华能集团清洁能源技术研究院有限公司 Boundary-optimized remote sensing image semantic segmentation method, device, equipment and medium
CN114283285A (en) * 2021-11-17 2022-04-05 华能盐城大丰新能源发电有限责任公司 Cross consistency self-training remote sensing image semantic segmentation network training method and device
CN114943963A (en) * 2022-04-29 2022-08-26 南京信息工程大学 Remote sensing image cloud and cloud shadow segmentation method based on double-branch fusion network
CN114937173A (en) * 2022-05-17 2022-08-23 中国地质大学(武汉) Hyperspectral image rapid classification method based on dynamic graph convolution network
CN115049922A (en) * 2022-05-18 2022-09-13 山东师范大学 Method and system for detecting change of remote sensing image
CN115049936A (en) * 2022-08-12 2022-09-13 武汉大学 High-resolution remote sensing image-oriented boundary enhancement type semantic segmentation method
CN115601549A (en) * 2022-12-07 2023-01-13 山东锋士信息技术有限公司(Cn) River and lake remote sensing image segmentation method based on deformable convolution and self-attention model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
蔡肖 等: "基于移位窗口金字塔Transformer的遥感图像目标检测", 《计算机科学》 *

Also Published As

Publication number Publication date
CN115861824B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN111738124B (en) Remote sensing image cloud detection method based on Gabor transformation and attention
CN111160276B (en) U-shaped cavity full convolution segmentation network identification model based on remote sensing image
CN110363215B (en) Method for converting SAR image into optical image based on generating type countermeasure network
CN110321963A (en) Based on the hyperspectral image classification method for merging multiple dimensioned multidimensional sky spectrum signature
CN112347859A (en) Optical remote sensing image saliency target detection method
CN113095409B (en) Hyperspectral image classification method based on attention mechanism and weight sharing
CN108564109A (en) A kind of Remote Sensing Target detection method based on deep learning
CN108764063A (en) A kind of pyramidal remote sensing image time critical target identifying system of feature based and method
CN112347888B (en) Remote sensing image scene classification method based on bi-directional feature iterative fusion
CN106600595A (en) Human body characteristic dimension automatic measuring method based on artificial intelligence algorithm
CN104252625A (en) Sample adaptive multi-feature weighted remote sensing image method
CN116912708A (en) Remote sensing image building extraction method based on deep learning
CN103714148A (en) SAR image search method based on sparse coding classification
CN115311502A (en) Remote sensing image small sample scene classification method based on multi-scale double-flow architecture
CN115439679A (en) Hyperspectral image classification method combining multi-attention and Transformer
CN113673556A (en) Hyperspectral image classification method based on multi-scale dense convolution network
CN116630700A (en) Remote sensing image classification method based on introduction channel-space attention mechanism
CN116309348A (en) Lunar south pole impact pit detection method based on improved TransUnet network
CN116168235A (en) Hyperspectral image classification method based on double-branch attention network
CN116977747B (en) Small sample hyperspectral classification method based on multipath multi-scale feature twin network
CN111199251B (en) Multi-scale hyperspectral image classification method based on weighted neighborhood
CN117523394A (en) SAR vessel detection method based on aggregation characteristic enhancement network
CN117115675A (en) Cross-time-phase light-weight spatial spectrum feature fusion hyperspectral change detection method, system, equipment and medium
CN115861824B (en) Remote sensing image recognition method based on improved transducer
CN113887470B (en) High-resolution remote sensing image ground object extraction method based on multitask attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant