CN115565071A

CN115565071A - Hyperspectral image transform network training and classifying method

Info

Publication number: CN115565071A
Application number: CN202211318387.6A
Authority: CN
Inventors: 贾森; 王一帆; 徐萌
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-01-03

Abstract

The invention discloses a hyperspectral image Transformer network training and classifying method, which comprises the steps of firstly, segmenting a hyperspectral image sample based on spectral dimensionality to obtain a plurality of spectral sub-bands, wherein the hyperspectral image sample is a label-free training sample; inputting each spectrum sub-band into an embedding module respectively to obtain local space spectrum embedding characteristics output by each embedding module, wherein the embedding module extracts the space spectrum characteristics of the spectrum sub-bands in a plurality of scales; fusing all the local space spectrum embedding features according to the positions of the spectrum sub-bands to obtain global space spectrum embedding features; the global spatial spectrum embedding features are input into a Transformer encoder, and a central region token is masked and reconstructed in a Transformer decoder to train the Transformer encoder autonomously. Compared with the prior art, the training effect of the Transformer network can be effectively improved by using the label-free samples, and the hyperspectral images can be classified with high precision.

Description

Hyperspectral image Transformer network training and classifying method

Technical Field

The invention relates to the technical field of hyperspectral image classification, in particular to a hyperspectral image Transformer network training and classifying method.

Background

The hyperspectral remote sensing images can extract ground feature information from hundreds of continuous spectral bands, so that the hyperspectral remote sensing images have strong ground target distinguishing capability. In the past decades, hyperspectral images play an important role in military target detection, ocean monitoring, disaster prevention and control and the like. The identification and classification of the hyperspectral images are a key problem of hyperspectral image analysis, and play an important role in the promotion and development of the hyperspectral remote sensing technology.

With the wide application of deep learning in various fields, a variety of deep learning classification methods, such as an Auto encoder AE (Auto encoder), a Convolutional Neural network CNN (Convolutional Neural Networks), and the like, have appeared in hyperspectral classification. However, because the sample size of the hyperspectral image is small, an ideal training effect is more difficult to obtain only by using a labeled sample under small sample data, so that the existing hyperspectral image classification model has a poor training effect, and when the hyperspectral image classification model is actually applied to hyperspectral image classification, the spatial spectrum feature information is difficult to extract fully, and the classification precision is not high.

Thus, the prior art is in need of improvement and enhancement.

Disclosure of Invention

The invention mainly aims to provide a hyperspectral image transform network training and classifying method, and aims to solve the problem that a hyperspectral image classification model in the prior art is poor in training effect.

In order to achieve the above object, the present invention provides a training method for a Transformer network of a hyperspectral image, in which an adaptive multi-scale embedding layer is set, the adaptive multi-scale embedding layer comprising a plurality of embedding modules, the training method comprising:

segmenting a hyperspectral image sample based on spectral dimensionality to obtain a plurality of spectral sub-bands, wherein the hyperspectral image sample is a label-free training sample;

inputting each spectrum sub-band into an embedding module respectively to obtain local space spectrum embedding characteristics output by each embedding module, wherein the embedding module extracts the space spectrum characteristics of the spectrum sub-bands in a plurality of scales;

fusing all the local space spectrum embedding features according to the positions of the spectrum sub-bands to obtain a global space spectrum embedding feature;

inputting the global space spectrum embedding features into a Transformer encoder, and performing masking and reconstruction on a central region token in a Transformer decoder to train the Transformer encoder in an autonomous manner.

Optionally, the embedding module includes a small-scale branch for embedding the small-scale features, a medium-scale branch for embedding the medium-scale features, and a large-scale branch for embedding the large-scale features, and each spectral sub-band is input into one embedding module, so as to obtain the local spatial spectrum embedding features output by each embedding module, including:

inputting the spectrum sub-wave band into an embedding module, and respectively obtaining a small-scale embedding characteristic of small-scale branch output, a medium-scale embedding characteristic of medium-scale branch output and a large-scale embedding characteristic of large-scale branch output;

and fusing the large-scale embedded feature, the medium-scale embedded feature and the small-scale embedded feature to obtain the local space spectrum embedded feature.

Optionally, the obtaining of the large-scale embedded feature of the large-scale branch output includes: performing three-dimensional convolution operation on the spectrum sub-bands, extracting first space spectrum features, performing three-dimensional convolution operation on the first space spectrum features, extracting second space spectrum features, splicing and flattening the second space spectrum features in spectrum dimensions, and obtaining large-scale embedded features;

the obtaining of the mesoscale embedded features of the mesoscale branch output comprises: and carrying out three-dimensional convolution operation on the spectrum sub-wave band, extracting a third space spectrum characteristic, fusing and flattening the third space spectrum characteristic, and obtaining a mesoscale embedded characteristic.

Optionally, the fusing the large-scale embedded features, the medium-scale embedded features, and the small-scale embedded features to obtain the local spatial spectrum embedded features includes:

and fusing the large-scale embedded feature, the medium-scale embedded feature and the small-scale embedded feature by adopting learnable weight parameters to obtain the local space spectrum embedded feature.

Optionally, after training the transform encoder, the method further includes:

and resetting the weight parameters, inputting the output result of the trained Transformer encoder into a classifier, and retraining the trained Transformer encoder by adopting the hyperspectral image sample with the label.

Optionally, the masking and reconstructing the central region token in the transform decoder to train the transform encoder in an autonomous manner includes:

replacing the central pixel of the training sample with a learnable vector;

reconstructing a central pixel and a training sample in a transform decoding layer;

and training the Transformer encoder according to the loss value reconstructed by the central pixel and the loss value reconstructed by the training sample.

In order to achieve the above object, the present invention further provides a hyperspectral image classification method based on a multiscale convolutional Transformer network, where the multiscale convolutional Transformer network includes an adaptive multiscale embedding layer and a Transformer encoder, the adaptive multiscale embedding layer includes multiple embedding modules, and the classification method includes:

training the multi-scale convolution Transformer network by adopting any one of the training methods for the Transformer network of the hyperspectral image in advance;

inputting a hyperspectral image into the multi-scale convolution Transformer network to obtain classification characteristics;

and inputting the classification features into a classifier to obtain a hyperspectral image classification result.

In order to achieve the above object, the present invention further provides a training apparatus for a Transformer network of a hyperspectral image, where the Transformer network includes an adaptive multi-scale embedding layer, the adaptive multi-scale embedding layer includes multiple embedding modules, and the apparatus includes:

the spectrum sub-band module is used for segmenting a hyperspectral image based on spectrum dimensionality to obtain a plurality of spectrum sub-bands, and the hyperspectral image is an unlabeled training sample of the Transformer network;

the local space spectrum embedding feature module is used for inputting each spectrum sub-waveband into one embedding module respectively to obtain local space spectrum embedding features output by each embedding module, and the embedding module extracts the space spectrum features of the spectrum sub-wavebands in multiple scales;

the global empty spectrum embedding characteristic module is used for fusing all the local empty spectrum embedding characteristics according to the positions of the spectrum sub-bands to obtain global empty spectrum embedding characteristics;

and the self-supervision training module is used for inputting the global space spectrum embedding features into the Transformer encoder and performing masking and reconstruction on the central region token in the Transformer decoder so as to train the Transformer encoder in a self-supervision mode.

In order to achieve the above object, the present invention further provides an intelligent terminal, where the intelligent terminal includes a memory, a processor, and a training program of a Transformer network for hyperspectral images, the training program of the Transformer network for hyperspectral images being stored in the memory and being executable on the processor, and the training program of the Transformer network for hyperspectral images implements any one of the steps of the training method of the Transformer network for hyperspectral images when executed by the processor.

In order to achieve the above object, the present invention further provides a computer-readable storage medium, where a training program of a Transformer network for a hyperspectral image is stored, and when the training program of the Transformer network for the hyperspectral image is executed by a processor, the method includes any one of the steps of the training method of the Transformer network for the hyperspectral image.

According to the method, the hyperspectral image wave bands are grouped, and then the space spectrum features of the spectrum sub-wave bands are extracted in multiple scales, so that the difficulty of feature extraction can be effectively reduced, and the sample information can be fully extracted; based on the masking and reconstruction of the token in the central region, the unlabeled samples can be effectively utilized, the relational modeling capability of the Transformer network on the neighborhood is improved through self-supervision learning, and the performance of the Transformer network under the condition of small samples is further improved. The Transformer network with better training effect can be obtained, and the hyperspectral images can be classified with high precision.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the embodiments or the prior art description will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without inventive labor.

FIG. 1 is a schematic flow chart of an embodiment of a training method of a Transformer network for hyperspectral images provided by the invention;

FIG. 2 is a schematic diagram of a network model architecture of the multi-scale embedding layer of the embodiment of FIG. 1;

FIG. 3 is a schematic diagram of the embedded module of FIG. 2;

FIG. 4 is a schematic flow chart of obtaining large-scale embedded features in the embodiment of FIG. 1;

FIG. 5 is a schematic flow chart of obtaining mesoscale embedded features in the embodiment of FIG. 1;

FIG. 6 is a schematic diagram of a model architecture during training of a Transformer network;

FIG. 7 is a schematic flow chart of an auto-supervised training Transformer network;

FIG. 8 is a schematic structural diagram of a training apparatus of a transform network for hyperspectral images according to an embodiment of the present invention;

fig. 9 is a schematic block diagram of an internal structure of an intelligent terminal according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when 8230that is," or "once" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted depending on the context to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings of the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and it will be appreciated by those skilled in the art that the present invention may be practiced without departing from the spirit and scope of the present invention, and therefore the present invention is not limited by the specific embodiments disclosed below.

The imaging distance of the hyperspectral image is long, a plurality of interference factors exist in the imaging process, and the phenomena of same objects, different spectrums and same spectrum foreign matters exist widely. The traditional classification model can not realize long-distance modeling of input samples during training, extraction of sample information is insufficient, and ideal training effect is difficult to obtain only by means of labeled samples under the condition of small sample data. Therefore, the traditional classification model has poor training effect and is difficult to realize the accurate classification of the hyperspectral images.

In order to solve the problems of poor training effect, low classification precision and poor ground object type classification effect of a hyperspectral image classification model in the prior art, the invention provides a training method of a Transformer network for a hyperspectral image. Due to the fact that the feature extraction of the wave band grouping based on the division and the division can effectively reduce the difficulty of feature embedding, and the space spectrum features are fully extracted. Through masking and reconstruction based on the token of the central region, unlabeled samples are effectively utilized, the relational modeling capability of the Transformer network on the neighborhood is improved through self-supervision learning, and the performance of the Transformer network under the condition of small samples is further improved. Therefore, the spatial spectrum characteristic information of the hyperspectral images can be fully extracted, the training effect of the transform network is improved, and the hyperspectral images are accurately classified.

Exemplary method

As shown in fig. 1, an embodiment of the present invention provides a training method for a transform network of a hyperspectral image, which is deployed on an intelligent terminal. In the network architecture, firstly, an adaptive multi-scale embedding layer comprising a plurality of embedding modules is arranged in front of a transform encoder in a transform network, and the space spectrum features extracted by the adaptive multi-scale embedding layer are input into the transform encoder. Wherein each embedded module is configured to extract spatial spectral features of a spectral sub-band from a plurality of scales.

Specifically, the training method comprises the following steps:

step S100: segmenting a hyperspectral image sample based on spectral dimensionality to obtain a plurality of spectral sub-bands, wherein the hyperspectral image sample is a label-free training sample;

specifically, the training sample of the embodiment is an unlabeled hyperspectral image, and the unlabeled hyperspectral image is used for training the Transformer network so as to sufficiently learn the empty spectrum characteristics of the hyperspectral image, thereby improving the classification performance of the Transformer network.

Assume a hyperspectral image as

Wherein, X and Y represent the length and width of the spatial dimension of the hyperspectral image, and B represents the spectral dimension (namely the spectral band) of the hyperspectral image. By equally dividing the hyperspectral image in the spectral dimension, a plurality of spectral sub-bands can be obtained as shown in fig. 2.

Since the hyperspectral image is high-dimensional data, in order to avoid dimension disaster, in this embodiment, before segmenting the hyperspectral image, the original hyperspectral image is reduced in dimension by Principal Component Analysis (PCA). Supposing that the hyperspectral image after dimension reduction belongs to H and belongs to R ^X×Y×K Wherein, X and Y are length and width of the spatial dimension of the hyperspectral image respectively, generally X = Y, and K is the spectral dimension reserved after PCA dimension reduction processing. Specifically, in this embodiment, X = Y =13, k =80, that is, 80 spectral bands remain after dimensionality reduction. Then, the hyperspectral image H belonging to R after equal length segmentation and dimension reduction on the spectral dimension ^X×Y×K After equal-length segmentation, the image is segmented into k spectral sub-spectraWave band

In this example, the value of k is 8, resulting in a set of spectral sub-bands (H' ₁ ,H′ ₂ ,H′ ₃ ,…,H′ ₈ ). The specific value of k is not limited, and can be adjusted according to actual conditions.

Step S200: inputting each spectrum sub-band into an embedding module respectively to obtain local empty spectrum embedding characteristics output by each embedding module, wherein the embedding module extracts the empty spectrum characteristics of the spectrum sub-bands in multiple scales;

specifically, the number of embedding modules in the adaptive multi-scale embedding layer is set according to the number of spectral sub-bands, so that each embedding module extracts the empty spectral features of one spectral sub-band. Since the Transformer network has the advantage of extracting global information, long-distance modeling can be realized, but the extraction of local information is not sufficient. In order to more fully extract the empty spectral features of the spectral sub-bands, the embedding module of the invention extracts the empty spectral features of the spectral sub-bands in a plurality of scales, and fuses all the empty spectral features to obtain and output local empty spectral embedding features. The method is more suitable for feature extraction of hyperspectral images, and excellent model performance can be obtained under the conditions of small sample training and training of any number of samples. The method for extracting the spatial spectral features of the spectral sub-bands in multiple scales is not limited, for example: and adopting convolution operations of different scales to extract the empty spectrum features of the spectrum sub-bands of different scales.

In this embodiment, each embedding module extracts spatial spectral features from three scales, including a small-scale branch for embedding small-scale features, a medium-scale branch for embedding medium-scale features, and a large-scale branch for embedding large-scale features. After the spectrum sub-wave band is input into the embedding module, respectively obtaining a small-scale embedding characteristic output by a small-scale branch, a medium-scale embedding characteristic output by a medium-scale branch and a large-scale embedding characteristic output by a large-scale branch; and then, fusing the large-scale embedding features, the medium-scale embedding features and the small-scale embedding features to obtain the local space spectrum embedding features.

In the small-scale branch, for the input spectral sub-band

Firstly, linear embedding is used for embedding features of each spectrum in a spectrum sub-band, then embedding results are flattened in the spatial dimension, and small-scale embedded features are obtained

The specific expression is as follows:

wherein the linear embedding results

Flattened result Emb _small ∈R ^{EmbNum×EmbDim} The value of EmbNum is X × Y, and EmbDim is the model embedding dimension.

Considering that the medium-scale branch and the large-scale branch have partial feature dependence, the feature embedding of the medium-scale branch and the large-scale branch is different from that of the small-scale branch, and as shown in fig. 3, a 3D convolution operation is also included.

As shown in fig. 4, the method for obtaining the large-scale embedded feature of the large-scale branch output specifically includes the following steps:

step A210: performing three-dimensional convolution operation on the spectral sub-bands to extract a first spatial spectrum characteristic;

step A220: performing three-dimensional convolution operation on the first space spectrum characteristic, and extracting a second space spectrum characteristic;

step A230: and splicing and flattening the second spatial spectrum features in the spectral dimension to obtain large-scale embedded features.

Specifically, first, a 3D convolution operation with a convolution kernel size of (3, 3) is used to perform local nullingExtracting the spectral features to obtain first space spectral features

Performing a 3D convolution operation with convolution kernel size of (3, 3) again to obtain a second spatial spectrum characteristic

Flattening in space dimension after spectrum dimension splicing to obtain large-scale embedded features

The specific expression is as follows:

as shown in fig. 5, the obtaining of the mesoscale embedded feature of the mesoscale branch output specifically includes:

step B210: performing three-dimensional convolution operation on the spectral sub-bands to extract a third spatial spectrum characteristic;

step B220: and fusing and flattening the third spatial spectrum characteristic to obtain a mesoscale embedded characteristic.

Specifically, a 3D convolution operation with a convolution kernel size of (3, 3) is firstly used for local space spectrum feature extraction to obtain a third space spectrum feature

And then, fusing and flattening the four third space spectrum features by using one-dimensional convolution to obtain a mesoscale embedded feature

The specific expression is as follows:

wherein the space spectrum characteristic

Mesoscale embedded features

The value of EmbNum is X × Y, and EmbDim is the model embedding dimension.

Further, in consideration of the difference in spatial resolution of the data set, learnable weight parameters are used in multi-scale embedded feature fusion (f) _small ,w _mid ,w _large ) To fuse large scale embedded features

Mesoscale embedded features

And small scale embedded features

Thereby adjusting the embedding weight of each branch feature to obtain the local space spectrum embedding feature

The specific expression is as follows:

dividing k spectral sub-bands

And after the parallel embedding modules are input, local space spectrum embedding characteristics corresponding to all spectrum sub-bands are obtained. The concrete expression is as follows:

wherein Emb _local ∈R ^{k×X′×Y′×z′/k} X ' and Y ' are the length and width of the spatial dimension after the extraction of the local space spectrum feature, and Z ' is the spectral dimension after the extraction of the local space spectrum feature.

It should be noted that the multi-scale branches of the embedded module can be freely selected according to the data characteristics of the hyperspectral image, so that the embedded module has better self-adaptive capability. The embedded module may be specifically represented as:

wherein,

representing a local spatial spectrum embedding feature, SSEM, corresponding to the ith spectral sub-band _i The function represents a local spatial spectrum embedding function corresponding to the ith spectral sub-band. Local space spectrum embedding function SSEM _i (i =1,2.. K) can be freely selected according to the characteristics of the sample data, and a great degree of freedom is brought to the characteristic embedding.

Step S300: fusing all local space spectrum embedding characteristics according to the positions of the spectrum sub-bands to obtain global space spectrum embedding characteristics;

specifically, as shown in fig. 2, the local spatial spectrum is embedded with the characteristics

The fusion is performed in the order of the positions of the spectral sub-bands, for example: simply concatenating all of the spectral sub-bands in order of their positionsEmbedding local empty spectrum characteristics, and then flattening in space dimension through twice flattening operations to obtain global empty spectrum embedding characteristics Emb _global 。

Emb′ _local ＝Reshape_1(Emb _local )，Emb _global ＝Reshape_2(Emb′ _local )，

Wherein Reshape _1 and Reshape _2represent flattening operation, and the flattened local empty spectrum is embedded into a feature Emb' _kocal ∈R ^{X′×Y′×Z′} Flattened global space spectrum embedding feature Emb _global ∈R ^{EmbSize×Z′} The number of feature embeddings = X '× Y'. Flattening is a conventional technique of neural network modeling, and is not described in detail here.

Step S400: the global spatial spectrum embedding features are input into the Transformer encoder and the central region token is masked and reconstructed in the Transformer decoder to train the Transformer encoder autonomously.

Specifically, aiming at the problem that the existing classification model does not effectively utilize the unlabeled samples, the invention adopts the covering and reconstruction based on the central region Token, can effectively utilize the unlabeled samples, improves the modeling capacity of the model for the relation of the neighborhood thereof through self-supervision learning, further improves the performance of the classification model under the condition of small samples, and can also enhance the classification training effect of the labeled samples through effectively utilizing the unlabeled samples.

Referring to fig. 6, in this embodiment, the decoder consists of two layers of standard transform encoders, the global space spectrum is embedded into the features input into the decoder, and the central region token is masked and reconstructed in the decoder to train the transform encoder autonomously.

As shown in fig. 7, the masking and reconstructing of the central region token in the decoder to train the transform encoder in an autonomous manner includes the following specific steps:

step S410: replacing the central pixel of the training sample with a learnable vector;

step S420: reconstructing the center pixel and the training samples in a decoder;

step S430: and training the Transformer encoder according to the loss value reconstructed by the central pixel and the loss value reconstructed by the training sample.

Specifically, unlike the self-supervised training in the field of computer vision, the RGB image cannot directly find a region that needs to be focused on, while a neighborhood region of a hyperspectral image (HSI) sample exists for enriching feature expression of a central pixel. Therefore, the mask target can select an important part in the training sample, namely the central pixel, and embed the global space spectrum into the sequence mark L in the middle of the feature _center Replacement by a learnable vector V _learn And obtaining a new feature representation sequence, and reconstructing the training sample. The specific designations are:

L＝(L ₁ ,L ₂ ,…,L _center ,…L _k )→L′＝(L ₁ ,L ₂ ,…,V _learn ,…L _k )。

inputting the masked sequence L' into a transform decoder, and performing pixel-level sequence reconstruction in the transform decoder by an MLP head (MLP: a fully-connected network comprising a hidden layer) to obtain an overall reconstruction result R (R) of a central pixel and a training sample ₁ ,R ₂ ,…,R _center ,…,R _k )。

R＝Decoder(L′)。

Wherein R is _i Corresponding sequence L _i Reconstructing the result.

That is, the training process of the present embodiment is divided into two tasks, task one is the main task: and (3) reconstructing a sequence of the central pixel, wherein a task two is an auxiliary task: and (5) reconstructing a neighborhood sample. The multi-task training strategy can enable a training model to better learn the relation between the central pixel and the adjacent pixels under the condition of no label, meanwhile, the auxiliary task has the regularization effect, and the main task is prevented from collapsing into a mode. The loss function calculation for the training model can be expressed as:

Loss _center ＝MSE(R _center ,Spectral _center )，

Loss _sample ＝MSE(R,H)，

Loss _total ＝w _center *Loss _center +w _sample *Loss _sample ，

wherein, spectral _center Representing the sequence of central pixels of the input sample H, w _center And w _sample Represents the weight of the central pixel reconstruction task and the sample reconstruction task in the total Loss function, loss _center Reconstructing the Loss value, loss, of the task for the center pixel _sample The loss value of the task is reconstructed for the sample.

Most hyperspectral image classification methods are patch-based. The input of the classification model is not only the spectral curve of the central pixel, but also includes its neighboring region (generally a square region), making the input more distinctive. Therefore, it can be seen that the training method of the embodiment is easier to implement and is more suitable for hyperspectral data.

After the training is completed, the Transformer decoder is stripped when the training is actually applied, and the Transformer decoder is only used in the training stage. The Transformer encoder comprising the self-adaptive multi-scale embedded layer is a classification model, and the hyperspectral image can be classified and identified by using the classification model. The training of the transform network by using unlabeled samples may also be referred to as pre-training, and on the basis of the pre-training, the classification training may further be performed on the classification model (i.e. the transform network with the transform decoder stripped) by using labeled samples, so as to further improve the training effect and the classification accuracy.

In one embodiment, the classification training specifically comprises the following steps: and resetting the weight parameters adopted when the large-scale embedded features, the medium-scale embedded features and the small-scale embedded features are fused, inputting the hyperspectral image samples with labels into a Transformer network, and inputting the output results of the Transformer encoder into a classifier to train the Transformer encoder again. Through resetting the multi-branch weight parameters, a small amount of labeled samples are used for fine adjustment of the classification model, the training effect is further enhanced, and the classification precision of the trained classification model is higher. Training the neural network model by using the labeled samples is a conventional technical means in the field, and is not described herein again.

From the above, in this embodiment, firstly, a label-free hyperspectral image sample is divided into a plurality of spectral sub-bands based on spectral dimensions, each spectral sub-band is input into one embedding module respectively to extract null spectrum features of the spectral sub-bands from a plurality of scales so as to obtain local null spectrum embedding features, all the local null spectrum embedding features are fused to obtain global null spectrum embedding features, then the global null spectrum embedding features are input into a transform encoder, the global null spectrum embedding features are reconstructed by using the long-distance modeling capability of the transform, and a transform encoder is trained in a self-supervision manner by masking and reconstructing a central region token, so that a trained transform network is obtained. The method can fully extract the characteristic information of the label-free sample, enhance the training effect of the Transformer network and obtain higher classification precision.

It should be noted that, the specific type of the transform encoder is not limited, and may be selected accordingly according to the practical application. In this embodiment, only the position coding part of the transform encoder is eliminated.

On the basis of a Transformer network model constructed by the training method, the embodiment of the invention also provides a hyperspectral image classification method based on a multi-scale convolution Transformer network, wherein the multi-scale convolution Transformer network is a hyperspectral image classification model and comprises a self-adaptive multi-scale embedding layer and a Transformer encoder. The adaptive multi-scale embedding layer is the same as the adaptive multi-scale embedding layer in the training method and comprises a plurality of embedding modules. The classification method comprises the following specific steps: the method comprises the steps of training a multi-scale convolution Transformer network in advance by adopting the training method, inputting a hyperspectral image into the multi-scale convolution Transformer network, extracting null spectrum features in a multi-scale mode, obtaining classification features output by the multi-scale convolution Transformer network, and inputting the classification features into a classifier to obtain a hyperspectral image classification result. The classifier is not limited, and can be used as follows: various classifiers such as logistic regression decision trees, random forests, support Vector Machines (SVMs), and the like.

In this embodiment, after the hyperspectral images are subjected to equal-length segmentation in spectral dimensions, the hyperspectral images are respectively input into parallel embedding modules, after local spatial spectrum embedding characteristics are obtained, the hyperspectral images are spliced according to positions before segmentation, and then input into a transform coding layer to extract characteristics and further classify the characteristics, so that a hyperspectral image classification result is obtained. The global space spectrum embedding features of the hyperspectral images are extracted according to the dividing and governing thinking, the dividing and governing feature embedding of the hyperspectral images and the global space spectrum embedding feature reconstruction are achieved, the difficulty of space spectrum feature extraction can be effectively reduced, feature information can be fully extracted, and the classification accuracy of the hyperspectral images is improved.

The complete implementation process comprises the following steps: firstly, pre-training is carried out, principal component analysis is used for reducing dimensions of an original hyperspectral image, and a hyperspectral image data set is divided into a training set and a test set. After isometric segmentation is carried out on hyperspectral image samples, the hyperspectral image samples are respectively input into parallel embedding modules, after local blank spectrum embedding characteristics corresponding to the hyperspectral image samples are obtained, corresponding splicing is carried out, then tokens generated from a central area are removed, learnable vectors are used for replacing the positions of the tokens, all sequences are input into a decoding module consisting of two layers of transform decoders, and the tokens are captured through the relationship among the tokens and are interacted with information from an attention module, so that the learnable vectors can restore the spectrum characteristics of the central area as far as possible. The spectral curve of the central pixel is taken as a supervision target, and the relation between the neighborhood and the central pixel is learned in the process. Then, a Transformer encoder is used as a classification model, multi-branch weight parameters are reset, the classification model is finely adjusted by using a small amount of samples, the trained classification model is obtained, and then the classification model can be used for realizing the classification of the ground objects.

The following tables I and II are experimental results of various hyperspectral image classification methods on the Pavia unity data set and the high-score five data set respectively, wherein the number of training samples in each category is five. Compared with the existing various hyperspectral image classification methods, the classification method can obtain better classification accuracy.

TABLE I, pavia unity data set

Data set of second and fifth high scores of table

Categories	CNNHSI	HybridSN	VIT	SpectralFormer	SSFTT	SPRLT	Method for producing a composite material
								1	73.13	73.10	49.52	66.96	80.76	90.61	88.83
2	98.91	99.39	88.06	99.65	99.46	98.60	99.09
								3	60.89	63.83	51.45	82.79	89.74	80.13	85.27
4	64.17	65.84	79.18	90.43	91.84	93.49	95.79
								5	70.94	76.11	81.96	98.94	94.60	97.23	98.40
6	82.00	81.30	74.08	84.59	87.06	83.34	86.38
								7	49.24	53.39	56.59	63.93	57.77	64.97	59.19
8	74.03	71.46	71.86	82.80	81.37	85.58	84.67
								9	87.24	86.38	71.48	69.56	85.94	83.88	88.63
10	93.38	78.79	67.99	76.94	82.89	81.67	91.21
								11	79.57	83.79	47.55	83.07	92.92	81.42	92.47
12	79.10	79.43	63.60	73.97	80.98	77.28	85.04
								13	75.31	83.64	70.84	84.24	91.49	87.26	92.38
14	77.43	72.18	62.29	88.33	79.60	86.76	83.12
								15	80.42	62.96	59.97	92.86	70.18	88.63	63.28
16	89.64	88.24	74.55	77.82	84.09	87.70	90.66
								17	93.76	92.49	78.40	86.28	94.44	84.65	97.98
18	56.53	55.43	50.83	66.32	69.16	57.62	69.09
								19	74.82	81.22	59.47	61.99	82.97	75.79	84.21
20	95.14	81.23	75.08	77.69	90.46	72.31	85.85
								OA	72.96	74.86	72.60	87.78	89.07	88.57	91.38
AA	77.78	76.51	66.74	80.46	84.39	82.95	86.08
								Kappa	69.30	71.54	68.46	85.86	87.37	86.78	90.02

Exemplary device

As shown in fig. 8, corresponding to a training method for a Transformer network for a hyperspectral image, an embodiment of the present invention further provides a training apparatus for a Transformer network for a hyperspectral image, where the Transformer network includes an adaptive multi-scale embedding layer, the adaptive multi-scale embedding layer includes multiple embedding modules, and specifically, the apparatus includes:

a spectrum sub-band module 600, configured to divide a hyperspectral image based on spectrum dimensionality to obtain a plurality of spectrum sub-bands, where the hyperspectral image is an unlabeled training sample of the transform network;

a local empty spectrum embedding feature module 610, configured to input each spectral sub-band into one embedding module, to obtain a local empty spectrum embedding feature output by each embedding module, where the embedding module extracts the empty spectrum feature of the spectral sub-band in multiple scales;

a global empty spectrum embedding feature module 620, configured to fuse all the local empty spectrum embedding features according to positions of the spectrum sub-bands to obtain global empty spectrum embedding features;

an auto-supervised training module 630, configured to input the global empty spectrum embedding features into the transform encoder, and perform masking and reconstruction on the central region token in the transform decoder to auto-supervise train the transform encoder.

In this embodiment, the training apparatus for the Transformer network for the hyperspectral image may refer to the corresponding description in the training method for the Transformer network for the hyperspectral image, and is not described herein again.

Based on the above embodiment, the present invention further provides an intelligent terminal, and a schematic block diagram thereof may be as shown in fig. 9. The intelligent terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein, the processor of the intelligent terminal is used for providing calculation and control capability. The memory of the intelligent terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a training program for a Transformer network of hyperspectral images. The internal memory provides an environment for the operating system in the non-volatile storage medium and the running of the training program for the Transformer network of the hyperspectral image. The network interface of the intelligent terminal is used for being connected and communicated with an external terminal through a network. The training program of the Transformer network for the hyperspectral image realizes the steps of any one of the training methods of the Transformer network for the hyperspectral image when being executed by a processor. The display screen of the intelligent terminal can be a liquid crystal display screen or an electronic ink display screen.

It will be understood by those skilled in the art that the block diagram of fig. 9 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the intelligent terminal to which the solution of the present invention is applied, and a specific intelligent terminal may include more or less components than those shown in the figure, or combine some components, or have different arrangements of components.

In one embodiment, an intelligent terminal is provided, where the intelligent terminal includes a memory, a processor, and a training program of a Transformer network for hyperspectral images, stored on the memory and executable on the processor, and when executed by the processor, the training program of the Transformer network for hyperspectral images performs the following operations:

fusing all the local space spectrum embedding features according to the positions of the spectrum sub-bands to obtain global space spectrum embedding features;

Optionally, the embedding module includes a small-scale branch for embedding the small-scale feature, a medium-scale branch for embedding the medium-scale feature, and a large-scale branch for embedding the large-scale feature, and each spectral sub-band is input into one embedding module respectively, so as to obtain a local spatial spectrum embedding feature output by each embedding module, including:

and fusing the large-scale embedded feature, the medium-scale embedded feature and the small-scale embedded feature to obtain the local spatial spectrum embedded feature.

Optionally, the obtaining of the large-scale embedded feature of the large-scale branch output includes: performing three-dimensional convolution operation on the spectrum sub-bands, extracting a first space spectrum characteristic, performing three-dimensional convolution operation on the first space spectrum characteristic, extracting a second space spectrum characteristic, splicing and flattening the second space spectrum characteristic in the spectrum dimension to obtain a large-scale embedded characteristic;

Optionally, the fusing the large-scale embedded feature, the medium-scale embedded feature, and the small-scale embedded feature to obtain the local spatial spectrum embedded feature includes:

and adopting learnable weight parameters to fuse the large-scale embedded features, the medium-scale embedded features and the small-scale embedded features to obtain the local spatial spectrum embedded features.

Optionally, after training the transform encoder, the method further includes:

replacing the central pixel of the training sample with a learnable vector;

The embodiment of the invention further provides a computer-readable storage medium, wherein a training program of a Transformer network for a hyperspectral image is stored in the computer-readable storage medium, and when the training program of the Transformer network for the hyperspectral image is executed by a processor, the steps of any one of the methods for training the Transformer network for the hyperspectral image provided by the embodiment of the invention are realized.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention. For the specific working processes of the units and modules in the system, reference may be made to the corresponding processes in the foregoing method embodiments, which are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art would appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the above modules or units is only one logical division, and the actual implementation may be implemented by another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The integrated modules/units described above may be stored in a computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium and can implement the steps of the embodiments of the method when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer readable medium may include: any entity or device capable of carrying the above-mentioned computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signal, telecommunication signal, software distribution medium, etc. It should be noted that the contents of the computer-readable storage medium can be increased or decreased as required by the legislation and patent practice in the jurisdiction.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims

1. A training method for a Transformer network of hyperspectral images, characterized in that an adaptive multi-scale embedding layer is arranged in the Transformer network, the adaptive multi-scale embedding layer comprises a plurality of embedding modules, and the training method comprises the following steps:

2. The training method of the Transformer network for hyperspectral images according to claim 1, wherein the embedding modules comprise a small-scale branch for embedding small-scale features, a medium-scale branch for embedding medium-scale features and a large-scale branch for embedding large-scale features, each spectral sub-band is input into one embedding module, and the local empty-spectrum embedded features output by each embedding module are obtained, and the method comprises the following steps:

3. The training method of the Transformer network for hyperspectral images according to claim 2, wherein the obtaining of the large-scale embedded features of the large-scale branch output comprises: performing three-dimensional convolution operation on the spectrum sub-bands, extracting first space spectrum features, performing three-dimensional convolution operation on the first space spectrum features, extracting second space spectrum features, splicing and flattening the second space spectrum features in spectrum dimensions, and obtaining large-scale embedded features;

4. The training method of the Transformer network for hyperspectral images according to claim 2, wherein the fusing the large-scale embedded features, the medium-scale embedded features and the small-scale embedded features to obtain the local spatial spectrum embedded features comprises:

5. The training method of the Transformer network for hyperspectral images of claim 4, wherein after training the Transformer encoder, further comprising:

6. The training method of the Transformer network for hyperspectral images according to claim 1, wherein the masking and reconstructing the center region token in the Transformer decoder to train the Transformer encoder autonomously, comprises:

replacing the central pixel of the training sample with a learnable vector;

7. A hyperspectral image classification method based on a multiscale convolutional Transformer network, wherein the multiscale convolutional Transformer network comprises an adaptive multiscale embedding layer and a Transformer encoder, the adaptive multiscale embedding layer comprises a plurality of embedding modules, and the classification method comprises the following steps:

training the multiscale convolutional Transformer network in advance by using the training method of the Transformer network for hyperspectral images according to any one of claims 1 to 6;

8. Training apparatus for a Transformer network of hyperspectral images, the Transformer network comprising an adaptive multi-scale embedding layer, the adaptive multi-scale embedding layer comprising a plurality of embedding modules, the apparatus comprising:

9. An intelligent terminal, characterized in that the intelligent terminal comprises a memory, a processor and a training program of a Transformer network for hyperspectral images, which is stored on the memory and can run on the processor, wherein the training program of the Transformer network for hyperspectral images realizes the steps of the training method of the Transformer network for hyperspectral images according to any one of claims 1 to 6 when executed by the processor.

10. Computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a training program of a Transformer network for hyperspectral images, which training program, when executed by a processor, implements the steps of the training method of a Transformer network for hyperspectral images as claimed in any of claims 1 to 6.