CN114758170A

CN114758170A - Three-branch three-attention mechanism hyperspectral image classification method combined with D3D

Info

Publication number: CN114758170A
Application number: CN202210344115.7A
Authority: CN
Inventors: 潘新; 唐婷; 刘江平; 罗小玲
Original assignee: Inner Mongolia Agricultural University
Current assignee: Inner Mongolia Agricultural University
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-07-15
Anticipated expiration: 2042-04-02
Also published as: CN114758170B

Abstract

The invention belongs to the technical field of spectral image classification, and discloses a three-branch three-attention mechanism hyperspectral image classification method combined with D3D, wherein a three-branch three-attention mechanism network D3DTBTA-Net combined with deformable 3D convolution is constructed, and spectral information and spatial information of a hyperspectral image are extracted; the D3DTBTA-Net is divided into a spectrum branch, a space X branch and a space Y branch, and after a spectrum characteristic diagram, a space X characteristic diagram and a space Y characteristic diagram are respectively extracted, the characteristic diagrams extracted from the three branches are fused and classified. The method can automatically classify according to the trained deep learning model without inputting any parameter and consuming a large amount of time cost and labor cost to label data; the characteristics with more discriminative power can be extracted through the deformable 3D convolution and the three-branch three-attention mechanism, so that the classification precision is improved, and the good classification performance can be still kept under the condition that the number of training samples is limited.

Description

Three-branch three-attention mechanism hyperspectral image classification method combined with D3D

Technical Field

The invention belongs to the technical field of spectral image classification, and particularly relates to a three-branch three-attention mechanism hyperspectral image classification method combined with D3D.

Background

At present, a hyperspectral image has nanoscale spectral resolution, can reflect slight differences of different ground objects in spectral dimensions, and greatly improves the resolution and identification capability of the ground objects. The hyperspectral image classification is characterized in that rich information contained in the hyperspectral image is utilized, each pixel is assigned with a unique class label, and the hyperspectral image classification is an important aspect of hyperspectral image application. However, the hyperspectral data has high-dimensional characteristics, the phenomena of same-object different spectrums and same-spectrum foreign matters exist in the hyperspectral image, so that the image data structure is highly nonlinear, and the adjacent wave bands and the adjacent pixels have strong correlation; meanwhile, labels in the hyperspectral images are insufficient, training samples are often limited in number, and dimensionality disasters are prone to occurring. Therefore, how to extract features with strong discriminability and realize accurate classification on the premise of small samples is the key of hyperspectral image classification.

The traditional machine learning method generally only utilizes spectral information and neglects rich spatial information of hyperspectral images, so that the classification precision is low; in addition, a great deal of time and labor costs are required to label data. When the convolutional neural network based on the convolutional neural network or the improved deeper network extracts the features, the sampling position of a convolutional kernel is usually fixed, and the size of a receptive field cannot be dynamically adjusted according to the actual condition of an image, so that the features are better extracted, and the classification performance is limited. The hyperspectral image classification method based on deep learning has low classification precision on small sample data. Therefore, it is necessary to design a new hyperspectral image classification method to overcome the defects in the prior art.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) the traditional machine learning method only utilizes spectral information and ignores rich spatial information of hyperspectral images, so that the classification precision is low, and a large amount of time cost and labor cost are needed to label data.

(2) The hyperspectral image classification method based on deep learning has low classification precision on small sample data.

(3) When the convolutional neural network based on the convolutional neural network or the improved deeper network extracts the features, the sampling position of a convolutional kernel is usually fixed, and the size of a receptive field cannot be dynamically adjusted according to the actual condition of an image, so that the features are better extracted, and the classification performance is limited.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a three-branch three-attention mechanism hyperspectral image classification method combined with D3D, and particularly relates to a D3 DTBTA-Net: a hyperspectral image classification method, a system, a medium, equipment and a terminal of a three-branch three-attention mechanism combined with D3D aim at solving the problem that a hyperspectral image classification algorithm based on small samples in the prior art is low in classification accuracy.

The invention is realized in such a way that a three-branch three-attention mechanism hyperspectral image classification method combined with D3D comprises the following steps: constructing a three-branch three-attention mechanism network D3DTBTA-Net combined with deformable 3D convolution for extracting spectral information and spatial information of the hyperspectral image; the three-branch three-attention mechanism network D3DTBTA-Net respectively extracts a spectral feature map, a spatial X feature map and a spatial Y feature map by utilizing three branches, and performs feature map fusion and classification; the three branches are a spectral branch, a spatial X branch and a spatial Y branch.

Further, the three-branch three-attention hyperspectral image classification method combined with the D3D comprises the following steps of:

step one, generating a data set: generating a set of three-dimensional cubes, and randomly dividing the set of three-dimensional cubes into a training set, a verification set and a test set;

step two, training a model and verifying the model: the training set is used for updating parameters of multiple iterations, and the verification set is used for monitoring the performance of the model and selecting the model which is best trained;

step three, prediction: and selecting a test set to verify the effectiveness of the training model to obtain a classification result.

Further, the data set generation in the first step includes:

selecting a central pixel x from the raw data_iP neighboring pixels, generating a set of three-dimensional cubic blocks

If the target pixel is located at the edge of the image, setting the missing neighboring pixel value to zero; in the D3DTBTA-Net algorithm, p is the size of the patch, the size of the patch is set to be 9, and b is the number of the spectral bands; randomly dividing three-dimensional cube set into training set X_trainVerification set X_valAnd test set X_testThe corresponding label vector is divided into Y_train、Y_valAnd Y_testOnly spatial information around the target pixel is used.

Further, the training model and the verification model in the second step include:

training a model and verifying the model by using a D3DTBTA-Net algorithm, wherein the D3DTBTA-Net algorithm is divided into three branches: the spectrum branch, the space X branch and the space Y branch are respectively used for capturing a spectrum characteristic diagram, a space X characteristic diagram and a space Y characteristic diagram and fusing the three acquired characteristic diagrams for classification; wherein the spectrum branch comprises a Dense spectrum block and a spectrum attention block; the space X branch comprises a Dense space X block and a space X attention block; the space Y branch contains a Dense space Y block and a space Y attention block.

Further, the following basic modules are used in the second step:

(1) 3D-CNN with BN: the 3D-CNN with BN is a common element in a depth learning model based on 3D cubic blocks; for p_m×p_m×b_mN of size_mFeature map, in a 3D-CNN layer, contains the size of alpha_m+1×α_m+1×d_m+1K of (a)_m+1A channel of size p_m+1×p_m+1×b_m+1N of (A) to (B)_m+1Outputting a characteristic diagram; the ith output of the (m +1) th 3D-CNN layer with BN is calculated as:

wherein the content of the first and second substances,

is the jth input feature map of the (m +1) layer,

is the output after BN of m layers; e (-) and Var (-) represent the expectation and variance functions of the input, respectively;

and

and respectively representing the weight and the bias of the (m +1) layer 3D-CNN, wherein the is a 3D convolution operation, and R (-) is an activation function introduced into a network nonlinear unit.

(2) DenseNet dense ligation: the dense block is the basic unit in DenseNet, and the output of the l-th dense block is calculated as:

x_l＝H_l[x₀,x₁,...,x_l-1]；

wherein H_lIs a block containing a convolutional layer, an active layer and a BN layer, x₀,x₁,...,x_l-1Representing the generated dense blocks, the more connections, the more information flows in the dense network; the dense network with L layers has L (L +1)/2 linksThen, the conventional convolutional network with the same number of layers has only L direct connections.

(3) An attention mechanism is as follows:

spectral attention mapping

Is directly inputted from the initial

Calculating, wherein p × p is the size of the input block, and c represents the number of input channels; a and A are reacted^TPerforming matrix multiplication to obtain channel attention mapping

Connecting the softmax layer as:

wherein x is_jiRepresenting the influence of the ith channel on the jth channel; mixing X^TThe result of the matrix multiplication with A is transformed into

Weighting the reconstructed result through the parameter of the scale alpha, and adding the input A to obtain a final spectrum attention chart

Where α is initialized to zero and learned step by step. The final plot E contains a weighted sum of all channel features to describe a dependency relationship that enhances the discriminability of the features.

The space attention block: given an input profile

Generating new feature maps B and C using two convolution layers, respectively, wherein

Deforming B and C into

Where n-p × p is the number of pixels; performing matrix multiplication between B and C, adding a softmax layer, and calculating a spatial attention feature map

Wherein s is_jiRepresenting the influence of the ith pixel to the jth pixel; the closer the feature representations of two pixels are, the stronger the correlation between the representative pixels.

Simultaneously sending the initial input features A into the convolutional layer to obtain a new feature map

Is deformed into

At D and S^TPerforms matrix multiplication operation therebetween, and the result is transformed into

Wherein the initial value of beta is zero, and more weights are gradually learned and distributed; adding a certain weight to all the positions and the original features to obtain final features

The context information in the spatial dimension is modeled as E.

(4) Deformable 3D convolution: the size of a receptive field is dynamically adjusted by deformable convolution according to the actual situation of an image, and an input feature with the size of C multiplied by H multiplied by W passes through a 3D-CNN with the size of p multiplied by q multiplied by r to generate an offset feature with the size of 3N multiplied by C multiplied by H multiplied by W, wherein N is the size of a sampling grid; having 3N values along the channel dimension, the values representing deformation values of the D3D sampling grid; applying the learned offset features to a deformation of the 3D-CNN sampling grid to generate a D3D sampling grid; the D3D sampling grid is used to generate output features.

D3D is represented by the following formula:

wherein, Δ p_nRepresenting the offset corresponding to the nth value in the p × q × r convolutional sampling grid, using bilinear interpolation to generate the exact value. The bilinear interpolation formula is:

(5) mish activation function: the activation function adopted by the D3DTBTA is Mish, which is a self-regularized non-monotonic activation function, rather than the traditional relu (x) max (0, x). The formula of Mish is:

mish(x)＝x×tanh(softplus(x))；＝x_i×tanh(ln(1+e^x))

wherein x represents an input; mish is unbounded at the upper bound, and the range at the lower bound is [ ≈ 0.31, ∞ ]; the differential coefficient of Mish is defined as:

wherein the content of the first and second substances,

(6) regarding the selection of the optimal weight, in the training process, selecting the model with the highest accuracy on the verification set as output, and if the accuracy on the verification set is consistent, selecting the model with the minimum loss on the verification set to output; the best model is saved in each iteration, if the model of the next iteration is better, the model saved last time is replaced, otherwise, the model is not replaced.

Dynamically adjusting the learning rate by adopting a cosine annealing method as shown in the following formula:

wherein eta is_iIs in the range of the ith iteration

The learning rate of (c); t is a unit of_curIs responsible for calculating the number of iterations that have been performed, and T_iThe number of iterations performed in one adjustment period is controlled.

Further, the predicting in step three includes:

HSI data set A is composed of N marked pixels

Composition, where p is the band and the corresponding class label set is

Where q is the number of land cover categories.

In HSI classification, the quantitative measure for measuring the difference between the predicted result and the true value is a cross-entropy loss function defined as:

wherein, the first and the second end of the pipe are connected with each other,

a label vector representing the model prediction, y ═ y₁,y₂,...,y_L]Representing the true tag vector.

Another object of the present invention is to provide a three-branch three-attention hyperspectral image classification system combined with D3D, which includes:

the data set generating module is used for generating a set of three-dimensional cubes and randomly dividing the set of three-dimensional cubes into a training set, a verification set and a test set;

the model training and verifying module is used for updating parameters of multiple iterations through a training set, monitoring the performance of the model by using a verifying set and selecting the model with the best training;

and the prediction module is used for selecting the test set to verify the effectiveness of the training model and obtain a classification result.

It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

constructing a three-branch three-attention mechanism network D3DTBTA-Net combined with deformable 3D convolution for extracting spectral information and spatial information of the hyperspectral image; the D3DTBTA-Net is divided into three branches: and after the spectral characteristic diagram, the spatial X characteristic diagram and the spatial Y characteristic diagram are respectively extracted, fusing the characteristic diagrams extracted from the three branches for classification.

It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

constructing a three-branch three-attention mechanism network D3DTBTA-Net combined with deformable 3D convolution for extracting spectral information and spatial information of a hyperspectral image; the D3DTBTA-Net is divided into three branches: and after the spectral characteristic diagram, the spatial X characteristic diagram and the spatial Y characteristic diagram are respectively extracted from the spectral branch, the spatial X branch and the spatial Y branch, the characteristic diagrams extracted from the three branches are fused and classified.

Another object of the present invention is to provide an information data processing terminal for implementing the three-branch three-attention hyperspectral image classification system in combination with D3D.

In combination with the technical solutions and the technical problems to be solved, please analyze the advantages and positive effects of the technical solutions to be protected in the present invention from the following aspects:

first, aiming at the technical problems existing in the prior art and the difficulty in solving the problems, the technical problems to be solved by the technical scheme of the present invention are closely combined with results, data and the like in the research and development process, and some creative technical effects are brought after the problems are solved. The specific description is as follows:

the invention provides a three-branch three-attention mechanism network D3DTBTA-Net combined with deformable 3D convolution, which can enhance feature extraction and fully extract spectral information and spatial information of a hyperspectral image, thereby improving the classification precision of the hyperspectral image on the premise of a small sample. The D3DTBTA-Net of the invention is divided into three branches: and respectively extracting a spectral feature map, a spatial X feature map and a spatial Y feature map from the spectral branch, the spatial X branch and the spatial Y branch, and fusing the feature maps extracted from the three branches for classification. A comparison experiment with other classification methods shows that the D3DTBTA-Net is suitable for classifying the hyperspectral images of small samples and can obtain better classification performance.

Secondly, considering the technical solution as a whole or from the perspective of products, the technical effects and advantages of the technical solution to be protected by the present invention are specifically described as follows:

the method can automatically classify according to the trained deep learning model without inputting any parameter and consuming a large amount of time cost and labor cost to label data; the characteristics with more discriminative power can be extracted through the deformable 3D convolution and the three-branch three-attention mechanism, so that the classification precision is improved, the problem of dimension disasters is solved, and the good classification performance can be still kept under the condition that the number of training samples is limited.

Third, as an inventive supplementary proof of the claims of the present invention, there are also presented several important aspects:

the expected income and commercial value after the technical scheme of the invention is converted are as follows: at present, the remote sensing technology is widely applied to the fields of agriculture, forestry, geology, oceans, meteorology, hydrology, military affairs, environmental protection and the like. The method improves the classification precision of the remote sensing image, can be applied to various fields, for example, the method is applied to agricultural production, can dynamically monitor the growth vigor of crops, monitor the diseases and the insect pests of the crops, estimate the yield of the crops and the like, and in the agricultural production, the remote sensing technology can periodically observe and cover a large area to obtain ground information, thereby greatly saving the labor cost and reducing the errors caused by the labor factors, and further promoting the agricultural modernization process of China.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a three-branch three-attention hyperspectral image classification method in combination with D3D according to an embodiment of the invention;

FIG. 2 is a block diagram of a three-branch three-attention hyperspectral image classification system combined with D3D according to an embodiment of the invention;

FIG. 3 is a flow chart of the D3DTBTA-Net algorithm provided by the embodiment of the present invention;

FIG. 4 is a schematic diagram of a calculation process of a spectral attention map provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a deformable 3D convolution provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a network structure of a D3DTBTA according to an embodiment of the present invention;

FIG. 7 is an experimental plot on an Indian Pines (IP) data set provided by an embodiment of the present invention; wherein FIG. 7(a) is a pseudo-color image; FIG. 7(b) corresponds to a label; fig. 7(c) SVM (68.75%); FIG. 7(d) CDCNN (64.21%); FIG. 7(e) SSRN (91.59%); FIG. 7(f) FDSSC (93.85%); FIG. 7(g) DBDA (91.32%); FIG. 7(h) D3DTBTA (95.74%);

in the figure: 1. a data set generation module; 2. a model training and verification module; 3. and a prediction module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In order to solve the problems in the prior art, the invention provides a three-branch three-attention hyperspectral image classification method combined with D3D, and the invention is described in detail with reference to the accompanying drawings.

First, an embodiment is explained. This section is an explanatory embodiment expanding on the claims so as to fully understand how the present invention is embodied by those skilled in the art.

Example 1

The hyperspectral image classification method aims at solving the problem that a hyperspectral image classification algorithm based on a small sample is low in classification precision in the prior art. The invention provides a three-branch three-attention mechanism network D3DTBTA-Net combined with deformable 3D convolution, which can enhance feature extraction and fully extract spectral information and spatial information of a hyperspectral image, thereby improving the classification precision of the hyperspectral image on the premise of a small sample. The D3DTBTA-Net is divided into three branches: and respectively extracting a spectral feature map, a spatial X feature map and a spatial Y feature map from the spectral branch, the spatial X branch and the spatial Y branch, and fusing the feature maps extracted from the three branches for classification. A comparison experiment with other classification methods shows that the D3DTBTA-Net is suitable for classifying the hyperspectral images of small samples and can obtain better classification performance.

As shown in fig. 1, the three-branch three-attention hyperspectral image classification method combined with D3D according to the embodiment of the invention includes the following steps:

s101, generating a data set: randomly dividing a three-dimensional cube set into a training set, a verification set and a test set;

s102, training a model and verifying the model: the training set is used for updating parameters of multiple iterations, and the verification set is used for monitoring the performance of the model and selecting the model which is best trained;

s103, predicting: and selecting a test set to verify the effectiveness of the training model to obtain a classification result.

As shown in fig. 2, the three-branch three-attention hyperspectral image classification system combined with D3D provided by the embodiment of the invention includes:

the data set generating module 1 is used for generating a set of three-dimensional cubes and randomly dividing the set of three-dimensional cubes into a training set, a verification set and a test set;

the model training and verifying module 2 is used for updating parameters of multiple iterations through a training set, monitoring the performance of the model by using a verifying set and selecting the model with the best training;

and the prediction module 3 is used for selecting the test set to verify the effectiveness of the training model and obtain a classification result.

The process of the D3DTBTA-Net algorithm provided by the embodiment of the invention comprises three steps: data set generation, training and validation, and prediction. Fig. 3 illustrates the overall algorithm flow of the method of the present invention.

Suppose that HSI data set A is made up of N labeled pixels

Composition, where p is the band and the corresponding class label set is

Where q is the number of land cover categories.

Step 1, generating a data set. Selecting a central pixel x from the raw data_iP neighboring pixels, generating a set of three-dimensional cubic blocks

If the target pixel is located at an edge of the image, the missing neighboring pixel value is set to zero. In the D3DTBTA-Net algorithm, p is the size of the patch, the size of the patch in the method is set to be 9, and b is the number of the spectral bands. Then, the three-dimensional cube set is randomly divided into a training set X_trainVerification set X_valAnd test set X_test. The corresponding label vector is divided into Y_train、Y_valAnd Y_test. Since the labels of neighboring pixels are not visible to the network, only spatial information around the target pixel is used.

And 2, training a model and verifying the model. The training set is used to update the parameters for multiple iterations, while the validation set is used to monitor the performance of the model and select the best trained model. In step 2, the training model and the verification model use the algorithm D3DTBTA-Net of the present invention, which is divided into three branches: and the spectral branch, the spatial X branch and the spatial Y branch are respectively used for capturing the spectral feature map, the spatial X feature map and the spatial Y feature map, and then the obtained three feature maps are fused and classified. Wherein the spectrum branch comprises a Dense spectrum block and a spectrum attention block; the space X branch comprises a Dense space X block and a space X attention block; the space Y branch contains a De nse space Y block and a space Y attention block.

And 3, predicting. And selecting a test set to verify the effectiveness of the training model, so as to obtain a classification result. In HSI classification, a common quantitative measure of the difference between the predicted result and the true value is a cross-entropy loss function defined as:

wherein the content of the first and second substances,

tag vector predicted for model, y ═ y₁,y₂,...,y_L]Representing the true tag vector.

Further, the following basic modules are used in step 2.

(1) 3D-CNN with BN. 3D-CNN with BN is a common element in depth learning models based on 3D cube blocks. For p_m×p_m×b_mN of size_mFeature map, in a 3D-CNN layer, comprising a size of α_m+1×α_m+1×d_m+1K of (a)_m+1A channel of size p_m+1×p_m+1×b_m+1N of (a)_m+1And outputting the characteristic diagram. The ith output of the (m +1) th 3D-CNN layer with BN is calculated as:

wherein

Is the jth input feature map of the (m +1) layer,

is the output after BN for m layers. E (-) and Var (-) represent the expectation and variance functions of the input, respectively.

And

and (3) weight values and bias values of the (m +1) layer 3D-CNN are represented, the (m +1) layer 3D-CNN is a 3D convolution operation, and R (-) represents an activation function introduced into a network nonlinear unit.

(2) DenseNet dense junctions. Generally, the more convolutional layers, the better the performance of the network. However, when the network reaches a certain depth, the performance cannot be improved by continuously increasing the number of layers, but the network is degraded, that is, the accuracy on the training set gradually saturates or even decreases with the increase of the number of layers of the network. DenseNet is an effective way to solve this problem.

The dense block is the basic unit in DenseNet, and the output of the l-th dense block is calculated as:

x_l＝H_l[x₀,x₁,...,x_l-1] (4)

wherein H_lIs a block containing a convolutional layer, an active layer and a BN layer, x₀,x₁,...,x_l-1Representing the generated dense blocks, the more connections, the more information flows in the dense network. Specifically, a dense network with L layers has L (L +1)/2 connections, whereas a conventional convolutional network with equal layers has only L direct connections.

(3) Attention is paid to the mechanism. One drawback of 3D-CNN is that all spatial pixels and spectral bands possess equivalent weights in the spatial and spectral domains. Obviously, different spectral bands and spatial pixels contribute differently to the extracted features.

As shown in FIG. 4, spectral attention mapping

Is directly inputted from the initial

Calculated, where p × p is the size of the input block, and c represents the number of input channels. Firstly, A and A are mixed^TPerforming matrix multiplication to obtain channel attention mapping

Connecting the softmax layer as:

wherein x_jiIndicating the effect of the ith channel on the jth channel. Secondly, mixing X^TThe result of the matrix multiplication with A is transformed into

Finally, weighting the reconstructed result through the parameter of the scale alpha, and adding the input A to obtain a final spectrum attention chart

Where α is initialized to zero and can be learned step by step. The final plot E contains a weighted sum of all channel features, which may describe a dependency, enhancing the discriminability of the features.

The space notices the block. Given an input feature map

First, B and C are deformed into

Where n-p × p is the number of pixels. Secondly, performing matrix multiplication between B and C, then adding a softmax layer, and calculating a spatial attention feature map

Wherein s is_jiRepresenting the impact of the ith pixel through the jth pixel. Of two pixelsThe closer the feature representations are, the stronger the correlation between them is represented.

Simultaneously feeding the initial input features A into the convolutional layer to obtain a new feature map

Is deformed into

Finally at D and S^TPerforms matrix multiplication operation therebetween, and the result is transformed into

Where the initial value of β is zero, more weight can be learned and assigned step by step. According to the equation (8), the final feature is obtained by adding a certain weight to all the positions and the original features

Thus, the context information in the spatial dimension is modeled as E.

(4) A deformable 3D convolution. CNNs based on CNNs or improved deeper networks, when extracting features, the sampling positions of convolution kernels are usually fixed grids, and for very complex objects with different scales or shapes, the traditional method based on convolution neural networks cannot effectively extract features from complex structures, thereby limiting the classification performance. The size of the receptive field can be dynamically adjusted by the deformable convolution according to the actual condition of the image, and the features can be better extracted. The Deformable Convolution is generally two-dimensional, and Deformable 3D Convolution (D3D Convolution, D3D) fuses the Deformable Convolution and the 3D-CNN together, thereby significantly improving the deformation modeling capability of CNNs. D3D may expand the spatial field of view by a learnable offset variable, where as shown in fig. 5, an input feature of size C × H × W is first passed through a 3D-CNN of size p × q × r to generate an offset feature of size 3N × C × H × W (where N ═ p × q × r is the size of the sampling grid). Along its channel dimension, has 3N values representing the deformation values of the D3D sampling grid. The learned offset features are then used to deform the 3D-CNN sampling grid to generate a D3D sampling grid. Finally, a D3D sampling grid is used to produce the output features. D3D can be expressed by the following formula:

wherein Δ p_nRepresenting the offset corresponding to the nth value in the p × q × r convolutional sampling grid. Since offset variables are usually fractional numbers, bilinear interpolation is used to generate accurate values. The bilinear interpolation formula is:

(5) the Mish activation function. The activation function adopted by the D3DTBTA is Mish, which is a self-regularized non-monotonic activation function, rather than the traditional relu (x) ═ max (0, x). The formula of Mish is:

mish(x)＝x×tanh(softplus(x)) ＝x_i×tanh(ln(1+e^x)) (11)

where x represents the input. Mish is unbounded at the upper bound, and the range at the lower bound is [ ≈ -0.31, ∞ ]. The differential coefficient of Mish is defined as:

wherein

(6) And regarding selection of the optimal weight, selecting the model with the highest accuracy on the verification set as output in the training process, and selecting the model with the lowest loss on the verification set to output if the accuracy on the verification set is consistent. The best model is saved in each iteration, if the model of the next iteration is better, the model saved last time is replaced, otherwise, the model is not replaced.

The learning rate is an important hyper-parameter for training the network, and the dynamic learning rate can help the network avoid local minima. Dynamically adjusting the learning rate by adopting a cosine annealing method as follows:

wherein eta_iIs in the range of the ith iteration

The learning rate of (2). T is_curIs responsible for calculating the number of iterations that have been performed, and T_iThe number of iterations performed in one adjustment period is controlled.

Example 2

The network structure of the D3DTBTA is shown in fig. 6. For convenience, the upper branch is called the spectral branch, and the lower branch is called the spatial X branch and the spatial Y branch, respectively. And respectively inputting the spectrum branch, the space X branch and the space Y branch to obtain a spectrum characteristic diagram and a space characteristic diagram. And then, obtaining a classification result by adopting the fusion operation of the spectrum, the space X characteristic diagram and the space Y characteristic diagram.

The following section describes spectral branches, spatial X branches, spatial Y branches, and the operation of fusing spectra and spaces, taking an Indian Pipes (IP) dataset as an example. The sample cube size is 9 × 9 × 200, as the matrix mentioned below (9 × 9 × 97,24), 9 × 9 × 97 denotes the height, width and depth of the 3D cube, and 24 denotes the number of 3D cubes generated by the 3D-CNN. The IP data set contains 145 x 145 pixels with 200 spectral bands, i.e. the size of the IP is 145 x 200. Only 10249 pixels have a corresponding label, the other pixels being background.

Because the spectrum channels of the HSI are extremely numerous and are redundant for classification, the HSI classification algorithm generally carries out dimensionality reduction operation firstly, so that the redundancy is reduced, and the classification accuracy is improved. The D3DTBTA firstly uses a 3D-CNN layer with the convolution kernel size of 1 multiplied by 7, the step is set to be (1,1,2) to reduce the number of channels to obtain a characteristic diagram of (9 multiplied by 97,8), then uses a deformable 3D convolution enhancement characteristic with the convolution kernel size of 3 multiplied by 3, and then uses a 3D-CNN layer with the convolution kernel size of 1 multiplied by 7 to capture the characteristic diagram of (9 multiplied by 97,24) as an input characteristic diagram of three branches.

The captured signature of size (9 × 9 × 97,24) is input to the spectral branch, first passed through 3D-CNN sense spectral blocks with BN, each sense spectral block having 12 channels in 3D-CNN, with a convolution kernel size of 1 × 1 × 7. After passing through the Dense spectrum block, the number of channels of the feature map calculated by equation (5) is increased to 60, and the size of the feature map is (9 × 9 × 97, 60). Next, after the last 3D-CNN with a convolution kernel size of 1 × 1 × 97, a (9 × 9 × 1,60) feature map is generated. However, these 60 channels contribute differently to the classification. To refine the spectral features, spectral attention blocks are employed, which emphasize the weight of useful information and de-emphasize the weight of redundant information. After the weighted spectral feature map is obtained, the features are enhanced through a deformable 3D convolution with the size of 3 x 1, and then stability and robustness are improved by adopting BN layers and dropout layers. Finally, a 1 × 60 feature map is obtained by globally averaging the pooling layers. The details of the implementation of the spectral branching are shown in table 1.

TABLE 1 implementation details of spectral branching

Meanwhile, the feature map of (9 × 9 × 97,24) is input to the spatial X branch, and then the 3D-CNN sense spatial X block with BN is added. Each 3D-CNN has 12 channels in the sense space X block, with a convolution kernel size of 3 × 1 × 1. Next, the feature map of (9 × 9 × 1,60) is input to the spatial X attention block, and the coefficients of each pixel are weighted by the spatial X attention block, thereby obtaining a more discriminative spatial X feature map. After obtaining the weighted spatial X feature map, enhancing the features by a deformable 3D convolution with a size of 3 × 3 × 1, and then obtaining a 1 × 60 spatial X feature map by the BN layer, the Dropout layer, and the global average pooling layer. The implementation details of the spatial X branch are shown in table 2.

TABLE 2 implementation details of spatial X-branching

Likewise, the feature map of (9 × 9 × 97,24) is input to the space Y branch, and then the 3D-CNN sense space Y block with BN is added. Each 3D-CNN has 12 channels in the Dense space Y block, and the convolution kernel size is 1 × 3 × 1. The feature map of (9 × 9 × 1,60) is input to the spatial Y attention block, and the coefficients of each pixel are weighted by the spatial Y attention block, thereby obtaining a more discriminative spatial Y feature map. After obtaining the weighted spatial Y feature map, enhancing the features by a deformable 3D convolution with a size of 3 × 3 × 1, and then obtaining a 1 × 60 spatial Y feature map by the BN layer, the Dropout layer, and the global average pooling layer. The implementation details of the space Y branch are shown in table 3.

TABLE 3 implementation details of space Y-Branch

Obtaining a spectral feature map, a spatial X feature map and a spatial Y feature map through the spectral branch, the spatial X branch and the spatial Y branch, and then connecting the three feature maps for classification. In addition, the reason for using the tandem operation rather than the addition operation is that the spectral, spatial X, and spatial Y features are all in unrelated domains, the tandem operation can keep the spectral, spatial X, and spatial Y features independent, and the addition operation can mix the spectral, spatial X, and spatial Y features together. And finally, obtaining a classification result through the full connection layer and the softmax layer.

The inventive method was experimented on 4 published hyperspectral datasets, namely the Indian Pines (IP) dataset, the Pavia University (UP) dataset, the Salanas Valley (SV) dataset and the Kennedy Space Center (KSC). Other 5 methods were compared: SVM, CDCNN, SSRN, FDS SC, and DBDA. The methods are effective methods for classifying the hyperspectral images of the small samples and are authenticated by researchers.

The experiments were all performed on the same platform, configuring 16GB memory and NVIDIA GeForce RTX 1080 Ti GPU. All deep learning based classifiers are implemented using PyTorch and the support vector machine is implemented using ski earn.

Since the SVM directly uses spectral information for classification, the input sample size is 1 × 1 × p. For better comparative experiments, other deep learning based methods use the same input sample size of 9 × 9 × p, where p is the number of spectral bands.

The batch processing sizes of CDCNN, SSRN, FDSSC, DBDA and the method D3DTBTA of the invention are all set as 16, the optimizer is set as Adam, and the learning rate is 0.0005. Each method is independently performed for 10 iterations, and the experimental result is the average value of 10 iteration results. The total number of epochs is set to 150, with a step size of 30 per epoch. Experiments were performed using the optimal weight selection method.

The size of the training and validation samples was 3% of the total sample. The number of training, validation and test samples in the Indian Pines (IP) dataset is shown in table 4.

TABLE 4 number of training, validation and test samples in IP dataset

And II, application embodiment. In order to prove the creativity and the technical value of the technical scheme of the invention, the part is the application example of the technical scheme of the claims on specific products or related technologies.

Embodiments of the present invention may be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

And thirdly, evidence of relevant effects of the embodiment. The embodiment of the invention achieves some positive effects in the process of research and development or use, and has great advantages compared with the prior art, and the following contents are described by combining data, diagrams and the like in the test process.

In the examples, the experimental results on the Indian Pines (IP) data set are shown in fig. 7 and in table 5 below, where the training set size is 3%.

TABLE 5

Wherein FIG. 7(a) is a pseudo-color image; FIG. 7(b) corresponds to a label; fig. 7(c) SVM (68.75%); FIG. 7(d) CDCNN (64.21%); FIG. 7(e) SSRN (91.59%); FIG. 7(f) FDSSC (93.85%); FIG. 7(g) DBDA (91.32%); fig. 7(h) D3DTBTA (95.74%).

The Overall Accuracy (OA) of the four data sets at different training sample ratios is shown in table 6.

TABLE 6 Integrated accuracy (OA) at different training sample ratios

Under different training sample proportions, the best classification result is shown in bold. As shown in the table, the classification performance of the method is superior to that of other methods. The D3DTBTA provided by the invention is not the best in classification precision except on an IP data set with a training sample proportion of 1%, but has a small difference with the best classification precision, and the method provided by the invention obtains the best classification precision under other data sets and different training sample proportions. And with the increase of the proportion of the training samples, the classification precision is higher and higher. Under the condition of less training samples, the method provided by the invention can still keep good classification performance.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A three-branch three-attention hyperspectral image classification method combined with D3D is characterized by comprising the following steps of: constructing a three-branch three-attention mechanism network D3DTBTA-Net combined with deformable 3D convolution for extracting spectral information and spatial information of the hyperspectral image; the three-branch three-attention mechanism network D3DTBTA-Net respectively extracts a spectral feature map, a spatial X feature map and a spatial Y feature map by utilizing three branches, and performs feature map fusion and classification.

2. The three-branch three-attention mechanism hyperspectral image classification method combined with D3D according to claim 1, wherein the three-branch three-attention mechanism hyperspectral image classification method combined with D3D comprises the following steps:

3. The three-branch three-attention mechanism hyperspectral image classification method in combination with D3D according to claim 2, wherein the data set generation in the first step comprises:

4. The method for classifying hyperspectral images with a three-branch three-attention mechanism combined with D3D according to claim 2, wherein the training model and the verification model in the second step comprise:

training a model and a verification model by using a D3DTBTA-Net algorithm, wherein the D3DTBTA-Net algorithm is divided into three branches: the spectrum branch, the space X branch and the space Y branch are respectively used for capturing a spectrum characteristic diagram, a space X characteristic diagram and a space Y characteristic diagram and fusing the three acquired characteristic diagrams for classification; wherein the spectrum branch comprises a Dense spectrum block and a spectrum attention block; the space X branch comprises a Dense space X block and a space X attention block; the space Y branch contains a Dense space Y block and a space Y attention block.

5. The method for classifying hyperspectral images by using a three-branch three-attention mechanism in combination with D3D according to claim 2, wherein the following basic modules are used in the second step:

(1) 3D-CNN with BN: the 3D-CNN with BN is a common element in a depth learning model based on 3D cubic blocks; for p_m×p_m×b_mN of size_mFeature map, in a 3D-CNN layer, comprising a size of α_m+1×α_m+1×d_m+1K of (a)_m+1A channel of size p_m+1×p_m+1×b_m+1N of (A) to (B)_m+1Outputting a characteristic diagram; the ith output of the (m +1) th 3D-CNN layer with BN is calculated as:

wherein the content of the first and second substances,

is the jth input feature map of the (m +1) layer,

is the output after BN of m layers; e (-) and Var (-) represent the expectation and variance functions of the input, respectively; h_i ^m+1And b_i ^m+1Respectively representing the weight and bias of (m +1) layer 3D-CNN, which is a 3D convolution operation, and R (-) is an activation function introduced into a network nonlinear unit；

x_l＝H_l[x₀,x₁,...,x_l-1]；

wherein H_lIs a block containing a convolutional layer, an active layer and a BN layer, x₀,x₁,...,x_l-1Representing the generated dense blocks, the more connections, the more information flows in the dense network; the dense network with the layer number L has L (L +1)/2 connections, and the traditional convolution network with the same layer number only has L direct connections;

(3) an attention mechanism is as follows:

spectral attention mapping

Is directly inputted from the initial

Connecting the softmax layer as:

Weighting the reconstructed result through the parameter of the scale alpha, and adding the input A to obtain a final spectrum attention map

Wherein alpha is initialized to zero and gradually learned; the final graph E contains the weighted sum of all channel features, which is used for describing a dependency relationship and enhancing the discriminability of the features;

the space attention block: given an input profile

Generating new characteristic maps B and C by using two convolution layers respectively, wherein

Deforming B and C into

Where n is p × p is the number of pixels; performing matrix multiplication between B and C, adding a softmax layer, and calculating a spatial attention feature map

Wherein s is_jiRepresenting the influence of the ith pixel to the jth pixel; the closer the feature representations of two pixels are, the stronger the correlation between the representative pixels;

Is deformed into

So the context information in the spatial dimension is modeled as E;

(4) deformable 3D convolution: the size of a receptive field is dynamically adjusted by deformable convolution according to the actual situation of an image, and an input feature with the size of C multiplied by H multiplied by W passes through a 3D-CNN with the size of p multiplied by q multiplied by r to generate an offset feature with the size of 3N multiplied by C multiplied by H multiplied by W, wherein N is the size of a sampling grid; having 3N values along the channel dimension, the values representing deformation values of the D3D sampling grid; applying the learned offset features to a deformation of the 3D-CNN sampling grid to generate a D3D sampling grid; generating output features using a D3D sampling grid;

D3D is represented by the following formula:

wherein, Δ p_nRepresenting an offset corresponding to an nth value in a p × q × r convolutional sampling grid, using bilinear interpolation to generate an accurate value; the bilinear interpolation formula is:

(5) mish activation function: the activation function adopted by the D3DTBTA is Mish, and the formula of Mish is as follows:

mish(x)＝x×tanh(softplus(x))；

＝x_i×tanh(ln(1+e^x))

wherein the content of the first and second substances,

(6) regarding the selection of the optimal weight, in the training process, selecting the model with the highest accuracy on the verification set as output, and if the accuracy on the verification set is consistent, selecting the model with the minimum loss on the verification set to output; the best model is saved in each iteration, if the model of the next iteration is better, the model saved last time is replaced, otherwise, the model is not replaced;

wherein eta_iIs in the range of the ith iteration

The learning rate of (c); t is_curIs responsible for calculating the number of iterations that have been performed, and T_iThe number of iterations performed in one adjustment period is controlled.

6. The three-branch three-attention mechanism hyperspectral image classification method in combination with D3D according to claim 2, wherein the prediction in step three comprises:

HSI data set A is composed of N marked pixels

Composition, where p is the band and the corresponding class label set is

Wherein q is the number of land cover categories;

wherein the content of the first and second substances,

tag vector representing model prediction, y ═ y₁,y₂,...,y_L]Representing the true tag vector.

7. A three-branch three-attention hyperspectral image classification system combined with D3D applying the three-branch three-attention hyperspectral image classification method combined with D3D according to any of claims 1 to 6, wherein the three-branch three-attention hyperspectral image classification system combined with D3D comprises:

8. A computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:

constructing a three-branch three-attention mechanism network D3DTBTA-Net combined with deformable 3D convolution for extracting spectral information and spatial information of a hyperspectral image; the D3DTBTA-Net is divided into three branches: and after the spectral characteristic diagram, the spatial X characteristic diagram and the spatial Y characteristic diagram are respectively extracted, fusing the characteristic diagrams extracted from the three branches for classification.

9. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

10. An information data processing terminal characterized by being configured to implement the three-branch three-attention hyperspectral image classification system in combination with D3D of claim 7.