CN116258976A

CN116258976A - Hierarchical transducer high-resolution remote sensing image semantic segmentation method and system

Info

Publication number: CN116258976A
Application number: CN202310298438.1A
Authority: CN
Inventors: 付勇泉; 吴宏林; 贾勇
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-06-13

Abstract

The invention belongs to the technical field of semantic segmentation of remote sensing images, and particularly provides a hierarchical Transformer high-resolution semantic segmentation method and system for remote sensing images, wherein the method comprises the following steps: the method comprises the steps of obtaining an original remote sensing image, and performing preliminary processing to obtain a unified high-resolution remote sensing image; data preprocessing is carried out to construct a sample set; building a high-resolution remote sensing image semantic segmentation model of the hierarchical high-efficiency transducer; training the semantic segmentation model of the high-resolution remote sensing image according to a preset training scheme to obtain an improved model; and cutting the high-resolution remote sensing image to be subjected to segmentation processing according to a fixed size, and loading the cut high-resolution remote sensing image into an improved model to realize rapid segmentation. By designing a lightweight and efficient transducer block as a backbone network, model computation overhead can be reduced and an efficient global feature map can be obtained. Adding an attention mechanism and a multi-scale feature aggregation strategy to a backbone network; the method is used for enhancing the resolution of target classes with different scales and sizes and improving the fineness of the boundary, so that better comprehensive performance is obtained.

Description

Hierarchical transducer high-resolution remote sensing image semantic segmentation method and system

Technical Field

The invention relates to the technical field of semantic segmentation of remote sensing images, in particular to a hierarchical Transformer high-resolution semantic segmentation method and system of remote sensing images.

Background

Remote sensing techniques may be used to obtain various information about the earth's surface, such as terrain, weather, hydrology, vegetation, and so forth. The information can be used in agriculture, forestry, city planning, resource management, disaster monitoring and other fields. With the continuous progress of remote sensing technology and the continuous expansion of application fields, the remote sensing technology has an increasingly profound effect on the life and work of people.

The remote sensing image processing comprises semantic segmentation, change monitoring, coverage classification and the like. The semantic segmentation of the high-resolution remote sensing image is an important and fundamental research in the remote sensing image processing. The high-resolution remote sensing image semantic segmentation is to process the region to be segmented in the image, allocate the label category to each pixel point by using spectrum and abstract semantic features of different positions, and lay a solid foundation for remote sensing image analysis and understanding. The traditional image processing algorithm has a plurality of difficulties in semantic segmentation of high-resolution remote sensing images. These difficulties include complexity of feature design, difficulty in processing large-scale data, complex and varied scenes, poor adaptability of algorithms, etc. Therefore, the semantic segmentation of high-resolution remote sensing images using conventional image processing algorithms not only requires a lot of time and effort, but also often makes it difficult to achieve satisfactory results.

In recent years, the explosive development of deep learning brings about obvious progress to the task of semantic segmentation of high-resolution remote sensing images. Compared with the traditional image processing algorithm, the neural network based on deep learning is excellent in various fields of computer vision, and is therefore focused and studied by more and more students. With the improvement of the resolution of the remote sensing image, the traditional remote sensing image segmentation and machine learning methods cannot effectively process a large number of feature extraction. The problem is continuously explored by a deep convolutional neural network, so that the problem is effectively solved to a certain extent, but the result is not satisfactory by directly applying a general semantic segmentation method to a high-resolution remote sensing image. Therefore, in recent years, in order to solve four main problems in remote sensing image segmentation, namely, obtaining spatial information, reconstructing edge details, establishing global relationships, and designing a lightweight architecture, deep learning semantic segmentation methods for remote sensing images have been continuously developed, including CNNs-based models and Transformer-based models.

Networks based on CNNs structures are often used in image segmentation tasks, CNNs structures are simple, training speed is high, and a large amount of data can be processed. In addition, CNNs can improve model accuracy by using techniques of pooling, convolution, attention mechanisms, and the like. However, since CNNs can capture only local image features, there may be a deficiency in processing some global information. Therefore, the CNNs model solves the problems of space information extraction, edge reconstruction, lightweight design and the like to a certain extent; however, the establishment of global relationships is far from resolved.

As the application of transfomer in computer vision increases, it is gradually applied to semantic segmentation of remote sensing images. The transform structure has a global attention mechanism that captures global context information and is not limited by the size of the convolution kernel. This allows the transducer to better understand the global information of the image in processing the semantic segmentation of the remote sensing image. The transducer architecture is adept at establishing global relationships because the design based on the attention mechanism constitutes a basic transducer unit, but is less robust in extracting local information.

In summary, the segmentation model based on the CNNs architecture is generally composed of a basic convolutional neural network and a feature extraction and aggregation strategy, so that the problems of multi-scale information extraction, edge enhancement, efficient segmentation and the like are solved to a certain extent. However, as the resolution of images increases, the number of network model parameters increases, and the CNNs architecture-based model will reach a threshold on the semantic segmentation task. The transform can continuously improve the performance, because the transform has a global self-attention mechanism, can capture global context information, is not limited by the size of a convolution kernel, and can better understand the global information of an image in processing semantic segmentation of a remote sensing image. But the huge computational overhead limits its development in the field of semantic segmentation of high resolution remote sensing images.

Disclosure of Invention

The invention aims at the technical problems of complexity of feature design, difficulty in processing large-scale data, complex and changeable scene, poor adaptability of algorithm and the like in the prior art when the traditional image processing algorithm performs semantic segmentation of the high-resolution remote sensing image.

The invention provides a hierarchical Transformer high-resolution remote sensing image semantic segmentation method, which comprises the following steps:

s1, acquiring an original remote sensing image, and performing preliminary processing to obtain a high-resolution remote sensing image with uniform size;

s2, data preprocessing is carried out to construct a sample set;

s3, building a high-resolution remote sensing image semantic segmentation model of the hierarchical high-efficiency transducer;

s4, training the semantic segmentation model of the high-resolution remote sensing image according to a preset training scheme to obtain an improved model;

s5, cutting the high-resolution remote sensing image to be subjected to segmentation processing according to a fixed size, and loading the cut high-resolution remote sensing image into an improved model to realize rapid segmentation.

Preferably, the S1 specifically includes:

s11, scanning and shooting a preset ground by using a high-resolution remote sensing sensor carried by an unmanned aerial vehicle to obtain an original remote sensing image;

s12, performing image correction, stitching and denoising on the original remote sensing image to generate a high-quality high-resolution remote sensing image;

s13, performing artificial visual annotation on part of the high-resolution remote sensing image data set to improve the generalization capability of the model under the condition of collecting data set samples;

and S14, unifying the sizes of the high-resolution remote sensing dataset images, and cutting the original image and the corresponding labels to 512 multiplied by 512 so as to adapt to network input.

Preferably, the S2 specifically includes:

s21, normalizing all the processed images for the convenience of training;

s22, using one-hot coding to carry out vectorization coding for each pixel category of the label;

s23, enhancing the image by adopting a spatial data enhancement mode to obtain a data set;

s24, dividing the data set into a training set, a verification set and a test set according to the ratio of 3:1:1.

Preferably, the step S3 specifically includes:

s31, constructing a lightweight backbone network, wherein the upper part is a multi-scale characteristic aggregation segmentation head, the lower part is a residual axial attention block (ResidualAxial Attention, RAA) correspondingly connected with the backbone, the backbone network comprises four stages, and each stage comprises a convolution embedded block (Convolutional tokens Embedding, convS 2) and an EST block (ESwin Transformer Blocks);

s32, inputting the H multiplied by W multiplied by 3 images into a backbone network to establish a global relationship, wherein H and W are the input sizes and 3 is the three channels representing RGB.

S33, the convolution embedded block adjusts the feature resolution before each stage to generate features with proper resolution; the EST block adopts a light self-attention design to reduce the parameter quantity, reduces the input size by half through a linear embedding layer, and then keeps the resolution unchanged through self-attention calculation, layer Normalization (LN) and multi-layer vector perception (MLP), and inputs the input size to the next stage; repeating step S33 a total of 4 times to output (C ₁ ，C ₂ ，C ₃ ，C ₄ )；

S34, performing hierarchical multi-scale feature aggregation on (C1, C2, C3 and C4) respectively by using depth separable convolution (DWConv), and performing matrix addition on C4 and each layer of features through up-sampling of a PPM module; each layer resolution is then mapped to by a 3 x 3 convolution

Obtaining different layer feature fusionThe diagram carries out cross-channel superposition through Concat, and inputs the cross-channel superposition into 1X 1 convolution for channel combination;

s35, adopting a residual axial attention mechanism method to compensate edge loss. Namely, the attention mechanism is decomposed into two attention modules, wherein the first attention module executes self-attention on the height axis of the feature map, and the second attention module runs on the width axis;

s36, concat fusion is carried out on the Uperhead segmentation head and the RAA result, the concat fusion is sent into 1 multiplied by 1 convolution for channel combination, finally, the predicted value of the target class is obtained through softmaxt normalization processing, and the predicted value is combined into a predicted segmentation graph for evaluating segmentation performance.

Preferably, the step S32 specifically includes:

firstly, dividing an image into a plurality of windows of 4H multiplied by 4W multiplied by C by utilizing a convolution embedded block, and flattening the windows into a sequence with the length of 4 multiplied by 3=48;

the feature dimension is then mapped from 48 to C through the linear embedding layer and the features are fed to the EST blocks to establish a global relationship.

Preferably, the step S33 specifically includes:

s331, size is scaled to 2 using 3×3 convolution with step size

Feature map, map to->

The downsampling process of the feature map is completed;

s332, the EST block adopts a light self-attention design to reduce the parameter quantity, maps the input size (H multiplied by W) to (H 'multiplied by W') through a linear embedding layer, and then carries out projection+deformation+multiplication, layer Normalization (LN) and multi-layer vector perception (MLP), so that the resolution is kept unchanged, and the input is input to the next stage;

repeating the steps, and respectively outputting (C1, C2, C3 and C4) at four stages with corresponding scale of

Preferably, the step S35 specifically includes:

for vertical stripe axis (height axis) axial attention, X is uniformly divided into non-overlapping horizontal stripes of equal width sw [ X ] ¹ ,..,X ^M ]And each stripe contains (sw×w) tokens, where sw is the stripe width;

after the attention features of the vertical axis (height axis), the feature graphs are recombined and input into the attention of the horizontal axis (width axis) to perform feature aggregation in the horizontal direction, and only horizontal division is needed.

The invention also provides a semantic segmentation method of the high-resolution remote sensing image of the hierarchical transducer, which is used for realizing the semantic segmentation method of the high-resolution remote sensing image of the hierarchical transducer and comprises the following steps:

the image acquisition module is used for acquiring an original remote sensing image, and performing preliminary processing to obtain a high-resolution remote sensing image with uniform size;

the data preprocessing module is used for constructing a sample set through data preprocessing;

the model building module is used for building a high-resolution remote sensing image semantic segmentation model of the hierarchical high-efficiency transducer;

the model training module is used for training the semantic segmentation model of the high-resolution remote sensing image according to a preset training scheme to obtain an improved model;

the semantic segmentation module is used for cutting the high-resolution remote sensing image to be segmented according to a fixed size and loading the cut high-resolution remote sensing image into the improved model to realize rapid segmentation.

The invention also provides an electronic device, which comprises a memory and a processor, wherein the processor is used for realizing the steps of the high-resolution remote sensing image semantic segmentation method of the hierarchical Transformer when executing the computer management program stored in the memory.

The invention also provides a computer readable storage medium, on which a computer management class program is stored, which when executed by a processor, implements the steps of the hierarchical Transformer high-resolution remote sensing image semantic segmentation method as described above.

The beneficial effects are that: the invention provides a hierarchical transform high-resolution remote sensing image semantic segmentation method and a hierarchical transform high-resolution remote sensing image semantic segmentation system, wherein the method comprises the following steps: the method comprises the steps of obtaining an original remote sensing image, and performing preliminary processing to obtain a high-resolution remote sensing image with uniform size; data preprocessing is carried out to construct a sample set; building a high-resolution remote sensing image semantic segmentation model of the hierarchical high-efficiency transducer; training the semantic segmentation model of the high-resolution remote sensing image according to a preset training scheme to obtain an improved model; and cutting the high-resolution remote sensing image to be subjected to segmentation processing according to a fixed size, and loading the cut high-resolution remote sensing image into an improved model to realize rapid segmentation. By designing a lightweight and efficient transducer block as a backbone network, model computation overhead can be reduced and an efficient global feature map can be obtained. Adding an attention mechanism and a multi-scale feature aggregation strategy to a backbone network; the method is used for enhancing the resolution of target classes with different scales and sizes and improving the fineness of the boundary, so that better comprehensive performance is obtained.

Drawings

FIG. 1 is a flow chart of a hierarchical transducer high-resolution remote sensing image semantic segmentation method provided by the invention;

FIG. 2 is a schematic diagram of a semantic segmentation method of a hierarchical transducer high-resolution remote sensing image provided by the invention;

FIG. 3 is a diagram of a high efficiency backbone architecture provided by the present invention;

FIG. 4 is a block diagram of an Uuperhead multiscale feature aggregation provided by the present invention;

FIG. 5 is a residual axial attention diagram provided by the present invention;

fig. 6 is a schematic hardware structure of one possible electronic device according to the present invention;

FIG. 7 is a schematic diagram of a possible hardware configuration of a computer readable storage medium according to the present invention;

fig. 8 is a main structure diagram of the Swin transformer provided by the invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

Fig. 1 is a schematic diagram of a hierarchical transform semantic segmentation method for high-resolution remote sensing images, which includes:

s1, acquiring an original remote sensing image, and performing preliminary processing to obtain a high-resolution remote sensing dataset image with uniform size. The unmanned aerial vehicle is used for acquiring data autonomously, manual labeling is performed firstly, then data cutting is performed, and the data is cut to the input size of 512 multiplied by 512 of the model.

S2, data preprocessing is conducted to construct a sample set. General data set preprocessing including data classification, cropping, enhancement, etc.

And S3, building a high-resolution remote sensing image semantic segmentation model of the hierarchical high-efficiency transducer.

And S4, training the semantic segmentation model of the high-resolution remote sensing image according to a preset training scheme. The input of the semantic segmentation model of the high-resolution remote sensing image during training is a high-resolution remote sensing cropping image with the size of 512 multiplied by 512, and the high-resolution remote sensing cropping image is output as a final segmentation image with the same size as the input image.

S5, carrying the trained model on a pc end, receiving a high-resolution remote sensing image which needs to be segmented by the pc end, cutting the image according to a fixed size, and then loading the cut image into the model, wherein the model can realize rapid segmentation of an input image.

The scheme solves four problems to realize fine segmentation of the high-resolution remote sensing image. The specific contents are summarized as follows:

a semantic segmentation model is fused by a transducer and a CNN, and feature extraction, a attention mechanism of the CNN and a multi-scale aggregation strategy are carried out by adopting the transducer block to realize global-local feature fusion. A lightweight Efficient Transformer backbone network is provided to reduce the calculation amount of SwinTransformer and accelerate the reasoning speed. And introducing a multi-scale characteristic aggregation network as a segmentation head to enhance the classification performance of the image target class pixels. Residual axial attention is fused into the hierarchical network to address object edge extraction problems in the Transformer architecture.

The comparison of segmentation indices on Vaihingen and Potsdam general high resolution remote sensing images is given in Table 1 below:

preferably, the step S1 specifically comprises the following steps

S11, scanning and shooting a preset ground by using a high-resolution remote sensing sensor carried by the unmanned aerial vehicle, and obtaining an original remote sensing image.

And S12, performing data preliminary processing including image correction, stitching, denoising and the like to generate a high-quality high-resolution remote sensing image.

S13, performing artificial visual annotation on part of the high-resolution remote sensing image data set to improve the generalization capability of the model under the condition of collecting data set samples.

The preferred scheme S2 specifically includes the following steps:

s21, normalization processing is carried out on all the processed images for the convenience of training.

S22, using one-hot coding to carry out vectorization coding for each pixel category of the label.

S23, enhancing the image by adopting a spatial data enhancement mode to obtain a data set, wherein the data set comprises random rotations (90 DEG, 180 DEG, 270 DEG and 360 DEG) with different angles and vertical or horizontal random mirror image overturning.

Preferably, the step S3 specifically includes the following steps:

s31, in order to more efficiently process a large number of feature extractions, an efficient hierarchical visual converter (Efficient Swin TransformerBlock) is presented herein, which for convenience of description, hereinafter ESWin, includes a convolution operation to stage down feature image resolution to obtainMore spatial information. And (3) constructing a high-resolution remote sensing image semantic segmentation model of the hierarchical high-efficiency transducer, as shown in fig. 2. The model mainly comprises a main network (four cascaded EST blocks (ESwin Transformer Blocks)) with a lightweight design, a multi-scale feature aggregation segmentation head above, and a residual axial attention (Residual Axial Attention, RAA) block below, which are correspondingly connected with the main network. As shown in FIG. 3, the EST block structure is four stages in total, and the number of blocks (N ₁ ,N ₂ ,N ₃ ,N ₄ ) Respectively (2, 2), wherein each stage comprises a convolutional embedding block (ConvS 2) and an EST block.

S32, an H×W×3 image (3 stands for RGB channel) is input into the backbone network. As shown in fig. 3, an image is first divided into a number of windows of (4H) × (4W) ×c using a convolution embedded (Convolutional tokens Embedding, conv S2) block and flattened into a sequence of length 4×4×3=48. Therefore, the original H×W resolution becomes

And the dimension changes from 3 to 48. The feature dimension is then mapped from 48 to C through the linear embedding layer and the features are fed to ESwin Transformer Blocks (EST block) to establish a global relationship.

S33, respectively processing the two blocks, wherein the process is as follows:

s331, the convolution embedding block adjusts the feature resolution before each stage to generate features with the proper resolution. Specifically, a 3×3 convolution with a step size of 2 is used to scale the size

Feature map, map to->

The downsampling process of the feature map is completed. Wherein the size is reduced to half of the original size and the channel is enlarged to 2 times of the original size.

S332, the EST block adopts a light self-attention design to reduce the parameter quantity, maps the input size (H multiplied by W) to (H 'multiplied by W') through a Linear Embedding layer (Linear Embedding), and then carries out self-attention calculation (projection+deformation+multiplication), layer Normalization (LN) and multi-layer vector perception (MLP), and the resolution is kept unchanged and is input to the next stage.

Repeating the above steps, and outputting (C ₁ ，C ₂ ，C ₃ ，C ₄ ) Corresponding scale is

Encoder and ViT of the present embodiment]In contrast, multi-level multi-scale features are generated given an input image, enabling hierarchical feature representation. These feature layers provide high resolution coarse features and low resolution fine-grained features that can improve the performance of semantic segmentation. Specifically, given an input image of size h×w×3, a convolution of step 2 is performed to generate a hierarchical feature map Fi of resolution

Where i ε {1,2,3,4}, ci+1 is greater than Ci, representing the number of channels expanding layer by layer.

S34, UPerhead structure as shown in FIG. 4, pairs (C) are respectively performed using depth separable convolution (DWConv) ₁ ，C ₂ ，C ₃ ，C ₄ ) Performing hierarchical multiscale feature aggregation, C ₄ Up-sampling by PPM module, matrix adding with each layer of features, and 3×3 convolution to map each layer of resolution to

And obtaining feature fusion graphs of different layers, and overlapping the cross channels through Concat. Input to a 1 x 1 convolution for channel combining.

As shown in fig. 4, the semantic segmentation model is typically composed of a stem and a segmentation head. The backbone network is used to extract features from the image so that the model can distinguish between different pixel classes. The segmentation head maps features extracted from the skeleton to specific classification categories and restores the downsampled features to the resolution of the input image.

The Uperhead is proposed to solve the problem caused by different importance of textures and categories in semantic segmentation tasks. In this task, texture and class are also important to distinguish between different objects, so feature maps of different phases need to be fused to improve the performance of the model. Meanwhile, the Uuperead also references the idea of UuperNet, namely, features with different resolutions are sensitive to different characters, so that the distinguishing capability of the model to different categories is improved.

The entire process of uperhoad includes Pyramid Pooling Module (PPM), cascading add architecture, and fusion block. In the PPM module, global pooling of different pooling scales (1, 2,3 and 6) is used for acquiring information of different receptive fields, so that the distinguishing capability of the model on different categories is enhanced. Then, the C4 feature map becomes feature map F4 after passing through PPM. The cascade addition architecture fuses the outputs of the different stages through a step-wise addition operation. Specifically, it upsamples F4 (1/32) and adds to input C3 (1/16) to obtain fusion feature F3 (1/16), and then F2 (1/8) and F1 (1/4) are obtained by the same operation. The gradual addition method can ensure that feature maps with different resolutions can be effectively fused, thereby improving the performance of the model. Finally, the fusion block fuses the feature maps of F1, F2, F3 and F4 together to generate a fused feature with 1/4 resolution. And performing dimension mapping on the features through a convolution layer to obtain a final segmentation map. In the process, a cascading addition architecture and a pyramid pooling module are used for fusing feature graphs with different resolutions, so that the model can better understand textures and categories of objects, and the segmentation accuracy and robustness are improved.

Through experimental comparison, the Uuperhead brings about more efficient and accurate feature fusion in semantic segmentation tasks. The method uses global pooling and gradual addition operations of different pooling scales to fuse feature maps of different resolutions, so that the model can better understand the textures and categories of the objects.

S35, adopting a residual axial attention mechanism method to compensate edge loss. The RAA corresponds to the backbone blocks in a layered manner, each block is cascaded by using bilinear interpolation up-sampling, so that the boundary constraint of a backbone network can be enhanced, and the local spatial context information is made up to overcome the computational complexity. The attention mechanism is broken down into two attention modules. The first module performs self-attention on the feature map height axis and the second module runs on the width axis. The method comprises the following specific steps:

referring to fig. 5, for vertical stripe axial attention (Height-axiaattntion), X is uniformly divided into non-overlapping horizontal stripes of equal width sw [ X1,..xm ], and each stripe contains (sw X W) tokens. Here sw is the stripe width, which can be manually adjusted to balance learning ability and computational complexity. Then the output of the vertical stripe attention is defined as:

X＝[X ¹ ,X ² ,…X ^M ]；

Z ⁱ ＝Attention(X ⁱ W ^Q ,X ⁱ W ^K ,X ⁱ W ^V ),

H _attention ＝[Z ¹ ,Z ² ,…,Z ⁱ ]

wherein i=1, 2, … M; FC represents a linear mapping; w (W) ^Q ,W ^K ,W ^V Weights representing the query, key, value of the X-projection, respectively; z is Z ⁱ Attention representing the ith vertical stripe attempts to output a value.

And (3) after the Attention features of the vertical Axis (height Axis), merging the feature graphs again, inputting the feature graphs into the Attention (Width Axis) of the horizontal Axis (Width Axis) to perform feature aggregation in the horizontal direction, and only performing horizontal division and solving the following process with the Attention of the vertical Axis.

It is difficult to identify edges in semantic segmentation tasks. First, it is difficult to accurately label boundary pixels of an edge object class. Secondly, in the process of acquiring the remote sensing image, the camera and the ground object have relative motion, so that the object generates a certain distortion problem. In order to solve this problem, it is proposed to enhance the boundary of the feature map by a residual axial attention mechanism (for convenience of description, ARR is used instead of ARR), thereby further improving the segmentation accuracy of the high-resolution remote sensing image. Fig. 5 shows a functional diagram of the residual axial attention.

Under the action of a common spatial attention mechanism, to obtain the context relation of the target pixel point, the attention relation of the whole image needs to be calculated, so the calculation complexity is the secondary complexity of the size of the input image. To overcome the computational complexity of computation, the RAA is decomposed into two modules, a vertical axis module and a horizontal axis module of a block profile, respectively, to perform self-attention. And finally, combining the input feature map and the attention output feature through residual connection to obtain the final output of the module. In the ARR, a special attention design is adopted, and an intersection area is split into two strip areas in the horizontal and vertical directions, and the attention is calculated in the two strip areas respectively. The calculation process is as follows:

given an input tensor

Wherein H, W, C represent the height, width and channel dimensions of the input feature map, respectively. First the channel is mapped to +.sup.1 by convolution with a convolution kernel size of 1 x 1>

Then will

Vertical stripe division is carried out, and->

Is of adjustable width, will ∈>

Dividing into N strip-shaped regions in the vertical direction>

Wherein the output from each position o (i, j) is Y _o In the vertical directionThe mathematical expression for attention calculation of (a) is shown as:

wherein query term q _o ＝W ^Q x _o Key item k _o ＝W ^K x _o Value term V _o ＝W ^V x _o Are all inputs

Linear transformation of W ^Q And W is ^K Is shaped as +.>

W ^V Is shaped as +.>

(in practice d) _q And d _out Is to be compared with->

Much smaller) they are all weight parameter matrices that can be learned. softmaxx _p Indicated in the strip area->

The softmax function was used at all positions. It is easy to extend a stripe region over the entire feature map:

the specific implementation is that a plurality of strip attentives are operated in parallel, and the input x is input _o(i,j) The method comprises the steps of carrying out a first treatment on the surface of the Namely there is

The outputs of the multiple channels are then concat to get the final output of the vertical direction attention: />

Then, when the horizontal axis direction attention mechanism calculation is carried out, the output of the horizontal axis attention can be obtained easily only by adjusting the segmentation direction:

Y＝Conv(Y _W ，Ckernez _s ize＝1)+X#

Y _W the vertical attention mask is included, the channel is restored to the input size C finally through convolution of 1×1, and the final output result Y of ARR is obtained through residual connection. The axial attention of the height and width axes is applied to effectively simulate the spatial linear relation of the ground object targets, so that the calculation efficiency is better. Experiments prove that the efficient residual axial attention mechanism is more focused on local boundary alignment, and the calculation cost is reduced in design, so that the RAA is a more efficient design.

The detailed calculation process of the residual axial attention is as follows:

given an input tensor

Then will

Vertical stripe division is carried out, and->

Is of adjustable width, will ∈>

Dividing into N strip-shaped regions in the vertical direction>

Wherein the output from each position o (i, j) is Y _o The mathematical expression for attention calculation in the vertical direction is shown as:

Linear transformation of W ^Q And W is ^K Is shaped as +.>

W ^V Is shaped as +.>

(in practice d) _q And d _out Is to be compared with->

the specific implementation is that a plurality of strip attentives are operated in parallel, and the input x is input _o(i，j) The method comprises the steps of carrying out a first treatment on the surface of the Namely there is

Y＝Conv(Y _W ，C，kernel _s ize＝1)+X#

Y _W the vertical attention mask is included, the channel is restored to the input size C finally through convolution of 1×1, and the final output result Y of ARR is obtained through residual connection.

The structure of the Swin transducer backbone is shown in FIG. 8. The overall framework consists of one slice Partition (Patch Partition) module and four cascaded stages to produce four resolution outputs. stage1 contains a Linear Embedding (Linear Embedding) and two SwinTransformer blocks, with the remaining stages containing one slice reassembly (Patch raising) module and an even block (e.g., x 2) in each stage. Each two blocks consist of a window multi-head self-attention (W-MSA) block and a shift window multi-head self-attention (SW-MSA) block for calculating global attention. Specific W-MSA blocks include Layer Normalization (LN), W-MSA modules, and multi-layer perceptrons (MLP). LN normalizes the features to make the training process more stable, W-MSA and SW-MSA are used to calculate the attention relation between pixels, MLP contains a lot of learnable parameters, and the coefficients learned by W-MSA are recorded. The specific procedure is as follows, given an H W C input. The Patch Partition module first divides the image into several pieces

And flattened into a sequence of length 4 x c=16c. Therefore, the original H×W resolution becomes +.>

And the dimension changes from C to 16C. The linear embedding layer then maps the feature picture dimension from 16C to Cin and sends the feature map to Swin TransformerBlocks to establish a global relationship. Patch Merging adjusts feature resolution before each stage to generate features with the proper resolution.

Fig. 8 shows a miniature version of swinsformer with 2, 6, 2 blocks for each stage. By adjusting the number of blocks per stage and varying the value of dimension C, four versions of Swinlransformer, tiny, small, base and large, abbreviated as Swin-T, swin-S, swin-B, and Swin-L, are provided herein, indicating that the model scale is from small to large. The specific innovative design of Swinlransformer is the window self-attention mechanism (W-MSA), which calculates the relationship over the entire H W image size, in contrast to the traditional self-attention Mechanism (MSA), which is a plurality of 7X 7 size windowsThe attention relation is calculated on the size, so that the calculated amount is greatly reduced. However, such a process also reduces receptive fields, which are detrimental to segmentation of large objects. Thus, swin transducer adds another smart design, using moving window self-attention (SW-MSA) to solve this problem (as shown in FIG. 3). The local receptive field is extended to the global receptive field by dividing and merging the feature maps between the two transducer blocks. Swin transducer has four stages in total, with PatchMerging layer at each stage

Spatial downsampling, wherein the four outputs C1, C2, C3 and C4 represent different resolutions: respectively->

And->

These different scale outputs have receptive fields that vary from small to large and are therefore sensitive to objects of different sizes, respectively.

For ease of description, ESWin abbreviations are next used in place of Efficient Swin transformer. The detailed model structure is shown in fig. 3, and there are four different points compared to the Swin transducer structure.

1) The slice initialization combination is performed using convolution operations instead of the Patch Partition operation in the Swin transducer. The convolution layer with a step size of 4 divides the input H x W x 3 (3 representing RGB channels) size picture into 4 x 4 size pictures with a convolution kernel of 7 x 7

For extracting local features and location information for each window, and then mapping the number of channels from 3 to C into an ESWin block.

2) Prior to each transducer block, a convolution of 3 x step size 2 (Conv S2) was used for slice reorganization and downsampling the feature map. Specifically, the resolution of the upper layer characteristic diagram is reduced by half, and the number of channels is changed to 2 times of the original number.

3) In each ESWin block, a tunneling deep convolution (DWConv) is used for position embedding to learn the position information of each pixel instead of the relative position embedding in the original swintra former.

4) Using lighter ESWin blocks of input size h 'x w', full connection layer (FC) was applied to perform the reduced-dimension matrix multiplication to calculate the attention relationship, LN and MLP using conventional transducer designs.

By applying this structure, ESWin can achieve accuracy comparable to Swin transducer, but with lower computational complexity. The ESWin module will be described in detail.

One of the challenging problems of applying a transducer structure in computer vision is the high computational load. The computational complexity of a self-care block can be expressed as:

Ω _MSA ＝4HWC ² +2(HW) ² C#

where C represents the dimension of the feature map, typically between tens and hundreds. From the above equation, the computational complexity of the self-attention mechanism is mainly determined by the size of the image, and the computational complexity is the square of the image size, so that a huge computational overhead is required for training the self-attention module. In the case of ensuring precision, how to efficiently compress the input size of the picture becomes the first choice for designing the structure in a lightweight manner, so the swintsformer divides an image of h×w size into m×m slices (patches), and the final computational complexity after cutting becomes expressed as:

Ω _W-MSA ＝4HWC ² +2M ² HWC#

the secondary complexity is converted to linear complexity by simple manual design.

However, when the semantic segmentation task is performed on the high-resolution remote sensing image, since the data set itself has a large size, in order to protect the continuity of spatial information and the integrity of a large target sample, an input image as large as possible is required to improve the segmentation accuracy, and thus the size setting of M is limited to a certain range. To further increase the segmentation efficiency of the segmentation of high resolution remote sensing images, the ESWin attention mechanism described herein (as shown in fig. 3)The H x W size inputs are mapped to smaller H 'x W' to establish local and global attention relationships, thereby greatly reducing computation. When h '×w' is small enough, ESWin trains more efficiently than Swin transducer, but performance is also affected to some extent; by comparison of experimental effects, h 'and w' are set herein as

And->

To achieve the best tradeoff between performance and efficiency. Furthermore, when the segmentation task is relatively easy, to avoid the over-fitting problem and reduce the computational resource consumption, the number of blocks of ESWi n per stage and the channel dimension C can be further reduced, so that the model can reduce the complexity of the model almost without losing accuracy. The attention computational complexity of ESwi n herein can be expressed as:>

in addition, the number of blocks and the dimension of C can be further reduced when the segmentation task is relatively easy, thereby reducing model complexity, avoiding overfitting problems, and reducing the consumption of computational resources. Experimental results show that the method can remarkably reduce the complexity of the model under the condition of not losing the precision.

The embodiment of the invention also provides a semantic segmentation method of the high-resolution remote sensing image of the hierarchical Transformer, which is used for realizing the semantic segmentation method of the high-resolution remote sensing image of the hierarchical Transformer and comprises the following steps:

Fig. 6 is a schematic diagram of an embodiment of an electronic device according to an embodiment of the present invention. As shown in fig. 6, an embodiment of the present invention provides an electronic device, including a memory 1310, a processor 1320, and a computer program 1311 stored in the memory 1310 and executable on the processor 1320, wherein the processor 1320 executes the computer program 1311 to implement the following steps: s1, acquiring an original remote sensing image, and performing preliminary processing to obtain a high-resolution remote sensing image with uniform size;

s2, data preprocessing is carried out to construct a sample set;

Fig. 7 is a schematic diagram of an embodiment of a computer readable storage medium according to the present invention. As shown in fig. 7, the present embodiment provides a computer-readable storage medium 1400 having stored thereon a computer program 1411, which computer program 1411, when executed by a processor, performs the steps of: s1, acquiring an original remote sensing image, and performing preliminary processing to obtain a high-resolution remote sensing image with uniform size;

s2, data preprocessing is carried out to construct a sample set;

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A hierarchical transducer high-resolution remote sensing image semantic segmentation method is characterized by comprising the following steps:

s2, data preprocessing is carried out to construct a sample set;

2. The hierarchical Transformer high-resolution remote sensing image semantic segmentation method according to claim 1, wherein S1 specifically comprises:

3. The hierarchical Transformer high-resolution remote sensing image semantic segmentation method according to claim 1, wherein the S2 specifically comprises:

s21, normalizing all the processed images for the convenience of training;

S24 _， the data set is divided into a training set, a validation set and a test set according to a ratio of 3:1:1.

4. The hierarchical Transformer high-resolution remote sensing image semantic segmentation method according to claim 1, wherein the step S3 specifically comprises:

s31, constructing a lightweight backbone network, wherein a multi-scale characteristic aggregation segmentation head is arranged on the backbone network, residual axial attention blocks (Residual Axial Attention, RAA) are correspondingly connected with the backbone, the backbone network comprises four stages, and each stage comprises a convolution embedded block (Convolutional tokens Embedding, conv S2) and an EST block (ESwin Transformer Blocks);

s32, inputting an H multiplied by W multiplied by 3 image into a backbone network to establish a global relation, wherein H and W are the input sizes, and 3 is a three-channel representing RGB;

s33, the convolution embedded block adjusts the feature resolution before each stage to generate features with proper resolution; EST blocks employ lightweight self-attention designReducing the number of parameters, reducing the input size by half through a linear embedding layer, then performing self-attention calculation, layer Normalization (LN) and multi-layer vector perception (MLP), keeping the resolution unchanged, and inputting to the next stage; repeating step S33 a total of 4 times to output (C ₁ ，C ₂ ，C ₃ ，C ₄ )；

Obtaining different layers of characteristic fusion graphs, performing cross-channel superposition through Concat, and inputting the cross-channel superposition into 1X 1 convolution for channel combination;

s35, compensating edge loss by adopting a residual axial attention mechanism method;

5. The hierarchical Transformer high-resolution remote sensing image semantic segmentation method according to claim 4, wherein the step S32 specifically comprises:

6. The hierarchical Transformer high-resolution remote sensing image semantic segmentation method according to claim 4, wherein the step S33 specifically comprises:

s331, size is scaled to 2 using 3×3 convolution with step size

Feature map, map to->

The downsampling process of the feature map is completed;

7. The hierarchical Transformer high-resolution remote sensing image semantic segmentation method according to claim 4, wherein the step S35 specifically comprises:

8. The hierarchical fransformer high resolution remote sensing image semantic segmentation method according to any one of claims 1-7, wherein the system is configured to implement the hierarchical fransformer high resolution remote sensing image semantic segmentation method, and the method comprises:

9. An electronic device, comprising a memory, and a processor, wherein the processor is configured to implement the steps of the hierarchical Transformer high resolution remote sensing image semantic segmentation method according to any one of claims 1-7 when executing a computer management class program stored in the memory.

10. A computer readable storage medium, having stored thereon a computer management class program which, when executed by a processor, implements the steps of the hierarchical Transformer's high resolution remote sensing image semantic segmentation method according to any one of claims 1-7.