CN115205672A

CN115205672A - Remote sensing building semantic segmentation method and system based on multi-scale regional attention

Info

Publication number: CN115205672A
Application number: CN202210577106.2A
Authority: CN
Inventors: 徐胜军; 邓博文; 孟月波; 刘光辉; 赵敏华; 韩九强; 钟德星; 吕红强
Original assignee: Xian University of Architecture and Technology
Current assignee: Xian University of Architecture and Technology
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-10-18

Abstract

A remote sensing building semantic segmentation method and system based on multi-scale regional attention comprises the following steps: step 1, obtaining an image containing a remote sensing building, and constructing a remote sensing building data set; step 2, training a pre-constructed semantic segmentation network by using the acquired remote sensing building data set to obtain the trained semantic segmentation network, wherein the semantic segmentation network comprises a coding and decoding structure network based on regional attention and a multi-scale regional attention module; step 3, utilizing the trained semantic segmentation network to segment and extract buildings in the remote sensing image to be extracted; the method can effectively position the discriminative characteristic region, extract the global node characteristics and the local semantic information, can more effectively segment the high-resolution remote sensing building image, and has better robustness.

Description

Remote sensing building semantic segmentation method and system based on multi-scale regional attention

Technical Field

The invention belongs to the technical field of high-resolution remote sensing building image extraction, and particularly relates to a remote sensing building semantic segmentation method and system based on multi-scale regional attention.

Background

The semantic segmentation of the high-resolution remote sensing building is an important component of a remote sensing earth observation technology, and the main task of the semantic segmentation is to extract relevant characteristic information of the building by using an acquired remote sensing image, classify a target represented by each pixel in the remote sensing image and further finish the extraction of the building in the remote sensing image. With the development of computer vision, more and more researchers carry out deep research on the semantic segmentation problem of the high-resolution remote sensing building. These research methods are mainly divided into: the method comprises a high-resolution remote sensing building semantic segmentation method based on traditional machine learning and a high-resolution remote sensing building semantic segmentation method based on deep learning. The high-resolution remote sensing building semantic segmentation method based on traditional machine learning mainly utilizes artificially constructed features to train classifiers, such as basic feature information of some images of shapes, textures, colors, spectrums, spatial details and the like. Although artificially constructed features may effectively represent various attributes of an image. However, the classical algorithm of artificially constructed features often has the defects of poor generalization capability, complex design and the like, and the problem of semantic segmentation of the remote sensing image in a real complex environment is difficult to solve.

In recent years, deep learning has the characteristics of strong generalization, self-learning target characteristics and the like, so that a good effect is obtained in the problem of high-resolution remote sensing building image semantic segmentation. In order to enhance the representation capability of the network to the target to be extracted in different scenes, the mainstream idea is to increase an attention mechanism in an encoding and decoding module, so that the network has the capability of capturing long-distance dependency relationship and context information, and the segmentation and classification precision is further improved. Although significant research results have been obtained in the task of semantic segmentation of high-resolution remote sensing buildings based on deep learning and attention mechanism, the mainstream attention-based method is still limited to perform associated classification on pixel levels at different positions, and lacks attention on local sub-region level consistency and correlation between regions, so that the network lacks learning and supervision on the target region to be segmented and the edge consistency, and the semantic segmentation result precision is greatly influenced. Therefore, how to design an effective regional attention mechanism and further enhance the attention capacity of the network to the correlation between the remote sensing building image neighborhoods and the consistency of pixels in the neighborhoods remains a very challenging problem.

Disclosure of Invention

The invention aims to provide a remote sensing building semantic segmentation method and system based on multi-scale regional attention, and overcomes the defects in the prior art.

In order to achieve the purpose, the invention adopts the technical scheme that:

the invention provides a remote sensing building semantic segmentation method based on multi-scale regional attention, which comprises the following steps of:

step 1, obtaining an image containing a remote sensing building, and constructing a remote sensing building data set;

step 2, training a pre-constructed semantic segmentation network by using the acquired remote sensing building data set to obtain the trained semantic segmentation network, wherein the semantic segmentation network comprises a coding and decoding structure network based on regional attention and a multi-scale regional attention module;

and 3, segmenting and extracting buildings in the remote sensing image to be extracted by utilizing the trained semantic segmentation network.

Preferably, the regional attention-based codec structure network comprises an encoder and a decoder, wherein the encoder comprises a convolution block and four residual block based on a residual structure; the decoder includes four upsampling blocks and four multi-scale region attention modules and a convolution block.

Preferably, the convolution block of the encoder comprises two convolution layers, each convolution layer having associated therewith a batch normalization layer and a leaky linear rectifier.

Preferably, each residual block includes a maximum pooling layer, an output end of the maximum pooling layer is connected with two convolution layers, and an output end of each convolution layer is sequentially connected with a batch normalization layer and a linear rectification with leakage.

Preferably, four upsampling blocks and four multi-scale region attention modules of the decoder are alternately connected, and the output end of the multi-scale region attention module arranged at the last is connected with a convolution block.

Preferably, each of the up-sampling blocks comprises two convolution layers, and each convolution layer is connected with a batch normalization layer and a leaky linear rectifier.

Preferably, the multi-scale region attention module comprises a multi-scale neighborhood extraction module, a region embedding module, a self-attention module and a local weighting module.

Preferably, the multi-scale neighborhood extraction module comprises two stages, wherein the first stage comprises four void convolution layers; the second stage comprises a convolution layer, and the output of the convolution layer is sequentially connected with a batch normalization layer and linear rectification with leakage;

the region embedding module comprises a maximum pooling layer and a convolution layer, and the output of the convolution layer is connected with a batch normalization layer;

the self-attention module comprises three convolutional layers and a Softmax layer;

the local weighting module comprises an upper sampling layer and two convolution layers, and the output of each convolution layer is connected with a batch normalization layer and linear rectification with leakage.

Preferably, in step 3, the obtained remote sensing building data set is used to train the pre-constructed semantic segmentation network to obtain the trained semantic segmentation network, and the specific method is as follows:

and performing iterative optimization training on the pre-constructed semantic segmentation network by using the acquired remote sensing building data set and combining with a loss function of regional consistency supervision to obtain the trained semantic segmentation network.

A remote sensing building semantic segmentation system based on multi-scale regional attention, comprising:

the data acquisition unit is used for acquiring images containing the remote sensing buildings and constructing a remote sensing building data set;

the network training unit is used for training a pre-constructed semantic segmentation network by utilizing the acquired remote sensing building data set to obtain the trained semantic segmentation network, wherein the semantic segmentation network comprises a coding and decoding structure network based on regional attention and a multi-scale regional attention module;

and the segmentation and extraction unit is used for segmenting and extracting buildings in the remote sensing image to be extracted by utilizing the trained semantic segmentation network.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a remote sensing building semantic segmentation method based on multi-scale regional attention, wherein the network ReA-Net firstly utilizes an encoder to mainly extract the characteristics of textures, boundaries, deep semantics and the like of buildings in a remote sensing image. Secondly, resolution recovery is carried out on the extracted feature map by utilizing a decoder structure of progressive up-sampling, and meanwhile, the attention capacity of the network on correlation between the image neighborhoods of the remote sensing building and pixel consistency in the neighborhoods is enhanced by introducing a multi-scale region attention mechanism in the feature fusion stage of up-sampling, so that the extraction capacity of the network on the region and boundary feature information of the target to be segmented is enhanced. Finally, by introducing region consistency supervision loss and designing a weighted penalty term, the neighborhood and high-order neighborhood consistency of each pixel of the observation field and the label field is approximated, the neighborhood label continuity of the classification result is strengthened, and meanwhile, the sensitivity of the model to the building boundary and the precision of pixel classification are strengthened;

in conclusion, the method can effectively locate the distinguishing characteristic region, extract the global node characteristics and the local semantic information, can more effectively segment the high-resolution remote sensing building image, and has better robustness.

Drawings

FIG. 1 is a semantic segmentation network structure of a remote sensing building based on multi-scale regional attention;

FIG. 2 is a block diagram of a multi-scale zone attention module;

FIG. 3 is a block diagram of regional consistency loss;

FIG. 4 is a flow chart of the present invention;

fig. 5 is a graph of the effect of segmentation.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

referring to fig. 1, the present invention provides a method for semantic segmentation of a remote sensing building based on multi-scale regional attention, which includes the following steps:

step 1, constructing a remote sensing building semantic segmentation network based on multi-scale regional attention, wherein the structure of the remote sensing building semantic segmentation network based on the multi-scale regional attention is shown in figure 1.

The semantic segmentation method constructs a Multi-scale region attention Module (MRA) based on a semantic segmentation network with a Unet coding and decoding structure, and further provides a semantic segmentation network (ReA-Net) based on Multi-scale attention; and for enhancing the smoothness of the network to distribute labels to the pixels in the local area, a local area consistency supervision module is established in a network output layer, wherein:

the proposed ReA-Net network is mainly composed of three parts: remote sensing image characteristic extraction module based on Unet codec structure, resolution ratio recovery module based on decoder and multi-scale regional attention (MRA) module, wherein:

the remote sensing image feature extraction module is used for extracting the features of textures, boundaries, deep semantics and the like of buildings in the remote sensing image and inputting the extracted features to the resolution recovery module based on the decoder;

the resolution recovery module based on the decoder is used for performing resolution recovery on input features, in order to enhance the characterization capability of a network on correlation between remote sensing image regions, a multi-scale region attention module is introduced into the resolution recovery module, region level features of different scales are constructed by utilizing cavity convolution and pooling operation and are input into the self-attention module, so that an enhanced graph of the region level correlation of a feature graph is obtained, and finally the region correlation enhanced graph and the input features are fused by utilizing a local weighting module, so that the expression of the region level correlation of the remote sensing image is realized.

Meanwhile, in order to improve the smoothness of the remote sensing image segmentation result in the segmentation result, a loss function for multi-scale neighborhood consistency supervision is provided based on the assumption that adjacent pixels in a local region tend to take the same segmentation label, so that the consistency constraint of the local region is enhanced, and the smoothness in the segmented local region is better.

The invention relates to a method for enhancing regional Attention of semantic features to a target Region, which comprises the following steps of (1) constructing a multi-scale regional Attention module based on a Unet network decoding structure, designing a Region-Attention-based coding and decoding structure network (Region-Attention Net, reA-Net), and enhancing regional Attention capability of the semantic features to the target Region; (2) Providing a loss function of multi-scale neighborhood consistency supervision based on the assumption that local area pixels tend to take the spatial consistency of the same label; (3) A multi-scale regional attention and neighborhood consistency supervision mechanism is fused, and a remote sensing building semantic segmentation algorithm based on multi-scale regional consistency attention supervision is provided.

Step 2, constructing a remote sensing building data set, which mainly comprises the following steps:

firstly, collecting high-resolution remote sensing Building image data, and carrying out experiments by adopting an initial image Dataset and a Massachusetts Building image Dataset.

And then, carrying out data enhancement on the collected remote sensing building data set by simultaneously adopting methods of random cutting, random horizontal turning, vertical turning and the like on the collected remote sensing building image, and dividing the remote sensing building image into a training data set and a test data set according to a certain proportion.

Step 3, constructing a coder-decoder part of ReA-Net, which mainly comprises the following steps:

the coder of the ReA-Net mainly comprises two stages, wherein the first stage utilizes a convolution block (Conv 1) to extract low-level texture features of a remote sensing image, the convolution block mainly comprises two convolution layers with convolution kernels of 3 multiplied by 3 and a filling of 1, and each convolution layer is followed by a Batch Normalization layer (BN) and a leakage Linear rectification layer (Leaky Rectified Linear Unit, leaky ReLU); the second stage comprises 4 residual block (ResConv 1-ResConv 4) based on residual structure, wherein each residual block comprises a largest pooling layer with a kernel size of 2 x 2 for down-sampling operation, two convolution layers with convolution kernels of 3 x 3 and a filling of 1 are arranged behind each convolution layer, and each convolution layer is followed by a batch normalization layer (BN) and leaky linear rectification (LeakyReLU) for increasing the capability of the network for extracting deeper semantic features of the remote sensing image. The encoder parameter table for ReA-Net is shown in Table 1.

TABLE 1ReA-Net encoder parameter Table

Kernel in the table represents the convolution Kernel size; h, W represents the height and width of the input image; max stands for Max pooling.

The decoder mainly utilizes a progressive upsampling strategy to recover the resolution of the extracted feature map and complete the classification of dense pixels, and the classification is mainly divided into five stages.

The decoder mainly comprises 4 upsampling blocks (Upsample 1-Upsample 4), 4 multi-scale region attention (MRA 1-MRA 4) modules and a convolution block (Outconv), and is divided into five stages. From the first stage to the fourth stage, each stage includes an upsampling module and a multi-scale region attention module. And in order to eliminate the chessboard effect caused by deconvolution, each upsampling block comprises a bilinear upsampling layer and two convolution layers with convolution kernel size of 3 multiplied by 3 and filling quantity of 1, and each convolution layer is followed by a BN and a LeakyReLU. Meanwhile, in the up-sampling feature fusion stage, a multi-scale regional attention module is utilized to fuse regional level correlation enhancement maps of different scales, and the expression capability of the network on the large-scale spatial correlation among remote sensing image regions is enhanced. The multi-scale regional attention block mainly comprises a plurality of layers of pooling and convolution layers. And finally, the fifth stage is composed of a convolution block to realize the segmentation task of the remote sensing image output characteristic graph. The decoder parameter table for ReA-Net is shown in Table 2.

TABLE 2ReA-Net decoder parameter Table

Kernel in the table represents the convolution Kernel size; h, W represents the height and width of the input image; scale _ factor represents the upsampling rate.

Step 4, constructing a multi-scale region attention Module (MRA), and specifically comprising the following steps:

let the characteristic diagram of input MRA be

The output multi-scale region attention feature map is

Wherein, W _f ,H _f And C is the height, width and channel number of the characteristic diagram of the input MRA respectively.

The proposed MRA mainly consists of a multi-scale neighborhood extraction (MNE) module, a Region Embedding (RE) module, a Self-Attention (SA) module, and a Local Weighting (LW) module.

Specifically, the multi-scale neighborhood extraction Module (MNE) consists of five convolutional layers, divided into two stages. The first stage is composed of four convolution layers with convolution kernel size of 3 multiplied by 3, void ratio of [1,3,5,7] and filling of [1,3,5,7], and is used for extracting multi-scale information and splicing. The second stage is composed of convolution layers with convolution kernel size of 1 × 1, and is used for performing dimension recovery on the features, and meanwhile, the convolution layer of the second stage is followed by a BN and a LeakyReLU;

a region embedding module (RE) for constructing a region level descriptor; the region embedding module (RE) is mainly composed of a convolution kernel size of 4, a maximum pooling layer with a step size of 4 and a convolution layer with a convolution kernel size of 3 × 3, and a BN is followed by the convolution layer.

The self-attention module (SA) is used for constructing correlation relation among characteristic regions, and the SA mainly comprises three convolution layers with the convolution kernel size of 1 multiplied by 1 and a Softmax layer;

and the Local Weighting (LW) module is used for weighting the correlation characteristics of the region level in the original input characteristic diagram so as to generate a multi-scale region attention characteristic diagram.

The LW mainly comprises an upsampling block which is composed of an upsampling layer with a scaling rate of 4 and an upsampling mode of a nearest sampling method and two convolutional layers with convolution kernel sizes of 3 multiplied by 3 and filling quantity of 1, and a BN and a LeakyReLU are arranged behind each convolutional layer in the upsampling block;

the specific flow of MRA can be described as:

firstly, extracting an input remote sensing building characteristic graph F by using cavity volume blocks with different cavity rates _in The neighborhood characteristics extracted from the convolution layers with different void ratios are spliced to obtain a multi-scale neighborhood characteristic diagram

Namely:

F _in,d ＝ReLU(BN(Conv _d,k,pad (F _in ))) (1)

A＝Concat(F _in,1 ,F _in,3 ,F _in,5 ,F _in,7 ) (2)

wherein, conv _d,k,pad (. Cndot.) denotes the void fraction d ∈ [1,3,5,7]The convolution kernel size is k =3, and the padding is pad ∈ [1,3,5,7 ∈]The multilayer void convolution layer of (2); BN represents a batch normalization layer; reLU represents a leaky linear rectifying layer; concat represents the splicing operation.

Secondly, dimension selection is carried out on the feature map A by utilizing the convolution layer with convolution kernel of 1 to obtain a feature map

And then reducing redundancy of multi-scale features, enhancing characterization capability of the features, and simultaneously carrying out region average pooling operation on the feature map B to obtain region level descriptors in order to obtain feature characterization information of all regions

Namely:

C＝Avgpool _k,s (BN(Conv _d,k,pad (A))) (3)

wherein, conv _d,k,pad (. Cndot.) represents a convolutional layer with void rate d =1, convolutional kernel size k =1, and pad =0, where the dimensionality reduction rate is 0.25; BN represents a batch normalization layer; avgpool _k,s (·) represents the maximum pooling layer with convolution kernel k =4 and step size s = 4; h _r ＝1/4H,W _r ＝1/4W。

Again, to obtain the correlation relationship between feature regions, C with region-level features is input from the attention module (SA). Thirdly, coding and deforming the characteristic diagram C by utilizing the convolution layer of the convolution kernel 1 to respectively obtain three coded characteristic matrixes,

multiplying the two characteristic diagram matrixes V, G and then obtaining a space attention moment array through a Softmax activation function

Finally, multiplying the characteristic matrix I and the space attention matrix Z and deforming to obtain an enhanced graph with correlation between areas

. Namely:

wherein, conv _d,k,pad (. Cndot.) represents a convolutional layer having a void rate d =1, a convolutional kernel size k =1, and a pad =0 as a filler;

represents the multiplication of corresponding elements; v _i ,G _i Represents the attention score at spatial location i; m = W _p ×H _p 。

From time to time, an enhanced graph of inter-region correlation

Inputting the correlation data into a local weighting module (LW), and linking the correlation at the region level in the original characteristic diagram F through weighting reaction _in In (1). Upsampling by enhancing the graph K to obtain a region correlation weighted graph

H and F _in Carrying out pixel-by-pixel multiplication to obtain a regional attention feature map

. Namely:

H＝Upsample(K) (6)

wherein, upesample _scale,mode (·) denotes the upsampling layer with scale =4, upsampling mode being the nearest sampling method pad = 0;

finally, in order to integrate the relation between the global context semantic information and the regional relevance, a multi-scale regional attention feature map Q and a global attention feature map are combined

Fusing, namely multiplying the fused result with the original characteristic pixel by pixel through an activation function Sigmoid to obtain a multi-scale region attention characteristic diagram F output by the MRA module _out . Namely:

step 5, training the ReA-Net, and the specific steps comprise:

inputting the established high-resolution remote sensing building training data set into a network, and calculating by using a forward propagation algorithm to obtain loss; solving a partial derivative of the objective function with respect to the feature; and obtaining a gradient by using a back propagation algorithm to update and learn parameters.

The invention provides a loss function for monitoring regional consistency, aims to quantitatively evaluate the regional consistency, the edge continuity and the error of a real pixel label in the remote sensing building image segmentation result by ReA-Net, and uses a network model to perform back propagation and iteratively optimize the network weight by using the network loss. In training, the established high-resolution remote sensing building training data set is input into a network, and when the loss function metric value is minimum, namely the difference between the input training image and the output network predicted value is minimum, the trained network is optimal. The regional uniformity loss structure is shown in figure 3.

Loss function Loss for region consistency supervision _lc Is defined as:

in the formula (I), the compound is shown in the specification,

representing a result set of a prediction probability graph output by the ReA-Net;

represents a set of tags, wherein

The number of training images is set for a single input.

Let S = { S | S ≦ B × R } mean one defined in each training image

A finite set of lattice points on, wherein

And a neighborhood node set representing the node s, B and R are set sizes, and d represents the Euclidean distance between the neighborhood node and the central node. The penalty weight term may be defined as:

then there are:

the invention also provides a remote sensing building semantic segmentation system based on multi-scale regional attention, which comprises the following steps:

the data acquisition unit is used for acquiring images containing the remote sensing buildings and constructing a data set of the remote sensing buildings;

a segmentation and extraction unit for segmenting and extracting buildings in the remote sensing image to be extracted by utilizing the trained semantic segmentation network

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 5, fig. 5 is a segmentation effect diagram, and it can be seen from fig. 5 that semantic segmentation is affected by a problem of adhesion between buildings and a foreground background color, for example, in a remote sensing image shown in fig. 5 (a), when there is building adhesion, an adhesion part is relatively small compared with the building itself, and is similar to the problem of similarity between the building color and the background color in fig. 5 (b), fig. 5 (d), fig. 5 (e), and fig. 5 (g), and the problem of similarity between the building adhesion part color and the background color also occurs at the building adhesion part in fig. 5 (a); the illumination and the shadow also significantly influence the performance of semantic segmentation, and the boundary of a building in the remote sensing image is influenced by the illumination and the shadow as shown in fig. 5 (c); the semantic segmentation effect of the remote sensing image with a complex foreground color is shown in fig. 5 (f); FIG. 5 (k) illustrates mainly the semantic segmentation effect of the remote sensing image with small target buildings and continuous buildings;

as shown in the row of FIG. 5, the ReA-Net provided by the invention can better extract the space detail information of the building, and is constrained by the space consistency, so that the ReA-Net can further reduce the problem of inaccurate segmentation caused by building adhesion; for the problem that the building color is similar to the background color, as shown in fig. 5 (a), 5 (b), 5 (d), 5 (e) and 5 (g), the multi-scale region attention mechanism provided by the invention can accurately pay attention to the edge and region shape of the building which are not significant, thereby realizing accurate segmentation of the image with the building color similar to the background color; as shown in fig. 5 (c) and 5 (f), the problem that the boundary of the remote sensing building image with high frequency is influenced by illumination, shadow and complex foreground color is solved, the algorithm can effectively notice the boundary of the building under the influence of various noises, and therefore the building segmentation under the influence of the illumination, the shadow and the complex foreground color is realized; as shown in FIG. 5 (k), the problem of under-segmentation of small targets in segmentation is caused by different sizes of buildings in remote sensing building images, and the algorithm provides a multi-scale regional attention and regional consistency supervision method, so that a network can also pay attention to regional and boundary information of the small targets, and the extraction capability of the network on the semantic features of the buildings is enhanced. In conclusion, the remote sensing building semantic segmentation method based on multi-scale regional attention provided by the invention can effectively perform high-quality segmentation on challenging problems such as remote sensing building image adhesion, illumination shadow interference, complex foreground and background color interference, small targets and the like in a complex scene.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A remote sensing building semantic segmentation method based on multi-scale regional attention is characterized by comprising the following steps:

2. The method for carrying out semantic segmentation on the remote sensing building based on the multiscale regional attention according to claim 1, wherein the regional attention-based coding and decoding structure network comprises an encoder and a decoder, wherein the encoder comprises a rolling block and four residual block based on a residual structure; the decoder includes four upsampling blocks and four multi-scale region attention modules and a convolution block.

3. The method for semantic segmentation of remote sensing buildings based on multiscale regional attention according to claim 2, wherein the convolution block of the encoder comprises two convolution layers, each convolution layer being connected to a batch normalization layer and leaky linear rectification.

4. The remote sensing building semantic segmentation method based on multi-scale regional attention according to claim 2, wherein each residual block comprises a maximum pooling layer, two convolution layers are connected to an output end of the maximum pooling layer, and a batch normalization layer and linear rectification with leakage are sequentially connected to an output end of each convolution layer.

5. The method for semantic segmentation of remote sensing buildings based on multi-scale regional attention of claim 2, wherein four upsampling blocks and four multi-scale regional attention modules of the decoder are alternately connected, and a convolution block is connected to the output end of the last multi-scale regional attention module.

6. The method for semantic segmentation of the remote sensing building based on multi-scale regional attention according to claim 2 or 5, wherein each upsampling block comprises two convolutional layers, and each convolutional layer is connected with a batch normalization layer and linear rectification with leakage.

7. The method for semantic segmentation of remote sensing buildings based on multi-scale regional attention of claim 1, wherein the multi-scale regional attention module comprises a multi-scale neighborhood extraction module, a region embedding module, a self-attention module and a local weighting module.

8. The remote sensing building semantic segmentation method based on multi-scale regional attention according to claim 1, wherein the multi-scale neighborhood extraction module comprises two stages, wherein the first stage comprises four void convolution layers; the second stage comprises a convolution layer, and the output of the convolution layer is sequentially connected with a batch normalization layer and linear rectification with leakage;

9. The remote sensing building semantic segmentation method based on multi-scale regional attention according to claim 1, characterized in that in step 3, a pre-constructed semantic segmentation network is trained by using the obtained remote sensing building data set to obtain a trained semantic segmentation network, and the specific method is as follows:

10. A remote sensing building semantic segmentation system based on multi-scale regional attention is characterized by comprising: