CN112183360B

CN112183360B - Lightweight semantic segmentation method for high-resolution remote sensing image

Info

Publication number: CN112183360B
Application number: CN202011049591.3A
Authority: CN
Inventors: 霍宏; 吕亮; 傅陈钦; 沙拉依丁·斯热吉丁; 方涛
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2022-11-08
Anticipated expiration: 2040-09-29
Also published as: CN112183360A

Abstract

A lightweight semantic segmentation method for a high-resolution remote sensing image comprises the following steps: the method comprises the steps of building, training and testing a network, wherein the network specifically builds a deep semantic segmentation network of an encoder-decoder structure for a pytoch deep learning framework, and obtains a segmentation result of the remote sensing image by taking the remote sensing image to be tested as network input after network training is carried out based on a remote sensing image data sample set. On one hand, the method reduces model parameters by decomposing depth separable convolution, reduces the calculation complexity, shortens the time of semantic segmentation of the high-resolution remote sensing image, and improves the efficiency of semantic segmentation of the high-resolution remote sensing image. On the other hand, the semantic segmentation precision is improved through multi-scale feature aggregation, a spatial attention module and gating convolution, so that the proposed lightweight deep semantic segmentation network can accurately and efficiently realize the semantic segmentation of the high-resolution remote sensing image.

Description

Lightweight semantic segmentation method for high-resolution remote sensing image

Technical Field

The invention relates to a technology in the field of remote sensing image processing, in particular to a light-weight semantic segmentation method for a high-resolution remote sensing image.

Background

With the development of aerospace technology, high-resolution remote sensing images are easier to obtain in large quantities, and the ground object boundaries in the remote sensing images are extracted by image segmentation, which is the basis for further analysis and utilization of the high-resolution remote sensing images. The traditional high-resolution remote sensing image segmentation algorithm usually realizes the extraction of the ground object boundary in the image by means of the manually designed characteristics such as texture, color and the like, but only the ground object boundary itself can be obtained, and the semantic information of the region defined by the boundary, namely the type of the ground object, cannot be obtained at the same time. In recent years, semantic segmentation based on a deep network has attracted much attention because it can extract feature boundaries and determine feature semantics at the same time. Since the complete Convolutional network (full volumetric Networks) for semantic segmentation proposed by Jonathan Long et al in 2015, a large number of semantic segmentation methods such as UNet, pspNet, deep lab series, etc. have been widely demonstrated to have advantages over the traditional remote sensing image segmentation algorithm, and have been widely used for automatic extraction of remote sensing image information.

However, when the method is used for semantic segmentation of the high-resolution remote sensing image, the high-resolution remote sensing image generally has the characteristic of large size, and the problems of slow training, low efficiency and the like often exist, so the light-weight semantic segmentation method is designed for semantic segmentation of the high-resolution remote sensing image, and the semantic segmentation efficiency of the high-resolution remote sensing image is greatly improved while the segmentation precision is ensured.

Disclosure of Invention

The invention provides a light-weight semantic segmentation method for a high-resolution remote sensing image, which solves the problem that the operation efficiency of a segmentation algorithm is low due to the quantity and the calculated amount of a large-amplitude high-resolution remote sensing image in the existing semantic segmentation network. On the other hand, the semantic segmentation precision is improved through multi-scale feature aggregation, a spatial attention module and gate control convolution, so that the proposed lightweight depth network can accurately and efficiently realize the semantic segmentation of the high-resolution remote sensing image.

The invention is realized by the following technical scheme:

the invention relates to a lightweight semantic segmentation method for a high-resolution remote sensing image, which comprises the following steps: the method comprises the steps of building, training and testing a network, wherein the network specifically constructs a deep semantic segmentation network of an encoder-decoder structure for a pytorch deep learning framework, and after network training is carried out based on a remote sensing image data sample set, a remote sensing image to be detected is used as network input to obtain a segmentation result of the remote sensing image.

The encoder is constructed by utilizing a multi-scale feature fusion and attention mechanism technology and comprises two sub-networks with the same structure and an attention module for capturing context information of a feature map, wherein: the image data is input into a first sub-network of the two sub-networks, the low-level feature map output by the first sub-network is subjected to up-sampling by 4 times and then fused with the first-level feature map to serve as the input of a second sub-network, the input of each level of feature map of the second sub-network is fused with the feature map of the first sub-network with the same scale, namely the first-level feature map is formed by fusing the output of the first sub-network and the first-level feature map, the high-level feature map output by the second sub-network is input into the spatial attention module, and the output result of the spatial attention module is input into the decoder.

Said first and second sub-networks each comprising: and each feature extraction layer consists of a downsampling layer and four decomposition depth separable convolution residual blocks.

The down-sampling layer is composed of a convolution kernel with the size of 1 multiplied by 1, a convolution layer with the step length of 2, a batch normalization layer and a Relu activation layer.

The decomposition depth separable convolution residual block extracts features through two groups of decomposition depth separable convolution kernels of 3 x 1 and 1 x 3, and residual connection is added in order to reduce gradient dispersion and facilitate network training. When the number of input feature map channels is c _in Using c _out The convolution kernels perform convolution operations. The standard 3 × 3 convolution kernel performs convolution operation on all channels, and the parameter number is 3 × 3 × c _in ×c _out . Decomposition ofConvolution decomposes a standard 3 × 3 convolution kernel into 3 × 1 and 1 × 3 convolution kernels with a parameter quantity of 2 × 3 × c _in ×c _out . The depth separable convolution includes:

i) And (3) longitudinal convolution: each 3 × 3 convolution kernel is only responsible for performing convolution operation on one channel, and the parameter number of the longitudinal convolution is 3 × 3 × c _in ；

ii) point convolution: performing convolution operation by using 1 × 1 convolution kernel to realize information interaction between channels, wherein the parameter number of point convolution is 1 × 1 × c _in ×c _out 。

The total parameter for the depth separable convolution is therefore 3 × 3 × c _in +c _in ×c _out . The method combines the decomposition convolution and the depth separable convolution and provides a decomposition depth separable convolution kernel with the parameter number of 2 multiplied by 3 multiplied by c _in In contrast, the decomposition depth separable convolution kernel effectively reduces the parameter quantity and the calculated quantity, but because each channel is independently subjected to convolution operation, information interaction between the channels is lacked, and finally, the technology of channel random shuffling is introduced to improve the information interaction between the channels and improve the network performance.

The calculation process of the space attention module is as follows:

wherein: theta (X),

And

are all new feature maps generated by 1X 1 convolution from the input feature map, theta (X),

The multiplication result is fed into the softmax layer to obtain a spatial correlation coefficient matrix S,

the convolution with δ being 1 × 1 to recover the channel number by restoring to the size of the input feature map, and the final result Y is retainedGlobal context information.

The decoder comprises: three gated convolution modules for fusing high-level features and low-level features and four upsampling units, wherein: the input of the decoder is from the output of the encoder, each level of gated convolution module receives a low-level feature map from a first sub-network with the same scale, the gated convolution module performs refinement processing on the low-level features, information is input to each level of up-sampling units, the input of the first level of up-sampling units is from the output of the decoder, the other up-sampling units receive the feature map obtained by fusing the gated convolution and the high-level features, then double-bilinear interpolation processing is performed, the feature map obtained after interpolation is output to the next level of up-sampling units, and a semantic segmentation result map is output until the original image size is recovered.

The up-sampling unit comprises: the method comprises the steps of 1 multiplied by 1 convolution layer, batch normalization layer, activation layer and double-time bilinear interpolation layer, wherein a feature map is input into an up-sampling unit for decoding and carrying out double-time bilinear interpolation to obtain a feature map with improved resolution as the input of the next up-sampling unit.

The calculation process of the gate control convolution module is as follows:

wherein: x is an input feature map, i, j represents the position of each pixel, sigma is a sigmoid function, the weight of each pixel is set to be 0-1, and the edge part of the ground object in the image and the small-size ground object can obtain higher weight through learning, so that the semantic segmentation precision can be improved.

The training of the lightweight semantic segmentation network comprises the following steps of:

a1, dividing a remote sensing image data sample set into a training set, a verification set and a test set.

And A2, reading in the remote sensing images and the corresponding label images of the training set, randomly sampling the original large-format remote sensing images and the label data in order to fully utilize the training sample set, setting the sampling frequency of each round of training, setting sampling size parameters according to the size of a display memory, and simultaneously sampling the random positions of the remote sensing images and the label images.

The input sampling size parameter is the size of the image to be cut.

And A3, setting data enhancement parameters, and performing the same data enhancement on the remote sensing image and the corresponding label image.

The data enhancement parameters comprise an image rotation angle, an image turning angle, a brightness enhancement coefficient, a contrast enhancement coefficient, a chrominance enhancement coefficient and a scaling coefficient.

And A4, setting a learning rate, an exponential decay rate and a regularization coefficient to train a deep network, and selecting the deep network with the highest precision of a verification set as a lightweight semantic segmentation network obtained by training.

Technical effects

The invention integrally solves the problem that the existing semantic segmentation network faces the low operation efficiency of the segmentation algorithm caused by the parameter quantity and the calculated quantity of the large-amplitude high-resolution remote sensing image.

Compared with the prior art, the lightweight semantic segmentation network built by the decomposition depth separable convolution residual blocks reduces the parameter amount and the calculated amount, and greatly improves the operation speed of semantic segmentation. According to the invention, a multi-scale feature aggregation technology is adopted in an encoder and a decoder, low-level and high-level feature maps are aggregated to encode and decode multi-scale ground features in a high-resolution remote sensing image, a spatial attention module is combined to capture context information, and gated convolution is used to emphatically learn the edges of the ground features and small-size ground features when the low-level feature maps are aggregated, so that the semantic segmentation precision is improved. Therefore, the method can keep higher semantic segmentation precision, improves the efficiency of a semantic segmentation algorithm, and is an effective solution for performing high-resolution remote sensing image semantic segmentation.

Drawings

FIG. 1 is a flow chart of the method;

FIG. 2 is an exemplary diagram of a lightweight semantic segmentation network according to an embodiment;

FIG. 3 decomposes an example diagram of a depth separable residual convolution block;

FIG. 4 is a schematic diagram of an embodiment of a semantic segmentation dataset of a remote sensing image;

FIG. 5 is a diagram illustrating a comparison of semantic segmentation results of an embodiment;

in the figure: columns 1-5 are: original graph, label graph, DFANet prediction result graph, ENet prediction result graph and network prediction result graph of the embodiment.

Detailed Description

As shown in fig. 1, the method for lightweight semantic segmentation based on high-resolution remote sensing images according to the present invention includes the following steps:

step A, dividing a remote sensing image sample data set into a training set, a verification set and a test set according to the proportion of 0.5.

Step B, a deep semantic segmentation network is built and trained by using a pytoreh deep learning framework, original large-amplitude high-resolution remote sensing images and label data are read into a memory, in order to improve the data utilization rate, large-amplitude remote sensing images are randomly sampled into small graphs for batch training, the sampling frequency of each round of training is set to be 450, the sampling size and the training batch size are set according to the size of a display memory, the default size of an input image is 512 multiplied by 512, the batch size is defaulted to be 10, the original large-amplitude remote sensing image data and corresponding label data are randomly sampled, the remote sensing images and corresponding label graphs with the sizes of 512 multiplied by 512 are obtained in each round of sampling, training samples of each round are obtained after multiple sampling, a training sample enhancement parameter range is set, the random contrast is enhanced by 0.5 times to 1.5 times, the random saturation is enhanced by 0.5 times to 1.5 times, the random brightness is enhanced by 0.5 times to 1.5 times, the random scaling is enhanced by 0.5 times to 1.5 times, the semantic segmentation capability of the random sample and the maximum depth segmentation capability of the random sample and the training data of each round of the random depth segmentation is improved. And after each iteration, verifying the precision of the deep semantic segmentation network by using a verification data set to obtain the deep semantic segmentation network with the highest precision. And inputting the high-resolution remote sensing image in the test set into the obtained depth semantic segmentation network to obtain the semantic segmentation result of the image.

As shown in fig. 2, the lightweight semantic segmentation network has an encoder-decoder structure.

The encoder is constructed by utilizing a multi-scale feature fusion and attention mechanism technology, and comprises two sub-networks with the same structure and an attention module for capturing context information of a feature map, wherein the two sub-networks are respectively a first sub-network and a second sub-network, and the method comprises the following steps of: the data is firstly input into a first sub-network, the output result is up-sampled by 4 times and then fused with the feature map of the first layer to be used as the input of a second sub-network, the input of each level of feature map of the second sub-network is fused with the feature map of the first sub-network with the same scale, the output result of the second sub-network is input into a spatial attention module, and the output result of the spatial attention module is input into a decoder.

The sub-network comprises three feature extraction layers. Each feature extraction layer consists of one downsampled layer and four decomposition depth separable convolution residual blocks. The down-sampling layer is composed of a convolution layer with convolution kernel size of 1 multiplied by 1 and step length of 2, a batch normalization layer and a Relu activation function. The separable convolution residual block of the decomposition depth uses separable convolution kernels of the decomposition depth of 3 x 1 and 1 x 3 to extract features, in order to reduce gradient dispersion and facilitate network training, residual connection is used, and finally the technology of channel random mixing is introduced to improve information exchange among channels and improve network performance, wherein the sequence of the separable convolution residual block of the decomposition depth is as follows: 1 × 3 depth separable convolution- > Relu activation function- >3 × 1 depth separable convolution- > Relu activation function + batch normalization- >1 × 1 convolution- >1 × 3 depth separable convolution- > Relu activation function- >3 × 1 depth separable convolution- > Relu activation function + batch normalization- > residual concatenation- > channel random shuffle.

In order to keep a sufficiently large receptive field, the output of the first sub-network is up-sampled by 4 times and then used as the input of the second sub-network, and features with the same scale between the two sub-networks are fused at the same time so as to learn high-dimensional structure information of different ground object targets, and low-level and high-level features are aggregated to encode the ground object targets with different scales in the remote sensing image. The output of the second sub-network is sent to the spatial attention module.

The spatial attention module is used for capturing context information of the feature map, focusing on the most important part for semantic segmentation in the feature map, and suppressing useless information so as to improve segmentation performance. The input of the space attention module is from the feature map extracted by the sub-network, and the input is passed through theta (X),

And

three 1X 1 convolutional layers yield three new feature maps, θ (X),

convolution is carried out to recover the channel number by recovering the dimension of the input feature diagram and delta is 1 multiplied by 1, and the whole process is as follows:

the resulting output Y retains the global context information and is input to the decoder.

The decoder comprises: three gated convolution modules for fusing high-level features with low-level features and four upsampling units, wherein: the input of the decoder is from the output of the encoder, each level of gated convolution module receives a low-level feature map from a first sub-network with the same scale, the gated convolution module performs thinning processing on the low-level features and then aggregates the low-level features with deep features, the feature maps obtained after aggregation are input into each level of up-sampling units, and finally, the feature maps are restored to the original image size and then semantic segmentation result maps are output.

In order to retain the detail information lost in the continuous down-sampling process of the deep feature map and prevent the over-segmentation easily caused by redundant information contained in the low-level features, a gated convolution module is used for dynamically selecting weight values of the features to refine the low-level feature map in the fusion process, the input of the gated convolution is from the low-level feature map of a first sub-network with the same scale, the refined feature map and the deep feature map are aggregated and then input into an upper sampling unit, and the whole process is as follows:

wherein: x is an input feature map, i, j represents the position of each pixel, sigma is a sigmoid function, the weight of the pixel is set between 0 and 1, and the whole module gives higher weight to the edge of the target and the small-size target through automatic learning.

The calculation sequence of the up-sampling unit is as follows: 1 × 1 convolutional layer, batch layer, active layer, and two-fold bilinear interpolation layer, where: the input of the first up-sampling unit is from the output of the decoder, the other up-sampling units receive the feature map after the gate control convolution and the high-level feature fusion, bilinear interpolation processing is carried out, the feature map after interpolation is output to the next-level up-sampling unit, and finally the segmentation result map with the same size as the original image is output.

In this embodiment, it is preferable to adopt a data set Potsdam of a remote sensing image 2D semantic segmentation competition of the international photogrammetry and remote sensing society, where the data set is an aerial image, each image has 3 bands of red, green, and blue, and surface features are divided into six categories, including an impermeable ground surface, buildings, low and short vegetation, trees, vehicles, and sundries, and have a truth diagram of labeling semantics pixel by pixel, and the truth diagram can be used for precision evaluation of semantic segmentation results, and is shown in fig. 3: the non-permeable ground surface is white (RGB value: 255, 255, 255), the buildings are blue (RGB value: 0, 255), the short vegetation is bright blue (RGB value: 0, 255, 255), the trees are green (RGB value: 0, 255, 0), the vehicles are yellow (RGB value: 255, 255, 0), and the sundries are red (RGB value: 255, 0). The semantic segmentation precision adopts total pixel precision and average F ₁ And value evaluation, namely evaluating the semantic segmentation efficiency by adopting the model parameter quantity and the model prediction time.

The truth diagram of the semantics marked pixel by pixel is taken as reference, 14 test set remote sensing images of 6000 × 6000 are utilized, and the method is compared with four semantic segmentation methods in two aspects of precision and efficiency, wherein the four semantic segmentation methods comprise ENet, DFANet, PSPNet and deep LabV3+. Average F with overall pixel precision ₁ The value serves as a criterion for measuring the accuracy of semantic segmentation. Overall pixel precision and average F ₁ The higher the value of (A), the closer the semantic segmentation result is to the truth map, and the higher the semantic segmentation precision is. The prediction time (unit: second, s) of a remote sensing image with the size of 512 multiplied by 512 is used as a standard for measuring semantic segmentation efficiency by using the model parameter quantity (unit: mega, M), and the smaller the parameter quantity, the shorter the prediction time and the higher the efficiency. The results of the different semantic segmentation methods are shown in table 1:

TABLE 1 comparison of the present Process with the existing Process

Method	Overall pixel accuracy	Average F ₁	Time(s)	Reference quantity (M)
					ENet	82.2％	82.8％	0.26	0.36
DFANet	83.3％	83.6％	0.23	7.8
					PSPNet	86.9％	87.6％	1.03	48.7
DeepLabV3+	88.1％	89.1％	4.13	56.7
					Method for producing a composite material	86.7％	87.2％	0.32	1.29

As can be seen from the table, the method achieves an overall pixel accuracy of 86.7% and an average F of 87.2% ₁ The value, predicted time, was 0.32s and parameter number was 1.29M. Although the ENet parameters are the least and the DFANet prediction time is the shortest, the pixel population precision (82.2%, 83.3%) and average F of ENet and DFANet are ₁ The values (82.8%, 83.6%) are much lower than in this example. Despite the overall pixel accuracy and average F of the method ₁ The values were slightly lower than DeepLanV3+ (88.1%, 89.1%) and PSPNet (86.9%, 87.6%). However, the parameter sizes of DeepLabV3+ and PSPNet (56.7M, 48.7M) were almost ten times as large as those of the method (1.29M) and the predicted time (4.13s, 1.03s) was several times as large as that of the method (0.32 s), so the integration was carried outThe method is superior to other deep semantic segmentation networks in semantic segmentation precision and efficiency.

From the visual effect, as shown in fig. 4, the present embodiment can accurately extract various feature boundaries and determine the semantics thereof, and can effectively reduce the false extraction compared with DFANet and ENet, and is closer to a true value map.

According to specific practical experiments, a model is trained and built under a pytoch deep learning framework, the learning rate is set to be 0.0001, the iteration times are 1500, the exponential decay rate is (0.9, 0.99), the regularization coefficient is 0.0002, the loss function is a cross entropy loss function, the sampling frequency of each round of training is set to be 450, the size of an input image is 512 x 512, the batch size is 10, the data enhancement parameter range comprises random rotation n x 90 degrees (n =0,1,2, 3), random horizontal direction and vertical direction 180 degrees are turned over, the random scale is scaled by 0.5 to 1.5 times, the random brightness is enhanced by 0.5 to 1.5 times, the random contrast is enhanced by 0.5 to 1.5 times, the random saturation is enhanced by 0.5 to 1.5 times, the learning rate is set to be 0.0001, and on a Potsdam data set of 2D semantic segmentation of remote sensing images of the international photography measurement and remote sensing society, the overall pixel accuracy of 86.7% and the average F2% of the remote sensing society are obtained ₁ The value is that only 0.32s is needed for dividing one 512X 512 remote sensing image, and the parameter number is only 1.29M.

Compared with the prior art, the method obtains 86.7% of total pixel precision and 87.2% of average F in the data set of the 2D semantic segmentation competition of the remote sensing image of the international photogrammetry and remote sensing society ₁ However, the prediction time is only 0.32s, the parameter number is only 1.29M, the prediction speed is faster compared with the smaller parameter numbers of DeepLabV3+ and PSPNet, the precision is higher compared with ENet and DFANet, and the balance between the semantic segmentation precision and the efficiency is better considered.

The foregoing embodiments may be modified in many different ways by one skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and not by the preceding embodiments, and all embodiments within their scope are intended to be limited by the scope of the invention.

Claims

1. A high-resolution remote sensing image-oriented lightweight semantic segmentation method is characterized by comprising the following steps: building, training and testing a network, wherein the network specifically constructs a deep semantic segmentation network of an encoder-decoder structure for a pytorch deep learning framework, and after network training is carried out based on a remote sensing image data sample set, a remote sensing image to be tested is used as network input to obtain a segmentation result of the remote sensing image;

the encoder is constructed by utilizing a multi-scale feature fusion and attention mechanism technology and comprises two sub-networks with the same structure and an attention module for capturing context information of a feature map, wherein: the image data is input into a first sub-network of the two sub-networks, a low-level feature map output by the first sub-network is fused with the first-level feature map after being up-sampled and serves as the input of a second sub-network, the input of each level of feature map of the second sub-network is fused with the feature map of the first sub-network with the same scale, a high-level feature map output by the second sub-network is input into the spatial attention module, and the output result of the spatial attention module is input into the decoder;

said first and second sub-networks each comprising: three feature extraction layers, wherein each feature extraction layer consists of a downsampling layer and four separable convolution residual blocks with resolution depths;

the down-sampling layer consists of a convolution kernel with the size of 1 multiplied by 1, a convolution layer with the step length of 2, a batch normalization layer and a Relu activation layer;

the decomposition depth separable convolution residual block extracts features through two groups of decomposition depth separable convolution kernels of 3 x 1 and 1 x 3, and residual connection is added for reducing gradient dispersion and facilitating network training;

when the number of input feature map channels is c _in Using c _out Carrying out convolution operation on each convolution kernel;

the standard 3 × 3 convolution kernel performs convolution operation on all channels, and the parameter number is 3 × 3 × c _in ×c _out ；

Deconvolution the standard 3 × 3 convolution kernel is decomposed into 3 × 1 and 1 × 3 convolution kernels with a parameter quantity of 2 × 3 × c _in ×c _out ；

The depth separable convolution comprises:

ii) point convolution: performing convolution operation by using 1 × 1 convolution kernel to realize information interaction between channels, wherein the parameter number of point convolution is 1 × 1 × c _in ×c _out ；

The total parameter quantity of the depth separable convolution is 3 multiplied by c _in +c _in ×c _out ；

The calculation process of the space attention module is as follows:

wherein: theta (X),

And with

convolution is carried out to recover the number of channels by recovering the size of the input feature diagram and the delta is 1 multiplied by 1, and the obtained final result Y keeps the global context information;

the decoder comprises: three gated convolution modules for fusing high-level features and low-level features and four upsampling units, wherein: the input of the decoder is from the output of the encoder, each level of gated convolution module receives a low-level feature map from a first sub-network in the same scale, the gated convolution module carries out refinement processing on the low-level features, information is input to each level of up-sampling unit, the input of the first level of up-sampling unit is from the output of the decoder, the other up-sampling units receive the feature map formed by fusing the gated convolution and the high-level features, then double-linear interpolation processing is carried out, the feature map after interpolation is output to the next level of up-sampling unit, and a semantic segmentation result map is output until the original image size is recovered;

the up-sampling unit comprises: the method comprises the following steps that 1 x 1 convolutional layers, batch normalization layers, active layers and two-time bilinear interpolation layers are input into an up-sampling unit to be decoded, two-time bilinear interpolation is carried out, and a feature diagram with improved resolution is obtained and is used as the input of a next up-sampling unit;

the calculation process of the gated convolution module is as follows:

wherein: x is an input feature graph, i and j represent the position of each pixel, sigma is a sigmoid function, the weight of each pixel is set to be between 0 and 1, and through learning, the edge part of a ground object in an image and a small-size ground object obtain higher weight, which is beneficial to improving the precision of semantic segmentation;

the network training comprises the following steps:

a1, dividing a remote sensing image data sample set into a training set, a verification set and a test set;

a2, reading in a remote sensing image and a corresponding label image of a training set, randomly sampling an original large-amplitude remote sensing image and label data in order to fully utilize a training sample set, setting sampling frequency of each round of training, setting sampling size parameters according to the size of a video memory, and simultaneously sampling the remote sensing image and the label image at random positions;

a3, setting data enhancement parameters, and performing the same data enhancement on the remote sensing image and the corresponding label image;

a4, setting a learning rate, an exponential decay rate and a regularization coefficient to train a depth network, and selecting the depth network with the highest precision of a verification set for semantic segmentation of the high-resolution remote sensing image to be detected;

the data enhancement parameters comprise an image rotation angle, an image turning angle, a brightness enhancement coefficient, a contrast enhancement coefficient, a chroma enhancement coefficient and a scaling coefficient.