CN113159051B

CN113159051B - Remote sensing image lightweight semantic segmentation method based on edge decoupling

Info

Publication number: CN113159051B
Application number: CN202110456921.9A
Authority: CN
Inventors: 段锦; 刘高天; 祝勇; 赵言; ***
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2022-11-25
Anticipated expiration: 2041-04-27
Also published as: CN113159051A

Abstract

The invention discloses a remote sensing image lightweight semantic segmentation method based on edge decoupling, belongs to the field of computer vision, and can be used for intelligent interpretation in the field of remote sensing images. On one hand, the number of model parameters is reduced and network calculation overhead is reduced through a Ghost bottleneck module and deep separable convolution, the efficiency of semantic segmentation of the remote sensing image is effectively improved, and the lightweight of the proposed semantic segmentation network is realized; on the other hand, the precision of semantic segmentation is improved through the multi-scale feature pyramid, the global context module and the edge decoupling module, so that the proposed lightweight semantic segmentation network can accurately and efficiently realize the semantic segmentation of the remote sensing image, and the edge details of the remote sensing image are further refined.

Description

Remote sensing image lightweight semantic segmentation method based on edge decoupling

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a light-weight semantic segmentation method for a remote sensing image based on edge decoupling, which can be used for intelligent interpretation in the field of remote sensing images.

Background

The high-resolution remote sensing image contains information such as detailed color and texture characteristics of targets such as roads and buildings, and the intelligent interpretation of the information has important significance in various fields such as military, agriculture and environmental science. To accomplish the task of analytically classifying the remote sensing image, each pixel in the image should be assigned a label associated with the class to which it belongs, which is consistent with the purpose of semantic segmentation of the image.

The task has a better development direction under the inspiration of deep learning, and particularly, the full convolution network is provided, so that the image semantic segmentation method based on the deep learning is mainstream. Methods such as UNet, segNet, PSPNet, deeplab series and the like appear, and compared with the traditional remote sensing image segmentation algorithm, the methods have more advantages. When the algorithms are applied to semantic segmentation of high-resolution remote sensing images, although the algorithms can ensure a relatively excellent segmentation effect, the training speed and the segmentation efficiency are lower due to the fact that the image size is relatively large and the network structure is complex. In addition, the remote sensing image has too rich target diversity, unbalanced target class distribution and easy edge overlapping among different target classes, so that the remote sensing image cannot be finely divided.

Disclosure of Invention

The invention aims to provide a remote sensing image lightweight semantic segmentation method based on edge decoupling, and solve the technical problems that the inference speed is low and the segmentation efficiency is low due to large parameter quantity and large calculation overhead when the existing semantic segmentation method faces a high-resolution remote sensing image, and the edge segmentation effect is not ideal due to the fact that the edges of different types of targets are easy to overlap.

In order to achieve the purpose, the invention provides a remote sensing image light-weight semantic segmentation method based on edge decoupling, which has the following specific technical scheme:

a remote sensing image lightweight semantic segmentation method based on edge decoupling comprises the steps of building, training and testing a semantic segmentation network, wherein the semantic segmentation network is a lightweight coding and decoding network with a double-branch structure, after training of the semantic segmentation network is completed based on training samples, a remote sensing image to be tested is input into the semantic segmentation network, and a final remote sensing image semantic segmentation result is output;

comprises the following steps which are carried out in sequence:

step 1, acquiring a remote sensing image data set, and preparing a training and testing sample;

step 2, constructing a lightweight coding and decoding semantic segmentation network with a double-branch structure;

step 3, inputting the training samples into an encoder, performing feature encoding through feature extraction, and obtaining an encoding feature map F _E ；

Step 4, encoding characteristic graph F _E Input into a decoder to performThinning the edge characteristic and up-sampling to obtain a decoding characteristic graph F _D ；

Step 5, inputting the decoding characteristic graph into a classifier to perform pixel-level classification prediction, outputting a segmentation result, and performing supervised training on the semantic segmentation network through a supervision mechanism;

step 6, training the semantic segmentation network built in the step 2 by using the training sample according to the step (3-5);

and 7, inputting the sample to be tested into the trained semantic segmentation network, outputting a final remote sensing image semantic segmentation result, and completing the test of the semantic segmentation network.

Further, the step 2 builds a lightweight coding and decoding semantic segmentation network with a double-branch structure, and the network comprises an encoder, a decoder and a classifier;

the encoder is of a double-branch structure and comprises a global downsampling block, a lightweight double-branch sub-network and a global feature fusion module;

the decoder consists of a lightweight edge decoupling module and an up-sampling module;

the classifier is composed of a conventional convolutional layer and a SoftMax layer.

Further, the coding feature map F is obtained in the step 3 _E The method comprises the following steps:

step 3.1, inputting the training sample into a global downsampling block of an encoder to obtain a low-level feature map;

step 3.2, inputting the low-level feature map into a lightweight double-branch sub-network in an encoder to obtain a space detail feature map and an abstract semantic feature map;

step 3.3, the obtained space detail feature map and the abstract semantic feature map are subjected to multi-level feature fusion through a global feature fusion block of the encoder, and an encoding feature map F is output _E 。

Further, the global downsampling block in the step 3.1 is composed of 3 parts, one of which is 1 conventional convolution, the other of which is 1 Ghost bottleneck module, and the third of which is 1 global context module;

after the input sample passes through the global downsampling block, a low-level feature map with the output resolution of 1/4 of the original input is generated and used as the input of the subsequent process.

Further, the step 3.2 light-weighted dual-branch sub-network includes two branches, namely a trunk depth branch for obtaining abstract semantic features and a space holding branch for obtaining space detail features, and the two branches share a low-level feature map output by the global downsampling block;

the trunk deep branch is constructed based on a GhostNet feature extraction network and comprises two structures, wherein one structure is a branch main body structure consisting of 16 GhostNet bottleneck modules and used for carrying out a downsampling process for 4 times to extract deep features; the second is a light-weight characteristic pyramid, the structure is composed of four parts of depth separable convolution, an up-sampling block, a light-weight cavity space pooling pyramid module and element fusion, 4 deep characteristic graphs with different scales formed by a main body are used as input, and finally abstract semantic characteristics with enlarged receptive field and multi-scale information are output;

the spatial preserving branch is composed of 3 depth separable convolutions, 1 down-sampling is realized for the input low-level features, and the resolution of the output spatial detail feature map is 1/2 of the input.

Further, the step 3.3 global feature fusion module includes 3 parts, one of which is a depth separable convolution with two parallel convolution kernels of 1 × 1; second, element fusion; thirdly, a global context module;

performing dimension adjustment on the input abstract semantic features and space detail features through two parallel convolutions, outputting feature graphs with rich space details and abstract semantic information through element fusion, and finally performing lightweight context modeling through a global context module to finally form a coding feature graph F _E The global information can be better fused.

Further, the step 4 obtains a decoding feature map F _D Firstly, the coding feature map F _E Inputting the data into a lightweight edge decoupling module of a decoder to perform edge characterizationPerforming refining processing to generate a fine feature map with refined edges; inputting the fine characteristic diagram into an up-sampling module of a decoder, carrying out up-sampling operation, restoring the fine characteristic diagram to the size of the original input remote sensing image, and taking the restored fine characteristic diagram as a decoding characteristic diagram F output by the decoder _D 。

Further, the lightweight edge decoupling module consists of 3 parts, namely a lightweight cavity space pooling pyramid, a main body feature generator and an edge retainer; firstly, the coding characteristics are subjected to light-weight cavity space pooling pyramid to generate a characteristic diagram F with multi-scale information and a larger receptive field _aspp Then, more consistent feature representation is generated for pixels in the same object through a main body generator, and further a main body feature graph F of the target object is formed _body (ii) a F is to be _body 、F _aspp And F _E Inputting the data into an edge holder, and outputting a feature graph F of a refined edge through explicit subtraction operation, channel stack fusion and 1 x 1 conventional convolution dimensionality reduction _edge Finally, the main body characteristic diagram and the refined edge characteristic diagram are fused, and a refined output characteristic diagram for performing up-sampling recovery is output and is marked as F _final (ii) a The overall process can be represented by the following formula:

in the formula, f _dsaspp Representing a lightweight void space pooling pyramid function, phi representing a principal feature generating function,

an edge-preserving function;

the up-sampling module comprises two steps of 1 × 1 conventional convolution operation and up-sampling operation, and a fine feature map F _final After being output by the module, the characteristic diagram is restored to have the size of the original input remote sensing image, namely the output characteristic diagram F of the decoder _D 。

Further, the supervision mechanism in step 5, decoding the feature map F _D After being processed by the classifier, is finishedAnd performing pixel level classification prediction, namely outputting a result of semantic segmentation, and performing supervised training on the network through a supervision mechanism formed by the semantic segmentation result and a real label to ensure that the semantic segmentation network achieves the optimal segmentation performance.

Further, the supervision mechanism in step 5 is an edge-based supervision method, and the mechanism is implemented by a designed loss function, where the total loss function is denoted as L, and its formula is shown as follows:

in the formula L _body 、L _edge 、L _final 、L _G Respectively representing the main feature loss, the edge feature loss, the fine feature loss and the global coding loss, and the input of the 4 loss functions is respectively: respectively forming segmentation results and corresponding real labels of the segmentation results after the main body characteristic diagram, the refined edge characteristic diagram, the refined output characteristic diagram and the coding characteristic diagram are subjected to up-sampling recovery and a SoftMax layer;

wherein the loss function L _edge The method is based on the edge prediction part to obtain the comprehensive loss function of the boundary edge prior, and comprises two aspects: one is the binary cross entropy loss L for boundary pixel classification _bce And secondly, cross entropy loss L of edge parts in the scene _bce ，λ ₁ 、λ ₂ 、λ ₃ 、λ ₄ 、λ ₅ 、λ ₆ The representative hyperparameter is used to control the weighting between several losses.

The method of the invention has the following advantages: the method can fully consider the total parameter quantity and the total calculation quantity of the semantic segmentation network, the influence of a large number of redundant features on the segmentation efficiency and the segmentation accuracy, and fully consider the effect of the relation between the target body and the edge on the refined segmentation result;

firstly, the invention combines the idea of feature sharing, designs a global down-sampling block based on a global context module and a Ghost bottleneck module, and provides a first part of an encoder in a semantic segmentation network as the method, thereby effectively reducing the parameter scale of the early extraction of low-level features of the network, reducing the calculation cost, and better fusing global context information in the low-level features.

Secondly, the invention combines a double-branch structure and a global feature fusion mode based on a global context module, firstly, a lightweight double-branch sub-network is built based on a Ghost bottleneck module and a deep separable convolution, the parameter scale of a feature extraction stage is obviously reduced, the calculation complexity is reduced, and finally the output coding features contain rich space details and abstract semantic information. And secondly, the output characteristics of the double branches are fused in a global characteristic fusion mode based on a global context module, so that the finally output coding characteristics deepen the understanding of global information and the loss of the network to weak characteristic information is reduced.

Thirdly, the light-weight edge decoupling module is built by using the depth separable convolution, and the relation between the object body and the edge of the object body is introduced by modeling the main body and the edge of the target object; the problem that the existing remote sensing image semantic segmentation algorithm is not fine in edge segmentation is effectively solved, and the segmentation effect of edge details in the remote sensing image is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of a semantic segmentation network structure constructed by the method of the present invention.

Fig. 3 is a schematic structural diagram of a Ghost bottleneck module.

FIG. 4 is a block diagram of a global context module.

Fig. 5 is a schematic diagram of a lightweight feature pyramid structure in a trunk feature extraction branch.

Fig. 6 is a schematic structural view of a lightweight edge decoupling module.

Fig. 7 is an exemplary diagram of a remote sensing image in a data set and corresponding semantic tags.

FIG. 8 is a graph comparing semantic segmentation results according to an embodiment of the method of the present invention. (wherein (a) and (b) are input samples and corresponding labels, and (c) to (g) are semantic segmentation result graphs of Fast-SCNN, sem-FPN, the method of the invention, UNet and PSPNet in sequence).

Detailed Description

In order to better understand the purpose, structure and function of the invention, the following describes a remote sensing image lightweight semantic segmentation method based on edge decoupling in further detail with reference to the accompanying drawings.

As shown in FIG. 1, the invention designs a remote sensing image lightweight semantic segmentation method based on edge decoupling, which is applied to a high-resolution remote sensing image, so that the edge segmentation effect is refined while the precision is ensured, and the segmentation efficiency is greatly improved.

As shown in fig. 1, the invention relates to a remote sensing image light-weight semantic segmentation method based on edge decoupling, which specifically comprises the following steps:

firstly, a high-resolution remote sensing image data set containing semantic annotation is obtained, labels and data are cut correspondingly, and the cutting mode is that sliding window cutting with the sliding step length of 384 is carried out according to the coverage rate of 0.75 and is a fixed window with 512-512 resolution. The cut new data and the new label are corresponding, data amplification is carried out by using modes of rotation, color enhancement and the like, and the influence of overfitting can be effectively weakened by sufficient samples. Finally, dividing samples of a training set and a testing set according to the proportion of 4;

step 2, building a lightweight coding and decoding semantic segmentation network with a double-branch structure;

the structure of the constructed semantic segmentation network is shown in FIG. 2. The network is a lightweight coding and decoding semantic segmentation network with a double-branch structure and comprises an encoder, a decoder and a classifier. The encoder is of a double-branch structure and comprises a global downsampling block, a lightweight double-branch sub-network and a global feature fusion module; the decoder is composed of a lightweight edge decoupling module and an up-sampling module. The classifier is composed of a conventional convolution layer and a SoftMax layer;

step 3, inputting the training sample into an encoder, and extracting through characteristicsTaking and coding the characteristic to obtain a coding characteristic diagram F _E (ii) a The following three substeps are specifically experienced:

the obtained training samples are input into an encoder of the network, the input samples have a scale of 512 × 512, and the samples first pass through a global downsampling block in the encoder. The module consists of 3 parts, one is a conventional convolution, the other is 1 Ghost bottleneck module, and the other is a global context module. The low-level feature map output by the global downsampling module is better fused with global context information and also contains rich spatial detail information;

the conventional convolution is a convolution block with a convolution kernel of 3 × 3 and a step size of 2 and with a batch normalization layer and a ReLU activation layer, and the training sample outputs a feature map with a resolution of 256 × 256 after being subjected to partial down-sampling;

the Ghost bottleneck module is a lightweight module, originates from a Ghost Net network, is composed of Ghost modules, can generate a feature map with deeper dimensions by using fewer parameters, and has a structure shown in FIG. 3. The module has different structures according to different step lengths, the structure contains two layers of Ghost modules with the step length of 1 when the step length is 1, and the two layers of Ghost modules are mixed with one channel-by-channel convolution with the step length of 2 when the step length is 2. The step size of this block in the global downsampling block is 2, and the implementation of the second downsampling further reduces the resolution of the feature map. The resolution of the feature map output by the module is 128 × 128;

the global context module, the structure of which is shown in fig. 4, includes 3 processes: one is a global attention-focusing mechanism for context modeling; acquiring a self-attention weight of an input feature map by adopting 1-by-1 conventional convolution and a SoftMax layer, and then performing attention focusing operation on the input feature map to acquire a global background feature map; secondly, acquiring channel dependence by feature conversion; the part consists of two 1-by-1 convolution layers which are connected through a batch normalization layer and a ReLu activation function; thirdly, element fusion; and fusing the original input feature graph and the feature graph after obtaining channel dependence through element fusion, and aggregating the global context features to the features of each position. The output feature map and the input feature map keep the same size after passing through the module, so that the scale of the finally formed low-level feature map is 128 × 128;

step 3.2, inputting the low-level feature map into a lightweight double-branch subnetwork in the encoder to obtain a space detail feature map and an abstract semantic feature map;

the two-branch subnetwork comprises two branches, namely a trunk depth branch for acquiring abstract semantic features and a space holding branch for acquiring space detail features. Two branches share the low-level features output by the global downsampling block, compared with the traditional two-branch network, one input path is reduced, and the parameter scale and the calculation cost when the low-level features are extracted in the early stage of the network are reduced;

the trunk deep branch is constructed based on a GhostNet network, and the main body part of the network comprises 16 GhostNet bottleneck modules for carrying out a downsampling process for 4 times so as to realize extraction of deep features. The method reserves 16 Ghost bottleneck modules in the Ghost Net and changes the 16 Ghost bottleneck modules into a full convolution network serving as a main body of trunk deep branches. The input low-level feature maps are subjected to the branching processing, and finally, the depth feature maps with 4 scales of 64 × 64, 32 × 32, 16 × 16 and 8 × 8 are generated. The 4 scales represent 4 stages, and the number of Ghost bottleneck modules corresponding to each stage is respectively: [3,2,6,5], corresponding convolution kernel sizes are: [3,5,3,5]. Since 4 downsampling is implemented, each stage has a Ghost bottleneck module with a step size of 2.

Meanwhile, in order to obtain rich abstract semantic features, the method combines a depth separable convolution module, an up-sampling block module and a lightweight cavity space pooling pyramid module, and builds a lightweight feature pyramid by using 4 feature maps, wherein the structure of the pyramid is shown in fig. 5. The newly generated 4 levels are closely connected, the receptive field is enlarged, and the feature map with multi-scale information is sampled to 64 × 64 scales, and a final abstract semantic feature group formed by element fusion is output as a trunk deep branch;

the space-preserving branch consists of 3 depth separable convolutions, the convolution kernel size of the three is 3 x 3, and the step sizes are [1,2,1] respectively. 1-time down-sampling can be realized on the input low-level feature map, and the resolution of the output spatial detail feature map is 64 x 64; the branch reserves the space scale of the input image with less parameter quantity and less calculation cost, and can encode rich space information;

step 3.3, the obtained space detail characteristic diagram and the abstract semantic characteristic diagram are subjected to multi-level characteristic fusion through a global characteristic fusion block of the encoder, and an encoding characteristic diagram F is output _E ；

The global feature fusion module comprises 3 parts, wherein one part is a depth separable convolution with two parallel convolution kernels of 1 x 1; second, element fusion; thirdly, a global context module; the abstract semantic features and the space detail features from the double-branch sub-network are subjected to dimension adjustment through two parallel 1-by-1 convolutions, and a feature diagram with rich space details and abstract semantic information is output through element fusion. Finally, carrying out lightweight context modeling through a global context module to finally form a coding feature graph F _E The global information can be better fused. Because the downsampling process is not included, the output coding feature map is consistent with the input size;

finally, generating a coding feature map with the scale of 64 x 64 by the coder through the steps, and using the coding feature map as input of a subsequent process;

step 4, encoding feature map F _E Inputting the data into a decoder, performing edge feature refinement processing and up-sampling operation to obtain a decoded feature map F _D ；

The decoder consists of two modules, namely an edge decoupling module and an up-sampling module. The lightweight edge decoupling module comprises 3 parts, and the structure is shown in figure 6. The device comprises a lightweight cavity space pooling pyramid, a main body feature generator and an edge retainer; the specific process of acquiring the decoding characteristic diagram comprises the following steps: firstly, encoding characteristic graph F _E Inputting the data into a lightweight edge decoupling module of a decoder, and performing edge feature refinement processing to generate a fine feature map with refined edges; inputting the fine feature map into an up-sampling module of a decoder, performing up-sampling operation, and restoring the fine feature map to the original inputSize of remote sensing image as decoding characteristic graph F output by decoder _D ；

The main body feature generator comprises two processes of flow field generation and feature deformation, wherein the flow field generation is composed of a micro coding and decoding structure containing one-time down-sampling and one-time up-sampling and a conventional convolution with a convolution kernel of 3 x 3 and is used for generating flow field feature representation with prominent features at the central part of a target object. The characteristic deformation is to obtain the obvious main characteristic representation of the target object by carrying out deformation operation on the flow field characteristics; therefore, the main feature generator is responsible for generating more consistent feature representation for pixels in the same object, and the extracted main feature is the main feature of the target object;

the edge retainer comprises two steps, wherein the first step is that a subtracter is used for carrying out display subtraction operation on the coding feature map subjected to the receptive field expansion and the main body feature map to obtain a rough edge feature map; the second is an edge feature refiner, which supplements edge features with a low-level feature map containing fine details. Specifically, the low-level feature map from the encoder is feature-fused to the generated coarse edge feature map through channel stacking, thereby supplementing high-frequency information. Performing dimensionality reduction through 1 × 1 conventional convolution, and outputting a feature map of a refined edge;

input coding feature F _E Firstly, generating a feature map F with multi-scale information and larger receptive field through a lightweight cavity space pooling pyramid _aspp Then, a subject feature diagram F of the target is formed through a subject generator _body (ii) a F is to be _body 、F _aspp And F _E Inputting into an edge holder, first F _body And F _E Performing explicit subtraction to generate a preliminary edge feature map, and comparing the preliminary edge feature map with F _E Performing channel stacking fusion, performing 1 × 1 conventional convolution operation dimensionality reduction processing, and outputting an edge feature graph F with refined edges _edge . Finally F _body And F _edge Carrying out element fusion to obtain a fine output characteristic diagram F for carrying out up-sampling recovery _final (ii) a The whole process can be represented by the following formula:

in the formula, f _dsaspp Representing a lightweight cavity space pooling pyramid function, phi representing a principal feature generating function,

an edge-preserving function;

to obtain the final decoded feature map F _D Will F _final Inputting into an upsampling module containing 1-by-1 conventional convolution and upsampling operation for processing, recovering to the size of the original input image, and finally generating a feature map which is a decoding feature map F _D Its scale is 512 x 512;

step 5, inputting the decoding characteristic graph into a classifier to perform pixel level classification prediction, outputting a segmentation result, and performing supervised training on the semantic segmentation network through a supervision mechanism;

the classifier main body is a SoftMax layer, and a decoding characteristic diagram F _D After the processing of the SoftMax layer, the classification prediction of the pixel level is completed, and the result of semantic segmentation is obtained. The network training is supervised by a supervision mechanism formed by the segmentation result and the real label, so that the semantic segmentation network achieves the optimal segmentation performance;

the supervision mechanism does not supervise only the final segmentation result, but for F _body 、F _edge 、F _final 、F _E The four parts are jointly supervised. The mechanism is realized by a designed loss function, and the total loss function is recorded as L, and the formula is shown as the following formula:

in the formula L _body 、L _edge 、L _final 、L _G Respectively representing the loss of main features, the loss of edge features, the loss of fine features and the loss of global coding. Wherein L is _final And L _G By adopting semantic divisionA cross entropy loss function common in the cut task. L is _body It is assumed that the loss is lost by boundary relaxation, which can make F during training _body The classification of the boundary pixels can be relaxed, and the segmentation network is allowed to predict the boundary pixels into a plurality of classes. L is _edge The method is based on the edge prediction part to obtain the comprehensive loss function of the boundary edge prior, and comprises two aspects: one is the binary cross entropy loss L for boundary pixel classification _bce . Second is cross entropy loss L of edge parts in a scene _ce 。λ ₁ 、λ ₂ 、λ ₃ 、λ ₄ 、λ ₅ 、λ ₆ The representative hyperparameter is used to control the weighting between several losses. The first three defaults are 1, and the last three are 0.4, 20, 1, respectively.

Represents the real semantic label of the real object,

is represented by

The generated binary mask, b represents the boundary prediction result, s _body 、s _final And s _E Represents from F _body 、F _edge 、F _E The segmentation map prediction result obtained in (1);

according to the process, after a semantic segmentation network is built, continuously inputting training samples to train the network according to the step (3-5); before training, related training parameters such as input scale of the network, input sample batch, learning rate and the like need to be set.

Step 7, inputting a sample to be tested into the trained semantic segmentation network, outputting a final semantic segmentation result of the remote sensing image, and completing the test of the semantic segmentation network;

the following is a specific example experiment, which is not intended to limit the use of the method of the present invention, but is merely a better example for analysis.

The experiment used the Vaihingen dataset provided by isps, which contains 3 channels of IRRG images, DSM images, and NDSM images. 16 remote sensing images with the size of 6000 to 6000 and corresponding labels. The corresponding visualization results are shown in fig. 7. The corresponding semantic labels of the 6 types of target classes contained in the label are determined by RGB values, which are specifically shown in table 1 below:

TABLE 1 semantic annotation information Table

In this embodiment, the data set is preprocessed according to the sliding window cropping method and the data augmentation described in step 1, and the obtained data is a multi-channel graph of 512 × 3. Dividing the training sample and the test sample according to the proportion of 4;

and then, building a semantic segmentation network of the method, and setting relevant parameters before training. The input scale of the network is 512 × 512, the input batch is set to 10 (according to video memory), the optimizer adopts an SGD optimizer, the initial learning rate is set to 0.001, the minimum learning rate is set to 0.00001, the momentum is set to 0.9, and the weight attenuation coefficient is set to 0.0005.

In this embodiment, the selected semantic segmentation evaluation indexes are an average pixel intersection ratio (mlou), an average pixel precision (mAcc), GFLOPs (floating point computation), a parameter quantity, and a segmentation inference time of a single image. Selecting 4 semantic segmentation methods for comparing the method from two aspects of segmentation precision and efficiency, wherein the 4 semantic segmentation methods respectively comprise the following steps: UNet, PSPNet, fast _ SCNN, sem-FPN. And utilizing mIoU and mAcc as a standard for measuring the semantic segmentation accuracy. The higher the two is, the closer the segmentation result is to the real label is, and the higher the semantic segmentation precision is. GFLOPs, parameter numbers and inference time of single image segmentation are used as standards of semantic segmentation efficiency, and the smaller the GFLOPs, parameter numbers and inference time, the higher the segmentation efficiency. The experimental results of the different semantic segmentation methods are shown in table 2:

TABLE 2 comparison of the present Process with the existing Process

Method	mIoU(％)	mAcc(％)	GFLOPs	Reference quantity (M)	Partition reasoning time(s)
						UNet	86.19	91.16	203.04	29.06	0.067
PSPNet	86.40	92.19	178.48	48.98	0.066
						Fast-SCNN	76.23	83.83	0.91	1.21	0.015
Sem-FPN	83.57	90.91	45.48	28.50	0.029
						Method for producing a composite material	85.33	90.98	6.63	4.17	0.031

As can be seen from the results in Table 2, the method achieved 89.42% mIoU and 93.15% mAcc, GFlOPs was 6.9, the number of parameters was 4.1M, and the single image segmentation inference speed was 0.031s. Compared with Fast-SCNN, although the parameter quantity is minimum, the floating point calculation quantity is minimum, the reasoning time is minimum, and the precision is far lower than that of the method; compared with Sem-FPN, although the method is slightly inferior in inference time, the method is higher than Sem-FPN in mIoU and mACC, and the parameters and GFLOPs of Sem-FPN are far higher than that of the method. Both UNet and PSPNet are classical semantic segmentation networks, and compared with the method, the UNet and PSPNet are inferior to the two in precision, but the UNet and PSPNet are several times of the method in parameter quantity, GFLOPs and inference time. Therefore, the semantic segmentation network provided by the method is superior to other semantic segmentation networks in view of integrating the precision and the efficiency of semantic segmentation. In addition, from the view of parameter quantity, GFLOPs and reasoning speed, the invention is verified to be a light-weight semantic segmentation method for remote sensing images;

fig. 8 is a visualized semantic segmentation result obtained after a test sample is input. Compared with semantic segmentation results obtained by Fast-SCNN and Sem-FPN methods, the method is more accurate in pixel classification, and effectively improves the condition of segmentation errors caused by wrong classification; the method is more accurate in processing edge details and is closer to the real result of the semantic label. Compared with UNet and PSPNet, although the overall segmentation precision is insufficient, the method is closer to semantic labels in segmentation at edge details.

It is to be understood that the present invention has been described with reference to certain embodiments, and that various changes in the features and embodiments, or equivalent substitutions may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A remote sensing image lightweight semantic segmentation method based on edge decoupling is characterized by comprising the steps of building, training and testing a semantic segmentation network, wherein the semantic segmentation network is a lightweight coding and decoding network with a double-branch structure, after training of the semantic segmentation network is completed based on a training sample, a remote sensing image to be tested is input into the semantic segmentation network, and a final remote sensing image semantic segmentation result is output;

the method comprises the following steps in sequence:

step 2, building a lightweight coding and decoding semantic segmentation network with a double-branch structure, wherein the network comprises an encoder, a decoder and a classifier;

the classifier is composed of a conventional convolutional layer and a SoftMax layer;

Obtaining a coding feature map F in the step 3 _E The method comprises the following steps:

the global downsampling block in the step 3.1 consists of 3 parts, wherein one part is 1 conventional convolution, the other part is 1 Ghost bottleneck module, and the third part is 1 global context module;

after an input sample passes through a global downsampling block, a low-level feature map with the output resolution being 1/4 of the original input is generated and used as the input of the subsequent process;

step 3.3, the obtained space detail feature map and the abstract semantic feature map are subjected to multi-level feature fusion through a global feature fusion block of the encoder, and an encoding feature map F is output _E ；

2. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 1, wherein the light-weight double-branch sub-network in step 3.2 comprises two branches, namely a trunk depth branch for obtaining abstract semantic features and a space preservation branch for obtaining space detail features, and the two branches share a low-level feature map output by a global downsampling block;

the trunk deep branch is constructed based on a GhostNet feature extraction network and comprises two structures, wherein one structure is a branch main body structure consisting of 16 GhostNet bottleneck modules and is used for carrying out a downsampling process for 4 times to realize the extraction of deep features; the structure comprises four parts of depth separable convolution, an up-sampling block, a lightweight cavity space pooling pyramid module and element fusion, 4 deep feature maps with different scales formed by a main body are used as input, and finally, abstract semantic features with enlarged receptive field and multi-scale information are output;

the spatial preservation branch is composed of 3 depth separable convolutions, down-sampling is performed 1 time for the input low-level feature map, and the resolution of the output spatial detail feature map is 1/2 of the input.

3. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 1, wherein the step 3.3 global feature fusion module comprises 3 parts, one of which is a depth separable convolution with two parallel convolution kernels of 1 x 1; second, element fusion; the third is 1 global context module;

4. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 1, wherein the step 4 obtains a decoding feature map F _D Firstly, the coding feature map F _E Inputting the data into a lightweight edge decoupling module of a decoder, and performing edge feature refinement processing to generate a fine feature map with refined edges; inputting the fine characteristic diagram into an up-sampling module of a decoder, performing up-sampling operation, restoring the fine characteristic diagram to the size of the original input remote sensing image, and using the fine characteristic diagram as a decoding characteristic diagram F output by the decoder _D 。

5. The remote sensing image light-weight semantic segmentation method based on the edge decoupling as claimed in claim 1, wherein the light-weight edge decoupling module is composed of 3 parts, namely a light-weight cavity space pooling pyramid, a main body feature generator and an edge retainer; firstly, the coding characteristics are subjected to light-weight cavity space pooling pyramid to generate a characteristic diagram F with multi-scale information and a larger receptive field _aspp Then, more consistent feature representation is generated for pixels in the same object through a main body generator, and further a main body feature graph F of the target object is formed _body (ii) a F is to be _body 、F _aspp And F _E Inputting the data into an edge holder, and outputting a feature map F of a refined edge through explicit subtraction operation, channel stack fusion and 1 x 1 conventional convolution dimensionality reduction _edge Finally, the main body characteristic diagram and the refined edge characteristic diagram are fused, and a refined output characteristic diagram F for carrying out up-sampling recovery is output _final (ii) a The whole process can be represented by the following formula:

an edge preservation function;

6. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 1, wherein the supervision mechanism in step 5 is decoding feature map F _D After the classification of the pixel level is completed by the classifier, the output is the result of semantic segmentation, and the network is supervised and trained by a supervision mechanism formed by the result of semantic segmentation and a real label, so that the semantic segmentation network achieves the best segmentation performance.

7. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 1, wherein the supervision mechanism in step 5 is an edge-based supervision mode, the mechanism is realized by a designed loss function, the total loss function is denoted as L, and the formula is shown as follows:

wherein the loss function L _edge The method is based on the edge prediction part to obtain the comprehensive loss function of the boundary edge prior, and comprises two aspects: one is binary cross-entropy loss for boundary pixel classificationL _bce And secondly, the cross entropy loss L of the edge part in the scene _bce ，λ ₁ 、λ ₂ 、λ ₃ 、λ ₄ 、λ ₅ 、λ ₆ The representative hyperparameter is used to control the weighting between several losses.