CN112364699A

CN112364699A - Remote sensing image segmentation method, device and medium based on weighted loss fusion network

Info

Publication number: CN112364699A
Application number: CN202011097624.1A
Authority: CN
Inventors: 颜军; 张永军; 刘文杰; 邓剑文; 吴明朗; 郑忠良; 郝梦
Original assignee: Guangdong Obit Artificial Intelligence Research Institute Co ltd; Zhuhai Orbita Aerospace Technology Co ltd; Guizhou University
Current assignee: Guangdong Obit Artificial Intelligence Research Institute Co ltd; Zhuhai Orbita Aerospace Technology Co ltd; Guizhou University
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2021-02-12

Abstract

The invention relates to a remote sensing image segmentation method, a device and a medium based on a weighted loss fusion network, which comprises the following steps: preprocessing the remote sensing image to obtain training data; constructing a convolutional neural network, wherein the convolutional neural network comprises a coder with a multi-channel training branch, and context extraction is carried out on training data through the coder; constructing a double pyramid module, extracting a feature map of the training data through the double pyramid module, and outputting a corresponding feature map; performing up-sampling processing on the obtained feature maps to obtain feature maps with different sizes, and fusing the feature maps; and constructing a perception loss network, calculating the perception loss, weighting and fusing the perception loss and the training loss, and reversely transmitting the perception loss and the training loss to the training network to update parameters. The invention has the beneficial effects that: the network extracts high-quality deep features and scale features, avoids loss of spatial features to the maximum extent, effectively improves the segmentation effect of the remote sensing image, and is faster in training and fitting.

Description

Remote sensing image segmentation method, device and medium based on weighted loss fusion network

Technical Field

The invention relates to the field of remote sensing images and deep learning, in particular to a remote sensing image segmentation method, a remote sensing image segmentation device and a remote sensing image segmentation medium based on a weighted loss fusion network.

Background

With the development of remote sensing technology, the data volume of remote sensing images is larger and larger, and the resolution is higher and higher. The remote sensing image contains a large amount of information, so that the application of the remote sensing image has many aspects including target detection, scene classification, semantic segmentation and the like. Applications of remote sensing images tend to be diversified, such as city planning, building extraction, road extraction, vehicle detection and illegal building extraction. However, in these fields, high segmentation quality is required, and although there are many segmentation methods for remote sensing images, the segmentation effect still needs to be improved.

The semantic segmentation of the remote sensing image is a research hotspot, and with the development of deep learning, the segmentation precision is greatly improved by the semantic segmentation based on the full convolution neural network. The information amount of the remote sensing image is large, but the data amount of each sample is extremely uneven. Therefore, although the remote sensing image can be segmented to a certain extent by the common network, the segmentation precision can be greatly improved. The common neural network deepens the network to improve the classification accuracy, but has great loss on target spatial features and scale features.

Disclosure of Invention

The invention aims to solve at least one of the technical problems in the prior art, and provides a remote sensing image segmentation method, a device and a medium based on a weighted loss fusion network, which avoid the loss of spatial features and effectively improve the segmentation effect of remote sensing images for high-precision remote sensing images.

The technical scheme of the invention comprises a remote sensing image segmentation method based on a weighted loss fusion network, which is characterized in that: s100, preprocessing the remote sensing image to obtain training data for a convolutional neural network; s200, constructing a convolutional neural network, wherein the convolutional neural network comprises a coder with multi-channel training branches, and context extraction is carried out on the training data through the coder; s300, constructing a double pyramid module, extracting feature maps of the training data through two groups of spatial pyramids with different convolution expansion rates of the double pyramid module, and outputting corresponding first feature maps; s400, performing upsampling processing on the feature map obtained in the step S300 to obtain second feature maps with different sizes, and fusing the first feature map and the second feature map; s500, a perception loss network is constructed, loss calculation is carried out on the fused feature graph through the loss network, and back propagation is carried out through weighted fusion of the loss obtained through calculation and the loss network, so that parameter updating is achieved.

According to the remote sensing image segmentation method based on the weighted loss fusion network, the preprocessing in S100 comprises the following steps: and carrying out image cutting, normalization and data expansion processing on the remote sensing image, wherein the operations of data expansion are respectively carried out from the horizontal direction and the vertical direction.

According to the remote sensing image segmentation method based on the weighted loss fusion network, S200 comprises the following steps: the convolutional neural network uses Inception V-4 pre-trained on an ImageNet data set as a network backbone, removes an average pooling layer at the tail end of the convolutional neural network, and simultaneously adds network branches on the network backbone to obtain an encoder with multi-channel training branches, wherein the network branches are used for reserving shallow features of images.

According to the remote sensing image segmentation method based on the weighted loss fusion network, S200 comprises the following steps: and removing the average pooling layer and all subsequent layers, independently establishing a network branch for the output after the Reduction-A layer, and fusing the network branch with the tail end of the main network after passing through the 2 x 2 maximum pooling layer.

According to the remote sensing image segmentation method based on the weighted loss fusion network, S300 comprises the following steps: s310, establishing a first space pyramid behind the Stem module as a parallel training network, extracting multi-scale features of a first stage of the training data according to the first space pyramid, fusing feature graphs of five branches of the ASPP1 module to form corresponding fused feature graphs, and performing convolution on the fused feature graphs by 1 x 1; s320, executing the convolution operation of the maximum pooling of 4 x 4, and reducing the size of the feature map; s330, constructing a second spatial pyramid in a Reduction-A module of an inclusion V-4 network, receiving a corresponding feature map through an ASPP2 module, fusing four branches of the ASPP2 module, outputting the feature map through a 1 x 1 convolution layer, and adding a 2 x 2 maximum pooling layer after the convolution layer to obtain the feature map.

The remote sensing image segmentation method based on the weighted loss fusion network is characterized in that expansion rates of the first space pyramid and the second space pyramid are set to be [1, 6,12,18 ] and [1, 6,12 ] respectively.

According to the remote sensing image segmentation method based on the weighted loss fusion network, S400 comprises the following steps: and recovering the image size by four groups of convolution block sampling modules, and obtaining a classification prediction result by Softmax.

According to the remote sensing image segmentation method based on the weighted loss fusion network, S500 comprises the following steps: and constructing a perception loss network, adopting a pre-trained VGG16 network, transmitting a prediction graph obtained by a segmentation network into the loss network, calculating to obtain loss, and then performing weighted fusion and back propagation with the loss calculated by the segmentation network to realize parameter updating.

The technical scheme of the invention also comprises a remote sensing image segmentation device based on the weighted loss fusion network, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, and is characterized in that any one of the method steps is realized when the processor executes the computer program.

The technical solution of the present invention further includes a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements any of the above method steps.

The invention has the beneficial effects that: the network extracts high-quality deep features and scale features, avoids loss of spatial features to the maximum extent, effectively improves the segmentation effect of the remote sensing image, and obtains a good result on an ISPRS 2D data set.

Drawings

The invention is further described below with reference to the accompanying drawings and examples;

FIG. 1 shows a general flow diagram according to an embodiment of the invention.

Fig. 2 is an overall architecture diagram according to an embodiment of the present invention.

Fig. 3 is a diagram of a training network architecture according to an embodiment of the present invention.

Fig. 4 is a diagram of an inclusion v-4 backbone network architecture according to an embodiment of the present invention.

Fig. 5 is a structural view of a spatial pyramid according to an embodiment of the present invention.

FIG. 6 is a comparison of predicted results on a Vaihingen validation set, according to an embodiment of the present invention.

FIG. 7 is a diagram of a media device according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the present preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout.

In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number.

In the description of the present invention, the consecutive reference numbers of the method steps are for convenience of examination and understanding, and the implementation order between the steps is adjusted without affecting the technical effect achieved by the technical solution of the present invention by combining the whole technical solution of the present invention and the logical relationship between the steps.

FIG. 1 shows a general flow diagram according to an embodiment of the invention. The process comprises the following steps: s100, preprocessing the remote sensing shadow to obtain training data for a convolutional neural network; s200, constructing a convolutional neural network, wherein the convolutional neural network comprises a coder with multi-channel training branches, and context extraction is carried out on the training data through the coder; s300, constructing a double pyramid module, extracting feature maps of the training data through two groups of spatial pyramids with different convolution expansion rates of the double pyramid module, and outputting corresponding first feature maps; s400, performing upsampling processing on the feature map obtained in the step S300 to obtain second feature maps with different sizes, and fusing the first feature map and the second feature map; s500, a perception loss network is constructed, loss calculation is carried out on the fused feature graph through the loss network, and back propagation is carried out through weighted fusion of the loss obtained through calculation and the loss network, so that parameter updating is achieved.

For the pretreatment thereof, the present invention provides the following embodiments:

the validity of the method is verified using the real data. The experiment was trained using the ISPRS 2D Semantic Label control Vaihingen dataset.

The Vaihingen dataset is a high-resolution aerial image dataset with complete semantic tags, including high-resolution TOP projection (TOP) and digital terrain model (DSM). The image file is composed of different channels, and is in an IRRG (IR-R-G, 3-channel) image format. Herein, TOP IRRG images alone are used for training. The data set comprises 16 image spots with different sizes, and the data set labels are divided into six types including Improvious Surfaces, Building, Low vector, Tree, Car and background.

The method comprises the steps of cutting 16 images with semantic labels into images with the size of 299 x 299, considering the depth of a network, enabling the data size to be too small, and not obtaining enough characteristic information, carrying out data expansion, turning the images in the horizontal direction and the vertical direction respectively, then carrying out rotation to obtain 14824 images with the size of 299 x 299, and randomly selecting 75% of total samples as a training set, 20% as a testing set and the balance as a verification set.

Because the training data is huge and the computer computing power is limited, the Adam algorithm is selected for training optimization, so that the model can be converged more quickly.

Fig. 2 is an overall architecture diagram according to an embodiment of the present invention, which is implemented by adding a perceptual loss network while training the training network, calculating perceptual loss, then performing weighted fusion with the training loss, and performing back propagation to the training network.

Fig. 3 is a diagram of a neural network architecture according to an embodiment of the present invention. The flow of the network structure is as follows:

step 1: and preprocessing the remote sensing image, including image cutting, normalization and data expansion, wherein the operations of the data expansion are respectively turning from the horizontal direction and the vertical direction.

The step provides data for training of the network, after data expansion, overfitting can be avoided along with deep training of the network, and a large amount of information is provided for network training to better extract features.

Step 2: and removing the average pooling layer and all subsequent layers to adapt to the semantic segmentation task, then independently establishing a network branch for the output after the Reduction-A layer, and then fusing the network branch with the tail end of the trunk network after passing through the 2 x 2 maximum pooling layer.

The step realizes the modification of the backbone network, avoids the loss of shallow features caused by too deep network depth, and enables the network to better learn the image features.

And step 3: double-pyramid modules are constructed, two groups of space pyramids with different expansion rates are adopted, and the convolution expansion rates of the first group of pyramid modules are respectively as follows: 1. 6,12 and 18, fusing the output characteristic graph with the output of the second group of up-sampling volume blocks, wherein the convolution expansion rates of the second group of pyramid modules are 1,6 and 12 respectively, and the output characteristic graph is fused with the output of the first group of up-sampling volume blocks.

In the step, by introducing two groups of space pyramids, the scale features of the target can be better extracted, so that the classification precision is greatly improved. Different from the application of the common space pyramid, the common network simply adds the module to the terminal of the decoder network, so that the network depth is deepened in a certain sense, and certain characteristic loss is caused. According to the method, the double pyramid module is connected to different stages of a backbone network and then fused with the layer corresponding to the up-sampling module, so that the reservation of shallow features and the extraction of more scale features are considered.

And 4, step 4: an up-sampling module is designed, the up-sampling module consists of four groups of convolution blocks, the recovery of the image size is gradually completed, and finally, a classification prediction result is obtained through Softmax.

And 5: and when the characteristics of the training network are extracted, the perception loss is calculated through the perception loss network, the perception loss and the loss of the training network are subjected to weighted fusion, and finally the perception loss and the loss are reversely transmitted to the training network.

This step restores the feature size by upsampling and completes the final segmentation prediction.

As a semantic segmentation task, the IOU and F1 are used as evaluation indexes, and the formula is as follows:

where P is the number of positive samples, N is the number of negative samples, TP predicts the number of correct positive samples, FP is the number of mispredicted positive samples, TN is the number of correct negative samples, FN is the number of mispredicted negative samples, and the number of samples is the number of pixels per picture.

Fig. 4 is a diagram of an inclusion v-4 backbone network architecture according to an embodiment of the present invention. The backbone network structure chart mainly comprises a down-sampling module and an up-sampling module, wherein the down-sampling module consists of a backbone network based on an Inception V-4 network and a double-pyramid module, and the up-sampling module consists of four groups of convolution blocks

Fig. 5 is a structural view of a spatial pyramid according to an embodiment of the present invention. Based on the spatial pyramid structure diagram and the embodiment of fig. 3, the segmentation process based on the spatial pyramid structure is as follows:

step 1: the method comprises the steps of taking an increment V-4 network as a backbone, abandoning the last Average Power, Drapout and Softmax, then constructing a branch of the network, combining a feature map of an increment-A module with an output feature map of an increment-C module to form an encoder with a multi-channel training branch, and fully extracting context information of the network.

Step 2: and adding a double pyramid module, and then fusing the characteristics to form an encoder module. Establishing a first pyramid module after a Stem module to form a parallel training network, transmitting a 35 x 384 feature map into the first pyramid module, fully extracting multi-scale features of a first stage by cavity convolution with sampling rates of 1,6,12 and 18 respectively, then fusing feature maps of five branches of an ASPP1 module to form a 35 x 256 feature map, and performing 1 x 1 convolution on the fused feature map. To match the size of the feature map to be fused, a max pooling operation of 4 × 4 is added after the 1 × 1 convolution, reducing the feature map size to 32 × 32. Establishing a second pyramid module after the Reduction-A module of the Inception V-4 network, wherein the feature map received by the ASPP2 module is 17 × 17 × 1024, setting convolution sampling rates in the ASPP2 module to be 1,6,8 and 12 respectively due to the matching of the feature map size, then fusing the four branches of the ASPP2 module together, outputting a 17 × 17 × 512 feature map through a 1 × 1 convolution layer, and adding a 2 × 2 maximum pooling layer after the convolution layer to match the size of the fused feature map to obtain a 16 × 16 × 512 feature map. And after the training of the double pyramid module is completed, fusing the feature graphs with corresponding sizes in the decoder respectively.

And step 3: designing a decoder, dividing the decoder into four convolution modules, wherein each convolution block has an up-sampling operation, and finally restoring the image size through bilinear up-sampling. The decoder parameters are shown in table 1 below:

TABLE 1 decoder parameter Table

FIG. 6 is a comparison of predicted results on a Vaihingen validation set, according to an embodiment of the present invention. With reference to fig. 5, table 2 and table 3, experiments were conducted to compare mainstream split networks, including: FCN32, SegNet, PspNet, U-Net.

TABLE 2 IOU score comparison on Vaihingen data sets

	Imp.S.	Build.	Low.V.	Tree	Car	Overall
							FCN	85.15	94.42	79.26	76.51	80.16	84.25
SegNet	82.65	89.76	76.04	73.74	71.63	82.69
							PspNet	71.39	75.58	65.71	45.32	40.61	63.36
U-Net	83.49	90.14	78.36	75.41	73.10	83.20
							Our	93.97	97.77	90.06	88.23	90.94	96.43

TABLE 3F 1 score comparison on Vaihingen data set

To verify whether the modification to the network plays a positive role, the backbone network alone without any modification is used as a split network for testing and comparison, see fig. 6 for details.

As can be seen from the data comparison, the high-precision remote sensing image segmentation method based on the weighted loss fusion network provided by the invention can obtain good effect in remote sensing image segmentation.

FIG. 7 is a diagram of a media device according to an embodiment of the present invention. Including memory 100, processor 200. The memory 100 is used for storing various data when the processor runs, and the processor is used for executing the following steps: preprocessing the remote sensing image to obtain training data for a convolutional neural network; constructing a convolutional neural network, wherein the convolutional neural network comprises a coder with a multi-channel training branch, and context extraction is carried out on training data through the coder; constructing a double pyramid module, extracting a feature map of the training data through two groups of different convolution expansion rate space pyramids of the double pyramid module, and outputting a corresponding first feature map; performing upsampling processing on the obtained feature map to obtain second feature maps with different sizes, and fusing the first feature map and the second feature map; and constructing a perception loss network, performing loss calculation on the fused characteristic diagram through the loss network, and performing back propagation on the weighted fusion of the loss obtained through calculation and the loss network to realize parameter updating.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. A remote sensing image segmentation method based on a weighted loss fusion network is characterized by comprising the following steps:

s100, preprocessing the remote sensing shadow to obtain training data for a convolutional neural network;

s200, constructing a convolutional neural network, wherein the convolutional neural network comprises a coder with multi-channel training branches, and context extraction is carried out on the training data through the coder;

s300, constructing a double pyramid module, extracting feature maps of the training data through two groups of spatial pyramids with different convolution expansion rates of the double pyramid module, and outputting corresponding first feature maps;

s400, performing upsampling processing on the feature map obtained in the step S300 to obtain second feature maps with different sizes, and fusing the first feature map and the second feature map;

s500, a perception loss network is constructed, loss calculation is carried out on the fused feature graph through the loss network, and back propagation is carried out through weighted fusion of the loss obtained through calculation and the loss network, so that parameter updating is achieved.

2. The remote sensing image segmentation method based on the weighted loss fusion network as claimed in claim 1, wherein the preprocessing in S100 comprises: and carrying out image cutting, normalization and data expansion processing on the remote sensing image, wherein the operations of data expansion are respectively carried out from the horizontal direction and the vertical direction.

3. The remote sensing image segmentation method based on the weighted loss fusion network as claimed in claim 1, wherein the S200 comprises: the convolutional neural network uses Inception V-4 pre-trained on an ImageNet data set as a network backbone, removes an average pooling layer at the tail end of the convolutional neural network, and simultaneously adds network branches on the network backbone to obtain an encoder with multi-channel training branches, wherein the network branches are used for reserving shallow features of images.

4. The remote sensing image segmentation method based on the weighted loss fusion network as claimed in claim 3, wherein the S200 comprises:

and removing the average pooling layer and all subsequent layers, independently establishing a network branch for the output after the Reduction-A layer, and fusing the network branch with the tail end of the main network after passing through the 2 x 2 maximum pooling layer.

5. The remote sensing image segmentation method based on the weighted loss fusion network as claimed in claim 3, wherein the S300 comprises:

s310, establishing a first space pyramid behind the Stem module as a parallel training network, extracting multi-scale features of a first stage of the training data according to the first space pyramid, fusing feature graphs of five branches of the ASPP1 module to form corresponding fused feature graphs, and performing convolution on the fused feature graphs by 1 x 1;

s320, executing the convolution operation of the maximum pooling of 4 x 4, and reducing the size of the feature map;

s330, constructing a second spatial pyramid in a Reduction-A module of an inclusion V-4 network, receiving a corresponding feature map through an ASPP2 module, fusing four branches of the ASPP2 module, outputting the feature map through a 1 x 1 convolution layer, and adding a 2 x 2 maximum pooling layer after the convolution layer to obtain the feature map.

6. The remote sensing image segmentation method based on the weighted loss fusion network as claimed in claim 5, wherein the first and second spatial pyramids are set to expansion rates [1, 6,12,18 ] and [1, 6,12 ], respectively.

7. The remote sensing image segmentation method based on the weighted loss fusion network as claimed in claim 1, wherein the S400 comprises:

the image size is recovered by four sets of convolution block sampling modules.

8. The remote sensing image segmentation method based on the weighted loss fusion network as claimed in claim 1, wherein the S500 comprises: and constructing a perception loss network, adopting a pre-trained VGG16 network, transmitting a prediction graph obtained by a segmentation network into the loss network, calculating to obtain loss, and then performing weighted fusion and back propagation with the loss calculated by the segmentation network to realize parameter updating.

9. The remote sensing image segmentation method based on the weighted loss fusion network as claimed in claim 1, wherein the method further comprises:

the segmentation method of S100 to S400 was evaluated using IOU and F1 as evaluation indexes in such a manner that,

10. A method and apparatus for remote sensing image segmentation based on a weighted loss fusion network, the apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method steps of any one of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 8.