CN112489054A

CN112489054A - Remote sensing image semantic segmentation method based on deep learning

Info

Publication number: CN112489054A
Application number: CN202011359068.0A
Authority: CN
Inventors: 熊风光; 张鑫; 刘欢乐; 韩燮; 况立群
Original assignee: North University of China
Current assignee: North University of China
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-03-12

Abstract

The invention discloses a remote sensing image semantic segmentation method based on deep learning, and belongs to the technical field of machine vision. Aiming at the problems of difficult acquisition of the characteristics of small objects and insufficient segmentation precision of a semantic segmentation method of a mainstream deep convolutional neural network, the semantic segmentation method improves a single upsampling layer by improving a Deeplabv3 algorithm, and performs multi-layer upsampling by using residual errors obtained from a backbone network to ensure complete semantics of an image on resolution; meanwhile, the expansion rate of 4 layers of expansion convolution in the ASPP layer is modified, so that the network has a better effect on small object segmentation. The results show that: the improved Deeplabv3 semantic segmentation algorithm achieves the mIou and pixel accuracy rates of 94.92% and 98.01% on a self-made data set, improves the mIou and pixel accuracy rates by 3.77% and 2.40% respectively compared with the original algorithm, has higher accuracy, and has better robustness on segmentation of various terrains; the method is suitable for complex urban remote sensing image environments, and can be well used in the fields of urban planning, agricultural planning, military warfare and the like.

Description

Remote sensing image semantic segmentation method based on deep learning

Technical Field

The invention belongs to the technical field of machine vision, and particularly relates to a remote sensing image semantic segmentation method based on deep learning.

Background

With the continuous development of remote sensing technology, semantic information contained in remote sensing images is more and more abundant, so how to perform semantic segmentation on the remote sensing images, quickly and accurately extract important semantic information, and perform later application and development is a very important research topic. The semantic segmentation of the remote sensing image has wide application range and relates to urban planning, geological disaster prevention and control, military war simulation and the like. Particularly, in the aspect of military war simulation, semantic information segmented from remote sensing images plays an extremely important role in the rapid generation of real battlefield terrain and the rapid construction of environment.

For the semantic segmentation of remote sensing images, the method is generally divided into two categories of traditional graphics algorithms and deep learning-based algorithms. Conventional semantic segmentation algorithms include edge detection-based image segmentation algorithms, threshold-based image segmentation algorithms, and region-based image segmentation algorithms. The image segmentation algorithm based on edge detection simulates the human visual process, separates the image edge from the background, and perceives the image details, thereby recognizing the image object contour. The basic idea of the threshold-based image segmentation algorithm is to use the difference of the gray characteristics of the target and the background of interest in the image, and use one or more thresholds to divide the gray level of the image into several classes, and pixels belonging to the same class are identified as the same object. The image segmentation algorithm based on the region is characterized in that a small region in a target to be segmented is selected as a seed region from pixels according to a criterion of consistent region attribute characteristics, the region attribution of each pixel is determined on the basis, the pixels around the pixel are added continuously according to a certain criterion and are used as new seed regions, and finally all the pixels with specified characteristics are combined repeatedly to form the region. Although these methods can segment a complete scene, they are far inferior to the deep learning method in terms of segmentation accuracy.

Disclosure of Invention

Aiming at the problems of difficulty in obtaining the characteristics of small objects and insufficient segmentation precision of a mainstream semantic segmentation method of a deep convolutional neural network, the invention provides a semantic segmentation method of a remote sensing image based on deep learning. The method is suitable for complex urban surface remote sensing image segmentation and is used for semantic segmentation of machine vision.

In order to achieve the purpose, the invention adopts the following technical scheme:

a remote sensing image semantic segmentation method based on deep learning comprises the following steps:

step 1, marking collected remote sensing data by using a labelme tool to obtain a marking result;

step 2, performing data enhancement on the labeling result obtained in the step 1 to obtain a data set;

step 3, designing a network;

step 4, reading the data set in the step 2 into the design network in the step 3 for training;

step 5, reading the network weight trained in the step 4 into a network through evaluation and judgment, reading a picture to be predicted into the network, and calculating to obtain a Logit;

and 6, analyzing the Logit score, giving the corresponding color of each pixel to represent specific classification, and finally obtaining a segmentation result.

Further, the specific method for enhancing data in step 2 is as follows: randomly cutting the remote sensing data original image and the marked mask, wherein the size of the picture obtained by cutting each time is 256 multiplied by 256 pixels, rotating, turning, blurring, Gaussian filtering, bilateral filtering and white noise adding are carried out on each cut picture to obtain an enhanced picture, and then a data set is established.

Further, the specific method for designing the network in step 3 is as follows:

a main network is formed by resnet-50, and comprises convolution with convolution kernel of 7 multiplied by 7 and step length of 2 and output channel number of 64, and visual field pooling is maximum value with 3 multiplied by 3 and step length of 2;

then the three convolution kernels are 1 multiplied by 1, 3 multiplied by 3, the step length of 1 multiplied by 1 is 1, and the number of output channels is 64, 64 and 256 convolution respectively; the four convolution kernels are convolution with 1 × 1, 3 × 3, 1 × 1 step length being 1, and the number of output channels being 128, 128 and 512 respectively; the six convolution kernels are convolution with 1 × 1, 3 × 3 and 1 × 1 step length of 1, and the number of output channels is respectively 256, 256 and 1024; the three convolution kernels are convolution with 1 × 1, 3 × 3, 1 × 1 step length being 1, and the number of output channels being 512, 512 and 2048 respectively;

after the ASPP module with the modified void ratio is used, the five parallel sub-modules are respectively:

convolution kernel is convolution with 1 multiplied by 1, step length is 1, and output channel number is 256;

convolution kernel is convolution with 3 multiplied by 3 step length of 1, void rate of 3 and output channel number of 256;

convolution kernel is convolution with 3 multiplied by 3 step length of 1, void rate of 6 and output channel number of 256;

convolution kernel is convolution with 3 multiplied by 3 step length of 1, void rate of 9 and output channel number of 256;

the last layer is global average pooling, and the number of output channels is 256; the larger expansion rate has better segmentation effect on some large objects, but has disadvantages and benefits for small objects. And because the expansion rate of the expansion convolution is high, sparse sampling input signals are caused, so that no correlation exists between information obtained by long-distance convolution, and the classification result is influenced. For the remote sensing images, the expansion rates like 1, 6, 12 and 18 are too large, and the large receptive field is unfavorable for the segmentation of the tiny objects in the remote sensing images. Therefore, the relation of how to adjust the expansion ratio in the ASPP module and process the large and small objects is the key to design the expansion convolution network, so that the expansion convolutions with expansion ratios of 1, 3, 6 and 9 are used respectively. The modified expansion ratio reduces the receptive field of the ASPP module to a certain extent, and the sensitivity of the network to large and small objects is balanced. Meanwhile, the expansion rate is reduced, so that the sampling input signals are dense, and the problem of convolution failure caused by overlarge expansion rate is solved. And finally, the network can obtain a finer segmentation result of the small object.

Because the resolution of the output images of the 5 submodules is the same, the five submodules are superposed on the channel dimension to obtain the characteristic that the channel number is 1280, and the output channel number is fused into 256 through 1 multiplied by 1 convolution; then reducing the characteristics into 64 x 64 pixels through bilinear interpolation upsampling; then, the convolution kernel is superposed on a channel with the convolution kernel with the size of 7 multiplied by 7 at the beginning to obtain the characteristic that the number of output channels is 512; the feature map obtained by 7 x 7 convolution of the first layer of Resnet-50 is subjected to maximum pooling only once, so that the feature map has the characteristics of higher resolution, more complete spatial position information and the like, the feature map after ASPP and the feature map subjected to the first layer of Resnet-50 are subjected to channel dimension combination to construct a module similar to a Decoder, and the feature map containing rich spatial position information in the lower layer is utilized to enable the segmentation result to have more fine pixel position recovery. Compared with the original network, the improved up-sampling module only increases 256+256 × 3 × 3 × 2-4864 parameters, and has little influence on the operation cost of the whole network.

And finally, performing bilinear interpolation up-sampling to restore the image resolution to 256 × 256 and performing 1 × 1 convolution to change the channel number to 5 to obtain the Logit through two convolutions with the step length of 3 × 3 being 1.

Further, the specific method for evaluating the network training in the step 5 is as follows: average and cross-ratio

Building, vegetation, water system and road intersection

And pixel accuracy

As a detection evaluation index; based on the fact that the semantic segmentation of the remote sensing image is a classification task, the prediction result is in four conditions: true Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN); iou is the ratio of the intersection and union of the two sets of true and predicted values, i.e. the ratio

Where k +1 is the number of categories containing the background class, p_iiNumber of pixels to be correctly predicted, P_ijAnd P_jiAll represent the number of pixels that are falsely detected, mlou is a consideration of all classes, and Iou of each class are added and averaged to obtain a global-based evaluation. The network training condition is considered by utilizing the mIou, and the larger the mIou value is, the better the network training effect is, and the more consistent the network training effect is with the correct segmentation result. Meanwhile, the convergence condition of the network can be judged according to the change of the mIou, and the smaller the change of the mIou is, the closer the network is to the convergence. Therefore, the invention chooses mIou to consider the network training situation, so as to find out a group of optimal weights.

Further, the specific method for reading the picture to be predicted into the network in the step 5 is as follows: cutting a plurality of pictures with the size of 256 multiplied by 256 pixels from the left upper corner of the remote sensing image to the right and from the top to the bottom, wherein the interval between the first columns of two adjacent pictures in the same row which are cut each time is 256 pixels, and the interval between the first rows of two adjacent pictures in the same column is also 256 pixels; meanwhile, when the size of the pre-cut picture at the edge of the remote sensing image is less than 256 × 256 pixels, the 256 × 256 pixels are cut in the opposite direction by taking the pre-cut picture as a reference. And after the prediction of the cut pictures is finished, splicing the pictures according to a cutting rule, thereby obtaining a complete Logit score map of the remote sensing image.

Further, the step 6 of analyzing the Logit score, giving the color corresponding to each pixel to represent specific classification, and the specific method of finally obtaining the segmentation result is as follows: the number of channels of the score map is 5, 5 channels of each pixel respectively represent corresponding scores of buildings, vegetation, water systems, roads and other classifications, and the highest score is the category of the current pixel; newly building a zero matrix with the resolution as the original resolution of the test chart and the number of channels as 3; judging the pixel value corresponding to the original score map, if the pixel value is a building, the pixel value is [31,102,156 ]; if the vegetation is found, the pixel value is [0,255,0 ]; if the pixel is a water system, the pixel value is [255, 0 ]; if the road is found, the pixel value is [192,192,192 ]; if the pixel value is other types, the pixel value is [255,255 ]; each pixel is dyed by the method, and the finally obtained matrix derivation is a segmentation result. Therefore, the segmentation result can be well visualized, and the segmentation result which is easy to read and understand is obtained.

Compared with the prior art, the invention has the following advantages:

the invention provides an algorithm, which aims at optimizing the segmentation effect on small objects, and constructs a network model more suitable for the semantic segmentation of remote sensing images from changing a single up-sampling structure and reducing the overlarge receptive field of an ASPP (automatic sequence protocol) module, so that the problems of difficult segmentation, low segmentation precision and the like of the small objects are solved.

Mainstream semantic segmentation networks are often used for carrying out experiments on MS-COCO data sets with large objects, and for remote sensing images, segmentation targets are small, so that the network is often poor in small object segmentation effect. Aiming at semantic segmentation of remote sensing images and the problem of high difficulty in segmentation of tiny objects in a complex environment, the invention provides a deep learning-based Deeplabv3 improved algorithm to modify an up-sampling module and adjust the void rate of an ASPP module to construct a network model suitable for segmentation of the remote sensing images, thereby enhancing the capability of segmenting small objects in the complex environment. The problem of poor segmentation capability to little objects such as vegetation, building has been solved effectively, has promoted the segmentation precision, has fine segmentation effect.

Drawings

FIG. 1 is a data set annotation interface diagram;

FIG. 2 is a network layout of the present method;

FIG. 3 is a diagram of an ASPP module architecture;

FIG. 4 is a mIou converged during network training;

FIG. 5 is an original image of a remote sensing image for testing;

FIG. 6 is the segmentation result of the present invention.

Detailed Description

Example 1

The invention relates to a remote sensing image semantic segmentation method based on deep learning, which is used for carrying out semantic segmentation on a high-precision image and comprises the following specific steps:

step 1, labeling a data set: and marking the collected high-precision remote sensing images by using professional labelme software to obtain corresponding mask images after marking. And processing the obtained mask, and converting the mask into an 8-bit gray scale map as a label used by the training network.

Step 2, performing data enhancement on the labeling result obtained in the step 1: and randomly cutting the remote sensing data original image and the marked mask, wherein the size of the picture obtained by cutting each time is 256 multiplied by 256 pixels. And then, rotating, turning, blurring, Gaussian filtering, bilateral filtering and white noise adding are carried out on each cut picture to obtain an enhanced data set. As shown in the data set annotation interface diagram of fig. 1.

Wherein the selection data is prior to enhancement

As a training set, the training set is,

as a test set. Gaussian filtering, bilateral filtering, etc. are prior art and will not be described in detail here.

Step 3, reading the training set after the data enhancement in the step 2 into a designed network for training: 66666 remote sensing images as training set, 33333 as test set, batch size 48, learning rate 2X 10^-4The weight attenuation is normalized by l2 with a weight attenuation rate of 5 × 10^-4. Finally, the average cross-over ratio (mlou) stabilized around 94.92, iterating to 48000 stops.

As shown in the network design diagram of the method of fig. 2, wherein the network structure is designed as: the backbone network consists of convolutions with step size 2 and output channel number 64 including convolution kernel 7 x 7. The field of view is 3 x 3 with a maximum pooling of steps of 2. The convolution is followed by three convolution kernels with 1 × 1, 3 × 3, 1 × 1 step size of 1, and output channel numbers of 64, and 256, respectively. The four convolution kernels are convolutions with 1 × 1, 3 × 3, 1 × 1 step size of 1, and the number of output channels is 128, 512, respectively. The six convolution kernels are convolution with 1 × 1, 3 × 3, 1 × 1 step length being 1, and the number of output channels being 256, 1024 respectively. The three convolution kernels are convolution with 1 × 1, 3 × 3, 1 × 1 step length being 1 and the number of output channels being 512, 512 and 2048 respectively. And then, after the ASPP module with the modified void rate is adopted, the five parallel sub-modules are respectively convolution with convolution kernel of 1 multiplied by 1, step length of 1 and output channel number of 256. Convolution kernel is convolution with 3 × 3 step size of 1, void rate of 3, and number of output channels of 256. Convolution kernel is convolution with 3 × 3 step size of 1, void ratio of 6, and output channel number of 256. Convolution kernel is convolution with 3 × 3 step size of 1, void rate of 9, and output channel number of 256. The last layer is global average pooling, and the number of output channels is 256, as shown in the structure diagram of the ASPP module in fig. 3. Because the output image resolutions of the above 5 modules are the same, the above five modules are overlapped on the channel dimension to obtain the feature with the channel number of 1280, and the output channel number is fused into 256 through 1 × 1 convolution. The features are then restored to 64 x 64 pixels by bilinear interpolation upsampling. And then the convolution is superposed on the channel with the convolution with the initial convolution kernel size of 7 multiplied by 7, so that the characteristic that the number of output channels is 512 is obtained. And finally, performing bilinear interpolation up-sampling to restore the image resolution to 256 × 256 and performing 1 × 1 convolution to change the channel number to 5 to obtain the Logit through two convolutions with the step length of 3 × 3 being 1.

The network evaluation method comprises the following steps: first, average cross-over ratio

Building, vegetation, water system and road intersection

And pixel accuracy

As a detection evaluation index. Finally, mIou is used and the pixel accuracy reaches 94.92% and 98.01% of the weight. As shown by the converged mlou during the network training process of fig. 4.

And 5, inputting the high-precision remote sensing image for testing into a network for prediction: cutting a plurality of pictures with the size of 256 multiplied by 256 pixels from the left upper corner of the remote sensing image to the right and from the top to the bottom, wherein the interval between the first columns of two adjacent pictures in the same row which are cut each time is 256 pixels, and the interval between the first rows of two adjacent pictures in the same column is also 256 pixels; meanwhile, when the size of the pre-cut picture at the edge of the remote sensing image is less than 256 × 256 pixels, the 256 × 256 pixels are cut in the opposite direction by taking the pre-cut picture as a reference. And after the prediction of the cut pictures is finished, splicing the pictures according to a cutting rule, thereby obtaining a complete Logit score map of the remote sensing image.

And 6, analyzing the obtained Logit score, and drawing a final segmentation picture: the number of Logit image channels is 5, 5 channels of each pixel respectively represent corresponding scores of buildings, vegetation, water systems, roads and other classifications, and the highest score is the category of the current pixel. Therefore, a zero matrix with the size of 256 × 256 and the number of channels of 3 is newly created. Judging the pixel value corresponding to the original score map, if the pixel value is of other classes, the pixel value is [255,255 ]; if the building is found, the pixel value is [31,102,156 ]; if the vegetation is found, the pixel value is [0,255,0 ]; if the pixel is a water system, the pixel value is [255, 0 ]; if it is a road, the pixel value is [192,192,192], and the image segmentation result is drawn. Fig. 5 shows the remote sensing image original for test and fig. 6 shows the segmentation result of the present invention.

Aiming at semantic segmentation of remote sensing images and the problem of high difficulty in segmentation of tiny objects in a complex environment, the invention provides a deep learning-based Deeplabv3 improved algorithm to modify an up-sampling module and adjust the void rate of an ASPP module to construct a network model suitable for segmentation of the remote sensing images, thereby enhancing the capability of segmenting small objects in the complex environment. The experimental results show that: the network model provided by the invention effectively solves the problem of poor segmentation capability on small objects such as bedding, buildings and the like, improves the segmentation precision and has a good segmentation effect.

TABLE 1 analysis of efficiency

As can be seen from Table 1, in terms of average cross-over ratio (mIou), the average cross-over ratio of the original Deeplabv3 algorithm is 91.15%, the average cross-over ratio of the U-Net algorithm is 87.95%, the average cross-over ratio of the SegNet algorithm is 86.88%, the average cross-over ratio of HR-Net is 92.88%, and the average cross-over ratio of DANet is 95.16%, and the average cross-over ratio of the improved method of the invention is 94.92%, which is slightly lower than that of the DANet algorithm, and is improved by 3.77% and 2.04% compared with the original Deeplabv3 algorithm and HR-Net algorithm.

In the aspect of vegetation cross-combination ratio (Iou), the cross-combination ratio of the original Deeplabv3 algorithm is 85.25%, the cross-combination ratio of the vegetation of the DANet algorithm is 90.84%, the cross-combination ratio of the vegetation of the HR-Net algorithm is 82.83%, the cross-combination ratio of the vegetation of the method is 88.66%, and the method is improved by 3.41% and 1.9% compared with the original Deeplabv3 algorithm and the DANet algorithm.

In terms of building cross-correlation (Iou), the cross-correlation of the original Deeplabv3 algorithm is 90.06%, the cross-correlation of the building of the DANet algorithm is 90.50%, and the cross-correlation of the building of the HR-Net algorithm is 91.64%, and the cross-correlation of the building of the improved method is 93.83%, which is 3.77% and 30.33% higher than the original Deeplabv3 algorithm and the DANet algorithm.

The improved method (particle method) of the invention achieves 98.01% for Pixel Accuracy (Pixel Accuracy), and is improved by 2.40%, 0.67% and 4.16% respectively compared with the SegNet algorithm with the worst segmentation effect, namely the Deeplabv3 algorithm and the DANet algorithm.

Those skilled in the art will appreciate that the invention may be practiced without these specific details. Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A remote sensing image semantic segmentation method based on deep learning is characterized by comprising the following steps: the method comprises the following steps:

step 3, designing a network;

2. The remote sensing image semantic segmentation method based on deep learning of claim 1, which is characterized in that: the specific method for enhancing the data in the step 2 is as follows: randomly cutting the remote sensing data original image and the marked mask, wherein the size of the picture obtained by cutting each time is 256 multiplied by 256 pixels, rotating, turning, blurring, Gaussian filtering, bilateral filtering and white noise adding are carried out on each cut picture to obtain an enhanced picture, and then a data set is established.

3. The remote sensing image semantic segmentation method based on deep learning of claim 2, which is characterized in that: the specific method for designing the network in the step 3 is as follows:

the last layer is global average pooling, and the number of output channels is 256;

because the resolution of the output images of the 5 submodules is the same, the five submodules are superposed on the channel dimension to obtain the characteristic that the channel number is 1280, and the output channel number is fused into 256 through 1 multiplied by 1 convolution; then reducing the characteristics into 64 x 64 pixels through bilinear interpolation upsampling; then, the convolution kernel is superposed on a channel with the convolution kernel with the size of 7 multiplied by 7 at the beginning to obtain the characteristic that the number of output channels is 512; and finally, performing bilinear interpolation up-sampling to restore the image resolution to 256 × 256 and performing 1 × 1 convolution to change the channel number to 5 to obtain the Logit through two convolutions with the step length of 3 × 3 being 1.

4. The remote sensing image semantic segmentation method based on deep learning of claim 3, which is characterized in that: the tool for evaluating the network training in the step 5The method comprises the following steps: average and cross-ratio

Building, vegetation, water system and road intersection

And pixel accuracy

Where k +1 is the number of categories containing the background class, p_iiNumber of pixels to be correctly predicted, P_ijAnd P_jiAll represent the number of pixels that are falsely detected, mlou is a consideration of all classes, and Iou of each class are added and averaged to obtain a global-based evaluation.

5. The remote sensing image semantic segmentation method based on deep learning of claim 4, which is characterized in that: the concrete method for reading the picture to be predicted into the network in the step 5 is as follows: cutting a plurality of pictures with the size of 256 multiplied by 256 pixels from the left upper corner of the remote sensing image to the right and from the top to the bottom, wherein the interval between the first columns of two adjacent pictures in the same row which are cut each time is 256 pixels, and the interval between the first rows of two adjacent pictures in the same column is also 256 pixels; meanwhile, when the size of the pre-cut picture at the edge of the remote sensing image is less than 256 multiplied by 256 pixels, the 256 multiplied by 256 pixels are cut in the opposite direction by taking the pre-cut picture as a reference; and after the prediction of the cut pictures is finished, splicing the pictures according to a cutting rule, thereby obtaining a complete Logit score map of the remote sensing image.

6. The remote sensing image semantic segmentation method based on deep learning of claim 5, which is characterized in that: the step 6 of analyzing the Logit score, giving the color corresponding to each pixel to represent specific classification, and the specific method of finally obtaining the segmentation result is as follows: the number of channels of the score map is 5, 5 channels of each pixel respectively represent corresponding scores of buildings, vegetation, water systems, roads and other classifications, and the highest score is the category of the current pixel; newly building a zero matrix with the resolution as the original resolution of the test chart and the number of channels as 3; judging the pixel value corresponding to the original score map, if the pixel value is a building, the pixel value is [31,102,156 ]; if the vegetation is found, the pixel value is [0,255,0 ]; if the pixel is a water system, the pixel value is [255, 0 ]; if the road is found, the pixel value is [192,192,192 ]; if the pixel value is other types, the pixel value is [255,255 ]; each pixel is dyed by the method, and the finally obtained matrix derivation is a segmentation result.