CN114973011A

CN114973011A - High-resolution remote sensing image building extraction method based on deep learning

Info

Publication number: CN114973011A
Application number: CN202210538076.4A
Authority: CN
Inventors: 陆相竹; 孟上九; 王淼; 孙义强
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-08-30

Abstract

A high-resolution remote sensing image building extraction method based on deep learning belongs to the field of image extraction methods. The existing deep learning method for building extraction has the problems of end-to-end model design shortage, further improvement of extraction accuracy and the like. A high-resolution remote sensing image building extraction method based on deep learning is characterized in that a feature enhancement structure is added to a basic network structure U-Net network model to form a network structure model of an encoder-feature enhancement-decoder; then, replacing the activation function ReLU with ELU; and combining the U-Net + + network with the expansion convolution, and introducing a residual error network to obtain the characteristic information of the contact context. The invention designs and realizes the U-Net network for increasing the characteristic enhancement and changing the activation function, thereby improving the building extraction precision. The designed U-Net + + network model combined with the mixed hole convolution can realize building extraction for a small number of remote sensing samples.

Description

High-resolution remote sensing image building extraction method based on deep learning

Technical Field

The invention relates to a high-resolution remote sensing image building extraction method based on deep learning.

Background

In image processing, generally, a point where the gray scale changes sharply is considered as an edge, and the edge detection process is to find out a position where the gray scale changes sharply in an image. For the extraction of the building, firstly, edge information of the building is acquired by using an edge detection algorithm according to gray value changes of different ground object target boundaries in a remote sensing image, then line segments are grouped through a spatial relationship, and the contour and the spatial structure of the building are completed by combining the priori knowledge of people, so that the extraction of the building is realized. At present, the extraction methods based on region segmentation are mainly divided into three types: the method is based on the idea of region growing, and is based on the fact that single pixels are combined one by one to finally form a target object; second, a splitting algorithm represented by a quadtree is used to split the whole image into small objects successively; and thirdly, a segmentation algorithm which takes the watershed algorithm as a representative, takes the texture characteristics of the marked region as a merging criterion and takes the local region homogeneity as a segmentation basis. Of the three methods, the most common method is a region segmentation method based on a watershed algorithm, and some scholars have studied the algorithm to a certain extent. In recent years, with the improvement of the computing capability of a computer and the continuous development of deep learning, more and more researchers apply the deep learning to the field of remote sensing, wherein a convolutional neural network is popular in the remote sensing application after completely exposing a corner in the deep learning, and a building extraction method based on the deep learning opens a new era of remote sensing image information extraction.

At present, there is a method for realizing automatic segmentation of remote sensing images by using a learning network based on image blocks, which utilizes a massachusetts road data set and a massachusetts building data set to perform experiments, and performs comparative research on three models, namely a basic neural network model, a network model added with a conditional random field and a neural network model added with post-processing, wherein the highest accuracy of building extraction can reach 92.03%. Then, Saito et al improve the structure of the extracted convolutional neural network, and create a new loss function CIS, which improves the extraction accuracy. Liu wentao et al have used full convolution neural network to realize the automatic extraction of building roof, and the network structure of design has increased the receptive field of neuron through adding the cavity convolution, uses this neural network to extract the building roof and has obtained better effect. Xu et al proposed a convolutional neural network model called Res-U-Net based on the ResNet network, and introduced a guided filter as post-processing to fine-tune the extraction results of the building because the convolutional network would obscure the object boundaries and result visualization is degraded. Although the neural network model obtains a good building extraction effect and effectively reduces the influence of salt and pepper noise, the neural network model still has some defects: for example, for some buildings covered by trees, their shapes cannot be accurately detected and some fuzzy, irregular building boundaries can hardly be classified. In order to effectively perform semantic segmentation on the roof of a building in a dense urban environment, Qin et al effectively extract the roof of the building based on a deep convolutional neural network by using Chinese range GF2 remote sensing image data, establish a DCNN (distributed computing network) model which is based on a VHR (very high-speed radar) satellite image and can be marked by pixels, optimize the building extraction result by using a conditional random field, but the post-processing does not change the segmentation result well. Ye et al automatically extracted buildings using a combined attention depth neural network method based on ultrahigh resolution aerial imagery. The network training is carried out through the proposed convolutional neural network RFA-UNet, and a better result is obtained. Aiming at the problem of accurate segmentation of buildings in high-resolution remote sensing images, Wangyu et al propose a deep neural network segmentation algorithm ResNetCRF which takes Encoder-Decoder as a framework and ResNet as a basic network and combines a fully-connected conditional random field, the algorithm can accurately extract edge information of the buildings, but the method still has some defects: ResNetCRF cannot identify small buildings, and the buildings with similar color information and background and unobvious edge information have a missing mark phenomenon. Aiming at the problems of low precision, incomplete building boundary and the like of the traditional building extraction method, Vanlongshuang et al provide a self-adaptive pooling model based on a high-resolution remote sensing image, place the model in a convolutional neural network framework and obtain high building extraction precision, but an activation function used in the method cannot activate all neurons, and the network structure is single. To solve this problem, Liu et al propose a light-weight deep learning model, which adds spatial pyramid pooling (spatial pyramid pooling) to the model, wherein the pyramid pooling model can capture and aggregate multi-scale context information, and yankee et al propose a building extraction method of a convolutional neural network based on local features, which separates a building from an image, and then inputs the separated building blocks for recognition, reducing the complexity of the model, but the model is very dependent on the separated image. Although the above building extraction method based on deep learning has good effects, the following disadvantages still exist:

(1) some network structures are complex, and the situation that buildings with small dimensions cannot be identified exists;

(2) the method has the advantages that the building with irregular shape is incompletely extracted or can not be classified, the extraction result is relatively dependent on post-processing, and end-to-end network models are few;

(3) some network methods easily cause loss of a large amount of detail features in the image training process, so that the accuracy of building information extraction is influenced.

In conclusion, the existing information extraction method is difficult to comprehensively obtain and utilize the relevant features of buildings in the high-spatial-resolution remote sensing image, so that the work of extracting the building information of the high-spatial-resolution remote sensing image is difficult to obtain breakthrough progress in a better efficiency and precision state.

Disclosure of Invention

The invention aims to solve the problem that the existing information extraction method is difficult to comprehensively obtain and utilize the relevant features of buildings in a high-spatial resolution remote sensing image, and provides a high-resolution remote sensing image building extraction method based on deep learning.

A high-resolution remote sensing image building extraction method based on deep learning is realized through the following steps:

adding a characteristic enhancement structure based on a basic network structure U-Net network model to form a network structure model of an encoder-characteristic enhancement-decoder; and then, improving the activation function of the network structure model added with the characteristic enhancement structure: replacing the activation function ReLU with an ELU;

designing a U-Net + + network model combined with the mixed cavity convolution to extract the building; namely: combining the U-Net + + network with the expansion convolution, introducing a residual error network, and acquiring the characteristic information of the contact context; the process of extracting the remote sensing image building comprises a training stage and a testing stage;

a training stage: training the improved model by utilizing a back propagation and random gradient descent algorithm, then verifying and evaluating the precision of the trained model, and adjusting the model parameters according to the back propagation of a verification result; then, repeating training and parameter adjustment until the model tends to be stable;

and (3) a testing stage: testing the trained model row by using the remote sensing image data of the test set; and sequentially inputting the data into the model to extract the building, obtaining a prediction result, and evaluating the accuracy of the extraction result according to the ground truth data.

Preferably, in the step of adding the feature enhancement structure based on the U-Net network model of the basic network structure to form the network structure model of the encoder-feature enhancement-decoder,

(1) the U-Net network consists of an encoder, a decoder and a jumping connection, wherein the encoder performs down-sampling twice after convolution, the decoder performs down-sampling again after convolution twice, deconvolution is used in up-sampling of the decoder and jumping connection with down-sampling features of corresponding sizes, and then the decoder performs deconvolution twice and then deconvolution;

(2) in the step of adding the characteristic enhancement structure based on the basic network structure U-Net network model, the encoder-characteristic enhancement-decoder structure mainly comprises: an encoder section, a feature enhancement section, a decoder section; the network is an end-to-end network model, the used activation function is ReLU, the network model inputs image data from an input end in the training process and outputs a prediction result at an output end, and all operations of the middle part are positioned in the neural network and are not divided into a plurality of modules for processing;

the encoder-feature enhancement-decoder network structure model is added with a feature enhancement structure on the basis of a basic network model encoder-decoder structure, and the feature enhancement structure is specifically as follows:

a hole convolution is added in the middle of the network structure to replace a pooling operation; based on the cavity convolution, a characteristic enhancement structure is added in the middle of the network structure, the characteristic enhancement structure is a network structure coexisting in series and parallel, and characteristic graphs obtained through cavity convolution operation with different cavity rates are connected in series and parallel; the number of channels and the size of the characteristic diagram of a strong part depend on the operation in an encoder, the structure totally has four hole convolution operations, the hole rates are respectively 1, 2, 4 and 8, the hole convolutions with the sizes of 3 x 3, 5 x 5, 9 x 9 and 17 x 17 can be respectively obtained after expansion is carried out on the standard convolution with the size of 3 x 3, the receptive fields corresponding to each characteristic diagram are respectively 3, 7, 15 and 31, and the receptive field size of the final output characteristic diagram on the first intermediate characteristic diagram in the characteristic enhancement is 31 x 31; the resolution ratio of the feature graph obtained by the void convolution operation can not be reduced, and due to the void convolution with different void rates, multi-scale feature information is obtained through the structure, and the multi-scale features are added and combined in a jumping connection mode, so that the enhancement of the feature information is realized.

Preferably, the step of improving the activation function of the network structure model incorporating the feature enhancement structure is,

on the basis of the network structure of an encoder-feature enhancement-decoder, the activation function of the network model is replaced, namely the used activation function ReLU is replaced by an activation function ELU, so that a U-Net network model with the activation function of ELU and feature enhancement is obtained;

ELU is an exponential linear unit, and the expression is as follows:

the ELU function carries out corresponding index correction on a negative value part of the ReLU activation function, so that the difference between gradients is reduced, and the stability of a corresponding region is obviously enhanced when the input value is a negative number; combining the coding part, the characteristic enhancement part and the decoding part together to form a network structure of a coder-characteristic enhancement-decoder;

the improved network structure models of the encoder, the feature enhancement and the decoder are 4 layers in total, images are input from an encoder end, then convolution operation and pooling operation are carried out, and then the images enter a feature enhancement part to carry out cavity convolution operation, the cavity convolution connects cavity convolutions with different cavity rates through a series mode and a parallel mode, feature graphs with different scales are obtained, detailed information is reserved, and the feature graphs with different scales are added to realize the enhancement of feature information; and the feature graph after feature enhancement enters a decoder part, and finally, the range of the feature graph is reduced by using a Sigmoid function through transposition convolution operation, image splicing operation and convolution operation, and a result graph extracted by the building is output.

Preferably, a U-Net + + network model combined with the mixed hole convolution is designed, in the step of building extraction,

(1) the U-Net + + further reduces semantic difference between an encoder and a decoder by introducing nested and dense hop connection on the basis of a U-Net network; U-Net + + connects different network layer jumps, and performs up-sampling on the deep layer features to introduce shallow layer features, or performs down-sampling on the shallow layer features to introduce deep layer features, so as to make up semantic loss between an encoder and a decoder;

(2) the hole convolution uses an expanded sparse matrix to replace a traditional convolution kernel, and the semantic relation of the context is strengthened;

(3) the U-Net + + network model combined with the mixed hole convolution is as follows:

the method comprises the steps that hole convolution is used in the FCN, and an expanded sparse matrix is used for replacing a traditional convolution kernel so as to increase the field of experience of convolution and link feature information of image context;

the encoder features in the U-Net + + network features are fused with the up-sampling features of the encoder in the next layer, and the fused features are continuously fused with the up-sampling features of the next layer to form iteration until no up-sampling module exists in the next layer; the output of each module is as follows:

in the above formula, x ^i，j Representing the output of a feature extraction module, wherein i represents the layer sequence of the down-sampling of an encoder, j represents the sequence number of a module in the same layer, and j is 0 and represents the feature extraction module of the encoder; c (-) represents a convolution operation; u (·) denotes upsampling; [. for]Representing a characteristic channel connection; adding a Dropout layer behind each convolutional layer, and setting each middle layer neuron in the neural network as 0 with a certain probability during training without participating in forward propagation and backward feedback;

using Focal local as a Loss function to reduce the weight of the negative sample in training, the formula is:

in the above formula, y represents a label; y' represents the probability p of predicting the sample to be correct; introducing two factors, wherein gamma >0, for reducing the loss of easily classifiable samples, so that the model is more concerned with difficult, wrongly classified samples; the alpha factor is used for balancing the condition that the number proportion of the positive and negative samples is unbalanced;

introducing a mixed cavity convolution strategy for reducing the influence generated by expansion convolution and expanding the global information of the convergence of the receptive field; and taking the minimum value of a group of expansion coefficients of the mixed cavity convolution as km, namely selecting the expansion coefficients as k, k +1 and k +2, and keeping the three-layer continuous mixed cavity convolution method.

Preferably, in the step of introducing the residual error network and acquiring the characteristic information of the connection context, the residual error network is composed of residual error units, and each residual error unit is constructed by a convolution layer, a batch normalization layer and a nonlinear activation function layer; the constant shortcut connection is added in the residual error network to construct a residual error module.

The invention has the beneficial effects that:

by means of a deep learning technology, through contrastive analysis of several common building extraction methods and convolutional neural network models, the building information is constructed quickly and efficiently and automatically by adjusting the parameters of the existing models and integrating network types, the multi-layer characteristics of objects can be independently learned, the building extraction speed and efficiency can be improved, the application range of the extraction method can be properly widened, and the generalization capability of the remote sensing image building in a segmentation scene is enhanced. The concrete implementation means is as follows:

(1) the new end-to-end semantic segmentation network model is provided, the feature information is enhanced, the extraction accuracy is improved, the building extraction of the high-resolution remote sensing image is well realized by adding the feature enhancement structure, and the ELU activation function is used for replacing the Relu function, so that the network model can effectively extract the multi-scale building information and has a good extraction effect.

(2) Based on the combination of the U-Net + + network and the expansion convolution, a residual error network is introduced, the receptive field is enlarged, and the characteristic information of the contact context is obtained. For a small number of samples, in order to avoid overfitting and learn a more stable model, a Dropout structure is used, and a Loss function Focal local of dynamic scaling is adopted to solve the unbalanced condition of the samples. The method is suitable for extracting a small number of samples from buildings, and the extraction effect is superior to that of a classical U-Net network and a U-Net + + network.

Drawings

FIG. 1 is an improved U-Net network according to the present invention;

FIG. 2 is a flow of remote sensing image plus building extraction according to the present invention;

FIG. 3 is a U-Net network model according to the present invention;

FIG. 4 is a diagram of a feature enhancement architecture to which the present invention relates;

FIG. 5 is a U-Net + + network architecture according to the present invention;

FIG. 6a shows a hole convolution (expansion ratio of 1) according to the present invention;

FIG. 6b shows a hole convolution (expansion ratio of 2) according to the present invention;

FIG. 6c shows a hole convolution (expansion ratio of 3) according to the present invention;

FIG. 7a is a convolution of a hole with a coefficient of expansion of 2 according to the present invention;

FIG. 7b is a convolution of a hybrid hole according to the present invention;

FIG. 8 is a two-layer network residual block structure according to the present invention;

fig. 9 shows a three-layer network residual block structure according to the present invention.

Detailed Description

The first embodiment is as follows:

the method for extracting the high-resolution remote sensing image building based on the deep learning is realized by the following steps:

adding a characteristic enhancement structure based on a basic network structure U-Net network model to form a network structure model of an encoder, a characteristic enhancement and a decoder so as to improve the building extraction precision; and then, improving the activation function of the network structure model added with the characteristic enhancement structure: replacing the activation function ReLU with ELU to ensure that each neuron in the neural network is active, avoid the conditions that the neuron dies and the weight cannot be updated, and further improve the building extraction precision; the improved U-Net network structure is shown in figure 1;

after the problem of building extraction precision is solved, analysis shows that the high-resolution remote sensing image data has the phenomena of insufficient training samples and unbalanced data categories, a building segmentation method using a small amount of sample data is provided for multi-scale features and detail features of a building, a U-Net + + network model combined with mixed cavity convolution is designed, and building extraction is performed on a small amount of remote sensing samples; namely: combining the U-Net + + network with the expansion convolution, introducing a residual error network, expanding the receptive field and acquiring the characteristic information of the contact context; for a small number of samples, to avoid overfitting and learning a more stable model, a Dropout structure is used, and a dynamically scaled Loss function, Focal local, is used to solve the case of sample imbalance.

The process of extracting the remote sensing image building comprises a training stage and a testing stage; FIG. 2 shows a remote sensing image building extraction process based on an improved U-Net + + network;

The second embodiment is as follows:

different from the specific embodiment, in the method for extracting the high-resolution remote sensing image building based on the deep learning of the embodiment, in the step of adding the feature enhancement structure to the base network structure U-Net network model to form the network structure model of the encoder-feature enhancement-decoder,

(1) the U-Net network is a full convolution network improved based on FCN, is initially applied to medical image semantic segmentation, is symmetrical and clear in network structure, and comprises an encoder, a decoder and a jump connection, wherein the encoder has a function of extracting characteristics of shallow layers, low levels, fine granularity and the like of images, the decoder is used for restoring characteristic diagrams of all layers, and the jump connection is used for combining a high-level semantic characteristic diagram from the decoder with a low-level semantic characteristic diagram from the encoder in a corresponding scale. The down-sampling of the encoder is performed with two convolutions and then with the down-sampling again, the up-sampling of the decoder is performed with deconvolution and jump connection with the down-sampling feature with the corresponding size, and then the down-sampling is performed with two convolutions and then with deconvolution; (ii) a The network has the characteristics of less required sample data, fast convergence and high segmentation precision, and is a classical semantic segmentation algorithm.

(2) In the step of adding the characteristic enhancement structure based on the basic network structure U-Net network model, the encoder-characteristic enhancement-decoder structure mainly comprises: an encoder part, a feature enhancement part and a decoder part; the network is an end-to-end network model, the used activation function is ReLU, the traditional method that the steps of image input, image processing, image analysis, image output and the like are required to be completed as independent tasks in building extraction is compared, the network model only needs to input image data from an input end and output a prediction result at an output end in the training process, all operations of the middle part are positioned in the neural network, and the neural network is not divided into a plurality of modules for processing;

on a high-resolution remote sensing image, buildings have multi-scale features, buildings with larger sizes and regular shapes on the image are easy to extract, but some buildings on the image have irregular shapes, even some buildings have extremely small sizes, and for the building information, the retention of detailed spatial information is of great importance, if pooling operation is used, detailed information on the image is easy to lose, for example, for the buildings with irregular shapes, edge information is easy to lose, the shape outline of the building is not obvious, and some tiny and prominent edge information cannot be detected; for those buildings of very small size, an undetectable situation occurs because their information is completely lost during pooling. Therefore, while ensuring that the receptive field is increased, in order to avoid the situations of reduced resolution of the characteristic diagram and loss of spatial information, the invention adds the hole convolution in the middle part of the network structure to replace the pooling operation; the advantages of using hole convolution are: the method can expand the receptive field of a single pixel in the characteristic diagram like pooling operation, and can also keep the resolution of the characteristic diagram unchanged and keep detailed spatial information. Because the building in the remote sensing image has multi-scale features, in order to further accurately acquire the multi-scale information of the building, a feature enhancement structure, such as the structure shown in fig. 4, is added in the middle part of the network structure on the basis of the cavity convolution for feature enhancement;

as shown in fig. 4, the feature enhancement structure is a network structure in which series connection and parallel connection coexist, and connects feature maps obtained by performing a hole convolution operation with different hole rates in a series connection and parallel connection manner; the number of channels and the size of the strong part feature map are determined by the operations in the encoder and are not fixed. In the graph, green arrows represent hole convolution operations, numbers in the arrows represent hole rate sizes, four hole convolution operations are shared in the structure, the hole rates are 1, 2, 4 and 8 respectively, so that the hole convolutions with the sizes of 3 × 3, 5 × 5, 9 × 9 and 17 × 17 can be obtained after expansion is carried out on standard convolutions with the sizes of 3 × 3, the receptive fields corresponding to each feature map are 3, 7, 15 and 31 respectively, and therefore the receptive field size of the final output feature map on the first intermediate feature map in feature enhancement is 31 × 31; the resolution ratio of the feature graph obtained by the hole convolution operation cannot be reduced, and due to the hole convolution with different hole rates, multi-scale feature information is obtained through the structure, and the multi-scale features are added and combined in a jumping connection mode, so that the enhancement of the feature information is realized.

The third concrete implementation mode:

different from the specific embodiment, in the method for extracting a high-resolution remote sensing image building based on deep learning of the present embodiment, the step of improving the activation function of the network structure model with the feature enhancement structure is,

on the basis of the network structure of an encoder-feature enhancement-decoder, the activation function of the network model is replaced, namely the used activation function ReLU is replaced by an activation function ELU, so that a U-Net network model with the activation function of ELU and feature enhancement is obtained; the expression of the activating function ReLU is as follows:

f(x)＝max(0，x) (1)

the activation function ReLU can effectively solve the problem of gradient disappearance, and because the function is a non-exponential function and has only a linear relation, the calculation amount of the function is greatly reduced, the calculation speed is higher, and the function can enable the neural network to be converged more quickly in the random gradient descent. Although the activation function ReLU has many advantages, there are still disadvantages: when the function is trained, neurons are easy to die, and weight updating cannot be performed. To compensate for the deficiency of ReLU, ELU function is used

ELU is an exponential linear unit, and the expression is as follows:

the ELU function solves the problem of neuron death, and performs corresponding index correction on a negative value part of a ReLU activation function, so that the difference between gradients is reduced, and the stability of a corresponding region is obviously enhanced when an input value is a negative number; the method has the advantages that the method can avoid neuron death, has all the advantages of the ReLU function, and can accelerate the convergence speed of the network because the average value of the output values is close to 0 value. The structure of the improved network model is shown in fig. 1. The method combines three parts of an encoding part, a characteristic enhancement part and a decoding part together to form a network structure of an encoder-characteristic enhancement-decoder;

the improved network structure models of the encoder, the feature enhancement and the decoder are 4 layers in total, images are input from an encoder end, then convolution operation and pooling operation are carried out, and then the images enter a feature enhancement part to carry out cavity convolution operation, the cavity convolution connects cavity convolutions with different cavity rates through a series mode and a parallel mode, feature graphs with different scales are obtained, detailed information is reserved, and the feature graphs with different scales are added to realize the enhancement of feature information; and the feature graph after feature enhancement enters a decoder part, and finally, the range of the feature graph is reduced by using a Sigmoid function through transposition convolution operation, image splicing operation and convolution operation, and a result graph extracted by the building is output. Although the feature enhancement part used in the network structure utilizes a plurality of hole convolutions to carry out operation, compared with the common convolution operation, the convolution increases the scope of a receptive field, but does not increase the quantity of training parameters, but selectively skips some pixel values to carry out convolution operation;

for the traditional building extraction method, time and labor are wasted, and end-to-end semantic segmentation cannot be realized. The method is based on the U-Net network model, the feature enhancement structure based on the cavity convolution is added in the middle of the model structure, the part can acquire the multi-scale features, the detailed information of the feature map is reserved, and the extraction accuracy of the building is improved through the fusion enhancement of the multi-scale features. And the network model of the ReLU is replaced by the ELU, so that the building extraction precision is improved.

The fourth concrete implementation mode:

different from the third embodiment, in the method for extracting a high-resolution remote sensing image building based on deep learning of the present embodiment, a U-Net + + network model combined with a mixed hole convolution is designed, and in the step of extracting a building,

(1) the U-Net + + further reduces semantic difference between an encoder and a decoder by introducing nested and dense hop connection on the basis of a U-Net network; compared with a U-Net network, the U-Net + + jumps and connects different network layers, up-samples deep layer features and introduces shallow layer features, or down-samples shallow layer features and introduces deep layer features, so that semantic loss between an encoder and a decoder is compensated, and the model performance is improved; U-Net + + shows good performance in image semantic segmentation, but still cannot extract enough information from the remote sensing image in a multi-scale mode; in the semantic segmentation of the remote sensing image, different information is shown in feature maps with different scales. The low-level feature map captures rich spatial information and can highlight the boundary of the ground features; and the high-level semantic feature map reflects the information of the position of the ground object.

(2) The hole convolution

If a convolutional layer is to enlarge the receptive field and obtain more context feature information, it can be generally implemented by 3 ways: increasing the size of a convolution kernel; increasing the number of layers, for example, the convolution of two layers of 3 multiplied by 3 can be close to the convolution effect of one layer of 5 multiplied by 5; and carrying out convergence operation before convolution. The first two methods increase the number of parameters, while the 3 rd method loses some information. The hole convolution is a method capable of enlarging the cell field without increasing the number of parameters. As shown in fig. 6, an expanded sparse matrix is used to replace the conventional convolution kernel, so as to increase the experience field of convolution, and on the premise of ensuring the same parameters, the hole convolution can contain a larger information range, so as to strengthen the semantic relation of context;

compared with the normal convolution, the cavity convolution has one more hyper-parameter expansion rate, namely the distance between values during convolution kernel convolution; FIG. 6(a) dilation Rate of 1, field after convolution of 3; FIG. 6(b) dilation Rate 2, field after convolution 5; FIG. 6(c) shows a dilation rate of 3 and a field of 7 after convolution. The same convolution with 3 × 3 can achieve the effects of 5 × 5 and 7 × 7. The hole convolution enlarges the receptive field without increasing the number of parameters, and the normal convolution can also be considered as a special hole convolution.

aiming at the problem that U-Net + + cannot extract enough information from a remote sensing image in a multi-scale mode, the method uses hole convolution in the FCN, utilizes an expanded sparse matrix to replace a traditional convolution kernel so as to increase the receptive field of the convolution, effectively contacts the characteristic information of the image context, and avoids the problems of rough image segmentation result and discontinuous boundary caused by unbalance of local information and global information utilization rate.

in the above formula, x ^i，j Representing the output of a feature extraction module, wherein i represents the layer sequence of the down-sampling of an encoder, j represents the sequence number of a module in the same layer, and j is 0 and represents the feature extraction module of the encoder; c (-) represents a convolution operation; u (·) denotes upsampling; [. the]Representing a characteristic channel connection; the Dropout strategy is used for avoiding overfitting and learning a more stable model, namely a Dropout layer is added after each convolutional layer, and each middle layer neuron in the neural network is set to be 0 at a certain probability during training, namely the Dropout strategy does not participate in forward propagation and backward feedback; when a picture is input, the network randomly samples a new characteristic diagram, so that the independence among different trainings is effectively improved, the overfitting phenomenon is avoided, the trained model is more stable, and the remote sensing image segmentation performance is improved.

Aiming at the condition that overfitting is easily caused by unbalanced distribution of a building and a background, Focal local is used as a Loss function, and the Loss function reduces the weight of a large number of simple negative samples in training, so that the segmentation precision is effectively improved; the formula is as follows:

using a fixed expansion coefficient in the down-sampling process, a convolution similar to a "sieve" may occur to extract image features, resulting in information loss. When the expansion coefficient is too small, the sampled data is too dense, and the whole information can be lost; if the size is too large, the sampled data is too sparse, and local information is lost.

The invention introduces a hybrid-partitioned convolution (HDC) strategy for alleviating the influence generated by the expanded convolution and expanding the global information of the convergence of the receptive field. If the image resolution of the data set is km, in order to more accurately reflect the local detail structure of the outline of the building, the minimum value of a group of expansion coefficients of the mixed cavity convolution is km, namely the expansion coefficients are selected to be k, k +1 and k +2, a three-layer continuous mixed cavity convolution method is maintained, the information utilization rate is improved under the condition that the size of a received field is unchanged, meanwhile, the requirements of the segmentation of the buildings with different scales are met, and the sensing fields and the information utilization rate of the mixed cavity convolution and the common cavity convolution are shown in figure 7.

The fifth concrete implementation mode:

the method for extracting the high-resolution remote sensing image building based on the deep learning is different from the fourth specific embodiment in that in the step of introducing the residual error network and acquiring the characteristic information of the connection context, the residual error network is composed of residual error units, and each residual error Unit is constructed by a convolution (convolution, Conv) layer, a Batch Normalization (BN) layer and a nonlinear activation function (Rectisfied Linear Unit, LU) layer; the residual error network is different from a general neural network in that an identity shortcut connection is added to construct a residual error module.

The embodiments of the present invention are disclosed as the preferred embodiments, but not limited thereto, and those skilled in the art can easily understand the spirit of the present invention and make various extensions and changes without departing from the spirit of the present invention.

Claims

1. A high-resolution remote sensing image building extraction method based on deep learning is characterized by comprising the following steps: the method is realized by the following steps:

2. The method for extracting the high-resolution remote sensing image building based on the deep learning as claimed in claim 1, wherein the method comprises the following steps: in the step of adding the characteristic enhancement structure based on the basic network structure U-Net network model to form the network structure model of the encoder-characteristic enhancement-decoder,

a hole convolution is added in the middle of the network structure to replace a pooling operation; based on the cavity convolution, a characteristic enhancement structure is added in the middle of the network structure, the characteristic enhancement structure is a network structure coexisting in series and parallel, and characteristic graphs obtained through cavity convolution operation with different cavity rates are connected in series and parallel; the number of channels and the size of the characteristic diagram of a strong part depend on the operation in an encoder, the structure totally has four hole convolution operations, the hole rates are respectively 1, 2, 4 and 8, the hole convolutions with the sizes of 3 x 3, 5 x 5, 9 x 9 and 17 x 17 can be respectively obtained after expansion is carried out on the standard convolution with the size of 3 x 3, the receptive fields corresponding to each characteristic diagram are respectively 3, 7, 15 and 31, and the receptive field size of the final output characteristic diagram on the first intermediate characteristic diagram in the characteristic enhancement is 31 x 31; the resolution ratio of the feature graph obtained by the hole convolution operation cannot be reduced, and due to the hole convolution with different hole rates, multi-scale feature information is obtained through the structure, and the multi-scale features are added and combined in a jumping connection mode, so that the enhancement of the feature information is realized.

3. The high-resolution remote sensing image building extraction method based on deep learning according to claim 1 or 2, characterized in that: the step of improving the activation function of the network structure model added with the feature enhancement structure is that on the basis of the network structure of an encoder-feature enhancement-decoder, the activation function of the network model is replaced, namely, the used activation function ReLU is replaced by an activation function ELU, so that a U-Net network model with an ELU and a feature enhancement function is obtained;

ELU is an exponential linear unit, and the expression is as follows:

4. The method for extracting the high-resolution remote sensing image building based on the deep learning as claimed in claim 3, wherein the method comprises the following steps: designing a U-Net + + network model combined with the mixed hole convolution, in the step of extracting the building,

in the above formula, x ^i，j Representing the output of a feature extraction module, wherein i represents the layer sequence of the down-sampling of an encoder, j represents the sequence number of a module in the same layer, and j is 0 and represents the feature extraction module of the encoder; c (-) represents a convolution operation; u (·) represents upsampling; [. the]Representing a characteristic channel connection; adding a Dropout layer behind each convolutional layer, and setting each middle layer neuron in the neural network as 0 with a certain probability during training without participating in forward propagation and backward feedback;

5. The method for extracting the high-resolution remote sensing image building based on the deep learning as claimed in claim 1, 2 or 4, wherein the method comprises the following steps: in the step of introducing the residual error network and acquiring the characteristic information of the connection context, the residual error network consists of residual error units, and the construction of each residual error unit comprises a convolution layer, a batch processing normalization layer and a nonlinear activation function layer; the constant shortcut connection is added in the residual error network to construct a residual error module.