CN116503744B

CN116503744B - Height grade-guided single-view remote sensing image building height estimation method and device

Info

Publication number: CN116503744B
Application number: CN202310770597.7A
Authority: CN
Inventors: 陆超然; 王宇翔; 张攀; 沈均平
Original assignee: Aerospace Hongtu Information Technology Co Ltd
Current assignee: Aerospace Hongtu Information Technology Co Ltd
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-09-29
Anticipated expiration: 2043-06-28
Also published as: CN116503744A

Abstract

The application provides a method and a device for estimating building height of a single-view remote sensing image guided by a height grade, which relate to the technical field of building height estimation and comprise the following steps: acquiring a high-resolution optical satellite remote sensing image and a normalized digital surface model, and preprocessing the images to obtain a sample data set; inputting the optical images in the sample data set into a building height estimation model, calculating losses of the predicted values, the height grade types and the height values respectively, and iteratively optimizing model parameters to obtain a trained target building height estimation model; and inputting the high-resolution optical image to be predicted into a target building height estimation model to obtain a height grade type predicted value and a building height predicted value, and inhibiting a region in which the height grade type predicted value meets a first threshold value and a region in which the building height predicted value is smaller than a second threshold value to obtain a final height predicted result. The method and the device can improve the robustness and the accuracy of building height estimation of the single-view remote sensing image.

Description

Height grade-guided single-view remote sensing image building height estimation method and device

Technical Field

The application relates to the technical field of building height estimation, in particular to a method and a device for estimating building height by using single-view remote sensing images guided by a height grade.

Background

The digitized city construction needs to rely on three-dimensional reconstruction of the city, and building height in the three-dimensional reconstruction is important basic information. In the related art, the convolution network method is used for the height estimation of the single-view remote sensing image, and mainly comprises the following two types: (1) directly adopting the coding-decoding structure to return to the height; (2) And adding ground object category branches to assist in high regression by adopting multitask learning. The first type of method is low in regression accuracy and convergence rate, the second type of method is currently used for regression of a plurality of ground object categories, manual fine labeling is needed, and robustness and prediction accuracy of the second type of method are further improved.

Disclosure of Invention

The application aims to provide a method and a device for estimating the building height of a single-view remote sensing image guided by a height grade, which can improve the robustness and the accuracy of the building height estimation of the single-view remote sensing image.

In a first aspect, the present application provides a method for estimating building height by using a single-view remote sensing image guided by a height level, including:

Acquiring a high-resolution optical satellite remote sensing image and a normalized digital surface model, normalizing the normalized digital surface model, marking different height grades, and cutting the optical image and the normalized digital surface model to obtain a sample data set;

inputting the optical images in the sample data set into a pre-built height grade guided building height estimation model, calculating losses of the predicted values and height grade categories and height values respectively, and iteratively optimizing model parameters to obtain a trained target building height estimation model; the pre-built height level guided building height estimation model comprises a shared feature extractor and a preset number of decoders, wherein the decoders are used for height level classification and height regression;

and inputting the high-resolution optical image to be predicted into a target building height estimation model to obtain a height grade type predicted value and a building height predicted value, and inhibiting a region in which the height grade type predicted value meets a first threshold value and a region in which the building height predicted value is smaller than a second threshold value to obtain a final height predicted result.

In an alternative embodiment, normalizing the normalized digital surface model and marking different height grades, cutting the optical image and the normalized digital surface model to obtain a sample data set, including:

Taking natural logarithm of the normalized digital surface model, and carrying out preset range mapping by adopting maximum normalization;

classifying the normalized digital surface model pixel by pixel based on the height of the building, each class being labeled as a different height classification label, respectively;

and cutting the optical image, the normalized digital surface model and the high-classification label to a fixed size to obtain a sample data set.

In an alternative embodiment, the shared feature extractor comprises convolution units of a preset scale, and each convolution unit is connected through a downsampling module; the method further comprises the steps of:

inputting the optical image in the sample data set into a first downsampling module and a first convolution unit to obtain a first feature map;

inputting the first feature map to a second downsampling module and a second convolution unit to obtain a second feature map;

and the same is repeated until each convolution unit is calculated, and an nth characteristic diagram is obtained;

and respectively carrying out layer normalization on the first feature map, the second feature map, the … … and the nth feature map to obtain feature information corresponding to each feature map.

In an alternative embodiment, each convolution unit comprises a plurality of convolution modules, the convolution modules in each convolution unit are connected in cascade, and the convolution modules are in an inverse bottleneck separable convolution residual error structure;

The feature map X is subjected to deep convolution firstly, then two layers of point-by-point convolution are carried out to obtain a feature map X1, and the feature map X1 and the original feature X are added and input to a next convolution module or a downsampling module. The depth convolution is followed by layer normalization, and the two point-by-point convolutions are connected through a GELU activation function and Global Response Normalization (GRN). The first point-by-point convolution increases the feature dimension to 4 times the original feature, and the second point-by-point convolution decreases the feature dimension to the original feature; the feature map X is an output feature of the previous convolution module or the downsampling module.

In an alternative embodiment, the first convolution unit comprises 3 convolution modules, the second convolution unit comprises 3 convolution modules, the third convolution unit comprises 27 convolution modules, and the fourth convolution unit comprises 3 convolution modules;

correspondingly, the nth feature map is a fourth feature map.

In an alternative embodiment, the predetermined number of decoders includes a height level classification decoder and a height regression decoder; the height level classification decoder and the height regression decoder comprise structurally identical parts, wherein:

the parts with the same structure comprise: firstly, inputting the fourth feature map into a pyramid pooling module to obtain fourth enhancement features;

The third characteristic diagram is subjected to 1X 1 convolution and added with the fourth enhancement characteristic which is doubled by up-sampling, so that a third enhancement characteristic is obtained;

the second feature map is subjected to 1X 1 convolution to carry out dimension ascending, and is added with the third enhancement feature which is doubled by up-sampling, so that a second enhancement feature is obtained;

the first feature map is subjected to 1X 1 convolution to carry out dimension ascending, and is added with the second enhancement feature which is doubled by up-sampling, so that the first enhancement feature is obtained;

upsampling the second enhancement feature, the third enhancement feature and the fourth enhancement feature to the same size as the first enhancement feature, splicing in a channel dimension, fusing the multi-scale features through convolution, and upsampling and restoring the fused features;

wherein the same parts of the height level classification decoder and the height regression decoder structure do not share model parameters.

In an alternative embodiment, the height level classification decoder and the height regression decoder further comprise structurally different portions comprising:

the height level classification decoder and the height regression decoder are used for respectively reducing the dimension of the fusion characteristic to n types and 1 types, and the height regression decoder is connected with a sigmoid activation function and limits the output to a preset interval range.

In an alternative embodiment, inputting the optical image in the sample data set into a pre-built building height estimation model guided by the height level, calculating losses of the predicted value and the height level type and the height value respectively, and iteratively optimizing model parameters to obtain a trained target building height estimation model, wherein the method comprises the following steps:

Performing geometric and color enhancement on a preset number of optical remote sensing samples in the sample data set;

inputting the enhanced data sample into a pre-built height level guided building height estimation model to obtain a height level classification score and a normalized height prediction value;

respectively calculating the difference degree between the height grade classification score and the normalized height predicted value and the true value in the sample;

optimizing parameters in the model according to the difference degree and a preset optimizer to obtain an optimized building height estimation model;

and iteratively calculating the steps until the model parameters of the preset rounds are updated for several times, and obtaining the target building height estimation model.

In an alternative embodiment, inputting the high-resolution optical image to be predicted into a target building height estimation model to obtain a height class predicted value and a building height predicted value, and suppressing a region in which the height class predicted value meets a first threshold and a region in which the building height predicted value is smaller than a second threshold to obtain a final height predicted result, where the method includes:

dividing the high-resolution optical image to be predicted into image blocks with a preset window size according to a preset step length, and inputting the image blocks into a target building height estimation model to obtain a height grade class score and a normalized height estimation value;

Carrying out height value reduction based on the normalized height value to obtain an initial height predicted value;

determining a height class predicted value of the height class score through an argmax function, and suppressing a region meeting a first threshold value and a region with an initial height predicted value smaller than a second threshold value in the height class predicted result to obtain the height predicted value of the current image block;

and iteratively calculating the steps until the high-resolution optical image to be predicted is completely covered, so as to obtain a final height prediction result.

In a second aspect, the present invention provides a building height estimation device for a single-view remote sensing image guided by a height level, including:

the sample data set determining module is used for acquiring a high-resolution optical satellite remote sensing image and a normalized digital surface model, normalizing the normalized digital surface model and marking different height grades, and cutting the optical image and the normalized digital surface model to obtain a sample data set;

the model training module is used for inputting the optical images in the sample data set into a pre-built height grade guided building height estimation model, calculating losses of the predicted values and the height grade types and the height values respectively, and iteratively optimizing model parameters to obtain a trained target building height estimation model; the pre-built height level guided building height estimation model comprises a shared feature extractor and a preset number of decoders, wherein the decoders are used for height level classification and height regression;

The building height estimation module is used for inputting the high-resolution optical image to be predicted into the target building height estimation model to obtain a height grade type predicted value and a building height predicted value, and suppressing the area where the height grade type predicted value meets the first threshold value and the area where the building height predicted value is smaller than the second threshold value to obtain a final height predicted result.

According to the height-level-guided single-view remote sensing image building height estimation method and device, a multi-task learning framework is adopted, height level classification and height estimation tasks are combined, and the potential consistency of the multi-task learning framework and the height level classification and the height estimation tasks is utilized, so that the robustness and the accuracy of a model are improved; the classification branches take the building height grade as a supervision signal, so that additional manual labeling is avoided; through the passage from top to bottom and the transverse connection, the multi-scale semantic features are enhanced, and a more accurate building target boundary is obtained; aiming at the problem of unbalanced building height distribution, normalization processing is carried out on the building height distribution, so that the model can be converged more stably; and using the obtained height grade classification result as a mask for height estimation to further suppress signals of non-building areas. The method and the device can improve the robustness and the accuracy of building height estimation of the single-view remote sensing image.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for estimating building height of a single-view remote sensing image guided by a height level according to an embodiment of the present application;

FIG. 2 is a schematic view of a building height estimation model with height level guidance according to an embodiment of the present application;

FIG. 3 is a flowchart of a specific method for estimating building height of a single-view remote sensing image guided by a height level according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a structure of a height level guided building height estimation model sharing feature extractor according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a convolution module and a downsampling module of a shared feature extractor according to an embodiment of the present application;

FIG. 6 is a block diagram of a height-level-guided single-view remote sensing image building height estimation device according to an embodiment of the present application;

Fig. 7 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The digitized city construction needs to rely on three-dimensional city reconstruction, but most of traditional remote sensing applications fall on the extraction of two-dimensional information on the earth surface, and the acquisition of three-dimensional information is lacking, so that the understanding of the vertical information and the relation of a scene is hindered. Building height is an important fundamental information in three-dimensional reconstruction, usually expressed in the form of a normalized digital surface model (nsm).

Currently, digital surface models are mainly obtained through methods such as laser radar (LiDAR), radar interferometry (InSAR), motion structure recovery (SFM), stereopair photogrammetry and the like. The airborne LiDAR can provide a high-precision three-dimensional model of a target, but has high flight cost and limited coverage, and is difficult to apply to a large-scale area. InSAR technology needs high resolution airborne radar or multi-baseline time sequence analysis to obtain building height, and has certain requirements on the quality and quantity of data. The SFM method estimates the three-dimensional structure of the target from the disordered two-dimensional image sequence, the simulated terrain of the method can have the problems of deformation and excessive smoothness, and ground control points are needed during acquisition. Stereopair photogrammetry requires complex image acquisition techniques, as well as accurate registration of multi-view images. The main stream height estimation method has certain requirements on the aspects of data quality, technical level, resource investment and the like, provides an economic and rapid solution for building height estimation based on single-view remote sensing images, has the potential of large-scale application, and is an important research direction for three-dimensional reconstruction of cities.

Deep learning is a powerful machine learning technique that plays an irreplaceable role in the field of vision. The deep learning method acquires semantic features of the image through a convolutional neural network, so that deep information of the complex image is mined to the greatest extent, and the deep learning method is widely applied to the remote sensing field, such as ground object classification, target detection, change detection and the like. Building height estimation of single-view remote sensing images is an unfixed problem, but the characteristics of the target itself and other target relationships can be constructed by capturing the remote context relationship through the good performance of a convolutional neural network in terms of extracting advanced features. At present, a plurality of convolution network methods are used for the height estimation of single-view remote sensing images, and mainly comprise the following two types: (1) directly adopting the coding-decoding structure to return to the height; (2) And adding ground object category branches to assist in high regression by adopting multitask learning. The first type of method is low in regression accuracy and convergence rate, the second type of method is currently used for regression of a plurality of ground object categories, manual fine labeling is needed, and robustness and prediction accuracy of the second type of method are further improved.

Based on the above, the embodiment of the application provides a method and a device for estimating the building height of a single-view remote sensing image guided by a height grade, which adopt a multi-task learning framework and combine the height grade classification and the height estimation task, and the robustness and the precision of a model are improved by utilizing the potential consistency of the multi-task learning framework and the height grade classification and the height estimation task; the classification branches take the building height grade as a supervision signal, so that additional manual labeling is avoided; through the passage from top to bottom and the transverse connection, the multi-scale semantic features are enhanced, and a more accurate building target boundary is obtained; aiming at the problem of unbalanced building height distribution, normalization processing is carried out on the building height distribution, so that the model can be converged more stably; and using the obtained height grade classification result as a mask for height estimation to further suppress signals of non-building areas. The method and the device can improve the robustness and the accuracy of building height estimation of the single-view remote sensing image.

The embodiment of the application provides a method for estimating the building height of a single-view remote sensing image guided by a height grade, which is shown in fig. 1, and mainly comprises the following steps:

step S110, obtaining a high-resolution optical satellite remote sensing image and a normalized digital surface model, normalizing the normalized digital surface model, marking different height grades, and cutting the optical image and the normalized digital surface model to obtain a sample data set.

In one embodiment, normalizing the normalized digital surface model and marking different height grades, and cutting the optical image and the normalized digital surface model to obtain a sample data set, which may include the following steps 1.1 to 1.3:

step 1.1, taking natural logarithms of the normalized digital surface model, and carrying out preset range mapping by adopting maximum value normalization;

step 1.2, classifying the normalized digital surface model pixel by pixel based on the height of the building, and marking each type as a different height classification label respectively;

and 1.3, cutting the optical image, the normalized digital surface model and the high-classification label to a fixed size to obtain a sample data set.

And step S120, inputting the optical images in the sample data set into a pre-built height level guided building height estimation model, calculating losses of the predicted values, the height level categories and the height values respectively, and iteratively optimizing model parameters to obtain a trained target building height estimation model.

The pre-built height level guided building height estimation model includes a shared feature extractor and a preset number of decoders, and the decoders are used for height level classification and height regression, and refer to a model structure schematic diagram shown in fig. 2, and the model structure is described in detail below.

The shared feature extractor comprises convolution units with preset scales, and each convolution unit is connected through a downsampling module. For the usage mode of the shared feature extractor, the following steps (1) to (4) may be further adopted:

step (1), inputting an optical image in a sample data set into a first downsampling module and a first convolution unit to obtain a first feature map;

step (2), inputting the first feature map to a second downsampling module and a second convolution unit to obtain a second feature map;

step (3), and so on until each convolution unit is calculated, and obtaining an nth characteristic diagram;

And (4) respectively carrying out layer normalization on the first feature map, the second feature map, the … … and the nth feature map to obtain feature information corresponding to each feature map.

Each convolution unit comprises a plurality of convolution modules, the convolution modules in each convolution unit are connected in cascade, and the convolution modules are of an inverse bottleneck separable convolution residual error structure; the operations within each convolution module may include: the feature map X is subjected to deep convolution firstly, then two layers of point-by-point convolution are carried out to obtain a feature map X1, and the feature map X1 and the original feature X are added and input to a next convolution module or a downsampling module. The depth convolution is followed by layer normalization, and the two point-by-point convolutions are connected through a GELU activation function and Global Response Normalization (GRN). The first point-wise convolution increases the feature dimension to 4 times the original feature and the second point-wise convolution decreases the feature to the original dimension.

Further, the first convolution unit includes 3 convolution modules, the second convolution unit includes 3 convolution modules, the third convolution unit includes 27 convolution modules, and the fourth convolution unit includes 3 convolution modules; correspondingly, the nth feature map is a fourth feature map.

In an alternative embodiment, the predetermined number of decoders includes a height level classification decoder and a height regression decoder; the height level classification decoder and the height regression decoder comprise a part of identical structure and a part of different structure, wherein:

1. The parts with the same structure comprise: firstly, inputting a fourth feature map into a pyramid pooling module to obtain fourth enhancement features;

wherein the same structural parts of the high-level classification decoder and the high-level regression decoder do not share model parameters.

2. Structurally distinct portions comprising:

the height level classification decoder and the height regression decoder are used for respectively reducing the dimension of the fusion characteristic to n types and 1 types, and the height regression decoder is connected with the sigmoid activation function and limits the output to a preset interval range.

Inputting the optical image in the sample data set into a pre-built height level guided building height estimation model, calculating losses of the predicted value, the height level type and the height value respectively, and iteratively optimizing model parameters to obtain a trained target building height estimation model, wherein the method can comprise the following steps of 2.1 to 2.5 when in specific implementation:

step 2.1, performing geometric and color enhancement on a preset number of optical remote sensing samples in a sample data set;

step 2.2, inputting the enhanced data sample into a pre-built height grade guided building height estimation model to obtain a height grade classification score and a normalized height prediction value;

step 2.3, respectively calculating the difference degree between the height grade classification score and the normalized height predicted value and the true value in the sample;

step 2.4, optimizing parameters in the model according to the difference degree and a preset optimizer to obtain an optimized building height estimation model;

and 2.5, iteratively calculating the steps until the model parameters are updated for a plurality of times in a preset round to obtain the target building height estimation model.

And S130, inputting the high-resolution optical image to be predicted into a target building height estimation model to obtain a height grade type predicted value and a building height predicted value, and inhibiting a region in which the height grade type predicted value meets a first threshold value and a region in which the building height predicted value is smaller than a second threshold value to obtain a final height predicted result.

In an alternative embodiment, the following steps 3.1 to 3.4 may be included:

step 3.1, dividing the high-resolution optical image to be predicted into image blocks with a preset window size according to a preset step length, and inputting the image blocks into a target building height estimation model to obtain a height grade class score and a normalized height estimation value;

step 3.2, carrying out height value reduction based on the normalized height value to obtain an initial height predicted value;

step 3.3, determining a height grade category predicted value through an argmax function, and suppressing a region meeting a first threshold value and a region with an initial height predicted value smaller than a second threshold value in a height grade category predicted result to obtain a height predicted value of the current image block;

and 3.4, iteratively calculating the steps until the high-resolution optical image to be predicted is completely covered, and obtaining a final height prediction result.

In view of the above method, the present application further provides a specific embodiment, and referring to fig. 3, the method may include the following steps S101 to S104:

step S101: acquiring a high-resolution optical satellite remote sensing image and a normalized digital surface model (nDSM), carrying out normalization processing on the nDSM, marking different height grades, and cutting the optical image and the processed nDSM to obtain sample data;

In a real scene, the heights of the buildings basically accord with long tail distribution, most of the heights of the buildings are concentrated at a lower height level, and the higher or ultrahigh buildings occupy less space, so that the problem of building height regression is challenged, good performance cannot be obtained in high-rise building estimation, and the height regression of low-rise buildings is influenced. Therefore, the nDSM is normalized, and the buildings are classified into different grades according to the height, and the method is used for classifying the subsequent building height grades, and specifically comprises the following steps:

step 1a: the natural logarithm of nDSM is taken, so that the height of the deviated building is mapped to normal distribution, the height is stretched to a range which is easier to return, and the influence of few ultrahigh buildings is reduced. For example, a building with a height of 100m, the logarithmic height value is only 4.6 after logarithmic calculation. The partial area with the height below 1m is taken to be negative after the logarithm, and the partial area needs to be set to 0, namely the partial area is considered not to belong to the building, which accords with the cognition of the height of the general building, and meanwhile, the adverse effect of the negative value on model optimization can be avoided;

step 1b: dividing the natural logarithmic height obtained in the step 1a by the maximum value thereof, limiting the value in the range of [0,1], facilitating the regression of the model height, and recording the maximum value M;

The normalized nDSM obtained through steps 1a and 1b is processed as follows:

step 1c: the nDSM is divided into n classes pixel by pixel according to the height of the building, and marked as different labels respectively. In practical application, reasonable height division sections can be selected in combination with the distribution of building heights, in the embodiment, the heights are divided into 4 categories, namely non-building, low-rise building, middle-rise building and high-rise building, and the corresponding height sections are respectively [0, 1e-6 ], [1e-6, 24 ], [24, 50 ], [50, 187);

step 1d: clipping the obtained optical image, the normalized nsm obtained in step 1b and the high-level classification label obtained in step 1c to a fixed size to obtain a sample data set, wherein the sample size is 512×512 in this example.

Step S102, building a height level guided building height estimation multi-task model, wherein the model is provided with a shared feature extractor and two similar decoders, and the two decoders are respectively used for height level classification and height regression;

the feature extractor of the model comprises 4 scale convolution units, each convolution unit is connected through a downsampling module, and the change of the dimension and the dimension of the feature map is controlled by the downsampling module. Referring to fig. 4, an image is input to an initial downsampling module, downsampled by 4 times and upscaled to 128 channels, and enters a first convolution unit to obtain a feature map F1; then inputting F1 into a second downsampling module, downsampling by 2 times, upstairs to 256, and entering a second convolution unit to obtain a feature map F2; f2 is input into a third downsampling module, downsampling is carried out for 2 times, dimension is increased to 512, and the third downsampling module enters a third convolution unit to obtain a feature map F3; and F3 is input into a fourth downsampling module, downsampled by 2 times and upscaled to 1024, and enters a fourth convolution unit to obtain a characteristic diagram F4. And carrying out layer normalization on the feature graphs F1-F4 to obtain 4-scale features, namely B1, B2, B3 and B4.

Further, the initial downsampling module is composed of a convolution layer with a convolution kernel size of 4×4 and a step length of 4 and an LN layer, and the second, third and fourth downsampling modules are composed of the LN layer and a convolution layer with a convolution kernel size of 2×2 and a step length of 2.

Further, the first convolution unit includes 3 convolution modules, the second convolution unit includes 3 convolution modules, the third convolution unit includes 27 convolution modules, and the fourth convolution unit includes 3 convolution modules.

Further, the convolution modules within each convolution unit are cascade connected. The convolution module integrally adopts an inverse bottleneck separable convolution residual structure. Firstly, carrying out depth convolution with the convolution kernel size of 7 multiplied by 7 and the step length of 1, normalizing a subsequent layer, carrying out point-by-point convolution with the convolution kernel size of 1 multiplied by 1, increasing the feature dimension to 4 times of the original feature dimension, then sending into a GELU activation function and Global Response Normalization (GRN), carrying out point-by-point convolution with the convolution kernel size of 1 multiplied by 1, reducing the feature dimension back to the original dimension to obtain a feature image X1, adding the X1 and the original feature X to obtain a new feature image Y, and inputting into a next convolution module or a downsampling module; the feature map X is an output feature of the previous convolution module or the downsampling module. See fig. 5.

The model contains two decoders for height level classification and height regression, respectively, the height level classification decoder D1 and the height regression decoder D2 being similar. The parts of the decoder D1 and D2 that are structurally identical are: firstly, sending the feature map B4 into a pyramid pooling module to obtain enhanced features E4; the feature map B3 is subjected to 1X 1 convolution and added with E4 which is doubled by up-sampling to obtain enhanced features E3; the feature map B2 is subjected to 1X 1 convolution dimension up to 512 and added with E3 which is doubled by up sampling to obtain enhanced feature E2; feature map B1 is up-scaled to 512 by a 1 x 1 convolution and added to the up-sampled doubled E2 to obtain enhanced feature E1. E2, E3 and E4 are up-sampled to E1 size, and are spliced in the channel dimension, and the multi-scale features are fused through convolution of 3 multiplied by 3, so that the up-sampling of the fused features is reduced to original image size by 4 times.

The structurally different parts of the decoders D1 and D2 are: d1 and D2 reduce the dimension of the fusion feature to n-class and 1-class respectively, D2 is followed by a sigmoid activation function to define the output within the range of [0,1 ]. Note that the two decoders structurally identical parts do not share model parameters. In this example, n is 4.

Further, pyramid pooling pools the feature map B4 to 1, 2, 3 and 6 sizes respectively, reduces the dimension of the four-size features to 512 through convolution of 1×1, then upsamples to the original size of B4, splices 4 features in the channel dimension, and finally fuses the 4 features through convolution of 3×3 to obtain enhanced features E4, wherein the number of channels of E4 is 512.

Step S103: inputting the optical images in the sample data set into the network model constructed in the step 102, calculating losses of the predicted values and the height grade type and the height values respectively, and iteratively optimizing model parameters;

the method specifically comprises the following steps:

step 3a: performing data enhancement, namely performing geometric and color enhancement on a preset number of optical remote sensing samples in the sample data set obtained in the step 101, and performing corresponding geometric transformation on the height level label and the normalized nDSM at the same time;

further, geometric enhancement includes scaling, flipping, cropping, etc., and color enhancement includes channel normalization, photometric distortion, etc. In this example, 2 samples are selected each time to participate in the training.

Step 3b: the forward estimation is carried out, a sample with enhanced data is input into a building height estimation model guided by a height grade, and a height grade classification score S and a normalized height prediction value H are obtained;

step 3c: loss calculation, respectively calculating the difference between the height class classification score S and the normalized height prediction value H and the true value in the sample, wherein cross entropy loss and smoothing L are respectively adopted ₁ The two losses are multiplied by the respective weight coefficients to add up the losses to give the sum of the losses:

Wherein the cross entropy loss calculation formula is as follows:

representing the number of categories->Representing the probability of being predicted as a high level class c,/->Is a one-hot vector when the class is the same as the class of the sample is 1, otherwise is 0.

Smoothing L ₁ The loss calculation formula is as follows:

for normalized true height, +.>For the height estimation, due to +.>And->Are all in [0,1 ]]Within the range, the smoothing L1 therefore only uses the upper half of the formula.

In this example, the coefficient of cross entropy loss is 5 and the coefficient of smooth L1 loss is 30.

Step 3d: backward propagation, optimizing parameters in the model based on the sum of the losses and an AdamW optimizer, and carrying out backward propagation gradient to obtain an optimized height estimation model;

the AdamW optimizer is specifically calculated as follows:

for the current moment +.>For the last moment,/->Gradient (S)/(S)>For model parameters +.>For the first momentum, +>For the second momentum, +>Is the learning rate. The momentum is biased, i.e. since the initial value of the momentum is 0,/and>、/>the value of the derivative actuation is smaller and is close to 1, and after correction, the value of the derivative actuation is +.>And->Approaching 0 in the first few iterations, so that the momentum after correction becomes larger, whereas as the number of iterations increases, +.>And->Gradually approach 1- >And->Substantially equal to->And->. In this example, the->Is 0.001,/for initial learning rate>、/>0.9 and 0.999, respectively>Is a weight decay factor, set to 0.05. And obtaining new model weights through AdamW optimization.

And iteratively executing the steps 3a-3d until the preset iteration times are reached, and obtaining a final building height estimation model. In this embodiment, for a total of 160000 iterations, the initial model weights are pre-trained on the ImageNet dataset.

Step 104: inputting the remote sensing image to be predicted into the height estimation model obtained in the step 103, and performing post-processing to obtain final height prediction;

the method specifically comprises the following steps:

step 4a: dividing the remote sensing image into image blocks with a preset window size according to a preset step length, and inputting the image blocks into the building height estimation model obtained in the step 103 to obtain a height grade class score S and a normalized height estimation value H ₁ . In this embodiment, the image is divided into 512×512 image blocks, which are enlarged to 640×640 before being input into the model;

step 4b: multiplying the normalized height estimation value H by the maximum value M recorded in the step 1b, restoring to the original range, and obtaining the height estimation value H through natural index calculation ₂ ：

Step 4c: determining a height class predicted value of the height class score S through an argmax function, and suppressing a region with the height class predicted value of 0 and a region with the initial height predicted value of less than 2m in the height class predicted result to obtain a height predicted value H of the current image block;

step 4d: and iteratively executing the steps 4a-4c until the remote sensing image to be predicted is completely covered. And writing the height prediction result into the corresponding position of the original image, and taking a result with higher height estimation value for the repeatedly predicted area in the two adjacent blocks.

By the model structure, the building height estimation and evaluation index on the test data set is testedReaching 0.8.

In summary, the height-level-guided single-view remote sensing image building height estimation method provided by the embodiment of the application adopts a multi-task learning framework, combines the height level classification and the height estimation tasks, and improves the robustness and the accuracy of the model by utilizing the potential consistency of the two; the classification branches take the building height grade as a supervision signal, so that additional manual labeling is avoided; through the passage from top to bottom and the transverse connection, the multi-scale semantic features are enhanced, and a more accurate building target boundary is obtained; aiming at the problem of unbalanced building height distribution, normalization processing is carried out on the building height distribution, so that the model can be converged more stably; and using the obtained height grade classification result as a mask for height estimation to further suppress signals of non-building areas. The method and the device can improve the robustness and the accuracy of building height estimation of the single-view remote sensing image.

Based on the above method embodiment, the embodiment of the present application further provides a device for estimating building height of a single-view remote sensing image guided by a height level, as shown in fig. 6, where the device includes the following parts:

the sample data set determining module 610 is configured to obtain a high-resolution optical satellite remote sensing image and a normalized digital surface model, normalize the normalized digital surface model and mark different height levels, and cut the optical image and the normalized digital surface model after normalization to obtain a sample data set;

the model training module 620 is configured to input the optical image in the sample data set to a pre-built height level guided building height estimation model, and calculate losses of the predicted value and the height level class and the height value respectively, and iterate and optimize model parameters to obtain a trained target building height estimation model; the pre-built height level guided building height estimation model comprises a shared feature extractor and a preset number of decoders, wherein the decoders are used for height level classification and height regression;

the building height estimation module 630 is configured to input the high-resolution optical image to be predicted into a target building height estimation model to obtain a height class predicted value and a building height predicted value, and inhibit a region where the height class predicted value meets a first threshold and a region where the building height predicted value is less than a second threshold to obtain a final height predicted result.

In a possible embodiment, the sample data set determining module 610 is further configured to:

classifying the normalized digital surface model pixel by pixel based on the height of the building, each class being respectively labeled as a different height classification label;

In a possible implementation manner, the shared feature extractor comprises convolution units with preset scales, and each convolution unit is connected through a downsampling module; the device further comprises a feature extraction module for:

inputting the first characteristic diagram into a second downsampling module and a second convolution unit to obtain a second characteristic diagram;

and respectively carrying out layer normalization on the first feature map, the second feature map, … … and the nth feature map to obtain feature information corresponding to each feature map.

In a possible implementation manner, each convolution unit comprises a plurality of convolution modules, the convolution modules in each convolution unit are connected in cascade, and the convolution modules are in an inverse bottleneck separable convolution residual error structure;

In a possible embodiment, the first convolution unit comprises 3 convolution modules, the second convolution unit comprises 3 convolution modules, the third convolution unit comprises 27 convolution modules, and the fourth convolution unit comprises 3 convolution modules;

correspondingly, the nth feature map is a fourth feature map.

In a possible embodiment, the predetermined number of decoders includes a height level classification decoder and a height regression decoder; the height level classification decoder and the height regression decoder comprise structurally identical parts, wherein:

The parts with the same structure comprise: firstly, inputting a fourth feature map into a pyramid pooling module to obtain fourth enhancement features;

In a possible embodiment, the height level classification decoder and the height regression decoder further comprise structurally different parts, including:

In a possible implementation, the model training module 620 is further configured to:

performing geometric and color enhancement on a preset number of optical remote sensing samples in a sample data set;

respectively calculating the classification score of the height grade and the difference degree between the normalized height predicted value and the true value in the sample;

In a possible embodiment, the building height estimation module 630 is further configured to:

The implementation principle and the generated technical effects of the height-level-guided single-view remote sensing image building height estimation device provided by the embodiment of the application are the same as those of the embodiment of the method, and for the sake of brief description, reference can be made to corresponding contents in the embodiment of the height-level-guided single-view remote sensing image building height estimation method where the embodiment of the height-level-guided single-view remote sensing image building height estimation device is not mentioned.

An embodiment of the present application further provides an electronic device, as shown in fig. 7, which is a schematic structural diagram of the electronic device, where the electronic device 100 includes a processor 71 and a memory 70, the memory 70 stores computer executable instructions that can be executed by the processor 71, and the processor 71 executes the computer executable instructions to implement any one of the above-mentioned height level guided single view remote sensing image building height estimation methods.

In the embodiment shown in fig. 7, the electronic device further comprises a bus 72 and a communication interface 73, wherein the processor 71, the communication interface 73 and the memory 70 are connected by the bus 72.

The memory 70 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and the at least one other network element is achieved via at least one communication interface 73 (which may be wired or wireless), which may use the internet, a wide area network, a local network, a metropolitan area network, etc. Bus 72 may be an ISA (Industry Standard Architecture ) bus, PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The bus 72 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one bi-directional arrow is shown in FIG. 7, but not only one bus or type of bus.

The processor 71 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 71. The processor 71 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory, and the processor 71 reads the information in the memory, and combines the hardware to complete the steps of the height level guided single view remote sensing image building height estimation method of the foregoing embodiment.

The embodiment of the application also provides a computer readable storage medium, which stores computer executable instructions that, when being called and executed by a processor, cause the processor to implement the above-mentioned method for estimating the building height of the single-view remote sensing image guided by the height level, and the specific implementation can refer to the foregoing method embodiment, and will not be described herein.

The computer program product of the method and apparatus for estimating building height of single view remote sensing image guided by height level provided by the embodiment of the application includes a computer readable storage medium storing program codes, and the instructions included in the program codes can be used to execute the method described in the foregoing method embodiment, and specific implementation can be referred to the method embodiment and will not be repeated herein.

The relative steps, numerical expressions and numerical values of the components and steps set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the description of the present application, it should be noted that, directions or positional relationships indicated by terms such as "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., are directions or positional relationships based on those shown in the drawings, or are directions or positional relationships conventionally put in use of the inventive product, are merely for convenience of describing the present application and simplifying the description, and are not indicative or implying that the apparatus or element to be referred to must have a specific direction, be constructed and operated in a specific direction, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims

1. The method for estimating the building height of the single-view remote sensing image guided by the height grade is characterized by comprising the following steps of:

inputting the high-resolution optical image to be predicted into a target building height estimation model to obtain a height grade type predicted value and a building height predicted value, and inhibiting a region in which the height grade type predicted value meets a first threshold value and a region in which the building height predicted value is smaller than a second threshold value to obtain a final height predicted result;

Normalizing the normalized digital surface model and marking different height grades, including:

step 1a: taking natural logarithm of the normalized digital surface model to map the off-state building height to normal distribution;

step 1b: dividing the natural logarithmic height obtained in the step 1a by the maximum value thereof, and limiting the value within the range of [0,1 ];

the normalized digital surface model nDSM is obtained through the steps 1a and 1b, and the processing formula is as follows:

step 1c: the normalized digital surface model is divided into n classes pixel by pixel according to the height of the building, and the n classes are respectively marked as different labels, and the height classes comprise non-building, low-rise building, middle-rise building and high-rise building.

2. The method for estimating building height of a single view remote sensing image guided by a height level according to claim 1, wherein normalizing the normalized digital surface model and marking different height levels, and cutting the optical image and the normalized digital surface model to obtain a sample data set comprises:

3. The height level-guided single-view remote sensing image building height estimation method according to claim 1, wherein the shared feature extractor comprises convolution units of a preset scale, and each convolution unit is connected through a downsampling module; the method further comprises the steps of:

4. The height level-guided single-view remote sensing image building height estimation method according to claim 3, wherein each convolution unit comprises a plurality of convolution modules, the convolution modules in each convolution unit are connected in cascade, and the convolution modules are of an inverse bottleneck separable convolution residual structure;

The feature map X is subjected to depth convolution firstly, then two layers of point-by-point convolution are carried out to obtain a feature map X1, and the feature map X1 and the original feature X are added and input to a next convolution module or a downsampling module; the subsequent layer normalization of the depth convolution, the two point-by-point convolutions are connected through a GELU activation function and Global Response Normalization (GRN); the first point-by-point convolution increases the feature dimension to 4 times the original feature, and the second point-by-point convolution decreases the feature dimension to the original feature; the feature map X is an output feature of the previous convolution module or the downsampling module.

5. The height level-guided single view remote sensing image building height estimation method according to claim 3 or 4, wherein the first convolution unit comprises 3 convolution modules, the second convolution unit comprises 3 convolution modules, the third convolution unit comprises 27 convolution modules, and the fourth convolution unit comprises 3 convolution modules;

correspondingly, the nth feature map is a fourth feature map.

6. The method for building height estimation of height level-guided single view remote sensing images according to claim 5, wherein the predetermined number of decoders comprises a height level classification decoder and a height regression decoder; the height level classification decoder and the height regression decoder comprise structurally identical parts, wherein:

7. The method of claim 6, wherein the height level classification decoder and the height regression decoder further comprise structurally different portions, comprising:

8. The method for estimating building height of high-level guided single-view remote sensing image according to claim 1, wherein inputting the optical image in the sample dataset into a pre-built high-level guided building height estimation model, calculating losses of a predicted value and a high-level class and a high value respectively, and iteratively optimizing model parameters to obtain a trained target building height estimation model, comprising:

9. The method for estimating building height of high-level-guided single-view remote sensing image according to claim 1, wherein inputting the high-resolution optical image to be predicted into the target building height estimation model to obtain the high-level class predicted value and the building height predicted value, and suppressing the region where the high-level class predicted value satisfies the first threshold and the region where the building height predicted value is smaller than the second threshold to obtain the final height predicted result, comprises:

10. A height level guided single view remote sensing image building height estimation device, comprising:

the building height estimation module is used for inputting the high-resolution optical image to be predicted into the target building height estimation model to obtain a height grade type predicted value and a building height predicted value, and suppressing the area where the height grade type predicted value meets the first threshold value and the area where the building height predicted value is smaller than the second threshold value to obtain a final height predicted result;

the sample dataset determination module is further configured to: