CN113486928B

CN113486928B - Multi-view image alignment method based on rational polynomial model differentiable tensor expression

Info

Publication number: CN113486928B
Application number: CN202110666281.4A
Authority: CN
Inventors: 季顺平; 高建; 刘瑾
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2022-04-12
Anticipated expiration: 2041-06-16
Also published as: CN113486928A

Abstract

The invention discloses a multi-view image alignment method based on rational polynomial model differentiable tensor expression. The invention directly starts from the multi-view images and the imaging geometric relation, so that the end-to-end multi-view satellite image dense matching becomes possible, and complex preprocessing steps such as satellite image epipolar resampling and the like are not needed. The portability is strong, and the method can be embedded into the existing deep learning multi-view dense matching network as an independent module without changing other structures and training strategies. The theory is tighter, and the method has tighter theoretical support compared with the idea of locally fitting a rational polynomial model into a perspective imaging model, and proves that the technology can achieve higher precision in experiments.

Description

Multi-view image alignment method based on rational polynomial model differentiable tensor expression

Technical Field

The invention relates to a multi-view image alignment method based on rational polynomial model differentiable tensor expression, which can realize rapid coordinate mapping and feature alignment among multi-view images and is applied to the field of three-dimensional earth surface reconstruction based on multi-view optical satellite images.

Background

Processing of multi-view optical satellite stereoscopic image pairs is currently the most prominent way to produce large-scale digital earth surface models. The basic principle is to search homonymous points between stereo image pairs by means of an image dense matching technology and then extract elevation information according to an imaging geometric model.

In recent years, a plurality of multi-view dense matching methods based on deep learning are proposed and successfully applied to multi-view aerial images, and achieve precision and efficiency superior to those of the traditional dense matching algorithm in the earth surface model reconstruction task. The deep learning method which has great potential in the multi-view aerial image dense matching task is also very likely to bring a brand-new revolution for satellite image dense matching. However, currently, mainstream deep learning multi-view dense matching methods, such as MVSNet, RED-Net, UCS-Net, and the like, are designed for pinhole camera imaging models. The methods adopt micro homography transformation to align multi-view image features, and after a set of forward parallel surfaces of a reference camera at different depths are given, the image features of all views are transformed to the reference image views through the micro homography transformation to prepare for subsequent multi-view feature fusion, cost body construction, cost body regularization and depth map regression.

The above-mentioned alignment method for multi-view image features based on micro-homography mainly relies on a perspective camera model (i.e. a pinhole camera model), which can be applied to close-up images and aerial images, but is not applicable to multi-view satellite images. The satellite image is formed by linear array push-broom imaging, and a Rational Polynomial Camera (RPC) model is widely adopted. The differences in the imaging models make dense matching of satellite images unable to benefit from the most advanced depth learning algorithm. Therefore, a huge gap exists between the current mainstream deep learning multi-view dense matching technology and the multi-view satellite image dense matching task.

Some researchers propose to fit a rational polynomial model to a perspective imaging model in a local range of a satellite image, so that the imaging model of the satellite image is changed from the rational polynomial model to a plurality of perspective imaging models. Based on the thought, the mainstream multi-view dense matching framework has the capability of processing the satellite images. However, it is obvious that such a fitting strategy destroys the strict geometric correspondence in the imaging model, and inevitably introduces fitting errors; at the same time, additional image preprocessing is required. In addition, the earth surface three-dimensional reconstruction work aiming at the optical satellite image is still mainly realized by the traditional geometric method at present, and the deep learning multi-view dense matching technology is not effectively applied.

Therefore, in order to apply the current advanced deep learning multi-view dense matching technology to the satellite multi-view image, cross the gap between the prior art and the multi-view satellite image dense matching task, and realize end-to-end large-scale intelligent reconstruction of the global surface, a multi-view image alignment method expressed by tensor operation and based on rational polynomial model differential tensor expression needs to be researched, and the deep learning method with excellent performance on close-range and aviation data is further expanded to the satellite image.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a multi-view image alignment method based on differential tensor expression of a rational polynomial model, and a micro-mapping module based on the rational polynomial model is constructed based on the technology and is applied to multi-view satellite image dense matching. The module takes accurate rational polynomial model parameters and satellite image characteristics as input, and expresses the geometric transformation relation between the multi-view satellite images into a series of differentiable tensor operations by utilizing the mathematical description of a quaternary cubic model, thereby realizing the alignment between the multi-view satellite image characteristics.

The technical scheme adopted for realizing the aim of the invention is that the multi-view image alignment method based on the differentiable tensor expression of the rational polynomial model comprises the following steps:

step 1, dividing a multi-view satellite image into blocks correspondingly, taking a rational polynomial model as geometric constraint to find all corresponding target image block areas and dividing the target image block areas, wherein each reference image block and the corresponding target image block form a group of multi-view units;

step 2, mapping the coordinates of all pixel points in the reference image obtained in the step 1 to an object space through differentiable mapping based on a rational polynomial model inverse solution form to obtain corresponding object coordinates;

step 3, mapping the object space coordinates obtained in the step 2 to a target image through differentiable mapping based on a positive solution form of a rational polynomial model to obtain corresponding target image coordinates;

and 4, aligning the characteristics of the multi-view satellite images, forming a group of coordinate mapping by using the coordinates of all pixel points in the reference image obtained in the step 1 and the coordinates of the target image in the step 3, twisting the characteristics of the multi-view satellite images by using the coordinate mapping relation, and transforming the characteristics of the target image to the reference image to realize the characteristic alignment of the multi-view satellite images.

Further, the specific implementation of step 1 includes the following sub-steps;

step 1.1, selecting a block area corresponding to a multi-view satellite image; adopting a blocking method based on an object space or a blocking method based on a reference image space;

step 1.2, multi-view image segmentation; and calculating the minimum external rectangle of each block area on all the visual angle images, taking the image area in the rectangular range as the final image block, and forming a group of multi-view units by the reference image block and all the source image blocks.

Further, in the blocking method based on the object space, the object space is averagely divided in the designated object space range according to the designated overlapping degree and the blocking size, and then the divided object blocking areas are calculated to the image plane of each visual angle through a rational polynomial model, so that the corresponding blocking areas of the multi-view satellite image are obtained; in the reference image space-based blocking method, a specified object space range is calculated on a reference image plane through a rational polynomial model, an image plane area obtained through calculation is averagely divided according to a specified overlapping degree and a blocking size, and then the divided reference image blocking area is calculated on other visual angles through the rational polynomial model to obtain corresponding image blocking areas on a source image.

Further, the specific implementation of step 2 includes the following sub-steps:

step 2.1, constructing a hypothetical elevation plane: determining an elevation search range in a designated area, and dividing the elevation search range into D parts to obtain D imaginary elevation surfaces;

step 2.2, constructing a coordinate tensor: assuming that the height and width of the image block are H and W respectively; first, a tensor X with all 1 elements and H × W × D dimensions is constructed₁(ii) a Acquiring normalization parameters from a rational polynomial model, normalizing the row coordinates of all pixel points in the reference image blocks acquired in the step 1 to acquire normalized row coordinates, forming a tensor with dimension H multiplied by W, increasing the tensor by one dimension and copying the tensor for D times on the dimension to acquire a tensor X with dimension H multiplied by W multiplied by D_line(ii) a All pixels in the reference image block obtained in the step 1The column coordinates of the points are processed by the same process of the row coordinates to obtain a tensor X with dimensions H multiplied by W multiplied by D_samp(ii) a Combining the elevation values of all the hypothetical elevation surfaces obtained in the step 2.1 into a tensor with one dimension being D, normalizing, adding two dimensions again, and respectively copying H and W times on the two new dimensions to obtain a tensor X with dimensions H multiplied by W multiplied by D_hei(ii) a Mixing X₁、X_line、X_samp、X_heiStacking to form a reference image coordinate tensor X of H multiplied by W multiplied by D multiplied by 4;

step 2.3, constructing a coefficient tensor: respectively constructing the numerator coefficients and the denominator coefficients of two equations in a reverse solution form of a rational polynomial model into 4 multiplied by 4 coefficient tensors, which are hereinafter referred to as a row coordinate vector and a column coordinate vector;

the inverse solution form of the known rational polynomial model is:

wherein lon and lat respectively represent longitude and latitude coordinates, P represents a ternary cubic polynomial, line, samp and hei respectively represent row coordinates, column coordinates and elevation variables, subscript n represents normalization, and superscript inv represents an inverse solution form; before the description of the construction process is made, the following provisions are made: let T be the coefficient tensor with dimensions 4 × 4 × 4, where T (i, j, k) represents the element at (i, j, k) in the tensor T, where i, j, k can all be taken from the integer set {0,1,2,3 }; forming variable in molecule into a variable sequence V ═ 1, line_n,samp_n,hei_n]The ith element in the variable sequence is denoted as V_iWhere the constant 1 is considered as a variable alone, and line_n，samp_nAnd hei_nRespectively representing three variables of row coordinates, column coordinates and elevation in molecules; let the coefficients of the polynomial P be denoted as a_ijk，a_ijkRefers to V in the polynomial P_iV_jV_kThe coefficient corresponding to the term;

the coefficient tensors corresponding to each polynomial P in the formula (1) are constructed in the same mannerThe structure mode is as follows: when i, j, k are all equal, T (i, j, k) is a_ijk(ii) a When only two of i, j, k are equal, T (i, j, k) ═ a_ijkA/3; when all i, j, k are not equal, T (i, j, k) ═ a_ijk/6；

And 2.4, calculating the coordinates of the corresponding object space: first, the value of each polynomial P in formula (1) is calculated, that is, the molecular coefficient tensor T and the coordinate tensor X acquired in step 2.3 are calculated as follows:

tensor f (X) with dimension H multiplied by W multiplied by D and formed by the value of the polynomial P can be obtained through batch calculation; the operation in step 2.4 adopts the einstein summation convention, and the parenthesized superscript is specified to indicate that the dimension is not summed, in the formula, the components of the first three dimensions of X, namely the dimensions H, W and D, are reserved and not summed, and are multiplied by the three components of the coefficient tensor T respectively in the last dimension of X to be summed; each polynomial adopts the same calculation process; and then, performing element-by-element division on the tensor obtained by calculating the numerator polynomial and the tensor obtained by calculating the corresponding denominator polynomial, and performing normalization to obtain a longitude coordinate tensor and a latitude coordinate tensor with the dimensionality of H multiplied by W multiplied by D.

Further, the elevation search range in step 2.1 is obtained by calculating an elevation normalization coefficient in rational polynomial model parameters, or a sparse reconstruction result or a public DEM is suitably expanded in the height direction.

Further, the specific implementation of step 3 includes the following sub-steps:

step 3.1, constructing a coordinate tensor: first, a tensor X with all 1 elements and H × W × D dimensions is constructed₁(ii) a Normalizing each element in the longitude coordinate tensor obtained in the step 2.4 to obtain a normalized longitude coordinate tensor X with the dimensionality of H multiplied by W multiplied by D_lon(ii) a The latitude coordinate tensor obtained in the step 2.4 is executed with the same process as the longitude coordinate to obtain a normalized latitude coordinate tensor X with the dimensionality of H multiplied by W multiplied by D_lat(ii) a All hypothetical elevations obtained in step 2.1Combining the elevation values of the surfaces into a tensor with one dimension of D, normalizing, adding two dimensions again, and respectively copying H and W times on the two new dimensions to obtain a tensor X with dimensions of H multiplied by W multiplied by D_hei(ii) a Mixing X₁、X_lon、X_lat、X_heiStacking to form a new H multiplied by W multiplied by D multiplied by 4 coordinate tensor X;

step 3.2, constructing a coefficient tensor: respectively constructing the numerator coefficients and denominator coefficients of two equations in a positive solution form of a rational polynomial model into 4-dimensional 4 x 4 coefficient tensors;

the positive solution form of the known rational polynomial model is:

where the superscript fwd represents the positive solution form, unlike the definition in step 2.3, the variable sequence here is modified to V ═ 1, lon_n,line_n,hei_n]Other specifications are kept consistent, and the construction process of the coefficient tensor is the same as the step 2.3;

and 3.3, calculating corresponding target image coordinates: and (5) finally obtaining a line coordinate tensor and a column coordinate tensor of the target image with the dimension of H multiplied by W multiplied by D in the same way as the step 2.4.

Further, the specific implementation manner of step 4 is as follows: and (3) forming a group of coordinate mapping by the row coordinate tensor and the column coordinate tensor in the reference image obtained in the step (2.2) and the row coordinate tensor and the column coordinate tensor of the target image obtained by calculation in the step (3.3), and transforming the image or the image characteristics on the target image to the reference image through differentiable bilinear interpolation according to the coordinate mapping relation to realize the alignment of the multi-view satellite image or the image characteristics.

The invention has the following advantages:

the method is used for feature alignment between multi-view satellite images and is a core key technology for establishing a multi-view satellite image dense matching framework.

Directly starting from the multi-view image and imaging geometric relation, the end-to-end multi-view satellite image dense matching becomes possible, and complex preprocessing steps such as satellite image epipolar resampling and the like are not needed.

The portability is strong, and the method can be embedded into the existing deep learning multi-view dense matching network as an independent module without changing other structures and training strategies.

The theory is tighter, and the method has tighter theoretical support compared with the idea of locally fitting a rational polynomial model into a perspective imaging model, and proves that the technology can achieve higher precision in experiments, see the attached figure 2.

Drawings

Fig. 1 is an overall schematic view of the present invention.

FIG. 2 is an error distribution plot of a rational polynomial camera model fitted to a perspective projection model.

FIG. 3 is a schematic diagram of a deep learning-based multi-view dense matching network model framework according to an embodiment of the present invention.

FIG. 4 is a Digital Surface Model (DSM) visualization result from three methods of an embodiment of the invention.

Detailed Description

The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.

The invention provides a multi-view image alignment method based on rational polynomial model differentiable tensor expression, which comprises the following steps:

step 1, correspondingly partitioning the multi-view satellite image. And partitioning the satellite image by using the rational polynomial model as geometric constraint, and finding all source image characteristic partitioned areas corresponding to the reference image characteristic partitioned area to form a group of multi-view units. If enough video memory can directly process the satellite image features of the original size, this step can be omitted.

And 2, mapping the coordinates of all pixel points in the reference image obtained in the step 1 to an object space through differentiable mapping based on a rational polynomial model inverse solution form to obtain corresponding object coordinates.

And 3, mapping the object space coordinates obtained in the step 2 to the source image through differentiable mapping based on a positive solution form of a rational polynomial model to obtain corresponding source image coordinates.

And 4, aligning the characteristics of the multi-view satellite images. And (3) forming a group of coordinate mapping by the coordinates of all pixel points in the reference image obtained in the step (1) and the source image coordinates in the step (3), performing distortion on the characteristics of the multi-view satellite image by using the coordinate mapping relation, and transforming the characteristics on the source image to the reference image to realize characteristic alignment of the multi-view satellite image.

Further, the specific implementation of step 1 includes the following sub-steps:

step 1.1, selecting a block area corresponding to a multi-view satellite image; a blocking method based on an object space or a blocking method based on a reference image space may be employed. In the blocking method based on the object space, the object space is averagely divided in a designated object space range according to the designated overlapping degree and the designated blocking size, and then the divided object blocking area is calculated to the image plane of each visual angle through a rational polynomial model, so that the corresponding blocking area of the multi-view satellite image is obtained. In the reference image space-based blocking method, a specified object space range is calculated on a reference image plane through a rational polynomial model, an image plane area obtained through calculation is averagely divided according to a specified overlapping degree and a blocking size, and then the divided reference image blocking area is calculated on other visual angles through the rational polynomial model to obtain corresponding image blocking areas on a source image.

step 2.1, constructing a hypothetical elevation plane: firstly, determining an elevation search range in a designated area. The range can be calculated by an elevation normalization coefficient in rational polynomial model parameters, or obtained by properly extending a sparse reconstruction result or a public DEM in the height direction. The high-range search space range is divided into D parts to obtain D imaginary high-range planes.

Step 2.2, constructing a coordinate tensor: assume that the height and width of the video block are H and W, respectively. First, a tensor X with all 1 elements and H × W × D dimensions is constructed₁(ii) a Acquiring normalization parameters from a rational polynomial model, normalizing the row coordinates of all pixel points in the reference image blocks acquired in the step 1 to acquire normalized row coordinates, forming a tensor with dimension H multiplied by W, increasing the tensor by one dimension and copying the tensor for D times on the dimension to acquire a tensor X with dimension H multiplied by W multiplied by D_line(ii) a Carrying out the process of the same row coordinate on the column coordinates of all the pixel points in the reference image block obtained in the step 1 to obtain a tensor X with dimension H multiplied by W multiplied by D_samp(ii) a Combining the elevation values of all the hypothetical elevation surfaces obtained in the step 2.1 into a tensor with one dimension being D, normalizing, adding two dimensions again, and respectively copying H and W times on the two new dimensions to obtain a tensor X with dimensions H multiplied by W multiplied by D_hei. Mixing X₁、X_line、X_samp、X_heiStacking is performed to form a reference image coordinate tensor X of H multiplied by W multiplied by D multiplied by 4.

Step 2.3, constructing a coefficient tensor: the numerator coefficients and denominator coefficients of two equations in the inverse solution form of the rational polynomial model are respectively constructed as 4 × 4 × 4 coefficient tensors in dimension.

The inverse solution form of the known rational polynomial model is:

wherein lon and lat respectively represent longitude and latitude coordinates, P represents a ternary cubic polynomial, line, samp and hei respectively represent row coordinates, column coordinates and elevation variables, subscript n represents normalization, and superscript inv represents an inverse solution form. Before the expression of the construction process is carried outThe following is specified: let T be the coefficient tensor with dimensions 4 × 4 × 4, where T (i, j, k) represents the element at (i, j, k) in the tensor T, where i, j, k can all be taken from the integer set {0,1,2,3 }; forming variable in molecule into a variable sequence V ═ 1, line_n,samp_n,hei_n]The ith element in the variable sequence is denoted as V_iWhere the constant 1 is considered as a variable alone, and line_n，samp_nAnd hei_nRespectively representing three variables of row coordinates, column coordinates and elevation in molecules; let the coefficients of the polynomial P be denoted as a_ijkWherein a is_ijkRefers to V in the polynomial P_iV_jV_kThe coefficient corresponding to the term.

The coefficient tensors corresponding to each polynomial P in equation (3) are constructed in the same manner as follows: when i, j, k are all equal, T (i, j, k) is a_ijk(ii) a When only two of i, j, k are equal, T (i, j, k) ═ a_ijkA/3; when all i, j, k are not equal, T (i, j, k) ═ a_ijk/6。

And 2.4, calculating the coordinates of the corresponding object space: first, the value of each polynomial P in formula (3) is calculated, that is, the molecular coefficient tensor T and the coordinate tensor X acquired in step 2.3 are calculated as follows:

the tensor f (x) having dimensions H × W × D, which is formed by the values of the polynomial P, can be calculated in batches. Note that the above operation uses the einstein summation convention and the parenthetical superscript is defined here to mean that the dimension does not sum. In the equation, the components of the first three dimensions of X (i.e., the H, W, and D dimensions) remain without being summed, and are multiplied by the three components of the coefficient tensor T in the last dimension of X, respectively, and then summed. Each polynomial uses the same calculation process as described above. And then, performing element-by-element division on the tensor obtained by calculating the numerator polynomial and the tensor obtained by calculating the corresponding denominator polynomial, and performing normalization to obtain a longitude coordinate tensor and a latitude coordinate tensor with the dimensionality of H multiplied by W multiplied by D.

step 3.1, constructing a coordinate tensor: first, a tensor X with all 1 elements and H × W × D dimensions is constructed₁(ii) a Normalizing each element in the longitude coordinate tensor obtained in the step 2.4 to obtain a normalized longitude coordinate tensor X with the dimensionality of H multiplied by W multiplied by D_lon(ii) a The latitude coordinate tensor obtained in the step 2.4 is executed with the same process as the longitude coordinate to obtain a normalized latitude coordinate tensor X with the dimensionality of H multiplied by W multiplied by D_lat(ii) a Combining the elevation values of all the hypothetical elevation surfaces obtained in the step 2.1 into a tensor with one dimension being D, normalizing, adding two dimensions again, and respectively copying H and W times on the two new dimensions to obtain a tensor X with dimensions H multiplied by W multiplied by D_hei. Mixing X₁、X_lon、X_lat、X_heiStacking is performed to newly construct a coordinate tensor X of H multiplied by W multiplied by D multiplied by 4.

Step 3.2, constructing a coefficient tensor: the numerator coefficients and denominator coefficients of two equations in the positive solution form of the rational polynomial model are respectively constructed as 4-dimensional 4 × 4 × 4 coefficient tensors.

The positive solution form of the known rational polynomial model is:

where the superscript fwd represents the positive solution form. In contrast to the definition in step 2.3, the variable sequence here is modified to V ═ 1, lon_n,line_n,hei_n]Other specifications remain consistent. The construction process of the coefficient tensor is the same as step 2.3.

And 3.3, calculating the corresponding source image coordinates: the same as step 2.4, will not be described again. And finally, obtaining a line coordinate tensor and a column coordinate tensor of the source image with dimensions H multiplied by W multiplied by D.

Further, the specific implementation manner of step 4 is as follows: and (3) forming a group of coordinate mapping by using the row coordinate tensor and the column coordinate tensor in the reference image obtained in the step 2.2 and the row coordinate tensor and the column coordinate tensor of the source image obtained by calculation in the step 3.3. And according to the coordinate mapping relation, image features on the source image are transformed to the reference image through differentiable bilinear interpolation, so that the alignment of the multi-view satellite image features is realized.

Example (b):

the end-to-end large-range earth surface model intelligent reconstruction based on the multi-view satellite images is realized by using the multi-view image alignment method based on rational polynomial model differentiable tensor expression and under a deep learning framework. The constructed micro-mapping module based on the rational polynomial model is embedded into the three networks by taking the three network structures of the most advanced RED-Net, Cas-MVSNet and UCS-Net as basic frameworks to replace the micro-homography transformation module therein for constructing cost bodies among multi-view satellite image features. In the present embodiment, we denote the multi-view dense matching method based on the differentiable homography as "× (homo)", and the multi-view dense matching method based on the differentiable rational polynomial coordinate mapping as "× (rpc)".

The general structure of the three basic network frameworks is shown in fig. 3. The type framework comprises a feature detection module, a cost body construction module, a cost body regularization module, a loss value calculation module and a multi-scale prediction module. The characteristic detection module consists of a plurality of 2D convolution neuron network branches sharing weights; the cost body construction module is realized by adopting the above micro homography transformation technology (homo) or the micro rational polynomial coordinate mapping technology (rpc); the cost body regularization module comprises two forms: a cyclic coding-decoding regularization module (RED-Net, consisting of 4-scale 2D convolutional layers and convolutional gating cyclic units) or a 3D convolutional regularization module (Cas-MVSNet and UCS-Net, consisting of 4-scale 3D convolutional layers) for learning regularized cost bodies in the depth direction and the spatial direction; the loss value calculation module is used for converting the regularized cost body into a depth map and calculating a loss value back propagation guide network training; the multi-scale prediction module comprises a pyramid structure with three scales, the matching of the next scale is restrained by using the matching result of the low resolution of the previous scale, the depth search interval when the cost body is constructed by the next scale is determined in a self-adaptive mode, and the accurate matching from coarse to fine is realized.

And constructing a multi-scale dense matching network for the multi-view satellite images according to the framework, and then carrying out experiment and precision evaluation on the open-source multi-view satellite image dense matching data set TLC SatMVS. TLC SatMVS image data comes from ZY3-02 star carrying a Three-Line array Camera (TLC), the lower visual resolution is 2.1m, the front and rear visual angle resolution is 2.5m, and the method is suitable for large-scale terrain reconstruction. The data set provided 173 sets of three-linear satellite image data, of which 127 sets were training data and 46 sets were test data, each image size was 5120 × 5120 pixels, and the degree of overlap of the three images was around 95%. Limited by the existing memory size, where the training set data is clipped to 5011 to 768 × 384 sized tri-view sub-blocks.

For a network frame (represented by a (rpc)) adopting a slightly rational polynomial model coordinate mapping, directly taking a satellite image and a corresponding rational polynomial coefficient as input, and outputting a height map corresponding to a reference visual angle; for a network framework (denoted as a homography) adopting the micro-homography, a satellite image based on a rational polynomial camera model is locally fitted to be a central projection image, and a depth map corresponding to a lower view angle is output by taking the fitted block image and central projection camera parameters as input.

And training the constructed deep learning dense matching network by using the training set until the training loss is not reduced any more and the model reaches the optimum. All network models are implemented on a deep learning framework Pytorch and follow the same software and hardware and hyper-parameter settings, for example: training on a single NVIDIA RTX 2080Ti GPU display card, wherein the size of a training batch is 1, the optimizer is RMSprop, all training data are iterated for 35 times, and the initial learning rate is set to be 0.001. And (3) the three-layer pyramid structure is adopted to realize depth map (or height map) prediction from coarse to fine. For a three-line array image, the number of the visual angles of the input image is fixed to be N-3, in a three-layer structure, the number of the virtual height planes is {64,32,8}, and except for the UCS-Net model frame which adopts a designed adaptive sampling interval determination strategy, the sampling intervals corresponding to the other two model frames are { (d)_max-d_min) /64,5m,2.5m }, wherein d is_maxAnd d_minRespectively representing a maximum elevation (or a maximum depth) and a minimum elevation (or a minimum depth) in the search range.

After the model training is finished, inputting the multi-view images of the adjacent visual angles of the test area and the corresponding rational polynomial coefficients (or fitted central projection camera parameters) into a network model to obtain a height map (or a depth map) corresponding to the reference image. And filtering outliers of the height value (or the depth value) of each pixel through a post-processing process of consistency check, back-projecting the residual reliable points to a three-dimensional object space by using imaging Model parameters, and generating a Digital Surface Model (DSM) of a test area through resampling.

Because an end-to-end deep learning multi-view dense matching method directly applied to satellite images does not exist at present, the micro-mapping module based on the rational polynomial model is only embedded into several advanced multi-view dense matching deep learning network frameworks, and is compared with a network model based on a micro-homography conversion module (a pinhole camera model is adopted to fit an RPC model when the network model is applied to the satellite images) in a basic form. Meanwhile, the earth surface reconstruction result of the multi-view satellite image based on the deep learning method is compared with the result based on the traditional method adaptive-COLMAP.

Table 1 shows the quantitative evaluation results of the image alignment method based on the differential homography, the image alignment method based on the rational polynomial differential tensor expression, and the attached-colomap method based on the conventional method on the TLC SatMVS test set, and four measures are used as the indexes for evaluating the DSM result quality, which are: mean Absolute Error (MAE), i.e., L between the ground true value and the estimated height value valid in the DSM, over all grid cells₁An average value of the distances; root Mean Square Error (RMSE), i.e. the standard deviation between the ground true height and the estimated height; l is₁Error less than 2.5m and 7.5m grid cell percentage (<2.5m,<7.5 m); and the integrity of the DSM results (Comp.). From the quantitative results, it can be seen that under the same network framework, bases are used compared with the homography transformation module adopting the pinhole fitting modelBetter performance is generally obtained in a micro-mappable module of a strictly rational polynomial model; compared with the traditional method adaptive-COLMAP, the method has obvious advantages in performance of all deep learning-based multi-view dense matching methods.

TABLE 1 comparison of quantitative results of different multi-view dense matching methods (or network models) on TLC SatMVS test set

Fig. 4 is a network model of an image alignment method based on a micro-homography transformation and an image alignment method based on rational polynomial differentiable tensor expression, and an obtained visual DSM result of adaptive-colomap, taking RED-Net network framework as an example, and it can be seen from the figure that the DSM result generated by a deep learning method based on rational polynomial tensor expression image alignment is more complete, less hollow regions and clearer feature edges.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A multi-view image alignment method based on rational polynomial model differentiable tensor expression is characterized in that: the method comprises the following steps:

2. The method for multi-view image alignment based on differential tensor expression of rational polynomial model as claimed in claim 1, wherein: the specific implementation of the step 1 comprises the following substeps;

3. The method for multi-view image alignment based on differential tensor expression of rational polynomial model as claimed in claim 2, wherein: in the blocking method based on the object space, the object space is averagely divided in a designated object space range according to the designated overlapping degree and the designated blocking size, and then the divided object blocking area is calculated to the image plane of each visual angle through a rational polynomial model, so that the corresponding blocking area of the multi-view satellite image is obtained; in the reference image space-based blocking method, a specified object space range is calculated on a reference image plane through a rational polynomial model, an image plane area obtained through calculation is averagely divided according to a specified overlapping degree and a blocking size, and then the divided reference image blocking area is calculated on other visual angles through the rational polynomial model to obtain corresponding image blocking areas on a source image.

4. The method for multi-view image alignment based on differential tensor expression of rational polynomial model as claimed in claim 1, wherein: the specific implementation of the step 2 comprises the following substeps:

step 2.2, constructing a coordinate tensor: assuming that the height and width of the image block are H and W respectively; first, a tensor X with all 1 elements and H × W × D dimensions is constructed₁(ii) a Acquiring normalization parameters from a rational polynomial model, normalizing the row coordinates of all pixel points in the reference image blocks acquired in the step 1 to acquire normalized row coordinates, forming a tensor with dimension H multiplied by W, increasing the tensor by one dimension and copying the tensor for D times on the dimension to acquire a tensor X with dimension H multiplied by W multiplied by D_line(ii) a Carrying out the process of the same row coordinate on the column coordinates of all the pixel points in the reference image block obtained in the step 1 to obtain a tensor X with dimension H multiplied by W multiplied by D_samp(ii) a Combining the elevation values of all the hypothetical elevation surfaces obtained in the step 2.1 into a tensor with one dimension being D, normalizing, adding two dimensions again, and respectively copying H and W times on the two new dimensions to obtain a tensor X with dimensions H multiplied by W multiplied by D_hei(ii) a Mixing X₁、X_line、X_samp、X_heiStacking to form a reference image coordinate tensor X of H multiplied by W multiplied by D multiplied by 4;

the inverse solution form of the known rational polynomial model is:

the coefficient tensors corresponding to each polynomial P in equation (1) are constructed in the same manner as follows: when i, j, k are all equal, T (i, j, k) is a_ijk(ii) a When only two of i, j, k are equal, T (i, j, k) ═ a_ijkA/3; when all i, j, k are not equal, T (i, j, k) ═ a_ijk/6；

tensor f (X) with dimension H multiplied by W multiplied by D and formed by the value of the polynomial P can be obtained through batch calculation; the operation in step 2.4 uses the einstein summation convention and specifies that the parenthetical superscript indicates that this dimension is not summed, in which the first three dimensions of X, i.e. the components of the H, W and D dimensionsThe three components of the coefficient tensor T are respectively multiplied in the last dimension of X and then summed; each polynomial adopts the same calculation process; and then, performing element-by-element division on the tensor obtained by calculating the numerator polynomial and the tensor obtained by calculating the corresponding denominator polynomial, and performing normalization to obtain a longitude coordinate tensor and a latitude coordinate tensor with the dimensionality of H multiplied by W multiplied by D.

5. The method for multi-view image alignment based on differential tensor expression of rational polynomial model as claimed in claim 4, wherein: and 2.1, calculating the elevation search range through an elevation normalization coefficient in rational polynomial model parameters, or obtaining a sparse reconstruction result or a public DEM in a height direction through proper external expansion.

6. The method for multi-view image alignment based on differential tensor expression of rational polynomial model as claimed in claim 4, wherein: the specific implementation of the step 3 comprises the following substeps:

step 3.1, constructing a coordinate tensor: first, a tensor X with all 1 elements and H × W × D dimensions is constructed₁(ii) a Normalizing each element in the longitude coordinate tensor obtained in the step 2.4 to obtain a normalized longitude coordinate tensor X with the dimensionality of H multiplied by W multiplied by D_lon(ii) a The latitude coordinate tensor obtained in the step 2.4 is executed with the same process as the longitude coordinate to obtain a normalized latitude coordinate tensor X with the dimensionality of H multiplied by W multiplied by D_lat(ii) a Combining the elevation values of all the hypothetical elevation surfaces obtained in the step 2.1 into a tensor with one dimension being D, normalizing, adding two dimensions again, and respectively copying H and W times on the two new dimensions to obtain a tensor X with dimensions H multiplied by W multiplied by D_hei(ii) a Mixing X₁、X_lon、X_lat、X_heiStacking to form a new H multiplied by W multiplied by D multiplied by 4 coordinate tensor X;

the positive solution form of the known rational polynomial model is:

7. The method for multi-view image alignment based on differential tensor expression of rational polynomial model as claimed in claim 6, wherein: the specific implementation manner of the step 4 is as follows: and (3) forming a group of coordinate mapping by the row coordinate tensor and the column coordinate tensor in the reference image obtained in the step (2.2) and the row coordinate tensor and the column coordinate tensor of the target image obtained by calculation in the step (3.3), and transforming the image or the image characteristics on the target image to the reference image through differentiable bilinear interpolation according to the coordinate mapping relation to realize the alignment of the multi-view satellite image or the image characteristics.