CN115423978A

CN115423978A - Image laser data fusion method based on deep learning and used for building reconstruction

Info

Publication number: CN115423978A
Application number: CN202211059667.XA
Authority: CN
Inventors: 谢红梅; 曾田子; 徐梓雲; 邱文; 蒋晓悦; 姚冠宇; 冯晓毅; 彭进业; 文明; 苗阿新; 夏召强
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2022-12-02

Abstract

The invention discloses a method for fusing visible light images and laser radar data based on deep learning, which can be used for three-dimensional reconstruction of outdoor buildings, and comprises the following steps: firstly, acquiring a visible light image and laser radar data and preprocessing the visible light image and the laser radar data; secondly, sparse reconstruction and camera pose estimation are carried out through a motion recovery Structure (SFM) frame COLMAP; secondly, constructing a depth map completion network model based on combination of explicit expression and a Space Propagation Network (SPN), inputting a data set formed by a visible light image, a laser radar depth map and a laser radar point cloud into the network model for training to obtain a trained depth map completion network model, inputting a sparse laser radar depth map to be completed, the point cloud and the visible light image into the trained model, and estimating a dense depth map; and finally, carrying out dense reconstruction, grid reconstruction and texture mapping by using the estimated depth map through an open-source multi-view stereo (MVS) framework OpenMVS. The invention provides a depth map completion (namely, the visible light image and laser radar data are fused) method based on the combination of explicit expression and SPN, which makes full use of two-dimensional image information and three-dimensional space structure information, increases the accuracy of depth map estimation and improves the precision of three-dimensional reconstruction.

Description

Image laser data fusion method based on deep learning for building reconstruction

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a visible light image and laser radar data fusion method based on deep learning and capable of being used for three-dimensional reconstruction of outdoor buildings.

Background

The intensive three-dimensional reconstruction with large scale and high precision is one of the most classic problems in the fields of photogrammetry and computer vision, and is very important for various applications such as automatic driving, quality control monitoring, virtual tourism, augmented reality, cultural heritage protection and the like.

Whether rendering realistic objects or further analyzing the three-dimensional model, both accurate spatial geometry information and high-fidelity color texture information are often lacking. Although the laser scanning reconstruction technology can recover the real size of a target object, the accuracy can reach millimeter level, but the difference of environmental scenes can also cause the inherent sparsity of point cloud data acquired by a laser radar. Lack of texture expression and susceptibility to defects and holes at the edges and corners of the model object. The model built by the multi-view-based reconstruction technology is visual in effect, but lacks the real depth information of a target object, and is greatly influenced by environmental factors such as illumination and the like, so that the final three-dimensional model is poor in reconstruction effect. Any method is independently used and has certain limitation, so that the hole noise point is repaired according to different data sources based on the three-dimensional reconstruction strategy fused by multiple data sources, the actual coordinate deviation of the three-dimensional point cloud is reduced, and more targeted three-dimensional data support and service can be provided for structure protection and information retention.

The research of performing three-dimensional reconstruction by using visible light images and laser radar data mainly comprises depth estimation (depth map completion) and three-dimensional grid reconstruction, wherein the depth map completion is fused with the visible light images and sparse depth map prediction derived by the laser radar to generate a dense depth map, which is the most critical step.

In the past decades, scholars make great efforts in tasks such as model construction and 3D object detection by using a depth map completion method. The method based on deep learning shows remarkable performance on the task of completing the deep map and draws the development trend. Previous work has shown that a network with multiple convolutional layers or a simple automatic encoder can make up for the depth of the miss. Furthermore, depth completion can be further improved by using RGB information. A typical approach of this type is to use dual encoders to extract features from the sparse depth map and its corresponding RGB image, respectively, and then fuse them in the decoder. To advance the progress of depth map completion, recent methods tend to use a complex network structure and a complex learning strategy. In addition to multi-branching for extracting features from multi-modal data (e.g., images and sparse depths), researchers have begun integrating surface normals, affinity matrices, residual depth maps, etc. into their framework. Furthermore, to address the lack of supervised pixels, some work has introduced the use of multi-view geometric constraints and anti-regularization.

Depth map completion methods can be roughly classified into five categories: 1. and (5) a previous-stage fusion model. 2. And (5) later stage fusion model. 3. An explicit three-dimensional representation model. 4. A residual depth model. 5. A model based on Spatial Propagation Network (SPN).

1. Early stage fusion model: such methods typically aggregate images and sparse depth maps directly as input, or fuse multi-modal features at the first convolution layer, as in fig. 1 (a).

2. Later fusion model: as in fig. 1 (b), this approach generally consists of a dual encoder or two sub-networks as shown in fig. 1; one for extracting RGB features and the other for extracting depth features. The fusion is performed in the middle layer, for example, fusing features extracted from the encoder.

3. Explicit three-dimensional representation model: such methods typically apply three-dimensional convolution, embedding surface curves, or learning information directly from three-dimensional point clouds, as in fig. 1 (c), to predict dense depth maps.

4. Residual depth model: as shown in fig. 1 (d), this method generally learns a coarse depth map and a residual depth map, and generates a final depth map by combining them.

5. SPN-based model: as shown in fig. 1 (e), such methods typically first learn the affinity matrix and the initial coarse depth map through the encoder-decoder network, and then use the SPN for affinity-based depth map refinement iterations.

Disclosure of Invention

As can be seen from the published models of scholars, the explicit three-dimensional representation model, the SPN-based model and the residual depth model exhibit more advanced performance, often superior to other methods. The SPN-based model learns three-dimensional geometric relationships in an implicit way, and the explicit three-dimensional representation model greatly facilitates the progress of depth map completion. Therefore, the depth map completion method based on the combination of the explicit three-dimensional representation and the SPN is provided, the visible light RGB image and the sparse depth information of the laser radar are fused, the dense and accurate depth map can be obtained, and the method has potential application value in application scenes such as large building reconstruction for cultural relic protection, three-dimensional reconstruction of street view reconstruction for automatic driving and the like.

The technical method adopted by the invention is as follows: a visible light image and laser radar data fusion method based on deep learning and applicable to three-dimensional reconstruction of outdoor buildings comprises the following steps:

acquiring a visible light image and a laser radar data set and preprocessing the visible light image and the laser radar data set;

step 101: respectively acquiring a multi-angle visible light image and a point cloud of the same scene by using a visible light camera and a laser radar/laser scanner;

step 102: projecting the point cloud acquired by the laser radar onto an imaging plane of the visible light camera to obtain a corresponding sparse depth map;

step 103: acquiring a depth map truth value: superposing each sparse depth map and 2n +1 sparse depth maps (n =5 when KITTI is processed) adjacent to each other at sampling time to increase the density of the generated depth map, cleaning accumulated laser scanning projection by using semi-global matching (SGM), removing abnormal values of occlusion, dynamic motion and measurement artifacts, and taking the finally superposed depth map as a true value of the depth map;

step 104: processing to obtain that each visible light RGB image corresponds to a sparse depth map and a point cloud one by one, forming a data set by a plurality of visible light RGB images and the sparse depth maps and the point cloud which correspond to one by one, and dividing the obtained data set into a training set and a testing set;

performing sparse reconstruction and camera pose estimation through an open source SFM framework COLMAP;

step 201: inputting the multi-view visible light picture into a COLMAP, and obtaining a camera pose and a sparse point cloud through sparse reconstruction;

step 202: converting the binary type related file generated in the step 201 into a text format, and changing other types of camera models in the camera.

Step three, constructing an image feature extraction module of the depth map completion network, and extracting features of the image and the depth map, for example, fig. 3 is a coder-decoder framework of the network;

step 301: an image encoder is constructed based on a residual network. An image encoder of visible light image and sparse depth map features uses ResNet as a basic structure, processes two kinds of input through an extra convolutional layer, and connects two image features of different sources as input of ResNet after passing through a first layer of convolutional layer;

step 302: obtaining an intermediate feature vector representation of one of the two images by passing the result obtained in the step 301 through 5 ResNet rolling blocks;

step four, constructing a point cloud feature extraction module of the depth map completion network, and extracting the features of the point cloud;

step 401: selecting a point cloud processing classic network PointNet + + as a point cloud characteristic encoder;

step 402: grouping the input point clouds, and extracting the features of each group: carrying out dimension change, carrying out convolution operation, and finally carrying out maximum pooling according to a PointNet mode to obtain characteristics;

step 403: sampling and grouping the result of the step 402 for multiple times, and obtaining the final overall characteristics by the operation of PointNet;

constructing a decoder module of the depth map completion network, and performing up-sampling on the obtained features;

step 501: the decoder uses the multi-scale image features from the image encoder and the point features from the point cloud encoder, and is composed of four transposed convolution layers and convolution layers, and the transposed convolution layers up-sample the features;

step 502: projecting point features onto each transposed block in the same proportion as the image features by feature projection;

step 503: the shared weight of the initial dense depth, the confidence coefficient, the non-local neighborhood and the original affinity is estimated by the characteristics of the decoder network, the output of the last transposed block is processed by the convolution layer, and the initial dense depth image, the initial confidence coefficient and the original affinity are predicted;

step six, constructing an SPN module of the depth map completion network, and performing iterative optimization to obtain a final depth map;

step 601: the SPN may propagate information from regions with high confidence to regions with low confidence based on the affinity of the data, but if the propagation neighborhood is fixed, the depth distribution within the local region is ignored, so non-fixed local SPNs are employed;

step 602: the non-stationary local SPN estimates the neighborhood of each pixel outside the local area (i.e., the non-stationary local) based on color and depth information, the non-stationary local neighborhood being defined as:

in the formula, I and D are respectively a visible light RGB image and a sparse depth map; f. of _φ (. Is a non-stationary local neighborhood prediction network that estimates K neighbors per pixel under a learnable parameter φ; p and q are real numbers;

step 603: and (3) realizing the non-fixed local SPN by the initial dense depth map obtained in the step (503) through deformable convolution, wherein the formula of iterative optimization is as follows:

wherein (m, n) and (i, j) are coordinates of the reference pixel and the neighboring pixel, respectively,

which represents the affinity of the reference pixel,

indicates the affinity of the pixels at (m, n) and (i, j). The first term to the right of the formula represents the propagation of the reference pixel and the second term represents the propagation affinity of the neighborhood weighted by the corresponding pixel.

Step 604: to ensure the stability of the transmission, the affinity was normalized by combining confidence before transmission, using Tanh-gamma-Abs-Sum ^* The process of (1) according to the formula:

in the formula, c ^i,j ∈[0,1]Indicating that the pixel is in (i,j) The confidence of (c).

Constructing a loss function of the depth map completion network;

step 701: the reconstruction loss formula in the depth map completion is as follows:

wherein D ^gt Is a groudtuth depth map, D ^pred Is a depth map predicted by an algorithm, d _υ V and | v | represent the depth value at the pixel index v, D, respectively ^gt P is 1 denotes l ₁ Loss, 2 denotes l ₂ Loss;

step 702: introducing a Chamfering Distance (CD) in 3D point cloud processing, wherein the CD represents that the mutual nearest points in two point sets are averaged, and the calculation formula is as follows:

wherein S ₁ And S ₂ For two 3D point cloud sets, back projecting a dense depth map predicted by a depth map completion network to a 3D space to obtain a pseudo laser point, performing the same operation on the dense depth map in a group route, and calculating the CD loss between the dense depth map and the pseudo laser point;

step 703: combining the reconstruction loss in the depth map completion and the CD loss iteration in the point cloud processing, the final loss function is:

L＝μL _recon +(1-μ)L _CD

wherein mu is a weight coefficient lost by depth map reconstruction;

step eight, predicting test set data by using the trained model to obtain a depth map completion result;

performing dense reconstruction, grid reconstruction and texture mapping by using the estimated depth map through an open source MVS framework OpenMVS;

step 901: replacing a depth map calculation part in OpenMVS, and obtaining dense point cloud by a method of fusing multi-frame depth maps according to a camera pose calculated by COLMAP and a dense depth map obtained by depth map completion network prediction.

Step 902: and continuously using the grid reconstruction in the OpenMVS to obtain a three-dimensional grid model from the dense point cloud obtained in the step 901, using a grid optimization module to obtain a finer grid model, and using a texture mapping module to obtain a final three-dimensional surface model with textures.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention has simple steps, reasonable design and convenient realization, use and operation.

2. According to the depth map completion method, the depth map completion is carried out by using a method of combining explicit three-dimensional expression with an implicit model, 3D geometric clues are captured from sparse irregular depth distribution, and the accuracy of the depth map completion is improved.

3. The invention simultaneously utilizes the two-dimensional image information and the three-dimensional information to fuse the characteristics of the two-dimensional image information and the three-dimensional information, thereby improving the precision of three-dimensional reconstruction.

4. According to the method, a loss function in 3D point cloud processing is introduced into a depth map for completion, and more three-dimensional structure information is fed back to a model.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

Fig. 1 shows the classification of the main depth map completion methods.

FIG. 2 is a flow chart of the method of the present invention.

Fig. 3 is a block diagram of an encoder-decoder of a depth map completion network.

Detailed Description

The method of the present invention is further described in detail below with reference to the accompanying drawings and embodiments of the invention.

As shown in fig. 2, the present invention comprises the steps of:

step 104: processing to obtain one-to-one correspondence of each visible light RGB image, the sparse depth map and the point cloud, forming a data set by the visible light RGB images, the sparse depth maps and the point cloud which are in one-to-one correspondence, and dividing the obtained data set into a training set and a testing set;

step 301: an image encoder is constructed based on a residual network. An image encoder of visible light image and sparse depth map features uses ResNet as a basic structure, processes two kinds of input through an extra convolutional layer, and connects image features of two different sources as input of ResNet after passing through a first layer of convolutional layer;

step 402: grouping the input point clouds, and extracting the characteristics of each group: carrying out dimensionality change, carrying out convolution operation, and finally carrying out maximum pooling according to a PointNet mode to obtain characteristics;

step 503: the shared weight is estimated according to the characteristics of the decoder network for the initial dense depth, the confidence coefficient, the non-local neighborhood and the original affinity, the output of the last transposed block is processed through the convolution layer, and the initial dense depth image, the initial confidence coefficient and the original affinity are predicted;

step 601: SPNs can propagate information from regions with high confidence to regions with low confidence based on affinity for data, but if the propagation neighborhood is fixed, the depth distribution within the local region is ignored, so non-fixed local SPNs are employed;

which represents the affinity of the reference pixel,

Step 604: in order to ensure the stability of the transmission, the affinity is normalized by combining confidence coefficient before the transmission, and Tanh-gamma-Abs-Sum is adopted ^* The process of (1) according to the formula:

in the formula, c ^i,j ∈[0,1]Representing the confidence of the pixel at (i, j).

Constructing a loss function of the depth map completion network;

wherein D ^gt Is a groudtuth depth map, D ^pred Is a depth map predicted by an algorithm, d _υ V and | v | represent depth values at the pixel index v, D, respectively ^gt P is 1 denotes l ₁ Loss, 2 represents l ₂ Loss;

wherein S ₁ And S ₂ For two 3D point cloud sets, carrying out back projection on a dense depth map predicted by a depth map completion network to a 3D space to obtain pseudo laser points, carrying out the same operation on the dense depth map in the group route, and calculating the CD loss between the dense depth map and the pseudo laser points;

step 703: in combination with reconstruction loss in depth map completion and CD loss iteration in point cloud processing, the final loss function is:

L＝μL _recon +(1-μ)L _CD

wherein mu is a weight coefficient lost by depth map reconstruction;

Step 902: and (4) continuously using the grid reconstruction in OpenMVS to obtain a three-dimensional grid model from the dense point cloud obtained in the step 901, using a grid optimization module to obtain a finer grid model, and using a texture mapping module to obtain a final three-dimensional surface model with texture.

The above description is only an embodiment of the present invention, and does not limit the present invention in any way, and any simple modifications, alterations and equivalent structural changes made to the above embodiment according to the technical essence of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims

1. A visible light image and laser radar data fusion method based on deep learning and applicable to three-dimensional reconstruction of outdoor buildings comprises the following steps:

step 102: projecting the point cloud collected by the laser radar to an imaging plane of the visible light camera to obtain a corresponding sparse depth map;

step 302: obtaining an intermediate feature vector representation of one of the two images by 5 ResNet rolling blocks of the result obtained in the step 301;

step 401: selecting a point cloud processing classical network PointNet + + as a point cloud characteristic encoder;

step 402: grouping the input point clouds, and extracting the features of each group: carrying out dimensionality change, carrying out convolution operation, and finally carrying out maximum pooling according to a PointNet mode to obtain characteristics;

step 501: the decoder uses the multi-scale image features from the image encoder and the point features from the point cloud encoder, and consists of four transposed convolutional layers and a convolutional layer, and the transposed convolutional layers up-sample the features;

step 502: projecting the point features onto each transposed block in the same proportion as the image features by feature projection;

which represents the affinity of the reference pixel,

denotes the positions of (m, n) and (i, j)Affinity of the pixel. The first term to the right of the formula represents the propagation of the reference pixel and the second term represents the propagation affinity of the neighborhood weighted by the corresponding pixel.

Step 604: in order to ensure the stability of the transmission, the affinity is normalized by combining confidence coefficient before the transmission, and Tanh-gamma-Abs-Sum is adopted ^* The process of (1), formula:

Constructing a loss function of the depth map completion network;

wherein D ^gt Is a groudtuth depth map, D ^pred Is a depth map predicted by the algorithm, d _υ V and | v | represent the depth value at the pixel index v, D, respectively ^gt P is 1 denotes l ₁ Loss, 2 denotes l ₂ Loss;

step 702: introducing a Chamfer Distance (CD) in 3D point cloud processing, wherein the CD represents that the average value of the points which are closest to each other in two point sets is obtained, and the calculation formula is as follows:

L＝μL _recon +(1-μ)L _CD

wherein mu is a weight coefficient lost by depth map reconstruction;

step 901: replacing a depth map calculation part in OpenMVS, and obtaining a dense point cloud by a method of fusing multi-frame depth maps according to a camera pose calculated by COLMAP and a dense depth map predicted by a depth map completion network.