CN115564888A

CN115564888A - Visible light multi-view image three-dimensional reconstruction method based on deep learning

Info

Publication number: CN115564888A
Application number: CN202210845580.9A
Authority: CN
Inventors: 罗欣; 冯倩; 吴禹萱; 韦祖棋; 宋依芸; 冷庚; 许文波
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2023-01-03

Abstract

The invention provides a visible light multi-view image three-dimensional reconstruction method based on deep learning, which is improved based on an MVSNet network. The batch normalization layer and the nonlinear activation function layer in the network are replaced by the fused infilace-ABN layer, so that the occupation amount of video memory is reduced. A weighted mean measurement method based on grouping similarity is designed to reduce the dimension of the characteristic dimension of the cost body, so that a lighter-weight cost body is obtained, network parameters are compressed, and the calculation amount and the video memory consumption are reduced. Aiming at the problem that the resolution of a depth map is lower than that of an input image due to the fact that a low-scale feature map is used in an MVSNet network, a multi-scale feature map is extracted by using a feature pyramid module, and staged multi-scale iterative optimization depth estimation is designed. On the premise of ensuring the precision, the average number of depth planes of the cost body is reduced through multiple rounds of depth iteration, so that the cost body obtains higher spatial resolution, and the accuracy of depth map estimation is improved. And finally, filtering and fusing the output depth map to complete a three-dimensional scene reconstruction task.

Description

Visible light multi-view image three-dimensional reconstruction method based on deep learning

Technical Field

The invention belongs to the field of computer image processing, and relates to a method for performing three-dimensional reconstruction on a visible light multi-view image based on a deep learning method and outputting a three-dimensional point cloud.

Background

As a technology for finely restoring real world scenes, three-dimensional reconstruction plays an important role in daily life and production work of people. The concept of depth in three-dimensional reconstruction refers to the projection distance between a spatial three-dimensional point corresponding to an imaging pixel and a camera focus on an image. The depth map is a data format for recording depth information of all pixel points on one image, and the pixels in the image can be restored to a three-dimensional space according to the depth map corresponding to the image to obtain a small piece of point cloud. Enough images and enough depth maps are provided, so that a point cloud with enough density can be obtained. The MVSNet is a more classical MVS method based on deep learning, follows the idea of a plane scanning method, and has the main advantages that the feature extraction is carried out through a convolutional neural network, a high-dimensional cost body constructed based on the MVSNet keeps high-dimensional spatial structure semantic information, the cost body is subjected to regularization processing through 3D CNN, the operation speed of the method is much higher than that of the traditional method, a better processing effect is achieved on low-texture regions, and some obvious defects exist. The MVSNet abandons a pixel map and uses a feature map instead for depth estimation, a feature extraction network with a VGG structure is used, the size is continuously reduced by using multilayer convolution to extract image features of different levels, the network is subjected to down-sampling twice, the resolution of the feature map is reduced to 1/16 of that of an original image, and the constructed cost is wide and the height is only 1/4 of that of the original image. Since the width and height of the depth map are equal to those of the cost body, the area of the finally predicted depth map is only 1/16 of that of the reference image, and the edge of the target object is too smooth due to the influence of the convolution operation. In order to solve the problems of resolution reduction and edge smoothing of the depth map, the MVSNet adopts an additional 2D CNN upsampling module to perform refinement upsampling on the H/4 xW/4 initial depth map, and the process combines edge features contained in an original image to perform interpolation to finally obtain the H xW full-size depth map. Since this process is performed on the initial depth map at a two-dimensional level, the three-dimensional high-level semantic information included in the cost volume is not effectively utilized.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention discloses a feature point detection matching method based on deep learning. The invention provides a visible light multi-view image three-dimensional reconstruction method based on deep learning, which is improved based on an MVSNet network. The batch normalization layer and the nonlinear activation function layer in the network are replaced by the fused infilace-ABN layer, so that the occupation amount of video memory is reduced. A weighted mean measurement method based on grouping similarity is designed to reduce the dimension of the characteristic dimension of the cost body, so that a lighter-weight cost body is obtained, network parameters are compressed, and the calculation amount and the video memory consumption are reduced. Aiming at the problem that the resolution of a depth map is lower than that of an input image due to the fact that a low-scale feature map is used in an MVSNet network, a multi-scale feature map is extracted by using a feature pyramid module, and staged multi-scale iterative optimization depth estimation is designed. On the premise of ensuring the precision, the average number of the depth planes of the cost body is reduced through multiple rounds of depth iteration, so that the cost body obtains higher spatial resolution, and the accuracy of depth map estimation is improved. And finally, filtering and fusing the output depth map to complete a scene three-dimensional reconstruction task.

The technical route adopted by the invention is as follows:

a multi-view image three-dimensional reconstruction method based on deep learning comprises the following steps:

step 1: performing incremental SfM on an image group of a scene to be predicted, and calculating camera parameters of each image and sparse point cloud of the scene to be predicted;

step 1.1: and reading in the image group of the scene to be predicted by using a COLMAP program, performing an incremental motion recovery structure algorithm, and calculating to obtain the camera parameters of each image and the sparse point cloud of the scene to be predicted.

And 2, step: designing an improved depth estimation network based on MVSNet, inputting a scene to be predicted into the network for calculation, and obtaining a depth map and a probability map corresponding to each image;

step 2.1: and (3) adopting the same extraction process as the MVSNet feature extractor for an original image with the size of H multiplied by W, after obtaining a 32-channel high-dimensional feature map, performing multilayer convolution and two times of 2-time interpolation upsampling, after each time of interpolation upsampling, aggregating the upsampled feature map with the same resolution of the previous stage, and finally obtaining feature maps with the sizes of H/4 multiplied by W/4 multiplied by 32, H/2 multiplied by W/2 multiplied by 16 and H multiplied by W multiplied by 8.

Step 2.2: and for each adjacent visual angle, extracting a point cloud set of a common-view area of the adjacent visual angle and the reference visual angle from the sparse point cloud according to the camera parameters and the sparse point cloud of the scene obtained in the step 1. And calculating a base line angle between each point in the point cloud set and the optical centers and the main optical axes of the two image cameras, calculating a score for the point by using a piecewise Gaussian function, and adding the scores of all the points to obtain a total score representing the matching degree score between the two images.

Step 2.3: and dividing 32-channel feature bodies obtained by carrying out micro-homography transformation on feature maps extracted from adjacent visual angles into G channel groups, and calculating the similarity of each group and the channel group corresponding to the feature body of the reference visual angle by adopting an inner product mode. A similarity map of the G channel is obtained for each contiguous viewing angle. And carrying out normalized weighted mean aggregation by using the matching degree score as a weighting coefficient between the similarity mapping bodies of all adjacent visual angles, and finally obtaining the G channel cost body of the group mean measurement.

Step 2.4: and (4) uniformly setting 64 depth planes in the depth range of the whole scene by using the feature graph with the lowest scale extracted by the feature pyramid module, and constructing an H/4 multiplied by W/4 multiplied by 64 multiplied by G cost body by using the grouping similarity mean value measurement method in the step 2.3, wherein G is the number of groups. Then, the cost body is normalized by using 3D CNN to obtain a probability body, and a H/4 xW/4 coarse depth map is estimated. Wherein the batch normalization layer and the nonlinear activation function layer after each convolution layer in the 3D CNN are replaced by an Inplace-ABN layer.

Step 2.5: and 2 times of upsampling is carried out on the coarse depth map estimated in the step 2.4 by utilizing the mesoscale feature map extracted by the feature pyramid module to obtain an H/2 xW/2 upsampled depth map, the depth map is used as a prior depth curved surface, 1/128 of the scene depth range is used as an interval, and 32 equidistant relative depth surfaces are arranged in front of and behind the prior depth curved surface. After the relative depth surface is set up, a H/2 xW/2 x32 xG cost body is constructed by utilizing a grouping similarity mean value measurement method. Regularizing the cost body by using the 3D CNN module in the step 2.4 to obtain a probability body, estimating a relative depth map of H/2 xW/2, and superposing the relative depth map and a result after bilinear interpolation upsampling of the prior depth map to obtain an intermediate-level depth map of H/2 xW/2.

Step 2.6: similar to the step 2.5, performing 2-fold upsampling on the intermediate-level depth map output in the step 2.5 by using a high-scale feature map extracted by a feature pyramid module to obtain an H × W upsampled depth map, taking the depth map as a prior depth curved surface, setting up 8 equidistant relative depth planes in front and at the back of the prior depth curved surface, wherein the plane interval is 1/256 of the scene depth, constructing an H × W × 8 × G cost body by using a grouping similarity mean value measurement method, obtaining a probability body by using the regularization of the 3D CNN module in the step 2.5, estimating the relative depth map with the size of H × W, and overlapping the relative depth map with the result of bilinear interpolation upsampling of the intermediate-level depth map to obtain a final depth map.

And step 3: and filtering and fusing the depth maps of all the images according to the geometric consistency to generate three-dimensional point cloud data of the scene to be predicted.

And 4, step 4: generating scene three-dimensional point cloud data;

step 4.1: and performing threshold value screening on the depth map and the probability map obtained from each image through the probability map, performing depth filtering on pixel depth meeting the threshold value through double-view geometric consistency after the depth map and the probability map of each image are obtained, and fusing the filtered depth pixels to obtain point cloud data.

The method is suitable for three-dimensional reconstruction engineering of visible light multi-view images, such as building model reconstruction, unmanned aerial vehicle photogrammetry and the like.

Compared with the prior art, the invention has the following advantages:

(1) When the MVSNet network reconstructs an image, the consumption of video memory resources is overlarge, and the application in a high-resolution scene is greatly limited. According to the improved MVSNet method based on deep learning, the batch normalization layer and the nonlinear activation function layer in the network are replaced by the fused infilace-ABN layer, and the occupation amount of video memory is reduced.

(2) The designed weighted mean measurement method based on the grouping similarity is used for reducing the dimension of the characteristic dimension of the cost body, so that the more lightweight cost body is obtained, the network parameters are compressed, and the calculated amount and the memory consumption are reduced.

(3) Aiming at the problem that the resolution of a depth map is lower than that of an input image due to the fact that a low-scale feature map is used in an MVSNet network, a multi-scale feature map is extracted by using a feature pyramid module, and staged multi-scale iterative optimization depth estimation is designed. On the premise of ensuring the precision, the average number of the depth planes of the cost body is reduced through multiple rounds of depth iteration, so that the cost body obtains higher spatial resolution, and the accuracy of depth map estimation is improved.

Drawings

Fig. 1 is a diagram of a network architecture of the present invention.

Fig. 2 is a diagram of a characteristic pyramid network structure according to the present invention.

Fig. 3 is a flow chart of the construction of the group mean cost body according to the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The Network structure of this patent is as shown in fig. 1, firstly, a Feature extraction structure of the MVSNet is improved, pyramid Feature extraction is performed on an input reference view image and an adjacent view image by using a Feature Pyramid Network (FPN), so as to obtain a series of Feature maps of different scales, and then, the Feature maps are constructed into cost bodies of different scales by a mean value measurement method based on grouping similarity from low to high according to the scales. And after 3D regularization is carried out on the low-scale cost body and the pre-estimated depth map of the scale is output, the depth map is used as prior depth information, iterative correction is carried out on the depth in the high-scale cost body, and finally the depth map with the same resolution as the reference visual angle image is obtained through multi-stage multi-scale iteration.

Based on the FPN idea, the feature extraction network of the MVSNet is improved to extract a plurality of feature maps with different scales. As shown in fig. 2, firstly, the same extraction process is adopted as that of the original MVSNet feature extractor, after a 32-channel high-dimensional feature map is obtained, multilayer convolution and two times of interpolation upsampling are performed, and after each interpolation upsampling, aggregation is performed with the same-resolution feature map of the previous stage, so that three feature maps with different scales are finally obtained. For an original image with the size of H multiplied by W, the sizes of the FPN output feature maps are H/4 multiplied by W/4 multiplied by 32, H/2 multiplied by W/2 multiplied by 16 and H multiplied by W multiplied by 8 respectively, and the original image with the size of H multiplied by W is aggregated with high-level semantic features and is used for constructing cost bodies at different stages respectively.

Based on a grouping mean value measurement mechanism used in a binocular matching task, a multi-view image depth estimation network is improved to replace an original measurement method based on variance, a mean value measurement method based on grouping similarity is used for constructing a cost body, and the specific flow is shown in fig. 3:

reference is made to the characteristic diagram of the reference viewing angle

Let the characteristic map of the ith adjacent viewing angle be

Will be provided with

Assuming a plane d at the jth depth _j The homographic projection of (A) is recorded as

Will be provided with

Is divided into G groupsTo is aligned with

And

the similarity between the groups is calculated according to the grouping of the characteristic channels, and the similarity of the g-th group is recorded as

Wherein G belongs to (0, 1,. G-1),

the calculation formula of (a) is as follows:

wherein

Characteristic diagram of indicating reference visual angle

The method of (1) group g of features,

to represent

The group g of features of (1),<·,·>representing an inner product operation. Grouping all G groups

Similarity between them

After the calculation is finished, the calculation is carried out to simultaneously obtain the characteristic similarity mapping S with G channels _i,j . Let the total number of depth hypothesis planes be denoted as D, knowing j ∈ (0, 1.,. D-1), and thus D between the reference image and the ith neighboring imageFeature similarity mapping S _i,j Similarity character V capable of being combined as W multiplied by H multiplied by D multiplied by G _i This feature is distinguished from the MVSNet feature in that it records how similar the feature map of the neighboring view angle is to the reference view angle. Thus, we distinguish from the variance-based aggregation approach that MVSNet employs for different neighboring view-angle features, for similarity feature V _i And a mean value-based polymerization mode is adopted to obtain the light-weight matching cost body C. In the cost based on variance, the smaller the variance value at the depth plane d is, the higher the probability that the depth value is d is, while in the cost based on mean value, the higher the mean value at the depth plane d is, the higher the similarity at d is represented by each view angle, the higher the probability that the depth value is d is, and the aggregation formula is as follows:

the size of the cost body C is W multiplied by H multiplied by D multiplied by G, the size of the cost body can be reduced to the original G/F based on the average value measurement of the grouping similarity, G =8 is set, and compared with the original 32-channel cost body, the operation consumption of a 3D U-Net regularization link is reduced.

Claims

1. A visible light multi-view image three-dimensional reconstruction method based on deep learning is characterized by comprising the following steps:

step 1: performing incremental SfM on an image group of a scene to be predicted, and calculating to obtain camera parameters of each image and sparse point cloud of the scene to be predicted;

Step 2: designing an improved depth estimation network based on MVSNet, inputting a scene to be predicted into the network for calculation, and obtaining a depth map and a probability map corresponding to each image;

Step 2.2: and for each adjacent visual angle, extracting a point cloud set of a common-view area of the adjacent visual angle and the reference visual angle from the sparse point cloud according to the camera parameters and the sparse point cloud of the scene obtained in the step 1. And calculating a base line angle between each point in the point cloud set and the optical centers and the main optical axis of the two image cameras, calculating a score for the point by using a piecewise Gaussian function, and adding the scores of all the points to obtain a total score representing the matching degree score between the two images.

Step 2.3: and dividing the 32-channel feature bodies obtained by carrying out micro-homography transformation on the feature maps extracted from the adjacent visual angles into G channel groups, and calculating the similarity of the channel groups corresponding to the feature bodies of the reference visual angles by each group in an inner product mode. A similarity map of the G channel for each adjacent viewing angle is obtained. And carrying out normalized weighted mean aggregation by using the matching degree score as a weighting coefficient between the similarity mapping bodies of all adjacent visual angles, and finally obtaining the G channel cost body of the group mean measurement.

Step 2.4: and (3) uniformly setting 64 depth planes in the depth range of the whole scene by using the feature map of the lowest scale extracted by the feature pyramid module, and constructing an H/4 multiplied by W/4 multiplied by 64 multiplied by G cost body by using the grouping similarity mean measurement method in the step 2.3, wherein G is the number of groups. Then, the cost body is normalized by using 3D CNN to obtain a probability body, and a H/4 xW/4 coarse depth map is estimated. Wherein the batch normalization layer and the nonlinear activation function layer after each convolution layer in the 3D CNN are replaced by an Inplace-ABN layer.

Step 2.5: and 2 times of upsampling is carried out on the coarse depth map estimated in the step 2.4 by utilizing the mesoscale feature map extracted by the feature pyramid module to obtain an H/2 xW/2 upsampled depth map, the depth map is used as a prior depth curved surface, 1/128 of the scene depth range is used as an interval, and 32 equidistant relative depth surfaces are arranged in front of and behind the prior depth curved surface. After the relative depth surface is established, a H/2 xW/2 x32 xG cost body is constructed by utilizing a grouping similarity mean measurement method. And (3) regularizing the cost body by using the 3D CNN module in the step 2.4 to obtain a probability body, estimating an H/2 xW/2 relative depth map, and superposing the relative depth map and a result obtained after bilinear interpolation upsampling of the prior depth map to obtain an H/2 xW/2 intermediate-level depth map.

And 4, step 4: generating scene three-dimensional point cloud data;

step 4.1: and performing threshold value screening on the depth map and the probability map obtained from each image through the probability map, performing depth filtering on pixel depths meeting the threshold value through double-view geometric consistency after the depth map and the probability map of each image are obtained, and fusing the filtered depth pixels to obtain point cloud data.

2. The method as claimed in claim 1, wherein step 2.1 uses a feature pyramid network structure to improve the feature extraction network of the MVSNet, extracts the multi-scale image features, and replaces the batch normalization layer and the activation function layer with the Inplace-ABN layer, thereby reducing the consumption of video memory.

3. The method according to claim 1, wherein the matching-based weighted mean measure method designed in step 2.2 converts the feature map after the micro-homography transformation into the similarity mapping body of the G channel based on the grouping similarity, designs the perspective matching degree algorithm, and aggregates the similarity mapping bodies of the adjacent perspectives into the lightweight cost body based on the matching degree weighted mean.

4. The method of claim 1, wherein step 2.3 is to perform multi-stage iteration through a multi-scale mean cost body of the multi-scale feature map aggregation, to refine the depth map by increasing the spatial resolution, and finally to output the depth map and the probability map with the same size as the original image.