CN113066168B

CN113066168B - Multi-view stereo network three-dimensional reconstruction method and system

Info

Publication number: CN113066168B
Application number: CN202110378393.XA
Authority: CN
Inventors: 柏正尧; 程威
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2021-04-08
Filing date: 2021-04-08
Publication date: 2022-08-26
Anticipated expiration: 2041-04-08
Also published as: CN113066168A

Abstract

The invention relates to a multi-view stereo network three-dimensional reconstruction method and a system. The method comprises the following steps: acquiring a reference image and a plurality of actual shot images of a target object to be reconstructed, and extracting image features of the plurality of actual shot images by using a multi-scale feature extraction module; grouping image features according to feature similarity between a reference image and an actual shot image, constructing a depth cost body, and determining a depth map; in the rough stage, dividing the depth map by adopting a fixed depth interval, determining a depth prediction interval of the rough stage, in the refinement stage, determining a self-adaptive depth interval by using uncertainty of depth prediction, dividing the depth map, and determining a depth prediction interval of the refinement stage; deducing a final depth map with the same resolution as the reference image through a cascaded depth framework according to the depth prediction intervals of the two stages; and generating dense three-dimensional point cloud through a depth filtering fusion script according to the final depth map. The invention can improve the reconstruction precision and the reconstruction effect.

Description

Multi-view stereo network three-dimensional reconstruction method and system

Technical Field

The invention relates to the field of multi-view stereo network three-dimensional reconstruction, in particular to a multi-view stereo network three-dimensional reconstruction method and a multi-view stereo network three-dimensional reconstruction system.

Background

Multi-view stereogeometry (MVS) aims at recovering three-dimensional scene surfaces from a set of calibrated two-dimensional images and estimated camera parameters, and is being widely applied in the fields of autopilot, augmented reality, digital presentation and protection of cultural relics, measurement of urban dimensions, and the like. Compared with an active three-dimensional reconstruction method, the method needs an expensive depth camera or structured light camera, has the advantages of low cost, convenience and high efficiency, and can generate high-quality dense point cloud only by fusing accurate depth maps, so that the acquisition of the high-quality depth maps plays a crucial role in generating accurate dense point cloud by multi-view three-dimensional reconstruction.

In the conventional method, artificially designed similarity measurement is generally introduced for image association, and then iterative optimization is performed to generate dense point cloud, for example, normalized cross-correlation is used as the similarity measurement, and then semi-global matching is used for optimization. The method has a good effect on an ideal Lambert surface, but has a poor reconstruction effect on low texture, high illumination and reflection areas in an actual scene, so that the reconstructed scene is incomplete, and the accuracy is difficult to guarantee.

In order to solve the problem, some recent deep learning-based MVS methods adopt Convolutional Neural Networks (CNNs) to extract features to construct a three-dimensional depth cost body (cost volume) to infer a depth map of each view, and then construct a final three-dimensional scene by using a depth map fusion method. The learnable method introduces global semantic information such as illumination prior and reflection prior in the process of reasoning the depth map, so that more robust feature matching is obtained, and photometric matching in a three-dimensional space can also relieve the influence of image distortion caused by perspective transformation and occlusion. However, one of the key steps of this kind of method is to establish a 3D depth cost body through a plane scanning (plane sweep) process, and the depth cost body generally needs multi-scale 3D CNNs to be regularized, which needs to consume a lot of video memory and computational power, and some research methods reduce GPU memory occupation by using down-sampled images, and this method can effectively reduce video memory occupation, but loses some feature information, so that the resolution of the estimated depth map is very low, and the reconstruction accuracy and integrity are greatly reduced. In addition, in the depth prediction stage, the methods mostly adopt fixed depth intervals, and the depth estimation range is divided into a plurality of depth intervals with fixed intervals, and the division mode can only simply determine the assumed depth planes of all scenes, and cannot set the optimal depth interval for each scene to be estimated, so that the reconstructed stereo network is greatly different from the actual scene to be estimated, and the reconstruction effect is poor.

Disclosure of Invention

The invention aims to provide a multi-view three-dimensional network three-dimensional reconstruction method and a multi-view three-dimensional network three-dimensional reconstruction system, which are used for solving the problems that the resolution of a depth map estimated by the existing reconstruction method is low, and the reconstruction precision and the reconstruction effect are poor due to the fact that a depth estimation range is divided into a plurality of depth intervals with fixed intervals by adopting fixed depth intervals.

In order to achieve the purpose, the invention provides the following scheme:

a multi-view stereo network three-dimensional reconstruction method comprises the following steps:

acquiring a reference image and a plurality of actual shot images of a target object to be reconstructed, and extracting image features of the plurality of actual shot images by using a multi-scale feature extraction module; the plurality of actual shot images are self-acquisition images obtained by performing surrounding shooting on the target object to be reconstructed; the multi-scale feature extraction module comprises a down-sampling encoder and an up-sampling decoder;

introducing similarity measurement, grouping the image features according to the feature similarity between the reference image and the actually shot image, and constructing a depth cost body;

carrying out regularization operation on the depth cost body to determine a depth map;

based on a depth inference strategy from coarse to fine, dividing the depth map by adopting a fixed depth interval in a coarse stage, determining a depth prediction interval in the coarse stage, determining an adaptive depth interval by using uncertainty of depth prediction in the coarse stage in a fine stage, dividing the depth map by using the adaptive depth interval, and determining the depth prediction interval in the fine stage;

deducing a final depth map with the same resolution as the reference image through a cascaded depth architecture according to the depth prediction interval of the rough stage and the depth prediction interval of the refinement stage; the cascaded depth architecture comprises a coarse phase and two refinement phases;

generating dense three-dimensional point cloud through a depth filtering fusion script according to the final depth map; and the dense three-dimensional point cloud is used for displaying the target object to be reconstructed.

Optionally, the down-sampled encoder includes a subsequent BN layer and a convolutional layer with an activation function; two convolution layers with the step length of 2 and the convolution kernel size of 5x5 downsample the actual shot image for two times;

the up-sampling decoder comprises 2 up-sampling layers with jump connection and 4 convolutional layers for unifying the number of output channels;

inputting an image matrix of the actually shot image, sequentially performing convolution operation through the encoder to extract an image characteristic diagram containing three scales, and sequentially extracting a final image characteristic diagram containing three scales through the convolution layer of the decoder in combination with the upper sampling layer in jump connection; the final image feature map includes full-size image features, 1/2-size image features, and 1/4-size image features of the actual captured image.

Optionally, the introducing similarity measurement, grouping the image features according to the feature similarity between the reference image and the actually-captured image, and constructing a depth cost body specifically includes:

dividing the characteristic channels of the final image characteristic diagram into a plurality of groups, and calculating the characteristic similarity of the characteristic diagram between the reference image and the actually shot image in each group of the characteristic channels at a set depth plane;

compressing the final image feature map to a similarity tensor for a plurality of sets of the feature channels based on the feature similarity within each set of the feature channels; and the set of the similarity tensors of the characteristic channels is a depth cost body.

Optionally, the regularizing the depth cost body to determine the depth map specifically includes:

inputting the depth cost body into a 3D UNet model, and outputting a depth cost body after regularization; the 3D UNet model comprises a plurality of downsampled and upsampled 3D convolutional layers;

performing Softmax operation along the depth direction of the depth cost body after the regularization, calculating the depth probability of each pixel in the depth cost body after the regularization, and determining a depth probability body containing depth probability distribution information;

and calculating the weighted average value of the set depth threshold value of each pixel division and the depth probability body, and determining the depth map.

Optionally, the determining, in the refinement stage, an adaptive depth interval by using uncertainty of depth prediction in the rough stage, and dividing the depth map by using the adaptive depth interval to determine a depth prediction interval in the refinement stage specifically includes:

obtaining the depth predicted in the rough stage and setting the number of depth planes;

calculating the mean square error of the depth probability distribution of each pixel in the rough stage according to the depth predicted in the rough stage and the set depth plane number;

calculating uncertainty of depth prediction in the rough stage according to the sum of the depth predicted in the rough stage and the mean square error of the depth probability distribution and the difference between the depth predicted in the rough stage and the mean square error of the depth probability distribution;

acquiring an uncertainty upper boundary and an uncertainty lower boundary of the depth prediction in the rough stage;

and determining a self-adaptive depth interval according to the upper boundary and the lower boundary, dividing the depth map by using the self-adaptive depth interval, and determining a depth prediction interval of a thinning stage.

Optionally, the generating a dense three-dimensional point cloud by a depth filtering fusion script according to the final depth map further includes:

and evaluating the dense three-dimensional point cloud by utilizing the DTU data set and the Tanks & Temples data set.

A multi-view stereoscopic network three-dimensional reconstruction system, comprising:

the image feature extraction module is used for acquiring a reference image and a plurality of actually shot images of a target object to be reconstructed, and extracting image features of the plurality of actually shot images by using the multi-scale feature extraction module; the plurality of actual shot images are self-acquisition images obtained by performing surrounding shooting on the target object to be reconstructed; the multi-scale feature extraction module comprises a down-sampling encoder and an up-sampling decoder;

the depth cost body construction module is used for introducing similarity measurement, grouping the image features according to the feature similarity between the reference image and the actually shot image and constructing a depth cost body;

the depth map determining module is used for conducting regularization operation on the depth cost body and determining a depth map;

a depth prediction interval determination module of the rough stage and a depth prediction interval determination module of the refinement stage, which are used for dividing the depth map by adopting a fixed depth interval in the rough stage based on a rough-to-fine depth inference strategy, determining the depth prediction interval of the rough stage, determining an adaptive depth interval by utilizing uncertainty of depth prediction in the rough stage in the refinement stage, dividing the depth map by utilizing the adaptive depth interval, and determining the depth prediction interval of the refinement stage;

a final depth map deducing module, configured to deduct a final depth map with a resolution that is the same as that of the reference image through a cascaded depth architecture according to the depth prediction interval of the coarse stage and the depth prediction interval of the refinement stage; the cascaded depth architecture comprises a coarse phase and two refinement phases;

the dense three-dimensional point cloud construction module is used for generating dense three-dimensional point cloud through a depth filtering fusion script according to the final depth map; and the dense three-dimensional point cloud is used for displaying the target object to be reconstructed.

Optionally, the depth cost body construction module specifically includes:

the characteristic similarity determining unit is used for dividing characteristic channels of the final image characteristic diagram into a plurality of groups and calculating the characteristic similarity of the characteristic diagram between the reference image and the actually shot image in each group of the characteristic channels at a set depth plane;

a similarity tensor determining unit, configured to compress the final image feature map to a similarity tensor of multiple sets of feature channels based on the feature similarity in each set of feature channels; and the collection of the similarity tensors of the characteristic channels is a depth cost body.

Optionally, the depth map determining module specifically includes:

the regularization unit is used for inputting the depth cost body into a 3D UNet model and outputting the regularized depth cost body; the 3D UNet model comprises a plurality of downsampled and upsampled 3D convolutional layers;

a depth probability body determining unit, configured to perform Softmax operation along a depth direction of the regularized depth cost body, calculate a depth probability of each pixel in the regularized depth cost body, and determine a depth probability body including depth probability distribution information;

and the depth map determining unit is used for calculating a weighted average value of the set depth threshold value and the depth probability body of each pixel division to determine the depth map.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the invention provides a multi-view three-dimensional network three-dimensional reconstruction method and a multi-view three-dimensional network three-dimensional reconstruction system, which introduce similarity measurement-based grouping of image features and construction of a depth cost body, and an average group correlation mode of the similarity measurement replaces feature cost accumulation based on variance for feature grouping, thereby improving effective utilization of features, omitting redundant feature information in an image, reducing video memory occupation and simultaneously improving feature utilization efficiency and reconstruction quality compared with a construction mode based on variance; meanwhile, the designed adaptive depth interval module performs pixel-level weighting on the depth prediction interval compared with a fixed depth interval, different adaptive depth intervals are adopted in the refinement stage to realize a more subdivided prediction interval, meanwhile, the coarse-fine depth prediction framework effectively utilizes the cascading layering characteristic, the coarse-level depth prediction information can guide the refinement stage to divide the adaptive depth interval, and the coarse-level depth prediction information and the fine-level depth prediction information complement each other to enable the final depth estimation to be finer and improve the reconstruction effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a multi-view stereo network three-dimensional reconstruction method provided by the present invention;

FIG. 2 is a schematic diagram of a network structure of a multi-view stereo network three-dimensional reconstruction method provided by the present invention;

fig. 3 is a schematic diagram of a 3D Unet network structure provided in the present invention;

FIG. 4 is a schematic diagram of an adaptive depth interval provided by the present invention;

FIG. 5 is a block diagram of a multi-view stereo network three-dimensional reconstruction system provided by the present invention;

fig. 6 is a schematic diagram of a three-dimensional reconstruction result of a DTU data set scene 9 provided by the present invention;

fig. 7 is a schematic diagram of three-dimensional reconstruction results of a DTU data set scene 77 and a scene 49 provided by the present invention;

FIG. 8 is a schematic diagram of a three-dimensional reconstruction result of a Tanks & Temples dataset middle set provided by the present invention;

fig. 9 is a schematic diagram of a reconstruction result of a self-acquired data set according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention aims to provide a multi-view three-dimensional network three-dimensional reconstruction method and a multi-view three-dimensional network three-dimensional reconstruction system, which can improve reconstruction accuracy and reconstruction effect.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flowchart of a multi-view stereo network three-dimensional reconstruction method provided by the present invention, and as shown in fig. 1, a multi-view stereo network three-dimensional reconstruction method includes:

step 101: acquiring a reference image and a plurality of actual shot images of a target object to be reconstructed, and extracting image features of the plurality of actual shot images by using a multi-scale feature extraction module; the plurality of actual shot images are self-acquisition images obtained by performing surrounding shooting on the target object to be reconstructed; the multi-scale feature extraction module includes a down-sampling encoder and an up-sampling decoder.

The multi-scale feature extraction module comprises a down-sampling encoder and an up-sampling decoder. The down-sampling part consists of 8 Batch Normalization (BN) layers and a convolutional layer with an activation function (modulated Linear Unit, ReLU) activation layer; the method comprises the steps of firstly, obtaining a convolution kernel with the size of 5x5, then, obtaining a convolution operation of a multi-scale extraction module, wherein two convolution layers with the step length of 2 and the convolution kernel size of 5x5 carry out down-sampling on an original image twice, an up-sampling part comprises 2 up-sampling layers with jump connection and 4 convolution layers for unifying the number of output channels, an input image matrix can extract a characteristic matrix (characteristic diagram) of the image corresponding to the convolution kernel through the convolution operation of the convolution kernel, the image can extract characteristic diagrams with three scales through the sequential convolution operation of a down-sampling part of the multi-scale extraction module, and then, the characteristic diagrams with three scales are sequentially extracted through the convolution operation of the up-sampling part and the jump connection (the characteristic diagram with the larger scale is added after the bilinear interpolation of the small-scale characteristic diagram), and finally, the characteristics with the whole size of the original image, the characteristics with the size of 1/2 and the characteristics with the size of 1/4 are obtained.

In the prior art, a pyramid structure is constructed on an image, namely, images with a plurality of scales are input and are respectively subjected to CNN (CNN) extraction to obtain a plurality of scale features, as shown in FIG. 2, the pyramid structure of the image is replaced by the pyramid structure of the features, namely, a single scale image is input, three scale feature images are extracted by sequentially performing convolution operation on a down-sampling part of UNet during feature extraction, and then the three scale feature images are output by performing convolution operation on an up-sampling part and jump connection, and in addition, the number of channels of the feature images with larger scales is reduced to ensure that network parameters are not too large.

Step 102: and introducing similarity measurement, grouping the image features according to the feature similarity between the reference image and the actually shot image, and constructing a depth cost body.

Firstly, dividing characteristic channels of a characteristic image obtained by multi-scale characteristic extraction into G groups, then calculating the similarity of the characteristic image between a reference image and a transformed image at an assumed depth plane, after the characteristic similarity of the G groups is calculated, compressing an original characteristic image to similarity tensors of the G channels, calculating a cost body of a source image into a set of the similarity tensors of the G channels, and calculating a final total cost body into an average of all view cost bodies.

Step 103: and carrying out regularization operation on the depth cost body to determine a depth map.

As shown in fig. 3, a depth probability volume for depth estimation is obtained by performing a regularization operation on a depth cost volume, where the regularization operation is to input the cost volume into a 3D UNet composed of a plurality of downsampled and upsampled 3D convolutional layers and output the regularized cost volume, then perform a Softmax operation on the cost volume along a depth direction to calculate a depth probability of each pixel to obtain a probability volume containing depth probability distribution information, and finally calculate a weighted average of an assumed depth of each pixel division and the probability volume to obtain a continuous depth estimation value, that is, a depth map.

Step 104: based on a depth inference strategy from coarse to fine, dividing the depth map by adopting a fixed depth interval in a coarse stage, determining a depth prediction interval in the coarse stage, determining an adaptive depth interval by utilizing uncertainty of depth prediction in the coarse stage in a fine stage, dividing the depth map by utilizing the adaptive depth interval, and determining the depth prediction interval in the fine stage.

In a coarse-to-fine cascaded depth architecture, a depth prediction interval of a coarsest stage needs to cover the whole scene, and an uncertainty depth interval is inferred based on the depth predicted by a previous stage in a fine stage.

As shown in fig. 4, in a refinement stage of depth estimation, an adaptive depth interval module is used to calculate uncertainty of a previous depth prediction, first, knowing the depth prediction estimated in the previous stage and the number of hypothetical depth planes determined in the previous stage, first, using the two known conditions to calculate mean square deviation of depth probability distribution at each pixel in the previous stage, then using the depth predicted in the previous stage to add or subtract the mean square deviation to calculate uncertainty of the depth prediction in the previous stage, and finally, subtracting a lower boundary from an upper boundary of the uncertainty calculated in the previous step to divide an upper and a lower interval of the depth prediction in the refinement stage. In addition, the depth interval adaptive adjustment is to optimize the weight in a mean square error calculation formula, namely the depth prediction interval divided by the optimization, by utilizing the characteristic that the calculation mode of the mean square error (namely the variance opening root) can be differentiated and combining with the backward propagation parameter of the deep learning neural network.

And guiding the fine stage to divide a more fit depth prediction interval by using uncertainty of the coarse stage estimated depth.

Step 105: deducing a final depth map with the same resolution as the reference image through a cascaded depth architecture according to the depth prediction interval of the rough stage and the depth prediction interval of the refinement stage; the cascaded depth architecture includes one coarse phase and two refinement phases.

Step 106: generating dense three-dimensional point cloud through a depth filtering fusion script according to the final depth map; and the dense three-dimensional point cloud is used for displaying the target object to be reconstructed.

The method comprises the steps of calculating photometric matching and geometric consistency between a reference image and a source image, filtering out depth and redundant pixel points smaller than a threshold value, and then carrying out pairing back projection on a plurality of images with the most overlapped areas through iteration selection to a three-dimensional space to generate a three-dimensional point cloud model.

Step 106 is followed by: and evaluating the dense three-dimensional point cloud by utilizing the DTU data set and the Tanks & Temples data set.

The DTU dataset was published by denmark university of technology for multi-view three-dimensional reconstruction algorithm evaluation, and comprised 124 different indoor scenes, each scene containing multiple views taken from 49 or 64 perspectives under 7 different lighting conditions. The 3D reconstruction performance of the present invention on DTU datasets was quantitatively evaluated by calculating the overall accuracy of the average accuracy (Acc) and average integrity (Comp) using MATLAB scripts officially provided by the DTU dataset (OA, OA ═ Acc + Comp.)/2, smaller values representing higher reconstruction quality).

The Tanks & Temples data set is open source by an Intel intelligent system laboratory, consists of outdoor scenes of large-scale complex environment under real illumination condition, can reflect real world better than a DTU data set shot under the environment with fixed camera track accurate control, is divided into a middle-level set and a high-level set according to different reconstruction difficulty, the reconstruction performance of the Tanks & Temples data set takes F-score as an evaluation index, and the reconstruction result is better if the numerical value is larger. The two data sets provide a complete set of evaluation procedures for multi-view three-dimensional reconstruction.

The method changes the traditional method that matching cost is constructed by utilizing geometric and optical consistency, the matching cost is accumulated, and then the depth value is estimated, takes one reference image and a plurality of actual shot images as input, combines the camera geometry with a deep learning neural network, embeds the camera geometry into the depth neural network through differentiable homography transformation operation, and connects a 2D image characteristic network and a 3D space regularization network, so that the whole multi-view three-dimensional reconstruction can be trained end to end.

The method adopts an improved multi-scale feature extraction network to extract the multi-scale depth features of the image, the rough level utilizes the low-level features to construct a rough depth map of the predicted reference image after cost body regularization, the refinement level utilizes the high-level features and combines the depth map obtained by the last level of estimation to determine a self-adaptive depth interval to estimate a depth map with higher resolution, and finally the depth map with the resolution consistent with that of the reference image is obtained through cascading depth inference.

The invention designs an average group correlation mode based on similarity measurement to replace variance-based feature cost accumulation for feature grouping, thereby improving the effective utilization of features and eliminating redundant feature information in images; and then, the grouped features are constructed into a 3D cost body based on similarity measurement, and compared with a construction mode based on variance, the video memory occupation is reduced, meanwhile, the feature utilization efficiency can be improved, and the reconstruction quality is improved.

The invention adopts the distribution change to estimate the uncertainty interval of the pixel, constructs the self-adaptive depth interval by using the uncertainty interval, adopts a differentiable calculation method to lead the network to learn and adjust the probability prediction of each stage, realizes the end-to-end training process of the self-adaptive depth interval of the thinning stage and optimizes the corresponding depth prediction interval to lead the predicted depth value to be more approximate to the true value.

Extracting a plurality of image characteristics through a multi-scale characteristic extraction module; secondly, introducing similarity measurement to the feature groups to construct a cost body; and finally, designing a self-adaptive depth interval module based on a coarse-to-fine depth inference strategy, and guiding the fine stage to divide a more fit depth prediction interval by using uncertainty of the coarse stage estimated depth. And generating dense point cloud by all the estimated depth maps through a depth filtering fusion script. A large number of experiments carried out on a DTU data set, a blendedMVS data set and a Tanks & Temples data set show that the method is obviously superior to the conventional learning-based method and the conventional MVS method in the aspects of accuracy, real-time performance, reconstruction quality and the like, and has wide application prospects in the fields of automatic driving, digital presentation of cultural relics, urban dimension measurement and the like.

Fig. 5 is a structural diagram of a multi-view stereo network three-dimensional reconstruction system provided by the present invention, and as shown in fig. 5, a multi-view stereo network three-dimensional reconstruction system includes:

the image feature extraction module 501 is configured to acquire a reference image of a target object to be reconstructed and a plurality of actually captured images, and extract image features of the plurality of actually captured images by using the multi-scale feature extraction module; the plurality of actual shot images are self-acquisition images obtained by performing surrounding shooting on the target object to be reconstructed; the multi-scale feature extraction module includes a down-sampling encoder and an up-sampling decoder.

The downsampled encoder includes a subsequent BN layer and a convolutional layer with an activation function; two convolution layers with the step length of 2 and the convolution kernel size of 5x5 downsample the actual shot image for two times; the up-sampling decoder comprises 2 up-sampling layers with jump connection and 4 convolutional layers for unifying the number of output channels; inputting an image matrix of the actually shot image, sequentially performing convolution operation through the encoder to extract an image characteristic diagram containing three scales, and sequentially extracting a final image characteristic diagram containing three scales through the convolution layer of the decoder in combination with the upper sampling layer in jump connection; the final image feature map includes full-size image features, 1/2-size image features, and 1/4-size image features of the actual captured image.

A depth cost body construction module 502, configured to introduce similarity measurement, group the image features according to feature similarity between the reference image and the actually captured image, and construct a depth cost body.

The depth cost body construction module 502 specifically includes: the characteristic similarity determining unit is used for dividing characteristic channels of the final image characteristic diagram into a plurality of groups and calculating the characteristic similarity of the characteristic diagram between the reference image and the actually shot image in each group of characteristic channels at a set depth plane; a similarity tensor determining unit, configured to compress the final image feature map to a similarity tensor of multiple sets of feature channels based on the feature similarity in each set of feature channels; and the collection of the similarity tensors of the characteristic channels is a depth cost body.

A depth map determining module 503, configured to perform a regularization operation on the depth cost body, and determine a depth map.

The depth map determining module 503 specifically includes: the regularization unit is used for inputting the depth cost body into a 3D UNet model and outputting the regularized depth cost body; the 3D UNet model comprises a plurality of downsampled and upsampled 3D convolutional layers; a depth probability body determining unit, configured to perform Softmax operation along the depth direction of the regularized depth cost body, calculate a depth probability of each pixel in the regularized depth cost body, and determine a depth probability body including depth probability distribution information; and the depth map determining unit is used for calculating a weighted average value of the set depth threshold value and the depth probability body of each pixel division to determine the depth map.

A depth prediction interval determination module 504 in the rough stage and the depth prediction interval determination module 504 in the refinement stage, configured to, based on a rough-to-fine depth inference policy, partition the depth map by using a fixed depth interval in the rough stage, determine the depth prediction interval in the rough stage, determine an adaptive depth interval by using uncertainty of depth prediction in the rough stage in the refinement stage, partition the depth map by using the adaptive depth interval, and determine the depth prediction interval in the refinement stage.

A final depth map inference module 505, configured to infer a final depth map with a resolution that is the same as that of the reference image through a cascaded depth architecture according to the depth prediction interval of the coarse stage and the depth prediction interval of the refinement stage; the cascaded depth architecture includes one coarse phase and two refinement phases.

A dense three-dimensional point cloud construction module 506, configured to generate a dense three-dimensional point cloud through a depth filtering fusion script according to the final depth map; and the dense three-dimensional point cloud is used for displaying the target object to be reconstructed.

As can be seen from fig. 6 to 9, the qualitative comparison between the DTU data set scenes 9, 77, and 49 and the CasMVNet and R-MVSNet in the present invention, under the same input image resolution setting, the dense point cloud reconstructed by the present invention is more complete, and at the same time, the color averaging can be considered to generate a smooth point cloud.

As shown in tables 1-2, 3 conventional methods, 8 learning-based methods and test results obtained by the present invention on the DTU data set under the same experimental parameter settings, and table 1 is a DTU data set benchmark test result table, it can be seen that the overall performance of the adaptive depth interval multi-view stereo network proposed by the present invention is the best. Among them, the average integrity index (comp. ═ 0.298mm) reaches the most advanced level, and is improved by 0.031mm compared with the previous highest AttMVS (comp. ═ 0.329mm) performance, and the overall average index is improved by 0.023mm compared with the previous highest UCSNet (OA. ═ 0.344mm) performance.

TABLE 1

Table 2 is a table of the results of the calibration and quantification of the data sets of Tanks & Temples, and as shown in table 2, the present invention has great advantages in F-score index compared to other published multi-view stereo methods based on deep learning. The average F-score was boosted from 56.84 to 58.60 for CasMVSNet, where the Horse scene score was 55.14, which is the highest score among all methods currently registered, demonstrating the effectiveness and robustness of the inventive network framework in complex scenarios.

In summary, the general process of the present invention is: given a reference image and a group of adjacent source images, the algorithm regresses fine-grained depth maps with the same resolution as the reference image in a coarse-to-fine strategy. Firstly, all input images are sent to a feature extraction module to extract multi-scale image features, then depth prediction is divided into three stages from coarse to fine, three cost bodies with different resolutions are constructed by the aid of average group correlation according to different depth intervals for the three image features with different scales, for the coarsest stage, the depth interval is fixed to ensure that a plane scanning algorithm can cover the whole scene, the depth intervals of the two refinement stages are self-adaptive to the depth predicted by the last stage and are constrained by a minimum depth interval condition, finally, depth images with the same resolution as a reference image are obtained through stepwise regression and refinement of 3D CNNs regularized cost bodies of the three stages, and after the depth images of all views are obtained, the depth images can be filtered by using an open source depth fusion tool box and are fused to generate point cloud density.

The method uses a cascading depth inference framework to replace single-stage depth inference, a rough level utilizes low-level features to construct a rough depth map of a predicted reference image after cost body regularization, a fine level utilizes high-level features and combines a depth map obtained by the last level of estimation to determine a self-adaptive depth interval to estimate a depth map with higher resolution, and finally a depth map which is consistent with the resolution of the reference image is obtained through cascading depth inference.

The invention introduces an average group correlation method based on similarity measurement to group the features and construct a depth cost body, designs an average group correlation method based on similarity measurement to substitute variance-based feature cost accumulation for the feature grouping, improves the effective utilization of the features, and eliminates redundant feature information in the image; and then, the grouped features are stacked into a 3D cost body based on similarity measurement, the final total cost body can be calculated as the average similarity of all views, and compared with a construction mode based on variance, the method reduces the video memory occupation, can improve the feature utilization efficiency and improves the reconstruction quality.

The invention designs the self-adaptive depth interval module to improve the depth prediction precision, and adopts the fixed depth interval to divide the depth prediction interval at the rough level according to the actual depth range of the scene, thereby ensuring that the depth prediction at the initial stage can cover the whole scene. Essentially determining the depth interval is to demarcate the physical thickness between hypothetical depth planes at the pixel at that stage. As shown in fig. 4, the depth prediction interval of stage 1 needs to cover the whole scene, and the refinement stage infers an uncertainty depth interval based on the depth predicted by the previous stage and adaptively divides the upper and lower curved boundaries with the spatially varying depth hypothesis. The distributed change is adopted to estimate the uncertainty interval of the pixel, the uncertainty interval is used to construct the self-adaptive depth interval, the network can learn and adjust the probability prediction of each stage by a differentiable calculation method, the end-to-end training process of the self-adaptive depth interval in the refinement stage is realized, and the corresponding depth prediction interval is optimized to enable the predicted depth value to be closer to the true value.

The method is based on a deep learning framework Pythrch, runs on a GPU workstation, and uses a video card NVIDIA GeForce GTX 2080 Ti. In order to carry out quantitative comparison with the existing method, the invention adopts the public DTU data set and the official evaluation flow provided by Tanks & Temples to evaluate the reconstruction effect of the invention.

The method adopts the characteristic pyramid to more effectively extract the characteristics of the image in different scales, and introduces average group correlation on the basis to construct a cost body by similarity measurement instead of a cost body construction mode based on variance, thereby reducing the video memory occupation and simultaneously obtaining better precision and integrity. The adaptive depth interval module designed by the invention performs pixel-level weighting on the depth prediction interval compared with the fixed depth interval, so that a more subdivided prediction interval is realized, meanwhile, the coarse-fine depth prediction framework effectively utilizes the cascading layering characteristic, the coarse-level depth prediction information can guide the fine-level division of the adaptive depth interval, and the coarse-level depth prediction information and the fine-level depth prediction information complement each other to make the final depth estimation finer.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A multi-view stereo network three-dimensional reconstruction method is characterized by comprising the following steps:

introducing similarity measurement, grouping the image features according to the feature similarity between the reference image and the actually shot image, and constructing a depth cost body; the introducing similarity measurement is to group the image features according to the feature similarity between the reference image and the actually shot image, and construct a depth cost body, and specifically includes:

dividing the characteristic channels of the final image characteristic graph into a plurality of groups, and calculating the characteristic similarity of the characteristic graph between the reference image and the actually shot image in each group of the characteristic channels at a set depth plane;

compressing the final image feature map to a similarity tensor for a plurality of sets of the feature channels based on the feature similarity within each set of the feature channels; the set of the similarity tensors of the characteristic channels is a depth cost body;

based on a depth inference strategy from coarse to fine, dividing the depth map by adopting a fixed depth interval in a coarse stage, determining a depth prediction interval in the coarse stage, determining a self-adaptive depth interval by utilizing uncertainty of depth prediction in the coarse stage in a fine stage, dividing the depth map by utilizing the self-adaptive depth interval, and determining the depth prediction interval in the fine stage; in the refinement stage, determining an adaptive depth interval by using the uncertainty of the depth prediction in the rough stage, dividing the depth map by using the adaptive depth interval, and determining a depth prediction interval in the refinement stage, specifically comprising:

determining a self-adaptive depth interval according to the upper boundary and the lower boundary, dividing the depth map by using the self-adaptive depth interval, and determining a depth prediction interval of a thinning stage;

2. The multi-view stereoscopic network three-dimensional reconstruction method of claim 1, wherein the down-sampled encoder comprises a subsequent BN layer and a convolutional layer having an activation function; two convolution layers with the step length of 2 and the convolution kernel size of 5x5 downsample the actual shot image for two times;

3. The multi-view stereo network three-dimensional reconstruction method according to claim 1, wherein the regularizing operation is performed on the depth cost body to determine a depth map, specifically including:

inputting the depth cost body into a 3D UNet model, and outputting a regularized depth cost body; the 3D UNet model comprises a plurality of downsampled and upsampled 3D convolutional layers;

4. The method for reconstructing a multi-view stereoscopic network in three dimensions according to claim 1, wherein the method for generating a dense three-dimensional point cloud by a depth filtering fusion script according to the final depth map further comprises:

5. A multi-view stereoscopic network three-dimensional reconstruction system, comprising:

the image feature extraction module is used for acquiring a reference image of a target object to be reconstructed and a plurality of actually shot images and extracting image features of the actually shot images by using the multi-scale feature extraction module; the plurality of actual shot images are self-acquisition images obtained by performing surrounding shooting on the target object to be reconstructed; the multi-scale feature extraction module comprises a down-sampling encoder and an up-sampling decoder;

the depth cost body construction module is used for introducing similarity measurement, grouping the image features according to the feature similarity between the reference image and the actually shot image and constructing a depth cost body; the depth cost body construction module specifically comprises:

the characteristic similarity determining unit is used for dividing characteristic channels of a final image characteristic image into a plurality of groups and calculating the characteristic similarity of the characteristic image between the reference image and the actually shot image in each group of characteristic channels at a set depth plane;

a similarity tensor determining unit, configured to compress the final image feature map to a similarity tensor of multiple sets of feature channels based on the feature similarity in each set of feature channels; the set of the similarity tensors of the characteristic channels is a depth cost body;

the depth map determining module is used for carrying out regularization operation on the depth cost body and determining a depth map;

a depth prediction interval determination module of the rough stage and a depth prediction interval determination module of the refinement stage, which are used for dividing the depth map by adopting a fixed depth interval in the rough stage based on a rough-to-fine depth inference strategy, determining the depth prediction interval of the rough stage, determining an adaptive depth interval by utilizing uncertainty of depth prediction in the rough stage in the refinement stage, dividing the depth map by utilizing the adaptive depth interval, and determining the depth prediction interval of the refinement stage; in the refinement stage, determining an adaptive depth interval by using the uncertainty of the depth prediction in the rough stage, dividing the depth map by using the adaptive depth interval, and determining a depth prediction interval in the refinement stage, specifically comprising:

6. The multi-view stereoscopic network three-dimensional reconstruction system of claim 5 wherein the down-sampled encoder includes a post-BN layer and a convolutional layer with an activation function; two convolution layers with the step length of 2 and the convolution kernel size of 5x5 perform downsampling on the actual shot image for two times;

7. The multi-view stereo network three-dimensional reconstruction system according to claim 5, wherein the depth map determining module specifically includes:

and the depth map determining unit is used for calculating a weighted average value of the set depth threshold value and the depth probability body of each pixel division to determine a depth map.