CN116912405A

CN116912405A - Three-dimensional reconstruction method and system based on improved MVSNet

Info

Publication number: CN116912405A
Application number: CN202310831912.2A
Authority: CN
Inventors: 彭艳; 赖鸿伟; 周洋; 瞿栋; 谢少荣; 蒲华燕; 罗均
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-10-20

Abstract

The invention discloses a multi-view three-dimensional reconstruction method and system based on an improved MVSNet, and belongs to the technical field of three-dimensional reconstruction. The method comprises the following steps: inputting the reference image and the source image into an improved MVSNet network, and extracting the characteristics of the image through a convolution layer and an ECA module to obtain a characteristic diagram of the reference image and the source image; obtaining a characteristic body through differential homography transformation, and aggregating and matching a cost body through a group similarity measurement module; regularizing the matching cost body through a 3D convolution module to obtain a probability body; obtaining a depth map by depth regression of the probability body; optimizing the depth map through a Gaussian Newton optimization module to obtain a final depth map; and fusing all the obtained depth maps and coordinate changes to obtain dense point clouds and visualizing the three-dimensional model. Compared with the traditional MVSNet method, the method fully improves the characteristic expression capability of the image, and the generated three-dimensional point cloud has higher precision and integrity and higher robustness to areas such as weak textures, repeated textures and the like.

Description

Three-dimensional reconstruction method and system based on improved MVSNet

Technical Field

The invention belongs to the technical field of three-dimensional reconstruction, and particularly relates to a three-dimensional reconstruction method and system based on an improved MVSNet.

Background

Three-dimensional reconstruction is a process of converting an object or scene in the real world into a three-dimensional model. The method is an important technology in the fields of computer vision and computer graphics, and has wide application, including the fields of virtual reality, augmented reality, game development, digital cultural relic protection and the like. In three-dimensional reconstruction, it is often necessary to acquire and process information of an object or scene by means of sensors, image processing algorithms and computer graphics techniques. Some common three-dimensional reconstruction techniques are: point cloud reconstruction, multi-view three-dimensional reconstruction, structured light reconstruction, RGB-D reconstruction, and the like.

Multi-view stereoscopic MVS (MultipleView Stereo) refers to observing and acquiring images of a scene from multiple views, and restoring three-dimensional representation of the scene by using the images and corresponding camera parameters to complete stereoscopic matching and depth estimation. I.e. reconstructing a three-dimensional model using image information of multiple perspectives, which typically comprises the steps of image feature extraction, camera pose estimation, three-dimensional point cloud reconstruction, etc. The multi-view stereo reconstruction technology is widely applied in the fields of photogrammetry, three-dimensional reconstruction, virtual reality and the like.

The traditional multi-view three-dimensional reconstruction method mainly utilizes geometric and optical consistency to calculate a matching cost body and then carries out depth estimation. The traditional MVS stereo reconstruction algorithm has Colmap, furu, gipuma and the like, recovers a three-dimensional scene from a motion structure, but has the problems of long running time, low reconstruction precision in a region without texture, weak texture, repeated texture, low robustness and the like. With the development of deep learning, a multi-view three-dimensional reconstruction algorithm based on the deep learning can extract picture features through a convolutional neural network, learn global semantic information, fully utilize context information of the semantic information, better improve reconstruction precision and integrity on repeated textures, weak textures and non-lambertian surfaces, and better generalize a model trained by big data, and can be applied to various scenes. Multi-view three-dimensional reconstruction based on deep learning has now become the focus of research for many scholars.

MVSNet was proposed by YAO et al in 2018 as the first end-to-end multi-view three-dimensional reconstruction frame based on deep learning, the input of the network is a reference image and a plurality of source images, and the depth map is obtained after the steps of feature extraction, matching cost body construction, cost body regularization and depth regression, but the consumption of GPU video memory for performing cost body regularization is excessive. Therefore, YAO et al in 2019 proposed R-MVSNet based on MVSNet improvement, added a recurrent neural network when regularizing the cost body, and reduced the memory consumption but increased the running time without reducing the accuracy and integrity of the result. The Point-MVSNet then proposes to directly operate on the Point cloud, and first converts the roughly generated depth map into the Point cloud to guide the depth optimization. Cascade-MVSNet proposes a Cascade method, and a depth map obtained by predicting a low-resolution picture is used for guiding the generation of a high-resolution picture depth map through upsampling. The PatchMatchNet introduces the concept of PatchMatch into multi-view three-dimensional reconstruction, and uses the propagation concept to make each pixel point in the picture probe the depth values of the surrounding same object surface, thereby reducing the running time on the premise of improving the result precision and integrity. CVP-MVSNet is the network with highest deep learning precision at present, but because the network model is complex, the calculation is expensive and the reconstruction efficiency is low due to the fact that the matching cost body is built in an image pyramid mode. The current direction of multi-view three-dimensional reconstruction based on deep learning mainly aims at reducing the cost of 3D convolution in terms of calculation and memory consumption, optimizing a network structure, and improving the precision and integrity of a reconstructed three-dimensional model under the condition of limited resources.

Disclosure of Invention

Aiming at the problems and defects existing in the prior art, the invention provides a three-dimensional reconstruction method and system based on improved MVSNet in order to improve the accuracy and the integrity in depth estimation.

In order to achieve the aim of the invention, the technical scheme adopted by the invention is as follows:

the invention provides a three-dimensional reconstruction method based on an improved MVSNet, which comprises the following steps:

step 1, acquiring N multi-view image sets of the same object in the same scene, and selecting one image from the N multi-view image sets as a reference image and (N-1) images as source images;

step 2, inputting the reference image and the source image into an improved MVSNet network, and extracting the characteristics of each image to obtain characteristic images of the reference image and the source image;

step 3, performing differential homography transformation on the feature map on different depth hypothesis planes to obtain a feature body; dividing the obtained feature body into g channel groups, calculating group similarity through a group similarity measurement module, and obtaining a matching cost body based on the group similarity;

step 4, regularizing the matched cost body by using 3D convolution to obtain a probability body;

step 5, calculating a depth expected value of the probability body along the depth direction to obtain an initial estimated depth map;

step 6, optimizing the initially estimated depth map by a Gaussian Newton optimization method to obtain an optimized depth map;

and 7, fusing the optimized depth map, obtaining a dense point cloud through coordinate transformation, and performing three-dimensional model visualization on the dense point cloud to obtain a three-dimensional reconstruction model.

According to the improved MVSNet-based three-dimensional reconstruction method, preferably, the specific operation of feature extraction on each graph is as follows: and carrying out feature extraction on the input source image and the reference image through 8 convolution layers and 2 ECA modules to obtain feature images of the reference image and the source image.

According to the three-dimensional reconstruction method based on the improved MVSNet, preferably, the feature map is subjected to differential homography transformation on the assumption planes of different depths to obtain a feature body, wherein the specific process is as follows: with a plane perpendicular to the main optical axis of the reference camera as the depth hypothesis plane, the feature map is warped into a different front plane of the reference camera, mapping all pixel coordinates on the source image to pixel coordinates on the reference image, from the most significantThe small depth is denoted as d _min Mapping to maximum depth d according to depth hypothesis interval _max And obtaining the feature body.

According to the improved MVSNet-based three-dimensional reconstruction method, preferably, the warping process is as shown in formula (1):

k in formula (1) _i ，K _ref Camera intrinsic parameters, R, of source and reference images, respectively _ref，i For the corresponding rotation matrix, t _ref，i For the corresponding translation vector d _j For the corresponding sampling depth value, p _i，j Is I _i Coordinates of pixel p in the source image.

According to the improved MVSNet-based three-dimensional reconstruction method, preferably, the group similarity is obtained by the formula (2):

the value range of g in the formula (2) is [0,G-1 ]]；Is->A g group of features corresponding to the feature; />Is thatThe corresponding feature of group g;<·>is an inner product operation.

According to the improved MVSNet-based three-dimensional reconstruction method, preferably, the depth expectation value in step 5 is obtained by the formula (3):

in the formula (3), P (d) represents the estimated probability of all pixels with depth assumed to be d, [ d ] _min ,d _max ]The range of values is taken for the sampling depth.

According to the improved MVSNet-based three-dimensional reconstruction method, preferably, the gaussian newton optimization method in step 6 includes: performing feature extraction on the reference image and the source image by using 2D convolution to obtain a feature map A; the predicted depth value of the initial estimated depth map is recorded asThe depth value of the pixel p is recorded as +.>Projecting a point p on the reference image onto each source image p' _i Where p and p 'are calculated' _i The difference of the values of the feature map A is minimized, and the objective function is shown as a formula (4):

f in formula (4) _i A feature map of the source image; f (F) ₀ A feature map which is a reference image; p's' _i The projection point of the pixel point p on the i Zhang Yuan image is shown.

According to the improved MVSNet-based three-dimensional reconstruction method, preferably, p' _i Obtained from the formula (5):

in (5)Respectively representing camera internal parameters, rotation matrixes and translation vectors of corresponding images.

According to the improved MVSNet-based three-dimensional reconstruction method, preferably, the optimized depth map in step 6 is obtained by the following formula (6):

in (6)Representing an optimized depth map, < >>Representing the depth value of the initially estimated depth map pixel point p, delta represents the delta value of the current depth.

According to the improved MVSNet-based three-dimensional reconstruction method, preferably, the delta value δ of the current depth is calculated by formula (7):

δ＝-(J ^T J) ^-1 J ^T r (7)

j in (7) is a stacked jacobian matrixr is the stacked residual vector->

A second aspect of the present invention provides a three-dimensional reconstruction system based on an improved MVSNet, comprising:

the data acquisition module is used for acquiring an image set corresponding to a target object or scene;

the data processing module is used for inputting the image set obtained by the data obtaining module into a neural network model to obtain a depth image, wherein the neural network model is the improved MVSNet network according to the first aspect of the invention;

the three-dimensional reconstruction module is used for fusing all the depth maps obtained by the data processing module through a preset algorithm and coordinate transformation to generate a three-dimensional point cloud of the target object or scene, so as to obtain a three-dimensional model.

According to the improved MVSNet-based three-dimensional reconstruction system, preferably, the data processing module comprises a feature extraction unit, a matching cost body construction unit, a matching cost body regularization unit, a depth regression unit and a depth map optimization unit, wherein the units are as follows:

the feature extraction unit is used for extracting features of the input source image and the reference image, and extracting features of each picture to be processed through the 2D convolution and ECA module to obtain feature images of the source image and the reference image;

the matching cost body construction unit is used for further processing the feature map obtained by the feature extraction unit and constructing a matching cost body, firstly, the feature body is obtained through differentiable homography transformation, and then the final matching cost body is formed through aggregation of the group similarity measurement modules;

the matching cost body regularization unit is used for removing noise contained in the matching cost body obtained by the matching cost body construction unit due to non-lambertian surfaces, shielding and other reasons, and regularizing the matching cost body through the 3D-UNet of four scales to obtain a probability body;

the depth regression unit is used for carrying out depth regression on the probability body obtained by the matching cost body regularization unit to obtain an initial estimated depth map;

and the depth map optimizing unit is used for further optimizing the initially estimated depth map obtained by the depth regression unit, and generating a finer depth map by iterative optimization of the initially estimated depth map by using a Gaussian Newton optimizing module.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides an improved MVSNet-based three-dimensional reconstruction method by improving a traditional MVSNet network architecture. In order to improve the integrity and the accuracy of the reconstruction result, the method fuses ECA (Efficient Channel Attention Module) modules in the feature extraction stage, performs feature extraction on the image through 8 layers of convolution and 2 ECA modules, learns the channel attention more effectively, and improves the feature expression capability.

2. In the invention, a group similarity measurement module is added in the process of matching the cost body construction to construct a lightweight cost body so as to reduce the consumption of the memory. In addition, the invention also adds a Newton Gaussian optimization module to optimize the initially estimated depth map to obtain a depth map with better effect. The three-dimensional model reconstructed by the method is tested on the DTU data set, and compared with the three-dimensional point cloud reconstructed by the traditional MVSNet network, the three-dimensional point cloud reconstructed by the method has higher precision and integrity.

Drawings

Fig. 1 is a schematic diagram of a network structure of a modified MVSNet;

FIG. 2 is a schematic diagram of an ECA module;

FIG. 3 is a schematic diagram of a Gaussian Newton optimization module;

fig. 4 is a schematic structural diagram of the three-dimensional reconstruction system based on the improved MVSNet of the present invention.

Detailed Description

The following examples are only suitable for further illustration of the invention. It should be noted that all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless otherwise indicated. The experimental methods in the following examples, in which specific conditions are not specified, are all conventional in the art or according to the conditions suggested by the manufacturer; the reagents or apparatus used were conventional products commercially available without the manufacturer's attention.

In order to enable those skilled in the art to more clearly understand the technical scheme of the present invention, the technical scheme of the present invention will be described in detail with reference to specific embodiments.

Example 1

In order to improve the precision and integrity of generating a three-dimensional point cloud in multi-view three-dimensional reconstruction, the embodiment provides a three-dimensional reconstruction method based on improved MVSNet, and the network structure of the three-dimensional reconstruction method is shown in fig. 1. The method mainly comprises the steps of feature extraction, matching cost body construction, matching cost body regularization, depth regression, depth map optimization, three-dimensional point cloud conversion and three-dimensional model visualization. The three-dimensional reconstruction method based on the improved MVSNet comprises the following steps:

step 1, acquiring a plurality of multi-view image sets of the same scene and the same object, and selecting a reference image and an (N-1) Zhang Yuan image from the multi-view image sets; the specific process is as follows:

n images are selected in total, wherein the height of each image is H, and the width of each image is W. 1 of them is taken as a reference image and is marked as I ₀ The remaining (N-1) sheets are taken as source images and recorded asAs an input to the network.

Step 2, inputting the reference image and the source image into an improved MVSNet network, and extracting the characteristics of each image through an ECA module and a convolution layer to obtain characteristic images of all the images; the specific process is as follows:

the size of the input source and reference images is 3 XH W, and the final output is of size through 8 convolution layers and 2 ECA modules (as shown in FIG. 2)The weight is shared among the convolution layers, the convolution layers of the third layer and the sixth layer are downsampled to be one half of the original height and width, and the ECA module is used for learning the channel attention after downsampling. After learning the attention of the channel, the feature extraction is further carried out, so that the features can be better aggregated, and the expression capability of the features is improved.

In general, the development of the attention module can be divided into two directions: the strategy of enhancing feature aggregation, i.e. improving feature aggregation, focuses on the combination of channels and spatial attention. However, the model is more complicated by adding the attention mechanism under normal conditions, the calculation amount of the model is increased, the calculation cost is increased, and the ECA module is an extremely light channel attention module, so that the attention of a channel can be learned at low cost, the efficiency of learning the attention of the channel is improved, and therefore, the ECA module only increases a small amount of parameters and can accelerate the convergence rate. In the embodiment, an ECA module is added in the feature extraction part to improve the feature expression capability of the source image and the reference image feature map, which is helpful for a series of steps such as subsequent matching cost body construction and the like.

The ECA module is realized by the following steps: after aggregating the convolution features using GAPs (global averaging pooling) without dimension reduction, the ECA module adaptively determines the kernel size k, then performs 1D convolution, and then performs Sigmoid functions to learn the channel attention. The ECA module does not change the size of the input tensor and the output tensor, and the structural schematic diagram of the ECA module is shown in figure 2; the ECA module realizes a local cross-channel interaction strategy without dimension reduction through 1D convolution, so that the dimension reduction of the channel is avoided to be beneficial to learning effective channel attention, and the dimension reduction of the channel can destroy the corresponding relation between the channel and the weight; cross-channel interaction is achieved by 1D convolution fusion of channel features of the neighborhood, with all channels sharing the same learnable parameters. To avoid the significant computational resources expended by manually adjusting k through cross-validation, the kernel size k represents the coverage of local cross-channel interactions by adaptively selecting the kernel size of the 1D convolution.

The basic construction of the entire feature extraction network is seen in Table 1, and the final output reference feature map can be noted as f ₀ The source signature can be written as

TABLE 1 feature extraction network infrastructure

Compared with the original mode, the feature aggregation of the feature graph extracted by 8 layers of convolution and 2 ECA modules is greatly enhanced, and the complexity of the model is hardly increased while obvious performance gain is brought.

Step 3, carrying out differential homography transformation on the feature map extracted in the step 2 on the assumption planes with different depths, and calculating matching cost under different depths to obtain a feature body; and uniformly dividing the feature body into g channel groups, calculating group similarity through a group similarity measurement module, and constructing and aggregating matching cost bodies based on the group similarity. The specific process is as follows:

according to the principle of the planar scanning algorithm, with a plane perpendicular to the main optical axis of the reference camera as the depth assumption plane, all feature maps are warped into different front planes of the reference camera, all pixel coordinates on the source image are mapped to pixel coordinates on the reference image, denoted d from the minimum depth _min Mapping to maximum depth d according to depth hypothesis interval _max The process of warping is called differentiable homography, and the corresponding calculation is given by equation (1):

By warping the pixel coordinates on all source images to coordinates on the reference image by equation (1), all source image depth features can be noted asDepth d _j The depth profile of all source images can be noted +.> The matching cost body dimension at this time is C×D×H×W, wherein C represents the channel dimension, D represents the depth dimension, and the matching cost body dimension is uniformly divided into g groups according to the channel dimension, and the matching cost body dimension is C×D×H×W>Group similarity between(2) And (3) calculating:

the value range of g in the formula (2) is [0,G-1 ]]；Is->A g group of cost bodies corresponding to the matched cost bodies; />Is->A corresponding cost body of group g;<·>is an inner product operation.

The feature dimension output by the group similarity measurement module is as followsN feature bodies are formed, designated +.>In order to adapt to the input of any Zhang Yuan image, the feature body is subjected to average calculation, and the final matching cost body can be calculated by the formula (3) after aggregation:

in this embodiment, g=8 is set, and a lightweight cost body of 8 channels can be obtained, which can greatly reduce memory consumption.

Step 4, regularizing the matching cost body through 3D convolution to obtain a probability body; the specific process is as follows:

the aggregated matching cost body contains noise due to non-lambertian surfaces, shielding and the like, and is subjected to 3D convolutionRegularizing to obtain probability body. The probability volume can be used not only for pixel-by-pixel depth prediction, but also to measure the confidence of the estimate. In this embodiment, a network structure including 3D-UNet with four scales is adopted to encode and decode the original matching cost body, and finally, the number of channels is compressed to 1, and finally, the probability value of the probability body is normalized by SoftMax operation in the depth direction. The dimension of the regularized probability body is as follows

Step 5, calculating a depth expected value of the probability body along the depth direction to obtain an initial estimated depth map; the specific process is as follows:

and regressing the probability of each pixel at the depth d by the regularized probability body through a softArgMin function to obtain the confidence coefficient of each pixel point in the reference image along the depth direction, and if the confidence coefficient is higher, the probability of the pixel point at the depth is higher for the initial depth value of the corresponding pixel. The output depth map size isMeanwhile, in order to avoid the situation of vacancy and unsmooth in the depth direction, the whole depth map is smoothed by using a calculation mode of a depth expected value as a depth estimated value of the pixel, and the depth expected value D is calculated by a formula (4):

in equation (4), P (d) represents the estimated probability of all pixels at depth d, [ d ] _min ,d _max ]Is the value range of the depth value.

And 6, optimizing the depth map obtained in the step 5 through a Gaussian Newton optimization module to obtain an optimized depth map, wherein a schematic diagram of the Gaussian Newton optimization module is shown in a figure 3. The specific process is as follows:

performing feature extraction on the original reference image and the multiple source images through 2D convolution to obtain a corresponding feature mapThe depth value predicted by using the initial estimated depth map obtained after the depth regression is recorded asThe depth value of the pixel point p is recorded asProjecting a point p on the reference image onto each source image p' _i Where p and p 'are calculated' _i Minimizing an objective function corresponding to the difference of the feature map values, wherein the objective function is shown as a formula (5):

f in formula (5) _i F as a feature map of the source image ₀ For reference image feature map, p' _i Is the projection point of the pixel point p in the i Zhang Yuan image, p' _i Calculated from formula (6):

in (6)Respectively representing camera internal parameters, rotation matrixes and translation vectors of corresponding images.

In particular, fromStarting to calculate the residual r of the pixel point p in each source image _i (p) residual is described by formula (7):

r _i (p)＝F _i (p' _i )-F ₀ (p) (7)

for each residual r _i (p) calculating the relationIs calculated by the formula (8)And (3) calculating:

finally, the increment value delta of the current depth can be obtained, wherein delta is shown as a formula (9):

δ＝-(J ^T J) ^-1 J ^T r (9)

j in (9) is a stacked jacobian matrixr is the stacked residual vector->Thereby calculating an optimized depth map, the optimized depth map being described by the formula (10):

step 7, fusing the optimized depth map, obtaining a dense point cloud through coordinate transformation, and performing three-dimensional model visualization on the dense point cloud to obtain a three-dimensional reconstruction model; the specific process is as follows: and converting each pixel point on the optimized depth map from a pixel coordinate system to a camera coordinate system, wherein the camera coordinate system is a world coordinate system, and the constraint of coordinate transformation is a camera parameter. The conversion relationship is shown in the formula (11):

in the formula (11), u and v are any coordinate points in an image coordinate system respectively; f/dx and f/dy are focal lengths of the camera along the x direction and the y direction respectively; u (u) ₀ 、v ₀ Respectively the center coordinates of the images; x is x _w 、y _w 、z _w Respectively representing three-dimensional coordinate points in a world coordinate system; z _c A z-axis value representing camera coordinates, i.e., a distance (depth value) of the object from the camera; r, T are external parametersA 3 x 3 rotation matrix and a 3 x 1 translation matrix of the matrix.

Since the world coordinate origin and the camera origin are coincident, i.e. there is no rotation or translation, the reference matrix is set here as:

and converting the optimized depth map into dense point cloud through coordinate transformation, storing the dense point cloud into a ply format file, and visualizing a three-dimensional model of the stored three-dimensional point cloud by using MeshLab software.

And evaluating the three-dimensional point cloud model generated by the three-dimensional reconstruction method based on the improved MVSNet in a manner of comparing the precision and the integrity of the generated three-dimensional point cloud. Scan1, scan4, scan9, scan10, scan11, scan12, scan13, scan15 were selected for a total of 8 Scans in the validation set of the DTU dataset, and the experimental results are shown in Table 2.

TABLE 2 precision results

Network Model	Accuracy/mm	Completeness/mm	Overall/mm
				MVSNet	0.3969	0.3767	0.3868
Improved MVSNet	0.3248	0.3426	0.3337

As can be seen from table 2, compared with the original MVSNet mode, the three-dimensional point cloud model generated based on the three-dimensional reconstruction method of the improved MVSNet provided by the invention has the advantages that the accuracy is improved by 22.23%, the integrity is improved by 9.96%, and the reconstruction accuracy and the pixel integrity of the comprehensive point cloud are improved by 15.92%.

In summary, the three-dimensional reconstruction method based on the improved MVSNet provided by the invention enables the network to obtain higher precision and integrity while inhibiting the complexity increase of the model. The model is trained and tested on the DTU data set, and experiments prove that the three-dimensional reconstruction method based on the improved MVSNet improves the reconstruction effect.

Example 2

The structural schematic diagram of the system is shown in fig. 4, and the system comprises a data acquisition module, a data processing module and a three-dimensional reconstruction module, wherein the modules are as follows:

the data processing module is used for processing the obtained image set of the target object or scene, inputting the reference image and the source image into the neural network model to obtain a depth image, wherein the neural network model is based on the improved MVSNet model described in the embodiment 1, and specifically, the data processing module is obtained by introducing an ECA effective channel attention module, a group similarity measurement module and a Gaussian Newton optimization module on the basis of the original network;

the three-dimensional reconstruction module is used for generating a three-dimensional point cloud of the target object or the scene by fusing all the obtained depth maps through a preset algorithm and coordinate transformation, so as to obtain a three-dimensional model and realize three-dimensional reconstruction of the corresponding target or scene.

The data processing module comprises a feature extraction unit, a matching cost body construction unit, a matching cost body regularization unit, a depth regression unit and a depth map optimization unit, wherein the units are specifically as follows:

It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding process in the foregoing method embodiment for the specific working process of the above-described system, which is not described herein again.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. The three-dimensional reconstruction method based on the improved MVSNet is characterized by comprising the following steps of:

2. The improved MVSNet-based three-dimensional reconstruction method according to claim 1, wherein the specific operation of feature extraction for each graph is: and carrying out feature extraction on the input source image and the reference image through 8 convolution layers and 2 ECA modules to obtain feature images of the reference image and the source image.

3. The improved MVSNet-based three-dimensional reconstruction method of claim 1, wherein the feature map is subjected to differential homography in different depth hypothesis planes to obtain a feature body, and the specific process is as follows: the feature map is warped to the reference camera with a plane perpendicular to the reference camera main optical axis as a depth hypothesis planeMapping all pixel coordinates on the source image to pixel coordinates on the reference image, denoted d from the minimum depth _min Mapping to maximum depth d according to depth hypothesis interval _max And obtaining the feature body.

4. The improved MVSNet-based three-dimensional reconstruction method according to claim 3, wherein the process of warping is as shown in formula (1):

5. The improved MVSNet-based three-dimensional reconstruction method according to claim 4, wherein the group similarity is obtained by formula (2):

the value range of g in the formula (2) is [0,G-1 ]]；Is->A g group of features corresponding to the feature; />Is->The corresponding feature of group g;<·>is an inner product operation.

6. The improved MVSNet-based three-dimensional reconstruction method according to claim 5, wherein the depth expectation value in step 5 is obtained by formula (3):

7. The improved MVSNet-based three-dimensional reconstruction method according to claim 5, wherein the gaussian newton optimization method in step 6 comprises: performing feature extraction on the reference image and the source image by using 2D convolution to obtain a feature map A; the predicted depth value of the initial estimated depth map is recorded asThe depth value of the pixel p is recorded as +.>Projecting a point p on the reference image onto each source image p' _i Where p and p 'are calculated' _i The difference of the values of the feature map A is minimized, and the objective function is shown as a formula (4):

8. The improved MVSNet-based three-dimensional reconstruction method according to claim 5 or 6 or 7, wherein the optimized depth map in step 6 is obtained by formula (6):

9. A three-dimensional reconstruction system based on an improved MVSNet, comprising:

the data processing module is used for inputting the image set obtained by the data obtaining module into a neural network model to obtain a depth map, wherein the neural network model is the improved MVSNet network according to any one of claims 1-8;

10. The improved MVSNet based three dimensional reconstruction system of claim 9, wherein the data processing module comprises:

the matching cost body regularization unit is used for removing noise contained in the matching cost body due to a non-lambertian surface and a shielding reason and regularizing the matching cost body through a 3D-UNet with four scales to obtain a probability body;