CN110021069B

CN110021069B - Three-dimensional model reconstruction method based on grid deformation

Info

Publication number: CN110021069B
Application number: CN201910298089.7A
Authority: CN
Inventors: 姚剑; 潘涛; 陈凯; 涂静
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-04-15
Filing date: 2019-04-15
Publication date: 2022-04-15
Anticipated expiration: 2039-04-15
Also published as: CN110021069A

Abstract

The invention provides a three-dimensional model reconstruction method based on grid deformation, which is used for constructing a training sample set and comprises the steps of manufacturing discrete visual angle pictures and corresponding three-dimensional point cloud data of a plurality of models; setting a deep learning network model based on a graph convolution neural network, wherein the deep learning network model based on the graph convolution neural network comprises a discrete visual angle feature fusion module and a grid deformation module, and the output of the discrete visual angle feature fusion module is connected with the input of the matched grid deformation module; setting a loss function, and training a deep learning network model based on a graph convolution neural network based on a training sample set; and inputting the discrete visual angle picture of the object to be reconstructed to the network model obtained by training, automatically reconstructing the three-dimensional network model and evaluating the precision. The method can support the stable and accurate automatic three-dimensional mesh model reconstruction of objects of different types and sizes by learning and training the discrete visual angle picture and the three-dimensional point cloud data set of the object.

Description

Three-dimensional model reconstruction method based on grid deformation

Technical Field

The invention belongs to the field of three-dimensional reconstruction, and particularly relates to a three-dimensional model reconstruction method based on grid deformation.

Background

Three-dimensional reconstruction is a very challenging problem that has been studied in the field of computer vision for decades. Traditional multi-view geometry-based methods, such as many SFM and SLAM algorithms, involve a complex series of processes including feature extraction, feature matching, matching point triangulation, and the like. Thus, this approach is not robust for challenging scenarios where feature extraction or matching cannot be done efficiently. Furthermore, their three-dimensional reconstruction results are typically a sparse reconstruction that cannot be used directly. To overcome these limitations, some learning-based approaches have emerged. And performing depth regression estimation on the scene by using the visual image characteristics of the depth to obtain a dense reconstruction result. However, similar to the earlier conventional methods, these methods still require that the input images have sufficient overlap area, which is somewhat awkward and inconvenient in practical applications.

With the rapid development of deep learning, many neural networks that infer a three-dimensional shape from only a single image have been proposed in recent years. Voxels are one of the most common methods for expressing three-dimensional models. Many methods propose extending the traditional two-dimensional convolutional neural network to a three-dimensional version and estimating a voxelized three-dimensional model from one RGB image. Unfortunately, these methods are generally limited by the high computational complexity and memory overhead of the three-dimensional convolutional neural network, and therefore the limited resolution of point cloud voxelization cannot reconstruct the structural details of the object. Another way to express a three-dimensional model is to use directly the disordered point cloud. In this expression, the network usually adopts an encoder-decoder architecture, and acquires geometric information from the RGB image at the encoder stage, and then regresses point cloud coordinates at the decoder stage. Usually, the point cloud is introduced into the network in the form of a set of unordered vectors, but the conventional convolutional neural network can only process ordered inputs, which results in the output point cloud lacking the geometric information of neighboring three-dimensional points. Recently, some methods of deep learning have proposed classification and analysis of three-dimensional models in non-euclidean space. This idea is naturally applied to reconstruct a three-dimensional model of an object directly from a single RGB image. The three-dimensional mesh model output by the method is generated by the progressive deformation of an initial mesh ellipsoid under a graph convolution based neural network (GCN) framework. This kind of inferring a three-dimensional model of an object from only one RGB image is certainly very attractive, but the above-described approaches are inherently limited to a single view. Since the geometry prior obtained from an image from only one angle is often too limited to accurately reconstruct the geometry of the object not seen in the picture.

Disclosure of Invention

To solve the problems of the above methods, the present invention finds a compromise, and proposes a discrete view mesh generation network (DMGNet) that can reconstruct a three-dimensional mesh model from a discrete view RGB image. On the one hand, the invention is more flexible than the traditional multi-view geometry approach based on multiple views, since it does not require any overlap of the input images. On the other hand, the invention simultaneously utilizes the fusion information of the discrete views, and can obtain more accurate model reconstruction results compared with the method based on the single view.

The invention provides a three-dimensional model reconstruction method based on grid deformation, which comprises the following steps:

step 1, constructing a training sample set, including making discrete visual angle pictures of a plurality of models and corresponding three-dimensional point cloud data;

step 2, setting a deep learning network model based on a graph convolution neural network, wherein the deep learning network model based on the graph convolution neural network comprises a discrete visual angle feature fusion module and a grid deformation module, and the output of the discrete visual angle feature fusion module is connected with the input of the matched grid deformation module;

the discrete perspective feature fusion module comprises 6 convolution calculation blocks Block1, Block2, Block3, Block4, Block5 and Block6, each Block having 2-3 convolution layers, respectively denoted as Block1_ conv1, Block1_ conv2, Block2_ conv1, Block2_ conv2, Block2_ conv3, Block3_ conv1, Block3_ conv2, Block2_ conv 2; respectively projecting the vertexes of the initial three-dimensional shape mesh in 4 layers of image feature graphs, namely block3_ conv3, block4_ conv3, block5_ conv3 and block6_ conv3, obtained from each visual angle by using camera external parameters of each discrete visual angle, and respectively obtaining image features corresponding to the vertex projection positions in each layer of image feature graph; fusing the image characteristics obtained under all the visual angles, adding the three-dimensional shape characteristics of the vertex as the multi-visual angle fusion characteristics of the vertex, and inputting the multi-visual angle fusion characteristics into a mesh deformation module;

the grid deformation module is formed by stacking 14 graph volume layers, and short connection is added between every two graph volume layers;

step 3, setting a loss function, and training a deep learning network model based on a graph convolution neural network based on a training sample set; the loss function consists of an error loss function and a Chamfer loss function, the error loss function is established according to the back projection error of the discrete view angle, and the Chamfer loss function is established according to the Chamfer distance error between the reconstructed grid vertex set and the real point cloud set;

and 4, inputting the discrete visual angle picture of the object to be reconstructed to the deep learning network model based on the graph convolution neural network obtained by training in the step 3, automatically reconstructing the three-dimensional grid model and evaluating the precision.

In the discrete view angle feature fusion module, when the vertex of the initial three-dimensional shape mesh is respectively projected in each view angle to obtain the image feature map, the vertex can fall in a certain mesh on each layer of image feature map, and in order to ensure the accuracy of the image feature obtained by projection, a bilinear interpolation method is used for obtaining the image feature corresponding to the projection position of the vertex in each layer of image feature map.

In step 3, a discrete view back projection error is defined, and an error loss function is established as follows:

wherein, IⁱRepresenting an input picture of an ith visual angle, wherein p 'and q' are a vertex p on an output three-dimensional grid and a point q on sample real point cloud data of the nearest neighbor of the vertex p, and a two-dimensional projection point is obtained by back projection on the picture by utilizing corresponding camera parameters; i isⁱ(p') and Iⁱ(q ') is the corresponding pixel value of the two-dimensional projection point p ', q ' on the input picture of the ith view angle, and the formula is to count the error of the corresponding pixel value under all discrete view angles

Accumulated as total back projection error.

Furthermore, in step 3, the Chamfer loss function is defined as follows,

wherein S is₁And S₂Respectively representing the set of vertices of the reconstructed mesh and the set of real point clouds of the model, for the set S₁At each point x of the otherA set S₂To find the spatially closest point y, for the set S₂Each point y of (a) in another set S₁Finding the nearest point x in the space, and summing the squares of the distances to obtain the Chamfer distance error d_CD(S₁,S₂) As a function of Chamfer loss.

And in the step 3, using a random gradient descent algorithm to iteratively solve the network parameters of the minimized error loss function and the Chamfer loss function to obtain a trained network model.

And in step 4, quantitative evaluation is carried out on the three-dimensional grid model reconstruction result by using the Charfer distance error and the F-score.

According to the invention, by utilizing a deep learning technology, through learning and training discrete visual angle pictures and three-dimensional point cloud data of objects, the deep learning model obtained through final training can accurately reconstruct three-dimensional models of objects of different types and sizes. Compared with the prior art, the invention has the following advantages:

(1) the learning-based method avoids the complex and time-consuming process of the traditional multi-vision geometry;

the invention uses a deep learning method to automatically learn the discrete visual angle picture of the object and the training sample formed by the corresponding three-dimensional model, thereby obtaining a network which can directly generate the three-dimensional grid model from the picture. The method does not need a series of operation processes such as feature extraction, feature matching, matching point triangularization and the like, and has the characteristics of convenience and rapidness. And the input pictures are not required to have certain overlapping degree, so the number of the required pictures is small, and the data acquisition has higher flexibility.

(2) The problem that the three-dimensional model format output by the traditional convolution network cannot be directly put into production and use is effectively solved;

the invention utilizes a new graph convolution neural network structure to directly deform on the three-dimensional grid, the finally output model format is triangular grid, compared with the traditional convolution network reconstruction result which is disordered point cloud or voxel grid, the three-dimensional model expression mode of the invention is close to the actual three-dimensional model production operation, and can be directly put into use.

(3) The problem that an invisible structure cannot be well reconstructed from a single picture is solved;

according to the invention, by designing a discrete visual angle feature fusion module, more geometric prior information can be obtained from a limited number of discrete pictures compared with a single-view picture, so that a part of object structure which cannot be seen in a single picture can be better reconstructed, the applicability of the invention is stronger, and the geometric structure of an object to be reconstructed is not required to have strong symmetry.

Drawings

Fig. 1 is a three-dimensional network model structure diagram based on a graph convolution neural network according to an embodiment of the present invention.

Fig. 2 is a structural diagram of a discrete view feature fusion module according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of a discrete-view back projection error according to an embodiment of the present invention.

Fig. 4 is a structural diagram of a mesh deformation module according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

The deep learning network only takes images of a plurality of visual angles as input and outputs a reconstructed three-dimensional grid model. As shown in fig. 1, the basic module of the network of the present invention mainly comprises 2 parts: (1) a grid deformation module based on a graph convolution neural network, and (2) a discrete visual angle image feature fusion module. With the continuous learning process of the network, the grid deformation module based on graph convolution gradually deforms the initial three-dimensional shape grid into the three-dimensional shape of the target object. The discrete visual angle image feature fusion module obtains two-dimensional image features from the discrete visual angle images respectively, then uses the camera external parameters of all visual angles to obtain the corresponding features of each grid vertex on the two-dimensional feature graphs of different visual angles through projection, and fuses multi-visual angle information as the shape features of the vertex in the grid deformation module. According to the invention, the basic network is stacked, and graph-based pooling operation is added among modules to increase the number of grid vertexes, so that the coarse-to-fine refinement process of the deformed grid is realized, and the reconstructed three-dimensional grid model is finally obtained.

The embodiment of the invention provides a three-dimensional model reconstruction method based on grid deformation, which comprises the following steps:

step 1, constructing a training sample set, including making discrete visual angle pictures of a plurality of models and corresponding three-dimensional point cloud data. And packing the pictures and the corresponding three-dimensional point cloud data to be used as a training data set.

In an embodiment, the following sub-steps are included:

step 1.1, simulating shooting of an object on a Shapelet data set at different angles, and obtaining picture data of discrete visual angles.

The open source data set ShapeNet contains 3D models and class labels for 3D models, covering the commonly used 3D data set. Embodiments create a training sample set based on a ShapeNet data set.

And step 1.2, acquiring corresponding original three-dimensional point cloud data by uniformly sampling on the model.

And 1.3, packing the discrete visual angle picture and the point cloud data of the same model to construct a training sample set.

And 2, setting a deep learning network model based on the graph convolution neural network.

In the embodiment, a three-dimensional mesh model generation network based on a graph convolution neural network as shown in fig. 1 is designed, the network mainly comprises a discrete visual angle feature fusion module and a mesh deformation module, and the output of the discrete visual angle feature fusion module is connected with the input of the mesh deformation module.

The method specifically comprises the following substeps:

and 2.1, uniformly adjusting the sizes of the discrete view angle pictures to 224 multiplied by 224, and carrying out normalization processing on pixel values. They are input to a discrete perspective feature fusion module.

Step 2.2, setting a discrete visual angle feature fusion module: as shown in the discrete view feature fusion module in fig. 2, 6 convolution calculation blocks Block1, Block2, Block3, Block4, Block5 and Block6 are included, each Block having 2-3 convolution layers, respectively denoted as Block1_ conv1, Block1_ conv2, Block2_ conv1, Block2_ conv2, Block2_ conv3, Block3_ conv1, Block3_ conv2, Block2_ conv 2_ 2. The initial three-dimensional shape mesh of the object in the embodiment employs an initial ellipsoid mesh. The initial ellipsoid grid in the figure is generated by Meshlab software, and has 156 vertexes, and the three axial lengths are 0.2 meter, 0.2 meter and 0.4 meter respectively. The cameras for view 1, view 2 and view 3 can be placed at the vertices of an equilateral triangular support centered on the object (initial ellipsoid grid). The vertices of the initial three-dimensional shape mesh are respectively projected in the 4-layer image feature maps of Block3_ conv3, Block4_ conv3, Block5_ conv3 and Block6_ conv3 obtained from each view angle by using camera external references of each discrete view angle. On each layer of the feature map, the vertices fall within a certain mesh. In order to ensure the accuracy of the image features obtained by projection, a bilinear interpolation method is used to obtain the corresponding image features at the vertex projection. After the 4 layers of image feature vectors with different dimensions obtained by the vertex projection are obtained, the image features of the vertices with total channel number 960 under the view angle are obtained through cascading. And fusing the image characteristics obtained under all the visual angles, adding the three-dimensional shape characteristics of the vertex, taking the three-dimensional shape characteristics as the multi-visual angle fusion characteristics of the vertex, and inputting the multi-visual angle fusion characteristics into the mesh deformation module.

Step 2.3, setting a grid deformation module: stacked from 14 map convolutional layers, short connections are added between every other two map convolutional layers in order to alleviate the difficult training problem with the deepening of the network depth and to make the information exchange more efficient. And updating the vertex coordinates of the mesh to achieve mesh deformation through the stacked graph convolution operation, so that the shape of the mesh is gradually fitted to the target model.

Referring to fig. 4, 14 of the graph convolutional layers, the output of layer 1 is connected to the output of layer 3, the output of layer 3 is connected to the output of layer 5, the output of layer 5 is connected to the output of layer 7, the output of layer 7 is connected to the output of layer 9, the output of layer 9 is connected to the output of layer 11, and the output of layer 11 is connected to the output of layer 13.

Namely:

L3＝(L1+L3)/2

L5＝(L3+L5)/2

L7＝(L5+L7)/2

…

L13＝(L11+L13)/2

the right Lc of the equation is the output of the graph convolution layer c before the short connection is increased, the left Lc of the equation is the output after the short connection is increased, and c is 3, 5, 7, … 13.

And 3, setting a loss function, and training the deep learning network model based on the graph convolution neural network based on the training sample set.

In an embodiment, the loss function is composed of an error loss function and a Chamfer loss function, and the step 3 implementation includes the following sub-steps:

step 3.1, defining a discrete view angle back projection error, and establishing an error loss function as follows:

as shown in FIG. 3, wherein IⁱAnd representing an input picture of the ith visual angle, wherein p 'and q' are a vertex p on the output three-dimensional grid and a point q on sample real point cloud data of the nearest neighbor, and a two-dimensional projection point is obtained by back projection on the picture by using corresponding camera parameters. E.g. with a vertex p in fig. 3₁,p₂,p₃,p₄,p₅Corresponding to the cloud data q₁,q₂,q₃,q₄,q₅。Iⁱ(p') and Iⁱ(q ') is the corresponding pixel value of the two-dimensional projection point p ', q ' on the input picture of the ith view angle, and the formula is to count the error of the corresponding pixel value under all discrete view angles

Accumulated as total back projection error.

Defining: charfer loss function

Wherein S is₁And S₂Respectively representing the set of vertices of the reconstructed mesh and the set of real point clouds of the model, for the set S₁Each point x of (2) in another set S₂To find the spatially closest point y, for the set S₂Each point y of (a) in another set S₁Finding the nearest point x in the space, and summing the squares of the distances to obtain the Chamfer distance error d_CD(S₁,S₂) This is taken as a function of the Chamfer loss.

The sum of the error loss function and the Chamfer loss function as the total loss.

And 3.2, iteratively solving network parameters of the minimized error loss function and the Chamfer loss function by using a random gradient descent algorithm to obtain a trained network model. The stochastic gradient descent algorithm is prior art and is not described in detail herein.

Step 4, inputting the discrete visual angle picture of the object to be reconstructed to the deep learning network model based on the graph convolution neural network obtained by training in the step 3, automatically reconstructing the three-dimensional grid model and evaluating the precision:

step 4.1, unifying the size of the discrete visual angle picture data to be reconstructed and normalizing the pixel value;

step 4.2, inputting the test data into the trained network model to obtain a three-dimensional grid model reconstruction result;

and 4.3, carrying out quantitative evaluation on the final reconstruction result by using the Chamfer distance error and the F-fraction. The calculation formula of the Chamfer distance error is the same as the Chamfer loss function.

The formula for calculating the F-fraction is as follows:

f (d) represents an F-score, d represents a threshold of an euclidean distance between two points in space, and a value can be preset in specific implementation, for example, 0.0001 is taken, p (d) represents a percentage of points with a distance smaller than d in a real point cloud set in a reconstructed mesh vertex set, and r (d) represents a percentage of points with a distance smaller than d in a reconstructed mesh vertex set in a real point cloud set. The F-score measures how close the reconstructed mesh vertex is to the real point cloud and also measures the integrity of the reconstructed mesh vertex.

According to the method and the device for reconstructing the three-dimensional model of the picture of the actual object, the three-dimensional model of the picture of the actual object can be reconstructed stably and accurately.

In specific implementation, the above technical solutions can adopt computer software technology to implement automatic operation process, and a hardware device for operating the method should also be within the protection scope of the present invention.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A three-dimensional model reconstruction method based on grid deformation is characterized by comprising the following steps:

2. The mesh deformation-based three-dimensional model reconstruction method according to claim 1, wherein: in the discrete visual angle feature fusion module, when the vertexes of the initial three-dimensional shape mesh are respectively projected in each visual angle to obtain an image feature map, the vertexes all fall into a certain mesh on each layer of the image feature map, and in order to ensure the accuracy of the image features obtained by projection, a bilinear interpolation method is used for obtaining the image features corresponding to the vertex projection positions in each layer of the image feature map.

3. The mesh deformation-based three-dimensional model reconstruction method according to claim 1 or 2, characterized in that: in step 3, defining a discrete view back projection error, and establishing an error loss function as follows:

wherein Ii represents an input picture of the ith view angle, p 'and q' are a vertex p on the output three-dimensional grid and a point q on sample real point cloud data nearest to the vertex p, and a two-dimensional projection point is obtained by back projection on the picture by using corresponding camera parameters; ii (p ') and Ii (q') are corresponding pixel values of the two-dimensional projection points p ', q' on the input picture of the ith view angle, and the formula calculates the error of the corresponding pixel values at all discrete view angles

Accumulated as total back projection error.

4. The mesh deformation-based three-dimensional model reconstruction method according to claim 1 or 2, characterized in that: in step 3, the Chamfer loss function is defined as follows,

wherein S is₁And S₂Respectively representing the set of vertices of the reconstructed mesh and the set of real point clouds of the model, for the set S₁Each point x of (2) in another set S₂To find the spatially closest point y, for the set S₂Each point y of (a) in another set S₁In finding a nullThe nearest adjacent point x is positioned between the adjacent points, and the squares of the distances are summed to obtain the Chamfer distance error d_CD(S₁，S₂) As a function of Chamfer loss.

5. The mesh deformation-based three-dimensional model reconstruction method according to claim 1 or 2, characterized in that: in the step 3, network parameters of the minimized error loss function and the Chamfer loss function are solved iteratively by using a random gradient descent algorithm to obtain a trained network model.

6. The mesh deformation-based three-dimensional model reconstruction method according to claim 1 or 2, characterized in that: and 4, carrying out quantitative evaluation on the three-dimensional grid model reconstruction result by using the Charfer distance error and the F-fraction.