CN117671146A

CN117671146A - Single-view three-dimensional reconstruction method based on cyclic diffusion model

Info

Publication number: CN117671146A
Application number: CN202311650864.3A
Authority: CN
Inventors: 周燕; 叶德旺; 周月霞; 刘翔宇; 许业文; 李文俊
Original assignee: Foshan University
Current assignee: Foshan University
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2024-03-08

Abstract

The invention discloses a single-view three-dimensional reconstruction method based on a cyclic diffusion model, which comprises the following steps of: extracting and fusing to obtain a fused feature vector, and predicting the mean value and variance of the conditional feature vector of the image; sampling the mean value and the variance of the predicted image condition feature vector to obtain an image condition feature vector; performing furthest point sampling to obtain a three-dimensional model point cloud, and training a denoising network in a diffusion model; the guiding capability of the image condition feature vector to the denoising network is improved through cyclic denoising; the denoising network is guided to gradually denoise the pure noise point cloud conforming to the standard Gaussian distribution through the input sheet Zhang Shitu, and finally, the three-dimensional model point cloud consistent with the geometric structure of the single view is obtained; the invention aims to provide a single-view three-dimensional reconstruction method based on a cyclic diffusion model, which has the advantages of stable training and higher operation efficiency and can improve the guiding capability of views.

Description

Single-view three-dimensional reconstruction method based on cyclic diffusion model

Technical Field

The invention relates to the technical field of machine learning, in particular to a single-view three-dimensional reconstruction method based on a cyclic diffusion model.

Background

Single view three-dimensional model reconstruction is a challenging task in the fields of computer vision, augmented reality, and industrial manufacturing, with the goal of generating a corresponding three-dimensional structure from a single image.

In recent years, with the continued development of deep learning, several voxel-based reconstruction methods have been proposed. These methods have the ability to generate shapes from a single image. However, voxel-based methods also have a significant disadvantage: it is difficult to balance the sampling resolution with the network efficiency. To overcome this limitation, researchers have explored point cloud-based reconstruction methods. These methods utilize a generation model (e.g., a variational auto-encoder, a generation antagonism network, a normalized flow model, and a diffusion probability model) for reconstruction. However, for the early methods of using a variational automatic encoder to generate an countermeasure network, since the number of output points is not decoupled from the network structure design, a different network needs to be retrained to obtain a point cloud with different points. The method of the diffusion probability model is based on the normalized flow model, the distribution of the point cloud is modeled through a network, the generation of the point cloud can be regarded as a process of sampling in the distribution, and therefore the number of sampling points can be set arbitrarily. The main drawbacks of the current method based on normalized flow model and diffusion probability model are that more noise is introduced, the quality of the generated point cloud is limited, and the time of training the network and the time of sampling generation are longer.

Disclosure of Invention

The invention aims to provide a single-view three-dimensional reconstruction method based on a circular diffusion model, which adopts the diffusion model to generate a three-dimensional model point cloud and has the advantage of stable training; in the process of training a denoising network by adopting a cyclic denoising mode, the guiding capability of the view is improved; the quality of generating the three-dimensional model point cloud is maintained, and the operation efficiency is high.

To achieve the purpose, the invention adopts the following technical scheme: a single-view three-dimensional reconstruction method based on a cyclic diffusion model comprises the following steps:

step S1: extracting and fusing to obtain fused feature vectors based on a random inactivation fused image feature extraction algorithm, and predicting the mean value and variance of image condition feature vectors through the fused feature vectors;

step S2: sampling the mean value and the variance of the predicted image condition feature vector through a heavy parameterization skill to obtain the image condition feature vector;

step S3: performing furthest point sampling on a triangular patch of the three-dimensional model to obtain a three-dimensional model point cloud, gradually adding noise to the three-dimensional model point cloud according to a noise adding strategy, and training a noise removing network in the diffusion model through the three-dimensional model point cloud after noise adding;

step S4: in each iteration of the training diffusion model, improving the guiding capability of the image condition feature vector to the denoising network through cyclic denoising;

step S5: the denoising network is guided by the input sheet Zhang Shitu to gradually denoise the pure noise point cloud conforming to the standard Gaussian distribution, and finally the three-dimensional model point cloud consistent with the geometric structure of the single view is obtained.

Preferably, step S1 comprises the following sub-steps:

substep S11: inputting rendering view { v } of i three-dimensional model ₁ ,v ₂ ,...,v _i The method comprises the steps of (1) inputting the view feature vector into a two-dimensional backbone network psi, and extracting features to obtain a set { f } of i view feature vectors ₁ ,f ₂ ,...,f _i The dimension of each view feature vector is set to 512; randomly inactivating the set of the view feature vectors by taking the probability as p to obtain a set F of the view feature vectors after random inactivation, wherein the specific formula is as follows:

wherein: f represents a set of randomly inactivated view feature vectors, VD _p Indicating random deactivationOperation, p represents the probability of random deactivation;

substep S12: and fusing the set F of the view feature vectors after the random inactivation by using maximum value pooling to obtain a final fused feature vector.

Preferably, step S2 comprises the following sub-steps:

substep S21: constructing networks for the mean and variance of the image condition feature vectors, each network comprising two full-connection layers, a batch normalization layer and a linear rectification layer; the number of network output channels is set to 256;

substep S22: sampling the mean value and the variance of the image condition feature vector predicted by the network to obtain the image condition feature vector used for guiding the denoising network, wherein the specific formula is as follows:

wherein: z represents the image condition feature vector for final use in guiding the denoising network, σ represents the variance of the network prediction, μ represents the mean of the network prediction, ε represents the noise sampled from the standard Gaussian distribution N (0,I), and I represents the identity matrix.

Preferably, step S3 comprises the following sub-steps:

substep S31: the original three-dimensional model point cloud isWherein->N represents the number of points, < >>Representing the value of the ith point in the point cloud on the x-axis; />Representing the value of the ith point in the point cloud on the y-axis; />Representing the value of the ith point in the point cloud on the z-axis;

substep S32: the process of gradually adding noise to the three-dimensional model point cloud is regarded as a Markov chain, and the formula for gradually adding noise to the three-dimensional model point cloud is as follows:

wherein: t represents the total number of diffusion steps, M ⁽⁰⁾ Representing a three-dimensional model point cloud, M ^(T) To follow the pure noise point cloud of the standard Gaussian distribution N (0,I), M ^(t) Representing an intermediate noise point cloud with the diffusion step number t in the noise adding process; q (M) ^(t) |M ^(t-1) ) Representing a noise adding operation, adding noise points to point M ^(t-1) On the noise point cloud M ^(t) Modeling the distribution of (2);

substep S33: noise-adding operation q (M) for three-dimensional model point cloud under each diffusion step ^(t) |M ^(t-1) ) The formula of (2) is:

wherein: beta _t Indicating the super-parameter, beta, of the noise adding intensity controlled according to the time step t _t From beta ₁ Linearly increasing to β=0.0004 _T =0.02; i represents an identity matrix; n represents the number of points; m is M ^(t) Representing the intermediate point cloud during the noise addition.

Preferably, step S4 comprises the following sub-steps:

substep S41: will bePoint cloud distribution as original three-dimensional model>Is a separate sample of (a)Z is the image condition feature vector obtained by re-parameterization in the step S2;

substep S42: the process of gradually denoising the three-dimensional model point cloud is also used as a Markov chain, and the formula of gradually denoising the initial three-dimensional Gaussian noise under the guidance of the image condition feature vector is as follows;

wherein: t represents the total number of diffusion steps, M ⁽⁰⁾ Representing an original three-dimensional model point cloud, M ^(T) Is pure noise point cloud, M ^(t) Representing the intermediate point cloud, p, in the denoising process _θ (M ^(t-1) |M ^(t) Z) represents a denoising operation;

according to the image condition feature vector z, the current moment point set M ^(t) Denoising to obtain a point set M at the next moment ^(t ^-1) The formula of (2) is:

p _θ (M ^(t-1) |M ^(t) ,z):＝N(M ^(t-1) |μ _θ (M ^(t) ,t,z),β _t I)；

wherein: mu (mu) _θ The method comprises the steps of predicting a noise mean value added in a point set in a current time step for a denoising network; m is M ^(t) Represents the noise point cloud, beta, at the diffusion step number t _t The noise adding intensity when the diffusion step number is t is represented; n represents the number of point clouds; i represents an identity matrix;

substep S43: the guiding capability of the image on network denoising under each time step is improved through cyclic diffusion; the cyclic diffusion includes: first, a point set M at a certain time step is given ^(t) Taking the first input point set as a first input point set of a cyclic denoising networkObtaining an input point set of the second cyclic denoising network through denoising operation>CirculationC times get->I.e. finally outputting the point set M of the last time step ^(t-1) ；

Substep S44: the formula for optimizing the network parameters by the gradient descent algorithm is as follows:

wherein: d (D) _KL Represents q (M) ^(t-1) |M ^(t) ,M ⁽⁰⁾ ) AndKL divergence between two probability distributions;modeling probability distribution of noise point cloud of the next diffusion step number t-1 by inputting noise point cloud of the diffusion step number t and image condition feature vector into a denoising network; p is p _θ (M ⁽⁰⁾ |M ⁽¹⁾ Z) is a noise point cloud M with a spread number of 1 by input ⁽¹⁾ Modeling noise point cloud M of next moment by image condition feature vector into denoising network ⁽⁰⁾ Probability distribution of (2);

q(M ^(t-1) |M ^(t) ,M ⁽⁰⁾ ) For the prior probability distribution, the calculation formula is:

wherein:

preferably, step S5 comprises the following sub-steps:

substep S51: randomly selecting a view V, V epsilon V from the atlas V of the rendering view in the step S1, and setting the total iteration step number T=100 of the diffusion model;

substep S52: extracting image characteristics of the single view through the step S2 and the step S3, and sampling to obtain an image condition characteristic vector z;

substep S53: sampling a set of points M from a standard Gaussian distribution ^(T) ；

Substep S54: based on Euler method, using trained denoising model, for M ^(T) Step denoising gradually until the denoising step number reaches the total iteration step number T to obtain a three-dimensional model point cloud M ⁽⁰⁾ 。

The technical scheme of the invention has the beneficial effects that: compared with the existing method for generating an countermeasure network based on an automatic encoder, the method for generating the three-dimensional model point cloud based on the diffusion model has the advantage of stable training, the method based on the diffusion model models data distribution, the point number of network output can be set at will, and network parameters do not need to be modified again for retraining.

In order to improve the correlation between the generated three-dimensional model point cloud and the input view, a cyclic denoising mode is adopted, and the guiding capability of the view is improved in the process of training a denoising network.

The method has the advantages that the quality of generating the three-dimensional model point cloud is maintained while the sampling step number is far smaller than that of other existing diffusion model-based methods, and the method has higher operation efficiency.

Drawings

FIG. 1 is a schematic flow diagram of one embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-view feature extraction and fusion process for random inactivation according to one embodiment of the invention;

FIG. 3 is a flow chart of image conditional feature vector sampling according to one embodiment of the present invention;

FIG. 4 is a schematic flow chart of cyclic denoising according to one embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further described below by the specific embodiments with reference to the accompanying drawings.

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

Referring to fig. 1 to 4, a single view three-dimensional reconstruction method based on a cyclic diffusion model is characterized by comprising the following steps:

step S3: performing furthest point sampling on a triangular patch of the three-dimensional model to obtain a three-dimensional model point cloud, gradually adding noise to the three-dimensional model point cloud according to a noise adding strategy, and training a noise removing network in the diffusion model through the three-dimensional model point cloud after noise adding; to obtain the noise point cloud M under each diffusion step t in the diffusion process ^(t) For training the noise point cloud M under the noise removal network prediction next diffusion step t-1 ^(t-1) ；

Compared with the existing method for generating an countermeasure network based on an automatic encoder, the method based on the diffusion model has the advantage of stable training, models data distribution, and the number of points output by the network can be set at will without revising network parameters for retraining.

Preferably, step S1 comprises the following sub-steps:

wherein: f represents a set of randomly inactivated view feature vectors, VD _p Represents a random deactivation operation, and p represents a probability of random deactivation;

Specifically, step S2 includes the following sub-steps:

Preferably, step S3 comprises the following sub-steps:

wherein: t represents the total number of diffusion steps, M ⁽⁰⁾ Representing a three-dimensional model point cloud, M ^(T) To follow the pure noise point cloud of the standard Gaussian distribution N (0,I), M ^(t) Indicating the number of diffusion steps in the noise adding process ast, middle noise point cloud; q (M) ^(t) |M ^(t-1) ) Representing a noise adding operation, adding noise points to point M ^(t-1) On the noise point cloud M ^(t) Modeling the distribution of (2);

Specifically, step S4 includes the following substeps:

substep S41: will bePoint cloud distribution as original three-dimensional model>Z is the image condition feature vector obtained by re-parameterization in the step S2;

wherein: t represents the total number of diffusion steps, M ⁽⁰⁾ Representing an original three-dimensional model point cloud, M ^(T) Is pure noise point cloud, M ^(t) Representing the intermediate point cloud, p, in the denoising process _θ (M ^(t-1) |M ^(t) Z) represents de-duplicationNoise operation;

p _θ (M ^(t-1) |M ^(t) ,z):＝N(M ^(t-1) |μ _θ (M ^(t) ,t,z),β _t I)；

substep S43: the guiding capability of the image on network denoising under each time step is improved through cyclic diffusion; the cyclic diffusion includes: first, a point set M at a certain time step is given ^(t) Taking the first input point set as a first input point set of a cyclic denoising networkObtaining an input point set of the second cyclic denoising network through denoising operation>Circulation C times to obtain->I.e. finally outputting the point set M of the last time step ^(t-1) ；

wherein:

preferably, step S5 comprises the following sub-steps:

substep S53: sampling a set of points M from a standard Gaussian distribution ^(T) The method comprises the steps of carrying out a first treatment on the surface of the Standard gaussian distribution N (0,I) is known;

substep S54: based on Euler method, using trained denoising model, for M ^(T) And carrying out gradual denoising until the denoising step number reaches the total iteration step number T, and obtaining the three-dimensional model point cloud.

In the description herein, reference to the term "embodiment," "example," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The technical principle of the present invention is described above in connection with the specific embodiments. The description is made for the purpose of illustrating the general principles of the invention and should not be taken in any way as limiting the scope of the invention. Other embodiments of the invention will be apparent to those skilled in the art from consideration of this specification without undue burden.

Claims

1. The single-view three-dimensional reconstruction method based on the cyclic diffusion model is characterized by comprising the following steps of:

2. The single view three dimensional reconstruction method based on a cyclic diffusion model according to claim 1, wherein step S1 comprises the sub-steps of:

3. The single view three dimensional reconstruction method based on a cyclic diffusion model according to claim 1, wherein step S2 comprises the sub-steps of:

4. The single view three dimensional reconstruction method based on a cyclic diffusion model according to claim 1, wherein step S3 comprises the sub-steps of:

substep S31: the original three-dimensional model point cloud isWherein->N represents the number of points and,representing the value of the ith point in the point cloud on the x-axis; />Representing the value of the ith point in the point cloud on the y-axis; />Representing the value of the ith point in the point cloud on the z-axis;

5. The single view three dimensional reconstruction method based on a cyclic diffusion model according to claim 1, wherein step S4 comprises the sub-steps of:

according to the image condition feature vector z, the current moment point set M ^(t) Denoising to obtain a point set M at the next moment ^(t-1) The formula of (2) is:

p _θ (M ^(t-1) |M ^(t) ,z):＝N(M ^(t-1) |μ _θ (M ^(t) ,t,z),β _t I)；

wherein: d (D) _KL Represents q (M) ^(t-1) |M ^(t) ,M ⁽⁰⁾ ) AndKL divergence between two probability distributions;modeling probability distribution of noise point cloud of the next diffusion step number t-1 by inputting noise point cloud of the diffusion step number t and image condition feature vector into a denoising network;

p _θ (M ⁽⁰⁾ |M ⁽¹⁾ z) is a noise point cloud M with a spread number of 1 by input ⁽¹⁾ Modeling noise point cloud M of next moment by image condition feature vector into denoising network ⁽⁰⁾ Probability distribution of (2);

wherein: alpha _t ＝1-β _t ，

6. The single view three dimensional reconstruction method based on a cyclic diffusion model according to claim 1, wherein step S5 comprises the sub-steps of:

substep S53: sampling a from a standard Gaussian distributionPoint set M ^(T) ；