CN113936117B

CN113936117B - High-frequency region enhanced luminosity three-dimensional reconstruction method based on deep learning

Info

Publication number: CN113936117B
Application number: CN202111524515.8A
Authority: CN
Inventors: 举雅琨; 董军宇; 高峰
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-03-08
Anticipated expiration: 2041-12-14
Also published as: CN113936117A

Abstract

The method comprises the steps of shooting a plurality of images of an object to be reconstructed by using a photometric stereo system, outputting accurate surface normal three-dimensional reconstruction by using a deep learning algorithm, wherein a surface normal generation network is designed to generate the surface normal of the object to be reconstructed from the images and illumination; the attention weight generation network generates an attention weight map of an object to be reconstructed from the image; processing the attention weight loss function pixel by pixel; and then using the trained network for surface normal reconstruction of the photometric stereo image. The invention respectively learns the surface normal and high-frequency information through the proposed surface normal generation network and the attention weight generation network, and trains by using the proposed attention weight loss, thereby improving the reconstruction precision of the surface of a high-frequency region such as a fold edge. Compared with the traditional photometric stereo method, the three-dimensional reconstruction precision is improved, and particularly the details of the surface of an object to be reconstructed are improved.

Description

High-frequency region enhanced luminosity three-dimensional reconstruction method based on deep learning

Technical Field

The invention relates to a high-frequency region enhanced luminosity three-dimensional reconstruction method based on deep learning, and belongs to the field of multi-degree three-dimensional reconstruction.

Background

The three-dimensional reconstruction algorithm is a very important and basic problem in computer vision, and the photometric stereo algorithm is a high-precision pixel-by-pixel three-dimensional reconstruction method which recovers the normal direction of the surface of an object by utilizing gray scale change clues provided by images in different illumination directions. Photometric stereo has irreplaceable positions in many high-precision three-dimensional reconstruction tasks, and has important application values in the aspects of archaeological exploration, pipeline detection, seabed fine mapping and the like.

However, the existing depth learning-based photometric stereo method has large errors in high-frequency regions of the object surface, such as wrinkles and edges, and the existing method generates blurred three-dimensional reconstruction results in these regions, which are the places where the emphasis is placed and accurate reconstruction is required.

Disclosure of Invention

In view of the above problems, the present invention provides a method for three-dimensional reconstruction of enhanced luminosity in high frequency region based on deep learning, so as to overcome the disadvantages of the prior art.

The high-frequency region enhanced luminosity three-dimensional reconstruction method based on deep learning is characterized by comprising the following steps of:

1) using a photometric stereo system, taking several images of the object to be reconstructed:

an image of an object to be reconstructed is shot under the irradiation of a single parallel white light source, a Cartesian coordinate system is established by taking the center of the object to be reconstructed as the origin of a coordinate axis, and the position of the white light source is determined by a vector in the Cartesian coordinate systeml = [x, y, z]Represents;

changing the position of the light source to obtain a shot image in another illumination direction; usually, at least 10 or more images under different illumination directions are taken and recorded asm ₁ , m ₂ , ..., m _j ，With the corresponding light source position notedl ₁ , l ₂ , ...,l _j，jIs a natural number greater than or equal to 10;

2) input using deep learning algorithmsm ₁ ,m ₂ , ..., m _jAndl ₁ ,l ₂ , ..., l _joutputting accurate surface normal three-dimensional reconstruction:

the deep learning algorithm utilized is divided into the following four parts: (1) generating a network by a surface normal method, (2) generating a network by attention weight, (3) performing attention weight loss function joint training, and (4) performing network training; wherein:

(1) the surface normal generating network is designed to generate images fromm ₁ ,m ₂ , ..., m _jAnd lightLight blockl ₁ ,l ₂ , ..., l _jIn generating a surface normal of the object to be reconstructed

；

(2) The attention weight generating network is designed to generate the attention weight from the imagem ₁ , m ₂ , ..., m _jTo generate an attention weight map of the object to be reconstructedP ；

(3) Attention weight lossLIs a loss function of pixel-by-pixel processing, which is determined by the loss of each pixelL _kIs obtained by average calculation, and the formula is

；p*qAs resolution of the image m，p、q≥2ⁿ，n≥4；

Loss per pixel positionL _kComprising two parts, the first part being a gradient loss with a coefficient termL _gradientThe second component is the normal loss with coefficient termL _normalI.e. byL _k= P _k L _gradient+λ(1 – P _k) L _normal；

Wherein the content of the first and second substances,

，

is normal to the true surface of the object to be reconstructednIn positionkThe gradient of (a) of (b) is,ζis the neighborhood pixel range used in computing the gradient,ζthe setting ranges are 1, 2, 3, 4 and 5,

is the predicted surface normal

In positionkA gradient of (a);

representing the surface normal of the network prediction,

representing the true surface normal;

gradient loss can sharpen high frequency representations of the surface normal in the network;P _kfor the pixel position on the attention weight mapkA value of (d) above;

secondly, the first step is to carry out the first,

● represents the point multiplication operation, λ is a hyper parameter, and the range is set to {7,8,9,10} for the purpose of gradient loss and normal loss;

the (1) surface normal generation network and (2) attention weight generation network can be linked through the (3) attention weight loss;

(4) network training

When the network is trained, continuously adjusting and optimizing by using a back propagation algorithm, minimizing the loss function, and stopping training when the set cycle number is reached so as to achieve the optimal effect; orL _normalWhen the training time is less than 0.03, the training is considered to have reached the most effective result, and the training is stopped;

3) the trained network is used for surface normal reconstruction of photometric stereo images:

firstly, shooting more than s images in different illumination directions, wherein s is more than or equal to 10, and then, shooting the images in different illumination directionsm ₁ , m ₂ , ..., m _sAndl ₁ , l ₂ , ..., l _sinputting the trained network to obtain the predicted surface normal

。

The (1) surface normal generating network is designed to generate images fromm ₁ ,m ₂ , ..., m _jAnd illumination of lightl ₁ ,l ₂ , ..., l _jIn generating a surface normal of the object to be reconstructed

The method comprises the following specific steps:

resolution of image m is notedp*q，p、q≥2ⁿN is not less than 4, thenm∈ℝ^p*q*3Wherein 3 represents RGB; the surface normal generation network is firstly as followsmResolution ofp*qTo illuminatel = [x, y, z] ∈ℝ³Repeatedly filling to ℝ^p*q*3In the space (D), the illumination after filling is recorded ashThen, thenh∈ℝ^p*q*3At this timehAndmhaving the same space size, willhAndmjoin in a third set of dimensions to form a new tensor, which belongs to ℝ^p*q*6At the input ofjUnder the condition of image and illumination, obtainjA fused tensor;

respectively carrying out 4 layers of convolutional layer operations on the tensors, wherein the sizes of convolutional kernels of convolutional layers 1, 2, 3 and 4 are all 3 x 3, and all the convolutional kernels adopt 'relu' activation functions, wherein the 2 nd layer and the 4 th layer are convolutions with the step length 'stride' of 2, the 1 st layer and the 3 rd layer are convolutions with the step length 'stride' of 1, and the number of characteristic channels of the convolutional layers 1, 2, 3 and 4 is respectively 64, 128, 128 and 256;

then, the maximum pooling layer is used to derive from j 4-layered convolved tensors ∈ ℝ^p/4*q/4*256Pooled into one ℝ^p/4*q/4*256Tensor of (2);

calculating by convolution layers 5, 6, 7 and 8, wherein the convolution kernels of the convolution layers 5, 6, 7 and 8 are all 3 x 3 and all adopt 'relu' activation functions, wherein the 5 th layer and the 7 th layer are transposition convolutions, the 6 th layer and the 8 th layer are convolutions with the step length 'stride' of 1, and the number of characteristic channels of the convolution layers 5, 6, 7 and 8 is 128, 64 and 3;

finally, normalizing the tensor obtained by the 8 th layer of convolution to enable the modulus to be 1, and obtaining the surface normal direction of the object to be reconstructed

。

The (2) attention weight generating network is designed to generate the attention weight from the imagem ₁ , m ₂ , ..., m _jTo generate an attention weight map of the object to be reconstructedPThe method comprises the following specific steps:

attention weight generating network pair imagem∈ℝ^p*q*3Calculate its gradient value, which also belongs to space ℝ^p*q*3And the gradient of the image is connected and fused with the image in a third group of dimensions to form a new tensor, wherein the new tensor belongs to ℝ^p*q*6Under the condition of inputting j images and illumination, j fused tensors are obtained;

firstly, performing convolution layer operations of 3 layers on the fused tensors respectively, wherein the sizes of convolution kernels of the 3 layers are all 3 x 3, and a 'relu' activation function is adopted, wherein the step length 'stride' of the 2 nd layer is 2, the step lengths 'stride' of the 1 st layer and the 3 rd layer are 1, and the number of characteristic channels of the four convolution layers is 64, 128 and 128 respectively;

then, from j 3-layered convolved tensors ∈ ℝ using the max pooling layer^p/2*q/2*128Pooled into one ℝ^p/2*q/2*128Tensor of (2);

and calculating by convolution layers 5, 6 and 7, wherein the convolution kernels of the convolution layers 5, 6 and 7 are all 3 x 3 and all adopt 'relu' activation functions, wherein the 6 th layer is the transposition convolution, the 5 th layer and the 7 th layer are the convolution with the step length 'stride' of 1, the number of characteristic channels of the convolution layers 5, 6 and 7 is 128, 64 and 1, and thus obtaining the attention weight graph of the object to be reconstructedP 。

The high-frequency region enhanced luminosity three-dimensional reconstruction method based on deep learning is characterized in that the resolution ratio of the image mp*qIn (1),pthe values 16, 32, 48, 64,qvalues 16, 32, 48, 64.

The above-mentionedThe high-frequency region enhanced luminosity three-dimensional reconstruction method based on deep learning is characterized in thatζIs set to 1.

The high-frequency region enhanced luminosity three-dimensional reconstruction method based on deep learning is characterized in that lambda is set to be 8.

The high-frequency region enhanced luminosity three-dimensional reconstruction method based on deep learning is characterized in that the cycle number is set to be 30 epochs.

The high-frequency region enhanced luminosity three-dimensional reconstruction method based on deep learning is characterized in thatpThe value of the number 32 is taken as the value,qtaking the value of 32.

According to the high-frequency region enhanced photometric stereo three-dimensional reconstruction method based on deep learning, provided by the invention, the network is generated through the surface normal, the network is generated through the attention weight, the surface normal and the high-frequency information are respectively learned, and the training is carried out by utilizing the provided attention weight loss, so that the reconstruction precision of the high-frequency region surface such as the fold edge can be improved. Compared with the traditional photometric stereo method, the three-dimensional reconstruction precision is improved, and particularly the details of the surface of an object to be reconstructed are improved.

The attention weight loss provided by the invention can be applied to various bottom layer vision tasks, the task precision is improved, and the details of the image, such as depth estimation, image deblurring and image defogging, are enriched.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a schematic diagram of the surface normal generation network in step 2).

Fig. 3 is a schematic diagram of the attention weight generation network in step 2).

Fig. 4 is a schematic diagram of the application effect of the present invention, in which a first action is an input image, a second action generates a weighted image, and a third action generates a surface normal.

Detailed Description

As shown in fig. 1, the method for three-dimensional reconstruction of a high-frequency region enhanced luminosity based on deep learning is characterized by comprising the following steps:

the deep learning algorithm utilized is divided into the following four parts: (1) generating a network by a surface normal method, (2) generating a network by attention weight, (3) performing attention weight loss function joint training, and (4) performing network training;

(1) the surface normal generating network is designed to generate images fromm ₁ ,m ₂ , ..., m _jAnd illumination of lightl ₁ ,l ₂ , ..., l _jIn generating a surface normal of the object to be reconstructed

；

Resolution of image m is notedp*q，p、q≥2ⁿN is not less than 4, thenm∈ℝ^p*q*3Wherein 3 represents RGB; as shown in FIG. 2, the surface normal generation network is first generated according tomResolution ofp*qTo illuminatel = [x, y, z] ∈ℝ³Repeatedly filling to ℝ^p*q*3In the space (D), the illumination after filling is recorded ashThen, thenh∈ℝ^p*q*3At this timehAndmhaving the same space size, willhAndmjoin in a third set of dimensions to form a new tensor, which belongs to ℝ^p*q*6At the input ofjUnder the condition of image and illumination, obtainjA fused tensor;

finally, normalizing the tensor obtained by the 8 th layer of convolution to enable the modulus to be 1, and obtaining the predicted surface normal direction

；

(2) The attention weight generating network is designed to generate the attention weight from the imagem ₁ , m ₂ , ..., m _jTo generate an attention weight map of the object to be reconstructed:

attention weight generating network pair imagem∈ℝ^p*q*3Calculate its gradient value, which also belongs to space ℝ^p*q*3And its gradient is fused with the image in a third set of dimensions, as in FIG. 3, intoNew tensor, the new tensor belonging to ℝ^p*q*6Under the condition of inputting j images and illumination, j fused tensors are obtained;

and calculating by convolution layers 5, 6 and 7, wherein the convolution kernels of the convolution layers 5, 6 and 7 are all 3 x 3 and all adopt 'relu' activation functions, wherein the 6 th layer is the transposition convolution, the 5 th layer and the 7 th layer are the convolution with the step length 'stride' of 1, the number of characteristic channels of the convolution layers 5, 6 and 7 is 128, 64 and 1, and thus obtaining the attention weight graph of the object to be reconstructedP ；

；

Wherein the content of the first and second substances,

，

is normal to the true surface of the object to be reconstructednIn positionkThe gradient of (a) of (b) is,ζis the neighborhood pixel range used in computing the gradient,ζthe setting ranges are 1, 2, 3, 4, 5, the default setting in the invention is 1,

is the predicted surface normal

In positionkA gradient of (a);

representing the surface normal of the network prediction,

representing the true surface normal;

gradient loss can sharpen high frequency representations of the surface normal in the network;P _kfor the pixel position on the attention weight mapkThe value of (A) is a loss of attention weight on a pixel-by-pixel basisL _kProviding a first gradient loss componentL _gradientWhere the attention weight value is large, the weight of the gradient loss is large;

secondly, the first step is to carry out the first,

● represents a point-by-point operation, λ is a hyper-parameter, which is set to 8 herein for the purpose of gradient loss and normal loss; generally, the setting can be {7,8,9,10}, and when 8 is taken, a better effect can be obtained;

(4) network training

When training the network, constantly adjusting and optimizing by using a back propagation algorithm to minimize the lossA function, which stops training at the time of reaching 30 epochs (cycles) to achieve the optimal effect; orL _normalWhen the training time is less than 0.03, the training is considered to have reached the most effective result, and the training is stopped;

in the invention, the training of the network is finished after 30 epochs, and the training is considered to have achieved the optimal effect at the moment;

(5) the trained network is used for surface normal reconstruction of photometric stereo images:

first shootingsThe images with different illumination directions are displayed,snot less than 10, mixing₁, m₂, ..., m_sAnd l₁, l₂, ..., l_sInputting the trained network to obtain the predicted surface normal

。

Whereinp，qE {16, 32, 48, 64}, λ e {7,8,910}, ζ can be 1, 2, 3, 4, 5.

The reconstruction effect is shown in fig. 4. The first row represents the image taken of the object to be reconstructed, the second row represents the generated attention weight map P, and the third row represents the generated surface normal

。

Claims

1. The high-frequency region enhanced luminosity three-dimensional reconstruction method based on deep learning is characterized by comprising the following steps of:

by changing the position of the light source, a shot is taken in another direction of illuminationTaking an image; usually, at least 10 or more images under different illumination directions are taken and recorded asm ₁ , m ₂ , ..., m _j ，With the corresponding light source position notedl ₁ , l ₂ , ..., l _j，jIs a natural number greater than or equal to 10;

；

； p*qAs resolution of the image m，p、q≥2ⁿ，n≥4；

Wherein the content of the first and second substances,

；

is normal to the true surface of the object to be reconstructednIn positionkA gradient of (a);

ζis the neighborhood pixel range used in computing the gradient,ζsetting ranges of 1, 2, 3, 4 and 5;

is the predicted surface normal

In positionkA gradient of (a);

representing the surface normal of the network prediction,

representing the true surface normal;

P _kfor the pixel position on the attention weight mapkA value of (d) above;

secondly, the first step is to carry out the first,

● represents the dot product operation, λ is a super parameter, and the setting range is {7,8,9,10 };

(4) network training

firstly, shooting more than s images in different illumination directions, wherein s is more than or equal to 10, and then, shooting the images in different illumination directionsm ₁ , m ₂ , ..., m _sAndl ₁ , l ₂ , ..., l _sinputting the trained network to obtain the predicted surface normal

。

2. The deep learning based high frequency region enhanced photometric stereo three dimensional reconstruction method according to claim 1 wherein (1) the surface normal generation network is designed to generate from the imagem ₁ ,m ₂ , ..., m _jAnd illumination of lightl ₁ ,l ₂ , ..., l _jIn generating a surface normal of the object to be reconstructed

The method comprises the following specific steps:

。

3. The deep learning-based high-frequency region-enhanced photometric stereo three-dimensional reconstruction method according to claim 1, wherein (2) the attention weight generating network is designed to generate the attention weight from the imagem ₁ , m ₂ , ..., m _jTo generate an attention weight map of the object to be reconstructedPThe method comprises the following specific steps:

firstly, performing convolution layer operation on 3 layers of fused tensors respectively, wherein the sizes of convolution kernels of the 3 layers are all 3 x 3, and a 'relu' activation function is adopted, wherein the step length 'stride' of the 2 nd layer is 2, the step lengths 'stride' of the 1 st layer and the 3 rd layer are 1, and the number of characteristic channels of the 3 convolution layers is 64, 128 and 128 respectively;

4. The deep learning based high frequency region enhanced photometric stereo three dimensional reconstruction method according to claim 1 wherein the resolution of the image m isp*qIn (1),pthe values 16, 32, 48, 64,qvalues 16, 32, 48, 64.

5. The deep learning based high frequency region enhanced photometric stereo three dimensional reconstruction method according to claim 1, wherein the method is characterized in thatζIs set to 1.

6. The deep learning based high frequency region enhanced photometric stereo three dimensional reconstruction method according to claim 1 wherein λ is set to 8.

7. The deep learning-based high-frequency region enhanced photometric stereo three-dimensional reconstruction method according to claim 1, wherein the number of cycles is set to 30 epochs.

8. The deep learning-based high-frequency region enhanced photometric stereo three-dimensional reconstruction method according to claim 4, wherein the method is characterized in thatpThe value of the number 32 is taken as the value,qtaking the value of 32.