CN117132704A

CN117132704A - Three-dimensional reconstruction method of dynamic structured light, system and computing equipment thereof

Info

Publication number: CN117132704A
Application number: CN202310643012.5A
Authority: CN
Inventors: 胡海洋; 高凌峰
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-11-28

Abstract

The invention discloses a three-dimensional reconstruction method of dynamic structured light, a system and computing equipment thereof. The invention constructs a dynamic structured light model of RPSNet; establishing a generator network model; establishing a discriminator network model; jointly optimizing a loss function; training of the generators and the discriminators is performed alternately, one discriminator is trained firstly, and then one generator is trained, and the training is performed alternately until the preset training wheel number is reached; and 3, realizing three-dimensional reconstruction by using the optimized structural optical model. The method is close to a reconstruction result of a twelve-step phase shift combined multi-frequency heterodyne method in reconstruction quality, the precision error is 0.11mm, and in general, the three-dimensional reconstruction requirement of workpiece sorting in a factory environment can be met.

Description

Three-dimensional reconstruction method of dynamic structured light, system and computing equipment thereof

Technical Field

The invention belongs to the technical field of machine vision, and particularly relates to a three-dimensional reconstruction method of dynamic structured light, a system and computing equipment thereof.

Background

Structured light is widely used as an optical non-contact three-dimensional shape measurement technology in the aspects of intelligent manufacturing, reverse engineering, heritage digitization and the like. The FPP (Fringe Projection Profilometry) -based structured light projection technology is one of the most popular optical three-dimensional imaging technologies at present due to the simple hardware structure, flexible realization and high measurement precision. With the improvement of the performance of imaging devices and projection devices, it becomes possible to realize high-speed three-dimensional shape measurement based on FPP structured light technology. Meanwhile, the importance of acquiring high-quality three-dimensional information in a high-speed scene is self-evident for applications such as online quality detection, stress deformation analysis, rapid reverse molding and the like. In order to realize three-dimensional measurement under a high-speed scene, the number of images required for each reconstruction is generally required to be reduced to improve the measurement efficiency, while in a traditional structured light measurement method, at least 3 phase shift images are theoretically required to complete one reconstruction, in an actual reconstruction process, 5 or more phase shift images are often required to obtain higher reconstruction quality, and the quality of three-dimensional reconstruction is in direct proportion to the number of phase shift images. The traditional method has excellent use effect in static scenes and scenes which are not measured in real time, but can not reach the expected effect in dynamic or real-time scenes, because the acquisition of a plurality of phase-shift images takes time, which is unacceptable for the application requiring high real-time performance, and more importantly, because the measured object is in a motion state, errors caused by the motion of the object exist among the plurality of phase-shift images, which leads to the non-ideal effect of the final three-dimensional reconstruction.

In recent years, with the improvement of the neural network structure and the improvement of the computer power, the deep learning presents strong fitting capability. Numerous studies have demonstrated that deep learning is superior to traditional algorithms in terms of speed and robustness. In the research direction of structured light, deep learning can also be widely applied, such as stripe denoising, stripe analysis, phase unwrapping, and the like.

Under a complex factory environment, the uniform motion of the measured object can be influenced by external vibration, electromagnetic interference and other factors, so that the step length of the relative phase shift is not uniform. The conventional three-step phase shift and twelve-step phase shift methods need to be used under the condition of ensuring that the phase shift step length is uniform, and a method capable of adapting to the random phase shift step length is needed.

Disclosure of Invention

The invention aims at solving the defects of the prior art and provides a novel dynamic structured light three-dimensional reconstruction method.

The invention provides a Phase shift method adapting to a Random Phase shift step length, which is realized based on an RPSNet (Random Phase-shifting Network) Network model, wherein the model takes CycleGAN (Cycle Generative Adversarial Network) as a Network framework and comprises a generator and a discriminator.

The generator network adopts an AIR2U-net network, the network model takes the U-net model as a basic network architecture, and the attention mechanism and the IRR module are fused. The U-net is formed based on FCN (fully convolutional network) improvement and comprises an encoder, a bottleneck (bottleneck) module and a decoder, wherein the network acquires a characteristic diagram through convolution and downsampling operations of the encoder, and the characteristic diagram in the encoding process is added into the process through deconvolution and upsampling processes to finally obtain an output result. The addition of the attention mechanism greatly enhances the capability of focusing on target features and suppressing irrelevant features of the whole network. The invention provides an IRR module, which is added with a circular convolution layer (RCL, recurrent Convolutional Layer) on the basis of an acceptance-Res module, and the recognition capability of the model on object characteristics is improved.

The input of the discriminator adopts a simple convolution network structure, and the output result G (x) of the generator or the label y of the data set can be used for judging whether the input data is real data or not after learning. When the output of the generator is judged to be false, the generator learns the characteristics of the input data, performs parameter updating through back propagation, outputs the data to the discriminator again to perform true and false judgment, and repeatedly performs the above steps. This process of countermeasure learning can help the generator to constantly learn the feature distribution of the real data, thereby generating more realistic data. In addition, the network structure and parameters of the discriminator can be optimized, so that the discrimination accuracy of the discriminator is improved, and the generation capacity of the generator is further improved.

In a first aspect, the present invention provides a method for three-dimensional reconstruction of dynamic structured light, comprising the steps of:

step S1, acquiring a plurality of grating images of an object to be tested at continuous moments under working conditions;

s2, performing three-dimensional reconstruction on the multiple grating images by utilizing a dynamic structural optical model RPSNet to obtain a depth image of an object to be detected;

the dynamic structure optical model RPSNet generates a reactive network and comprises a generator AIR2U-net and a discriminator;

the generator AIR2U-net adopts a U-net network basic architecture and adopts a coding-decoding structure; and obtaining a feature map through feature extraction and downsampling operations of an IRR module in the encoder, adding the feature map in the encoding process into the decoding process through a attention mechanism module RCAM through jump connection in the feature extraction and upsampling process of the IRR module in the decoder, and finally obtaining an output result.

The discriminator adopts a multi-layer convolution network.

In a second aspect, the present invention provides a dynamic structured light three-dimensional reconstruction system, characterized by comprising the steps of:

the data acquisition module acquires a plurality of grating images of an object to be detected at continuous moments under working conditions;

and the three-dimensional reconstruction module is used for carrying out three-dimensional reconstruction on the plurality of grating images by utilizing the dynamic structure optical model RPSNet to obtain a depth image of the object to be detected.

In a third aspect, the present invention provides a computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method.

The beneficial effects of the invention are as follows:

the invention provides a novel three-dimensional reconstruction method of dynamic structured light, which adopts reverse thinking, keeps a grating image of structured light projection unchanged, forms relative phase shift by utilizing movement of an object in a certain direction, and provides an RPSNet neural network model, which solves the problem of uncertainty of movement step length of the object. After the countermeasure training of the generator and the discriminator, the output of the final generator can output an ideal depth map through the judgment of the discriminator.

Drawings

FIG. 1 is a diagram of the RPSNet model training process and operation of the present invention;

FIG. 2 is a diagram of a network architecture of a generator in accordance with the present invention;

FIG. 3 is a block diagram of an IRR module of the present invention;

FIG. 4 is a view of the RCAM model structure;

FIG. 5 is a diagram of a discriminator network architecture;

FIG. 6 is a partial model of the Thing10k three-dimensional model dataset;

FIG. 7 is a virtual FFP scene graph;

FIG. 8 is a Blender shader node setup;

FIG. 9 is a Blender synthetic node setup diagram;

FIG. 10 is a graphical representation of a generated partial data set;

FIG. 11 is a pattern to be projected for the 3STep & Gray scheme; wherein (a) is 3 phase shift pictures, (b) is 4 gray code pictures, and (c) is two binary pictures;

FIG. 12 is a graph comparing experimental results, wherein (a) 3STep & Gray point cloud is shown; (b) 12step & multifreq point cloud plot; (c) RPSNet point cloud graphics; (d) GT point cloud.

Detailed Description

The present invention will be specifically described below.

A new dynamic structured light three-dimensional reconstruction method comprises the following steps:

the generator AIR2U-net adopts a U-net network basic architecture, and as shown in FIG. 2, adopts an encoding-decoding structure; acquiring a feature map through feature extraction and downsampling operation of an IRR (residual error network) module in an encoder, adding the feature map in the encoding process into the decoding process through a attention mechanism through jump connection in the feature extraction and upsampling process of the IRR module in a decoder, and finally obtaining an output result;

the 2 nd-N layer of the encoder downsampling process is respectively connected with an IRR module in series at the rear end of each layer of the existing U-net network encoder; n represents the total number of layers of the encoder;

the 2 nd-N layer in the up-sampling process of the decoding layer is respectively connected with a connection module in series at the front end of each layer of the decoding layer of the existing U-net network, and the rear end is respectively connected with an IRR module in series;

the encoder and the 2 nd-N layers of the decoder are connected in a jumping way, and each layer of jumping connection is connected with an attention mechanism module RCAM in series;

the connection module is used for splicing the output of the current layer of the encoder with the output of the upper layer of the decoder;

the IRR module uses the acceptance-ResNet module as a basic network framework, and adds a circular convolution block (Recurrent Convolution Block, RCB) to strengthen the characteristics, and specifically comprises parallel residual connection, 1×1RCB, 3×3RCB and 5×5RCB; a 1 x 1 bottleneck layer, a splice layer;

the 1X 1 bottleneck layer receives the output of 1X 1RCB, 3X 3RCB and 5X 5RCB, and performs channel dimension reduction and dimension increase on the output to effectively reduce the number of parameters in a network, avoid the condition of over-fitting and reduce the calculation burden;

the splicing layer receives the original input characteristics output by the residual connection and the characteristics output by the 1 multiplied by 1 bottleneck layer, and performs residual splicing on the original input characteristics;

the specific process of the RCB is as follows:

wherein the method comprises the steps ofIs the output of the RCB with convolution kernel s size,/->And->The input of the standard convolutional layer and the convolutional kernel s-sized RCB, respectively, ++>And->Is the weight of the standard convolutional layer and the kth layer RCB, and b _k Then it is a deviation;

the output will then be input into a standard ReLU activation function f, expressed as follows:

wherein F (x) _s ,w _s ) The output of RCB representing the convolution kernel as s-size;

the IRR module comprises the following specific processes:

the input is effectively accumulated by utilizing the idea of cyclic convolution through multi-scale RCB, the result is input into a 1X 1 bottleneck layer, and finally the input X is input _l Residual splicing is carried out on the output of the 1X 1 bottleneck layer convolution operation, and the spliced result is used as the output X of the whole IRR module _l+1 The method comprises the steps of carrying out a first treatment on the surface of the See formula (3):

wherein the method comprises the steps ofThe outputs of the RCB modules with convolution kernel sizes of 1×1, 3×3 and 5×5 respectively, B (·) is the bottleneck layer function, +.>The representation matrix is spliced along the depth direction;

the attention mechanism module RCAM as described in fig. 4 includes a channel attention module and a spatial attention module;

the channel attention module firstly carries out global maximum value pooling and average value pooling operation on each input channel respectively, then uses a full-connection layer (namely a multi-layer perceptron) sharing weight to respectively carry out weight calculation on different channels of the feature images after pooling operation to obtain weights of the two feature images in different channels, adds the weights and obtains a final weight result through an activation function;

the spatial attention module firstly uses maximum value pooling and average value pooling to pool channels, uses a convolution network to calculate weights after two obtained feature graphs are spliced, and finally outputs the obtained weights through an activation function;

the whole process of the attention mechanism module RCAM is expressed using the following formula:

wherein f _d ，f _e Characteristic diagrams respectively representing decoder and encoder outputs, conv (·) representing convolving an input, M _c (·)，M _s (. Cndot.) channel attention and spatial attention operations are performed on the input,representing element-by-element multiplication of the left and right inputs, O representing the final output of the module;

the channel attention module compresses the feature map in the space dimension to obtain a one-dimensional vector and then operates; the average pooling and the maximum pooling in the model are used for aggregating the spatial information of the feature graphs, then the spatial information is sent to a multi-layer perceptron network sharing weights to compress the spatial dimension of the input feature graphs, and finally the spatial dimension of the input feature graphs are summed element by element and combined to generate a channel attention graph; for a single picture, the channel attention mechanism mainly learns which content on the picture is important; carrying out gradient back propagation calculation by maximum pooling; the whole channel attention mechanism is described by the formula:

M _c (F)＝σ(MLP(AvgPool(F))+MLP(MaxPool(F))) (5)

AvgPool (·), maxPool (·) represents average pooling and maximum pooling of inputs, MLP (·) represents multi-layer perceptron operation with weight sharing, σ (·) is an activation function;

the spatial attention module compresses channels and performs average pooling and maximum pooling on the dimension of the channels; the operation of maximum pooling is to extract the maximum value on the channel, the number of times of extraction is the height multiplied by the width; the operation of average pooling is to extract an average value over the channel, the number of times of extraction also being the height times the width; combining the feature images extracted before to obtain a final output feature image; the process is expressed by using the formula:

M _s (F')＝σ(Conv(AvgPool(F')；MaxPool(F'))) (6)

the training and testing process of the dynamic structural optical model RPSNet is as shown in fig. 1:

the dynamic structure light model RPSNet comprises two parts during training;

the first part is that the grating image x of the measured object passes through a generator G _XY Converted into a depth map y' of the measured object, and then a generator G _YX A raster image x 'is generated, and the two conversion results x and y' of the process respectively use a discriminator D _X 、D _Y Judging whether the data are true or false, and calculating the countermeasures according to the output result of the discriminator by the generator so as to adjust the output; in addition, in order to prevent the output of the generator from being too aggressive, the characteristics of the original input image are lost, and consistency loss is added in the process of generating the image

The second part is similar to the first part, converts the depth map y into a grating map x ', and then converts the grating map x ' into a depth map y '; the process also requires constraints using a arbiter and a consistency loss function; through the alternate training of the process, the whole network model can learn key features and output high-quality depth images;

in the training process of the discriminant, the discriminant D of the X domain and the Y domain needs to be trained _X And D _Y Inputting a real X-domain or Y-domain image and an image generated by a generator respectively, calculating a loss function by an obtained result of the discriminator and an actual label, and then multiplying the two loss functions by a coefficient of 0.5 and adding the two loss functions; finally updating the model parameters through back propagation;

the training of the constructed generators and the training of the discriminators are alternately performed during training, one discriminator is trained firstly, and then one generator is trained, and the training is alternately performed until the number of preset training rounds is reached;

using test data sets only for generator G during testing _XY Verification and testing is performed.

Generator G _XY Sum generator G _YX The generator AIR2U-net is employed.

The loss function of the dynamic structure light model RPSNet during training consists of a loss function of a generator and a loss function of a discriminator;

the generator's penalty functions include a counterpenalty (differential loss), a cyclic consistency penalty (Cycle consistency loss), and an Identity penalty (Identity loss); since two generators are used in the RPSNet training to perform conversion of the original domain and the target domain, the three loss functions are all composed of two parts, so that the total loss function is composed of six parts; as in formula (7):

wherein the method comprises the steps ofAnd->Representing the contrast-induced impairments of the X-domain and Y-domain, respectivelyA loss function; />And->Represents the cyclic consistency loss function of the X domain and the Y domain respectively, lambda _X And lambda (lambda) _Y Weights of the X domain and the Y domain are respectively represented; />And->Represents the identity loss functions of the X domain and the Y domain, mu _X Sum mu _Y The weights of the X domain and the Y domain are represented, respectively;

since there are two generators of CycleGAN, the challenge generation loss function of CycleGAN is expressed by the following formula:

wherein f _w (. Cndot.) represents a series of function sets that satisfy the K-Lipschitz condition; p (P) _gxy Representation generator G _XY A resulting sample distribution; p (P) _gyx Representation generator G _YX A resulting sample distribution; e represents taking the average value;

the purpose of the cyclical consistency loss function is to prevent the image generated by the generator from being too biased towards the destination domain and losing the image information in the original domain; the inputs to the cyclic consistency loss function are the image in the original domain and the image in the target domain output by the generator and the inverse generator, which should be kept as similar as possible; the loop consistency loss function is therefore defined using the following formula:

wherein X is as followsThe original input raster pattern is shown as such, Y represents the original input depth map and, I.I ₁ Representing distance calculation functions, e.g. G _YX (G _XY (X)) to X, G _YX (G _XY (X)) represents a raster pattern regenerated from the raster pattern generated depth map;

the above-mentioned loss functions are all just inputs of the original domain, and inputs of the target domain are not considered; for this purpose, the identity loss function adds the input of the target domain, the formula of which is defined as follows:

the loss function of the discriminator also uses the Wasserstein distance as a similarity measurement index; the discriminator is designed to correctly recognize the true image and the generated image, so that the distance between the true image and the generated image is as large as possible in the design of the loss function, and therefore the loss function of the discriminator can be expressed by the following formula;

as shown in FIG. 5, the input of the discriminator adopts a simple multi-layer convolution network structure, and the output result G (x) of the generator or the label y of the data set can be used for judging whether the input data is real data or not after learning. When the output of the generator is judged to be false, the generator learns the characteristics of the input data, performs parameter updating through back propagation, outputs the data to the discriminator again to perform true and false judgment, and repeatedly performs the above steps. Specifically, the discriminators have two, one for discriminating a real image and a generated image, and the other for discriminating a conversion result of a real style image and a generated image. The two discriminators have the same structure and are deep neural networks formed by a plurality of convolution layers and full connection layers. During training, the goal of the arbiter is to distinguish as far as possible between true samples and generated samples, so its loss function typically employs a binary cross entropy loss function.

Experimental results and analysis

Training of a neural network model is not carried out by a large number of data sets, and in order to enable the model to be better suitable for three-dimensional reconstruction tasks in a factory environment, a data set is built by using the Thing10k 3D model library, and training and testing of the model are carried out on the data set. In order to verify the effectiveness of the model, the invention also compares the model with a measurement method using three-step phase shift and Gray code and a measurement method using twelve-step phase shift and multi-frequency heterodyne. Experiments show that the model provided by the invention has certain performance advantages and can be more suitable for the measurement requirements of workpieces in factory environments.

1) Experimental data set

The 3D model dataset commonly used today includes ModelNet, shapeNet, ABC, thingi K, etc. When a data set is selected, two points are mainly considered, firstly, the effective working distance of the FPP system under visible light is considered, and is generally 1-2 meters, so that the volume of the 3D model is not excessively large; secondly, the application scene of the model is in the industrial production process, and the selected model is similar to a common workpiece in the industrial scene. Based on the two points, the invention selects the Thingi10K data set as the 3D model data set used by the invention, wherein the data set comprises various 3D models of various common objects such as workpieces, sculptures, vases and the like, and part of the models are shown in figure 6. The diversity and scale of these models helps to generate large-scale and diverse data samples, thereby training more versatile and generalizing models.

The virtual FFP system construction effect diagram is shown in fig. 7. Blender is an open source 3D scene production software and it can process images in batches through Python scripts. With the Blender simulation software, a real world scene can be simulated in a virtual environment. In the virtual environment, two virtual cameras and a virtual projector are placed, the projector is arranged to project sinusoidal stripes on an object, the deformed stripes after the object is highly modulated are captured by the left and right cameras, the raster image captured by the virtual cameras can reach the degree of false and spurious, and the true virtual system can simulate a true FPP system.

With a virtual FFP system, the data set required for model training can be generated. In a virtual FPP system, a projector is first required to project a raster pattern, which can be achieved by setting up shader nodes. The shader node is a module used for shading and rendering the model in the Blender, and different rendering effects can be realized. As shown in fig. 8, a node named "image texture" needs to be set first for selecting the source of the picture projected by the projector. The node needs to select a picture of a sinusoidal stripe as the input for projecting the picture. In this way, the projection of the raster pattern may be implemented in the virtual FPP system and the corresponding edge and depth images generated.

Then, a raster image of the image captured by the camera needs to be rendered. Specifically, the image or the depth attribute in the rendering layer of the synthetic node is output to the synthetic node after passing through the normalized node, and then the raster image and the depth image are rendered. The composite node arrangement diagram is shown in fig. 9.

In order to make the virtual FFP system more similar to the real factory environment, the invention adopts various methods to simulate the real environment. For example, in order to further enrich the data set, the invention rotates the model in the three-dimensional model data set for a plurality of times along all directions, which is also a simulation of the messy placement of real workpieces. To enhance the realism of the virtual FFP system, a background plate is added to the scene and the photos of the real plant pipeline are rendered as a map onto the background plate. The data set obtained by the simulation means can be maximally close to the data set collected by the real environment, and the network model can learn the input characteristics better during training, so that the situation of over fitting is avoided.

By the above described related setup with respect to building the FFP virtual system in Blender, a large number of simulation data sets can be acquired. However, in practical applications, the patterning is extremely cumbersome and inefficient. For example, in order to train the model to learn the grating pattern with random phase shift step, it is necessary to perform random displacement on the measured object in a single direction, and if manual adjustment is performed in the graphical interface, the amount of work is definitely huge. While the Blender provides a means of Python script to build the entire simulation system, this is extremely convenient for the user. Therefore, the construction of the FFP virtual system is realized by using the Python script, and the generated partial data set is shown in FIG. 10. For proper and efficient training and evaluation of the model, the invention divides the data set into a training data set and a test data set according to a 3:1 ratio.

2) Model implementation details

The invention aims at explaining the design and training details of the RPSNet model. In training the RPSNet, the data set used was made based on the lying 10k data set. The hardware environment used in the experiment is a 64-bit Windows system, the CPU is Intel Core i7-11700, the display card is NVIDIA RTX2080TI, the memory is 16GB, and all codes in the experiment are realized by using a Pytorch framework. Since RPSNet is a variant of CycleGAN, the training is the same as CycleGAN, i.e., the model is trained by alternating training generators and discriminators.

In training the generator, it is necessary to train the generator G that converts the X-domain input into the Y-domain input at the same time _XY And a generator G for converting Y-domain input into X-domain input _YX . Both generators need to use the counterloss function L _GAN Cyclic coincidence loss function L _cyc Identity loss function L _idt Training is performed so that there are six total loss functions, wherein both the consistency and identity loss functions employ L ₁ loss, while the counterloss function uses Wasserstein distance to define the loss function. In order to control the specific gravity of each loss function term described above, the coefficient of the cyclic uniformity loss term is set to 10, the coefficient of the identity loss term is set to 5, and the coefficient of the counterloss term is set to 1 when the total loss is calculated. In the training process of the discriminant, the discriminant D of the X domain and the Y domain needs to be trained _X And D _Y The real X domain or Y domain image and the image generated by the generator are respectively input, the obtained result of the discriminator and the real label calculate the loss function, and then the two loss functions are multiplied by a coefficient of 0.5 and added. Finally, the model parameters are updated by back propagation. Is thatThe training effect is ensured, the Batch size (Batch size) of training samples is set to 8, and an Adam optimization algorithm is selected for back propagation. Adam optimizer]The method is a common optimization algorithm for solving the gradient descent problem, and the parameters are updated by dynamically adjusting the first moment estimation and the second moment estimation of the gradient. In the invention, the learning rate of the Adam optimizer is set to be 0.002, and the momentum parameter beta is set ₁ Is 0.5 beta ₂ Is 0.999 to find the global optimal point and improve training efficiency and model performance. Through the training process, the generalization capability of the model can be improved, so that a better effect is achieved in a real application scene.

3) Performance comparison experiment

The experiment of the invention aims at verifying the validity of the RPSNet model provided by the invention. Therefore, the method of combining Gray codes with three-step phase shift, the method of combining multi-frequency heterodyne with twelve-step phase shift and the method provided by the invention are adopted for comparison experiments. As no method which is universally applicable to the measurement of dynamic objects and has high reconstruction quality exists at present, the method selects a reconstruction algorithm of multi-frequency heterodyne combined with twelve-step phase shift to measure as a group Truth under the condition that the measured object keeps static. The twelve-step phase shift can well inhibit errors caused by factors such as nonlinearity of projection, object reflection and the like, so that the three-dimensional object can be reconstructed with high quality.

The specific reconstruction algorithm adopted for each scheme in the comparison experiment and the relevant settings of the corresponding experiment are described as follows:

3 site & gray: the scheme adopts three-step phase shift calculation to wrap the phase, and projects an extra Gray code pattern to carry out phase marking for analysis of the wrapped phase. In this scheme, the number of gray code patterns is set to 4, i.e. 16 periods in the field of view are supported at most, which is sufficient for 512 pixel-sized images. The specific projection pattern is encoded according to gray code to reduce the influence of object-indicative reflection on the grating pattern, the encoded pattern being as shown in fig. 11. In practical application, the reflection light of the surface of the measured object and the non-uniformity of light in the environment are considered, and two patterns of full black and full white are additionally projected for normalizing brightness, so that the brightness of the pixel point can be conveniently judged under different light environments. To sum up, this solution requires projecting a total of 3+4+2=9 patterns.

12step & multifreq: in the scheme, twelve-step phase shift is used for calculating the wrapping phase, and a multi-frequency heterodyne method is used for phase unwrapping. In order to ensure that the total period obtained by multi-frequency heterodyne can cover the whole field of view, the scheme adopts a three-frequency heterodyne algorithm, and the unwrapping calculation is carried out on sinusoidal patterns with 25 pixels, 27 pixels and 29 pixels as one period respectively, and because twelve-step phase shift is used for the calculation of the wrapping phase, 12 multiplied by 3=36 grating patterns are required to be projected in total.

RPSNet: the scheme uses the dynamic structured light measuring model RPSNet adapting to the random phase shift step length, and the model directly converts an input grating pattern into a depth image without the processes of wrapping phase calculation and unwrapping phase.

GT: in order to highlight the reconstruction effect of each scheme, the comparison experiment adopts a reconstruction algorithm of twelve-step phase shift and multi-frequency heterodyne, and the measurement is carried out on the measured object under the condition that the measured object is kept static, and the relevant experiment setting of the reconstruction algorithm is consistent with the 12step & multi-Freq reconstruction scheme.

Fig. 12 shows the result of the comparison experiment, and it can be seen from fig. 12 (a) that when the 3st ep & gray structured light reconstruction scheme is used for object measurement under a dynamic scene, the uneven plane exists in the point cloud due to the existence of motion errors, the whole point cloud is sparse, and the reconstruction precision of the whole point cloud is low. This is because the three-step phase shift algorithm cannot cope with errors caused by factors such as nonlinearity of the projector, reflection of light from the object surface, and the like. Fig. 12 (b) is a reconstruction result of the 12step & multifeq scheme, which has a good suppression effect on the environment and the error of the apparatus, but has a large motion error due to a large number of phase steps. The RPSNet model provided by the invention can better overcome the defects of the two reconstruction schemes, and the final reconstructed point cloud image is shown in fig. 12 (c). In the figure, it can be seen that, although the point cloud data reconstructed by the scheme is not dense with twelve-step phase shift, in contrast, the point cloud data reconstructed by the scheme can accurately reconstruct the basic characteristics of the workpiece.

In order to perform specific quantitative evaluation on the reconstruction effect of each method, the invention uses the method to perform a comparison experiment on a standard component, and the experimental results are shown in the following table:

table 1 comparative experiments of the methods measured on standard parts

From the table above, it can be seen that the three-step phase shift and gray code combined structured light measurement scheme has larger error and larger fluctuation of the error when measuring the moving object. From the comparison of 12step & MultiFreq and GT, the measurement mode of twelve-step phase shift and multifrequency heterodyne has higher measurement accuracy in a static scene, and larger error can occur when measuring an object in a motion state. The RPSNet provided by the invention can finish three-dimensional reconstruction in the scene of object movement, has higher precision and can finish the grabbing task.

4) Ablation experiments

In order to prove the effectiveness of each module in the RPSNet model, an ablation experiment is designed for different modules, a standard component is used as a measuring object, and the reconstruction quality of different models is represented by comparing the measured length of the standard component with the error between the actual standard component length. The experimental set-up was based on the relevant set-up of the RPSNet term in the performance comparison experiment, and none of the experimental set-ups mentioned was modified.

(1) Validity of IRR model

For verification of IRR model effectiveness, three groups of experiments are designed, and the experimental settings are as follows:

conv: in the set of experiments, IRR modules in the RPSNet model provided by the invention are replaced by common 3×3 convolution operation, the packing is set to 1, the stride is set to 1, and other parameters are kept unchanged.

Recurrent Conv: the IRR module is replaced by a circular convolution operation, the convolution setting is consistent with the Conv experiment setting, the specific implementation of the circular convolution is that the number of times of the circulation is set to be 3, and the convolution operation except the first time adds the result of the previous convolution operation and the input of the circular convolution and then inputs the result into the convolution operation.

IRR: the relevant settings for this set of experiments remained consistent with the RPSNet experimental settings in the performance comparison experiments.

Table 2 comparative experimental results on standard for ablation experiments of IRR modules

Table 2 shows the comparison experiment results of the ablation experiment of the IRR module on the standard component, from which it can be seen that the IRR module proposed by the present invention reduces the measured error to 0.36mm at the time-consuming cost of about 0.3 seconds increase, and the optimal measurement results are obtained in the above-mentioned ablation experiment, effectively proving the effectiveness of the module.

(2) Availability of RCAM modules

In the experiments of the validity verification of the RCAM module, three groups of comparison experiments are designed, and the settings of the experiments are as follows:

none: the results of each stage of encoder are directly input into the corresponding decoder without using any attention module in the set of experiments.

AG: this set of experiments replaced the RCAM module in RPSNet with the Attention Gate module in Attention U-net proposed by Oktay et al.

RCAM: the relevant settings for this set of experiments remained consistent with the settings for the RPSNet experiment in the performance comparison experiment.

Table 3 comparative experimental results on standard for ablation experiments of RCAM modules

The comparison experiment results of the ablation experiment of the RCAM module on the standard component are shown in table 3, and the RCAM module of the experiment can be obtained through the table, the reconstruction quality can be improved in the ablation experiment of the invention compared with other modules, the overall time consumption of the model is within an acceptable range, and the effectiveness of the model is proved.

In summary, the invention aims at the problem of three-dimensional reconstruction of a measured object by using structured light in a dynamic scene, wherein the dynamic measurement means that the measured object is in a moving state and the moving direction and speed are not fixed. For such dynamic three-dimensional measurements using structured light, there is currently no universal, perfect solution. The invention designs a dynamic structured light three-dimensional reconstruction method based on RPSNet aiming at a specific scene of three-dimensional detection of objects on a conveyor belt. In this scenario, the object under test can be considered to move in one direction, which also allows for exactly the relative phase shift. Although this would otherwise cause the source of motion error to be converted to a relative phase shift, a new error, i.e., a non-uniform phase shift, is introduced as the conveyor belt is a mechanical drive, and it is unavoidable that external disturbances will be received resulting in speed inconsistencies. For this reason, the present invention is designed to solve the structured light measurement mode of random phase shift step, but the following problem is that the data-driven model currently faces the biggest problem of lack of effective data. In order to build an effective dataset, the present invention creates a virtual FFP system in software Blender and creates a dataset using the 3D model dataset, thong 10 k. In order to verify the effectiveness of the method, the method is compared with methods of three-step phase shift combined with Gray codes, twelve-step phase shift combined with multi-frequency heterodyne and the like, and experiments show that the method provided by the invention is close to a reconstruction result of the twelve-step phase shift combined with multi-frequency heterodyne method in reconstruction quality, has an accuracy error of 0.11mm, can meet the three-dimensional reconstruction requirement of workpiece sorting in a factory environment, and creates conditions for subsequent three-dimensional positioning.

Claims

1. The three-dimensional reconstruction method of the dynamic structured light is characterized by comprising the following steps of:

the generator AIR2U-net adopts a U-net network basic architecture and adopts a coding-decoding structure; acquiring a feature map through feature extraction and downsampling operations of an IRR module in an encoder, adding the feature map in the encoding process into the decoding process through a attention mechanism module RCAM through jump connection in the feature extraction and upsampling process of the IRR module in a decoder, and finally obtaining an output result;

the discriminator adopts a multi-layer convolution network.

2. The method according to claim 1, wherein the encoder downsampling process layer 2-N is respectively connected with an IRR module in series at the back end of each layer of the existing U-net network encoder; n represents the total number of layers of the encoder;

the IRR module takes an acceptance-ResNet module as a basic network frame, and adds a circular convolution block RCB to strengthen the characteristics.

3. Method according to claim 2, characterized in that the IRR module comprises in particular a parallel residual connection, 1 x 1RCB, 3 x 3RCB, 5 x 5RCB; a 1 x 1 bottleneck layer, a splice layer;

the specific process of the RCB is as follows:

wherein F (x) _s ,w _s ) Representing the output of the RCB with a convolution kernel s.

4. A method according to claim 3, characterized in that the IRR module comprises the following specific procedures:

wherein the method comprises the steps ofThe outputs of the RCB modules with convolution kernel sizes of 1×1, 3×3 and 5×5 respectively, B (·) is the bottleneck layer function, +.>The representation matrix is stitched along the depth direction.

5. The method according to claim 1, characterized in that the attention mechanism module RCAM comprises a channel attention module and a spatial attention module;

the channel attention module firstly carries out global maximum pooling and average pooling operation on each input channel respectively, then carries out weight calculation on different channels of the feature images after pooling operation by using a full-connection layer sharing weight respectively to obtain weights of the two feature images in different channels, adds the weights and obtains a final weight result by activating a function;

the spatial attention module firstly uses maximum value pooling and average value pooling to pool channels, uses a convolution network to calculate weights after two obtained feature graphs are spliced, and finally outputs the obtained weights through an activation function.

6. The method of claim 5, wherein the attention mechanism module RCAM is formulated throughout using the following formula:

M _c (F)＝σ(MLP(AvgPool(F))+MLP(MaxPool(F))) (5)

wherein AvgPool (·) and MaxPool (·) represent average pooling and maximum pooling of inputs, respectively, MLP (·) represents multi-layer perceptron operation with weight sharing, σ (·) is an activation function;

M _s (F')＝σ(Conv(AvgPool(F')；MaxPool(F'))) (6)。

7. the method according to claim 1, characterized in that the training and testing procedure of the dynamic structured light model RPSNet is as follows:

the dynamic structure light model RPSNet comprises two parts during training;

the first part is that the grating image x of the measured object passes through a generator G _XY Converted into the object to be measuredDepth map y' is then generated by generator G _YX A raster pattern x″ is generated, and the two conversion results x″ and y' of the process respectively use a discriminator D _X 、D _Y Judging whether the data are true or false, and calculating the countermeasures according to the output result of the discriminator by the generator so as to adjust the output; in addition, in order to prevent the output of the generator from being too aggressive, the characteristics of the original input image are lost, and consistency loss is added in the process of generating the image

The second part is similar to the first part, converts the depth map y into a raster pattern x ', and then into a depth map y ' '; the process also requires constraints using a arbiter and a consistency loss function; through the alternate training of the process, the whole network model can learn key features and output high-quality depth images;

8. The method according to claim 7, characterized in that the loss function at training of the dynamic structured light model RPSNet consists of a loss function of a generator and a loss function of a arbiter;

the generator's loss functions include fight loss, loop consistency loss, and identity loss; since two generators are used in the RPSNet training to perform the conversion between the original domain and the target domain, the three loss functions are all composed of two parts, so the total loss function is expressed as follows:

wherein the method comprises the steps ofAnd->Representing the challenge-generation loss functions of the X-domain and the Y-domain, respectively; />And->Represents the cyclic consistency loss function of the X domain and the Y domain respectively, lambda _X And lambda (lambda) _Y Weights of the X domain and the Y domain are respectively represented; />And->Represents the identity loss functions of the X domain and the Y domain, mu _X Sum mu _Y Weights of the X domain and the Y domain are respectively represented;

the challenge generation loss function is defined by the following formula:

wherein f _w (. Cndot.) represents a series of function sets that satisfy the K-Lipschitz condition; p (P) _gxy Representation generator G _XY A resulting sample distribution; p (P) _gyx Representation generatorG _YX A resulting sample distribution; e represents taking the average value;

the loop consistency loss function is defined by the following formula:

where X represents the original input raster pattern, Y represents the original input depth map and, I.I ₁ Representing a distance calculation function;

the identity loss function is defined using the following formula:

the loss function of the discriminator is defined by the following formula;

。

9. a dynamic structured light three-dimensional reconstruction system implementing the method according to any one of claims 1-8, characterized by comprising the steps of:

10. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-8.