CN115170628A

CN115170628A - Multi-view modeling method and device based on ray implicit field and modeling equipment

Info

Publication number: CN115170628A
Application number: CN202210768179.XA
Authority: CN
Inventors: 徐凯; 惠军华; 施逸飞; 蔡志平; 陈垚
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-07-01
Filing date: 2022-07-01
Publication date: 2022-10-11

Abstract

The application relates to a multi-view modeling method and device based on a ray implicit field and modeling equipment. The method comprises the steps of obtaining an initial depth map of a reference view in multiple views through three-dimensional features of a cost volume constructed through camera parameters and multiple view features, projecting a group of light rays from a camera view finding direction of the reference view, obtaining initial depths of the light rays according to the initial depth map, uniformly sampling in a preset range of the initial depths to obtain a plurality of sampling points corresponding to the light rays, obtaining multiple view fusion view features of the sampling points through correlation among the multiple view features of the sampling points obtained through a self-attention mechanism of a polar line sensor, obtaining fusion features of the sampling points through superposition of the multiple view fusion view features and the three-dimensional features of the cost volume, predicting a depth value of a corresponding light implicit field through a fusion feature input sequence model, and carrying out multi-view modeling according to an accurate depth map obtained through the depth value of the light rays. The depth estimation based on the light rays in the method is simpler and lighter.

Description

Multi-view modeling method and device based on ray implicit field and modeling equipment

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a multi-view modeling method, apparatus, and modeling device based on a ray implicit field.

Background

Since the advent of the pioneering working MVSNet, learning-based multi-view reconstruction has received much attention. The core idea of MVSNet and most subsequent work is to construct a three-dimensional cost volume in the frustum of the reference view by transforming image features of multiple source views onto a set of forward parallel scan planes of hypothetical depths, and then apply a 3D convolution to the cost volume to extract 3D geometric features and to regress the final depth map of the reference view.

Since there is typically a large computational and memory consumption for 3D convolution, most existing methods are limited to low resolution cost volumes. Recent work suggests increasing the amount of sampling or refinement cost to improve the resolution of the output depth map, however, such improvements still require a trade-off between depth and spatial (image) resolution. For example, casMVSNet chooses to narrow the depth range to allow the high resolution depth map to match the spatial resolution of the input RGB image, while it also limits the three-dimensional convolution to a narrow band, resulting in a reduction in three-dimensional feature learning efficiency.

Disclosure of Invention

Based on this, it is necessary to provide a multi-view modeling method, apparatus and modeling device based on ray implicit field to improve the speed and accuracy of multi-view modeling.

A method of multi-view modeling based on a ray implicit field, the method comprising:

constructing a cost volume according to camera parameters and the two-dimensional characteristics of multiple views, and obtaining an initial depth map of a reference view according to the three-dimensional characteristics of the cost volume; wherein the multiple views include a reference view and a plurality of source views;

projecting a group of light rays from the camera view finding direction of the reference view, obtaining the initial depth of each light ray according to the initial depth map, and uniformly sampling in the preset range of the initial depth of each light ray respectively to obtain a plurality of sampling points corresponding to each light ray;

obtaining matching correlation among the two-dimensional characteristics of multiple views of each sampling point through a self-attention mechanism layer of the polar line perceptron, obtaining multi-view fusion characteristics of each sampling point according to the matching correlation, and superposing the multi-view fusion characteristics and the three-dimensional characteristics of the cost volume to obtain fusion characteristics of each sampling point;

sequentially inputting the fusion characteristics of the sampling points into a pre-trained sequence model to obtain the sequence characteristics of the sampling points and the corresponding whole light characteristics, and predicting according to the light characteristics to obtain the depth value of the corresponding light implicit field;

and obtaining an accurate depth map of the reference view according to the depth values of all the ray implicit fields, and performing multi-view modeling according to the accurate depth map.

In one embodiment, the matching correlation between the two-dimensional features of the multiple views of each sample point obtained by the self-attention mechanism layer of the epipolar line sensor is as follows:

S＝SelfAttention(Q，K，V)＝Softmax(QK ^T )V

Q＝XW _Q

K＝XW _k

V＝XW _v

wherein S is a matching correlation score, Q is a query vector, K is a key vector, V is a value vector, X is an input multi-view two-dimensional feature, W is a value vector _Q 、W _k 、W _v Weights of the query vector, the key vector and the value vector obtained by learning from the attention mechanism layer are respectively,

is the two-dimensional characteristic of the multi-view at the P-th sampling point, N is the number of sampling points, and I is the number of views of the multi-view.

Obtaining the multi-view fusion characteristics of each sampling point according to the matching correlation, wherein the multi-view fusion characteristics are as follows:

Z＝AddNorm(X)＝LayerNorm(X+S)

where LayerNorm (-) is the layer normalization function.

In one embodiment, the epipolar line perceptron comprises 4 self-attentive mechanism layers; each of the self attention mechanism layers is followed by 2 AddNorm layers and 1 feed forward layer.

In one embodiment, the fusion feature of each sampling point obtained by superimposing the multi-view fusion feature and the three-dimensional feature of the cost volume is:

wherein,

is a multi-view fusion feature of the sample points,

three-dimensional characterization of cost volume, F _p Is a fusion feature of the sampling points.

In one embodiment, the fusion features of the sampling points are sequentially input into a pre-trained sequence model to obtain the corresponding whole light feature:

c _k ＝z ^f ○c _k-1 +z ⁱ ○z

h _k ＝z ^o ○tanh(c _k )

wherein, F _k Is the sequence characteristic of the sampling point, h _k-1 For the k-1 hidden node, z is the cell input activation vector, z ^f To forget the gate activation vector, z ^u To update the gate activation vector, z ^o For outputting the gate activation vector, c _k For the light characteristic prediction value output at the kth moment, W ^f 、W ^u 、W ^O Weight matrices for cell input gate, forgetting gate, update gate and output gate, b ^f 、b ^u 、b ^o Offset vectors of a unit input gate, a forgetting gate, an updating gate and an output gate are respectively, and O is a dot product operation sign.

In one embodiment, the step of training the sequence model comprises:

and (3) predicting the depth value of the light implicit field by adopting a multilayer perceptron with light characteristics as input:

l＝MLP _l (c _K )

wherein MLP is a multilayer perceptron, c _K For the output ray feature prediction value, l is the depth value of the ray implicit field.

And (3) taking the light characteristic predicted value output by the current moment k, the sequence characteristic of the sampling point and the depth value predicted by the current moment k as input, and predicting the symbolic distance of the sampling point on the light by adopting a multilayer perceptron:

wherein,

in order to be a normalized depth value,

for normalized symbol distance, s _max Is the maximum symbol distance on the ray;

constructing a loss function of the sequence model according to the depth value and the symbol distance which are obtained by prediction:

L＝w _s L _s +w _l L _l +w _sl L _sl

wherein L is a loss function of the sequence model, L _s Is a loss function of symbol distance, L _l As a loss function of depth value, L _sl Penalty loss function for consistency, L ₁ Is L ₁ Norm, s _k Is the true value of the symbol distance and,

is the predicted value of the symbol distance, l is the true value of the symbol distance,

is a predicted value of the symbol distance;

and obtaining a trained sequence model by optimizing the loss function.

A multi-view modeling apparatus based on ray implicit fields, the apparatus comprising:

the initial depth map acquisition module is used for constructing a cost volume according to camera parameters and the two-dimensional characteristics of multiple views and obtaining an initial depth map of a reference view according to the three-dimensional characteristics of the cost volume; wherein the multiple views include a reference view and a plurality of source views;

the sampling module is used for projecting a group of light rays from the camera view finding direction of the reference view, obtaining the initial depth of each light ray according to the initial depth map, uniformly sampling in the preset range of the initial depth of each light ray respectively, and obtaining a plurality of sampling points corresponding to each light ray;

the fusion characteristic acquisition module is used for acquiring matching correlation among the two-dimensional characteristics of the multiple views of each sampling point through an attention mechanism layer of the polar line sensor, acquiring the multi-view fusion characteristics of each sampling point according to the matching correlation, and superposing the multi-view fusion characteristics and the three-dimensional characteristics of the cost volume to acquire the fusion characteristics of each sampling point;

the prediction module is used for sequentially inputting the fusion characteristics of the sampling points into a pre-trained sequence model to obtain the sequence characteristics of the sampling points and the corresponding whole light characteristics, and predicting the depth value of the corresponding light implicit field according to the light characteristics;

and the multi-view modeling module is used for obtaining an accurate depth map of the reference view according to the depth values of all the ray implicit fields and carrying out multi-view modeling according to the accurate depth map.

A modelling apparatus comprising a memory and a processor, the memory storing a computer program which when executed by the processor effects the steps of:

sequentially inputting the fusion characteristics of each sampling point into a pre-trained sequence model to obtain the sequence characteristics of each sampling point and the corresponding whole light characteristics, and predicting the depth value of the corresponding light implicit field according to the light characteristics;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The prior art efforts to cost-volume adaptive optimization lead to the problem of limited output depth resolution, in fact the depth map is view dependent, but the cost volume is view independent. Since the goal is a depth map, refining the cost volume does not seem to be economical nor necessary. From this point of view, a significant portion of the cost volume may not be visible. Therefore the present application mainly provides a solution for directly optimizing depth values along camera rays, specifically comprising: the method comprises the steps of firstly obtaining an initial depth map of a reference view in multiple views through three-dimensional features of a cost volume constructed through camera parameters and multiple view features, then projecting a group of light rays from a camera view finding direction of the reference view, then obtaining initial depths of the light rays according to the initial depth map, uniformly sampling in a preset range of the initial depths to obtain a plurality of sampling points corresponding to each light ray, then obtaining multiple view fusion view features of the sampling points through matching correlation among the multiple view features of the sampling points obtained through a self-attention machine system layer of a polar line sensor, obtaining fusion features of the sampling points through superposing the multiple view fusion view features and the three-dimensional features of the cost volume, inputting the fusion features into a trained sequence model to predict depth values of corresponding light ray implicit fields, and finally carrying out multi-view modeling according to the accurate depth map obtained according to the depth values of the light rays.

Compared with the method for estimating the depth on the three-dimensional cost volume, the method has the following advantages that:

1. because the depth map depends on the view, the ray-based depth optimization is simpler and lighter;

2. the one-dimensional implicit fields of all the rays have the same spatial characteristics, and the learning of the sequence model is simplified and standardized, so that efficient network training and accurate depth estimation are realized.

Drawings

FIG. 1 is a schematic flow chart of a multi-view modeling method based on implicit ray field in one embodiment;

FIG. 2 is a schematic diagram of the generation of sample points in one embodiment;

FIG. 3 is a diagram illustrating the computation of de-noised view features in one embodiment;

FIG. 4 is a schematic diagram of a network architecture of a recurrent neural network LSTM in one embodiment;

FIG. 5 is a diagram illustrating a preferred implementation of the implicit ray field-based multi-view modeling method in one embodiment;

FIG. 6 is a diagram illustrating fusion of transform features in one embodiment;

FIG. 7 is a graphical comparison of visual results of a surface reconstructed by the present method and a comparative method in one embodiment.

FIG. 8 is a block diagram of an apparatus for multi-view modeling based on implicit fields of light in one embodiment;

FIG. 9 is a diagram of an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided a multi-view modeling method based on ray implicit field, comprising the following steps:

and 102, constructing a cost volume according to the camera parameters and the two-dimensional characteristics of the multiple views, and obtaining an initial depth map of the reference image according to the three-dimensional characteristics of the cost volume.

Wherein the multiple views comprise a reference view and a plurality of source views, the method acquires a reference image I each time ₁ And N-1 source images

As an input.

The depth map includes the distance between each point in the scene and the camera, and the method starts from constructing a three-dimensional volume and estimating an initial depth map, wherein the initial depth map is a rough and inaccurate depth map and needs to be refined through subsequent operations.

And 104, projecting a group of light rays from the camera view finding direction of the reference view, obtaining the initial depth of each light ray according to the initial depth map, and uniformly sampling in the preset range of the initial depth of each light ray respectively to obtain a plurality of sampling points corresponding to each light ray.

Wherein the number of rays projected is determined by the resolution required to recover the accurate depth map obtained from the final optimization.

As shown in fig. 2, any one of the rays is selected for detailed description. Ideally, sampling can be performed on the light rays to generate as many sampling points as possible, but considering that most points are far away from the earth surface, less important information is provided for depth estimation, in order to enable the subsequent recurrent neural network to be easier to train, the method obtains the initial depth of the selected light rays according to the initial depth map, and uniformly samples the K points within the range of +/-delta of the initial depth/rough depth estimated along the light rays

Where K is the number of sample points, i.e. sample points, e.g. in fig. 2K =6.

That is, the method limits the modeling of each ray to within ± δ, which is a fixed length range centered on the estimated curved surface intersection calculated by the existing multi-view modeling method.

And 106, obtaining matching correlation among the two-dimensional features of the multiple views of each sampling point through a self-attention mechanism layer of the polar line sensor, obtaining the multi-view fusion feature of each sampling point according to the matching correlation, and superposing the multi-view fusion feature and the three-dimensional feature of the cost volume to obtain the fusion feature of each sampling point.

A simple method of gathering features at the sample points P is to extract features from the multi-view image based on view projection and obtain variance, but image features are susceptible to image defects such as specular reflection, light variation, and the like. Naive variance averaging accounts for multi-view image features, which can lead to unreliable features and inaccurate cross-view feature correlations.

In order to solve the problem, the method provides an epipolar line perception method for learning cross visual angle feature correlation by using an attention mechanism, an epipolar line sensor can learn the importance of features from a plurality of views, so that the denoising multi-view fusion features of each sampling point are obtained, and the influence caused by image defects is reduced.

In order to further improve the quality of the features and enable the light ray search to be easy to learn, the method obtains the fusion features of each sampling point by connecting the denoising multi-view features and the three-dimensional features of the cost volume, and then the fusion features are merged into a 1-dimensional implicit field input prediction model along the corresponding light rays to be processed, so that accurate light ray depth values can be obtained.

And 108, sequentially inputting the fusion characteristics of each sampling point into a pre-trained sequence model to obtain the sequence characteristics of each sampling point and the corresponding whole light characteristics, and predicting according to the light characteristics to obtain the depth value of the corresponding light implicit field.

The method converts a distance measurement formula of each ray into a one-dimensional implicit field learned along the ray. Firstly, because the depth map depends on the view, the ray-based depth optimization is simpler and lighter; secondly, the one-dimensional implicit fields of all the rays have the same spatial characteristics, and learning is simplified and normalized, so that the high efficiency of network training and the accuracy of a prediction result are realized.

And taking the sequence characteristics of each sampling point as input to obtain the light characteristics at the corresponding moment until the final light characteristics are output.

And step 110, obtaining an accurate depth map of the reference view according to the depth values of all the ray implicit fields, and performing multi-view modeling according to the accurate depth map.

According to the multi-view modeling method based on the light ray implicit field, an initial depth map of a reference view in multiple views is obtained through three-dimensional features of a cost volume constructed through camera parameters and the multiple view features, a group of light rays are projected from the camera view finding direction of the reference view, the initial depth of each light ray is obtained according to the initial depth map, uniform sampling is conducted in the preset range of each initial depth, a plurality of sampling points corresponding to each light ray are obtained, then the multi-view fusion view features of the sampling points are obtained through matching correlation among the multiple view features of each sampling point obtained through a self-attention mechanism layer of an epipolar line sensor, the fusion features of each sampling point are obtained through superposition of the multi-view fusion view features and the three-dimensional features of the cost volume, the fusion features are input into a trained sequence model to predict the depth value of the corresponding light ray implicit field, and finally the multi-view is conducted on an accurate depth map obtained according to the depth value of each light ray. Compared with the method for estimating the depth on the three-dimensional cost volume, the method has the following advantages that: because the depth map depends on the view, the ray-based depth optimization is simpler and lighter; the one-dimensional implicit fields of all the rays have the same spatial characteristics, and the learning of the sequence model is simplified and standardized, so that efficient network training and accurate depth estimation are realized.

In one embodiment, before constructing the cost volume from the camera parameters and the two-dimensional features of the multiple views, a two-dimensional convolutional network is used to extract the two-dimensional features of the reference view and the source view, respectively, where the two-dimensional convolutional network can select 2D-UNet.

In one embodiment, prior to deriving the initial depth map of the reference image from the three-dimensional features of the cost volume, comprises extracting the three-dimensional features of the cost volume by a three-dimensional convolution network, where the two-dimensional convolution network may select 3D-UNet.

Preferably, as shown in fig. 3, the matching correlation between the two-dimensional features of the multiple views of each sampling point obtained by the self-attention mechanism layer of the epipolar line sensor is:

S＝SelfAttention(Q，K，V)＝Softmax(QK ^T )V

Q＝XW _Q

K＝XW _k

V＝XW _v

wherein S is a matching relevance score, Q is a query vector, K is a key vector, V is a value vector, X is an input multi-view two-dimensional feature, W is a value vector _Q 、W _k 、W _v Weights of the query vector, the key vector and the value vector obtained by learning from the attention mechanism layer are respectively,

the two-dimensional characteristics of the multi-view at the P-th sampling point are shown, N is the number of the sampling points, and I is the view number of the multi-view.

Obtaining the multi-view fusion characteristics of each sampling point according to the matching correlation as follows:

Z＝AddNorm(X)＝LayerNorm(X+S)

where LayerNorm (-) is a layer normalization function.

Specifically, the network architecture of the epipolar transducer (epipolar perceptron) comprises four self-attention layers, each followed by two AddNorm layers and one feed-forward layer.

In one embodiment, the fusion features of each sampling point obtained by superimposing the multi-view fusion features and the three-dimensional features of the cost volume are:

wherein,

is a multi-view fusion feature of the sample points,

Preferably, the fusion characteristics of each sampling point are sequentially input into a sequence model trained in advance to obtain the corresponding whole light characteristics as follows:

c _k ＝z ^f ○c _k-1 +z ⁱ ○z

h _k ＝z ^o ○tanh(c _k )

wherein, F _k Is the sequence characteristic of the sampling point, h _k-1 For the k-1 hidden node, z is the cell input activation vector, z ^f To forget the gate activation vector, z ^u To update the gate activation vector, z ^o To output a gate activation vector, c _k For the light characteristic prediction value output at the kth moment, W ^f 、W ^u 、W ^o Weight matrices for cell input gate, forgetting gate, update gate and output gate, b ^f 、b ^u 、b ^o Offset vectors of the cell input gate, the forgetting gate, the update gate and the output gate are respectively, and O is a dot product operation sign.

Preferably, the step of training the sequence model comprises:

l＝MLP _l (c _K )

wherein MLP is a multilayer perceptron, c _K And l is the depth value of the light implicit field, and is the output light characteristic predicted value.

wherein,

in order to be a normalized depth value,

for normalized symbol distance, s _ma x is the maximum symbol distance on the ray;

L＝w _s L _s +w _l L _l +w _sl L _sl

wherein L is a loss function of the sequence model, L _s Is a loss function of symbol distance, L _l As a loss function of depth value, L _sl For consistency penalty loss function, L ₁ Is L ₁ Norm, s _k Is the true value of the symbol distance and,

is a predicted value of the symbol distance, l is a true value of the symbol distance,

is a symbol distanceA predicted value of separation;

and obtaining a trained sequence model by optimizing the loss function.

The method designs two learning tasks: 1) Sequence prediction of symbol distances over a sequence of points sampled over a fixed length range (i.e., sample points) and 2) regression of light depth values. A well-designed loss function is used for correlating the two tasks, and the multi-task learning method can obtain high-precision estimation of the intersection point of each ray and the surface of the scene object.

As shown in fig. 4, a network architecture of a recurrent neural network LSTM (long short term memory network) is provided, each sampling point on a light ray is sequentially input to the LSTM, the position of a zero crossing point (i.e., the intersection of the light ray and a modeled object) and the SDF (symbolic distance) of the point on the light ray are estimated, and the one-dimensional implicit fields of all the light rays have the same spatial characteristic, i.e., the monotonicity of the SDF along the light ray direction, so that learning is simplified and normalized, thereby realizing efficient network training and accurate results.

As shown in fig. 5, a schematic diagram of a preferred implementation of the present method is provided.

The first part is Multi-view feature extraction (Multi-view feature extraction), the Multi-view comprises a Reference view (Reference image) and a plurality of Source views (Source images), and 2D U-Net is adopted to perform Multi-view two-dimensional feature extraction.

And the second part is three-dimensional convolution (3D cost volume convolution) of the cost volume, three-dimensional feature extraction of the cost volume (cost volume) is carried out by adopting 3D U-Net, and a rough depth map of the reference view is obtained through the extracted three-dimensional feature.

The third part is the matching of two-dimensional features, and multi-view fusion features are obtained through a Self-Attention mechanism (Self-Attention) layer and an Add Norm layer in a feature sensor.

The fourth part is depth value estimation of a one-dimensional implicit field based on light rays, fusion characteristics obtained by superposing the three-dimensional characteristics of the cost volume obtained by the second part and the multi-view fusion characteristics obtained by the third part are input into a trained sequence model, and the depth value of corresponding light rays is estimated in a previously Determined sampling point sampling range +/-delta. As shown in fig. 6, a schematic illustration of fusing multi-view features is provided. The prediction of the 1D implicit field is lighter, the monotonicity of the ray-based distance field SDF around the intersection with the target surface helps for robust learning, and the depth estimation of the method is more accurate than existing purely cost-volume based methods.

And the fifth part is the optimization of a recurrent neural network, and a multi-task learning strategy is adopted to optimize the light implicit field multi-view reconstruction method. These two tasks, SDF estimation (SDF estimate) and Zero-crossing estimation (Zero-crossing position estimate) of the sample points on the ray, are essentially related.

To validate the method, we provide implementation details of training and reasoning. In the inference process, three images with the size of 640 × 512 are input, and the output feature size is 640 × 512 × 8. The two-dimensional convolutional network consists of 6 convolutional layers and 6 deconvolution layers, and except for the last layer, a ReLu layer is arranged behind each layer. The three-dimensional cost volume is fed into a three-dimensional convolutional network consisting of three-dimensional convolutional layers and three-dimensional anti-convolutional layers. The number of sampling points K on each ray is 16. The point sampling delta range of the DTU is 20mm, and the feature extraction of the view and the cost volume is realized by respectively adopting bilinear interpolation and trilinear interpolation. Fusion feature F _k Has a length of 32.z, z ^f ，z ^u ，z ^o ，c _k ，h _k Has a concealment dimension of 50.MLP _l And MLP _s All contain 4 complete convolution layers. Weight w of a multitask learning loss function _s 、w _l 、w _sl Respectively 0.1, 0.8 and 0.1. The epipolar perceptron and the LSTM are jointly trained in an end-to-end fashion. We used Adam optimizer with initial learning rate of 0.0005, reduced by 0.9 every two stages. Training took 48 hours.

To evaluate the proposed method on DTU datasets, the distance metric in our use compares the accuracy and completeness of the final reconstruction. The quantitative results are shown in Table 1.

All methods are compared using a distance metric. Numbers are reported in millimeters.

It can be seen that the present method not only produces competitive results in terms of accuracy and completeness, but also achieves the most advanced overall scoring performance. This demonstrates the effectiveness of our proposed light implicit field, particularly in terms of balancing accuracy and completeness. Qualitative comparison as shown in fig. 7, fig. 7 is a comparison of visual results of surface reconstruction by the present method and the comparative method. The result shows that the method can realize high-quality shape reconstruction under various scenes. In particular, our approach outperforms others in scenes with non-textured areas, re-occlusion, and complex geometry, noting the results of the field of challenges highlighted in the figure.

TABLE 1 quantitative results on DTU data set

The method is compared to the baseline on the Tanks & Templates dataset. The network trained on the DTU dataset was used without any fine tuning to test on the Tanks & Templates scenario. F-score was used as an evaluation index. The quantitative results are shown in Table 2. The result shows that the method has the best comprehensive performance, and shows the good universality of epipolar line perception and zero crossing point estimation based on a light ray implicit field in a large-scale scene.

TABLE 2 Performance results on Tanks & templates dataset

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 8, there is provided a multi-view modeling apparatus based on a ray implicit field, including: the system comprises an initial depth map acquisition module, a sampling module, a fusion feature acquisition module, a prediction module and a multi-view modeling module, wherein:

the initial depth map acquisition module is used for constructing a cost volume according to the camera parameters and the two-dimensional characteristics of the multiple views and obtaining an initial depth map of a reference view according to the three-dimensional characteristics of the cost volume; wherein the multiple views include a reference view and a plurality of source views;

the fusion characteristic acquisition module is used for acquiring matching correlation among the two-dimensional characteristics of the multiple views of each sampling point through a self-attention mechanism layer of the polar line sensor, acquiring the multi-view fusion characteristics of each sampling point according to the matching correlation, and superposing the multi-view fusion characteristics and the three-dimensional characteristics of the cost volume to acquire the fusion characteristics of each sampling point;

the prediction module is used for sequentially inputting the fusion characteristics of each sampling point into a pre-trained sequence model to obtain the sequence characteristics of each sampling point and the corresponding whole light characteristics, and predicting the depth value of the corresponding light implicit field according to the light characteristics;

In one embodiment, the fused feature obtaining module is further configured to obtain, through the self-attention mechanism layer of the epipolar line sensor, matching correlations between the two-dimensional features of the multiple views of each sampling point as follows:

S＝SelfAttention(Q，K，V)＝Softmax(QK ^T )V

Q＝XW _Q

K＝XW _k

V＝XW _v

wherein S is a matching correlation score, Q is a query vector, K is a key vector, V is a value vector, X is an input multi-view two-dimensional feature, W is a value vector _Q 、W _k 、W _v Weights of the query vector, the key vector and the value vector obtained by the learning of the attention mechanism layer are respectively,

In one embodiment, the prediction module is further configured to connect the denoised view feature and the three-dimensional feature of the cost volume to obtain a transformation feature of each source view as:

wherein,

is a multi-view fusion feature of the sample points,

three-dimensional characterization of cost volume, F _p Is the fusion characteristic of the sampling points.

In one embodiment, the prediction module is further configured to sequentially input the fusion features of each sampling point into a pre-trained sequence model to obtain the corresponding whole light feature as follows:

c _k ＝z ^f ○c _k-1 +z ⁱ ○z

h _k ＝z ^o ○tanh(c _k )

wherein, F _k Is the sequence characteristic of the sampling point, h _k-1 For the k-1 hidden node, z is the cell input activation vector, z ^f To forget the gate activation vector, z ^u To update the gate activation vector, z ^o To output a gate activation vector, c _k For the light characteristic prediction value output at the kth moment, W ^f 、W ^u 、W ^o Weight matrices for cell input gate, forgetting gate, update gate and output gate, b ^f 、b ^u 、b ^o Offset vectors of a unit input gate, a forgetting gate, an updating gate and an output gate are respectively, and O is a dot product operation sign.

For specific definition of the multi-view modeling apparatus based on the ray-implicit field, reference may be made to the definition of the multi-view modeling method based on the ray-implicit field, and details are not repeated here. The modules in the multi-view modeling apparatus based on implicit ray field can be implemented in whole or in part by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of multi-view modeling based on implicit fields of light. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a modeling apparatus is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A multi-view modeling method based on a ray implicit field is characterized by comprising the following steps:

constructing a cost volume according to camera parameters and two-dimensional features of multiple views, and obtaining an initial depth map of a reference view according to the three-dimensional features of the cost volume; wherein the multiple views include a reference view and a plurality of source views;

projecting a group of light rays from the camera viewing direction of the reference view, obtaining the initial depth of each light ray according to the initial depth map, and uniformly sampling in the preset range of the initial depth of each light ray respectively to obtain a plurality of sampling points corresponding to each light ray;

obtaining matching correlation among the two-dimensional features of the multiple views of each sampling point through a self-attention mechanism layer of the polar line perceptron, obtaining the multi-view fusion features of each sampling point according to the matching correlation, and superposing the multi-view fusion features and the three-dimensional features of the cost volume to obtain the fusion features of each sampling point;

2. The method according to claim 1, wherein the obtaining, through a self-attention mechanism layer of an epipolar perceptron, a matching correlation between two-dimensional features of multiple views of each sampling point, and obtaining a multi-view fused feature of each sampling point according to the matching correlation comprises:

the matching correlation among the two-dimensional features of the multiple views of each sampling point obtained through the self-attention mechanism layer of the polar line sensor is as follows:

S＝SelfAttention(Q，K，V)＝Softmax(QK ^T )V

Q＝XW _Q

K＝XW _k

V＝XW _v

Z＝RddNorm(X)＝LayerNorm(X+S)

where LayerNorm (-) is the layer normalization function.

3. The method according to claim 1, wherein the epipolar line perceptron comprises 4 self-attentive force mechanism layers; each of the self attention mechanism layers is followed by 2 AddNorm layers and 1 feed forward layer.

4. The method of claim 1, wherein the superimposing the multi-view fusion feature and the three-dimensional feature of the cost volume to obtain a fusion feature of each sample point comprises:

superposing the multi-view fusion characteristics and the three-dimensional characteristics of the cost volume to obtain the fusion characteristics of each sampling point as follows:

wherein,

is a multi-view fusion feature of the sample point,

5. The method of claim 1, wherein the sequentially inputting the fusion features of the sampling points into a pre-trained sequence model to obtain the sequence features of the sampling points and the corresponding whole light features comprises:

and sequentially inputting the fusion characteristics of each sampling point into a pre-trained sequence model to obtain the corresponding whole light characteristic as follows:

c _k ＝z ^f ○c _k-1 +z ⁱ ○z

h _k ＝z ^o ○tanh(c _k )

wherein, F _k Is the sequence characteristic of the sampling point, h _k-1 For the k-1 hidden node, z is the cell input activation vector, z ^f To forget the gate activation vector, z ^u To update the gate activation vector, z ^o To output a gate activation vector, c _k For the light characteristic prediction value output at the kth moment, W ^f 、W ^u 、W ^o Respectively a unit input gate, a forgetting gate and a moreWeight matrix of new and output gates, b ^f 、b ^u 、b ^o Offset vectors of the cell input gate, the forgetting gate, the update gate and the output gate are respectively, and O is a dot product operation sign.

6. The method of claim 1, wherein the step of training the sequence model comprises:

ι＝MLP _l (c _K )

wherein MLP is a multilayer perceptron, c _K And iota is the depth value of the light implicit field.

wherein,

in order to be a normalized depth value,

is a normalized symbol distance, s _max Is the maximum symbol distance on the ray;

L＝w _s L _s +w _l L _l +w _sl L _sl

wherein L is a loss function of the sequence model, L _s As a loss function of the symbol distance, L _l As a loss function of depth value, L _sl Penalty loss function for consistency, L ₁ Is L ₁ Norm, s _k Is the true value of the symbol distance and,

is a predicted value of the symbol distance;

and obtaining a trained sequence model by optimizing the loss function.

7. The method of any one of claims 1 to 6, wherein the number of rays is determined according to the resolution required for finally restoring the obtained accurate depth map.

8. A multi-view modeling apparatus based on ray implicit field, the apparatus comprising:

9. A modelling device comprising a memory and a processor, said memory storing a computer program, wherein the steps of the method of any one of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.