CN117036586A

CN117036586A - Global feature modeling-based MPI new viewpoint synthesis method

Info

Publication number: CN117036586A
Application number: CN202310634252.9A
Authority: CN
Inventors: 霍智勇; 魏俊宇
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-11-10

Abstract

A new viewpoint synthesis method (TransMPI) for a multi-planar image (Multiplane Images, MPI) based on global feature modeling. According to the method, the MPI generation network firstly captures local spatial features among a plurality of depth planes by using a 3D encoder, the prediction capability of an MPI depth plane shielding region is improved, and simultaneously, in order to overcome the limitation of a 3D convolutional neural network (Convolutional Neural Network, CNN) on global semantic information learning, a transducer self-attention mechanism is introduced, and the transducer is used for carrying out global feature representation modeling by combining with the transducer encoder so as to establish a long-distance dependency relationship in a global space. Experimental results show that the TransMPI further improves the reasoning quality of MPI scene representation and improves the quality of new viewpoint synthesized images by utilizing a self-attention mechanism to learn global features and local features between continuous depth planes.

Description

Global feature modeling-based MPI new viewpoint synthesis method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to an MPI new viewpoint synthesis method based on global feature modeling.

Background

Viewpoint synthesis using sparse unstructured or structured input images is a challenging task in the computer vision field. The viewpoint synthesis task requires an accurate understanding of the scene, acquires 3D structural and semantic information of the input image, and deduces therefrom the field Jing Jihe and object surface properties. The estimated 3D scene structure (e.g., dense depth map) tends to be inaccurate due to the presence of partially overlapping objects and different lighting conditions in the scene; meanwhile, the viewpoint position changes can cause the changes of the occlusion (the background which is visible before is invisible at a new viewpoint) and non-occlusion (the background which is invisible before is displayed) areas, and obvious parallax and occlusion often exist in the captured image; in addition, since portions of the scene are dynamic and the images tend to be acquired asynchronously, there tends to be significant object motion in the acquired images, which results in more significant fore-aft Jing Zhedang and inconsistencies in the images.

In the view image synthesis task, the MPI scene representation method can be used for realizing seamless transition content modeling and synthesis among a plurality of views), so that various complex scene information can be effectively captured, and even moving objects can be represented. The MPI scene representation method presents complex space structures and dynamic scenes in a more realistic mode, reduces rendering time and storage space, and obtains interactive viewpoint synthesis experience. For example, in an augmented reality (Augmented Reality, AR) application, virtual objects may be created by the MPI scene representation method and placed in a real environment to interact with the real environment; in addition, in game development, an MPI scene representation method can also be used for constructing a game scene and generating images of new viewpoints, so that better game experience is realized.

It was found in the study that an increase in the number of depth planes effectively expands the range of viewpoints that MPI can present and improves the quality of the rendered image, but since MPI is an over-parameterized representation, it is difficult for neural networks to learn, since tens or even hundreds of channels are required as outputs. With the increase of the number of depth planes, more global features are needed to realize accurate prediction of MPI scene representation, and although a 3D CNN-based method has good representation capability, because of limited acceptance fields of convolution kernels, an explicit long-distance dependency relationship is difficult to establish, and 3D CNN generally only can extract local spatial features of MPI, but ignores global features between continuous depth planes. The limitations of convolution operations present challenges to learning global semantic information, which is critical to the MPI prediction task.

Disclosure of Invention

The invention is inspired by the attention mechanism in natural language processing, and the limitation of learning global semantic information is overcome by fusing the attention mechanism with the CNN model and establishing an explicit long-distance dependency relationship. A transducer encoder is introduced to conduct global feature modeling on the basis of a 3D CNN network structure, and an MPI new viewpoint synthesis algorithm (TransMPI) based on a self-attention mechanism is provided, wherein the transducer encoder module achieves high-quality MPI scene representation reasoning through learning of global features and local features, and the quality of new viewpoint image synthesis is further improved.

The following technical scheme is adopted to solve the problems existing in the prior art:

a new MPI viewpoint synthesis method based on global feature modeling comprises the following steps:

step 1, acquiring training image data, and preprocessing the input of an MPI generation network to obtain PSV;

step 2, inputting the training image data obtained in the step 1 into an MPI generation network of an MPI new viewpoint synthesis method based on global feature modeling for training, wherein the process comprises the following steps:

(1) 3D CNN encoder: the TransMPI is built on the structure of a 3D residual error encoder-decoder, the residual error encoder of the network firstly utilizes 3D convolution to downsample the input PSV, so that the space volume characteristic is extracted, a compact volume characteristic diagram is obtained, and the local three-dimensional environment information is effectively captured; (2) transformer encoder: each spatial feature is reshaped into a vector (i.e., token) and long-range dependencies are modeled in global space using a transfomer encoder; (3) 3D CNN decoder: acquiring feature embedding from a transducer encoder, and recovering the feature map to be the same as the feature encoding part in size by repeatedly superposing an up-sampling layer and a convolution layer;

step 3, based on the trained generation network, inputting a reference image for detection, and synthesizing a target viewpoint image I by the network by utilizing the predicted Alpha and the mixed weight _t 。

Further, the homography transformations employed by the preprocessing operation in step 1 all use the same depth to compare different input images to infer the scene geometry.

Further, the local spatial features obtained by the 3D CNN in step 2 need to gradually encode the input image into low resolution/high dimension feature representations through linear mapping, and then send them to a transform encoder to further learn global spatial long distance dependency modeling.

Further, the transducer encoder in the step 2 is composed of 4 layers of transducers, each layer of transducer includes two parts, MHA and FFN.

The invention adopts the technical scheme and has the following beneficial effects:

(1) According to the method, a transducer module is introduced into a 3D CNN network architecture, the difficulty of learning global semantic information limitation is overcome, and local and global features in space and depth dimensions can be effectively simulated by the network.

(2) And capturing local spatial features among a plurality of depth planes by using a 3D encoder, and improving the prediction capability of an MPI depth plane shielding region.

(3) A transducer self-attention mechanism is introduced, and the limitation of the 3D convolutional neural network CNN on global semantic information learning is overcome.

(4) By learning global features and local features between successive depth planes using a self-attention mechanism, the reasoning quality of MPI scene representation is further improved, distortion and artifacts in new viewpoint images are eliminated, and the quality of new viewpoint composite images is improved.

Drawings

Fig. 1 is a flowchart of a view image synthesis algorithm for global feature modeling using a transducer self-attention mechanism based on an MPI scene representation in an embodiment of the present invention.

Fig. 2 is a schematic diagram of an MPI generation network architecture according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of a transducer encoder according to an embodiment of the present invention.

FIG. 4 is a diagram of a multi-head self-attention mechanism in an embodiment of the present invention.

Fig. 5 is a subjective result comparison chart of a viewpoint extrapolation synthesis algorithm in an embodiment of the present invention.

Fig. 6 is a subjective result comparison chart of a viewpoint extrapolation synthesis algorithm in the embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described below with reference to the drawings.

The overall structure of the invention is shown in figure 1, and an MPI new viewpoint synthesis framework based on global feature modeling is provided. The method specifically comprises the following steps:

and step 1, acquiring training image data, and preprocessing the input of the MPI generation network.

In step 11, the network needs to iterate training for many times and is suitable for various application situations, so that the prepared training data volume needs to reach a certain order of magnitude. Selecting two data sets for numerical experiments, wherein the RealEstate 10K data set comprises about 8 ten thousand 1000 ten thousand frames of images extracted from video clips, and comprises indoor and outdoor scenes (such as bedrooms, streets, churches, canyons and the like), and the data sets can be divided into a training set comprising 54000 scene images and a test set comprising 13500 scene images, and can be used for view extrapolation task research; the space dataset consisted of 100 indoor and outdoor scenes, captured using 16 camera devices. For each scene, a set of images (no more than 10 cm from each other) are captured at 5-10 slightly different device locations. The dithering of camera positions provides a flexible dataset for view synthesis, since views from different camera positions can be mixed for the same scene during training, typically for view interpolation task studies of wide baseline images, training using 90 scenes in the dataset, and evaluating in the remaining 10 scenes, the image resolution is set to 800 x 480.

Step 12, a view synthesis algorithm flow for global feature modeling using self-attention mechanism convertors based on MPI scene representation is shown in FIG. 1, in order to image I with respect to the input reference image ₁ And I ₂ Is encoded, and PSV projected to the target viewpoint by each reference viewpoint is calculated and denoted as P _i (i=1, 2). Known camera parameters C ₁ ＝(A ₁ ,[R ₁ ,t ₁ ]) And C ₂ ＝(A ₂ ,[R ₂ ,t ₂ ]) Wherein A is _i And [ R ] _i ,t _i ](i=1, 2) represents the internal and external parameters (rotation matrix and translation vector) of the camera, respectively. Consider the image I at the reference viewpoint _i One pixel p in (i=1, 2) _i (u _i ,v _i 1) and corresponding voxels in the reference camera coordinate system are located at a depth z _i Where it is located. If this voxel is indicated to be at depth z _v In the target camera coordinate system, then the matching pixel p in the target view _v (u _v ,v _v The method comprises the following steps of 1):

a three-dimensional scene may be segmented into multiple planes that are the same distance (i.e., disparity value) from the reference camera. For points on such a depth plane, their projection points in the reference view and the target view may pass through the homography matrix H _vi,z (distance z for one depth plane) are linked:

due to a series of homography matrices H _vi,z Applied on the reference view, the PSV, i.e. the result of the re-projection on a different depth plane, can be obtained. The size of each PSV tensor is [3, D, H, W]The two PSVs are connected along the color channel to obtain a [3N, D, H, W ]]As input to the network. Where H and W are the height and width of the image, respectively, D is the number of depth planes and N is the number of input images. The network learns to infer the geometry of the scene by comparing the PSVs of two different view images.

Step 2, as shown in fig. 2, the MPI generating network consists of three parts: a 3D CNN encoder, a transducer encoder, and a 3D CNN decoder. The spatial features pass through the encoder-decoder architecture to obtain the final MPI output. Each layer of the encoder-decoder architecture comprises an encoding block and a decoding block, for example encoding block 2 and decoding block 2, respectively. The coding block 2 consists of four three-dimensional convolutions, with a jump connection between every two convolutions, the three-dimensional convolution with a convolution kernel size of 1 x 1 being applied in the first jump connection of the coding block 2 for downsampling the input tensor. It should be noted that only the encoding block 1 does not downsample the input feature tensor; the decoding block 2 consists of two three-dimensional convolutional layers of convolution kernel size 3 x 3 and one upsampling layer, the other decoding blocks are identical to decoding block 2. The parameters of the individual modules in fig. 2 represent the change in the size of the tensors 3n@d, h, w, where "up-sampling 2×" represents the doubling of the resolution and depth channel parameters.

Step 21, as shown in fig. 3, the transducer encoder module is composed of 4 layers of transducers, each layer of transducer comprises two parts: an MHA module and an FFN. Given the feature map F of the 3D CNN encoder output, to guarantee a comprehensive representation of each volume, using linear mapping (a 3 x 3 convolution layer) increases the channel dimension from k=128 to d=512. The transducer layer requires a sequence as input, and reshapes the spatial dimension and the depth dimension into one dimension, resulting in a feature map f of dimension d n, i.e., f can be considered as n Token of d dimensions. The feature embedding is created as shown in equation (3) by encoding the position information with a learnable position embedding PE and fusing it directly with the feature map F:

z ₀ ＝f+PE＝W×F+PE (3)

where W is a linear mapping operation, PE εR ^d×n Representation position embedding, z ₀ ∈R ^d×n Then feature embedding is indicated. First (L e [1, 2., L.)]) The output of the layer transducer is shown in equations (4) and (5):

z _l ′＝MHA(LN(z _l -1))+z _l -1 (4)

z _l ＝FFN(LN(z _l ′))+z _l ′ (5)

where LN represents layer normalization, z _l Is the output of the layer I transducer, z' _l Is an intermediate result in the calculation process.

Step 22, as shown in fig. 4, MHA is proposed to solve the defect of the self-attention mechanism, i.e. the problem that the model will focus excessively on its own position when encoding the information of the current position. Given query q ε R ^dq (query), key k ε R ^dk (key) and value v ε R ^dv (value) each set of linearly projected vector representations is considered a head. Each attention head h _i The calculation method of (i=1..n) is as shown in formula (6):

h _i ＝f(W _i ^(q) q,W _i ^(k) k,W _i ^(v) v)∈R ^pv (6)

wherein the learnable parameters includeAnd->And a function representing attention pooling. The output of the multi-head attention needs to undergo another linear transformation, which corresponds to n heads h _i (i=1,., n) post-splice, so its learnable parameter is +.>

On the basis of this design, each head will notice a different part of the input, which can express more complex functions than a simple weighted average.

Step 3, as shown in fig. 1, the network output module directly predicts an alpha value of an MPI and two mixing weights w by the MPI generating network _i (i=1, 2). Whereas the RGB values of MPI can be modeled well by the mixing weights and PSV, where P is simply obtained by the homography matrix. Thus, for each plane, each RGB image c is calculated as:

c＝∑w _i ΘP _i (i＝1,2) (8)

the image of the new view may represent m= { c from the MPI scene _i ,α _i The } (i=1, 2..n) is rendered by Alpha synthesis. Wherein the rendering process is differentiable, alpha synthesis is defined as:

in summary, the invention provides a new MPI viewpoint synthesis method (TransMPI) based on global feature modeling, and in order to overcome the limitation of convolution operation on global semantic information learning, a transducer module is added in an MPI generation network module in the algorithm, so that the network can effectively simulate local and global features in space and depth dimensions. The TransMPI network uses the obtained local features to perform global feature representation modeling in combination with a transducer encoder to establish long-distance dependency in global space. Experimental results show that the TransMPI further improves the reasoning quality of MPI scene representation and improves the quality of new viewpoint synthesized images by utilizing a self-attention mechanism to learn global features and local features between continuous depth planes.

To verify the quality of the present method in synthesizing a new view image in the view extrapolation task, the same number of depth planes (d=32) are used by all algorithms for fair comparison, compared to the Stereo-Mag, 3D-Photo algorithms on the validation set of the RealEstate 10K dataset. Subjective results of several viewpoint extrapolation algorithms, such as those shown in fig. 5 and 6, give better results in the table top reflection area and the lighting area.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. A new MPI viewpoint synthesis method based on global feature modeling is characterized in that: the method comprises the following steps:

step 1, acquiring training data, and preprocessing the input of an MPI generation network to obtain a planar scanning volume PSV;

step 2, inputting the training image data obtained in the step 1 into an established TransMPI network based on global feature modeling for training, wherein the process comprises the following steps:

(1) 3D CNN encoder: transMPI is built on the structure of a 3D residual encoder-decoder, a network residual encoder firstly utilizes 3D convolution to downsample an input PSV, so that space volume characteristics are extracted, a compact volume characteristic diagram is obtained, and local three-dimensional environment information is captured; (2) a transducer encoder: each spatial feature is reshaped into a vector, token, and long-range dependencies are modeled in global space using a transfomer encoder; (3) 3D CNN decoder: acquiring feature embedding from a transducer encoder, and recovering the feature map to be the same as the feature encoding part in size by repeatedly superposing an up-sampling layer and a convolution layer;

step 3, generating a network based on the trained MPI, inputting a reference image for detection, and selectively using the reference image pair I at different depths by the network through the predicted Alpha and the mixed weight ₁ And I ₂ Thus, the target viewpoint image I _t From micro-renderable complexesObtaining the product.

2. The method for synthesizing the new view point of the MPI based on global feature modeling according to claim 1, wherein the method comprises the following steps: the step 1 data preprocessing operation includes utilizing homography transformation to input reference image pair I ₁ And I ₂ Is encoded, and the PSV P= { (C) of each reference viewpoint projected to the target viewpoint is calculated _i ,d _i ) (i=1, once again, D), consists of D planes which are parallel back and forth, each depth plane D _i From RGB image C _i Composition; and fusion of the input PSV over the color channels stacks its multi-layer depth planes into a cube for the MPI generation of spatial features between the network capture planes.

3. The method for synthesizing the new view point of the MPI based on global feature modeling according to claim 2, wherein the method comprises the following steps: known camera parameters C ₁ ＝(A ₁ ,[R ₁ ,t ₁ ]) And C ₂ ＝(A ₂ ,[R ₂ ,t ₂ ]) Wherein A is _i And [ R ] _i ,t _i ]I=1, 2, respectively representing the internal and external parameters of the camera; consider the image I at the reference viewpoint _i One pixel p of (2) _i (u _i ,v _i 1) and corresponding voxels in the reference camera coordinate system are located at a depth z _i A place; if this voxel is indicated to be at depth z _v In the target camera coordinate system, then the matching pixel p in the target view _v (u _v ,v _v The method comprises the following steps of 1):

4. the method for synthesizing the new view point of the MPI based on global feature modeling according to claim 1, wherein the method comprises the following steps: the homography transforms employed in step 1 all use the same depth to compare different input reference images to infer scene geometry.

5. The method for synthesizing the new MPI viewpoint based on global feature modeling according to claim 4, wherein the method comprises the following steps: a three-dimensional scene is segmented into planes that are the same distance from the reference camera; for points on such a depth plane, their projection points in the reference view and the target view pass through the homography matrix H _vi,z In connection, z is the distance of the depth plane:

p _v ＝A _v H _vi,z A _i ^-1 p _i

due to a series of homography matrices H _vi,z Is applied on the reference view to get the PSV, i.e. the result of the re-projection on a different depth plane; the size of each PSV tensor is [3, D, H, W]The two PSVs are connected along the color channel to obtain a [3N, D, H, W ]]As input to the network; where H and W are the height and width of the image, respectively, D is the number of depth planes, and N is the number of input images; the network learns to infer the geometry of the scene by comparing the PSVs of two different view images.

6. The method for synthesizing the new view point of the MPI based on global feature modeling according to claim 1, wherein the method comprises the following steps: the local spatial features obtained by the 3D CNN in step 2 need to gradually encode the input image into low resolution/high dimension feature representations through linear mapping, and then send the low resolution/high dimension feature representations to a transducer encoder for further learning of global spatial long distance dependency modeling.

7. The method for synthesizing the new view point of the MPI based on global feature modeling according to claim 1, wherein the method comprises the following steps: the transducer encoder in step 2 consists of 4 layers of transducers, each layer of transducer comprising two parts: a multi-headed attention MHA module and a feed forward network FFN.

8. The method for synthesizing the new view point of the MPI based on global feature modeling according to claim 7, wherein the method comprises the following steps: the step 2 comprises the following sub-steps:

step 21 given the feature map F of the 3D CNN encoder output, linear mapping using a 3 x 3 convolutional layer increasing the channel dimension from k=128 to d=512; the transducer layer requires a sequence as input, reshapes the spatial dimension and the depth dimension into one dimension, and thus generates a feature map f of dimension d×n, i.e., f can be regarded as n Token of d dimensions; by embedding the PE with a learnable location and fusing it directly with the feature map F, the location information is encoded, creating a feature embedding as follows:

z ₀ ＝f+PE＝W×F+PE

where W is a linear mapping operation, PE εR ^d×n Representation position embedding, z ₀ ∈R ^d×n Then the representation feature is embedded; the output of the layer I transducer is shown below (l.e. [1, 2., L.)])：

z _l ′＝MHA(LN(z _l -1))+z _l -1

z _l ＝FFN(LN(z _l ′))+z _l ′

Where LN represents layer normalization, z _l Is the output of the layer I transducer, z' _l Is an intermediate result in the calculation process;

step 22, given query q ε R ^dq Key k e R ^dk Sum v e R ^dv Regarding each set of linearly projected vector representations as a head; each attention head h _i The calculation method of (a) is as follows, (i=1., n):

h _i ＝f(W _i ^(q) q,W _i ^(k) k,W _i ^(v) v)∈R ^pv

wherein the learnable parameters includeAnd->And a function representing attention pooling; the output of multiple head attentiveness needs to passAnother linear transformation corresponding to n heads h _i The result after splicing, therefore its learnable parameter is +.>

9. The method for synthesizing the new view point of the MPI based on global feature modeling according to claim 1, wherein the method comprises the following steps: in step 3, the MPI generation network directly predicts an alpha value of MPI and two mixing weights w _i I=1, 2, whereas the RGB values of MPI are modeled by mixing weights and PSV, where P is obtained by a homography matrix; thus, for each plane, each RGB image c is calculated as:

c＝∑w _i ΘP _i (i＝1,2)

the image of the new view may represent m= { c from the MPI scene _i ,α _i I=1, 2..n, rendered by Alpha synthesis; wherein the rendering process is differentiable, alpha synthesis is defined as: