CN110769242A

CN110769242A - Full-automatic 2D video to 3D video conversion method based on space-time information modeling

Info

Publication number: CN110769242A
Application number: CN201910952610.4A
Authority: CN
Inventors: 陈蓓; 袁家斌; 包秀平
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2019-10-09
Filing date: 2019-10-09
Publication date: 2020-02-07

Abstract

The invention discloses a method for converting a full-automatic 2D video into a 3D video based on space-time information modeling, which comprises the steps of firstly, extracting the space information of the 2D video by utilizing an encoder network in a neural network; meanwhile, extracting time information among multiple frames of the video, and using the space information and the time information as a video representation mode; decoding the spatial information and the time information of the video into displacement information by utilizing a decoder network in the neural network; and then, the displacement information is combined with the pixel information of the video frame by using a space transformer to obtain a video frame of another visual angle corresponding to the video frame. And finally, splicing the video frames of the two visual angles into a 3D video. The invention is applied to the conversion from the 2D video to the 3D video, and the technical scheme of the invention can effectively improve the video conversion quality and the conversion efficiency.

Description

Full-automatic 2D video to 3D video conversion method based on space-time information modeling

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to full-automatic conversion from a 2D video to a 3D video by using spatiotemporal information modeling.

Background

The existing 2D video to 3D video conversion method comprises two steps: 1) extracting a depth map from an input image; 2) a stereoscopic image pair is generated using a virtual viewpoint synthesis technique. The extraction of the depth map can be divided into a semi-automatic type and a full-automatic type according to whether an operator participates in the extraction. The semi-automatic method needs manual participation, so that the expenditure on time and cost is high, the full-automatic method saves the labor cost, the conversion speed is greatly improved, and the conversion quality cannot well meet the requirements of people; meanwhile, virtual viewpoint synthesis is required subsequently, so that the conversion efficiency of the video is limited.

The vigorous development of deep learning provides a new idea for converting 2D video into 3D video. In the prior art, "j.lee, h.jung, y.kim, and k.sohn.automatic 2D-to-3D conversion using multi-scale local prediction on Image Processing IEEE, 2018" proposes a full-automatic end-to-end 2D to 3D video conversion model for extracting video spatial information using a multi-scale deep convolutional neural network, which simplifies the 2D to 3D video conversion process. But the problem of conversion efficiency is not solved due to the use of a multi-scale model; at the same time, the lack of time information also results in a lack of conversion quality.

Disclosure of Invention

In order to solve the problem of a 2D video-to-3D video conversion algorithm in the prior art, the invention provides a full-automatic 2D video-to-3D video conversion method based on space-time information modeling.

In order to achieve the purpose, the invention adopts the technical scheme that:

a full-automatic 2D video to 3D video conversion method based on spatio-temporal information modeling comprises the following steps:

step 1, extracting a plurality of video frames by using an encoder network

Time information f of_tAnd spatial information f_s；

Step 2, the time information f is processed_tAnd spatial information f_sAs input to a decoder network, video frames are respectively obtainedCorresponding displacement information d_i；

Step 3, the video frame is processed

Displacement information d corresponding thereto_iAs input to the space transformer, a transformation matrix a is used_θObtaining video frame by coordinate transformation formula

Another view of

Step 4, the video frame is processed

With the generated corresponding views

Splicing into 3D video frames;

and 5, repeating the steps 1-4 to obtain a complete 3D video.

Further, the encoder network used in step 1 is a dense-connected neural network, and it is necessary to replace the 2D convolution in the dense-connected neural network with a 3D convolution.

Further, the input of each layer network of the decoder used in step 2 is the sum of the output of the upper layer network and the output of the corresponding network layer of the encoder used in step 1.

Further, the transformation matrix A used in the step 3_θAnd the coordinate transformation formula is respectively as follows:

wherein: d each pixel corresponds toDisplacement, x^s，y^sAs the original pixel point coordinate, x^t，y^tIs the target pixel point coordinate.

Compared with the prior art, the invention has the following beneficial effects:

when the features are extracted, a 3D dense connection neural network is adopted instead of a multi-scale deep neural network. The 3D dense connection neural network can extract the spatial information of the video and the time information of the video, and the conversion quality of the 3D video is improved. Meanwhile, the 3D dense connection neural network can reduce the calculated amount of each layer and multiplex the characteristics, and the number of the networks is less than that of the multi-scale deep neural network, so that the conversion efficiency is greatly improved.

Drawings

FIG. 1 is an overall flow diagram of the present invention;

FIG. 2 is an input and output form of the present invention;

FIG. 3 is a diagram of a Dense Block structure according to the present invention;

FIG. 4 is a diagram showing a transition structure in the present invention;

FIG. 5 is a structural view of a space transformer in the present invention;

FIG. 6 is a schematic diagram of the 3D convolution operation of the present invention;

FIG. 7 is a schematic diagram of the deconvolution operation of the present invention.

Detailed Description

step 1: extracting multiple video frames using an encoder network

Time information f of_tAnd spatial information f_sFor the encoder network, we use a densely connected neural network and replace the 2D convolution in the densely connected neural network with a 3D convolution;

1.1 densely connecting neural networks

Suppose the input is a picture X₀Through which is passedAn L-level neural network, the input of the j-th level of the densely connected neural network is not only related to the output of the j-1 level, but also related to the outputs of all the previous levels, and is recorded as:

X_j＝H_j([X₀,X₁,…,X_j-1])

wherein: x_jIs the output of the i-th layer of the neural network, H_j() Is a nonlinear transformation of the j-th layer of the neural network,

the dense connection neural network comprises dense connection blocks and conversion layers, as shown in FIG. 3, the structure of the dense connection blocks is that the input of each layer is the sum of the outputs of all the previous layers; as shown in fig. 4, the structure of the translation layer includes pooling by normalization, modified linear units, convolution, and averaging.

1.23D convolution

The common 2D convolution is the spatial feature of an extracted single static image, and after the common 2D convolution is combined with a neural network, a good effect is achieved on tasks such as image classification and detection. But is overwhelmed with video, i.e., multi-frame images, because the 2D convolution does not take into account object motion information, i.e., optical flow fields, in the time dimension between images. Therefore, in order to be able to characterize video for classification and other tasks, a 3D convolution is proposed, adding a time dimension to the convolution kernel.

In the existing 2D-to-3D video conversion, 2D convolution is basically used to extract spatial information of a single picture, characteristics of video frames in a time dimension are ignored, and in order to extract information of the video in the time dimension, all convolution layers and deconvolution layers in the model adopt 3D convolution. As shown in fig. 6, the 3D convolution performs a convolution operation on the video and performs a convolution operation on information of a plurality of times.

Step 2: time information f_tAnd spatial information f_sAs input to a decoder network, video frames are respectively obtained

Corresponding displacement information d_iFor the decoder network, 5 deconvolution layers were used. The input of each layer network of the decoder is upThe output of the layer network is the sum of the output of the corresponding network layer of the encoder used in step 1.

2.1 deconvolution

The deconvolution layer essentially performs convolution operation, but only has exactly opposite relation with the input and the output of the convolution layer, so that the forward propagation and the backward propagation of the convolution layer are just exchanged, the forward propagation process of the convolution layer is the backward propagation process of the deconvolution layer, and the backward propagation process of the convolution layer is the forward propagation process of the deconvolution layer. The image size previously changed by the convolutional layer can be changed back to the original size by deconvolution. As shown in fig. 7, the deconvolution operation performs a corresponding enlargement operation for each convolution block.

And step 3: video frame

Displacement information d corresponding thereto_iAs input to the space transformer, a transformation matrix a is used_θObtaining video frames

Another view of, i.e. video frames

3.1 space transformer

As shown in FIG. 5, U and V in the space transformer structure represent the input image and the output image respectively, and correspond to the video frame respectively

And video frame

Firstly, parameter prediction is carried out to obtain a transformation matrix A_θThen carrying out coordinate mapping on the parameter theta, finally obtaining a final result through sampling, and transforming a coordinate transformation formula T by a space transformer_θ(G) Comprises the following steps:

wherein: x is the number of^s，y^sAs the original pixel point coordinate, x^t，y^tAs coordinates of the target pixel point, A_θIs a transformation matrix.

Since only the shift between left and right view pixel positions is not rotated and scaled, the transformation matrix A_θCan be simplified as follows:

wherein: d₁And d₂Respectively, the displacement in the horizontal direction and the vertical direction.

The coordinate transformation formula becomes:

further analysis of the stereo image pair reveals that there is only a displacement between the left and right view pixel positions in the horizontal direction, but is perfectly parallel in the vertical direction, i.e. d₂The transform matrix can be further simplified to 0:

wherein, d is₁To reduce to d, this can result in the final coordinate transformation formula:

each pixel corresponds to a displacement d, so the displacement map d is dense in pixels, and the task of the neural network is to estimate the optimal displacement d.

After the transformation of the pixel position is obtained, the pixel is inserted into the corresponding position by utilizing bilinear interpolation,

I^r＝B{I^l，(x^t，y^t)}

wherein: i, I^l、I^rRepresenting left view and right view pixels, respectively, B { } represents bilinear interpolation.

Then (x)^t，y^t) Is related only to the displacement d, so I^rCan be re-expressed as:

I^r＝B{I^l，d}

as bilinear interpolation is known to be differentiable, the formula is also differentiable for D, so that errors can be propagated reversely, and thus, 2D-to-3D video conversion is changed into an end-to-end system from two independent stages through a space transformer network;

3.2 loss function

The L1 loss uses the Mean Absolute Error (MAE) as an indicator of the error between the prediction and the label, so the loss function/is:

where n denotes the number of input video frames, y_i、

Respectively representing the real value and the predicted value of the ith video frame.

Step 4, the video frame is processed

With the generated corresponding views

Splicing into 3D video frames;

and 5, repeating the steps 1-4 to obtain a complete 3D video.

Firstly, extracting spatial information of a 2D video by using an encoder network in a neural network; meanwhile, extracting time information among multiple frames of the video, and using the space information and the time information as a video representation mode; decoding the spatial information and the time information of the video into displacement information by utilizing a decoder network in the neural network; and then, the displacement information is combined with the pixel information of the video frame by using a space transformer to obtain a video frame of another visual angle corresponding to the video frame. And finally, splicing the video frames of the two visual angles into a 3D video. The invention is applied to the conversion from the 2D video to the 3D video, and the technical scheme of the invention can effectively improve the video conversion quality and the conversion efficiency.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A full-automatic 2D video to 3D video conversion method based on spatio-temporal information modeling is characterized by comprising the following steps:

step 1, extracting a plurality of video frames by using an encoder network

Time information f of_tAnd spatial information f_s；

Step 2, the time information f is processed_tAnd spatial information f_sAs input to a decoder network, video frames are respectively obtained

Corresponding displacement information d_i；

Step 3, the video frame is processedDisplacement information d corresponding thereto_iAs input to the space transformer, a transformation matrix a is used_θObtaining video frame by coordinate transformation formula

Another view of

Step 4, the video frame is processedWith the generated corresponding viewsSplicing into 3D video frames;

and 5, repeating the steps 1-4 to obtain a complete 3D video.

2. The method for converting a fully automatic 2D video into a 3D video based on spatio-temporal information modeling as claimed in claim 1, wherein: the encoder network used in the step 1 is a dense connection neural network, and meanwhile, the 2D convolution in the dense connection neural network needs to be replaced by a 3D convolution.

3. The method for converting a fully automatic 2D video into a 3D video based on spatio-temporal information modeling as claimed in claim 1, wherein: the input of each layer of network of the decoder used in the step 2 is the sum of the output of the upper layer network and the output of the corresponding network layer of the encoder used in the step 1.

4. The method for converting a fully automatic 2D video into a 3D video based on spatio-temporal information modeling as claimed in claim 1, wherein: transformation matrix A used in the step 3_θAnd the coordinate transformation formula is respectively as follows:

wherein: d displacement, x, per pixel^s，y^sAs the original pixel point coordinate, x^t，y^tIs the target pixel point coordinate.