CN110769242A - Full-automatic 2D video to 3D video conversion method based on space-time information modeling - Google Patents

Full-automatic 2D video to 3D video conversion method based on space-time information modeling Download PDF

Info

Publication number
CN110769242A
CN110769242A CN201910952610.4A CN201910952610A CN110769242A CN 110769242 A CN110769242 A CN 110769242A CN 201910952610 A CN201910952610 A CN 201910952610A CN 110769242 A CN110769242 A CN 110769242A
Authority
CN
China
Prior art keywords
video
information
network
time information
automatic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910952610.4A
Other languages
Chinese (zh)
Inventor
陈蓓
袁家斌
包秀平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN201910952610.4A priority Critical patent/CN110769242A/en
Publication of CN110769242A publication Critical patent/CN110769242A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/161Encoding, multiplexing or demultiplexing different image signal components
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/261Image signal generators with monoscopic-to-stereoscopic image conversion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/275Image signal generators from 3D object models, e.g. computer-generated stereoscopic image signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

The invention discloses a method for converting a full-automatic 2D video into a 3D video based on space-time information modeling, which comprises the steps of firstly, extracting the space information of the 2D video by utilizing an encoder network in a neural network; meanwhile, extracting time information among multiple frames of the video, and using the space information and the time information as a video representation mode; decoding the spatial information and the time information of the video into displacement information by utilizing a decoder network in the neural network; and then, the displacement information is combined with the pixel information of the video frame by using a space transformer to obtain a video frame of another visual angle corresponding to the video frame. And finally, splicing the video frames of the two visual angles into a 3D video. The invention is applied to the conversion from the 2D video to the 3D video, and the technical scheme of the invention can effectively improve the video conversion quality and the conversion efficiency.

Description

Full-automatic 2D video to 3D video conversion method based on space-time information modeling
Technical Field
The invention belongs to the technical field of video processing, and particularly relates to full-automatic conversion from a 2D video to a 3D video by using spatiotemporal information modeling.
Background
The existing 2D video to 3D video conversion method comprises two steps: 1) extracting a depth map from an input image; 2) a stereoscopic image pair is generated using a virtual viewpoint synthesis technique. The extraction of the depth map can be divided into a semi-automatic type and a full-automatic type according to whether an operator participates in the extraction. The semi-automatic method needs manual participation, so that the expenditure on time and cost is high, the full-automatic method saves the labor cost, the conversion speed is greatly improved, and the conversion quality cannot well meet the requirements of people; meanwhile, virtual viewpoint synthesis is required subsequently, so that the conversion efficiency of the video is limited.
The vigorous development of deep learning provides a new idea for converting 2D video into 3D video. In the prior art, "j.lee, h.jung, y.kim, and k.sohn.automatic 2D-to-3D conversion using multi-scale local prediction on Image Processing IEEE, 2018" proposes a full-automatic end-to-end 2D to 3D video conversion model for extracting video spatial information using a multi-scale deep convolutional neural network, which simplifies the 2D to 3D video conversion process. But the problem of conversion efficiency is not solved due to the use of a multi-scale model; at the same time, the lack of time information also results in a lack of conversion quality.
Disclosure of Invention
In order to solve the problem of a 2D video-to-3D video conversion algorithm in the prior art, the invention provides a full-automatic 2D video-to-3D video conversion method based on space-time information modeling.
In order to achieve the purpose, the invention adopts the technical scheme that:
a full-automatic 2D video to 3D video conversion method based on spatio-temporal information modeling comprises the following steps:
step 1, extracting a plurality of video frames by using an encoder network
Figure BDA0002226250630000011
Time information f oftAnd spatial information fs
Step 2, the time information f is processedtAnd spatial information fsAs input to a decoder network, video frames are respectively obtainedCorresponding displacement information di
Step 3, the video frame is processed
Figure BDA0002226250630000013
Displacement information d corresponding theretoiAs input to the space transformer, a transformation matrix a is usedθObtaining video frame by coordinate transformation formula
Figure BDA0002226250630000014
Another view of
Figure BDA0002226250630000015
Step 4, the video frame is processed
Figure BDA0002226250630000021
With the generated corresponding views
Figure BDA0002226250630000022
Splicing into 3D video frames;
and 5, repeating the steps 1-4 to obtain a complete 3D video.
Further, the encoder network used in step 1 is a dense-connected neural network, and it is necessary to replace the 2D convolution in the dense-connected neural network with a 3D convolution.
Further, the input of each layer network of the decoder used in step 2 is the sum of the output of the upper layer network and the output of the corresponding network layer of the encoder used in step 1.
Further, the transformation matrix A used in the step 3θAnd the coordinate transformation formula is respectively as follows:
Figure BDA0002226250630000023
wherein: d each pixel corresponds toDisplacement, xs,ysAs the original pixel point coordinate, xt,ytIs the target pixel point coordinate.
Compared with the prior art, the invention has the following beneficial effects:
when the features are extracted, a 3D dense connection neural network is adopted instead of a multi-scale deep neural network. The 3D dense connection neural network can extract the spatial information of the video and the time information of the video, and the conversion quality of the 3D video is improved. Meanwhile, the 3D dense connection neural network can reduce the calculated amount of each layer and multiplex the characteristics, and the number of the networks is less than that of the multi-scale deep neural network, so that the conversion efficiency is greatly improved.
Drawings
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is an input and output form of the present invention;
FIG. 3 is a diagram of a Dense Block structure according to the present invention;
FIG. 4 is a diagram showing a transition structure in the present invention;
FIG. 5 is a structural view of a space transformer in the present invention;
FIG. 6 is a schematic diagram of the 3D convolution operation of the present invention;
FIG. 7 is a schematic diagram of the deconvolution operation of the present invention.
Detailed Description
A full-automatic 2D video to 3D video conversion method based on spatio-temporal information modeling comprises the following steps:
step 1: extracting multiple video frames using an encoder network
Figure BDA0002226250630000024
Time information f oftAnd spatial information fsFor the encoder network, we use a densely connected neural network and replace the 2D convolution in the densely connected neural network with a 3D convolution;
1.1 densely connecting neural networks
Suppose the input is a picture X0Through which is passedAn L-level neural network, the input of the j-th level of the densely connected neural network is not only related to the output of the j-1 level, but also related to the outputs of all the previous levels, and is recorded as:
Xj=Hj([X0,X1,…,Xj-1])
wherein: xjIs the output of the i-th layer of the neural network, Hj() Is a nonlinear transformation of the j-th layer of the neural network,
the dense connection neural network comprises dense connection blocks and conversion layers, as shown in FIG. 3, the structure of the dense connection blocks is that the input of each layer is the sum of the outputs of all the previous layers; as shown in fig. 4, the structure of the translation layer includes pooling by normalization, modified linear units, convolution, and averaging.
1.23D convolution
The common 2D convolution is the spatial feature of an extracted single static image, and after the common 2D convolution is combined with a neural network, a good effect is achieved on tasks such as image classification and detection. But is overwhelmed with video, i.e., multi-frame images, because the 2D convolution does not take into account object motion information, i.e., optical flow fields, in the time dimension between images. Therefore, in order to be able to characterize video for classification and other tasks, a 3D convolution is proposed, adding a time dimension to the convolution kernel.
In the existing 2D-to-3D video conversion, 2D convolution is basically used to extract spatial information of a single picture, characteristics of video frames in a time dimension are ignored, and in order to extract information of the video in the time dimension, all convolution layers and deconvolution layers in the model adopt 3D convolution. As shown in fig. 6, the 3D convolution performs a convolution operation on the video and performs a convolution operation on information of a plurality of times.
Step 2: time information ftAnd spatial information fsAs input to a decoder network, video frames are respectively obtained
Figure BDA0002226250630000031
Corresponding displacement information diFor the decoder network, 5 deconvolution layers were used. The input of each layer network of the decoder is upThe output of the layer network is the sum of the output of the corresponding network layer of the encoder used in step 1.
2.1 deconvolution
The deconvolution layer essentially performs convolution operation, but only has exactly opposite relation with the input and the output of the convolution layer, so that the forward propagation and the backward propagation of the convolution layer are just exchanged, the forward propagation process of the convolution layer is the backward propagation process of the deconvolution layer, and the backward propagation process of the convolution layer is the forward propagation process of the deconvolution layer. The image size previously changed by the convolutional layer can be changed back to the original size by deconvolution. As shown in fig. 7, the deconvolution operation performs a corresponding enlargement operation for each convolution block.
And step 3: video frame
Figure BDA0002226250630000041
Displacement information d corresponding theretoiAs input to the space transformer, a transformation matrix a is usedθObtaining video frames
Figure BDA0002226250630000042
Another view of, i.e. video frames
Figure BDA0002226250630000043
3.1 space transformer
As shown in FIG. 5, U and V in the space transformer structure represent the input image and the output image respectively, and correspond to the video frame respectively
Figure BDA0002226250630000044
And video frame
Figure BDA0002226250630000045
Firstly, parameter prediction is carried out to obtain a transformation matrix AθThen carrying out coordinate mapping on the parameter theta, finally obtaining a final result through sampling, and transforming a coordinate transformation formula T by a space transformerθ(G) Comprises the following steps:
wherein: x is the number ofs,ysAs the original pixel point coordinate, xt,ytAs coordinates of the target pixel point, AθIs a transformation matrix.
Since only the shift between left and right view pixel positions is not rotated and scaled, the transformation matrix AθCan be simplified as follows:
wherein: d1And d2Respectively, the displacement in the horizontal direction and the vertical direction.
The coordinate transformation formula becomes:
Figure BDA0002226250630000048
further analysis of the stereo image pair reveals that there is only a displacement between the left and right view pixel positions in the horizontal direction, but is perfectly parallel in the vertical direction, i.e. d2The transform matrix can be further simplified to 0:
Figure BDA0002226250630000049
wherein, d is1To reduce to d, this can result in the final coordinate transformation formula:
Figure BDA00022262506300000410
each pixel corresponds to a displacement d, so the displacement map d is dense in pixels, and the task of the neural network is to estimate the optimal displacement d.
After the transformation of the pixel position is obtained, the pixel is inserted into the corresponding position by utilizing bilinear interpolation,
Ir=B{Il,(xt,yt)}
wherein: i, Il、IrRepresenting left view and right view pixels, respectively, B { } represents bilinear interpolation.
Then (x)t,yt) Is related only to the displacement d, so IrCan be re-expressed as:
Ir=B{Il,d}
as bilinear interpolation is known to be differentiable, the formula is also differentiable for D, so that errors can be propagated reversely, and thus, 2D-to-3D video conversion is changed into an end-to-end system from two independent stages through a space transformer network;
3.2 loss function
The L1 loss uses the Mean Absolute Error (MAE) as an indicator of the error between the prediction and the label, so the loss function/is:
Figure BDA0002226250630000051
where n denotes the number of input video frames, yi
Figure BDA0002226250630000052
Respectively representing the real value and the predicted value of the ith video frame.
Step 4, the video frame is processed
Figure BDA0002226250630000053
With the generated corresponding views
Figure BDA0002226250630000054
Splicing into 3D video frames;
and 5, repeating the steps 1-4 to obtain a complete 3D video.
Firstly, extracting spatial information of a 2D video by using an encoder network in a neural network; meanwhile, extracting time information among multiple frames of the video, and using the space information and the time information as a video representation mode; decoding the spatial information and the time information of the video into displacement information by utilizing a decoder network in the neural network; and then, the displacement information is combined with the pixel information of the video frame by using a space transformer to obtain a video frame of another visual angle corresponding to the video frame. And finally, splicing the video frames of the two visual angles into a 3D video. The invention is applied to the conversion from the 2D video to the 3D video, and the technical scheme of the invention can effectively improve the video conversion quality and the conversion efficiency.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (4)

1. A full-automatic 2D video to 3D video conversion method based on spatio-temporal information modeling is characterized by comprising the following steps:
step 1, extracting a plurality of video frames by using an encoder network
Figure FDA0002226250620000011
Time information f oftAnd spatial information fs
Step 2, the time information f is processedtAnd spatial information fsAs input to a decoder network, video frames are respectively obtained
Figure FDA0002226250620000012
Corresponding displacement information di
Step 3, the video frame is processedDisplacement information d corresponding theretoiAs input to the space transformer, a transformation matrix a is usedθObtaining video frame by coordinate transformation formula
Figure FDA0002226250620000014
Another view of
Figure FDA0002226250620000015
Step 4, the video frame is processedWith the generated corresponding viewsSplicing into 3D video frames;
and 5, repeating the steps 1-4 to obtain a complete 3D video.
2. The method for converting a fully automatic 2D video into a 3D video based on spatio-temporal information modeling as claimed in claim 1, wherein: the encoder network used in the step 1 is a dense connection neural network, and meanwhile, the 2D convolution in the dense connection neural network needs to be replaced by a 3D convolution.
3. The method for converting a fully automatic 2D video into a 3D video based on spatio-temporal information modeling as claimed in claim 1, wherein: the input of each layer of network of the decoder used in the step 2 is the sum of the output of the upper layer network and the output of the corresponding network layer of the encoder used in the step 1.
4. The method for converting a fully automatic 2D video into a 3D video based on spatio-temporal information modeling as claimed in claim 1, wherein: transformation matrix A used in the step 3θAnd the coordinate transformation formula is respectively as follows:
Figure FDA0002226250620000018
wherein: d displacement, x, per pixels,ysAs the original pixel point coordinate, xt,ytIs the target pixel point coordinate.
CN201910952610.4A 2019-10-09 2019-10-09 Full-automatic 2D video to 3D video conversion method based on space-time information modeling Pending CN110769242A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910952610.4A CN110769242A (en) 2019-10-09 2019-10-09 Full-automatic 2D video to 3D video conversion method based on space-time information modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910952610.4A CN110769242A (en) 2019-10-09 2019-10-09 Full-automatic 2D video to 3D video conversion method based on space-time information modeling

Publications (1)

Publication Number Publication Date
CN110769242A true CN110769242A (en) 2020-02-07

Family

ID=69331077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910952610.4A Pending CN110769242A (en) 2019-10-09 2019-10-09 Full-automatic 2D video to 3D video conversion method based on space-time information modeling

Country Status (1)

Country Link
CN (1) CN110769242A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114268782A (en) * 2020-09-16 2022-04-01 镇江多游网络科技有限公司 Attention migration-based 2D-to-3D video conversion method and device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120007960A1 (en) * 2010-06-04 2012-01-12 Samsung Electronics Co., Ltd. Video processing method for 3D display based on multi-cue process
CN105122810A (en) * 2013-04-11 2015-12-02 Lg电子株式会社 Method and apparatus for processing video signal
CN106504190A (en) * 2016-12-29 2017-03-15 浙江工商大学 A kind of three-dimensional video-frequency generation method based on 3D convolutional neural networks
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN108921942A (en) * 2018-07-11 2018-11-30 北京聚力维度科技有限公司 The method and device of 2D transformation of ownership 3D is carried out to image

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120007960A1 (en) * 2010-06-04 2012-01-12 Samsung Electronics Co., Ltd. Video processing method for 3D display based on multi-cue process
CN105122810A (en) * 2013-04-11 2015-12-02 Lg电子株式会社 Method and apparatus for processing video signal
CN106504190A (en) * 2016-12-29 2017-03-15 浙江工商大学 A kind of three-dimensional video-frequency generation method based on 3D convolutional neural networks
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN108921942A (en) * 2018-07-11 2018-11-30 北京聚力维度科技有限公司 The method and device of 2D transformation of ownership 3D is carried out to image

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIYOUNG LEE等: "AUTOMATIC 2D-TO-3D CONVERSION USING MULTI-SCALE DEEP NEURAL NETWORK", 《SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING》 *
XIANGWEN LU等: "Atomicresolutiontomographyreconstructionoftiltseriesbased on aGPUacceleratedhybridinput–output algorithmusingpolar Fouriertransform", 《ULTRAMICROSCOPY》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114268782A (en) * 2020-09-16 2022-04-01 镇江多游网络科技有限公司 Attention migration-based 2D-to-3D video conversion method and device and storage medium

Similar Documents

Publication Publication Date Title
CN111739078B (en) Monocular unsupervised depth estimation method based on context attention mechanism
CN113362223B (en) Image super-resolution reconstruction method based on attention mechanism and two-channel network
CN110782490B (en) Video depth map estimation method and device with space-time consistency
CN111028150B (en) Rapid space-time residual attention video super-resolution reconstruction method
CN110930309B (en) Face super-resolution method and device based on multi-view texture learning
CN111739082B (en) Stereo vision unsupervised depth estimation method based on convolutional neural network
CN111105432B (en) Unsupervised end-to-end driving environment perception method based on deep learning
CN112019828B (en) Method for converting 2D (two-dimensional) video into 3D video
CN102857739A (en) Distributed panorama monitoring system and method thereof
CN110335222B (en) Self-correction weak supervision binocular parallax extraction method and device based on neural network
CN111899295B (en) Monocular scene depth prediction method based on deep learning
CN102438167B (en) Three-dimensional video encoding method based on depth image rendering
CN110706155B (en) Video super-resolution reconstruction method
CN105046725B (en) Head shoulder images method for reconstructing in low-bit rate video call based on model and object
Pan et al. RDEN: Residual distillation enhanced network-guided lightweight synthesized view quality enhancement for 3D-HEVC
CN104506872A (en) Method and device for converting planar video into stereoscopic video
CN113850718A (en) Video synchronization space-time super-resolution method based on inter-frame feature alignment
CN114170286A (en) Monocular depth estimation method based on unsupervised depth learning
CN116542889A (en) Panoramic video enhancement method with stable view point
CN113393382B (en) Binocular picture super-resolution reconstruction method based on multi-dimensional parallax prior
CN113610912B (en) System and method for estimating monocular depth of low-resolution image in three-dimensional scene reconstruction
CN112634127B (en) Unsupervised stereo image redirection method
CN112927348B (en) High-resolution human body three-dimensional reconstruction method based on multi-viewpoint RGBD camera
CN102223545A (en) Rapid multi-view video color correction method
CN107330856B (en) Panoramic imaging method based on projective transformation and thin plate spline

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200207