CN110769242A - Full-automatic 2D video to 3D video conversion method based on space-time information modeling - Google Patents
Full-automatic 2D video to 3D video conversion method based on space-time information modeling Download PDFInfo
- Publication number
- CN110769242A CN110769242A CN201910952610.4A CN201910952610A CN110769242A CN 110769242 A CN110769242 A CN 110769242A CN 201910952610 A CN201910952610 A CN 201910952610A CN 110769242 A CN110769242 A CN 110769242A
- Authority
- CN
- China
- Prior art keywords
- video
- information
- network
- time information
- automatic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/10—Processing, recording or transmission of stereoscopic or multi-view image signals
- H04N13/106—Processing image signals
- H04N13/161—Encoding, multiplexing or demultiplexing different image signal components
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/20—Image signal generators
- H04N13/261—Image signal generators with monoscopic-to-stereoscopic image conversion
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/20—Image signal generators
- H04N13/275—Image signal generators from 3D object models, e.g. computer-generated stereoscopic image signals
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
Abstract
The invention discloses a method for converting a full-automatic 2D video into a 3D video based on space-time information modeling, which comprises the steps of firstly, extracting the space information of the 2D video by utilizing an encoder network in a neural network; meanwhile, extracting time information among multiple frames of the video, and using the space information and the time information as a video representation mode; decoding the spatial information and the time information of the video into displacement information by utilizing a decoder network in the neural network; and then, the displacement information is combined with the pixel information of the video frame by using a space transformer to obtain a video frame of another visual angle corresponding to the video frame. And finally, splicing the video frames of the two visual angles into a 3D video. The invention is applied to the conversion from the 2D video to the 3D video, and the technical scheme of the invention can effectively improve the video conversion quality and the conversion efficiency.
Description
Technical Field
The invention belongs to the technical field of video processing, and particularly relates to full-automatic conversion from a 2D video to a 3D video by using spatiotemporal information modeling.
Background
The existing 2D video to 3D video conversion method comprises two steps: 1) extracting a depth map from an input image; 2) a stereoscopic image pair is generated using a virtual viewpoint synthesis technique. The extraction of the depth map can be divided into a semi-automatic type and a full-automatic type according to whether an operator participates in the extraction. The semi-automatic method needs manual participation, so that the expenditure on time and cost is high, the full-automatic method saves the labor cost, the conversion speed is greatly improved, and the conversion quality cannot well meet the requirements of people; meanwhile, virtual viewpoint synthesis is required subsequently, so that the conversion efficiency of the video is limited.
The vigorous development of deep learning provides a new idea for converting 2D video into 3D video. In the prior art, "j.lee, h.jung, y.kim, and k.sohn.automatic 2D-to-3D conversion using multi-scale local prediction on Image Processing IEEE, 2018" proposes a full-automatic end-to-end 2D to 3D video conversion model for extracting video spatial information using a multi-scale deep convolutional neural network, which simplifies the 2D to 3D video conversion process. But the problem of conversion efficiency is not solved due to the use of a multi-scale model; at the same time, the lack of time information also results in a lack of conversion quality.
Disclosure of Invention
In order to solve the problem of a 2D video-to-3D video conversion algorithm in the prior art, the invention provides a full-automatic 2D video-to-3D video conversion method based on space-time information modeling.
In order to achieve the purpose, the invention adopts the technical scheme that:
a full-automatic 2D video to 3D video conversion method based on spatio-temporal information modeling comprises the following steps:
step 1, extracting a plurality of video frames by using an encoder networkTime information f oftAnd spatial information fs;
Step 2, the time information f is processedtAnd spatial information fsAs input to a decoder network, video frames are respectively obtainedCorresponding displacement information di;
Step 3, the video frame is processedDisplacement information d corresponding theretoiAs input to the space transformer, a transformation matrix a is usedθObtaining video frame by coordinate transformation formulaAnother view of
Step 4, the video frame is processedWith the generated corresponding viewsSplicing into 3D video frames;
and 5, repeating the steps 1-4 to obtain a complete 3D video.
Further, the encoder network used in step 1 is a dense-connected neural network, and it is necessary to replace the 2D convolution in the dense-connected neural network with a 3D convolution.
Further, the input of each layer network of the decoder used in step 2 is the sum of the output of the upper layer network and the output of the corresponding network layer of the encoder used in step 1.
Further, the transformation matrix A used in the step 3θAnd the coordinate transformation formula is respectively as follows:
wherein: d each pixel corresponds toDisplacement, xs,ysAs the original pixel point coordinate, xt,ytIs the target pixel point coordinate.
Compared with the prior art, the invention has the following beneficial effects:
when the features are extracted, a 3D dense connection neural network is adopted instead of a multi-scale deep neural network. The 3D dense connection neural network can extract the spatial information of the video and the time information of the video, and the conversion quality of the 3D video is improved. Meanwhile, the 3D dense connection neural network can reduce the calculated amount of each layer and multiplex the characteristics, and the number of the networks is less than that of the multi-scale deep neural network, so that the conversion efficiency is greatly improved.
Drawings
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is an input and output form of the present invention;
FIG. 3 is a diagram of a Dense Block structure according to the present invention;
FIG. 4 is a diagram showing a transition structure in the present invention;
FIG. 5 is a structural view of a space transformer in the present invention;
FIG. 6 is a schematic diagram of the 3D convolution operation of the present invention;
FIG. 7 is a schematic diagram of the deconvolution operation of the present invention.
Detailed Description
A full-automatic 2D video to 3D video conversion method based on spatio-temporal information modeling comprises the following steps:
step 1: extracting multiple video frames using an encoder networkTime information f oftAnd spatial information fsFor the encoder network, we use a densely connected neural network and replace the 2D convolution in the densely connected neural network with a 3D convolution;
1.1 densely connecting neural networks
Suppose the input is a picture X0Through which is passedAn L-level neural network, the input of the j-th level of the densely connected neural network is not only related to the output of the j-1 level, but also related to the outputs of all the previous levels, and is recorded as:
Xj=Hj([X0,X1,…,Xj-1])
wherein: xjIs the output of the i-th layer of the neural network, Hj() Is a nonlinear transformation of the j-th layer of the neural network,
the dense connection neural network comprises dense connection blocks and conversion layers, as shown in FIG. 3, the structure of the dense connection blocks is that the input of each layer is the sum of the outputs of all the previous layers; as shown in fig. 4, the structure of the translation layer includes pooling by normalization, modified linear units, convolution, and averaging.
1.23D convolution
The common 2D convolution is the spatial feature of an extracted single static image, and after the common 2D convolution is combined with a neural network, a good effect is achieved on tasks such as image classification and detection. But is overwhelmed with video, i.e., multi-frame images, because the 2D convolution does not take into account object motion information, i.e., optical flow fields, in the time dimension between images. Therefore, in order to be able to characterize video for classification and other tasks, a 3D convolution is proposed, adding a time dimension to the convolution kernel.
In the existing 2D-to-3D video conversion, 2D convolution is basically used to extract spatial information of a single picture, characteristics of video frames in a time dimension are ignored, and in order to extract information of the video in the time dimension, all convolution layers and deconvolution layers in the model adopt 3D convolution. As shown in fig. 6, the 3D convolution performs a convolution operation on the video and performs a convolution operation on information of a plurality of times.
Step 2: time information ftAnd spatial information fsAs input to a decoder network, video frames are respectively obtainedCorresponding displacement information diFor the decoder network, 5 deconvolution layers were used. The input of each layer network of the decoder is upThe output of the layer network is the sum of the output of the corresponding network layer of the encoder used in step 1.
2.1 deconvolution
The deconvolution layer essentially performs convolution operation, but only has exactly opposite relation with the input and the output of the convolution layer, so that the forward propagation and the backward propagation of the convolution layer are just exchanged, the forward propagation process of the convolution layer is the backward propagation process of the deconvolution layer, and the backward propagation process of the convolution layer is the forward propagation process of the deconvolution layer. The image size previously changed by the convolutional layer can be changed back to the original size by deconvolution. As shown in fig. 7, the deconvolution operation performs a corresponding enlargement operation for each convolution block.
And step 3: video frameDisplacement information d corresponding theretoiAs input to the space transformer, a transformation matrix a is usedθObtaining video framesAnother view of, i.e. video frames
3.1 space transformer
As shown in FIG. 5, U and V in the space transformer structure represent the input image and the output image respectively, and correspond to the video frame respectivelyAnd video frameFirstly, parameter prediction is carried out to obtain a transformation matrix AθThen carrying out coordinate mapping on the parameter theta, finally obtaining a final result through sampling, and transforming a coordinate transformation formula T by a space transformerθ(G) Comprises the following steps:
wherein: x is the number ofs,ysAs the original pixel point coordinate, xt,ytAs coordinates of the target pixel point, AθIs a transformation matrix.
Since only the shift between left and right view pixel positions is not rotated and scaled, the transformation matrix AθCan be simplified as follows:
wherein: d1And d2Respectively, the displacement in the horizontal direction and the vertical direction.
The coordinate transformation formula becomes:
further analysis of the stereo image pair reveals that there is only a displacement between the left and right view pixel positions in the horizontal direction, but is perfectly parallel in the vertical direction, i.e. d2The transform matrix can be further simplified to 0:
wherein, d is1To reduce to d, this can result in the final coordinate transformation formula:
each pixel corresponds to a displacement d, so the displacement map d is dense in pixels, and the task of the neural network is to estimate the optimal displacement d.
After the transformation of the pixel position is obtained, the pixel is inserted into the corresponding position by utilizing bilinear interpolation,
Ir=B{Il,(xt,yt)}
wherein: i, Il、IrRepresenting left view and right view pixels, respectively, B { } represents bilinear interpolation.
Then (x)t,yt) Is related only to the displacement d, so IrCan be re-expressed as:
Ir=B{Il,d}
as bilinear interpolation is known to be differentiable, the formula is also differentiable for D, so that errors can be propagated reversely, and thus, 2D-to-3D video conversion is changed into an end-to-end system from two independent stages through a space transformer network;
3.2 loss function
The L1 loss uses the Mean Absolute Error (MAE) as an indicator of the error between the prediction and the label, so the loss function/is:
where n denotes the number of input video frames, yi、Respectively representing the real value and the predicted value of the ith video frame.
Step 4, the video frame is processedWith the generated corresponding viewsSplicing into 3D video frames;
and 5, repeating the steps 1-4 to obtain a complete 3D video.
Firstly, extracting spatial information of a 2D video by using an encoder network in a neural network; meanwhile, extracting time information among multiple frames of the video, and using the space information and the time information as a video representation mode; decoding the spatial information and the time information of the video into displacement information by utilizing a decoder network in the neural network; and then, the displacement information is combined with the pixel information of the video frame by using a space transformer to obtain a video frame of another visual angle corresponding to the video frame. And finally, splicing the video frames of the two visual angles into a 3D video. The invention is applied to the conversion from the 2D video to the 3D video, and the technical scheme of the invention can effectively improve the video conversion quality and the conversion efficiency.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (4)
1. A full-automatic 2D video to 3D video conversion method based on spatio-temporal information modeling is characterized by comprising the following steps:
step 1, extracting a plurality of video frames by using an encoder networkTime information f oftAnd spatial information fs;
Step 2, the time information f is processedtAnd spatial information fsAs input to a decoder network, video frames are respectively obtainedCorresponding displacement information di;
Step 3, the video frame is processedDisplacement information d corresponding theretoiAs input to the space transformer, a transformation matrix a is usedθObtaining video frame by coordinate transformation formulaAnother view of
Step 4, the video frame is processedWith the generated corresponding viewsSplicing into 3D video frames;
and 5, repeating the steps 1-4 to obtain a complete 3D video.
2. The method for converting a fully automatic 2D video into a 3D video based on spatio-temporal information modeling as claimed in claim 1, wherein: the encoder network used in the step 1 is a dense connection neural network, and meanwhile, the 2D convolution in the dense connection neural network needs to be replaced by a 3D convolution.
3. The method for converting a fully automatic 2D video into a 3D video based on spatio-temporal information modeling as claimed in claim 1, wherein: the input of each layer of network of the decoder used in the step 2 is the sum of the output of the upper layer network and the output of the corresponding network layer of the encoder used in the step 1.
4. The method for converting a fully automatic 2D video into a 3D video based on spatio-temporal information modeling as claimed in claim 1, wherein: transformation matrix A used in the step 3θAnd the coordinate transformation formula is respectively as follows:
wherein: d displacement, x, per pixels,ysAs the original pixel point coordinate, xt,ytIs the target pixel point coordinate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910952610.4A CN110769242A (en) | 2019-10-09 | 2019-10-09 | Full-automatic 2D video to 3D video conversion method based on space-time information modeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910952610.4A CN110769242A (en) | 2019-10-09 | 2019-10-09 | Full-automatic 2D video to 3D video conversion method based on space-time information modeling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110769242A true CN110769242A (en) | 2020-02-07 |
Family
ID=69331077
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910952610.4A Pending CN110769242A (en) | 2019-10-09 | 2019-10-09 | Full-automatic 2D video to 3D video conversion method based on space-time information modeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110769242A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114268782A (en) * | 2020-09-16 | 2022-04-01 | 镇江多游网络科技有限公司 | Attention migration-based 2D-to-3D video conversion method and device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120007960A1 (en) * | 2010-06-04 | 2012-01-12 | Samsung Electronics Co., Ltd. | Video processing method for 3D display based on multi-cue process |
CN105122810A (en) * | 2013-04-11 | 2015-12-02 | Lg电子株式会社 | Method and apparatus for processing video signal |
CN106504190A (en) * | 2016-12-29 | 2017-03-15 | 浙江工商大学 | A kind of three-dimensional video-frequency generation method based on 3D convolutional neural networks |
CN108388900A (en) * | 2018-02-05 | 2018-08-10 | 华南理工大学 | The video presentation method being combined based on multiple features fusion and space-time attention mechanism |
CN108921942A (en) * | 2018-07-11 | 2018-11-30 | 北京聚力维度科技有限公司 | The method and device of 2D transformation of ownership 3D is carried out to image |
-
2019
- 2019-10-09 CN CN201910952610.4A patent/CN110769242A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120007960A1 (en) * | 2010-06-04 | 2012-01-12 | Samsung Electronics Co., Ltd. | Video processing method for 3D display based on multi-cue process |
CN105122810A (en) * | 2013-04-11 | 2015-12-02 | Lg电子株式会社 | Method and apparatus for processing video signal |
CN106504190A (en) * | 2016-12-29 | 2017-03-15 | 浙江工商大学 | A kind of three-dimensional video-frequency generation method based on 3D convolutional neural networks |
CN108388900A (en) * | 2018-02-05 | 2018-08-10 | 华南理工大学 | The video presentation method being combined based on multiple features fusion and space-time attention mechanism |
CN108921942A (en) * | 2018-07-11 | 2018-11-30 | 北京聚力维度科技有限公司 | The method and device of 2D transformation of ownership 3D is carried out to image |
Non-Patent Citations (2)
Title |
---|
JIYOUNG LEE等: "AUTOMATIC 2D-TO-3D CONVERSION USING MULTI-SCALE DEEP NEURAL NETWORK", 《SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING》 * |
XIANGWEN LU等: "Atomicresolutiontomographyreconstructionoftiltseriesbased on aGPUacceleratedhybridinput–output algorithmusingpolar Fouriertransform", 《ULTRAMICROSCOPY》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114268782A (en) * | 2020-09-16 | 2022-04-01 | 镇江多游网络科技有限公司 | Attention migration-based 2D-to-3D video conversion method and device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111739078B (en) | Monocular unsupervised depth estimation method based on context attention mechanism | |
CN113362223B (en) | Image super-resolution reconstruction method based on attention mechanism and two-channel network | |
CN110782490B (en) | Video depth map estimation method and device with space-time consistency | |
CN111028150B (en) | Rapid space-time residual attention video super-resolution reconstruction method | |
CN110930309B (en) | Face super-resolution method and device based on multi-view texture learning | |
CN111739082B (en) | Stereo vision unsupervised depth estimation method based on convolutional neural network | |
CN111105432B (en) | Unsupervised end-to-end driving environment perception method based on deep learning | |
CN112019828B (en) | Method for converting 2D (two-dimensional) video into 3D video | |
CN102857739A (en) | Distributed panorama monitoring system and method thereof | |
CN110335222B (en) | Self-correction weak supervision binocular parallax extraction method and device based on neural network | |
CN111899295B (en) | Monocular scene depth prediction method based on deep learning | |
CN102438167B (en) | Three-dimensional video encoding method based on depth image rendering | |
CN110706155B (en) | Video super-resolution reconstruction method | |
CN105046725B (en) | Head shoulder images method for reconstructing in low-bit rate video call based on model and object | |
Pan et al. | RDEN: Residual distillation enhanced network-guided lightweight synthesized view quality enhancement for 3D-HEVC | |
CN104506872A (en) | Method and device for converting planar video into stereoscopic video | |
CN113850718A (en) | Video synchronization space-time super-resolution method based on inter-frame feature alignment | |
CN114170286A (en) | Monocular depth estimation method based on unsupervised depth learning | |
CN116542889A (en) | Panoramic video enhancement method with stable view point | |
CN113393382B (en) | Binocular picture super-resolution reconstruction method based on multi-dimensional parallax prior | |
CN113610912B (en) | System and method for estimating monocular depth of low-resolution image in three-dimensional scene reconstruction | |
CN112634127B (en) | Unsupervised stereo image redirection method | |
CN112927348B (en) | High-resolution human body three-dimensional reconstruction method based on multi-viewpoint RGBD camera | |
CN102223545A (en) | Rapid multi-view video color correction method | |
CN107330856B (en) | Panoramic imaging method based on projective transformation and thin plate spline |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200207 |