CN115588153B

CN115588153B - Video frame generation method based on 3D-DoubleU-Net

Info

Publication number: CN115588153B
Application number: CN202211234067.2A
Authority: CN
Inventors: 蹇木伟; 张昊然; 王芮; 举雅琨; 杨成东; 武玉增
Original assignee: Shandong Jiude Intelligent Technology Co ltd; Shandong University of Finance and Economics
Current assignee: Shandong Jiude Intelligent Technology Co ltd; Shandong University of Finance and Economics
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2024-02-02
Anticipated expiration: 2042-10-10
Also published as: CN115588153A

Abstract

The invention provides a 3D-DoubleU-Net based video frame generation method, which aims at the problems that the frame generation method is difficult to accurately acquire the space-time characteristics among video frames under the conditions of complex video scene, rapid object movement and blocked object, adopts the technology of combining a three-dimensional convolutional neural network and a double U-Net architecture to extract more space-time characteristics of the video, and generates an intermediate frame which is more similar to a real frame. According to the technical scheme, the 3D-double U-Net network can be used for exploring the characteristics that the space-time dimension can be explored and the inter-frame context information can be captured by the double U-Net network at the same time, so that more accurate motion information and more abundant space-time features between frames can be captured under an extreme scene, and a finer intermediate frame result can be generated.

Description

Video frame generation method based on 3D-DoubleU-Net

Technical Field

The invention relates to the technical field of computer vision, in particular to a video frame generation method based on 3D-DoubleU-Net.

Background

With the upgrading of video display equipment and the improvement of video transmission bandwidth, the requirements of people on video visual quality are increasing. The frame rate is one of the important indicators of the quality of a video, representing the number of frame images played per second of video. The video with lower frame rate can have picture delay and jitter phenomenon when played, thereby affecting the viewing experience of users.

The video frame generation is a technology of generating and inserting one or more frames between two continuous frames by using an original video frame image as a reference by utilizing a video/image processing technology, so that the conversion of the video frame rate from low to high is realized. The video frame generation technology is one of key technologies in the field of video processing, attracts attention of researchers, and is widely applied to the fields of video enhancement, data compression, video special effect processing and the like.

With the development of deep learning technology in recent years, a large number of video frame generation methods based on deep learning have been proposed, mainly including methods based on optical flow estimation, methods based on kernel estimation, and methods combining optical flow estimation with kernel estimation.

The most widely used method is based on estimating the optical flow between input frames, but the optical flow cannot be accurately estimated by the algorithm under the challenging condition, so that a fuzzy result is generated. The kernel estimation-based method generally adaptively estimates the kernel of each pixel, and then convolves the estimated kernel with the input frame image to obtain an intermediate frame, but the method cannot be directed to any position, and thus cannot process object motion beyond the kernel size. The method combining optical flow estimation and nuclear estimation can use an optical flow method to carry out motion estimation on an input frame and sample pixel information around a reference point. But the available reference points for this type of method are still small and the disadvantages of the methods of light flow estimation and nuclear estimation are not significantly improved.

In the actually collected video scene, the problems of complex scene, rapid object movement, object shielding, severe illumination change and the like generally exist, and great challenges are brought to video frame generation research. Therefore, the research of the video frame generation method is also one of the difficulties in the current computer vision field, and the research of the robust and accurate video frame generation method has important theoretical significance and application value.

Disclosure of Invention

In order to make up for the defects of the prior art, the invention provides a video frame generation method based on 3D-DoubleU-Net.

The invention is realized by the following technical scheme: the video frame generation method based on the 3D-DoubleU-Net is characterized by comprising the following steps of:

s1, constructing a data set: both training and test data sets contain a plurality of triplets, one triplet consisting of three consecutive frames in the time domain, denoted asWherein->Is the previous frame, +.>Is a real intermediate frame,/-, and>is the latter frame;

s2, designing a 3D-DoubleU-Net network model: the model comprises two three-dimensional U-Net networks with a double-cross view spatial attention mechanism (VISTA), wherein each three-dimensional U-Net network consists of a three-dimensional Encoder (3D-Encoder), a cavity convolution spatial pyramid pooling (ASPP) and a three-dimensional Decoder (3D-Decoder);

spliced adjacent framesSequentially inputting two three-dimensional U-Net networks, and obtaining a result of +.>The method comprises the steps of carrying out a first treatment on the surface of the Subsequently, let in>And->The second three-dimensional U-Net network is input together, and the result is +.>The method comprises the steps of carrying out a first treatment on the surface of the Finally, let(s)>And->After splicing, inputting two-dimensional convolution to obtain a final result +.>；

S3, training a model: the invention is achieved by minimizing the initial results、/>And final result->And true intermediate frame->The difference between the two models realizes the training of the optimal model; the loss function used is as follows:

（1）；

（2）；

（3）；

（4）；

wherein the invention usesTraining network->，/>，/>；/>Use->Norm measure +_>、/>And->Differences between; />Optimizing +.using Charbonnier function>Norms to measure +.>And->The difference between the two is that,；/>for perception loss->Conv4_3 convolutional layer in VGG-16 network after ImageNet pre-training is used as feature extractor +.>Obtain->And->A perceived loss between;

s4, testing a model: inputting the front frame and the rear frame of the test set into a trained model, and directly generating an intermediate frame result;

s5, using a model: the real video is input into a trained network model, and a high-frame-rate video can be obtained.

Preferably, the step S1 specifically includes the following steps:

s1-1, model training use includes 51312Vimeo-90K data set of triples, wherein the triples areAnd->For the adjacent frame, as input to the network, the second frame +.>Is a real frame and is used for supervising the training of the network;

s1-2, the UCF101 and the DAVIS data set are selected for testing the model.

Preferably, the step S2 specifically includes the following steps:

s2-1, designing a 3D-DoubleU-Net network model: the model contains two three-dimensional U-Net networks with a dual view spatial attention mechanism (VISTA);

s2-2, first three-dimensional U-Net slave encoderAnd decoder->Constitution, wherein encoder->Is a pretrained ResNet18-3D (R3D-18) three-dimensional convolutional neural network; removing the pooling operation and the last classification layer from R3D-18 and using a three-dimensional convolution with a spatial stride of 2;

s2-3: pairs using pixel-level multiplication operationsAnd->Fusing, and fusing the resultInputting to a second three-dimensional U-Net network, performing feature extraction and upsampling to obtain result output 2, denoted +.>；

S2-4: the second three-dimensional U-Net has the same structure as the first three-dimensional U-Net, and is formed by an encoderHole convolution spatial pyramid pooling (ASPP) and decoder->Constructing;

s2-5, willInput encoder->Obtaining the extracted characteristic->Subsequently ASPP is entered, obtaining multi-scale context information,/->；

S2-6 decoderAlso four decoding blocks are included, but are +.>Using only the encoder +.>Is different from the jump connection of->A skip connection from both encoders is used. Will->Input decoder->Obtaining a second generated frame result->；

S2-7, in a second three-dimensional U-Net, the last layer of each coding block and decoding block uses a double-span view space attention mechanism (VISTA) on the features;

s2-8, willAnd->Inputting the spliced result into a two-dimensional convolution to obtain a final result +.>。

Further, the step S2-3 specifically comprises the following steps:

s2-3-1, input frame after cascade connectionInput to encoder->Extracting features to obtain。

S2-3-2, capturing multi-scale context by adopting cavity convolution space pyramid pooling (ASPP) to obtain characteristics。

S2-3-3 decoder comprising four decoding blocksIs used to reconstruct the preliminary intermediate frame result output 1, i.e。

Further, the decoder in step S2-3-3Using a three-dimensional transposed convolution layer (3 DTransConv) with a stride of 2, and adding the three-dimensional transposed convolution layer after the last layer of the three-dimensional transposed convolution layer; the last layer of each decoding block uses a dual cross-view spatial attention mechanism (VISTA) for the features.

Further, the ResNet18-3D (R3D-18) three-dimensional convolutional neural network in step S2-4 is an encoderIs a backbone structure of the encoder->Is trained from scratch, comprising four coded blocks.

The invention adopts the technical proposal, and compared with the prior art, the invention has the following beneficial effects: the method mainly comprises two parts, namely, an intermediate frame is generated without using an intermediate motion estimation step, so that inaccuracy of motion estimation in the conventional method is avoided in an extreme scene; secondly, a three-dimensional convolutional neural network and a double U-Net network are combined for frame generation for the first time, and a 3D-DoubleU-Net network is provided. The 3D-double U-Net network can simultaneously use the three-dimensional convolution neural network to explore the characteristics that the space-time dimension and the double U-Net network can capture the inter-frame context information, and capture more accurate motion information and more abundant space-time characteristics between frames under extreme scenes to generate finer intermediate frame results.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a data set format;

FIG. 2 is a flow chart of the method of the present invention;

fig. 3 is a schematic diagram of a network structure according to the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

The 3D-DoubleU-Net based video frame generation method according to the embodiment of the present invention will be specifically described with reference to fig. 1 to 3. Aiming at the problems that the space-time characteristics between video frames are difficult to accurately obtain by a frame generation method under the conditions that a video scene is complex, the movement of an object is rapid and the object is blocked, the invention adopts the technology of combining a three-dimensional convolutional neural network with a double U-Net architecture to extract more space-time characteristics of the video and generate an intermediate frame which is more similar to a real frame. When the method is specifically implemented, the technical scheme of the invention can adopt the computer software technology to realize the automatic operation flow.

As shown in fig. 2, the invention provides a video frame generation method based on 3D-DoubleU-Net, which specifically includes the following steps:

s1, constructing a data set:both training and test data sets contain a plurality of triplets, one triplet consisting of three consecutive frames in the time domain, denoted asWherein->Is the previous frame, +.>Is a real intermediate frame,/-, and>is the latter frame; the method specifically comprises the following steps:

s1-1 model training uses a Vimeo-90K dataset containing 51312 triples, wherein the triples areAnd->For the adjacent frame, as input to the network, the second frame +.>Is a real frame and is used for supervising the training of the network;

s1-2, the invention selects UCF101 and DAVIS data sets to test the model.

spliced adjacent framesSequentially inputting two three-dimensional U-Net networks, passing through a firstThe result obtained for the three-dimensional U-Net network is +.>The method comprises the steps of carrying out a first treatment on the surface of the Subsequently, let in>And->The second three-dimensional U-Net network is input together, and the result is +.>The method comprises the steps of carrying out a first treatment on the surface of the Finally, let(s)>And->After splicing, inputting two-dimensional convolution to obtain a final result +.>；

The method specifically comprises the following steps:

s2-2, first three-dimensional U-Net slave encoderAnd decoder->Constitution, wherein encoder->Is a pretrained ResNet18-3D (R3D-18) three-dimensional convolutional neural network; the pooling operation and last classification layer are removed from R3D-18 and spatial stride is used2, three-dimensional convolution;

s2-3: pairs using pixel-level multiplication operationsAnd->Fusing, and fusing the resultInputting to a second three-dimensional U-Net network, performing feature extraction and upsampling to obtain result output 2, denoted +.>The method comprises the steps of carrying out a first treatment on the surface of the The method specifically comprises the following steps:

S2-3-3 decoder comprising four decoding blocksIs used to reconstruct the preliminary intermediate frame result output 1, i.eThe method comprises the steps of carrying out a first treatment on the surface of the Decoder->A three-dimensional transposition convolutional layer (3 DTransConv) with a stride of 2 is used, and in order to process common chessboard artifacts, the three-dimensional transposition convolutional layer is added after the last layer of the three-dimensional transposition convolutional layer; the last layer of each decoding block uses a dual cross-view spatial attention mechanism (VISTA) for the features.

S2-4: the second three-dimensional U-Net has the same structure as the first three-dimensional U-Net, and is formed by an encoderHole convolution spatial pyramid pooling (ASPP) and decoder->Constructing; wherein the ResNet18-3D (R3D-18) three-dimensional convolutional neural network is an encoder +.>Is associated with the encoder->Different, encoder->Is trained from scratch, comprising four coded blocks.

S3, training a model: the invention is achieved by minimizing the initial results、/>And final result->And true intermediate frame->The difference between the two models realizes the training of the optimal model; the loss function used in the invention is as follows:

（1）；

（2）；

（3）；

（4）；

wherein the invention usesTraining network->，/>，/>；/>Use->Norm measure +_>、/>And->Differences between; />Optimizing +.using Charbonnier function>Norms to measure +.>And->The difference between the two is that,；/>for perceived loss, the network can be helped to effectively produce a more visually realistic result; />Conv4_3 convolutional layer in VGG-16 network after ImageNet pre-training is used as feature extractor +.>Obtain->And->A perceived loss between;

s4, testing a model: inputting the front frame and the rear frame of the test set into a trained model, and directly generating an intermediate frame result; the invention evaluates the generated intermediate frame result by using an objective method peak signal-to-noise ratio, structural similarity and a subjective method.

In the description of the present invention, the term "plurality" means two or more, unless explicitly defined otherwise, the orientation or positional relationship indicated by the terms "upper", "lower", etc. are based on the orientation or positional relationship shown in the drawings, merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as limiting the present invention; the terms "coupled," "mounted," "secured," and the like are to be construed broadly, and may be fixedly coupled, detachably coupled, or integrally connected, for example; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

In the description of the present specification, the terms "one embodiment," "some embodiments," "particular embodiments," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The video frame generation method based on the 3D-DoubleU-Net is characterized by comprising the following steps of:

s2, designing a 3D-DoubleU-Net network model: the model comprises two three-dimensional U-Net networks with a double-cross view space attention mechanism VISTA, wherein each three-dimensional U-Net network consists of a three-dimensional Encoder 3D-Encoder, a cavity convolution space pyramid pooling ASPP and a three-dimensional Decoder 3D-Decode;

S3, training a model: by minimizing initial results、/>And final result->And true intermediate frame->The difference between the two models realizes the training of the optimal model; the loss function used is as follows:

（1）；

（2）；

（3）；

（4）；

wherein use is made ofTraining network->，/>，/>；/>Use->Norm measure +_>、/>And (3) withDifferences between; />Optimizing +.using Charbonnier function>Norms to measure +.>And->Difference between->；For perception loss->Conv4_3 convolutional layer in VGG-16 network after ImageNet pre-training is used as feature extractor +.>Obtain->And->A perceived loss between;

2. The method for generating a video frame based on 3D-double u-Net according to claim 1, wherein said step S1 specifically comprises the steps of:

s1-1 model training uses a Vimeo-90K dataset containing 51312 triples, wherein the triples areAnd->As input to the network, the secondFrame->Is a real frame and is used for supervising the training of the network;

s1-2, the UCF101 and the DAVIS data set are selected for testing the model.

3. The method for generating a video frame based on 3D-double u-Net according to claim 1, wherein said step S2 specifically comprises the steps of:

s2-1, designing a 3D-DoubleU-Net network model: the model comprises two three-dimensional U-Net networks with a double-span visual angle space attention mechanism VISTA;

s2-2, first three-dimensional U-Net slave encoderAnd decoder->Constitution, wherein encoder->Is a pretrained ResNet18-3D three-dimensional convolutional neural network; the pooling operation and last classification layer are removed from ResNet18-3D and a three-dimensional convolution with a spatial stride of 2 is used;

S2-4: the second three-dimensional U-Net has the same structure as the first three-dimensional U-Net, and is formed by an encoderCavity convolution spatial pyramid pooling ASPP and decoder +.>Constructing;

S2-6 decoderAlso four decoding blocks are included, but are +.>Using only the encoder +.>Is different from the jump connection of->Using a skip connection from two encoders; will->Input decoder->Obtaining a second generated frame result；

S2-7, in a second three-dimensional U-Net, the last layer of each coding block and decoding block uses a double-span visual angle space attention mechanism VISTA for the characteristics;

4. A method for generating a video frame based on 3D-double u-Net according to claim 3, wherein said step S2-3 specifically comprises the steps of:

s2-3-1, input frame after cascade connectionInput to encoder->Extracting features to obtain；

S2-3-2, capturing multi-scale contexts by adopting cavity convolution space pyramid pooling ASPP to obtain characteristics；

5. The method for generating video frames based on 3D-DoubleU-Net as recited in claim 4, wherein said decoder in step S2-3-3Using a three-dimensional transposed convolution layer (3 DTransConv) with a stride of 2, and adding the three-dimensional transposed convolution layer after the last layer of the three-dimensional transposed convolution layer; the last layer of each decoding block uses a dual cross-view spatial attention mechanism VISTA for the feature.

6. The method for generating 3D-DoubleU-Net based video frames according to claim 3, wherein said ResNet18-3D three-dimensional convolutional neural network in step S2-4 is an encoderIs a backbone structure of the encoder->Is trained from scratch, comprising four coded blocks.