CN112040311B

CN112040311B - Video image frame supplementing method, device and equipment and storage medium

Info

Publication number: CN112040311B
Application number: CN202010720883.9A
Authority: CN
Inventors: 李甲; 许豪; 马中行; 赵沁平
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2021-10-26
Anticipated expiration: 2040-07-24
Also published as: CN112040311A

Abstract

The embodiment of the invention provides a method, a device, equipment and a storage medium for video image frame supplement, and the specific implementation scheme is as follows: the method comprises the following steps: extracting front and rear adjacent frames of images in a target video, and respectively inputting the front and rear adjacent frames of images into a coarse-grained optical flow generation model trained to be convergent so as to output coarse-grained optical flow data corresponding to the front and rear adjacent frames of images; inputting the pre-configured frame supplementing time data and the coarse-grained optical flow data into an intermediate frame optical flow generation model trained to be converged to output intermediate frame optical flow data; and generating a target intermediate frame image according to the front and back adjacent two frame images and the intermediate frame optical flow data. The time information and the motion information are fused through the intermediate frame optical flow generation model trained to be convergent, so that the relevance between the generated intermediate frame optical flow data and the coarse-grained optical flow data is stronger, the relevance between the target intermediate frame image and the front frame and the rear frame is improved, and the continuity of the whole video image is improved.

Description

Video image frame supplementing method, device and equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of video image processing, in particular to a method, a device and equipment for supplementing a frame of a video image and a storage medium.

Background

In recent years, video application software is in a variety, and the scale of video data in a network is increased explosively, which brings new opportunities and challenges to the development related to video optimization. The video frame rate is an important attribute of the video, and has extremely high significance in the video content optimization research. Firstly, in the direction of a high frame rate video, the related industries of modern movies mostly adopt a persistence of vision technology to bring audience viewing experience, the main principle is to continuously play static pictures, and when the playing speed reaches the deception speed of human eyes, people feel movement. Therefore, the frame rate of the film determines the viewing experience of the audience, and the frame rate is increased to increase the image information amount and smooth the lens which moves violently so as to enhance the audio-visual impact has important significance for the development of the film and television industry. Secondly, in the field of video slow-playing, with the development of the internet, an industry taking video content as a main part is gradually started, and the video slow-playing is an important product characteristic and has great application value.

Therefore, how to increase the frame rate becomes the key point of the video optimization related problem. However, the frame rate is generally increased by frame-filling the video image, and the correlation between the intermediate frame and the previous and subsequent frames obtained in the prior art is poor, which results in poor continuity of the whole video image.

Disclosure of Invention

The invention provides a video image frame supplementing method, a video image frame supplementing device, video image frame supplementing equipment and a storage medium, which are used for solving the problem that the relevance between an intermediate frame obtained by a video image frame supplementing mode and a previous frame and a next frame is poor, so that the consistency of the whole video image is poor.

The first aspect of the embodiments of the present invention provides a method for frame interpolation of a video image, where the method is applied to an electronic device, and the method includes:

extracting front and rear adjacent frames of images in a target video, and respectively inputting the front and rear adjacent frames of images into a coarse-grained optical flow generation model trained to be convergent so as to output coarse-grained optical flow data corresponding to the front and rear adjacent frames of images;

inputting the pre-configured frame supplementing time data and the coarse-grained optical flow data into an intermediate frame optical flow generation model trained to be converged to output intermediate frame optical flow data;

and generating a target intermediate frame image according to the front and back adjacent two frame images and the intermediate frame optical flow data.

Further, the method for extracting two adjacent frames of images in front and back of the target video and inputting the two adjacent frames of images into the coarse-grained optical flow generation model trained to converge respectively to output the coarse-grained optical flow data corresponding to the two adjacent frames of images comprises:

extracting front and back adjacent two frames of images in a target video, and respectively inputting the images into a coarse-grained optical flow generation model trained to be convergent;

and extracting corresponding image characteristic parameters from the front and rear adjacent two frames of images through the coarse-grained optical flow generation model, and outputting coarse-grained optical flow data corresponding to the front and rear adjacent two frames of images according to the image characteristic parameters.

Further, the method as described above, wherein the coarse-grained optical flow generation model comprises a codec network and a reversed convolution structure;

the extracting, by the coarse-grained optical flow generation model, corresponding image feature parameters from the front and rear two frames of images, and outputting coarse-grained optical flow data corresponding to the front and rear two adjacent frames of images according to the image feature parameters, includes:

extracting the image characteristic parameters from the front and back adjacent two frames of images through a coding network and coding to obtain corresponding coding results;

inputting the image characteristic parameters into the turning convolution structure to obtain the alignment characteristic graphs of the front and rear adjacent frames of images;

and inputting the alignment feature map and the coding result into a decoding network to output coarse-grained optical flow data corresponding to the front and rear adjacent two frames of images.

Further, the method as described above, the coarse-grained optical flow data comprising coarse-grained bi-directional optical flow data; the intermediate frame optical flow generation model comprises a fusion function and an object motion track fitting function;

the inputting the pre-configured frame-complementing time data and the coarse-grained optical flow data into an intermediate frame optical flow generation model trained to converge to output intermediate frame optical flow data comprises:

fusing pre-configured frame supplementing time data and the coarse-grained bidirectional optical flow data through the fusion function to output frame supplementing time bidirectional optical flow data corresponding to the frame supplementing time data;

and inputting the frame supplementing time bidirectional optical flow data into the object motion track fitting function to output intermediate frame optical flow data.

Further, the method as described above, before the input into the coarse-grained optical flow generation model trained to converge to output the coarse-grained optical flow data corresponding to the two adjacent frames of images, further includes:

obtaining a first training sample, wherein the first training sample is a training sample corresponding to a coarse-grained optical flow generation model, and the first training sample comprises: a previous frame image and a subsequent frame image;

inputting the first training sample into a preset coarse-grained optical flow generation model to train the preset coarse-grained optical flow generation model;

judging whether the preset coarse-grained light stream generation model meets a convergence condition or not by adopting a reconstruction loss function;

and if the preset coarse-grained optical flow generation model meets the convergence condition, determining the coarse-grained optical flow generation model meeting the convergence condition as a coarse-grained optical flow generation model trained to be converged.

Further, the method as described above, before inputting the pre-configured frame-complementing time data and the coarse-grained optical flow data into the intermediate-frame optical flow generation model trained to converge to output intermediate-frame optical flow data, further comprising:

acquiring a second training sample, wherein the second training sample is a training sample corresponding to the intermediate frame optical flow generation model, and the second training sample comprises: a first standard inter-frame image and a first actual inter-frame image;

inputting the second training sample into a preset intermediate frame optical flow generation model so as to train the preset intermediate frame optical flow generation model;

judging whether the preset intermediate frame optical flow generation model meets a convergence condition or not by adopting a perception loss function;

and if the preset intermediate frame optical flow generation model meets the convergence condition, determining the intermediate frame optical flow generation model meeting the convergence condition as the intermediate frame optical flow generation model trained to be converged.

Further, the method as described above, the generating a target intermediate frame image from the two adjacent front and back frame images and the intermediate frame optical flow data includes:

acquiring a weight of the proportion of the intermediate frame optical flow data occupied by the front and rear adjacent frames of images according to the intermediate frame optical flow data;

and generating a target intermediate frame image through mapping operation according to the weight and the two adjacent frames of images.

A second aspect of the embodiments of the present invention provides a video image frame complementing apparatus, where the apparatus is located in an electronic device, and includes:

the coarse-grained optical flow generation module is used for extracting front and back adjacent two frames of images in the target video, and respectively inputting the front and back adjacent two frames of images into a coarse-grained optical flow generation model trained to be converged so as to output coarse-grained optical flow data corresponding to the front and back adjacent two frames of images;

the intermediate frame optical flow generation module is used for inputting pre-configured frame supplementing time data and the coarse-grained optical flow data into an intermediate frame optical flow generation model trained to be converged so as to output intermediate frame optical flow data;

and the intermediate frame image generation module is used for generating a target intermediate frame image according to the front and back adjacent two frame images and the intermediate frame optical flow data.

Further, in the apparatus described above, the coarse-grained optical flow generation module is specifically configured to:

extracting front and back adjacent two frames of images in a target video, and respectively inputting the images into a coarse-grained optical flow generation model trained to be convergent; and extracting corresponding image characteristic parameters from the front and rear adjacent two frames of images through the coarse-grained optical flow generation model, and outputting coarse-grained optical flow data corresponding to the front and rear adjacent two frames of images according to the image characteristic parameters.

Further, the apparatus as described above, the coarse-grained optical flow generation model comprises a codec network and a reversed convolution structure;

the coarse-grained optical flow generation module is specifically configured to, when extracting corresponding image feature parameters from the two previous and subsequent frames of images through the coarse-grained optical flow generation model and outputting coarse-grained optical flow data corresponding to the two previous and subsequent adjacent frames of images according to the image feature parameters:

extracting the image characteristic parameters from the front and back adjacent two frames of images through a coding network and coding to obtain corresponding coding results; inputting the image characteristic parameters into the turning convolution structure to obtain the alignment characteristic graphs of the front and rear adjacent frames of images; and inputting the alignment feature map and the coding result into a decoding network to output coarse-grained optical flow data corresponding to the front and rear adjacent two frames of images.

Further, the apparatus as described above, the coarse-grained optical flow data comprising coarse-grained bi-directional optical flow data; the intermediate frame optical flow generation model comprises a fusion function and an object motion track fitting function;

the intermediate-frame optical flow generation module is specifically configured to:

fusing pre-configured frame supplementing time data and the coarse-grained bidirectional optical flow data through the fusion function to output frame supplementing time bidirectional optical flow data corresponding to the frame supplementing time data; and inputting the frame supplementing time bidirectional optical flow data into the object motion track fitting function to output intermediate frame optical flow data.

Further, the apparatus as described above, the apparatus further comprising: a first training module;

the first training module is configured to obtain a first training sample, where the first training sample is a training sample corresponding to a coarse-grained optical flow generation model, and the first training sample includes: a previous frame image and a subsequent frame image; inputting the first training sample into a preset coarse-grained optical flow generation model to train the preset coarse-grained optical flow generation model; judging whether the preset coarse-grained light stream generation model meets a convergence condition or not by adopting a reconstruction loss function; and if the preset coarse-grained optical flow generation model meets the convergence condition, determining the coarse-grained optical flow generation model meeting the convergence condition as a coarse-grained optical flow generation model trained to be converged.

Further, the apparatus as described above, the apparatus further comprising: a second training module;

the second training module is to: acquiring a second training sample, wherein the second training sample is a training sample corresponding to the intermediate frame optical flow generation model, and the second training sample comprises: a first standard inter-frame image and a first actual inter-frame image; inputting the second training sample into a preset intermediate frame optical flow generation model so as to train the preset intermediate frame optical flow generation model; judging whether the preset intermediate frame optical flow generation model meets a convergence condition or not by adopting a perception loss function; and if the preset intermediate frame optical flow generation model meets the convergence condition, determining the intermediate frame optical flow generation model meeting the convergence condition as the intermediate frame optical flow generation model trained to be converged.

Further, in the apparatus as described above, the intermediate frame image generating module is specifically configured to:

A third aspect of the embodiments of the present invention provides a video image frame interpolation apparatus, including: a memory, a processor;

a memory; a memory for storing the processor-executable instructions;

wherein the processor is configured to perform the video image frame complementing method of any one of the first aspect by the processor.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the method for complementing video frames according to any one of the first aspect is implemented.

The embodiment of the invention provides a video image frame supplementing method, a video image frame supplementing device, video image frame supplementing equipment and a storable medium, wherein the method is applied to electronic equipment and comprises the following steps: extracting front and rear adjacent frames of images in a target video, and respectively inputting the front and rear adjacent frames of images into a coarse-grained optical flow generation model trained to be convergent so as to output coarse-grained optical flow data corresponding to the front and rear adjacent frames of images; inputting the pre-configured frame supplementing time data and the coarse-grained optical flow data into an intermediate frame optical flow generation model trained to be converged to output intermediate frame optical flow data; and generating a target intermediate frame image according to the front and back adjacent two frame images and the intermediate frame optical flow data. Corresponding coarse-grained optical flow data are output according to the front and back adjacent two frames of images in the target video through the intermediate frame optical flow generation model trained to be convergent, and intermediate frame optical flow data are output according to pre-configured frame supplementing time data and the coarse-grained optical flow data through the intermediate frame optical flow generation model trained to be convergent. And then generating a target intermediate frame image according to the front and back adjacent two frame images and the intermediate frame optical flow data. According to the method provided by the embodiment of the invention, the generated intermediate frame optical flow data and the coarse-grained optical flow data have stronger relevance by fusing the time information of the frame supplementing time data and the motion information contained in the coarse-grained optical flow data through the intermediate frame optical flow generation model trained to be converged, so that the relevance between the target intermediate frame image and the previous and subsequent frames is improved, and the continuity of the whole video image is further improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a scene diagram of a video image frame interpolation method that can implement an embodiment of the present invention;

fig. 2 is a flowchart illustrating a video image frame interpolation method according to a first embodiment of the present invention;

fig. 3 is a flowchart illustrating a video image frame interpolation method according to a second embodiment of the present invention;

fig. 4 is a flowchart illustrating step 202 of a video image frame interpolation method according to a second embodiment of the present invention;

fig. 5 is a schematic training flow chart of a video image frame interpolation method according to a fourth embodiment of the present invention;

fig. 6 is a schematic training flow chart of a video image frame interpolation method according to a fifth embodiment of the present invention;

fig. 7 is a schematic structural diagram of a video image frame interpolation apparatus according to a sixth embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to a seventh embodiment of the present invention.

With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. The drawings and the description are not intended to limit the scope of the inventive concept in any way, but rather to illustrate it by those skilled in the art with reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

First, terms related to embodiments of the present invention are explained:

frame: the single image picture is the minimum unit in the image animation, which is equivalent to each frame of lens on the motion picture film, one frame is a static picture, and continuous frames form the animation.

Frame supplementing: adding at least one frame image between two adjacent frames of images.

Optical flow: which refers to the apparent motion of the luminance pattern, the optical flow contains information of the motion of the object.

An application scenario of the video image frame interpolation method provided by the embodiment of the present invention is described below. As shown in fig. 1, 1 is a first electronic device, 2 is an adjacent subsequent frame image, 3 is an adjacent previous frame image, 4 is a second electronic device, and 5 is a third electronic device. The network architecture of the application scene corresponding to the video image frame supplementing method provided by the embodiment of the invention comprises the following steps: a first electronic device 1, a second electronic device 4 and a third electronic device 5. The second electronic device 4 stores a target video which needs to be subjected to frame complementing. The first electronic device 1 acquires the adjacent previous frame image 3 and the adjacent next frame image 2 of the target video from the second electronic device 4, and inputs the two adjacent previous and next frame images into a coarse-grained optical flow generation model trained to be convergent respectively so as to output coarse-grained optical flow data corresponding to the two adjacent previous and next frame images. And inputting the pre-configured frame supplementing time data and the coarse-grained optical flow data into an intermediate frame optical flow generation model trained to be converged to output the intermediate frame optical flow data. And generating a target intermediate frame image according to the front and back adjacent two frame images and the intermediate frame optical flow data. After the first electronic device 1 generates the target inter-frame image, the target inter-frame image may be output to the third electronic device 5. And acquiring the adjacent previous frame image 3 and the adjacent next frame image 2 of the target video in the second electronic equipment 4 by the third electronic equipment 5, and combining the target intermediate frame images to generate a video after frame supplement. Alternatively, the first electronic device 1 may combine the target inter-frame image with the target video to generate a video after frame interpolation.

According to the video image frame supplementing method provided by the embodiment of the invention, the time information of the frame supplementing time data and the motion information in the optical flow data are fused through the intermediate frame optical flow generation model trained to be converged, so that the generated intermediate frame optical flow data is stronger in relevance with the coarse-grained optical flow data, the relevance of the target intermediate frame image and the front and rear frames is improved, and the continuity of the whole video image is further improved.

The embodiments of the present invention will be described with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating a video image frame complementing method according to a first embodiment of the present invention, and as shown in fig. 2, an implementation subject of the embodiment of the present invention is a video image frame complementing device, which can be integrated in an electronic device. The video image frame interpolation method provided by the embodiment includes the following steps:

step S101, two adjacent frames of images in the target video are extracted and input into a coarse-grained optical flow generation model trained to be converged respectively, so as to output coarse-grained optical flow data corresponding to the two adjacent frames of images.

First, in the present embodiment, the two adjacent front and rear frame images refer to a front frame image and a rear frame image adjacent to the front frame image. Such as a first frame image, i.e., a previous frame image, and a second frame image, which is a subsequent frame image adjacent to the previous frame image, in the target video. The target video is a video needing video frame supplement.

The target video may be a target video obtained from a database, obtained from another electronic device, or manually input, which is not limited in this embodiment. In this embodiment, the coarse-grained optical flow generation model trained to be convergent is a trained model, and is used for generating corresponding coarse-grained optical flow data according to two adjacent frames of images. Coarse-grained optical flow data refers to logically larger optical flow data that is accommodated.

Wherein, the coarse-grained optical flow generation model can be a network structure model, such as a U-net network structure. Wherein the U-net network structure is a convolutional network structure. Meanwhile, the network structure model can adopt artificial intelligence to carry out deep learning so as to obtain the learned network structure model.

Step S102, inputting the pre-configured frame-complementing time data and the coarse-grained optical flow data into an intermediate frame optical flow generation model trained to be converged to output intermediate frame optical flow data.

In this embodiment, the preconfigured frame-complementing time data is related to a frame-complementing condition required by the video, so that the frame-complementing time data can be preconfigured according to the frame-complementing condition, for example, the preconfigured frame-complementing time data has a range of [0, 1 ]. For example, the target video has 30 frames of images, and 60 frames of images are required to be formed by frame complementing, and the preconfigured frame complementing time data can be set to be one half. If the target video needs to form an image of 90 frames by frame complementing, the preconfigured frame complementing time data may be set to one third.

In this embodiment, the intermediate frame optical flow generation model trained to converge is a trained model, and is used to generate intermediate frame optical flow data according to the pre-configured frame-complementing time data and coarse-grained optical flow data.

Wherein, the intermediate frame optical flow generation model can be a network structure model. The intermediate frame optical flow generation model can adopt a convolution network structure with spatial and temporal characteristics.

In this embodiment, the intermediate-frame optical flow data is obtained by fusing time information in the pre-configured frame-complementing time data and motion information of the object motion trajectory in the coarse-grained optical flow data, which may also be referred to as spatial information of the object motion trajectory.

In step S103, a target intermediate frame image is generated from the front and rear adjacent two frame images and the intermediate frame optical flow data.

In this embodiment, the intermediate frame optical flow data is combined based on the two adjacent frames of images, and the intermediate frame optical flow data has the motion information of the intermediate frame, so that the target intermediate frame image can be generated by combining the two adjacent frames of images. Meanwhile, the operation mode can be that the target intermediate frame image is generated by the mapping operation of the front and back adjacent two frame images and the combined intermediate frame optical flow data.

The embodiment of the invention provides a video image frame supplementing method, which is applied to electronic equipment and comprises the following steps: and extracting front and rear adjacent frames of images in the target video, and respectively inputting the front and rear adjacent frames of images into a coarse-grained optical flow generation model trained to be convergent so as to output coarse-grained optical flow data corresponding to the front and rear adjacent frames of images. And inputting the pre-configured frame supplementing time data and the coarse-grained optical flow data into an intermediate frame optical flow generation model trained to be converged to output the intermediate frame optical flow data. And generating a target intermediate frame image according to the front and back adjacent two frame images and the intermediate frame optical flow data. Corresponding coarse-grained optical flow data are output according to the front and back adjacent two frames of images in the target video through the intermediate frame optical flow generation model trained to be convergent, and intermediate frame optical flow data are output according to pre-configured frame supplementing time data and the coarse-grained optical flow data through the intermediate frame optical flow generation model trained to be convergent. And then generating a target intermediate frame image according to the front and back adjacent two frame images and the intermediate frame optical flow data. According to the method provided by the embodiment of the invention, the generated intermediate frame optical flow data and the coarse-grained optical flow data have stronger relevance by fusing the time information of the frame supplementing time data and the motion information contained in the coarse-grained optical flow data through the intermediate frame optical flow generation model trained to be converged, so that the relevance between the target intermediate frame image and the previous and subsequent frames is improved, and the continuity of the whole video image is further improved.

Fig. 3 is a schematic flow chart of a video image frame complementing method according to a second embodiment of the present invention, and as shown in fig. 3, the video image frame complementing method according to the present embodiment is further refined in each step based on the video image frame complementing method according to the first embodiment of the present invention. The video image frame interpolation method provided by the embodiment includes the following steps.

Wherein step 201-202 is a further refinement of step 101.

Step S201, two adjacent frames of images in the target video are extracted and input into a coarse-grained optical flow generation model trained to be converged respectively.

In this embodiment, the implementation manner of step 201 is similar to that of step 101 in the first embodiment of the present invention, and is not described in detail here.

Step S202, extracting corresponding image characteristic parameters from the two adjacent frames of images through the coarse-grained optical flow generation model, and outputting coarse-grained optical flow data corresponding to the two adjacent frames of images according to the image characteristic parameters.

In this embodiment, the image characteristic parameters include object motion vectors, pixel coordinates, and the like. The extraction of the corresponding image characteristic parameters from the two adjacent frames of images is used for comparing the image characteristic parameters in the two adjacent frames of images so as to output the coarse-grained optical flow data corresponding to the two adjacent frames of images. Therefore, in this embodiment, the coarse-grained optical flow data corresponding to two adjacent frames of images before and after being output according to the image feature parameters may specifically be: and outputting coarse-grained optical flow data corresponding to the front and back adjacent two frames of images according to the object motion vector and the pixel coordinates.

The obtained coarse-grained optical flow data can be more corresponding to the two adjacent frames of images by outputting the coarse-grained optical flow data corresponding to the two adjacent frames of images by using the image characteristic parameters.

It should be noted that step 203-204 is a further refinement of step 102. Also, optionally, the coarse-grained optical flow data comprises coarse-grained bi-directional optical flow data. The intermediate frame optical flow generation model comprises a fusion function and an object motion track fitting function.

Step S203, fusing the pre-configured frame-complementing time data and the coarse-grained bidirectional optical flow data through a fusion function to output frame-complementing time bidirectional optical flow data corresponding to the frame-complementing time data.

In this embodiment, the pre-configured frame-complementing time data is similar to step 102 in the first embodiment of the present invention, and is not described in detail here.

In this embodiment, the fusion function is:

and

wherein the content of the first and second substances,

and optical flow data representing directions from an adjacent subsequent frame image to an adjacent previous frame image in the frame-complementing-time bidirectional optical flow data.

And optical flow data representing directions from an adjacent preceding frame image to an adjacent succeeding frame image in the frame-complementing-time bidirectional optical flow data.

Representing pre-configured frame complement time data. F_0→1Optical flow data representing directions from an adjacent preceding frame image to an adjacent succeeding frame image in the coarse-grained bidirectional optical flow data. F_1→0Optical flow data representing directions from an adjacent subsequent frame image to an adjacent previous frame image in the coarse-grained bidirectional optical flow data.

In this embodiment, the pre-configured frame supplementing time data and the coarse-grained bidirectional optical flow data are fused by the fusion function to output the frame supplementing time bidirectional optical flow data corresponding to the frame supplementing time data, so that the frame supplementing time bidirectional optical flow data can include time information, thereby providing a basis for weight determination when a target intermediate frame image is subsequently generated.

Step S204, inputting the frame-complementing time bidirectional optical flow data into an object motion track fitting function to output intermediate frame optical flow data.

In this embodiment, the object motion trajectory fitting function may adopt an object motion trajectory fitting function in a Conv-LSTM network or another object motion trajectory fitting function, which is not limited in this embodiment. The Conv-LSTM network is a new network structure formed by adding convolution operation capable of extracting spatial features to an LSTM network capable of extracting time sequence features, and the network structure takes an object motion track fitting function as a core.

In this embodiment, the intermediate frame optical flow data is optimized by the object motion trajectory fitting function, so that the relevance between the subsequently generated target intermediate frame image and the two adjacent frames of images is higher.

It should be noted that the steps 205-206 are further detailed for the step 103.

And step S205, acquiring the weight of the proportion of the intermediate frame optical flow data occupied by the front and rear adjacent frames of images according to the intermediate frame optical flow data.

In this embodiment, the weight of the proportion of the intermediate frame optical flow data occupied by the two adjacent frames of images in the intermediate frame optical flow data can be confirmed according to the time information included in the intermediate frame optical flow data. The mapping relationship between the time information and the weight value can be preset. For example, as the time information is closer to 0, the weight representing the adjacent previous frame image is higher. In contrast, as the time information is closer to 1, the weight representing the adjacent subsequent frame image is higher.

And step S206, generating a target intermediate frame image through mapping operation according to the weight and the two adjacent frames of images.

In this embodiment, the process of generating the target intermediate frame image may be to generate corresponding target previous frames and target subsequent frames from the two adjacent previous and subsequent frame images and the intermediate frame optical flow data through mapping operation, and to obtain the target intermediate frame image by fusing the target previous frames and the target subsequent frames according to the weight. The fusion process specifically comprises the following steps: and obtaining a weighted front frame according to the weight and the spatial information of the target front frame, and obtaining a weighted rear frame according to the weight and the spatial information of the target rear frame. And fusing the weighted frame and the weighted frame to obtain a target intermediate frame image.

The embodiment of the invention provides a video image frame supplementing method, which extracts corresponding image characteristic parameters from two adjacent frames of images through a coarse-grained optical flow generation model and outputs coarse-grained optical flow data corresponding to the two adjacent frames of images according to the image characteristic parameters. And fusing the pre-configured frame supplementing time data and the coarse-grained bidirectional optical flow data through a fusion function to output frame supplementing time bidirectional optical flow data corresponding to the frame supplementing time data. Meanwhile, the frame-complementing time bidirectional optical flow data is input to an object motion trajectory fitting function to output intermediate frame optical flow data. And then, acquiring a weight value of the proportion of the front and rear adjacent frames of images in the optical flow data of the intermediate frame according to the optical flow data of the intermediate frame, and generating a target intermediate frame image through mapping operation according to the weight value and the front and rear adjacent frames of images.

The method provided by the embodiment of the invention fuses pre-configured frame supplementing time data and coarse-grained bidirectional optical flow data through a fusion function included in an intermediate frame optical flow generation model trained to be convergent, generates corresponding frame supplementing time bidirectional optical flow data by fusing time information and motion information, and then correspondingly optimizes the frame supplementing time bidirectional optical flow data through an object motion track fitting function so as to output the intermediate frame optical flow data. Because the intermediate frame optical flow data comprises the time information, the weight of the proportion of the front and rear adjacent frame images in the intermediate frame optical flow data can be obtained according to the intermediate frame optical flow data, so that the generated target intermediate frame image is combined with the content of the front and rear adjacent frame images according to the weight, the accuracy of the generated target intermediate frame image is higher, the relevance between the target intermediate frame image and the front and rear frames is improved, and the continuity of the whole video image is improved.

Fig. 4 is a flowchart illustrating step 202 of a video image frame interpolation method according to a second embodiment of the present invention. As shown in fig. 4, the video image frame interpolation method provided in this embodiment is a further refinement of step 202 on the basis of the video image frame interpolation method provided in the second embodiment of the present invention. The video image frame interpolation method provided by the embodiment includes the following steps.

Step S2021, extracting image feature parameters from two adjacent frames of images and encoding the image feature parameters to obtain corresponding encoding results. The coarse-grained optical flow generation model comprises a coding and decoding network and a reversed convolution structure.

In this embodiment, the image feature parameters are similar to those in step 202 in the second embodiment of the present invention, and are not described in detail herein.

The encoding network may be a network structure that converts image characteristic parameters and the like in an image into code information data. The flipped convolutional structure is a convolutional network structure.

Step S2022, inputting the image feature parameters into the inverse convolution structure to obtain the alignment feature maps of the two adjacent frames of images.

In this embodiment, the inverse convolution structure is mainly used to obtain spatial related information, such as motion information, of two adjacent frames of images according to the image characteristic parameters, and is expressed in the form of an alignment characteristic map. The specific process is as follows: and (4) turning the convolution structure to generate an alignment characteristic image according to the spatial information of the previous frame image and the corresponding spatial information of the next frame image. For example, the pixel coordinates of the previous frame image and the pixel coordinates of the subsequent frame image are aligned one by one, and the motion vector of the object of the previous frame image is aligned with the motion vector of the object corresponding to the subsequent frame image, so as to generate an alignment feature map.

Step S2023, inputting the alignment feature map and the encoding result into a decoding network to output coarse-grained optical flow data corresponding to two adjacent frames of images.

The decoding network is a network structure for converting code information data into optical flow data.

In this embodiment, the alignment feature map and the encoding result are input to a decoding network, and information data such as image parameters are converted into coarse-grained optical flow data corresponding to two adjacent frames of images through the decoding network. The alignment feature map generated by inverting the convolution structure can enable the obtained image feature change between the previous frame image and the next frame image to be more accurate, and therefore the accuracy of the subsequently generated coarse-grained optical flow data is higher.

Fig. 5 is a schematic diagram of a training flow of a video image frame interpolation method according to a fourth embodiment of the present invention, and as shown in fig. 5, the video image frame interpolation method according to the present embodiment is a training flow that adds a coarse-grained optical flow generation model to the video image frame interpolation methods according to the first to third embodiments of the present invention. The video image frame interpolation method provided by the embodiment includes the following steps.

Step S301, a first training sample is obtained, wherein the first training sample is a training sample corresponding to the coarse-grained optical flow generation model. The first training sample comprises: a previous frame image and a subsequent frame image.

In this embodiment, the previous frame image and the next frame image may be the 1 st frame in the target video of the previous frame image, and the corresponding next frame image is the 3 rd frame in the target video. Meanwhile, the image of the previous frame is the 1 st frame in the target video, and the image of the corresponding next frame is the 5 th frame in the target video. So as to ensure that the actual intermediate frame image obtained subsequently has a standard intermediate frame image with contrast in the target video. For example, when the previous frame image is the 1 st frame in the target video and the corresponding next frame image is the 3 rd frame in the target video, the standard intermediate frame image is the 2 nd frame in the target video. Therefore, the generated actual intermediate frame image can be compared with the standard intermediate frame image, and the similarity between the generated actual intermediate frame image and the standard intermediate frame image is determined, so that the accuracy of the generated actual intermediate frame image is improved.

Step S302, inputting the first training sample into a preset coarse-grained optical flow generation model so as to train the preset coarse-grained optical flow generation model.

In this embodiment, the first training sample is input into a preset coarse-grained optical flow generation model that needs to be trained, and the generated coarse-grained optical flow data is used to determine whether a convergence condition is satisfied through a reconstruction loss function.

Step S303, adopting a reconstruction loss function to judge whether the preset coarse-grained optical flow generation model meets a convergence condition.

In this embodiment, the reconstruction loss function is:

L_r1＝|warp_op(frame 1，-F_0→1)-frame 0|

L_r2＝|warp_op(frame 0，-F_1→0)-frame 1|

wherein L is_r1Representing the intermediate function of the later frame, L_r2Representing the intermediate function of the previous frame, L_rRepresenting the reconstruction loss function, frame 1 representing the following frame image, frame 0 representing the preceding frame image, F_0→1Optical flow data representing the direction in coarse-grained optical flow data from a preceding image to a following image, F_1→0Optical flow data representing the direction from the subsequent frame image to the previous frame image in the coarse-grained bidirectional optical flow data, warp _ op representing the operation of generating an output frame image using the optical flow data and the input frame image, x and y representing the coordinates of pixel points in the image, and N representing the number of image pixels.

In this embodiment, the reconstruction loss function is mainly used to monitor the generation effect of the coarse-grained optical flow data.

In step S304, if the preset coarse-grained optical flow generation model satisfies the convergence condition, the coarse-grained optical flow generation model satisfying the convergence condition is determined as the coarse-grained optical flow generation model trained to converge.

The convergence condition corresponding to the preset coarse-grained optical flow generation model is when the reconstruction loss function reaches the minimum value which can be optimized. At this time, the preset coarse-grained optical flow generation model meets the convergence condition, and the generated coarse-grained optical flow has high accuracy.

In this embodiment, by using the trained coarse-grained optical flow generation model, the coarse-grained optical flow data generated by the coarse-grained optical flow generation model can be made more accurate.

Fig. 6 is a schematic diagram of a training flow of a video image frame interpolation method according to a fifth embodiment of the present invention, and as shown in fig. 6, the video image frame interpolation method according to the present embodiment is a training flow added with an intermediate frame optical flow generation model on the basis of the video image frame interpolation method according to the fourth embodiment of the present invention. The video image frame interpolation method provided by the embodiment includes the following steps.

Step S401, a second training sample is obtained, wherein the second training sample is a training sample corresponding to the intermediate frame optical flow generation model. The second training sample comprises: a first standard inter-frame image and a first actual inter-frame image.

In this embodiment, the first standard inter-frame image is a standard inter-frame image in the target video for comparison. The first actual intermediate frame image is an actual intermediate frame image generated by an intermediate frame optical flow generation model in the training process.

Step S402, inputting a second training sample into the preset intermediate frame optical flow generation model to train the preset intermediate frame optical flow generation model.

In this embodiment, the training process is similar to that of step 302 in the fourth embodiment of the present invention, and is not described in detail herein.

And step S403, judging whether the preset intermediate frame optical flow generation model meets a convergence condition by adopting a perception loss function.

In this embodiment, the perceptual loss function is:

wherein L is_pDenotes the perceptual loss function, phi (I) denotes the characteristic output of the first actual inter-frame image, phi (gt) denotes the characteristic output of the first standard inter-frame image, and N denotes the number of image pixels.

In this embodiment, the perceptual loss function mainly evaluates the two images at a higher semantic level, and the loss calculation process generally includes two steps of high-level feature extraction and feature difference calculation. The high-level feature extraction generally adopts high-level features in a pre-trained deep neural network, for example, the output of a conv4_3 convolutional layer in a VGG16 network pre-trained on ImageNet, wherein ImageNet is a large visualization database used for visual object recognition software research, the VGG16 network is a network structure, and the conv4_3 convolutional layer is a convolutional layer in the VGG16 network.

In step S404, if the preset intermediate frame optical flow generation model satisfies the convergence condition, the intermediate frame optical flow generation model satisfying the convergence condition is determined as the intermediate frame optical flow generation model trained to converge.

And the convergence condition of the preset intermediate frame optical flow generation model is that the perception loss function reaches the minimum value. At this time, the preset intermediate frame optical flow generation model meets the convergence condition, and the generated intermediate frame optical flow has high accuracy.

Optionally, in this embodiment, an interframe loss function may also be used to determine whether the preset intermediate frame optical flow generation model meets the convergence condition to train the intermediate frame optical flow generation model. At this time, the training samples include: a second standard inter-frame image and a second actual inter-frame image.

The interframe loss function is:

wherein L is_fAnd expressing an interframe loss function, I expressing a second actual intermediate frame image, gt expressing a second standard intermediate frame image, x and y expressing the coordinates of pixel points in the image, and N expressing the number of image pixels.

And the convergence condition of the preset intermediate frame optical flow generation model is that the inter-frame loss function reaches the minimum value. At this time, the preset intermediate frame optical flow generation model satisfies the convergence condition and is determined as an intermediate frame optical flow generation model trained to converge.

Meanwhile, in the embodiment, the accuracy of the intermediate frame optical flow generated by the trained intermediate frame optical flow generation model can be higher through the inter-frame loss function.

Fig. 7 is a schematic structural diagram of a video image frame complementing apparatus according to a sixth embodiment of the present invention, as shown in fig. 7, in this embodiment, the apparatus is located in an electronic device, and the video image frame complementing apparatus 500 includes:

the coarse-grained optical flow generation module 501 is configured to extract two adjacent frames of images in the target video, and input the two adjacent frames of images into a coarse-grained optical flow generation model trained to be convergent to output coarse-grained optical flow data corresponding to the two adjacent frames of images.

An intermediate frame optical flow generating module 502, configured to input the pre-configured frame-complementing time data and the coarse-grained optical flow data into an intermediate frame optical flow generating model trained to converge, so as to output the intermediate frame optical flow data.

An intermediate frame image generating module 503, configured to generate a target intermediate frame image according to the two adjacent front and back frame images and the intermediate frame optical flow data.

The video image frame interpolation apparatus provided in this embodiment may implement the technical solution of the method embodiment shown in fig. 2, and the implementation principle and technical effect thereof are similar to those of the method embodiment shown in fig. 2, and are not described in detail herein.

Meanwhile, another embodiment of the video image frame complementing device provided by the present invention further refines the video image frame complementing device 500 on the basis of the video image frame complementing device provided by the previous embodiment.

Optionally, in this embodiment, the coarse-grained optical flow generating module 501 is specifically configured to:

and extracting front and back adjacent two frames of images in the target video, and respectively inputting the images into a coarse-grained optical flow generation model trained to be convergent.

Meanwhile, corresponding image characteristic parameters are extracted from the front and the back adjacent two frames of images through the coarse-grained optical flow generation model, and coarse-grained optical flow data corresponding to the front and the back adjacent two frames of images are output according to the image characteristic parameters.

Optionally, in this embodiment, the coarse-grained optical flow generation model includes a coding and decoding network and a reversed convolution structure.

The coarse-grained optical flow generation module 501 is specifically configured to, when extracting corresponding image feature parameters from two previous and next frames of images through the coarse-grained optical flow generation model and outputting coarse-grained optical flow data corresponding to two previous and next frames of images according to the image feature parameters:

and extracting image characteristic parameters from the front and back adjacent two frames of images through a coding network and coding to obtain a corresponding coding result.

Meanwhile, inputting the image characteristic parameters into a turning convolution structure to obtain the alignment characteristic graphs of the two adjacent frames of images.

And inputting the alignment feature map and the coding result into a decoding network to output coarse-grained optical flow data corresponding to two adjacent frames of images.

Optionally, in this embodiment, the coarse-grained optical flow data includes coarse-grained bidirectional optical flow data. The intermediate frame optical flow generation model comprises a fusion function and an object motion track fitting function.

The intermediate-frame optical flow generation module 502 is specifically configured to:

and fusing the pre-configured frame supplementing time data and the coarse-grained bidirectional optical flow data through a fusion function to output frame supplementing time bidirectional optical flow data corresponding to the frame supplementing time data.

Meanwhile, the frame-complementing time bidirectional optical flow data is input to an object motion trajectory fitting function to output intermediate frame optical flow data.

Optionally, in this embodiment, the video image frame interpolation apparatus 500 further includes: a first training module.

The first training module is used for acquiring a first training sample, the first training sample is a training sample corresponding to the coarse-grained optical flow generation model, and the first training sample comprises: a previous frame image and a subsequent frame image.

Meanwhile, inputting the first training sample into a preset coarse-grained optical flow generation model so as to train the preset coarse-grained optical flow generation model.

And then, judging whether the preset coarse-grained optical flow generation model meets the convergence condition or not by adopting a reconstruction loss function.

And if the preset coarse-grained optical flow generation model meets the convergence condition, determining the coarse-grained optical flow generation model meeting the convergence condition as the coarse-grained optical flow generation model trained to be converged.

Optionally, in this embodiment, the video image frame interpolation apparatus 500 further includes: a second training module.

The second training module is used for obtaining a second training sample, the second training sample is a training sample corresponding to the intermediate frame optical flow generation model, and the second training sample comprises: a first standard inter-frame image and a first actual inter-frame image.

Meanwhile, inputting a second training sample into the preset intermediate frame optical flow generation model to train the preset intermediate frame optical flow generation model.

And then, judging whether the preset intermediate frame optical flow generation model meets the convergence condition or not by adopting a perception loss function.

Optionally, in this embodiment, the intermediate frame image generating module 503 is specifically configured to:

and acquiring the weight of the proportion of the front and rear adjacent frames of images in the intermediate frame optical flow data according to the intermediate frame optical flow data.

The video image frame interpolation apparatus provided in this embodiment may implement the technical solutions of the method embodiments shown in fig. 2 to 6, and the implementation principles and technical effects thereof are similar to those of the method embodiments shown in fig. 2 to 6, and are not described in detail herein.

The invention also provides an electronic device and a computer-readable storage medium according to the embodiments of the invention.

Fig. 8 is a schematic structural diagram of an electronic device according to a seventh embodiment of the invention. Electronic devices are intended for various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 8, the electronic apparatus includes: a processor 601, a memory 602. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device.

The memory 602 is a non-transitory computer readable storage medium provided by the present invention. The memory stores instructions executable by the at least one processor, so that the at least one processor executes the video image frame complementing method provided by the invention. The non-transitory computer-readable storage medium of the present invention stores computer instructions for causing a computer to execute the video image frame interpolation method provided by the present invention.

The memory 602, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the video image frame interpolation method in the embodiment of the present invention (for example, the coarse-grained optical flow generation module 501, the intermediate frame optical flow generation module 502, and the intermediate frame image generation module 503 shown in fig. 7). The processor 601 executes various functional applications and data processing of the server by running non-transitory software programs, instructions and modules stored in the memory 602, that is, implementing the video image frame complementing method in the above method embodiment.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the embodiments of the invention following, in general, the principles of the embodiments of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the embodiments of the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of embodiments of the invention being indicated by the following claims.

It is to be understood that the embodiments of the present invention are not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of embodiments of the invention is limited only by the appended claims.

Claims

1. A video image frame complementing method is applied to an electronic device, and comprises the following steps:

generating a target intermediate frame image according to the front and back adjacent two frame images and the intermediate frame optical flow data;

the method for extracting front and rear adjacent frames of images in a target video and respectively inputting the front and rear adjacent frames of images into a coarse-grained optical flow generation model trained to be convergent so as to output coarse-grained optical flow data corresponding to the front and rear adjacent frames of images comprises the following steps:

extracting corresponding image characteristic parameters from the front and rear adjacent two frames of images through the coarse-grained optical flow generation model, and outputting coarse-grained optical flow data corresponding to the front and rear adjacent two frames of images according to the image characteristic parameters;

the coarse-grained optical flow generation model comprises a coding and decoding network and a turning convolution structure;

2. The method of claim 1, wherein the coarse-grained optical flow data comprises coarse-grained bi-directional optical flow data; the intermediate frame optical flow generation model comprises a fusion function and an object motion track fitting function;

3. The method according to claim 1, wherein before the input into the coarse-grained optical flow generation model trained to converge respectively to output the coarse-grained optical flow data corresponding to the two adjacent frames of images, the method further comprises:

4. The method of claim 1, wherein before inputting the preconfigured complement temporal data and the coarse-grained optical flow data into an inter-frame optical flow generation model trained to converge to output inter-frame optical flow data, further comprising:

5. The method of claim 1, wherein generating a target inter-frame image from the two consecutive frame images and the inter-frame optical flow data comprises:

6. An apparatus for frame-filling a video image, the apparatus being located in an electronic device, comprising:

the intermediate frame image generation module is used for generating a target intermediate frame image according to the front and back adjacent two frame images and the intermediate frame optical flow data;

the coarse-grained optical flow generation module is specifically configured to:

extracting front and back adjacent two frames of images in a target video, and respectively inputting the images into a coarse-grained optical flow generation model trained to be convergent; extracting corresponding image characteristic parameters from the front and rear adjacent two frames of images through the coarse-grained optical flow generation model, and outputting coarse-grained optical flow data corresponding to the front and rear adjacent two frames of images according to the image characteristic parameters;

7. A video image frame complementing apparatus, comprising: a memory, a processor;

a memory; a memory for storing the processor-executable instructions;

wherein the processor is configured to perform the video image frame interpolation method of any one of claims 1 to 5 by the processor.

8. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, are configured to implement the video image frame complementing method according to any one of claims 1 to 5.