CN112995433A

CN112995433A - Time sequence video generation method and device, computing equipment and storage medium

Info

Publication number: CN112995433A
Application number: CN202110169891.3A
Authority: CN
Inventors: 孙腾
Original assignee: Beijing Moviebook Technology Corp ltd
Current assignee: Beijing Moviebook Technology Corp ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-06-18
Anticipated expiration: 2041-02-08
Also published as: CN112995433B

Abstract

The application discloses a time sequence video generation method and device, computing equipment and a storage medium. The method comprises the following steps: extracting a semantic segmentation graph of each frame of image of each video clip in the training data set, and calculating an optical flow estimation graph between a front frame and a rear frame; training a multi-level generator network by utilizing a semantic segmentation graph of each frame image of each video clip in a training data set and an optical flow estimation graph between a front frame and a rear frame to obtain a trained multi-level generator network; and inputting each frame of semantic image of the time sequence video into a trained multi-level generator network to obtain the time sequence video. The device comprises a semantic segmentation map extraction module, a training module and a time sequence video generation module. The computing device comprises a memory, a processor and a computer program stored in the memory and executable by the processor, the processor implementing the above method when executing the computer program. The storage medium has stored therein a computer program which, when executed by a processor, implements the above-described method.

Description

Time sequence video generation method and device, computing equipment and storage medium

Technical Field

The application relates to the field of time sequence video generation, in particular to technologies of video feature extraction, time sequence analysis, image generation and the like.

Background

The neural network is rapidly developed in the field of Artificial Intelligence (AI), so that the information cross fusion of multiple fields such as images, texts and voices is promoted, and the expectation of users on image and video processing technologies is higher and higher. The application scenes of virtual reality are more and more complex, and the data of specified conditions need to be subjected to visualization operation and simulation by a computer, so that the development of the field of rendering and generating virtual scenes at a real image level is promoted by the appearance of an anti-generation network technology. Against this background, the competing generation network technology enables the generation of images of specified contents in accordance with conditional inputs, to a level at which it is difficult for the naked eye to distinguish authenticity, but has fewer generation schemes for video sequences of consecutive multi-frame images. The commonly used pix2pix and pix2pixHD image translation generation algorithm is designed only for the translation of a static picture, does not model the dimension of a time sequence, and cannot be used for video generation because the problem of discontinuous frames is caused if the algorithm is directly used for the translation of a video.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to an aspect of the present application, there is provided a time-series video generation method, including:

extracting a semantic segmentation graph of each frame of image of each video clip in the training data set, and calculating an optical flow estimation graph between a front frame and a rear frame;

training a multi-level generator network by utilizing a semantic segmentation graph of each frame image of each video clip in the training data set and an optical flow estimation graph between a front frame and a rear frame to obtain a trained multi-level generator network, wherein the multi-level generator network G'_N+1The structure of (1) is as follows:

G'_N+1＝G_{n +1 or less}+G'_N+G_{N +1 to}

Wherein G is_N+1To generate a subnetwork of devices, said G_{N +1 or less}Is said G_N+1Of said G_{N +1 to}Is said G_N+1The up-sampling part of (1), G₁' for the generator network, N is more than or equal to 1;

multi-level discriminator network D 'co-operating with said multi-level generator network'_N+1The structure of (1) is as follows:

D’_N+1＝D_{n +1 or less}+D’_N+D_{N +1 to}

Wherein D is_N+1For arbiter subnetwork, said D_{N +1 or less}Is said D_N+1A down-sampling part of D_{N +1 to}Is said D_N+1The up-sampling part of (1), D₁' is a discriminator network;

the residual convolution layers of the multi-level generator network and the multi-level discriminator network are both 3D convolution structures;

and inputting each frame of semantic image of the time sequence video into a trained multi-level generator network to obtain the time sequence video.

Optionally, the penalty functions of the multi-level generator network include image distribution penalty, timing penalty, optical flow penalty, feature matching penalty, and content consistency penalty.

Optionally, the method for obtaining the content consistency loss includes:

extracting features from the image output by the multi-level generator network and a Grund Truth by using the multi-level discriminator network to obtain two feature maps, wherein the Grund Truth is each frame image of each video clip in the training data set;

and calculating the error between the two feature maps, and taking the error as the content consistency loss.

Optionally, the method for obtaining the feature matching loss includes:

extracting features from the image output by the multi-level generator network and a Grund Truth by using VGG16 to obtain two feature maps, wherein the Grund Truth is each frame image of each video clip in the training data set;

and calculating the error between the two feature maps, and taking the error as the feature matching loss.

Optionally, a specific method for training the multi-level generator network is as follows: generator networks of different sizes are trained respectively in space, and the dimensionality of the frames participating in training is gradually increased in the time dimension.

According to another aspect of the present application, there is provided a time-series video generating apparatus including:

the semantic segmentation image extraction module is configured to extract a semantic segmentation image of each frame of image of each video clip in the training data set and calculate an optical flow estimation image between a front frame and a rear frame;

a training module configured to train a multi-level generator network with a semantic segmentation graph of each frame image of each video clip in the training data set and an optical flow estimation graph between a preceding frame and a following frame, resulting in a trained multi-level generator network, the multi-level generator network G'_N+1The structure of (1) is as follows:

G'_N+1＝G_{n +1 or less}+G'_N+G_{N +1 to}

D’_N+1＝D_{n +1 or less}+D’_N+D_{N +1 to}

the residual convolution layers of the multi-level generator network and the multi-level discriminator network are both 3D convolution structures; and

and the time sequence video generation module is configured to input each frame of semantic image of the time sequence video into a trained multi-level generator network to obtain the time sequence video.

Optionally: generator networks of different sizes are trained respectively in space, and the dimensionality of the frames participating in training is gradually increased in the time dimension.

According to a third aspect of the present application, there is provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of the present application when executing the computer program.

According to a fourth aspect of the present application, a storage medium, preferably a non-volatile readable storage medium, is provided, having stored therein a computer program which, when executed by a processor, implements the method described herein.

The time sequence video generation method, the time sequence video generation device, the computing equipment and the storage medium generate the video sequence corresponding to the real content according to the specified condition input, are a universal video image translation generation framework, can realize the generation of various types of videos, not only ensure the sense of reality of each frame of image generation, but also meet the continuity of continuous frame-to-frame picture change.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a schematic flow chart diagram of a method of time-series video generation according to one embodiment of the present application;

FIG. 2 is a schematic diagram of the structure of a first level generator subnetwork in a multi-level generator network according to one embodiment of the present application;

FIG. 3 is a schematic diagram of a method of training a multi-level generator network according to one embodiment of the present application;

fig. 4 is a schematic structural diagram of a time-series video generation apparatus according to an embodiment of the present application;

FIG. 5 is a schematic block diagram of a computing device according to one embodiment of the present application;

FIG. 6 is a schematic diagram of a storage medium according to an embodiment of the present application.

Detailed Description

Common image translation generation algorithms are pix2pix and pix2pixHD, and the method is improved on the basis of the common image translation generation algorithms, and video content is generated by inputting a continuous semantic graph sequence, semantic information in a modeling graph and time dynamic change information and rendering.

Fig. 1 is a schematic flow chart of a method of time-series video generation according to an embodiment of the present application. The time-series video generating method may generally include the following steps S1 to S3.

Step S1, extracting the semantic segmentation graph of each frame image of each video clip in the training data set, and calculating the optical flow estimation graphs of the front frame and the rear frame:

for each video segment in the training data set, the semantic segmentation graph of each frame of image needs to be extracted respectively, the optical flow estimation graphs of the front frame and the rear frame are calculated and used as model training input, and each frame of image of the original video is used as GroundTruth.

Step S2, training a multi-level generator network G 'by utilizing the semantic segmentation graph of each frame image of each video clip in the training data set and the optical flow estimation graph of the front frame and the rear frame'_N+1Obtaining a trained multi-level generator network G'_N+1Of said multi-level generator network G'_N+1The structure of (1) is as follows:

G'_N+1＝G_{n +1 or less}+G'_N+G_{N +1 to}

Wherein G is_N+1To generate a subnetwork of devices, said G_{N +1 or less}Is said G_N+1A down-sampling part ofG is_{N +1 to}Is said G_N+1The up-sampling part of (1), G₁' is a generator network, N is an integer, and N is greater than or equal to 1.

The multi-level generator network comprises multi-level generator sub-networks G which are nested with each other in sequence₁To G_N，G₁' that is, G₁. For example, the multi-hierarchy generator network comprises N levels of generator sub-networks in total, then, as shown in fig. 2, a level one generator network G₁' i.e. the first level generator subnetwork G₁First-level generator subnetwork G₁Embedded in the second-level generator subnetwork G₂Form a secondary generator network G'₂Of secondary generator network G'₂Embedded in a third level generator sub-network G₃Form a three-level generator network G₃', … …, and so on, N-stage generator network G'_NEmbedded in the N +1 th level generator sub-network G_N+1Form an N +1 stage generator network G'_N+1。

When N is 2, the multi-level generator network G'_N+1Is a two-stage generator network G'₂The structure is as follows: g'₂＝G_{2 at the bottom}+G₁'+G_{2 to}，G₁' (i.e. G)₁) For the generator network, the principle is to connect the generator sub-network G₂Divided from the middle into down-samples G_{2 at the bottom}And upsampling G_{2 to}Two moieties, G_{2 at the bottom}And G_{2 to}Respectively located in the generator network G₁' front and rear, G_{2 at the bottom}As output of G₁Input of `, G_{2 at the bottom}Output and G₁The outputs of' are superimposed, the result of the superimposition being G_{2 to}Input of G_{2 to}As a secondary generator network G'₂An output of (d);

when N is 3, the multi-level generator network G'_N+1For a three-stage generator network G₃', its structure is: g₃'＝G_{3 is below}+G'₂+G_{3 to}Principle and two-stage generator network G'₂The same;

……

by analogy, N +1 levelGenerator network G'_N+1The structure of (1) is as follows: g'_N+1＝G_{N +1 or less}+G'_N+G_{N +1 to}The principle is the same as above.

In this embodiment, the generator network G₁'also known as Global aware network, Generator network G'₂Also called local enhanced network, and a global sensing network is used as the intermediate layer structure of the local enhanced network to be integrated, and the global sensing network G₁'resolution is 256px, local enhancement network G'₂Is a global perception network G₁' twice as high, the input image of each level of generator sub-network in the multi-level generator network has twice the resolution as the previous level of generator sub-network.

In the training process, the network G 'of the multi-level generator'_N+1Multi-level arbiter network D 'for cooperative use'_N+1The structure of (1) is as follows:

D’_N+1＝D_{n +1 or less}+D’_N+D_{N +1 to}

The multi-level discriminator network is similar in structure to the multi-level generator network, D_N+1To generate a subnetwork of devices, said D_{N +1 or less}Is said D_N+1A down-sampling part of D_{N +1 to}Is said D_N+1The up-sampling part of (1), D₁' is a network of discriminators.

The residual convolution layers of the multi-level generator network and the multi-level discriminator network are both 3D convolution structures.

The loss function of a multi-level generator network consists of multiple parts, with the aim of guaranteeing the constraint capability of said multi-level generator network. The first part is image distribution loss and is used for guaranteeing the truth of the generated image; the second part is a time sequence loss which is used for guaranteeing the consistency of the generated video; the third part is optical flow loss used for guaranteeing the correctness of the estimated optical flow; the fourth part is the feature matching penalty and the fifth part is the content consistency penalty.

The content consistency loss is obtained by respectively inputting an image (namely a generated sample) output by a multi-level generator network and a Grund Truth (namely each frame image of each video clip in a training data set, namely a real data sample) into a multi-level discriminator network to extract features, obtaining a feature diagram of the generated sample and a feature diagram of the real data sample, and then calculating Element-wise loss of the two feature diagrams, so that the consistency of the image content is ensured, and the training stability is improved;

the feature matching loss is obtained by respectively inputting an image (namely a generated sample) output by a multi-level generator network and a Grund Truth (namely each frame image of each video clip in a training data set, namely a real data sample) into VGG16 to extract features, obtaining a feature map of the generated sample and a feature map of the real data sample, and then calculating Element-wise loss of the two feature maps.

Because the multi-level generator network structure is complex, in consideration of hardware capability and training convergence speed, the training process adopts cross training of time domain (shown as T and T 'in the figure) and space domain (shown as S and S' in the figure) as shown in fig. 3, that is, generator sub-networks of different sizes are respectively trained in space, and the dimension of the frame participating in training is continuously increased in the time dimension.

Step S3, inputting each frame of semantic image of the time-series video into the trained multi-level generator network (when the trained multi-level generator network is used for prediction, only the corresponding semantic segmentation graph needs to be provided as needed, and in the process of generating frame by frame, the multi-level generator network can calculate the optical flow graph of the previous frame as the input of the current frame prediction), so as to obtain the time-series video.

The following describes in detail the process of generating time-series video according to input conditions using a trained multi-level generator network:

step S31, inputting a first frame of semantic image to the trained multi-level generator network;

step S32, inputting a second frame semantic image to the trained multi-level generator network;

step S33, the trained multi-level generator network calculates an optical flow graph from a first frame to a second frame, the optical flow graph is used as the prediction of the second frame, and a second frame image is obtained through calculation according to the prediction of the second frame and a second frame semantic image input by a user;

step S34, inputting a third frame of semantic image to the trained multi-level generator network;

step S35, the trained multi-level generator network calculates optical flow graphs from the second frame to the third frame, the optical flow graphs serve as the prediction of the third frame, and a third frame image is obtained through calculation according to the prediction of the third frame and a third frame semantic image input by a user;

and in the same way, for the nth frame image of the time sequence video, the semantic image of the image needs to be input by a user, then the trained multi-level generator network calculates the optical flow graphs from the (n-1) th frame to the nth frame, the optical flow graphs are used as the prediction of the nth frame, and the nth frame image is obtained through calculation according to the prediction of the nth frame and the nth frame semantic image input by the user.

In summary, the time-series video generation method of the embodiment is mainly improved on the basis of the prior art as follows:

1. the conditional input of the multi-level generator network is not only the input semantic picture of the current frame, but also comprises the input semantic pictures of the previous frames and the generated output pictures of the previous frames, namely, the input images and the generated images of the previous frames are reviewed;

2. the continuous change information estimation between frames estimates the change amount of each pixel of the image by calculating the optical flow change between the previous frames to predict the optical flow from the previous frame to the current frame, and the output of the multi-level generator network contains the prediction of the optical flow change of the next frame, which is learned by training.

3. The sequential video generation method is based on a sequential video generation method of a countermeasure generation network, the basic structure of the countermeasure generation network is composed of two paired sub-networks of a generator and a discriminator, optical flow constraint is added to the generator, optical flow information is added to the discriminator, the generator can obtain a prediction graph of a current frame, a generated graph of a previous frame is subjected to optical flow estimation of the current frame to obtain a evolution graph, the generated graph of the previous frame and the evolution graph are weighted and synthesized to form a final generated graph of the current frame, and the design keeps a great deal of similar information between a front continuous frame and a rear continuous frame.

4. The input images of several continuous frames are regarded as a three-dimensional data structure (the third dimension is time sequence arrangement), residual convolution layers of a generator and a discriminator network are improved into a 3D convolution structure, convolution is carried out on the space coordinate direction and the time direction of the image to extract features, and local space features and continuous time sequence features are learned.

5. The high-definition video is generated by adopting a multi-level generator sub-network structure, firstly constructing a low-resolution generated global perception network, and gradually superposing the convolution layers at the front end and the rear end by taking the global perception network as a basic structure to construct a local enhancement sub-network so as to improve the image resolution.

Fig. 4 is a schematic structural diagram of a time-series video generation apparatus according to an embodiment of the present application. The time-series video generating apparatus may generally include:

a semantic segmentation map extraction module 1 configured to extract a semantic segmentation map of each frame image of each video clip in a training data set, and calculate an optical flow estimation map between a front frame and a rear frame;

a training module 2 configured to train a multi-level generator network with a semantic segmentation graph of each frame image of each video clip in the training data set and an optical flow estimation graph between a preceding frame and a following frame, resulting in a trained multi-level generator network, the multi-level generator network G'_N+1The structure of (1) is as follows:

G'_N+1＝G_{n +1 or less}+G'_N+G_{N +1 to}

D’_N+1＝D_{n +1 or less}+D’_N+D_{N +1 to}

and the time sequence video generation module 3 is configured to input each frame of semantic image of the time sequence video into a trained multi-level generator network to obtain the time sequence video.

The penalty functions of the multi-level generator network include image distribution penalty, timing penalty, optical flow penalty, feature matching penalty, and content consistency penalty.

The specific method for training the multi-level generator network comprises the following steps: generator networks of different sizes are trained respectively in space, and the dimensionality of the frames participating in training is gradually increased in the time dimension.

Embodiments also provide a computing device, referring to fig. 5, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, the computer program, when executed by the processor 1110, implementing the method steps 1131 for performing any of the methods according to the invention.

The embodiment of the application also provides a computer-readable storage medium. Referring to fig. 6, the storage medium comprises a storage unit for program code provided with a program 1131' for performing the steps of the method according to the invention, which program is executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of time-series video generation, comprising:

G′_N+1＝G_{n +1 or less}+G′_N+G_{N +1 to}

Wherein G is_N+1To generate a subnetwork of devices, said G_{N +1 or less}Is said G_N+1Of said G_{N +1 to}Is said G_N+1Of (a) is an upsampled part of (G'₁For the generator network, N is more than or equal to 1;

D′_N+1＝D_{n +1 or less}+D′_N+D_{N +1 to}

Wherein D is_N+1For arbiter subnetwork, said D_{N +1 or less}Is said D_N+1A down-sampling part of D_{N +1 to}Is said D_N+1The up-sampling part of，D′₁Is a network of discriminators;

2. The method of claim 1, wherein the loss functions of the multi-hierarchy generator network comprise image distribution loss, timing loss, optical flow loss, feature matching loss, and content consistency loss.

3. The method of claim 2, wherein the content consistency loss is obtained by:

4. The method of claim 2, wherein the feature matching loss is obtained by:

5. The method of claim 2, wherein the specific method of training the multi-level generator network is: generator networks of different sizes are trained respectively in space, and the dimensionality of the frames participating in training is gradually increased in the time dimension.

6. A time-series video generation apparatus comprising:

G′_N+1＝G_{n +1 or less}+G′_N+G_{N +1 to}

D′_N+1＝D_{n +1 or less}+D′_N+D_{N +1 to}

Wherein D is_N+1For arbiter subnetwork, said D_{N +1 or less}Is said D_N+1A down-sampling part of D_{N +1 to}Is said D_N+1Of (a) is an upsampled part of, D'₁Is a network of discriminators;

7. The apparatus of claim 6, wherein the penalty functions of the multi-hierarchy generator network comprise image distribution penalty, timing penalty, optical flow penalty, feature matching penalty, and content consistency penalty.

8. The apparatus of claim 7, wherein the specific method for training the multi-level generator network is: generator networks of different sizes are trained respectively in space, and the dimensionality of the frames participating in training is gradually increased in the time dimension.

9. A computing device comprising a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of any of claims 1-5 when executing the computer program.

10. A storage medium, preferably a non-volatile readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method of any one of claims 1-5.