CN112995433A - Time sequence video generation method and device, computing equipment and storage medium - Google Patents

Time sequence video generation method and device, computing equipment and storage medium Download PDF

Info

Publication number
CN112995433A
CN112995433A CN202110169891.3A CN202110169891A CN112995433A CN 112995433 A CN112995433 A CN 112995433A CN 202110169891 A CN202110169891 A CN 202110169891A CN 112995433 A CN112995433 A CN 112995433A
Authority
CN
China
Prior art keywords
frame
network
generator network
image
level generator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110169891.3A
Other languages
Chinese (zh)
Other versions
CN112995433B (en
Inventor
孙腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Moviebook Technology Corp ltd
Original Assignee
Beijing Moviebook Technology Corp ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Moviebook Technology Corp ltd filed Critical Beijing Moviebook Technology Corp ltd
Priority to CN202110169891.3A priority Critical patent/CN112995433B/en
Publication of CN112995433A publication Critical patent/CN112995433A/en
Application granted granted Critical
Publication of CN112995433B publication Critical patent/CN112995433B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/04Synchronising
    • H04N5/06Generation of synchronising signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/04Synchronising
    • H04N5/08Separation of synchronising signals from picture signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • H04N5/765Interface circuits between an apparatus for recording and another apparatus

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a time sequence video generation method and device, computing equipment and a storage medium. The method comprises the following steps: extracting a semantic segmentation graph of each frame of image of each video clip in the training data set, and calculating an optical flow estimation graph between a front frame and a rear frame; training a multi-level generator network by utilizing a semantic segmentation graph of each frame image of each video clip in a training data set and an optical flow estimation graph between a front frame and a rear frame to obtain a trained multi-level generator network; and inputting each frame of semantic image of the time sequence video into a trained multi-level generator network to obtain the time sequence video. The device comprises a semantic segmentation map extraction module, a training module and a time sequence video generation module. The computing device comprises a memory, a processor and a computer program stored in the memory and executable by the processor, the processor implementing the above method when executing the computer program. The storage medium has stored therein a computer program which, when executed by a processor, implements the above-described method.

Description

Time sequence video generation method and device, computing equipment and storage medium
Technical Field
The application relates to the field of time sequence video generation, in particular to technologies of video feature extraction, time sequence analysis, image generation and the like.
Background
The neural network is rapidly developed in the field of Artificial Intelligence (AI), so that the information cross fusion of multiple fields such as images, texts and voices is promoted, and the expectation of users on image and video processing technologies is higher and higher. The application scenes of virtual reality are more and more complex, and the data of specified conditions need to be subjected to visualization operation and simulation by a computer, so that the development of the field of rendering and generating virtual scenes at a real image level is promoted by the appearance of an anti-generation network technology. Against this background, the competing generation network technology enables the generation of images of specified contents in accordance with conditional inputs, to a level at which it is difficult for the naked eye to distinguish authenticity, but has fewer generation schemes for video sequences of consecutive multi-frame images. The commonly used pix2pix and pix2pixHD image translation generation algorithm is designed only for the translation of a static picture, does not model the dimension of a time sequence, and cannot be used for video generation because the problem of discontinuous frames is caused if the algorithm is directly used for the translation of a video.
Disclosure of Invention
It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.
According to an aspect of the present application, there is provided a time-series video generation method, including:
extracting a semantic segmentation graph of each frame of image of each video clip in the training data set, and calculating an optical flow estimation graph between a front frame and a rear frame;
training a multi-level generator network by utilizing a semantic segmentation graph of each frame image of each video clip in the training data set and an optical flow estimation graph between a front frame and a rear frame to obtain a trained multi-level generator network, wherein the multi-level generator network G'N+1The structure of (1) is as follows:
G'N+1=Gn +1 or less+G'N+GN +1 to
Wherein G isN+1To generate a subnetwork of devices, said GN +1 or lessIs said GN+1Of said GN +1 toIs said GN+1The up-sampling part of (1), G1' for the generator network, N is more than or equal to 1;
multi-level discriminator network D 'co-operating with said multi-level generator network'N+1The structure of (1) is as follows:
D’N+1=Dn +1 or less+D’N+DN +1 to
Wherein D isN+1For arbiter subnetwork, said DN +1 or lessIs said DN+1A down-sampling part of DN +1 toIs said DN+1The up-sampling part of (1), D1' is a discriminator network;
the residual convolution layers of the multi-level generator network and the multi-level discriminator network are both 3D convolution structures;
and inputting each frame of semantic image of the time sequence video into a trained multi-level generator network to obtain the time sequence video.
Optionally, the penalty functions of the multi-level generator network include image distribution penalty, timing penalty, optical flow penalty, feature matching penalty, and content consistency penalty.
Optionally, the method for obtaining the content consistency loss includes:
extracting features from the image output by the multi-level generator network and a Grund Truth by using the multi-level discriminator network to obtain two feature maps, wherein the Grund Truth is each frame image of each video clip in the training data set;
and calculating the error between the two feature maps, and taking the error as the content consistency loss.
Optionally, the method for obtaining the feature matching loss includes:
extracting features from the image output by the multi-level generator network and a Grund Truth by using VGG16 to obtain two feature maps, wherein the Grund Truth is each frame image of each video clip in the training data set;
and calculating the error between the two feature maps, and taking the error as the feature matching loss.
Optionally, a specific method for training the multi-level generator network is as follows: generator networks of different sizes are trained respectively in space, and the dimensionality of the frames participating in training is gradually increased in the time dimension.
According to another aspect of the present application, there is provided a time-series video generating apparatus including:
the semantic segmentation image extraction module is configured to extract a semantic segmentation image of each frame of image of each video clip in the training data set and calculate an optical flow estimation image between a front frame and a rear frame;
a training module configured to train a multi-level generator network with a semantic segmentation graph of each frame image of each video clip in the training data set and an optical flow estimation graph between a preceding frame and a following frame, resulting in a trained multi-level generator network, the multi-level generator network G'N+1The structure of (1) is as follows:
G'N+1=Gn +1 or less+G'N+GN +1 to
Wherein G isN+1To generate a subnetwork of devices, said GN +1 or lessIs said GN+1Of said GN +1 toIs said GN+1The up-sampling part of (1), G1' for the generator network, N is more than or equal to 1;
multi-level discriminator network D 'co-operating with said multi-level generator network'N+1The structure of (1) is as follows:
D’N+1=Dn +1 or less+D’N+DN +1 to
Wherein D isN+1For arbiter subnetwork, said DN +1 or lessIs said DN+1A down-sampling part of DN +1 toIs said DN+1The up-sampling part of (1), D1' is a discriminator network;
the residual convolution layers of the multi-level generator network and the multi-level discriminator network are both 3D convolution structures; and
and the time sequence video generation module is configured to input each frame of semantic image of the time sequence video into a trained multi-level generator network to obtain the time sequence video.
Optionally, the penalty functions of the multi-level generator network include image distribution penalty, timing penalty, optical flow penalty, feature matching penalty, and content consistency penalty.
Optionally: generator networks of different sizes are trained respectively in space, and the dimensionality of the frames participating in training is gradually increased in the time dimension.
According to a third aspect of the present application, there is provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of the present application when executing the computer program.
According to a fourth aspect of the present application, a storage medium, preferably a non-volatile readable storage medium, is provided, having stored therein a computer program which, when executed by a processor, implements the method described herein.
The time sequence video generation method, the time sequence video generation device, the computing equipment and the storage medium generate the video sequence corresponding to the real content according to the specified condition input, are a universal video image translation generation framework, can realize the generation of various types of videos, not only ensure the sense of reality of each frame of image generation, but also meet the continuity of continuous frame-to-frame picture change.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:
FIG. 1 is a schematic flow chart diagram of a method of time-series video generation according to one embodiment of the present application;
FIG. 2 is a schematic diagram of the structure of a first level generator subnetwork in a multi-level generator network according to one embodiment of the present application;
FIG. 3 is a schematic diagram of a method of training a multi-level generator network according to one embodiment of the present application;
fig. 4 is a schematic structural diagram of a time-series video generation apparatus according to an embodiment of the present application;
FIG. 5 is a schematic block diagram of a computing device according to one embodiment of the present application;
FIG. 6 is a schematic diagram of a storage medium according to an embodiment of the present application.
Detailed Description
Common image translation generation algorithms are pix2pix and pix2pixHD, and the method is improved on the basis of the common image translation generation algorithms, and video content is generated by inputting a continuous semantic graph sequence, semantic information in a modeling graph and time dynamic change information and rendering.
Fig. 1 is a schematic flow chart of a method of time-series video generation according to an embodiment of the present application. The time-series video generating method may generally include the following steps S1 to S3.
Step S1, extracting the semantic segmentation graph of each frame image of each video clip in the training data set, and calculating the optical flow estimation graphs of the front frame and the rear frame:
for each video segment in the training data set, the semantic segmentation graph of each frame of image needs to be extracted respectively, the optical flow estimation graphs of the front frame and the rear frame are calculated and used as model training input, and each frame of image of the original video is used as GroundTruth.
Step S2, training a multi-level generator network G 'by utilizing the semantic segmentation graph of each frame image of each video clip in the training data set and the optical flow estimation graph of the front frame and the rear frame'N+1Obtaining a trained multi-level generator network G'N+1Of said multi-level generator network G'N+1The structure of (1) is as follows:
G'N+1=Gn +1 or less+G'N+GN +1 to
Wherein G isN+1To generate a subnetwork of devices, said GN +1 or lessIs said GN+1A down-sampling part ofG isN +1 toIs said GN+1The up-sampling part of (1), G1' is a generator network, N is an integer, and N is greater than or equal to 1.
The multi-level generator network comprises multi-level generator sub-networks G which are nested with each other in sequence1To GN,G1' that is, G1. For example, the multi-hierarchy generator network comprises N levels of generator sub-networks in total, then, as shown in fig. 2, a level one generator network G1' i.e. the first level generator subnetwork G1First-level generator subnetwork G1Embedded in the second-level generator subnetwork G2Form a secondary generator network G'2Of secondary generator network G'2Embedded in a third level generator sub-network G3Form a three-level generator network G3', … …, and so on, N-stage generator network G'NEmbedded in the N +1 th level generator sub-network GN+1Form an N +1 stage generator network G'N+1
When N is 2, the multi-level generator network G'N+1Is a two-stage generator network G'2The structure is as follows: g'2=G2 at the bottom+G1'+G2 to,G1' (i.e. G)1) For the generator network, the principle is to connect the generator sub-network G2Divided from the middle into down-samples G2 at the bottomAnd upsampling G2 toTwo moieties, G2 at the bottomAnd G2 toRespectively located in the generator network G1' front and rear, G2 at the bottomAs output of G1Input of `, G2 at the bottomOutput and G1The outputs of' are superimposed, the result of the superimposition being G2 toInput of G2 toAs a secondary generator network G'2An output of (d);
when N is 3, the multi-level generator network G'N+1For a three-stage generator network G3', its structure is: g3'=G3 is below+G'2+G3 toPrinciple and two-stage generator network G'2The same;
……
by analogy, N +1 levelGenerator network G'N+1The structure of (1) is as follows: g'N+1=GN +1 or less+G'N+GN +1 toThe principle is the same as above.
In this embodiment, the generator network G1'also known as Global aware network, Generator network G'2Also called local enhanced network, and a global sensing network is used as the intermediate layer structure of the local enhanced network to be integrated, and the global sensing network G1'resolution is 256px, local enhancement network G'2Is a global perception network G1' twice as high, the input image of each level of generator sub-network in the multi-level generator network has twice the resolution as the previous level of generator sub-network.
In the training process, the network G 'of the multi-level generator'N+1Multi-level arbiter network D 'for cooperative use'N+1The structure of (1) is as follows:
D’N+1=Dn +1 or less+D’N+DN +1 to
The multi-level discriminator network is similar in structure to the multi-level generator network, DN+1To generate a subnetwork of devices, said DN +1 or lessIs said DN+1A down-sampling part of DN +1 toIs said DN+1The up-sampling part of (1), D1' is a network of discriminators.
The residual convolution layers of the multi-level generator network and the multi-level discriminator network are both 3D convolution structures.
The loss function of a multi-level generator network consists of multiple parts, with the aim of guaranteeing the constraint capability of said multi-level generator network. The first part is image distribution loss and is used for guaranteeing the truth of the generated image; the second part is a time sequence loss which is used for guaranteeing the consistency of the generated video; the third part is optical flow loss used for guaranteeing the correctness of the estimated optical flow; the fourth part is the feature matching penalty and the fifth part is the content consistency penalty.
The content consistency loss is obtained by respectively inputting an image (namely a generated sample) output by a multi-level generator network and a Grund Truth (namely each frame image of each video clip in a training data set, namely a real data sample) into a multi-level discriminator network to extract features, obtaining a feature diagram of the generated sample and a feature diagram of the real data sample, and then calculating Element-wise loss of the two feature diagrams, so that the consistency of the image content is ensured, and the training stability is improved;
the feature matching loss is obtained by respectively inputting an image (namely a generated sample) output by a multi-level generator network and a Grund Truth (namely each frame image of each video clip in a training data set, namely a real data sample) into VGG16 to extract features, obtaining a feature map of the generated sample and a feature map of the real data sample, and then calculating Element-wise loss of the two feature maps.
Because the multi-level generator network structure is complex, in consideration of hardware capability and training convergence speed, the training process adopts cross training of time domain (shown as T and T 'in the figure) and space domain (shown as S and S' in the figure) as shown in fig. 3, that is, generator sub-networks of different sizes are respectively trained in space, and the dimension of the frame participating in training is continuously increased in the time dimension.
Step S3, inputting each frame of semantic image of the time-series video into the trained multi-level generator network (when the trained multi-level generator network is used for prediction, only the corresponding semantic segmentation graph needs to be provided as needed, and in the process of generating frame by frame, the multi-level generator network can calculate the optical flow graph of the previous frame as the input of the current frame prediction), so as to obtain the time-series video.
The following describes in detail the process of generating time-series video according to input conditions using a trained multi-level generator network:
step S31, inputting a first frame of semantic image to the trained multi-level generator network;
step S32, inputting a second frame semantic image to the trained multi-level generator network;
step S33, the trained multi-level generator network calculates an optical flow graph from a first frame to a second frame, the optical flow graph is used as the prediction of the second frame, and a second frame image is obtained through calculation according to the prediction of the second frame and a second frame semantic image input by a user;
step S34, inputting a third frame of semantic image to the trained multi-level generator network;
step S35, the trained multi-level generator network calculates optical flow graphs from the second frame to the third frame, the optical flow graphs serve as the prediction of the third frame, and a third frame image is obtained through calculation according to the prediction of the third frame and a third frame semantic image input by a user;
and in the same way, for the nth frame image of the time sequence video, the semantic image of the image needs to be input by a user, then the trained multi-level generator network calculates the optical flow graphs from the (n-1) th frame to the nth frame, the optical flow graphs are used as the prediction of the nth frame, and the nth frame image is obtained through calculation according to the prediction of the nth frame and the nth frame semantic image input by the user.
In summary, the time-series video generation method of the embodiment is mainly improved on the basis of the prior art as follows:
1. the conditional input of the multi-level generator network is not only the input semantic picture of the current frame, but also comprises the input semantic pictures of the previous frames and the generated output pictures of the previous frames, namely, the input images and the generated images of the previous frames are reviewed;
2. the continuous change information estimation between frames estimates the change amount of each pixel of the image by calculating the optical flow change between the previous frames to predict the optical flow from the previous frame to the current frame, and the output of the multi-level generator network contains the prediction of the optical flow change of the next frame, which is learned by training.
3. The sequential video generation method is based on a sequential video generation method of a countermeasure generation network, the basic structure of the countermeasure generation network is composed of two paired sub-networks of a generator and a discriminator, optical flow constraint is added to the generator, optical flow information is added to the discriminator, the generator can obtain a prediction graph of a current frame, a generated graph of a previous frame is subjected to optical flow estimation of the current frame to obtain a evolution graph, the generated graph of the previous frame and the evolution graph are weighted and synthesized to form a final generated graph of the current frame, and the design keeps a great deal of similar information between a front continuous frame and a rear continuous frame.
4. The input images of several continuous frames are regarded as a three-dimensional data structure (the third dimension is time sequence arrangement), residual convolution layers of a generator and a discriminator network are improved into a 3D convolution structure, convolution is carried out on the space coordinate direction and the time direction of the image to extract features, and local space features and continuous time sequence features are learned.
5. The high-definition video is generated by adopting a multi-level generator sub-network structure, firstly constructing a low-resolution generated global perception network, and gradually superposing the convolution layers at the front end and the rear end by taking the global perception network as a basic structure to construct a local enhancement sub-network so as to improve the image resolution.
Fig. 4 is a schematic structural diagram of a time-series video generation apparatus according to an embodiment of the present application. The time-series video generating apparatus may generally include:
a semantic segmentation map extraction module 1 configured to extract a semantic segmentation map of each frame image of each video clip in a training data set, and calculate an optical flow estimation map between a front frame and a rear frame;
a training module 2 configured to train a multi-level generator network with a semantic segmentation graph of each frame image of each video clip in the training data set and an optical flow estimation graph between a preceding frame and a following frame, resulting in a trained multi-level generator network, the multi-level generator network G'N+1The structure of (1) is as follows:
G'N+1=Gn +1 or less+G'N+GN +1 to
Wherein G isN+1To generate a subnetwork of devices, said GN +1 or lessIs said GN+1Of said GN +1 toIs said GN+1The up-sampling part of (1), G1' for the generator network, N is more than or equal to 1;
multi-level discriminator network D 'co-operating with said multi-level generator network'N+1The structure of (1) is as follows:
D’N+1=Dn +1 or less+D’N+DN +1 to
Wherein D isN+1For arbiter subnetwork, said DN +1 or lessIs said DN+1A down-sampling part of DN +1 toIs said DN+1The up-sampling part of (1), D1' is a discriminator network;
the residual convolution layers of the multi-level generator network and the multi-level discriminator network are both 3D convolution structures; and
and the time sequence video generation module 3 is configured to input each frame of semantic image of the time sequence video into a trained multi-level generator network to obtain the time sequence video.
The penalty functions of the multi-level generator network include image distribution penalty, timing penalty, optical flow penalty, feature matching penalty, and content consistency penalty.
The specific method for training the multi-level generator network comprises the following steps: generator networks of different sizes are trained respectively in space, and the dimensionality of the frames participating in training is gradually increased in the time dimension.
Embodiments also provide a computing device, referring to fig. 5, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, the computer program, when executed by the processor 1110, implementing the method steps 1131 for performing any of the methods according to the invention.
The embodiment of the application also provides a computer-readable storage medium. Referring to fig. 6, the storage medium comprises a storage unit for program code provided with a program 1131' for performing the steps of the method according to the invention, which program is executed by a processor.
The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of time-series video generation, comprising:
extracting a semantic segmentation graph of each frame of image of each video clip in the training data set, and calculating an optical flow estimation graph between a front frame and a rear frame;
training a multi-level generator network by utilizing a semantic segmentation graph of each frame image of each video clip in the training data set and an optical flow estimation graph between a front frame and a rear frame to obtain a trained multi-level generator network, wherein the multi-level generator network G'N+1The structure of (1) is as follows:
G′N+1=Gn +1 or less+G′N+GN +1 to
Wherein G isN+1To generate a subnetwork of devices, said GN +1 or lessIs said GN+1Of said GN +1 toIs said GN+1Of (a) is an upsampled part of (G'1For the generator network, N is more than or equal to 1;
multi-level discriminator network D 'co-operating with said multi-level generator network'N+1The structure of (1) is as follows:
D′N+1=Dn +1 or less+D′N+DN +1 to
Wherein D isN+1For arbiter subnetwork, said DN +1 or lessIs said DN+1A down-sampling part of DN +1 toIs said DN+1The up-sampling part of,D′1Is a network of discriminators;
the residual convolution layers of the multi-level generator network and the multi-level discriminator network are both 3D convolution structures;
and inputting each frame of semantic image of the time sequence video into a trained multi-level generator network to obtain the time sequence video.
2. The method of claim 1, wherein the loss functions of the multi-hierarchy generator network comprise image distribution loss, timing loss, optical flow loss, feature matching loss, and content consistency loss.
3. The method of claim 2, wherein the content consistency loss is obtained by:
extracting features from the image output by the multi-level generator network and a Grund Truth by using the multi-level discriminator network to obtain two feature maps, wherein the Grund Truth is each frame image of each video clip in the training data set;
and calculating the error between the two feature maps, and taking the error as the content consistency loss.
4. The method of claim 2, wherein the feature matching loss is obtained by:
extracting features from the image output by the multi-level generator network and a Grund Truth by using VGG16 to obtain two feature maps, wherein the Grund Truth is each frame image of each video clip in the training data set;
and calculating the error between the two feature maps, and taking the error as the feature matching loss.
5. The method of claim 2, wherein the specific method of training the multi-level generator network is: generator networks of different sizes are trained respectively in space, and the dimensionality of the frames participating in training is gradually increased in the time dimension.
6. A time-series video generation apparatus comprising:
the semantic segmentation image extraction module is configured to extract a semantic segmentation image of each frame of image of each video clip in the training data set and calculate an optical flow estimation image between a front frame and a rear frame;
a training module configured to train a multi-level generator network with a semantic segmentation graph of each frame image of each video clip in the training data set and an optical flow estimation graph between a preceding frame and a following frame, resulting in a trained multi-level generator network, the multi-level generator network G'N+1The structure of (1) is as follows:
G′N+1=Gn +1 or less+G′N+GN +1 to
Wherein G isN+1To generate a subnetwork of devices, said GN +1 or lessIs said GN+1Of said GN +1 toIs said GN+1Of (a) is an upsampled part of (G'1For the generator network, N is more than or equal to 1;
multi-level discriminator network D 'co-operating with said multi-level generator network'N+1The structure of (1) is as follows:
D′N+1=Dn +1 or less+D′N+DN +1 to
Wherein D isN+1For arbiter subnetwork, said DN +1 or lessIs said DN+1A down-sampling part of DN +1 toIs said DN+1Of (a) is an upsampled part of, D'1Is a network of discriminators;
the residual convolution layers of the multi-level generator network and the multi-level discriminator network are both 3D convolution structures; and
and the time sequence video generation module is configured to input each frame of semantic image of the time sequence video into a trained multi-level generator network to obtain the time sequence video.
7. The apparatus of claim 6, wherein the penalty functions of the multi-hierarchy generator network comprise image distribution penalty, timing penalty, optical flow penalty, feature matching penalty, and content consistency penalty.
8. The apparatus of claim 7, wherein the specific method for training the multi-level generator network is: generator networks of different sizes are trained respectively in space, and the dimensionality of the frames participating in training is gradually increased in the time dimension.
9. A computing device comprising a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of any of claims 1-5 when executing the computer program.
10. A storage medium, preferably a non-volatile readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method of any one of claims 1-5.
CN202110169891.3A 2021-02-08 2021-02-08 Time sequence video generation method and device, computing equipment and storage medium Active CN112995433B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110169891.3A CN112995433B (en) 2021-02-08 2021-02-08 Time sequence video generation method and device, computing equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110169891.3A CN112995433B (en) 2021-02-08 2021-02-08 Time sequence video generation method and device, computing equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112995433A true CN112995433A (en) 2021-06-18
CN112995433B CN112995433B (en) 2023-04-28

Family

ID=76348988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110169891.3A Active CN112995433B (en) 2021-02-08 2021-02-08 Time sequence video generation method and device, computing equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112995433B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115061770A (en) * 2022-08-10 2022-09-16 荣耀终端有限公司 Method and electronic device for displaying dynamic wallpaper

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107968962A (en) * 2017-12-12 2018-04-27 华中科技大学 A kind of video generation method of the non-conterminous image of two frames based on deep learning
CN109993820A (en) * 2019-03-29 2019-07-09 合肥工业大学 A kind of animated video automatic generation method and its device
CN110381268A (en) * 2019-06-25 2019-10-25 深圳前海达闼云端智能科技有限公司 method, device, storage medium and electronic equipment for generating video
CN110868598A (en) * 2019-10-17 2020-03-06 上海交通大学 Video content replacement method and system based on countermeasure generation network
US20200357099A1 (en) * 2019-05-09 2020-11-12 Adobe Inc. Video inpainting with deep internal learning
CN112149545A (en) * 2020-09-16 2020-12-29 珠海格力电器股份有限公司 Sample generation method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107968962A (en) * 2017-12-12 2018-04-27 华中科技大学 A kind of video generation method of the non-conterminous image of two frames based on deep learning
CN109993820A (en) * 2019-03-29 2019-07-09 合肥工业大学 A kind of animated video automatic generation method and its device
US20200357099A1 (en) * 2019-05-09 2020-11-12 Adobe Inc. Video inpainting with deep internal learning
CN110381268A (en) * 2019-06-25 2019-10-25 深圳前海达闼云端智能科技有限公司 method, device, storage medium and electronic equipment for generating video
CN110868598A (en) * 2019-10-17 2020-03-06 上海交通大学 Video content replacement method and system based on countermeasure generation network
CN112149545A (en) * 2020-09-16 2020-12-29 珠海格力电器股份有限公司 Sample generation method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
于海涛等: "基于多模态输入的对抗式视频生成方法", 《计算机研究与发展》 *
刘士豪等: "基于生成对抗双网络的虚拟到真实驾驶场景的视频翻译模型", 《计算机应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115061770A (en) * 2022-08-10 2022-09-16 荣耀终端有限公司 Method and electronic device for displaying dynamic wallpaper
CN115061770B (en) * 2022-08-10 2023-01-13 荣耀终端有限公司 Method and electronic device for displaying dynamic wallpaper

Also Published As

Publication number Publication date
CN112995433B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
Zhang et al. Uncertainty inspired RGB-D saliency detection
Ye et al. PMBANet: Progressive multi-branch aggregation network for scene depth super-resolution
TWI739151B (en) Method, device and electronic equipment for image generation network training and image processing
CN114339409B (en) Video processing method, device, computer equipment and storage medium
Zuo et al. Frequency-dependent depth map enhancement via iterative depth-guided affine transformation and intensity-guided refinement
JP2023545189A (en) Image processing methods, devices, and electronic equipment
CN112149545B (en) Sample generation method, device, electronic equipment and storage medium
Johari et al. Context-aware colorization of gray-scale images utilizing a cycle-consistent generative adversarial network architecture
CN116797768A (en) Method and device for reducing reality of panoramic image
Junayed et al. Consistent video inpainting using axial attention-based style transformer
Li et al. Image super-resolution reconstruction based on multi-scale dual-attention
Chang et al. Disentangling audio content and emotion with adaptive instance normalization for expressive facial animation synthesis
CN112995433B (en) Time sequence video generation method and device, computing equipment and storage medium
Quan et al. Deep learning-based image and video inpainting: A survey
WO2024041235A1 (en) Image processing method and apparatus, device, storage medium and program product
CN114998814B (en) Target video generation method and device, computer equipment and storage medium
CN115018734B (en) Video restoration method and training method and device of video restoration model
Wang et al. Decomposed guided dynamic filters for efficient rgb-guided depth completion
CN116975347A (en) Image generation model training method and related device
Li et al. Feature pre-inpainting enhanced transformer for video inpainting
CN113658231A (en) Optical flow prediction method, optical flow prediction device, electronic device, and storage medium
CN112052863A (en) Image detection method and device, computer storage medium and electronic equipment
Alshehri et al. Self‐Attention‐Based Edge Computing Model for Synthesis Image to Text through Next‐Generation AI Mechanism
Wang et al. Dynamic context-driven progressive image inpainting with auxiliary generative units
Chen et al. Contrastive structure and texture fusion for image inpainting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A temporal video generation method, device, computing device, and storage medium

Effective date of registration: 20230713

Granted publication date: 20230428

Pledgee: Bank of Jiangsu Limited by Share Ltd. Beijing branch

Pledgor: BEIJING MOVIEBOOK SCIENCE AND TECHNOLOGY Co.,Ltd.

Registration number: Y2023110000278