CN111353069A

CN111353069A - Character scene video generation method, system, device and storage medium

Info

Publication number: CN111353069A
Application number: CN202010079892.4A
Authority: CN
Inventors: 李�权; 叶俊杰; 王伦基; 黄桂芳; 任勇; 韩蓝青
Original assignee: CYAGEN BIOSCIENCES (GUANGZHOU) Inc; Research Institute Of Tsinghua Pearl River Delta
Current assignee: CYAGEN BIOSCIENCES (GUANGZHOU) Inc; Research Institute Of Tsinghua Pearl River Delta
Priority date: 2020-02-04
Filing date: 2020-02-04
Publication date: 2020-06-30

Abstract

The invention discloses a character scene video generation method, a system, a device and a storage medium, wherein a confrontation network model is generated through training, and a label image with a limiting condition is input into the trained generated confrontation network model, so that a real person picture corresponding to the limiting condition can be output, the limiting condition can guide the generated confrontation network model to generate a real image corresponding to the limiting condition, and therefore, more precise content control can be performed on generated content, and a more controllable high-definition image can be generated. And new limiting conditions can be added according to new generation requirements generated in subsequent use, so that the generated content is expanded more abundantly according to the requirements; and each video is not required to be recorded by a real person, so that the method has higher production efficiency and richer expansion forms. The invention is widely applied to the technical field of computers.

Description

Character scene video generation method, system, device and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a method, a system, a device and a storage medium for generating a character scene video.

Background

With the continuous development of virtual reality technology and/or augmented reality, more and more three-dimensional models are used for sharing applications, and three-dimensional scenes are constructed through the three-dimensional models for sharing applications, and the three-dimensional scenes are widely applied to many fields, so that more visual enjoyment can be provided for users to a great extent, and the user experience is improved.

Most of the existing figure image synthesis methods adopt a Computer Graphics (CG) method, and through a plurality of plates such as modeling, synthesis, material, rendering and the like, an object model is firstly built up one block, then different parts are subjected to chartlet rendering to achieve a more real effect, and finally the object model is fused with a real environment. In each step, a great deal of energy is required for professionals, each image needs to be finely processed, the whole manufacturing time is long, the labor cost is high, and the requirements of high quality and high efficiency cannot be met at the same time; in the existing mode of making video content according to the manuscript, each video must have real characters for recording, a large amount of time is needed, and the working efficiency is low.

Disclosure of Invention

In order to solve at least one of the above problems, the present invention provides a method, a system, an apparatus and a storage medium for generating a character scene video.

The technical scheme adopted by the invention is as follows: in one aspect, an embodiment of the present invention includes a method for generating a character scene video, including:

acquiring a first image, wherein the first image is a label image with limiting conditions, and the limiting conditions comprise a human face contour, a human body key point skeleton, a human body contour, a head contour and a background;

receiving the first image by using a trained generation confrontation network model and processing the first image to output a second image, wherein the second image is a real image corresponding to a limiting condition;

acquiring a voice signal;

and combining the second image with the voice signal to generate a character scene video.

Further, the method further includes training generation of the confrontation network model, including:

constructing a training set, wherein the training set consists of a figure image sample, a figure video sample and a label sample, and the label sample is obtained by extracting key points and masks of the figure image sample and the figure video sample;

the training set is obtained to train a generative antagonistic network model.

Further, the method further comprises detecting generation of a countering network model, including:

modifying the label sample;

generating a counternetwork model to obtain a modified label sample;

whether the image and/or the video corresponding to the label is output by the generation of the confrontation network model is detected.

Further, the step of modifying the label sample specifically includes:

extracting key points and masks of the person image sample and the person video sample to obtain a label sample;

the keypoint coordinate locations and mask shapes are changed to modify the label exemplars.

Further, the generating of the confrontation network model comprises generating a network and judging the network;

the generation network is used for receiving the first image and generating a second image;

and the judging network is used for judging the truth of the second image.

Further, the generation network comprises a plurality of sub-networks, including a first sub-network and a second sub-network;

the first sub-network is used for generating an image containing global information;

the second sub-network is used for carrying out local detail enhancement on the image generated by the first sub-network so as to output an image containing local detail features.

Further, the step of judging the authenticity of the second image by the judging network specifically includes:

cropping the second image into a plurality of images of different scales;

judging on the images with different scales by using a multi-scale discriminator to obtain a plurality of judgment result values;

calculating an average value of the plurality of discrimination result values;

and judging the truth of the second image according to the calculated average value.

On the other hand, the embodiment of the invention also comprises a character scene video generation system, which comprises a test module and a training module;

the test module is used for:

acquiring a voice signal;

combining the second image with the voice signal to generate a character scene video;

the training module is used for training the generation confrontation network model through the following processes:

acquiring the training set to train a generative countermeasure network model;

In another aspect, an embodiment of the present invention further includes a character scene video generating apparatus, including a processor and a memory, wherein,

the memory is to store program instructions;

the processor is used for reading the program instructions in the memory and executing the character scene video generation method according to the program instructions in the memory.

In another aspect, embodiments of the present invention also include a computer-readable storage medium, wherein,

the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, executes the person scene video generation method of an embodiment.

The invention has the beneficial effects that: according to the embodiment of the invention, the countermeasure network model is generated by training, and the label image with the limiting condition is input into the trained countermeasure network model, so that the real person picture corresponding to the limiting condition can be output, and the limiting condition can guide the generation of the countermeasure network model to generate the real image corresponding to the limiting condition, so that the generation content can be more finely controlled, and a more controllable high-definition image can be generated. And new limiting conditions can be added according to new generation requirements generated in subsequent use, so that the generated content is expanded more abundantly according to the requirements; and each video is not required to be recorded by a real person, so that the method has higher production efficiency and richer expansion forms.

Drawings

Fig. 1 is a flowchart of a method for generating a character scene video according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a character scene video generation system according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a character scene video generation apparatus according to an embodiment of the present invention.

Detailed Description

Fig. 1 is a flowchart of a method for generating a character scene video according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

s1, acquiring a first image, wherein the first image is a label image with limiting conditions, and the limiting conditions comprise a face contour, a human body key point skeleton, a human body contour, a head contour and a background;

s2, receiving the first image by using a trained generation countermeasure network model and processing the first image to output a second image, wherein the second image is a real image corresponding to a limiting condition;

s3, acquiring a voice signal;

and S4, combining the second image with the voice signal to generate a character scene video.

In the present embodiment, the conversion of the label image with the constraint condition into the real image corresponding to the constraint condition is mainly performed using a trained generative confrontation network model (GAN model). The limiting conditions include a face contour, a body key point skeleton, a body contour, a head contour and a background, for example, the face contour condition can guide the trained generation countermeasure network model to generate a vivid face at a corresponding position of the contour, the clothes contour condition can guide the trained generation countermeasure network model to generate a corresponding upper body and corresponding clothes at a corresponding position, and the body key point contour condition can guide the trained generation countermeasure network model to generate a real human body with a corresponding height at a corresponding position.

In this embodiment, acquiring the first image, that is, acquiring the label image with the limitation condition specifically includes the following processes:

and extracting key points and masks of the character scene image or video to construct a label image. For example, to acquire a label image with a face contour condition, a key point detection method is used for detecting key points from a person scene image or a video, and connection is performed, so that the label image with the face contour limiting condition can be generated; similarly, if a label image with a clothing contour condition is to be acquired, the image segmentation method is used for segmenting the clothing in the scene image or video of the character, and the mask of the clothing and/or the tie is acquired, so that the label image with the clothing contour limitation condition can be acquired.

In this embodiment, the training process for generating the antagonistic network model includes the following steps:

p1, constructing a training set, wherein the training set consists of a figure image sample, a figure video sample and a label sample, and the label sample is obtained by extracting key points and masks of the figure image sample and the figure video sample;

p2, acquiring the training set to train the generative countermeasure network model.

In this embodiment, after training the generation countermeasure network model, the generation countermeasure network model is also detected, and the process specifically includes the following steps:

D1. modifying the label sample;

D2. generating a counternetwork model to obtain a modified label sample;

D3. whether the image and/or the video corresponding to the label is output by the generation of the confrontation network model is detected.

In the embodiment, key points and masks are extracted from a character image sample and a character video sample to obtain a label sample;

by changing the keypoint coordinate locations and the mask shape, the label exemplars can be modified.

In this embodiment, the generating the confrontation network model includes generating a network and determining the network; the generation network is used for receiving the first image and generating a second image; and the judging network is used for judging the truth of the second image. That is, after receiving input and generating a label image with a limiting condition in a countermeasure network model, the generation network generates a real image corresponding to the limiting condition; for example, an image with a human face contour is input, and after the image is received by the generating network, a vivid human face is generated at the corresponding position of the contour.

In this embodiment, the generation network includes a plurality of sub-networks, including a first sub-network and a second sub-network, that is, the generation network G may be split into two sub-networks G ═ { G1, G2}, where the G1 generation network is an end2end network using a U-net structure, and is used to generate a lower resolution image (e.g., 1024x 512) containing global information, and G2 is used to output a high resolution image (e.g., 2048x 1024) by using the output of G1 for local detail enhancement; by analogy, if a higher definition image needs to be generated, only a more detail enhancement generation network needs to be added (e.g., G ═ G1, G2, G3).

As an optional specific implementation manner, the step of determining, by the network, the authenticity of the second image specifically includes

Cropping the second image into a plurality of images of different scales;

calculating an average value of the plurality of discrimination result values;

In this embodiment, the second image is cut into 3 images with different scales, where the second image is an image output by the generated network processing, the discrimination network D adopts a multi-scale discriminator to discriminate values on three different image scales, and finally, the patch discrimination result values of the three scales are merged to obtain an average value. The three dimensions of the discrimination network are: artwork size, 1/2 size, and 1/4 size.

In this embodiment, an idea based on a pix2pixHD network and using a conditional GAN is adopted to generate a high-definition character scene video. The pix2pixHD adds a feature matching technology, the feature maps of all layers (except an output layer) in a judging network are taken to be used as the feature matching, and after a feature matching loss function is added, the loss function of the pix2pixHD is as follows:

the formula is divided into GAN loss and Feature matching loss, a network D is judged in the GAN loss to continuously and iteratively maximize an objective function, and a network G is generated to continuously and iteratively minimize the GAN loss and Feature matching loss so as to ensure that a clearer and more detailed image is generated.

In summary, the character scene video generation method in the embodiment has the following advantages:

the countermeasure network model is generated through training, the label image with the limiting condition is input into the trained countermeasure network model, so that the real person picture corresponding to the limiting condition can be output, the limiting condition can guide the generation of the countermeasure network model to generate the real image corresponding to the limiting condition, and therefore more precise content control can be performed on the generated content, and a more controllable high-definition image can be generated. And new limiting conditions can be added according to new generation requirements generated in subsequent use, so that the generated content is expanded more abundantly according to the requirements; and each video is not required to be recorded by a real person, so that the method has higher production efficiency and richer expansion forms.

Referring to fig. 2, the embodiment of the present invention further includes a system for generating a character scene video, including a testing module and a training module;

the test module is used for:

acquiring a first image, wherein the first image is a label image with limiting conditions, and the limiting conditions comprise a human face contour, a human body key point skeleton, a human body contour, a head contour and a background

acquiring a voice signal;

acquiring the training set to train a generative countermeasure network model;

The test module and the training module respectively refer to a hardware module, a software module or a combination of the hardware module and the software module with the same function. Different modules may share the same hardware or software elements.

The character scene video generation system may be a server or a personal computer, and the same technical effect as the method of converting voice into a lip shape may be achieved by operating the system, which is obtained by writing the method of converting voice into a lip shape into a computer program and writing the computer program into the server or the personal computer.

Fig. 3 is a schematic structural diagram of a character scene video generating apparatus according to an embodiment of the present invention, please refer to fig. 3, where the apparatus 60 may include a processor 601 and a memory 602. Wherein the content of the first and second substances,

the memory 602 is used to store program instructions;

the processor 601 is configured to read the program instructions in the memory 602 and execute the method for extracting the avatar gestures according to the embodiment shown in the embodiment according to the program instructions in the memory 602.

The memory may also be separately produced and used to store a computer program corresponding to the virtual character expression and motion extraction method. When the memory is connected with the processor, the stored computer program is read out by the processor and executed, so that the method for extracting the expression and the action of the virtual character is implemented, and the technical effect of the embodiment is achieved.

The present embodiment also includes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, executes the method for extracting the expressive action of the virtual character shown in the embodiment.

It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. Furthermore, the descriptions of upper, lower, left, right, etc. used in the present disclosure are only relative to the mutual positional relationship of the constituent parts of the present disclosure in the drawings. As used in this disclosure, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, unless defined otherwise, all technical and scientific terms used in this example have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description of the embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this embodiment, the term "and/or" includes any combination of one or more of the associated listed items.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element of the same type from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language ("e.g.," such as "or the like") provided with this embodiment is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.

It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, operations of processes described in this embodiment can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this embodiment (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this embodiment includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein.

A computer program can be applied to input data to perform the functions described in the present embodiment to convert the input data to generate output data that is stored to a non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims

1. A method for generating a character scene video, comprising:

acquiring a voice signal;

2. The method of claim 1, further comprising training generation of a confrontational network model, comprising:

the training set is obtained to train a generative antagonistic network model.

3. The method of claim 2, further comprising detecting generation of a confrontational network model, comprising:

modifying the label sample;

generating a counternetwork model to obtain a modified label sample;

4. The method of claim 3, wherein the step of modifying the label sample comprises:

5. The character scene video generation method of claim 3, wherein the generation of the confrontation network model includes generation of a network and discrimination network;

and the judging network is used for judging the truth of the second image.

6. The character scene video generation method of claim 5, wherein the generation network includes a plurality of sub-networks including a first sub-network and a second sub-network;

7. The method as claimed in claim 5, wherein the step of determining the degree of reality of the second image by a determination network comprises:

cropping the second image into a plurality of images of different scales;

calculating an average value of the plurality of discrimination result values;

8. A character scene video generation system is characterized by comprising a test module and a training module;

the test module is used for:

acquiring a voice signal;

acquiring the training set to train a generative countermeasure network model;

9. A character scene video generating apparatus, comprising a processor and a memory, wherein,

the memory is to store program instructions;

the processor is used for reading the program instructions in the memory and executing the character scene video generation method as claimed in any one of claims 1 to 7 according to the program instructions in the memory.

10. A computer-readable storage medium, characterized in that,

a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method of generating a video of a character scene as claimed in any one of claims 1 to 7.