CN113886643A

CN113886643A - Digital human video generation method and device, electronic equipment and storage medium

Info

Publication number: CN113886643A
Application number: CN202111173208.XA
Authority: CN
Inventors: 王鑫宇; 杨国基; 刘致远; 刘云峰
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2022-01-04

Abstract

The embodiment of the disclosure discloses a digital human video generation method, a digital human video generation device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a target audio and a target face image; inputting the audio frame into a mouth region image generation model trained in advance aiming at the audio frame in the target audio to obtain a mouth region image corresponding to the audio frame, wherein the mouth region image generation model is used for representing the corresponding relation between the audio frame and the mouth region image; inputting a mouth region image corresponding to the audio frame and a target region image in the target face image into a pre-trained target image generation model aiming at the audio frame in the target audio, and generating a target image corresponding to the audio frame, wherein the target region image is a region image except the mouth region image in the target face image; based on the generated target image, a digital human video is generated. The embodiment of the disclosure can improve the digital human generation effect.

Description

Digital human video generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of digital human video generation technologies, and in particular, to a digital human video generation method, an apparatus, an electronic device, and a storage medium.

Background

The technology of digital human generation is becoming more sophisticated. The existing scheme is a digital human generation method based on pix2pix, pix2pixHD and video2video synthesis. Specifically, a large number of digital human generation technologies are currently available, for example, digital human generation methods based on pix2pix, pix2pixHD, Vid2Vid, how shot video2video, NERF, StyleGAN, and the like.

However, in these conventional schemes, if the generated face key points are inaccurate and the effect of generating a sketch is poor, the effect of the finally generated digital human picture is poor.

Disclosure of Invention

In view of the above, to solve some or all of the technical problems, embodiments of the present disclosure provide a digital human video generation method, apparatus, electronic device and storage medium.

In a first aspect, an embodiment of the present disclosure provides a method for generating a digital human video, where the method includes:

acquiring a target audio and a target face image;

inputting the audio frame into a mouth region image generation model trained in advance aiming at the audio frame in the target audio to obtain a mouth region image corresponding to the audio frame, wherein the mouth region image generation model is used for representing the corresponding relation between the audio frame and the mouth region image;

for an audio frame in the target audio, inputting a mouth region image corresponding to the audio frame and a target region image in the target face image into a pre-trained target image generation model, and generating a target image corresponding to the audio frame, wherein the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out audio indicated by the audio frame, and the target region image is a region image except the mouth region image in the target face image;

based on the generated target image, a digital human video is generated.

Optionally, in the method according to any embodiment of the present disclosure, the inputting a mouth region image corresponding to the audio frame and a target region image in the target face image into a pre-trained target image generation model to generate a target image corresponding to the audio frame includes:

carrying out channel combination on the mouth region image corresponding to the audio frame and the target region image in the target face image to generate a synthetic image corresponding to the audio frame;

the synthetic image corresponding to the audio frame is input to a pre-trained target image generation model, and a target image corresponding to the audio frame is generated.

Optionally, in the method according to any embodiment of the present disclosure, the mouth region image generation model is obtained by training as follows:

acquiring video data;

extracting an audio frame and a face image corresponding to the audio frame from the video data, taking the extracted audio frame as a sample audio, and taking the extracted face image as a sample face image;

and obtaining a mouth region image generated by a first generator corresponding to the sample audio by using a machine learning algorithm and taking the sample audio as input data of the first generator in the first generative confrontation network, and taking the current first generator as a mouth region image generation model if a first discriminator in the first generative confrontation network determines that the first generation satisfies a first preset training end condition.

Optionally, in the method of any embodiment of the present disclosure, the mouth region image is extracted from the sample face image corresponding to the sample audio by:

extracting face key points and mouth contour lines from a sample face image corresponding to the sample audio;

extracting key points of the mouth from the key points of the face;

and generating a mouth region image based on the mouth contour line and the mouth key point.

acquiring video data;

extracting audio frames and face images corresponding to the audio frames from the video data, taking a preset number of continuous audio frames containing the extracted audio frames in the video data as sample audio, and taking the extracted face images as sample face images;

Optionally, in the method according to any embodiment of the present disclosure, the step of training to obtain the mouth region image generation model further includes:

the following training steps are performed:

inputting a sample audio into an initial model to obtain a predicted mouth key point corresponding to the sample audio, wherein the initial model comprises a first sub-model, a second sub-model and a third sub-model, input data of the first sub-model is the sample audio, input data of the second sub-model and input data of the third sub-model are both output data of the first sub-model, output data of the second sub-model is the mouth key point, and output data of the third sub-model is a mouth region image;

calculating a function value of a preset loss function based on a predicted mouth key point corresponding to the sample audio and a mouth key point extracted from a sample face image corresponding to the sample audio;

and if the calculated function value is less than or equal to a preset threshold value, determining a first sub-model and a third sub-model included in the current initial model as the trained mouth region image generation model.

if the function value is larger than the preset threshold value, updating the parameters of the current initial model, and continuing to execute the training step based on the initial model after the parameters are updated.

Optionally, in the method of any embodiment of the present disclosure, after the mouth region image generation model is trained, the target image generation model is trained by:

and obtaining a target image generated by a second generator corresponding to the sample audio by using a machine learning algorithm and taking the mouth region image output by the mouth region image generation model and the corresponding target region image as input data of the second generator in a second generation type confrontation network, and taking the current second generator as a target image generation model if a second discriminator in the second generation type confrontation network determines that the target image generated by the second generator meets a second preset training end condition.

Optionally, in the method of any embodiment of the present disclosure, the method further includes:

if the second discriminator determines that the target image generated by the second generator does not meet the second preset training end condition, updating the current model parameters of the second generator, and continuing training based on the second generative confrontation network after the model parameters are updated.

Optionally, in the method according to any embodiment of the present disclosure, the inputting the audio frame to a mouth region image generation model trained in advance to obtain a mouth region image corresponding to the audio frame includes:

extracting the audio features of the audio frame;

and inputting the extracted audio features into a pre-trained mouth region image generation model to obtain a mouth region image corresponding to the audio frame.

Optionally, in the method according to any embodiment of the present disclosure, the extracting the audio feature of the audio frame includes:

extracting the frequency cepstrum coefficient characteristics of the audio frame to serve as the audio characteristics of the audio frame; or

And inputting the audio frame into a pre-trained feature extraction model to obtain the audio features of the audio frame, wherein the feature extraction model represents the corresponding relation between the audio frame and the audio features of the audio frame.

In a second aspect, an embodiment of the present disclosure provides a digital human video generating apparatus, where the apparatus includes:

an acquisition unit configured to acquire a target audio and a target face image;

a first input unit, configured to input, for an audio frame in the target audio, the audio frame to a mouth region image generation model trained in advance, so as to obtain a mouth region image corresponding to the audio frame, where the mouth region image generation model is used to represent a correspondence between the audio frame and the mouth region image;

a second input unit configured to input, for an audio frame in the target audio, a mouth region image corresponding to the audio frame and a target region image in the target face image into a pre-trained target image generation model, and generate a target image corresponding to the audio frame, where the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to emit audio indicated by the audio frame, and the target region image is a region image in the target face image except for the mouth region image;

a generating unit configured to generate a digital human video based on the generated target image.

Optionally, in the apparatus according to any embodiment of the present disclosure, the second input unit is further configured to:

Optionally, in the apparatus according to any embodiment of the present disclosure, the mouth region image generation model is trained as follows:

acquiring video data;

Optionally, in the apparatus according to any embodiment of the present disclosure, the mouth region image is extracted from the sample face image corresponding to the sample audio by:

extracting key points of the mouth from the key points of the face;

acquiring video data;

Optionally, in the apparatus according to any embodiment of the present disclosure, the step of training to obtain the mouth region image generation model further includes:

the following training steps are performed:

Optionally, in the apparatus according to any embodiment of the present disclosure, after the mouth region image generation model is trained, the target image generation model is trained as follows:

Optionally, in an apparatus according to any embodiment of the present disclosure, the apparatus further includes:

Optionally, in the apparatus according to any embodiment of the present disclosure, the inputting the audio frame to a mouth region image generation model trained in advance to obtain a mouth region image corresponding to the audio frame includes:

extracting the audio features of the audio frame;

Optionally, in an apparatus according to any embodiment of the present disclosure, the extracting an audio feature of the audio frame includes:

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

a memory for storing a computer program;

a processor, configured to execute the computer program stored in the memory, and when the computer program is executed, implement the method of any embodiment of the digital human video generation method of the first aspect of the present disclosure.

In a fourth aspect, the disclosed embodiments provide a computer readable medium, which when executed by a processor, implements the method of any of the embodiments of the digital human video generation method of the first aspect.

In a fifth aspect, embodiments of the present disclosure provide a computer program comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for implementing the steps in the method as in any of the embodiments of the digital human video generation method of the first aspect described above.

Based on the digital human video generation method provided by the above embodiment of the present disclosure, a target audio and a target face image are obtained, then, for an audio frame in the target audio, the audio frame is input to a pre-trained mouth region image generation model, so as to obtain a mouth region image corresponding to the audio frame, where the mouth region image generation model is used to represent a corresponding relationship between the audio frame and the mouth region image, then, for an audio frame in the target audio, a mouth region image corresponding to the audio frame and a target region image in the target face image are input to the pre-trained target image generation model, so as to generate a target image corresponding to the audio frame, where the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to send out an audio indicated by the audio frame, the target area image is an area image of the target face image excluding the mouth area image, and finally, a digital human video is generated based on the generated target image. Therefore, the target image is generated through the mouth area image obtained by the audio frame and the target area image in the face image, and the digital human video is further generated, so that the generation effect of the digital human video can be improved.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is an exemplary system architecture diagram of a digital human video generation method or a digital human video generation apparatus provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method for generating a digital human video provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of one application scenario for the embodiment of FIG. 2;

FIG. 4A is a flow chart of another method for generating digital human video provided by embodiments of the present disclosure;

FIG. 4B is a flowchart of a further method for generating a digital human video according to an embodiment of the disclosure;

fig. 4C is a schematic structural diagram of a mouth region image generation model in a digital human video generation method according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a digital human video generating device provided by an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of parts and steps, numerical expressions, and values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those within the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one object, step, device, or module from another object, and do not denote any particular technical meaning or logical order therebetween.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is an exemplary system architecture diagram of a digital human video generation method or a digital human video generation apparatus provided by an embodiment of the present disclosure.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or transmit data (e.g., target audio and target facial images), etc. Various client applications, such as audio/video processing software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background server that processes data transmitted by the

terminal devices

101, 102, 103. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be further noted that the digital human video generation method provided by the embodiments of the present disclosure may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit, sub-unit, module, sub-module) included in the digital human video generating device may be entirely disposed in the server, may be entirely disposed in the terminal device, and may be disposed in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the digital human video generation method operates does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., a server or a terminal device) on which the digital human video generation method operates.

Fig. 2 shows a flow 200 of a digital human video generation method provided by an embodiment of the present disclosure. The digital human video generation method comprises the following steps:

step 201, acquiring a target audio and a target face image.

In this embodiment, an execution subject (for example, a server or a terminal device shown in fig. 1) of the digital human video generation method may acquire the target audio and the target face image from other electronic devices or locally.

The target audio may be various audio. The target audio may be used to sound the target audio indication for the digital human video generated in the subsequent step. For example, the target audio may be speech audio or audio generated by converting text through a machine.

The target face image can be any face image. As an example, the target face image may be a shot image containing a face, or a frame of face image extracted from a video.

In some cases, there may be no association between the target audio and the target face image. For example, the target audio may be audio uttered by a first person, and the target face image may be a face image of a second person, where the second person may be a person other than the first person; alternatively, the target audio may be audio emitted by the first person at a first time, and the target facial image may be a facial image of the first person at a second time, where the second time may be any time different from the first time.

Step 202, for the audio frame in the target audio, inputting the audio frame to a mouth region image generation model trained in advance, and obtaining a mouth region image corresponding to the audio frame.

In this embodiment, the executing subject may input, for an audio frame in the target audio, the audio frame to a mouth region image generation model trained in advance, and obtain a mouth region image corresponding to the audio frame. The mouth region image generation model is used for representing the corresponding relation between the audio frame and the mouth region image.

In some optional implementation manners of this embodiment, the execution subject or the electronic device communicatively connected to the execution subject may train the mouth region image generation model in the following manner:

step one, video data is obtained.

The video data may be any video data containing voice and face images. In the video data, each video frame comprises an audio frame and a face image, namely, each audio frame has a corresponding face image. For example, in video data within one second, if the video within one second includes 5 frames, that is, 5 audio frames and 5 personal face images, the audio frames correspond to the face images one to one.

And step two, extracting audio frames and face images corresponding to the audio frames from the video data, taking the extracted audio frames as sample audio, and taking the extracted face images as sample face images.

And step three, adopting a machine learning algorithm, using a sample audio as input data of a first generator in a first generative confrontation network, obtaining a mouth region image which corresponds to the sample audio and is generated by the first generator, and using the current first generator as a mouth region image generation model if a first discriminator in the first generative confrontation network determines that the first generation satisfies a first preset training end condition.

Wherein, the first preset training end condition may include at least one of the following: the calculated loss function value is less than or equal to a preset threshold, and the probability that the mouth region image generated by the first generator is the mouth region image of the sample face image corresponding to the sample audio is 50%.

It is to be understood that in the above alternative implementation, the mouth region image generation model is obtained based on the generative confrontation network, so that the generation effect of the digital human video can be improved by improving the accuracy of the mouth region image generation model generated by the first generator.

In some application scenarios in the above-described alternative implementations, the mouth region image may be extracted from a sample face image corresponding to the sample audio by:

first, face key points and mouth contours are extracted from a sample face image corresponding to a sample audio.

Then, mouth key points (e.g., 26 key points including the mouth and the chin) are extracted from the face key points (e.g., 68 face key points).

Finally, a mouth region image is generated based on the mouth contour lines and the mouth key points.

As an example, the executive body may generate the mouth region image based on the mouth contour line and the mouth key point by using an image generation model obtained by supervised or unsupervised training.

It can be understood that, in the above optional implementation manner, a large number of personal face key points may be obtained based on a single frame of audio frame, and then a small number of mouth key points of the target face image may be obtained based on the obtained large number of face key points, so that the accuracy of the obtained mouth key points corresponding to the audio frame may be improved, and further, the generation effect and speed of the digital human video may be improved through subsequent steps.

In some optional implementation manners of this embodiment, the execution subject or the electronic device communicatively connected to the execution subject may also train the mouth region image generation model in the following manner:

step one, video data is obtained.

And step two, extracting audio frames and face images corresponding to the audio frames from the video data, taking the continuous audio frames of a preset number (for example, 4) of the extracted audio frames contained in the video data as sample audio, and taking the extracted face images as sample face images.

It is understood that in the above alternative implementation, the mouth region image generation model is obtained based on the generative confrontation network, so that the generation effect of the digital human video can be improved by improving the accuracy of the mouth region image generated by the first generator.

In some cases, the step of training the mouth region image generation model further includes:

the following training steps (including the first, second and third steps) are performed:

the first step is to input the sample audio into the initial model to obtain the predicted key points of the mouth corresponding to the sample audio. The initial model comprises a first sub-model, a second sub-model and a third sub-model, input data of the first sub-model is sample audio, input data of the second sub-model and input data of the third sub-model are output data of the first sub-model, output data of the second sub-model are mouth key points, and output data of the third sub-model are mouth region images.

A second step of calculating a function value of a preset loss function based on the predicted mouth key point corresponding to the sample audio and the mouth key point extracted from the sample face image corresponding to the sample audio.

And a third step of determining the first sub-model and the third sub-model included in the current initial model as the trained mouth region image generation model if the calculated function value is less than or equal to a preset threshold value.

Optionally, the step of training the mouth region image generation model further includes:

It can be understood that, in the above alternative implementation, at the stage of using the mouth region image generation model, the second sub-model is not required to be used to obtain the key points of the mouth, so that the generation efficiency of the digital human video can be improved.

In some optional implementation manners of this embodiment, the executing subject may execute the step 202 in a manner that the audio frame is input to a mouth region image generation model trained in advance, so as to obtain a mouth region image corresponding to the audio frame:

first, the audio features of the audio frame are extracted. The audio features of the audio frame may include, but are not limited to: frequency cepstral coefficient features, timbre features, tonal features, and the like.

In some application scenarios in the foregoing optional implementation manners, the execution main body may extract the audio feature of the audio frame in the following manner: and extracting the frequency cepstrum coefficient characteristics of the audio frame as the audio characteristics of the audio frame.

In some application scenarios in the foregoing optional implementation manners, the execution main body may also extract the audio feature of the audio frame in the following manner: and inputting the audio frame into a pre-trained feature extraction model to obtain the audio features of the audio frame. The feature extraction model represents the corresponding relation between the audio frame and the audio features of the audio frame.

Then, the extracted audio features are input to a mouth region image generation model trained in advance, and a mouth region image corresponding to the audio frame is obtained. The mouth region image generation model may include a sub-model representing the correspondence between the audio features and the mouth region images corresponding to the audio frames.

It is to be understood that, in the above alternative implementation manner, the mouth region image corresponding to the audio frame may be obtained by extracting the audio features of the audio frame, and thus, the generation effect of the digital human video may be further improved through subsequent steps.

Step 203, for the audio frame in the target audio, inputting the mouth region image corresponding to the audio frame and the target region image in the target face image into a pre-trained target image generation model, and generating a target image corresponding to the audio frame.

In this embodiment, the executing subject may input, for an audio frame in the target audio, a mouth region image corresponding to the audio frame and a target region image in the target face image to a target image generation model trained in advance, and generate a target image corresponding to the audio frame. The target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out audio indicated by the audio frame, and the target area image is an area image except for a mouth area image in the target face image.

In some optional implementation manners of this embodiment, the executing subject may execute step 203 in the following manner, so as to input the mouth region image corresponding to the audio frame and the target region image in the target face image into a pre-trained target image generation model, and generate a target image corresponding to the audio frame:

firstly, the mouth region image corresponding to the audio frame and the target region image in the target face image are subjected to channel combination to generate a synthetic image corresponding to the audio frame.

Then, the synthetic image corresponding to the audio frame is input to a pre-trained target image generation model, and a target image corresponding to the audio frame is generated. The target image generation model may include a sub-model for characterizing the correspondence between the composite image and the target image.

It can be understood that, in the above alternative implementation manner, the generation effect of the digital human video is further improved by performing channel merging on the mouth region image corresponding to the audio frame and the target region image in the target face image.

In some optional implementation manners of this embodiment, after the mouth region image generation model is trained, the execution subject or an electronic device communicatively connected to the execution subject may be trained to obtain the target image generation model by:

Optionally, if the second determiner determines that the target image generated by the second generator does not satisfy the second preset training end condition, the current model parameter of the second generator is updated, and the training is continued based on the updated second generative confrontation network of the model parameter.

Wherein the second preset training end condition may include at least one of the following: the calculated loss function value is less than or equal to a preset threshold, and the probability that the mouth region image generated by the first generator is the mouth region image of the sample face image corresponding to the sample audio is 50%.

It will be appreciated that in the above alternative implementation, the target image generation model is obtained based on the generative confrontation network, so that the generation effect of the digital human video can be improved by improving the accuracy of the target image generated by the second generator.

And step 204, generating a digital human video based on the generated target image.

In the present embodiment, the execution body described above may generate a digital human video based on the respective target images generated.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the digital human video generation method according to the present embodiment. In fig. 3, a server 310 (i.e., the executing agent) first acquires a target audio 301 and a target face image 305. Then, the server 310 inputs the audio frame 302 to the mouth region image generation model 303 trained in advance for the audio frame 302 in the target audio 301, obtains a mouth region image 304 corresponding to the audio frame 302, wherein, the mouth region image generation model 303 is used for representing the corresponding relationship between the audio frame and the mouth region image, then, the server 310 inputs the mouth region image 304 corresponding to the audio frame 302 and the target region image 306 in the target face image 305 to the pre-trained target image generation model 307 for the audio frame 302 in the target audio 301, generates the target image 308 corresponding to the audio frame 302, wherein the target image 308 corresponding to the audio frame 302 is used to instruct the person indicated by the target face image 305 to emit the audio indicated by the audio frame 302, the target area image 306 is an area image of the target face image 305 excluding the mouth area image. Finally, server 310 generates digital human video 309 based on generated target image 308.

The method provided by the above embodiment of the present disclosure obtains a target audio and a target face image, then inputs the audio frame into a pre-trained mouth region image generation model for an audio frame in the target audio to obtain a mouth region image corresponding to the audio frame, where the mouth region image generation model is used to represent a correspondence between the audio frame and the mouth region image, then inputs a mouth region image corresponding to the audio frame and a target region image in the target face image into the pre-trained target image generation model for an audio frame in the target audio to generate a target image corresponding to the audio frame, where the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to emit audio indicated by the audio frame, and the target region image is a region image in the target face image except for the mouth region image, finally, a digital human video is generated based on the generated target image. Therefore, the target image is generated through the mouth area image obtained by the audio frame and the target area image in the face image, and the digital human video is further generated, so that the generation effect of the digital human video can be improved.

With further reference to fig. 4A, a flow 400 of yet another embodiment of a digital human video generation method is shown. The process of the digital human video generation method comprises the following steps:

step 401, acquiring a target audio and a target face image.

Step 402, for the audio frame in the target audio, inputting the audio frame to a mouth region image generation model trained in advance, and obtaining a mouth region image corresponding to the audio frame. The mouth region image generation model is used for representing the corresponding relation between the audio frame and the mouth region image.

Step 403, for the audio frame in the target audio, performing channel merging on the mouth region image corresponding to the audio frame and the target region image in the target face image, and generating a synthetic image corresponding to the audio frame. The target area image is an area image of the target face image except for the mouth area image.

Step 404, for the audio frame in the target audio, inputting the synthetic image corresponding to the audio frame to a pre-trained target image generation model, and generating a target image corresponding to the audio frame. And the target image corresponding to the audio frame is used for indicating the person indicated by the target face image to send out the audio indicated by the audio frame.

Step 405, generating a digital human video based on the generated target image.

As an example, the digital human video generation method in the present embodiment may be performed as follows:

first, the format of data is described:

in this embodiment, the size of the face sketch in the digital human video generation method is 512 × 1; the size of the target face image is 512 × 3; the face sketch and the target face image are combined to form a whole with the size of 512 x 4.

Referring to fig. 4B, the implementation process of the specific scheme is described as follows:

after the user audio (namely the target audio) is obtained, audio features are extracted from the user audio; extracting 68 key points based on the video picture frame (namely, the video picture frame and the target face image) corresponding to the user audio, intercepting key points (such as 20 key points of the mouth and 6 key points of the chin) of a mouth region from the key points to be used as input of an LMGAN model (namely, a mouth region image generation model), outputting a picture (namely, a mouth region image) which is a mouth contour line by using the LMGAN model, and synthesizing the mouth contour line picture and a real picture (a target region image in the target face image) to be input into a GAN (namely, the target image generation model) to obtain a digital human fake image (namely, a target image) finally output by the GAN, so that a corresponding digital human video (one video comprises a plurality of frame pictures) can be output based on a plurality of frame digital human fake images output by the GAN generation model.

The sound inference model may be configured to extract audio features of the audio, where the input sound may be in a wav format, and the frame rate may be 100, 50, or 25. Wav is a lossless audio file format. For the sound features, the feature can be MFCC, or the feature extracted by a model such as Deepseech/ASR/wav 2 Vector. The acoustic inference model may be LSMT, BERT (Bidirectional Encoder representation from transforms, transducer-based Bidirectional code representation model), Transfromer (transducer model), CNN (Convolutional Neural Networks), RNN (Recurrent Neural Networks), or the like.

In the training phase, this can be performed by:

first, video data is prepared, the video data including audio (i.e., sample audio) and pictures (i.e., sample face images corresponding to the sample audio).

Then, processing data according to 25 frames per second of frame rate, extracting characteristics of audio, and extracting key points of the human face and corresponding canny lines from pictures; namely, for each video frame, audio features are extracted from video audio (sample audio), and 68 face key points are extracted from a video picture (i.e. a sample face image corresponding to the sample audio), wherein the audio feature extraction method can use fourier transform to extract MFCC, extract audio features with depepseech model, or extract audio features with other algorithms (ASR model — voice recognition).

Then, intercepting key points (26 key points in total for the mouth and the chin) of the human mouth area, and training the key points as the input of an LMGAN model; specifically, after extracting key points (68) of a human face, extracting 20 key points of a mouth and 6 key points of a chin, and 26 key points of the mouth, connecting the key points to form a mouth contour line graph, and training the LMGAN by using the mouth contour line graph and audio (or the extracted audio features), so that an LMGAN model can be trained. The LMGAN model is a generator of sound to mouth region images (i.e., mouth region images), and outputs a picture of a mouth contour line (i.e., mouth region image) with sound or sound characteristics as an input.

The LMGAN training mode may be: a mouth picture (i.e., a mouth region image) of one frame of picture is trained by using one frame of audio data or multiple frames of audio data. Specifically, when a frame of mouth picture is trained by using N frames of audio data, for example, when a mouth region image of a tth frame picture is trained, the mouth region image of the tth frame picture can be trained by using audio data corresponding to the tth frame and the t-1, t-2 … … t- (N-1) frames, so that the generation effect of the mouth region image is improved, and the generation effect of the digital human picture is better. N may be greater than 1, the larger N, the better the mouth is produced. In addition, the current frame of audio and the previous 4 frames of audio can be adopted to generate the mouth region image of the current frame, so that the generation effect and the generation efficiency can be considered.

In addition, the computation of the loss function of the additional model of LMGAN (the second submodel 412 in fig. 4C) may be added. As shown in fig. 4C, the LMGAN may include a first submodel 411, a second submodel 412, and a third submodel 413. Wherein the first submodel 411 may be an encoder and the third submodel 413 may be a decoder. After the sound coding vector output by the coder passes through the lstm layer and the full-connected layer, corresponding 26 mouth key points can be generated, namely 26 inferred key points are obtained. Calculating a function value of the loss function based on the inferred 26 key points and the true key points, wherein the function value is a function value of the loss function of the human mouth generator of the GAN; and further, whether the LMGAN model is converged can be judged by using the function value of the loss function, and the training of the LMGAN model is completed.

And when the training of the face key point model is finished, training the GAN (namely a target image generation model). The input of GAN is the contour line diagram of the mouth and the picture of the face (without the area around the mouth), and the output of GAN is to generate the final picture (i.e. the target image).

In the inference stage:

the audio (also called target audio) can be obtained, the audio characteristics are input into the LMGAN model, the contour line picture (also called mouth region image) of the face is obtained, the contour line picture and the real picture (also called target region image) are input into the GAN, and the final picture (also called target image) is obtained.

In this embodiment, the specific implementation manners of the steps 401 to 405 may refer to the related descriptions of the embodiment corresponding to fig. 2, and are not repeated herein. In addition, besides the above-mentioned contents, the embodiment of the present disclosure may further include the same or similar features and effects as the embodiment corresponding to fig. 2, and details are not repeated herein.

In the embodiment, the digital human video generation method can realize generation from sound to canny pictures (namely mouth area images) in an application stage, does not need to generate key points of human faces, and has higher efficiency; the contour line graph of the mouth of the current frame is generated by using N (N is more than 1) frame audio, then the contour line graph is combined with a face picture (a picture without a surrounding area of the mouth) of the current frame according to a channel combination mode, the face picture is input to the GAN to obtain a final picture, and the generation effect can be more natural.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of a digital human video generating apparatus, which corresponds to the above-described method embodiment, and which may include the same or corresponding features as the above-described method embodiment and produce the same or corresponding effects as the above-described method embodiment, in addition to the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the digital human video generating apparatus 500 of the present embodiment. The above apparatus 500 includes: an acquisition unit 501, a first input unit 502, a second input unit 503, and a generation unit 504. The acquiring unit 501 is configured to acquire a target audio and a target face image; a first input unit 502, configured to input, for an audio frame in the target audio, the audio frame to a mouth region image generation model trained in advance, so as to obtain a mouth region image corresponding to the audio frame, where the mouth region image generation model is used to represent a correspondence between the audio frame and the mouth region image; a second input unit 503 configured to input, for an audio frame in the target audio, a mouth region image corresponding to the audio frame and a target region image in the target face image into a pre-trained target image generation model, and generate a target image corresponding to the audio frame, where the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to emit audio indicated by the audio frame, and the target region image is a region image in the target face image except for the mouth region image; a generating unit 504 configured to generate a digital human video based on the generated target image.

In the present embodiment, the acquisition unit 501 of the digital human video generating apparatus 500 may acquire a target audio and a target face image.

In this embodiment, the first input unit 502 may input, for an audio frame in the target audio, the audio frame to a mouth region image generation model trained in advance, and obtain a mouth region image corresponding to the audio frame, where the mouth region image generation model is used to represent a correspondence between the audio frame and the mouth region image.

In this embodiment, the second input unit 503 may input, for an audio frame in the target audio, a mouth region image corresponding to the audio frame and a target region image in the target face image into a pre-trained target image generation model, and generate a target image corresponding to the audio frame, where the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to emit audio indicated by the audio frame, and the target region image is a region image in the target face image except for the mouth region image.

In the present embodiment, the generation unit 504 may generate a digital human video based on the generated target image.

In some optional implementations of the present embodiment, the second input unit 503 is further configured to:

In some optional implementations of the present embodiment, the mouth region image generation model is trained as follows:

acquiring video data;

In some alternative implementations of the present embodiment, the mouth region image is extracted from the sample face image corresponding to the sample audio by:

extracting key points of the mouth from the key points of the face;

acquiring video data;

In some optional implementations of this embodiment, the step of training to obtain the mouth region image generation model further includes:

the following training steps are performed:

In some optional implementations of the embodiment, after the mouth region image generation model is trained, the target image generation model is trained by:

In some optional implementations of this embodiment, the apparatus 500 further includes:

In some optional implementation manners of this embodiment, the inputting the audio frame to a mouth region image generation model trained in advance to obtain a mouth region image corresponding to the audio frame includes:

extracting the audio features of the audio frame;

In some optional implementations of this embodiment, the extracting the audio feature of the audio frame includes:

In the apparatus 500 provided by the above embodiment of the present disclosure, the obtaining unit 501 may obtain a target audio and a target face image, then the first input unit 502 may input, for an audio frame in the target audio, the audio frame into a pre-trained mouth region image generation model to obtain a mouth region image corresponding to the audio frame, where the mouth region image generation model is used to represent a corresponding relationship between the audio frame and the mouth region image, then the second input unit 503 may input, for an audio frame in the target audio, a mouth region image corresponding to the audio frame and a target region image in the target face image into the pre-trained target image generation model to generate a target image corresponding to the audio frame, where the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to emit audio indicated by the audio frame, the target area image is an area image of the target face image excluding the mouth area image, and finally, the generating unit 504 may generate a digital human video based on the generated target image. Therefore, the target image is generated through the mouth area image obtained by the audio frame and the target area image in the face image, and the digital human video is further generated, so that the generation effect of the digital human video can be improved.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, where the electronic device 600 shown in fig. 6 includes: at least one processor 601, memory 602, and at least one network interface 604 and other user interfaces 603. The various components in the electronic device 600 are coupled together by a bus system 605. It is understood that the bus system 605 is used to enable communications among the components. The bus system 605 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 605 in fig. 6.

The user interface 603 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, trackball, touch pad, or touch screen, among others.

It will be appreciated that the memory 602 in embodiments of the disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), synchlronous SDRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 602 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 602 stores the following elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system 6021 and application programs 6022.

The operating system 6021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program 6022 includes various application programs such as a Media Player (Media Player), a Browser (Browser), and the like, and is used to implement various application services. Programs that implement methods of embodiments of the disclosure can be included in the application program 6022.

In the embodiment of the present disclosure, by calling a program or an instruction stored in the memory 602, specifically, a program or an instruction stored in the application program 6022, the processor 601 is configured to execute the method steps provided by the method embodiments, for example, including: acquiring a target audio and a target face image; inputting the audio frame into a mouth region image generation model trained in advance aiming at the audio frame in the target audio to obtain a mouth region image corresponding to the audio frame, wherein the mouth region image generation model is used for representing the corresponding relation between the audio frame and the mouth region image; for an audio frame in the target audio, inputting a mouth region image corresponding to the audio frame and a target region image in the target face image into a pre-trained target image generation model, and generating a target image corresponding to the audio frame, wherein the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out audio indicated by the audio frame, and the target region image is a region image except the mouth region image in the target face image; based on the generated target image, a digital human video is generated.

The method disclosed by the embodiment of the present disclosure can be applied to the processor 601 or implemented by the processor 601. The processor 601 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 601. The Processor 601 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present disclosure may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 602, and the processor 601 reads the information in the memory 602 and completes the steps of the method in combination with the hardware thereof.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented in one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The electronic device provided in this embodiment may be the electronic device shown in fig. 6, and may execute all the steps of the digital human video generation method shown in fig. 2, so as to achieve the technical effect of the digital human video generation method shown in fig. 2.

The disclosed embodiments also provide a storage medium (computer-readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.

When one or more programs in the storage medium are executable by one or more processors, the above-described digital human video generation method executed on the electronic device side is implemented.

The processor is configured to execute the communication program stored in the memory to implement the following steps of the digital human video generation method executed on the electronic device side: acquiring a target audio and a target face image; inputting the audio frame into a mouth region image generation model trained in advance aiming at the audio frame in the target audio to obtain a mouth region image corresponding to the audio frame, wherein the mouth region image generation model is used for representing the corresponding relation between the audio frame and the mouth region image; for an audio frame in the target audio, inputting a mouth region image corresponding to the audio frame and a target region image in the target face image into a pre-trained target image generation model, and generating a target image corresponding to the audio frame, wherein the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out audio indicated by the audio frame, and the target region image is a region image except the mouth region image in the target face image; based on the generated target image, a digital human video is generated.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present disclosure in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present disclosure, and are not intended to limit the scope of the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method for generating a digital human video, the method comprising:

acquiring a target audio and a target face image;

for an audio frame in the target audio, inputting the audio frame to a pre-trained mouth region image generation model to obtain a mouth region image corresponding to the audio frame, wherein the mouth region image generation model is used for representing a corresponding relation between the audio frame and the mouth region image;

based on the generated target image, a digital human video is generated.

2. The method according to claim 1, wherein the inputting the mouth region image corresponding to the audio frame and the target region image in the target face image into a pre-trained target image generation model to generate the target image corresponding to the audio frame comprises:

3. The method of claim 1, wherein the mouth region image generation model is trained by:

acquiring video data;

and using a machine learning algorithm, using a sample audio as input data of a first generator in a first generative confrontation network, obtaining a mouth region image which corresponds to the sample audio and is generated by the first generator, and using the current first generator as a mouth region image generation model if a first discriminator in the first generative confrontation network determines that the mouth region image generated by the first generator meets a first preset training end condition.

4. The method according to claim 3, characterized in that the mouth region image is extracted from the sample face image corresponding to the sample audio by:

extracting key points of a mouth from the key points of the face;

generating a mouth region image based on the mouth contour line and the mouth key point.

5. The method of claim 1, wherein the mouth region image generation model is trained by:

acquiring video data;

6. The method according to any one of claims 3-5, wherein the step of training the mouth region image generative model further comprises:

the following training steps are performed:

7. The method of claim 6, wherein the step of training the mouth region image generation model further comprises:

and if the function value is larger than the preset threshold value, updating the parameters of the current initial model, and continuing to execute the training step based on the initial model after the parameters are updated.

8. The method according to any one of claims 3 to 5, wherein after the mouth region image generation model is trained, the target image generation model is trained by:

and using a machine learning algorithm, taking the mouth region image output by the mouth region image generation model and the corresponding target region image as input data of a second generator in a second generation type confrontation network to obtain a target image which corresponds to the sample audio and is generated by the second generator, and taking the current second generator as a target image generation model if a second discriminator in the second generation type confrontation network determines that the target image generated by the second generator meets a second preset training end condition.

9. A digital human video generating apparatus, characterized in that the apparatus comprises:

10. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing a computer program stored in the memory, and when executed, implementing the method of any of the preceding claims 1-8.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of the preceding claims 1 to 8.