CN111401101A

CN111401101A - Video generation system based on portrait

Info

Publication number: CN111401101A
Application number: CN201811635958.2A
Authority: CN
Inventors: 王慧; 朱频频
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2020-07-10

Abstract

The embodiment of the invention provides a video generation system based on a portrait, which comprises: the first input unit is suitable for acquiring a static image of a target face; the second input unit is suitable for acquiring portrait expression control data; the target portrait generating unit comprises a trained generation confrontation network model and is suitable for respectively carrying out corresponding feature extraction processing and feature fusion on the target face static image input by the first input unit and the portrait expression control data input by the second input unit to generate an image sequence, and the action of the portrait in the image sequence is matched with the expression features of the portrait expression control data; and the output unit is suitable for outputting the image sequence generated by the target portrait model. The system can improve the universality of video generation based on the portrait.

Description

Video generation system based on portrait

Technical Field

The embodiment of the invention relates to the technical field of video processing, in particular to a video generation system based on human images.

Background

The virtual portrait refers to a portrait generated by a computer. At present, a video generation method based on a portrait mainly synthesizes a three-dimensional animation character through a computer graphics technology, and uses animation parameters to drive the expression actions of the face, the head and the like of a human face.

However, this method requires modeling for a specific portrait, and if a portrait is replaced, the model needs to be readjusted, which is not universal.

Disclosure of Invention

The embodiment of the invention provides a video generation system based on a portrait, which is used for improving the universality of video generation based on the portrait.

The embodiment of the invention provides a video system based on a portrait, which comprises: the first input unit is suitable for acquiring a static image of a target face; the second input unit is suitable for acquiring portrait expression control data; the target portrait generating unit comprises a trained generation confrontation network model and is suitable for respectively carrying out corresponding feature extraction processing and feature fusion on the target face static image input by the first input unit and the portrait expression control data input by the second input unit to generate an image sequence, and the action of the portrait in the image sequence is matched with the expression features of the portrait expression control data; and the output unit is suitable for outputting the image sequence generated by the target portrait generation unit.

Optionally, the second input unit comprises at least one of: a first input subunit adapted to input emotion data; and the second input subunit is suitable for inputting voice data.

Optionally, the second input subunit includes: and the text-to-speech module is suitable for acquiring text data and converting the text data into speech data.

Optionally, the first input subunit comprises at least one of: the emotion tag input module is suitable for inputting emotion tags as the emotion data; and the emotion recognition module is suitable for recognizing the emotion characteristics of the voice data or the text data and taking the recognized emotion characteristic sequence as the emotion data.

Optionally, the target portrait generating unit includes: and the portrait generator is suitable for respectively carrying out corresponding feature extraction processing and feature fusion on the target face static image and the portrait expression control data to generate the image sequence.

Optionally, the portrait generator comprises: the first image encoder is suitable for encoding the static image of the target face and extracting to obtain an image feature set; and the portrait emotional expression feature extractor is suitable for inputting the emotional data into a preset portrait expression feature extraction model and extracting to obtain a portrait emotional expression feature sequence.

Optionally, the human image emotion expression feature extractor comprises at least one of the following: the facial expression feature extractor is suitable for inputting the emotion data into a preset expression feature extraction model and extracting to obtain a portrait facial expression feature sequence; and the attitude feature extractor is suitable for inputting the emotion data into a preset attitude feature extraction model and extracting to obtain a portrait attitude feature sequence.

Optionally, the human image emotion expression feature extractor further includes: and the time sequence converter is suitable for carrying out time sequence conversion on the portrait facial expression characteristic sequence or the portrait posture characteristic sequence according to a preset rule.

Optionally, the human image emotion expression feature extractor further includes: and the audio encoder is suitable for performing audio feature extraction processing on the input voice data to obtain an audio feature sequence.

Optionally, the portrait generator further comprises: the characteristic fusion device is suitable for carrying out time sequence matching and dimension fusion on a characteristic sequence obtained after characteristic extraction, and the characteristic fusion module comprises: the time sequence matcher is suitable for respectively carrying out time sequence matching on the image feature set and the portrait expression feature sequence; the dimension fusion device is used for carrying out dimension fusion on the image feature set and each portrait expression feature sequence to obtain a combined feature vector; and the image decoder is suitable for performing image decoding on the joint characteristic vector to obtain the image sequence.

Optionally, the time sequence matcher is adapted to perform time sequence matching on the audio feature sequence and the image feature set, so that the mouth shape of the portrait in the image sequence is matched with the audio feature sequence.

Optionally, the target portrait generating unit further includes: a discriminator adapted to be coupled to the portrait generator and jointly iteratively trained, wherein: the portrait generator is suitable for acquiring a static image of a target face from a training data set and inputting portrait expression control data into the portrait generator, and generating an image sequence matched with the portrait expression control data as a training generated image sequence; the discriminator is suitable for comparing the image sequence generated by the portrait generator with the acquired target face dynamic image in the training stage, and the parameters of the portrait generator are firstly fixed in each iteration of the discriminator iteration process, so that the discriminator reaches the optimal value, then the parameters of the discriminator when the discriminator reaches the optimal value are fixed, the parameters of the portrait generator are updated, and the confrontation network model is determined to be trained completely when the iteration is circulated until the difference value between the image sequence generated by the training and the target face dynamic image is converged to a preset threshold value.

Optionally, the discriminator comprises at least one of: the identity discriminator is suitable for discriminating the identity of the portrait in the generated image sequence; the expression discriminator is suitable for performing emotion discrimination on expression features in the generated image sequence; an audio discriminator adapted to perform audio discrimination on audio features in the generated image sequence; and the posture discriminator is suitable for carrying out posture discrimination on the posture characteristics in the generated image sequence.

Optionally, the discriminator is adapted to discriminate a difference value between the generated image sequence and the target face dynamic image through a preset difference loss function, and constrain a weight of the corresponding discriminator through a corresponding coefficient in the difference loss function.

By adopting the embodiment of the invention, the generated confrontation network model which is finished by training is adopted to respectively carry out corresponding feature extraction processing and feature fusion on the input target face static image and portrait expression control data, an image sequence is generated and output, and by adopting the image generation scheme, the specially modeling is not needed to be carried out aiming at a certain specific portrait, so that when the target portrait is changed, the generated confrontation network model which is finished by training is not needed to be readjusted, therefore, the video generation scheme has stronger universality, and the specially modeling is not needed to be carried out aiming at a certain specific portrait, so the video generation cost can be reduced.

Further, the portrait expression control data can include emotion data and voice data, so that the generated video can synchronously control emotion expression actions such as expressions and postures of the portrait and voice (mouth shape), the portrait in the generated video is more real and vivid, and hearing and visual experiences of users can be optimized.

Furthermore, voice data can be directly acquired, and acquired text data can also be converted into voice data, so that various possible input requirements of a user can be met, and the operation of the user is facilitated.

Furthermore, by acquiring the emotion label input by the user as emotion data, the expression of the portrait in the generated video can more accurately meet the requirement of the user on the emotional expression of the portrait, so that the user experience can be further improved.

Furthermore, by identifying the emotion characteristics of the acquired voice data or text data and taking the identified emotion characteristic sequence as the emotion data, the expression of the portrait is more consistent with the emotion expressed by the voice data, so that the portrait in the generated video is more real and natural.

Furthermore, the emotion data is input into a preset expression feature extraction model, a portrait facial expression feature sequence is extracted and obtained, and time sequence transformation is carried out on the portrait facial expression feature sequence according to a preset rule, so that the consistency of portrait expressions in a video can be improved.

Furthermore, the expression characteristics of the portrait in the generated video can be more real and natural by respectively carrying out time sequence matching on the image characteristic set and the portrait expression characteristic sequence and carrying out dimension fusion on the image characteristic set and the portrait expression characteristic sequence.

Further, the mouth shape of the portrait in the image sequence is matched with the portrait facial expression feature sequence by performing time sequence matching on the audio feature sequence and the portrait facial expression feature sequence, or the mouth shape of the portrait in the image sequence is matched with the portrait posture feature sequence by performing time sequence matching on the audio feature sequence and the portrait posture feature sequence, so that the mouth shape and the expression of the portrait in the video are better matched, and the mouth shape and the portrait posture are better matched, so that the reality of the portrait in the generated video can be further improved.

Further, in the training process of generating the confrontation network model, a discriminator and a portrait generator are subjected to combined iterative training, parameters of the portrait generator are firstly fixed in each round of the iterative process of the discriminator, so that the discriminator reaches an optimal value, then the parameters when the discriminator reaches the optimal value are fixed, the parameters of the portrait generator are updated, and the iterative process is circulated until the difference value between the training generated image sequence and the target face dynamic image is converged to a preset threshold value.

Furthermore, in the training process, through identity discrimination, emotion discrimination or audio discrimination of the portrait in the training generated image sequence, multi-dimensional identification of the authenticity of the portrait in the generated video can be realized, so that the authenticity of the portrait in the generated video can be further improved.

Further, whether the generated image sequence reaches a preset truth threshold value is judged through a preset difference loss function, and the weight of the corresponding discrimination type is constrained through the coefficient in the difference loss function, so that the corresponding portrait expression characteristic can be strengthened according to the user requirement, and the individuation of the portrait in the generated video can be enhanced.

Drawings

FIG. 1 is a flow chart illustrating a method for portrait based video generation in an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating an architecture of a video generation system based on human images according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a portrait generator according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of a portrait generator;

FIG. 5 is a schematic structural diagram of a target portrait generation unit in the embodiment of the present invention;

FIG. 6 is a schematic diagram of a structure of an arbiter according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of another discriminator according to an embodiment of the present invention.

Detailed Description

As mentioned above, the current way of generating a portrait-based video based on modeling a specific portrait is not universal enough.

In order to solve the above problems, in the embodiments of the present invention, a trained generated confrontation network model (GANs) is used to perform corresponding feature extraction processing and feature fusion on an input target face static image and face expression control data, so as to generate an image sequence, and when it is determined that the generated image sequence reaches a preset threshold of degree of truth, the generated image sequence is output. By adopting the video generation method of the embodiment of the invention, the special modeling is not needed for a certain specific portrait, so that the generated confrontation network model after training is not needed to be readjusted when the target portrait is changed, the universality is stronger, and the video generation cost can be reduced because the special modeling is not needed for a certain specific portrait.

In order that those skilled in the art may better understand and implement the embodiments of the present invention, a detailed description will be given below by way of specific implementation with reference to the accompanying drawings.

Referring to the flowchart of the video generation method based on human images shown in fig. 1, in the embodiment of the present invention, the video generation may be performed by the following steps, including:

and S11, acquiring the static image of the target face.

In a specific implementation, the static image of the target face may be one static image or multiple static images.

In specific implementation, the target face still image may be downloaded locally or from the internet, or may be directly obtained by using a camera or a shooting device such as a camera.

And S12, acquiring portrait expression control data.

In a specific implementation, the portrait expression control data is emotion data, may also be voice data, or includes both emotion data and voice data.

In specific implementation, the voice data can be directly acquired; text data may also be obtained and converted to speech data.

In particular implementations, the emotion data may be acquired in one or more ways. For example, an emotion tag input by the user may be acquired as the emotion data. The sentiment tag may be: smile, laugh, smile, melancholy, sadness, apathy, joy, etc. The specific setting or definition can be carried out in the training process. For another example, the emotion feature of the voice data or the text data may be recognized, and the recognized emotion feature sequence may be used as the emotion data. Alternatively, the expressive features of the static image of the target face may be identified as the emotion data.

And S13, inputting the target face static image and the portrait expression control data into a trained generation confrontation network model, respectively performing corresponding feature extraction processing and feature fusion, and generating an image sequence, wherein the action posture of the portrait in the image sequence is matched with the expression features of the portrait expression control data.

In order to generate the video based on the portrait according to the embodiment of the present invention, the generation of the confrontation network may include a portrait generator, that is, the generator may perform corresponding feature extraction processing and feature fusion on the target face static image and the portrait expression control data, respectively, to generate the image sequence. How the portrait generator in the embodiment of the present invention generates a video based on a portrait is described below by way of specific embodiments.

In specific implementation, the target face static image may be encoded, and an image feature set may be extracted. For example, a preset convolutional neural network may be adopted to encode the static image of the target face, and the image feature set may be extracted by a preset image encoder.

In particular implementations, the motion pose of the figures in the sequence of images may include one or more of facial expressions, mouth shape, head pose, and the like. The type of motion pose of the portrait in the generated image sequence corresponds to the type of portrait presentation control data that was acquired.

In specific implementation, for emotion data, the emotion data can be input into a preset portrait expression feature extraction model, and a portrait emotion expression feature sequence is extracted and obtained. Based on the rich manner of human image emotional expression, in the embodiment of the invention, the emotional data can be input into one or more selected human image emotional expression characteristic extraction models to perform characteristic extraction processing according to the requirement. For example, the emotion data can be input into a preset expression feature extraction model, and a portrait facial expression feature sequence is extracted. For another example, the emotion data can be input into a preset posture feature extraction model, and a portrait posture feature sequence is extracted. For example, portrait pose features may include: head shaking, head nodding, head skewing, etc.

In a specific implementation, in order to enhance the coherence of the facial expression of the portrait, the extracted facial expression feature sequence of the portrait may be subjected to time sequence transformation according to a preset rule. For example, the portrait facial expression features may be time-sequentially transformed according to sentence intervals in the speech data. For another example, the human face facial expression features may be subjected to time sequence transformation according to a preset time period. In a specific implementation, the time sequence conversion mode according to sentence intervals can be estimated according to linear interpolation of facial expression features of the portrait between different facial expressions of the portrait, and corresponding neural networks can be adopted to extract feature conversion of the facial expression features of the portrait according to the acquired facial expression data set of the portrait. In an embodiment of the present invention, a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) are used to perform time sequence conversion according to sentence intervals. The portrait facial expression dataset may be acquired during a training phase.

In a particular implementation, the avatar representation control data may include voice data. The voice data can be input into the trained confrontation network model to be subjected to audio feature extraction processing, and an audio feature sequence is extracted. In an embodiment of the present invention, the voice data is subjected to audio feature extraction by using Mel Frequency Cepstrum Coefficients (MFCCs), and is input to a one-dimensional Convolutional Neural Network (CNN) for encoding, and a preset Recurrent Neural Network (RNN) is used to access a timing feature, so as to obtain the audio feature sequence.

In a specific implementation, feature fusion may be performed as follows: respectively carrying out time sequence matching on the image feature set and the portrait expression feature sequence; and carrying out dimension fusion on the image feature set and the portrait expression feature sequence to obtain a joint feature vector, and carrying out image decoding on the joint feature vector to obtain the image sequence.

In the process of time sequence matching, corresponding time sequence matching operation can be carried out according to the characteristics of the input portrait expression characteristic sequence.

For audio feature sequences, in a specific implementation, the following timing matching operations may be performed: the audio feature sequence may be time-sequentially matched with the set of image features such that the mouth shape of a portrait in the image sequence matches the audio feature sequence. In one embodiment of the invention, the figure of the person may include lip movements, and in other embodiments of the invention, the figure of the person may include lip movements and associated facial muscle movements when speaking.

For a scene without voice input, performing time sequence matching on the portrait facial expression feature sequence and the portrait posture feature sequence for the portrait facial expression feature sequence, so that the facial expression of the portrait in the image sequence is matched with the portrait posture.

In the dimension fusion process, the image feature set can be directly connected with each portrait expression feature sequence to generate a combined feature vector. For example, if the image feature set X1 is an output, voice feature sequence X2, an RNN output sequence, and the expression feature sequence X3, which are generally two-dimensional CNN sequences, the image feature set X1, the voice feature sequence X2, and the expression feature sequence X3 may be directly connected to obtain a corresponding joint feature vector. In a specific implementation, the obtained joint feature vector may be input into a preset image decoder to obtain a generated image sequence, that is, a video based on a portrait generated by the embodiment of the present invention.

S14, the generated image sequence is output.

In order to make those skilled in the art better understand and implement the embodiments of the present invention, the following describes in detail the video generation method based on human images according to the above embodiments of the present invention through several specific application scenarios:

in an embodiment of the present invention, a user may only input one or more target face still images, and facial expression data of the user may be acquired from the target face still images as emotion data. Then, an image feature set can be extracted from the one or more static images of the target face through the GANs, and for emotion data, the emotion data can be input into an expression feature extraction model preset by the GANs to extract and obtain a human face expression feature sequence. And then, the facial expression feature sequence of the portrait and the image feature set can be subjected to feature fusion to generate an image sequence, and then the image sequence can be output as a video based on the portrait in the embodiment of the invention.

In another embodiment of the present invention, the user can input the selected static image of the target face and input the emotion label. In specific implementation, time sequence transformation can also be performed on the human image facial expression feature sequence to enhance fluency of human image expression change in the generated video, for example, time sequence transformation can be performed according to a preset time period. The GANs may then perform feature fusion on the facial expression feature sequence and the image feature set to generate the image sequence.

In another embodiment of the invention, the user can input the selected static image of the target face and input the emotion tag, voice or text data. For Text data, a Text-To-Speech (TTS) module may be provided in or outside the GANs To convert input Text data into Speech data. And then, the trained GANS model can be adopted to respectively and correspondingly extract an image feature set, a portrait facial expression feature sequence and an audio feature sequence, and then feature fusion can be carried out on the image feature set, the portrait facial expression feature sequence and the audio feature sequence, wherein the feature fusion comprises matching of audio and expression time sequences and dimension fusion, and an image sequence is generated.

It is to be understood that the above is merely for ease of understanding, exemplified by some specific scenarios. The embodiments of the present invention are not limited to the above application scenarios or specific implementation manners.

In a specific application process, for example, the user may not input an expression label, and the trained GANs model may acquire emotion data from audio or text data input by the user, and then generate a corresponding facial expression feature sequence, portrait posture feature sequence, and the like. And then, performing feature fusion and judging the degree of truth, namely outputting the image sequence reaching the preset degree of truth threshold.

In order to make the embodiment of the present invention better understood and realized by those skilled in the art, a detailed description will be given below of how the GANs model used in the embodiment of the present invention is trained.

As described above, in order to achieve the universality of the video generation method based on the portrait, the GANs model used in the embodiment of the present invention may include a portrait generator.

In order to make the portrait in the generated image sequence meet the authenticity requirement, the GANs model can be trained in advance.

In the training stage, the GANs model may further include a discriminator, in addition to the portrait generator, the discriminator being adapted to be coupled to the portrait generator and perform joint iterative training, wherein the portrait generator is adapted to obtain a static image of a target face from a training data set and portrait expression control data to be input to the portrait generator, and generate an image sequence matching the portrait expression control data as a training generation image sequence;

the discriminator is suitable for comparing the image sequence generated by the portrait generator with the obtained target face dynamic image during the network model generation training, the parameters of the portrait generator are firstly fixed in each iteration of the discriminator iteration process, so that the discriminator reaches the optimal value, then the parameters of the discriminator reaches the optimal value are fixed, the parameters of the portrait generator are updated, and the confrontation network model training is determined to be completed when the iteration is circulated until the difference value between the training generated image sequence and the target face dynamic image is converged to a preset threshold value.

In a specific implementation, the difference value between the training generated image sequence and the target human face dynamic image can be judged through a predicted difference loss function.

In a specific implementation, the discriminator may perform at least one of the following discriminating operations: carrying out identity discrimination on the portrait in the generated image sequence; performing emotion judgment on expression features in the generated image sequence; carrying out audio judgment on audio features in the generated image sequence; and judging the emotion of the attitude feature in the generated image sequence.

In a specific implementation, the weight of the corresponding discriminant type may be constrained by coefficients in the difference loss function.

In an embodiment of the present invention, during the training process, the identity discriminator is used to discriminate the portrait in the generated image sequence, the expression discriminator is used to discriminate the emotion of the expression feature in the generated image sequence, and the audio discriminator is used to discriminate the audio of the generated image sequence.

LOSS＝∑_i＝1,2,3λ_iL_i+λ_rL_r；

Each difference loss function can be defined as follows:

Li＝Ei{log[Di(Xi′)]}+Ei{log[1-Di(Xi′)]},(i＝1,2,3)；

Lr＝∑_{all pixel}|G-T|。

where Ei { … } is the average over a sequence of segments, such as:

in the case of the identity discriminator,

wherein T0 is the starting time of generating video, and T is the ending time of generating video;

for the speech discriminator, E2 is the average value within the defined speech frame;

for the expression discriminator, E3 is the average value in the defined expression time interval;

g is a generated image result obtained by the portrait generator; t is the true video result of the target portrait;

l r is the L p-norm difference, i.e. the p-norm difference, at the pixel level of the reconstructed image for G-T, i.e. for one vector X ═ X₁,x₂,…,x_N]Its p-norm can be calculated by the following formula:

||X||_p＝(|x₁|^p+|x₂|^p+…+|x_N|^p),p＝0,1,2,…

the general L OSS function is set mainly using the case where p is 0,1,2, i.e. L0 norm, L1 norm and L2 norm, which can be calculated by the above formula.

During the training process, the arbiter that maximizes L OSS and the portrait generator that minimizes L OSS, argmin, can be found_genmax_discrL OSS, wherein gen represents a portrait generator, and distr represents a discriminator, so that the obtained portrait generator, namely training is completed, can be used as the portrait generator for generating the portrait mouth shape and the expression video in the embodiment of the invention.

In order to make the embodiments of the present invention better understood and realized by those skilled in the art, the following provides a detailed description of the video generation system based on human images and the models of the GANs used in the embodiments of the present invention.

As shown in fig. 2, an embodiment of the present invention provides a video generation system 20 based on human images, including: a first input unit 21, a second input unit 22, a target portrait generating unit 23, and an output unit 24, wherein:

a first input unit 21 adapted to acquire a still image of a target face;

a second input unit 22 adapted to acquire portrait presentation control data;

a target portrait generating unit 23, including a trained confrontation network model, adapted to perform corresponding feature extraction processing and feature fusion on the target face static image input by the first input unit 21 and the portrait expression control data input by the second input unit 22, respectively, so as to generate an image sequence, where an action pose of a portrait in the image sequence is matched with an expression feature of the portrait expression control data;

an output unit 24 adapted to output the sequence of images generated by the target portrait generating unit 23.

In a specific implementation, the image acquired by the first input unit 21 may be one image, or may be an image sequence formed by a plurality of still images, and the image may be a two-dimensional image or a three-dimensional image.

In a specific implementation, the second input unit 22 may include at least one of a first input subunit 221 and a second input subunit 222, wherein:

a first input subunit 221 adapted to input emotion data;

the second input subunit 222 is adapted to input voice data.

In a specific implementation, the second input subunit 222 may further include a text-to-speech module (not shown) adapted to acquire text data and convert the text data into speech data.

In a specific implementation, the first input subunit 221 may include at least one of:

an emotion tag input module (not shown) adapted to input an emotion tag as the emotion data;

and the emotion recognition module (not shown) is suitable for recognizing the emotion characteristics in the voice data or the text data or the target portrait static picture, and using the recognized emotion characteristic sequence as the emotion data.

In a specific implementation, the emotion tag input by the user can be acquired through the emotion tag input module. The emotion feature sequence can be identified from the input voice data or text data or the target portrait static image through an emotion identification module.

The target portrait generating unit 23 according to the embodiment of the present invention may include a GANs model. In a specific implementation, as shown in fig. 2, the target face generating unit 23 may include a face generator 231, and the face generator 231 is adapted to perform corresponding feature extraction processing and feature fusion on the target face static image and the face expression control data, respectively, to generate the image sequence.

The following detailed description of the structure and operation of the portrait generator according to the present invention is provided for those skilled in the art to better understand and implement the present invention.

Referring to a schematic structural diagram of a portrait generator in the embodiment of the present invention shown in fig. 3, in the embodiment of the present invention, the portrait generator 30 may include: a first image encoder 31 and a portrait emotion expression feature extractor 32 and a feature fusion 33, wherein:

the first image encoder 31 is adapted to encode the static face image and extract an image feature set;

the portrait emotional expression feature extractor 32 is suitable for inputting the emotional data into a preset portrait expression feature extraction model and extracting to obtain a portrait emotional expression feature sequence;

the feature fusion device 33 is adapted to perform time sequence matching and dimension fusion on the feature sequence obtained after feature extraction.

In a specific implementation, the human emotion expression feature extractor 32 may include at least one of the following:

the facial expression feature extractor 321 is adapted to input the emotion data into a preset expression feature extraction model, and extract to obtain a portrait facial expression feature sequence;

and the attitude feature extractor 322 is suitable for inputting the emotion data into a preset attitude feature extraction model and extracting to obtain a portrait attitude feature sequence.

In a specific implementation, the human emotion expression feature extractor 32 may further include: and the time sequence converter 323 is suitable for carrying out time sequence conversion on the portrait facial expression characteristic sequence or the portrait posture characteristic sequence according to a preset rule.

In a specific implementation, the feature sequences extracted by the facial expression feature extractor 321 and the pose feature extractor 322 may also be input to different time sequence converters for processing.

In a specific implementation, the human emotion expression feature extractor 32 may further include: the audio encoder 324 is adapted to perform audio feature extraction processing on the input voice data to obtain an audio feature sequence.

In a specific implementation, as shown in fig. 3, the feature fusion module 33 may include: a timing matcher 331 and a dimension fuser 332, wherein:

the time sequence matcher 331 is adapted to perform time sequence matching on the image feature set and the portrait expression feature sequence respectively;

the dimension fusion device 332 performs dimension fusion on the image feature set and each portrait expression feature sequence to obtain a joint feature vector;

an image decoder 333, adapted to perform image decoding on the joint feature vector to obtain the image sequence.

In a specific implementation, the timing matcher 331 is adapted to perform timing matching on the audio feature sequence and the image feature set, so that the mouth shape of the portrait in the image sequence matches the audio feature sequence.

And matching the time sequence of the audio feature sequence with the facial expression feature sequence of the portrait to match the mouth shape of the portrait in the image sequence with the facial expression feature sequence of the portrait, so that the reality of the portrait in the generated video can be further improved.

In order that those skilled in the art will better understand and realize the embodiments of the present invention, the following detailed description is given by way of the structure of the portrait generator in the target portrait generating unit employed in one specific application.

Fig. 4 shows a schematic structural diagram of a portrait generator in an embodiment of the present invention. In an embodiment of the present invention, referring to fig. 4, the portrait generator 40 may include: a first image encoder 41, an audio encoder 42, an expressive feature extractor 43, an expressive feature model M, a feature fuser 44, and an image decoder 45. In particular implementations, RNN 46 may also be included. In addition, in order to enhance the consistency of the human expression, a timing converter 47 may be further included.

In a specific implementation, the still picture of the target face may be input into the first image encoder 41 for encoding, and a (two-dimensional) image feature set X1 is extracted, for example, the first image encoder 41 may employ a (two-dimensional) CNN for encoding, and extract the (two-dimensional) image feature set X1.

In a specific implementation, for speech data, timing information and audio information are already included, so that audio coding can be directly performed to obtain an audio feature sequence X2 within a certain time period (audio frame T2). For example, the audio encoder 42 may extract the audio feature sequence X2 using MFCC. One-dimensional RNN 46 can be input for encoding, and an RNN output sequence is generated by utilizing the RNN access timing characteristics.

For the input text data, as shown in fig. 4, it needs to be converted into voice data by the TTS module 48, and the converted voice data is used as the audio encoder 42 for subsequent processing. In particular implementations, the TTS module 48 may also be located external to the portrait generator, as shown in FIG. 4.

In a specific implementation, with continued reference to fig. 4, the user may further input an emotion tag sequence, and for the input emotion tag sequence, the facial expression feature of the portrait may be extracted by the expression feature extractor 43 using the preset expression feature model M, so as to obtain a facial expression feature sequence X3 of the portrait. Then, in order to enhance the consistency of the expression of the human image in the generated video, the sequence of facial expression features X3 can be time-series transformed by using a time-series transformer 47.

For embodiments of the present invention, portrait-based video is generated based on the GANs model. As described above, before the image sequence is generated by the target figure generation unit 23 according to the present invention, the figure generator 231 of the target figure generation unit 23 may be trained in advance to improve the reality of the generated image sequence.

Referring to the target figure generation unit 23 shown in fig. 5, in a specific implementation, the target figure generation unit 23 may further include a discriminator 232 in addition to the figure generator 231, and the discriminator 232 is adapted to be coupled to the figure generator 231 and jointly perform iterative training, where:

the portrait generator 231 is adapted to obtain a static image of a target face and portrait expression control data from a training data set, and generate an image sequence matching the portrait expression control data as a training generated image sequence;

the discriminator 232 is adapted to compare the image sequence generated by the portrait generator 231 with the obtained target face dynamic image in the training stage, and fix the parameter of the portrait generator 231 in each iteration of the discriminator 232, so that the discriminator 232 reaches the optimal value, then fix the parameter when the discriminator 232 reaches the optimal value, update the parameter of the portrait generator 231, and repeat the iteration in a loop until the difference between the training generated image sequence and the target face dynamic image converges to the preset threshold, thereby determining that the training of the confrontation network model is completed.

In the training stage, the emotion labels can be automatically generated by using a preset expression recognition algorithm, and can also be manually labeled, for example, the emotion labels in a certain sentence or a certain time period are manually labeled. The emotion label list obtained in the training process is consistent with the emotion label list adopted in the specific application process of the human figure generator based on the GANs.

For the expression feature model M preset in the embodiment of the present invention, an expression data set may be used to pre-train and extract features of different expressions, for example, features of different expressions may be extracted in a preset face database (such as cmcvacs & PIE database) by using a CNN-based expression recognition algorithm, and an emotion tag list trained by the expression feature model M is matched with an emotion tag list of a video generated in the embodiment of the present invention. The CNN structure employed here may be different from the CNN structure of the extracted (two-dimensional) image feature set X1.

With continued reference to fig. 4, the extracted (two-dimensional) image feature set X1, audio feature sequence X2, and portrait facial expression feature sequence X3 are feature fused by the feature fusion device 64, and then output to the image decoder 45 for processing, so as to generate a portrait-based image sequence.

By adopting the portrait generator 40, the control of the portrait expression can be realized, and under the condition of voice input, the synchronous control of the portrait expression and the voice (mouth shape) can be realized, so that the reality of the portrait in the generated video can be improved, the portrait in the generated video is more real and vivid, and the auditory sense and visual experience of a user can be optimized.

In a particular implementation, feature fuser 44 may include a timing matcher (not shown) and a dimension fuser (not shown). The audio characteristic sequence and the expression characteristic sequence can be subjected to time sequence matching through the time sequence matcher, so that the expression and the mouth shape of the portrait in the generated video are synchronously controlled, and the facial action of the portrait in the video is more natural.

Since the extracted image feature set X1 is usually a two-dimensional CNN output sequence, the voice feature sequence X2 is usually an RNN output sequence, and the human image facial expression feature sequence X3 is usually a two-dimensional CNN sequence, in a specific implementation, the dimension fusion device may directly connect the image feature set X1, the voice feature sequence X2 is usually an RNN output sequence, and the human image facial expression feature sequence X3 to obtain a joint feature vector, and then input the joint feature vector to the image decoder 45, that is, the image sequence X may be automatically generated to form a video based on a human image.

In specific implementation, in order to ensure that the adopted portrait generator meets the preset requirement of the degree of truth, a corresponding discriminator can be selected and designed according to the requirement.

In a specific implementation, the discriminator may discriminate a difference value between the generated image sequence and the target face dynamic image through a preset difference loss function. In a specific implementation, the weight of the corresponding discriminator may also be constrained by the corresponding coefficient in the difference loss function.

In a specific implementation, the target portrait generating unit 23 may be trained as follows: a still image of a target face and portrait expression control data obtained from a training data set are input to the portrait generator 231, and an image sequence matching the portrait expression control data is generated as training generation data. Then, the training generation data and the moving video image in the training data set are input to the discriminator 232. In order to determine the training effect, a preset difference loss function can be adopted for discrimination, and when the difference loss function of the two is determined to be smaller than a preset value, the GANs training is determined to be completed.

In the training phase of the applied GANs model, as shown in fig. 5, training data may be obtained from a predetermined set of training data 50. A still image of the target person may be obtained from the training data and the input to the person image generator may comprise the still image of the target person's face, and speech data or text data may also be obtained from the training data set 50. In the training process, public data sets such as GRID or TCD time can be selected as data sources, and video clips, pictures and corresponding subtitles captured from movies and television shows can also be used as data sources according to needs. The embodiment of the invention does not limit the specific type and source of the selected training data set.

Fig. 6 shows a schematic structural diagram of a discriminator according to an embodiment of the present invention. The arbiter 60 may include any one or more of the following arbiters, as desired:

an identity discriminator 61 adapted to discriminate the identity of the figures in the generated sequence of images;

an expression discriminator 62 adapted to perform emotion discrimination on the expression features in the generated image sequence;

an audio discriminator 63 adapted to perform audio discrimination on audio features in the generated image sequence;

and a posture discriminator 64 adapted to discriminate the posture of the posture feature in the generated image sequence.

For example, for the case where there is no voice input, only the identity discriminator 61, only the expression discriminator 62, or both the identity discriminator 61 and the expression discriminator 62 may be employed. For the case of voice input, only one of the identity discriminator 61, the expression discriminator 62 and the audio discriminator may be adopted, or both the identity discriminator 61 and the audio discriminator 63 may be adopted, or both the identity discriminator 61 and the expression discriminator 62 may be adopted, or both the identity discriminator 61 and the expression discriminator may be adopted.

When a plurality of discriminators are used in a matched mode, the coefficient of the difference loss function corresponding to each discriminator can be set according to requirements, so that corresponding portrait expression characteristics can be strengthened according to user requirements, and the individuation of the portrait in the generated video is enhanced.

Fig. 7 is a schematic structural diagram of another discriminator according to an embodiment of the present invention. In one embodiment of the present invention, as shown in FIG. 7, during the training phase, the discriminator 70 may be used with the portrait generator 40 shown in FIG. 4. As needed, the discriminator 70 may include at least one of an identity discriminator 71, an audio discriminator 72, and an expression discriminator 73, wherein:

the identity discriminator 71 judges that each image of the image sequence generated by the portrait generator is a real person and outputs a first discrimination result D1.

The audio discriminator 72 may judge whether the generated image sequence is real or not by using the generated image sequence (video) and audio features, and output a second judgment result D2.

The expression discriminator 73 may determine whether the generated image sequence is real or not using the generated image sequence and the expression feature, and output a third discrimination result D3.

In a specific implementation, the authenticity of the generated image sequence may be determined by comparing the difference loss function with a corresponding threshold value of authenticity.

In one embodiment of the present invention, the overall difference loss function may be defined as follows:

LOSS＝∑_i＝1,2,3λ_iL_i+λ_rL_r；

each difference loss function can be defined as follows:

Li＝Ei{log[Di(Xi′)]}+Ei{log[1-Di(Xi′)]},(i＝1,2,3)；

Lr＝∑_{all pixel}|G-T|。

where Ei { … } is the average over a sequence of segments, such as:

in the case of the identity discriminator,

l r is the L p-norm difference, i.e. the p-norm difference, at the pixel level of the reconstructed image for G-T, i.e. for one vector X ═ X₁,x₂,…,x_N]Its p-norm can be calculated by the following formula (please supplement):

||X||_p＝(|x₁|^p+|x₂|^p+…+|x_N|^p),p＝0,1,2,…

the general L OSS function is designed mainly using the case where p is 0,1,2, i.e. L0 norm, L1 norm and L2 norm, which can be calculated by the above formula.

During training, one can look for the arbiter that maximizes L OSS and the portrait generator that minimizes L OSS, i.e., argmin_genmax_discrL OSS, wherein gen represents a portrait generator, and distr represents a discriminator, so that the obtained portrait generator, namely training is completed, can be used as the portrait generator for generating the portrait mouth shape and the expression video in the embodiment of the invention.

The above embodiment shows the calculation process of the overall difference loss function with separate discriminators. In a specific implementation, the audio discriminator 72 and the emotion discriminator 73 may be combined into one, and only one difference loss function may be used for discrimination.

In a specific implementation, with reference to fig. 7, the generated image sequence may be preprocessed to extract a corresponding feature sequence, and then the extracted feature sequence and the input data input into the target portrait model are input into corresponding discriminators respectively for comparison and determination, and a determination result is output. The following is described in detail with reference to fig. 6 and 7.

For the identity determination, the image sequence generated by the portrait generator 40 through the image decoder 45 and the input target portrait static picture are respectively input into the second image encoder 74 for processing, so as to obtain corresponding image feature sets X1 and X1', which are respectively input into the identity determiner 71, so as to obtain the first determination result D1.

For the audio discrimination, the image sequence generated by the image generator 40 through the image decoder 45 may be input to the third image encoder 75 for processing to obtain an audio feature sequence X2', and then input to the RNN 77 for encoding, and the audio feature sequence obtained by the training data set and the audio feature sequence input to the RNN 77 for processing are respectively input to the audio discriminator 72 for comparison to obtain a second discrimination result D2.

For the expression discrimination, the image sequence generated by the portrait generator 40 through the image decoder 45 may be input into the fourth image encoder 76, the expression feature sequence X3' is extracted, and the expression feature sequence X3 extracted from the portrait generator 40 and the sequence are respectively input into the sequence converter 78 for sequence conversion, and then input into the expression discriminator 73, so as to obtain the third discrimination result D3.

The obtained first, second and third discrimination results D1, D2 and D3 may be calculated by using the above-mentioned overall difference loss function to obtain a uniform discrimination result D, and when the discrimination result of the uniform discrimination result D reaches a preset threshold of degree of truth, the generated image sequence X may be output.

In a specific implementation, the image encoder, RNN, timing converter, etc. used by the discriminator 70 in the above embodiment may be the same as or have the same structure and parameters as the corresponding image encoder, RNN, timing converter, etc. in the connected portrait creator 40.

In order to facilitate the implementation of the method for generating a video based on a portrait according to the embodiment of the present invention, an embodiment of the present invention further provides a video generating device, where the video generating device may include a memory and a processor, where the memory stores a computer instruction that can be executed on the processor, and when the processor executes the computer instruction, the processor may perform the steps of the method for generating a video based on a portrait according to any one of the embodiments, and specific implementation may refer to the description of the above embodiment, and is not described herein again.

To facilitate the implementation of the method for generating a video based on a portrait according to the above embodiment of the present invention, an embodiment of the present invention further provides a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed, the steps of the method for generating a video based on a portrait according to any of the above embodiments may be executed. For specific implementation, reference may be made to the description of the above embodiments, which are not described herein again. The computer storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A portrait-based video generation system, comprising:

the first input unit is suitable for acquiring a static image of a target face;

the second input unit is suitable for acquiring portrait expression control data;

the target portrait generating unit comprises a trained generation confrontation network model and is suitable for respectively carrying out corresponding feature extraction processing and feature fusion on the target face static image input by the first input unit and the portrait expression control data input by the second input unit to generate an image sequence, and the action of the portrait in the image sequence is matched with the expression features of the portrait expression control data;

and the output unit is suitable for outputting the image sequence generated by the target portrait generation unit.

2. The portrait based video generation system of claim 1, wherein the second input unit comprises at least one of:

a first input subunit adapted to input emotion data;

and the second input subunit is suitable for inputting voice data.

3. The portrait based video generation system of claim 2, wherein the second input subunit comprises: and the text-to-speech module is suitable for acquiring text data and converting the text data into speech data.

4. The portrait based video generation system of claim 3, wherein the first input subunit includes at least one of:

the emotion tag input module is suitable for inputting emotion tags as the emotion data;

and the emotion recognition module is suitable for recognizing the emotion characteristics of the voice data or the text data and taking the recognized emotion characteristic sequence as the emotion data.

5. The portrait based video generation system of claim 4, wherein the target portrait generation unit includes:

and the portrait generator is suitable for respectively carrying out corresponding feature extraction processing and feature fusion on the target face static image and the portrait expression control data to generate the image sequence.

6. The portrait based video generation system of claim 5, wherein the portrait generator comprises:

the first image encoder is suitable for encoding the static image of the target face and extracting to obtain an image feature set;

and the portrait emotional expression feature extractor is suitable for inputting the emotional data into a preset portrait expression feature extraction model and extracting to obtain a portrait emotional expression feature sequence.

7. The portrait based video generation system of claim 6, wherein the portrait affective expression feature extractor comprises at least one of:

the facial expression feature extractor is suitable for inputting the emotion data into a preset expression feature extraction model and extracting to obtain a portrait facial expression feature sequence;

and the attitude feature extractor is suitable for inputting the emotion data into a preset attitude feature extraction model and extracting to obtain a portrait attitude feature sequence.

8. The portrait based video generation system of claim 7, wherein the portrait emotion expression feature extractor further comprises:

and the time sequence converter is suitable for carrying out time sequence conversion on the portrait facial expression characteristic sequence or the portrait posture characteristic sequence according to a preset rule.

9. The portrait based video generation system of claim 7, wherein the portrait emotion expression feature extractor further comprises:

and the audio encoder is suitable for performing audio feature extraction processing on the input voice data to obtain an audio feature sequence.

10. The portrait based video generation system of claim 9, wherein the portrait generator further comprises: the characteristic fusion device is suitable for carrying out time sequence matching and dimension fusion on a characteristic sequence obtained after characteristic extraction, and the characteristic fusion module comprises:

the time sequence matcher is suitable for respectively carrying out time sequence matching on the image feature set and the portrait expression feature sequence;

the dimension fusion device is used for carrying out dimension fusion on the image feature set and each portrait expression feature sequence to obtain a combined feature vector;

and the image decoder is suitable for performing image decoding on the joint characteristic vector to obtain the image sequence.

11. The portrait-based video generation system of claim 10, wherein the timing matcher is adapted to time-sequence match the audio feature sequence with the set of image features such that a mouth shape of a portrait in the image sequence matches the audio feature sequence.

12. The portrait based video generation system of any of claims 5-11, wherein the target portrait generation unit further comprises: a discriminator adapted to be coupled to the portrait generator and jointly iteratively trained, wherein:

the portrait generator is suitable for acquiring a static image of a target face and portrait expression control data from a training data set, and generating an image sequence matched with the portrait expression control data as a training generated image sequence;

the discriminator is suitable for comparing the image sequence generated by the portrait generator with the acquired target face dynamic image in the training stage, and the parameters of the portrait generator are firstly fixed in each iteration of the discriminator iteration process, so that the discriminator reaches the optimal value, then the parameters of the discriminator when the discriminator reaches the optimal value are fixed, the parameters of the portrait generator are updated, and the confrontation network model is determined to be trained completely when the iteration is circulated until the difference value between the image sequence generated by the training and the target face dynamic image is converged to a preset threshold value.

13. The portrait based video generation system of claim 12, wherein the discriminator includes at least one of:

the identity discriminator is suitable for discriminating the identity of the portrait in the generated image sequence;

the expression discriminator is suitable for performing emotion discrimination on expression features in the generated image sequence;

an audio discriminator adapted to perform audio discrimination on audio features in the generated image sequence;

and the posture discriminator is suitable for carrying out posture discrimination on the posture characteristics in the generated image sequence.

14. The system according to claim 13, wherein the discriminator is adapted to discriminate the difference between the generated image sequence and the target human face dynamic image by a preset difference loss function, and to constrain the weight of the corresponding discriminator by corresponding coefficients in the difference loss function.