CN115035219A

CN115035219A - Expression generation method and device and expression generation model training method and device

Info

Publication number: CN115035219A
Application number: CN202210540239.2A
Authority: CN
Inventors: 石凡; 刘颖璐; 左佳伟; 王林芳; 张炜; 梅涛
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-09-09

Abstract

The disclosure relates to an expression generation method and device and an expression generation model training method and device, and relates to the technical field of computers. The method of the present disclosure comprises: acquiring feature information of each frame of image in an original video, feature information of face key points and classification information of original expressions; fusing the feature information of each frame of image, the feature information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image; and generating a fusion image corresponding to each frame of image according to the characteristic information of the fusion image corresponding to each frame of image, and obtaining a target video with the target expression of the facial expression formed by the fusion images corresponding to all the images.

Description

Expression generation method and device and expression generation model training method and device

Technical Field

The disclosure relates to the technical field of computers, and in particular relates to an expression generation method and device and an expression generation model training method and device.

Background

Under the drive of new technology waves such as artificial intelligence, virtual reality and the like, the digital man making process is effectively simplified, the performance of all aspects is dramatically improved, and the digitalization of the appearance is gradually deepened to the interaction of behaviors and the intellectualization of emotion. Digital people represented by virtual broadcasters, virtual employees, etc. successfully enter the public visual field and enjoy great diversity in a plurality of fields such as movies, games, media, text tours, finance, etc. in a plurality of postures.

Interactive digital human figure customization strives for realism and personalization, and every detail of the human figure is of interest to the user under the photographic-level super-realistic requirements. When the photo-level super-realistic interactive digital person is manufactured, a video can be recorded by the model, and then the expression and the action of the model are edited based on the video to generate the expression, the action and the like matched with an interactive scene.

Disclosure of Invention

One technical problem to be solved by the present disclosure is: in the process of making the digital person, how to edit the expression of a character in a video to generate the video of the digital person with a target expression corresponding to a scene, and enabling the generated video to be stable and clear.

According to some embodiments of the present disclosure, there is provided an expression generation method including: acquiring feature information of each frame of image in an original video, feature information of a face key point and classification information of an original expression; fusing the feature information of each frame of image, the feature information of the key points of the face, the classification information of the original expression and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image; and generating a fused image corresponding to each frame of image according to the characteristic information of the fused image corresponding to each frame of image, and obtaining a target video with a target expression, wherein the target expression is a facial expression formed by the fused images corresponding to all the images.

In some embodiments, the obtaining feature information of each frame of image in the original video and feature information of the face key point includes: inputting each frame of image in an original video into a face feature extraction model to obtain feature information of each frame of output image; inputting the characteristic information of each frame of image into a face key point detection model to obtain the coordinate information of the face key points of each frame of image; and reducing the dimension of the coordinate information of all the face key points by adopting a principal component analysis method to obtain the information of the preset dimension as the characteristic information of the face key points.

In some embodiments, obtaining classification information of an original expression of each frame of image in an original video includes: and inputting the characteristic information of each frame of image into the expression classification model to obtain the classification information of the original expression of each frame of image.

In some embodiments, the fusing the feature information of each frame of image, the feature information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression includes: adding and averaging the classification information of the original expression of each frame of image and preset classification information corresponding to the target expression to obtain the classification information of the fusion expression corresponding to each frame of image; and splicing the feature information of the face key points of each frame of image multiplied by the first weight obtained by training, the feature information of each frame of image multiplied by the second weight obtained by training and the classification information of the fusion expression corresponding to each frame of image.

In some embodiments, generating the fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image includes: and inputting the feature information of the fused image corresponding to each frame of image into a decoder, and outputting the generated fused image corresponding to each frame of image, wherein the human face feature extraction model comprises a convolution layer, and the decoder comprises a deconvolution layer.

According to other embodiments of the present disclosure, there is provided an expression generation model training method, including: acquiring a training pair consisting of each frame image of an original training video and each frame image of a target training video; inputting each frame of image of an original training video into a first generator, acquiring feature information of each frame of image of the original training video, feature information of a face key point and classification information of an original expression, fusing the feature information of each frame of image of the original training video, the feature information of the face key point, the classification information of the original expression and preset classification information corresponding to a target expression to obtain feature information of each frame of fused image corresponding to the original training video, and obtaining each frame of fused image corresponding to the original training video output by the first generator according to the feature information of each frame of fused image corresponding to the original training video; inputting each frame image of a target training video into a second generator, acquiring feature information of each frame image of the target training video, feature information of a face key point and classification information of a target expression, fusing the feature information of each frame image of the target training video, the feature information of the face key point, the classification information of the target expression and preset classification information corresponding to an original expression to obtain the feature information of each frame fusion image corresponding to the target training video, and obtaining each frame fusion image corresponding to the target training video output by the second generator according to the feature information of each frame fusion image corresponding to the target training video; determining the immunity loss and the cycle consistency loss according to the fusion images of the frames corresponding to the original training video and the fusion images of the frames corresponding to the target training video; the first generator and the second generator are trained based on the opponent loss and the cyclic agreement loss.

In some embodiments, the method further comprises: determining pixel-to-pixel loss according to the pixel difference between each two adjacent frames of fused images corresponding to the original training video and the pixel difference between each two adjacent frames of fused images corresponding to the target training video; wherein training the first generator and the second generator according to the immunity loss and the cycle consistent loss comprises: the first generator and the second generator are trained based on the countervailing loss, the cyclic coincidence loss, and the pixel-to-pixel loss.

In some embodiments, determining the countermeasure loss according to the fused image of each frame corresponding to the original training video and the fused image of each frame corresponding to the target training video includes: inputting each frame of fused image corresponding to the original training video into a first discriminator to obtain a first discrimination result of each frame of fused image corresponding to the original training video; inputting each frame of fused image corresponding to the target training video into a second discriminator to obtain a second discrimination result of each frame of fused image corresponding to the target training video; and determining a first pair of loss resistances according to the first judgment result of each frame of fused image corresponding to the original training video, and determining a second pair of loss resistances according to the second judgment result of each frame of fused image corresponding to the target training video.

In some embodiments, inputting each frame of fused image corresponding to the original training video into the first discriminator to obtain the first discrimination result of each frame of fused image corresponding to the original training video includes: inputting each frame of fused image corresponding to the original training video into a first face feature extraction model in a first discriminator to obtain feature information of each frame of fused image corresponding to the output original training video; inputting the feature information of each frame of fused image corresponding to the original training video into a first expression classification model in a first discriminator to obtain the expression classification information of each frame of fused image corresponding to the original training video as a first discrimination result; inputting the fused image of each frame corresponding to the target training video into the second discriminator to obtain a second discrimination result of the fused image of each frame corresponding to the target training video, wherein the second discrimination result comprises: inputting each frame of fused image corresponding to the target training video into a second face feature extraction model in a second discriminator to obtain feature information of each frame of fused image corresponding to the output target training video; and inputting the characteristic information of each frame of fused image corresponding to the target training video into a second expression classification model in a second discriminator to obtain the expression classification information of each frame of fused image corresponding to the target training video as a second discrimination result.

In some embodiments, the cycle consistent loss is determined using the following method: inputting each frame of fused image corresponding to the original training video into a second generator to generate each frame of reconstructed image of the original training video, inputting each frame of fused image corresponding to the target training video into a first generator to generate each frame of reconstructed image of the target training video; and determining the cycle consistent loss according to the difference between each frame of reconstructed image of the original training video and each frame of image of the original training video and the difference between each frame of reconstructed image of the target training video and each frame of image of the target training video.

In some embodiments, pixel-to-pixel loss is determined using the following method: determining the distance between the expression vectors of two pixels at each position in each two adjacent frames of fused images corresponding to the original training video, and summing the distances corresponding to all the positions to obtain a first loss; determining the distance between the expression vectors of two pixels at the positions in each two adjacent frames of fused images corresponding to the target training video aiming at each position in each two adjacent frames of fused images, and summing the distances corresponding to all the positions to obtain a second loss; and adding the first loss and the second loss to obtain pixel-to-pixel loss.

In some embodiments, the obtaining feature information of each frame of image of the original training video and feature information of the face key point includes: inputting each frame of image in the original training video into a third facial feature extraction model in a first generator to obtain feature information of each output frame of image; inputting the characteristic information of each frame of image into a first face key point detection model in a first generator to obtain the coordinate information of the face key points of each frame of image; reducing the dimensions of the coordinate information of all face key points by adopting a principal component analysis method to obtain first information with preset dimensions, wherein the first information is used as feature information of the face key points of each frame image of the original training video; the method for acquiring the feature information of each frame of image and the feature information of the key points of the human face of the target training video comprises the following steps: inputting each frame of image in the target training video into a fourth face feature extraction model in a second generator to obtain feature information of each output frame of image; inputting the feature information of each frame of image into a second face key point detection model in a second generator to obtain the coordinate information of the face key points of each frame of image; and reducing the dimensions of the coordinate information of all the face key points by adopting a principal component analysis method to obtain second information with preset dimensions, wherein the second information is used as the characteristic information of the face key points of each frame of image of the target training video.

In some embodiments, obtaining classification information of an original expression of each frame of image in an original training video includes: inputting the characteristic information of each frame of image in the original training video into a third emotion classification model in a first generator to obtain the classification information of the original expression of each frame of image in the original training video; the method for acquiring the classification information of the target expression of each frame of image in the target training video comprises the following steps: and inputting the feature information of each frame of image in the target training video into a fourth expression classification model in the second generator to obtain the classification information of the target expression of each frame of image in the target training video.

In some embodiments, the fusing the feature information of each frame of image of the original training video, the feature information of the face key point, the classification information of the original expression and the preset classification information corresponding to the target expression includes: adding and averaging the classification information of the original expression of each frame of image of the original training video and the preset classification information corresponding to the target expression to obtain the classification information of the fusion expression corresponding to each frame of image of the original training video; splicing the feature information of the face key points of each frame of image of the original training video multiplied by the first weight to be trained, the feature information of each frame of image of the original training video multiplied by the second weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the original training video; fusing the feature information of each frame of image of the target training video, the feature information of the face key points, the classification information of the target expression and the preset classification information corresponding to the original expression comprises the following steps: adding and averaging the classification information of the target expression of each frame of image of the target training video and the preset classification information corresponding to the original expression to obtain the classification information of the fusion expression corresponding to each frame of image of the target training video; and splicing the feature information of the face key points of each frame of image of the target training video multiplied by the third weight to be trained, the feature information of each frame of image of the target training video multiplied by the fourth weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the target training video.

In some embodiments, training the first generator and the second generator according to the counter-measure loss, the cycle-consistent loss, and the pixel-to-pixel loss comprises: carrying out weighted summation on the confrontation loss, the cyclic consistency loss and the pixel-to-pixel loss to obtain the total loss; the first generator and the second generator are trained on the total loss.

According to still other embodiments of the present disclosure, there is provided an expression generating apparatus including: the acquisition module is used for acquiring the feature information of each frame of image in the original video, the feature information of the key points of the human face and the classification information of the original expression; the fusion module is used for fusing the feature information of each frame of image, the feature information of the key points of the face, the classification information of the original expression and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image; and the generating module is used for generating a fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image to obtain a target video with a target expression, wherein the target expression is a facial expression formed by the fused images corresponding to all the images.

According to still other embodiments of the present disclosure, there is provided an expression generation model training apparatus, including: the acquisition module is used for acquiring a training pair consisting of each frame image of an original training video and each frame image of a target training video; the first generation module is used for inputting each frame of image of an original training video into the first generator, acquiring feature information of each frame of image of the original training video, feature information of a face key point and classification information of an original expression, fusing the feature information of each frame of image of the original training video, the feature information of the face key point, the classification information of the original expression and preset classification information corresponding to a target expression to obtain feature information of each frame of fused image corresponding to the original training video, and obtaining each frame of fused image corresponding to the original training video output by the first generator according to the feature information of each frame of fused image corresponding to the original training video; the second generation module is used for inputting each frame of image of the target training video into the second generator, acquiring the feature information of each frame of image of the target training video, the feature information of the face key point and the classification information of the target expression, fusing the feature information of each frame of image of the target training video, the feature information of the face key point, the classification information of the target expression and the preset classification information corresponding to the original expression to obtain the feature information of each frame of fused image corresponding to the target training video, and obtaining each frame of fused image corresponding to the target training video output by the second generator according to the feature information of each frame of fused image corresponding to the target training video; the determining module is used for determining the anti-loss and the cyclic consistent loss according to the fused images of the frames corresponding to the original training video and the fused images of the frames corresponding to the target training video; a training module for training the first generator and the second generator according to the confrontation loss and the cycle consistent loss.

According to still further embodiments of the present disclosure, there is provided an electronic device including: a processor; and a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform the expression generation method according to any of the foregoing embodiments, or the training method of the expression generation model according to any of the foregoing embodiments.

According to still further embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the expression generation method of any of the foregoing embodiments, or the training method of the expression generation model of any of the foregoing embodiments.

According to still other embodiments of the present disclosure, there is provided an expression generation system including: the expression generation device of any of the foregoing embodiments and the training device of the expression generation model of any of the foregoing embodiments.

The method comprises the steps of extracting feature information of each frame of image in an original video, feature information of key points of a human face and classification information of an original expression, fusing the extracted information with preset classification information corresponding to a target expression to obtain feature information of a fused image corresponding to each frame of image, generating a fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image, and forming the target video with the human face expression being the target expression by all the fused images. According to the method, the feature information of the key points of the face is extracted and is used for feature fusion, so that the expression in the fused image is more real and smooth, the generation of the target expression is directly realized through the fusion of the preset classification information corresponding to the target expression, the target expression is compatible with the face action and the mouth shape of a person in the original image, the mouth shape, the head action and the like of the person are not influenced, the definition of the original image is not influenced, and the generated video is stable, clear and smooth.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 illustrates a flow diagram of an expression generation method of some embodiments of the present disclosure.

Fig. 2 illustrates a schematic diagram of an expression generation method of some embodiments of the present disclosure.

Fig. 3 illustrates a flow diagram of a training method of an expression generation model of some embodiments of the present disclosure.

Fig. 4 illustrates a schematic diagram of a training method of an expression generation model of some embodiments of the present disclosure.

Fig. 5 shows a schematic structural diagram of an expression generation apparatus of some embodiments of the present disclosure.

Fig. 6 illustrates a schematic structural diagram of a training apparatus for an expression generation model according to some embodiments of the present disclosure.

Fig. 7 shows a schematic structural diagram of an electronic device of some embodiments of the present disclosure.

Fig. 8 shows a schematic structural diagram of an electronic device of further embodiments of the disclosure.

Fig. 9 illustrates a schematic structural diagram of an expression generation system of some embodiments of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The present disclosure provides an expression generation method, which is described below with reference to fig. 1 to 2.

FIG. 1 is a flow chart of some embodiments of the expression generation methods of the present disclosure. As shown in fig. 1, the method of this embodiment includes: steps S102 to S106.

In step S102, feature information of each frame of image in the original video, feature information of a face key point, and classification information of the original expression are obtained.

The original video can be a prerecorded video, and the facial expression in the video is the original expression. The facial expressions in the frames of images can be kept consistent basically, for example, the facial expressions in the frames of images are mainly calm expressions, that is, the proportion of the original expressions in the frames of images being calm expressions (preset expressions) exceeds a preset proportion.

In some embodiments, each frame of image in an original video is input into a human face feature extraction model to obtain feature information of each frame of output image; inputting the characteristic information of each frame of image into a face key point detection model to obtain the coordinate information of the face key points of each frame of image; and (3) reducing the dimension of the coordinate information of all the face key points by adopting a Principal Component Analysis (PCA) method to obtain the information of the preset dimension as the characteristic information of the face key points. And inputting the characteristic information of each frame of image into the expression classification model to obtain the classification information of the original expression of each frame of image.

The integral expression generation model comprises an encoder and a decoder, wherein the encoder can comprise a face feature extraction model, a face key point detection model and an expression classification model, and the face feature extraction model is connected with the face key point detection model and the expression classification model. The face feature extraction model can adopt an existing model, for example, a deep learning model with a feature extraction function, such as VGG-19, ResNet, Transformer, and the like. The part before VGG-19block 5 can be used as a face feature extraction model. The face keypoint detection model and the expression classification model may also adopt existing models, such as an MLP (multi-layer perceptron) and the like, and may specifically be a 3-layer MLP. And after the training of the expression generation model is finished, generating an expression, and then describing the training process in detail.

The Feature information of each frame of image in the original video is, for example, a Feature Map (Feature Map) output by a face Feature extraction model, the key points include, for example, 68 key points such as the chin, the eyebrow center, the corner of the mouth, and the like, and each key point is represented as the horizontal and vertical coordinates of the position of the key point. After the coordinate information of each key point is obtained through the face key point detection model, in order to reduce redundant information and improve efficiency, the dimension of the coordinate information of all face key points is reduced through PCA, and information of a preset dimension (for example, 6 dimensions, which can achieve the best effect) is obtained and used as feature information of the face key points. The expression classification model can output classification of neutral, happy, sad and other expressions, and can be represented by one-hot coded vectors. The classification information of the original expression may be one-hot codes of the classification of the original expression in each frame of image in the original video, which is obtained through the expression classification model.

In step S104, the feature information of each frame of image, the feature information of the key points of the face, the classification information of the original expression, and the preset classification information corresponding to the target expression are fused to obtain the feature information of the fused image corresponding to each frame of image.

In some embodiments, the classification information of the original expression of each frame of image is added with the preset classification information corresponding to the target expression to obtain the classification information of the fusion expression corresponding to each frame of image; and splicing the feature information of the face key points of each frame of image multiplied by the first weight obtained by training, the feature information of each frame of image multiplied by the second weight obtained by training and the classification information of the fusion expression corresponding to each frame of image.

The target expression is different from the original expression, for example, a smiling expression, and the preset classification information corresponding to the target expression is, for example, a preset one-hot code of the target expression. The preset classification information does not need to be obtained through a model, and the preset coding rule (one-hot) is directly adopted for coding. For example, the calm expression code is 1000, and the smile expression code is 0100. The classification information of the original expression is obtained through an expression classification model, and the classification information may be distinguished from preset classification information corresponding to the original expression, for example, the original expression is a calm expression, and a preset one-hot code is 1000, but the one-hot code obtained by the expression classification model may be 0.80.200.

The encoder can also comprise a feature fusion model, and feature information of each frame of image, feature information of key points of the human face, classification information of the original expression and preset classification information corresponding to the target expression are input into the feature fusion model for fusion. The parameters to be trained in the feature fusion model comprise a first weight and a second weight. And for each frame of image, multiplying a first weight obtained by training by the feature information of the face key point of the image to obtain a first feature vector, multiplying a second weight obtained by training by the feature information of the image to obtain a second feature vector, and splicing the first feature vector, the second feature vector and the classification information of the fusion expression corresponding to the image to obtain the feature information of the fusion image corresponding to the image. The first weight and the second weight may unify value ranges of the three kinds of information.

In step S106, a fused image corresponding to each frame of image is generated according to the feature information of the fused image corresponding to each frame of image, and a target video with a target expression, which is a facial expression formed by the fused images corresponding to all the images, is obtained.

In some embodiments, the feature information of the fused image corresponding to each frame of image is input into a decoder, and the generated fused image corresponding to each frame of image is output. The face feature extraction model includes a convolutional layer and the decoder includes an anti-convolutional layer, and an image may be generated based on the features. The decoder is, for example, block5 of VGG-19, replacing the last convolutional layer with the deconvolution layer. The fused image is an image of which the facial expression is the target expression, and each frame of fused image forms a target video.

Some application examples of the present disclosure are described below in conjunction with fig. 2.

As shown in fig. 2, a feature image is obtained after feature extraction is performed on one frame of image in an original video, face key point detection and expression classification are performed respectively according to the feature image, PCA is performed on feature information of each key point obtained by face key point detection, information with a reduced dimension to a preset dimension is used as a key point feature, one-hot coding is performed on the classification information of an original expression, and preset classification information corresponding to a target expression is fused to obtain an expression classification vector (classification information of a fused expression), and then the feature image, the expression classification vector and the key point feature of a face are fused to obtain feature information of the fused image, and feature information of the fused image is subjected to feature decoding to obtain a face image of the target expression.

The scheme of the embodiment extracts the feature information of each frame of image in the original video, the feature information of the face key points and the classification information of the original expression, fuses the extracted information and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image, further generates the fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image, and all the fused images can form the target video with the face expression being the target expression. In the above embodiment, the feature information of the key points of the face is extracted and used for feature fusion, so that the expression in the fused image is more real and smooth, the generation of the target expression is directly realized through the fusion of the preset classification information corresponding to the target expression, and the target expression is compatible with the face action and the mouth shape of the person in the original image, so that the mouth shape, the head action and the like of the person are not influenced, the definition of the original image is not influenced, and the generated video is stable, clear and smooth.

The following describes a training method of the expression generation model with reference to fig. 3.

FIG. 3 is a flow diagram of some embodiments of a training method of the expression generation model of the present disclosure. As shown in fig. 3, the method of this embodiment includes: steps S302 to S310.

In step S302, a training pair composed of each frame image of the original training video and each frame image of the target training video is obtained.

The original training video is a video with the face expression as the original expression, the target training video is a video with the face expression as the target expression, and each frame image of the original training video does not need to correspond to each frame image of the target training video one by one. And labeling the classification information of the original expression and the classification information of the target expression.

In step S304, each frame image of the original training video is input into the first generator, the feature information of each frame image of the original training video, the feature information of the face key point, and the classification information of the original expression are obtained, the feature information of each frame image of the original training video, the feature information of the face key point, the classification information of the original expression, and the preset classification information corresponding to the target expression are fused to obtain the feature information of each frame fused image corresponding to the original training video, and each frame fused image corresponding to the original training video output by the first generator is obtained according to the feature information of each frame fused image corresponding to the original training video.

And the first generator is used as an expression generation model after training is completed. In some embodiments, each frame of image in the original training video is input into a third facial feature extraction model in a first generator, and feature information of each output frame of image is obtained; inputting the characteristic information of each frame of image into a first face key point detection model in a first generator to obtain the coordinate information of the face key points of each frame of image; reducing the dimension of coordinate information of all face key points by adopting a principal component analysis method to obtain first information of a preset dimension, wherein the first information is used as feature information of the face key points of each frame of image of an original training video; and inputting the characteristic information of each frame of image in the original training video into a third emotion classification model in the first generator to obtain the classification information of the original expression of each frame of image in the original training video.

And (3) carrying out Principal Component Analysis (PCA) on the coordinate information of the key points of the human face, and reducing the coordinate information of the key points to 6 dimensions (6 dimensions are the best effect obtained through a large number of experiments). The PCA does not relate to training parameters (feature extraction of PCA and correspondence of front and back feature dimensions do not change with training, when gradient is transmitted in reverse, only the feature correspondence obtained by the initial PCA is used to transmit gradient to the previous parameters).

In some embodiments, the classification information of the original expression of each frame of image of the original training video and the preset classification information corresponding to the target expression are added and averaged to obtain the classification information of the fusion expression corresponding to each frame of image of the original training video; and splicing the feature information of the face key points of each frame of image of the original training video multiplied by the first weight to be trained, the feature information of each frame of image of the original training video multiplied by the second weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the original training video to obtain the feature information of each frame of fusion image corresponding to the original training video.

The first generator comprises a first feature fusion model, and the first weight and the second weight are parameters to be trained in the first feature fusion model. The above-mentioned processes of feature extraction and feature fusion can refer to the foregoing embodiments.

The first generator includes a first encoder and a first decoder, the first encoder includes: the third face feature extraction model, the first face key point detection model, the third attribute classification model and the first feature fusion model input the feature information of each frame fusion image corresponding to the original training video into the first decoder to obtain each frame fusion image corresponding to the generated original training video.

In step S306, each frame of image of the target training video is input into the second generator, the feature information of each frame of image of the target training video, the feature information of the face key point, and the classification information of the target expression are obtained, the feature information of each frame of image of the target training video, the feature information of the face key point, the classification information of the target expression, and the preset classification information corresponding to the original expression are fused to obtain the feature information of each frame of fused image corresponding to the target training video, and each frame of fused image corresponding to the target training video output by the second generator is obtained according to the feature information of each frame of fused image corresponding to the target training video.

The second generator is the same or similar in structure to the first generator, and the training target of the second generator is based on the target training video to generate the video with the same expression as the original training video.

In some embodiments, each frame of image in the target training video is input into a fourth face feature extraction model in the second generator, and feature information of each output frame of image is obtained; inputting the feature information of each frame of image into a second face key point detection model in a second generator to obtain the coordinate information of the face key points of each frame of image; and reducing the dimension of the coordinate information of all the face key points by adopting a principal component analysis method to obtain second information of a preset dimension, wherein the second information is used as the characteristic information of the face key points of each frame of image of the target training video. And inputting the feature information of each frame of image in the target training video into a fourth expression classification model in the second generator to obtain the classification information of the target expression of each frame of image in the target training video.

The dimension of the feature information of the face key points of each frame image of the target training video is the same as that of the feature information of the face key points of each frame image of the original training video, for example, 6 dimensions.

In some embodiments, the classification information of the target expression of each frame of image of the target training video is added and averaged with the preset classification information corresponding to the original expression to obtain the classification information of the fusion expression corresponding to each frame of image of the target training video; and splicing the feature information of the face key points of each frame of image of the target training video multiplied by the third weight to be trained, the feature information of each frame of image of the target training video multiplied by the fourth weight to be trained and the classification information of the fusion expression corresponding to each frame of image of the target training video to obtain the feature information of each frame of fusion image corresponding to the target training video.

The preset classification information corresponding to the original expression does not need to be obtained through a model, and the preset coding rule is directly adopted for coding. The second generator comprises a second feature fusion model, and the third weight are parameters to be trained in the second feature fusion model. The above processes of feature extraction and feature fusion may refer to the foregoing embodiments, and are not described again.

The second generator includes a second encoder and a second decoder, the second encoder including: and the feature information of each frame of fused image corresponding to the target training video is input into a second decoder to obtain each frame of fused image corresponding to the generated target training video.

In step S308, the immunity loss and the cyclic consistency loss are determined according to the fused image of each frame corresponding to the original training video and the fused image of each frame corresponding to the target training video.

End-to-end training is performed based on generation countermeasure learning and cross-domain transfer learning, so that the accuracy of the model can be improved, and the training efficiency can be improved.

In some embodiments, the challenge loss is determined using the following method: inputting each frame of fused image corresponding to the original training video into a first discriminator to obtain a first discrimination result of each frame of fused image corresponding to the original training video; inputting each frame of fused image corresponding to the target training video into a second discriminator to obtain a second discrimination result of each frame of fused image corresponding to the target training video; and determining a first pair of loss resistances according to the first judgment result of each frame of fused image corresponding to the original training video, and determining a second pair of loss resistances according to the second judgment result of each frame of fused image corresponding to the target training video.

Further, in some embodiments, each frame of fused image corresponding to the original training video is input into the first face feature extraction model in the first discriminator to obtain feature information of each frame of fused image corresponding to the output original training video; inputting the feature information of each frame of fused image corresponding to the original training video into a first expression classification model in a first discriminator to obtain the expression classification information of each frame of fused image corresponding to the original training video as a first discrimination result; inputting each frame of fused image corresponding to the target training video into a second face feature extraction model in a second discriminator to obtain feature information of each frame of fused image corresponding to the output target training video; and inputting the characteristic information of each frame of fused image corresponding to the target training video into a second expression classification model in a second discriminator to obtain the classification information of the expression of each frame of fused image corresponding to the target training video as a second discrimination result.

The overall model comprises two sets of generators and discriminators in the training process. The first discriminator and the second discriminator have the same or similar structures and both comprise a face feature extraction model and an expression classification model. The first facial feature extraction model and the second facial feature extraction model are the same as or similar to the third facial feature extraction model and the fourth facial feature extraction model in structure, and the first expression classification model and the second expression classification model are the same as or similar to the third expression classification model and the fourth expression classification model in structure.

For example, the data of the original video is X ═ { X ═ X _i Y ═ Y denotes data of the target video _i Represents it. A first generator G for realizing X → Y, training G (X) as close to Y as possible, a first discriminator D _Y And the method is used for judging the truth of each frame of fused image corresponding to the original training video. The first loss tolerance can be expressed by the following equation:

a second generator F for realizing Y → X, training to make F (Y) as close to X as possible, and a second discriminator D _X And the method is used for judging the truth of each frame of fused image corresponding to the target training video. The second pair of loss resistances can be expressed by the following equation:

in some embodiments, Cycle Consistency Losses (Cycle Consistency Losses) are determined using the following method: inputting each frame fusion image corresponding to the original training video into a second generator to generate each frame reconstruction image of the original training video, and inputting each frame fusion image corresponding to the target training video into a first generator to generate each frame reconstruction image of the target training video; and determining the cycle consistent loss according to the difference between each frame of reconstructed image of the original training video and each frame of image of the original training video and the difference between each frame of reconstructed image of the target training video and each frame of image of the target training video.

In order to further improve the accuracy of the model, the image generated by the first generator is input to the second generator to obtain the reconstructed image of each frame of the original training video, and the reconstructed image of each frame of the original training video generated by the second generator is expected to be consistent with the reconstructed image of each frame of the original training video as much as possible, namely F (G (x)) is approximately equal to x. And inputting the image generated by the second generator into the first generator to obtain each frame reconstruction image of the target training video, wherein the frame reconstruction image of the target training video generated by the first generator is expected to be consistent with each frame image of the target training video as much as possible, namely G (F (y)) is approximately equal to y.

The difference between each reconstructed image of the original training video and each image of the original training video can be determined by adopting the following method: and determining the distance (such as Euclidean distance) between the reconstructed image and the corresponding image and the representing vector of each pixel at the same position of the image according to each frame of the reconstructed image of the original training video and the image of the original training video corresponding to the reconstructed image, and summing all the distances.

The difference between each frame of reconstructed image of the target training video and each frame of image of the target training video can be determined by adopting the following method: for each frame of reconstructed image of the target training video and the image of the target training video corresponding to the reconstructed image, determining the distance (such as Euclidean distance) between the reconstructed image and the representation vector of the pixel at each same position of the corresponding image, and summing all the distances.

In step S310, the first generator and the second generator are trained according to the antagonistic loss and the cyclic coincidence loss.

The first opponent loss, the second opponent loss and the cyclic coincidence loss can be weighted and summed to obtain a total loss, and the first generator and the second generator are trained according to the total loss. For example, the total loss may be determined using the following equation:

L＝L _GAN (G,D _Y ,X,Y)+L _GAN (F,D _X ,Y,X)+λL _cyc (G,F) (3)

wherein L is _cyc (G, F) represents the cyclic consensus loss, and λ is the weight and can be found experimentally.

In order to further improve the accuracy of the model and ensure the stability and continuity of the output video result, the loss caused by the pixel difference between two frames of the video is increased in the training process. In some embodiments, pixel-to-pixel loss is determined according to pixel differences between every two adjacent frames of fused images corresponding to the original training video and pixel differences between every two adjacent frames of fused images corresponding to the target training video, and the first generator and the second generator are trained according to the antagonistic loss, the cyclic consistent loss and the pixel-to-pixel loss.

Further, in some embodiments, for each position in each two adjacent frames of fused images corresponding to the original training video, determining a distance between the expression vectors of two pixels at the position in the two adjacent frames of fused images, and summing the distances corresponding to all the positions to obtain a first loss; determining the distance between the expression vectors of two pixels at the positions in each two adjacent frames of fused images corresponding to the target training video aiming at each position in each two adjacent frames of fused images, and summing the distances corresponding to all the positions to obtain a second loss; and adding the first loss and the second loss to obtain pixel-to-pixel loss. The pixel-to-pixel loss can prevent the two adjacent frames of the generated video from changing too much.

In some embodiments, the antagonistic loss, the cyclic consensus loss, and the pixel-to-pixel loss are summed weighted to yield a total loss; the first generator and the second generator are trained on the total loss. For example, the total loss may be determined using the following equation:

L＝L _GAN (G,D _Y ,X,Y)+L _GAN (F,D _X ,Y,X)+λ ₁ L _cyc (G,F)+λ ₂ L _P2P (G(x _i ),G(x _i+1 ))+λ ₃ L _P2P (F(y _j ),F(y _j+1 )) (4)

wherein λ is ₁ ，λ ₂ ，λ ₃ For the weights, L can be obtained by experiment _P2P (G(x _i ),G(x _i+1 ) Denotes a first loss, L _P2P (F(y _j ),F(y _j,1 ) Represents a second loss.

As shown in fig. 4, before performing end-to-end training, models of each part may be pre-trained, for example, a large amount of open source face recognition data is selected to pre-train a face recognition model, and a part before an output feature map is selected as a face feature extraction model (this part method is not unique, for example, vgg-19, and a part before block5 is selected, and a feature map with dimensions of 8 × 8 × 512 may be output). And then fixing a face feature extraction model and parameters, dividing the back into two branches, namely a face key point detection model and an expression classification model, and respectively carrying out fine-tuning (fine-tune) on the respective branches by using a face key point detection data set and expression classification data to train the parameters in the structures of the two parts of models. The face key point detection model is not unique, and a scheme can be accessed as long as the model is based on a convolution network model and can obtain accurate key points; the expression classification model is a single label classification task based on a convolution network model. After the pre-training, an end-to-end training process may be performed based on the previous embodiment. This can improve training efficiency.

According to the method, the integral model is trained by adopting the counter loss, the cycle consistent loss and the pixel loss between two adjacent frames of the video, so that the accuracy of the model can be improved, the efficiency can be improved in the end-to-end training process, and the computing resources are saved.

The scheme of the disclosure is suitable for editing the facial expression in a single image. This is disclosed through adopting unique degree of deep learning model, fuse technologies such as expression discernment, key point detection, through the training of data, learn the law that people's facial key point moved under the different expressions, finally control the facial expression state that the model output through the categorised information of inputing the target expression to the model, and the expression only exists as a style state, can be fine the effect stack when the personage speaks or makes the action such as askew head, blink for the facial action video of the personage of final output is natural, not harmonious. The output result can have the same resolution and detail degree as the input image, and the output result can still be kept stable, clear and flawless under 1080p or even 2k resolution.

The present disclosure also provides an expression generating apparatus, which is described below with reference to fig. 5.

Fig. 5 is a block diagram of some embodiments of the expression generation apparatus of the present disclosure. As shown in fig. 5, the apparatus 50 of this embodiment includes: an obtaining module 510, a fusing module 520, and a generating module 530.

The obtaining module 510 is configured to obtain feature information of each frame of image in an original video, feature information of a face key point, and classification information of an original expression.

In some embodiments, the obtaining module 510 is configured to input each frame of image in the original video into a face feature extraction model, so as to obtain feature information of each frame of output image; inputting the characteristic information of each frame of image into a face key point detection model to obtain the coordinate information of the face key points of each frame of image; and reducing the dimensions of the coordinate information of all the face key points by adopting a principal component analysis method to obtain the information of preset dimensions as the characteristic information of the face key points.

In some embodiments, the obtaining module 510 is configured to input the feature information of each frame of image into the expression classification model, so as to obtain classification information of the original expression of each frame of image.

The fusion module 520 is configured to fuse the feature information of each frame of image, the feature information of the key points of the face, the classification information of the original expression, and the preset classification information corresponding to the target expression to obtain the feature information of the fusion image corresponding to each frame of image.

In some embodiments, the fusion module 520 is configured to add and average the classification information of the original expression of each frame of image and the preset classification information corresponding to the target expression to obtain the classification information of the fusion expression corresponding to each frame of image; and splicing the feature information of the face key points of each frame of image multiplied by the first weight obtained by training, the feature information of each frame of image multiplied by the second weight obtained by training and the classification information of the fusion expression corresponding to each frame of image.

The generating module 530 is configured to generate a fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image, and obtain a target video with a target expression, which is a facial expression formed by the fused images corresponding to all the images.

In some embodiments, the generating module 530 is configured to input feature information of a fused image corresponding to each frame of image into a decoder, and output the generated fused image corresponding to each frame of image, where the face feature extraction model includes a convolutional layer and the decoder includes an anti-convolutional layer.

The present disclosure also provides an expression generation model training apparatus, which is described below with reference to fig. 6.

FIG. 6 is a block diagram of some embodiments of a training apparatus for expression generation models according to the present disclosure. As shown in fig. 6, the apparatus 60 of this embodiment includes: an acquisition module 610, a first generation module 620, a second generation module 630, a determination module 640, and a training module 650.

The obtaining module 610 is configured to obtain a training pair composed of each frame image of the original training video and each frame image of the target training video.

The first generating module 620 is configured to input each frame image of the original training video into the first generator, acquire feature information of each frame image of the original training video, feature information of a face key point, and classification information of an original expression, fuse the feature information of each frame image of the original training video, the feature information of the face key point, the classification information of the original expression, and preset classification information corresponding to the target expression to obtain feature information of each frame fusion image corresponding to the original training video, and obtain each frame fusion image corresponding to the original training video output by the first generator according to the feature information of each frame fusion image corresponding to the original training video.

In some embodiments, the first generating module 620 is configured to input each frame of image in the original training video into the third facial feature extraction model in the first generator, so as to obtain feature information of each output frame of image; inputting the characteristic information of each frame of image into a first face key point detection model in a first generator to obtain the coordinate information of the face key points of each frame of image; reducing the dimension of coordinate information of all face key points by adopting a principal component analysis method to obtain first information of a preset dimension, wherein the first information is used as feature information of the face key points of each frame of image of an original training video; and inputting the characteristic information of each frame of image in the original training video into a third expression classification model in the first generator to obtain the classification information of the original expression of each frame of image in the original training video.

In some embodiments, the first generation module 620 is configured to add and average the classification information of the original expression of each frame of image of the original training video and the preset classification information corresponding to the target expression to obtain the classification information of the fused expression corresponding to each frame of image of the original training video; and splicing the feature information of the face key points of each frame of image of the original training video multiplied by the first weight to be trained, the feature information of each frame of image of the original training video multiplied by the second weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the original training video.

The second generating module 630 is configured to input each frame image of the target training video into the second generator, obtain feature information of each frame image of the target training video, feature information of a face key point, and classification information of a target expression, fuse the feature information of each frame image of the target training video, the feature information of the face key point, the classification information of the target expression, and preset classification information corresponding to the original expression to obtain feature information of each frame fusion image corresponding to the target training video, and obtain each frame fusion image corresponding to the target training video output by the second generator according to the feature information of each frame fusion image corresponding to the target training video.

In some embodiments, the second generating module 630 is configured to input each frame of image in the target training video into a fourth face feature extraction model in the second generator, so as to obtain feature information of each frame of image that is output; inputting the feature information of each frame of image into a second face key point detection model in a second generator to obtain the coordinate information of the face key points of each frame of image; reducing the dimension of the coordinate information of all face key points by adopting a principal component analysis method to obtain second information of a preset dimension, wherein the second information is used as the feature information of the face key points of each frame of image of the target training video; and inputting the feature information of each frame of image in the target training video into a fourth expression classification model in the second generator to obtain the classification information of the target expression of each frame of image in the target training video.

In some embodiments, the second generating module 630 is configured to add and average the classification information of the target expression of each frame of image of the target training video and the preset classification information corresponding to the original expression to obtain the classification information of the fusion expression corresponding to each frame of image of the target training video; and splicing the feature information of the face key points of each frame of image of the target training video multiplied by the third weight to be trained, the feature information of each frame of image of the target training video multiplied by the fourth weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the target training video.

The determining module 640 is configured to determine the alignment loss and the cyclic consistency loss according to each frame of fused images corresponding to the original training video and each frame of fused images corresponding to the target training video.

The training module 650 is configured to train the first generator and the second generator based on the opponent loss and the round-robin uniform loss.

In some embodiments, the determining module 640 is configured to determine pixel-to-pixel loss according to a pixel difference between each two adjacent frames of fused images corresponding to the original training video and a pixel difference between each two adjacent frames of fused images corresponding to the target training video; the training module 650 is used to train the first generator and the second generator based on the countervailing loss, the cyclic agreement loss, and the pixel-to-pixel loss.

In some embodiments, the determining module 640 is configured to input each frame of fused image corresponding to the original training video into the first discriminator to obtain a first discrimination result of each frame of fused image corresponding to the original training video; inputting each frame of fused image corresponding to the target training video into a second discriminator to obtain a second discrimination result of each frame of fused image corresponding to the target training video; and determining a first pair of loss resistances according to the first judgment result of each frame of fused image corresponding to the original training video, and determining a second pair of loss resistances according to the second judgment result of each frame of fused image corresponding to the target training video.

In some embodiments, the determining module 640 is configured to input each frame of fused image corresponding to the original training video into the first face feature extraction model in the first discriminator to obtain feature information of each frame of fused image corresponding to the output original training video; inputting the feature information of each frame of fused image corresponding to the original training video into a first expression classification model in a first discriminator to obtain the expression classification information of each frame of fused image corresponding to the original training video as a first discrimination result; inputting each frame of fused image corresponding to the target training video into a second face feature extraction model in a second discriminator to obtain feature information of each frame of fused image corresponding to the output target training video; and inputting the characteristic information of each frame of fused image corresponding to the target training video into a second expression classification model in a second discriminator to obtain the classification information of the expression of each frame of fused image corresponding to the target training video as a second discrimination result.

In some embodiments, the determining module 640 is configured to input each frame fused image corresponding to the original training video into the second generator, generate each frame reconstructed image of the original training video, input each frame fused image corresponding to the target training video into the first generator, and generate each frame reconstructed image of the target training video; and determining the cycle consistent loss according to the difference between each frame of reconstructed image of the original training video and each frame of image of the original training video and the difference between each frame of reconstructed image of the target training video and each frame of image of the target training video.

In some embodiments, the determining module 640 is configured to determine, for each position in each two adjacent frames of fused images corresponding to the original training video, a distance between the expression vectors of two pixels at the position in the two adjacent frames of fused images, and sum distances corresponding to all the positions to obtain a first loss; determining the distance between the expression vectors of two pixels at the positions in each two adjacent frames of fused images corresponding to the target training video aiming at each position in each two adjacent frames of fused images corresponding to the target training video, and summing the distances corresponding to all the positions to obtain a second loss; and adding the first loss and the second loss to obtain pixel-to-pixel loss.

In some embodiments, the training module 650 is configured to weight and sum the confrontation loss, the cyclic agreement loss, and the pixel-to-pixel loss to obtain a total loss; the first generator and the second generator are trained on the total loss.

The expression generation apparatus and the expression generation model training apparatus in the embodiments of the present disclosure may be implemented by various computing devices or computer systems, and are described below with reference to fig. 7 and 8.

Fig. 7 is a block diagram of some embodiments of an electronic device of the present disclosure. As shown in fig. 7, the electronic apparatus 70 of this embodiment includes: a memory 710 and a processor 720 coupled to the memory 710, the processor 720 being configured to perform an expression generation method or a training method of an expression generation model in any of some embodiments of the present disclosure based on instructions stored in the memory 710.

Memory 710 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.

Fig. 8 is a block diagram of further embodiments of an electronic device of the present disclosure. As shown in fig. 8, the electronic apparatus 80 of this embodiment includes: memory 810 and processor 820 are similar to memory 710 and processor 720, respectively. An input output interface 830, a network interface 840, a storage interface 850, and the like may also be included. These

interfaces

830, 840, 850 and the memory 810 and the processor 820 may be connected, for example, by a bus 860. The input/output interface 830 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 840 provides a connection interface for various networking devices, such as a database server or a cloud storage server. The storage interface 850 provides a connection interface for external storage devices such as an SD card and a usb disk.

The present disclosure also provides an expression generation system, as shown in fig. 9, the expression generation system 9 includes the expression generation apparatus 50 of any of the foregoing embodiments, and the training apparatus 60 of the expression generation model of any of the foregoing embodiments.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is meant to be illustrative of the preferred embodiments of the present disclosure and not to be taken as limiting the disclosure, and any modifications, equivalents, improvements and the like that are within the spirit and scope of the present disclosure are intended to be included therein.

Claims

1. An expression generation method comprising:

acquiring feature information of each frame of image in an original video, feature information of a face key point and classification information of an original expression;

fusing the feature information of each frame of image, the feature information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image;

and generating a fusion image corresponding to each frame of image according to the feature information of the fusion image corresponding to each frame of image, and obtaining a target video with a target expression, wherein the target expression is a facial expression formed by the fusion images corresponding to all the images.

2. The expression generation method of claim 1, wherein the obtaining of feature information of each frame of image in an original video and feature information of key points of a human face comprises:

inputting each frame of image in the original video into a face feature extraction model to obtain the feature information of each frame of output image;

inputting the feature information of each frame of image into a face key point detection model to obtain the coordinate information of the face key points of each frame of image;

and reducing the dimensions of the coordinate information of all the key points of the face by adopting a principal component analysis method to obtain the information of preset dimensions as the characteristic information of the key points of the face.

3. The expression generation method of claim 2, wherein the obtaining of the classification information of the original expression of each frame of image in the original video comprises:

and inputting the characteristic information of each frame of image into an expression classification model to obtain the classification information of the original expression of each frame of image.

4. The expression generation method of claim 1, wherein the fusing the feature information of each frame of image, the feature information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression comprises:

adding and averaging the classification information of the original expression of each frame of image and preset classification information corresponding to the target expression to obtain the classification information of the fusion expression corresponding to each frame of image;

and splicing the feature information of the face key points of each frame of image multiplied by the trained first weight, the feature information of each frame of image multiplied by the trained second weight and the classification information of the fusion expression corresponding to each frame of image.

5. The expression generation method according to claim 2, wherein the generating of the fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image includes:

inputting the feature information of the fused image corresponding to each frame of image into a decoder, and outputting the generated fused image corresponding to each frame of image;

the face feature extraction model comprises a convolution layer, and the decoder comprises an anti-convolution layer.

6. A training method of an expression generation model comprises the following steps:

acquiring a training pair consisting of each frame image of an original training video and each frame image of a target training video;

inputting each frame of image of the original training video into a first generator, acquiring feature information of each frame of image of the original training video, feature information of a face key point and classification information of an original expression, fusing the feature information of each frame of image of the original training video, the feature information of the face key point, the classification information of the original expression and preset classification information corresponding to a target expression to obtain feature information of each frame of fused image corresponding to the original training video, and obtaining each frame of fused image corresponding to the original training video output by the first generator according to the feature information of each frame of fused image corresponding to the original training video;

inputting each frame of image of the target training video into a second generator, acquiring feature information of each frame of image of the target training video, feature information of a face key point and classification information of a target expression, fusing the feature information of each frame of image of the target training video, the feature information of the face key point, the classification information of the target expression and preset classification information corresponding to an original expression to obtain the feature information of each frame of fused image corresponding to the target training video, and obtaining each frame of fused image corresponding to the target training video output by the second generator according to the feature information of each frame of fused image corresponding to the target training video;

determining the pair resistance loss and the cycle consistency loss according to the frame fusion images corresponding to the original training video and the frame fusion images corresponding to the target training video;

training the first generator and the second generator according to the confrontation loss and the cycle consistent loss.

7. The training method of claim 6, further comprising:

determining pixel-to-pixel loss according to the pixel difference between every two adjacent frames of fused images corresponding to the original training video and the pixel difference between every two adjacent frames of fused images corresponding to the target training video;

wherein training the first generator and the second generator according to the immunity loss and the cycle-consistent loss comprises:

training the first generator and the second generator according to the antagonistic loss, the consistent loss, and the pixel-to-pixel loss.

8. The training method according to claim 6 or 7, wherein the determining of the countermeasure loss according to the fused image of each frame corresponding to the original training video and the fused image of each frame corresponding to the target training video comprises:

inputting each frame of fused image corresponding to the original training video into a first discriminator to obtain a first discrimination result of each frame of fused image corresponding to the original training video;

inputting each frame of fused image corresponding to the target training video into a second discriminator to obtain a second discrimination result of each frame of fused image corresponding to the target training video;

and determining a first pair of loss resistances according to the first judgment result of each frame of fused image corresponding to the original training video, and determining a second pair of loss resistances according to the second judgment result of each frame of fused image corresponding to the target training video.

9. The training method according to claim 8, wherein the inputting of the fused image of each frame corresponding to the original training video into a first discriminator to obtain a first discrimination result of the fused image of each frame corresponding to the original training video comprises:

inputting each frame of fused image corresponding to the original training video into a first face feature extraction model in the first discriminator to obtain feature information of each frame of fused image corresponding to the output original training video;

inputting the feature information of each frame of fused image corresponding to the original training video into a first expression classification model in the first discriminator to obtain the classification information of the expression of each frame of fused image corresponding to the original training video as a first discrimination result;

inputting each frame of fused image corresponding to the target training video into a second discriminator to obtain a second discrimination result of each frame of fused image corresponding to the target training video comprises:

inputting each frame of fused image corresponding to the target training video into a second face feature extraction model in the second discriminator to obtain feature information of each frame of fused image corresponding to the output target training video;

and inputting the feature information of each frame of fused image corresponding to the target training video into a second expression classification model in the second judging device to obtain the classification information of the expression of each frame of fused image corresponding to the target training video as a second judging result.

10. Training method according to claim 6 or 7, wherein the cycle consistent loss is determined using the following method:

inputting each frame fusion image corresponding to the original training video into the second generator to generate each frame reconstruction image of the original training video, and inputting each frame fusion image corresponding to the target training video into the first generator to generate each frame reconstruction image of the target training video;

and determining the cycle consistency loss according to the difference between each frame of reconstructed image of the original training video and each frame of image of the original training video and the difference between each frame of reconstructed image of the target training video and each frame of image of the target training video.

11. The training method of claim 7, wherein the pixel-to-pixel loss is determined using the following method:

determining the distance between the expression vectors of two pixels at each position in each two adjacent frames of fused images corresponding to the original training video, and summing the distances corresponding to all the positions to obtain a first loss;

determining the distance between the expression vectors of two pixels at the positions in each two adjacent frames of fused images corresponding to the target training video according to each position in each two adjacent frames of fused images, and summing the distances corresponding to all the positions to obtain a second loss;

and adding the first loss and the second loss to obtain the pixel-to-pixel loss.

12. The training method according to claim 6, wherein the obtaining of the feature information of each frame of image of the original training video and the feature information of the face key point comprises:

inputting each frame of image in the original training video into a third facial feature extraction model in the first generator to obtain the output feature information of each frame of image; inputting the feature information of each frame of image into a first face key point detection model in the first generator to obtain coordinate information of the face key points of each frame of image; reducing the dimension of the coordinate information of all face key points by adopting a principal component analysis method to obtain first information of a preset dimension, wherein the first information is used as the characteristic information of the face key points of each frame of image of the original training video;

the obtaining of the feature information of each frame of image of the target training video and the feature information of the face key point includes:

inputting each frame of image in the target training video into a fourth face feature extraction model in the second generator to obtain the output feature information of each frame of image; inputting the feature information of each frame of image into a second face key point detection model in the second generator to obtain the coordinate information of the face key points of each frame of image; and reducing the dimension of the coordinate information of all the face key points by adopting a principal component analysis method to obtain second information of a preset dimension, wherein the second information is used as the feature information of the face key points of each frame of image of the target training video.

13. The training method of claim 12, wherein obtaining classification information of an original expression of each frame of image in the original training video comprises:

inputting the feature information of each frame of image in the original training video into a third emotion classification model in the first generator to obtain the classification information of the original expression of each frame of image in the original training video;

the obtaining of the classification information of the target expression of each frame of image in the target training video includes:

and inputting the feature information of each frame of image in the target training video into a fourth expression classification model in the second generator to obtain the classification information of the target expression of each frame of image in the target training video.

14. The training method of claim 6, wherein the fusing the feature information of each frame of image of the original training video, the feature information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression comprises:

adding and averaging the classification information of the original expression of each frame of image of the original training video and the preset classification information corresponding to the target expression to obtain the classification information of the fusion expression corresponding to each frame of image of the original training video; splicing the feature information of the face key points of each frame image of the original training video multiplied by the first weight to be trained, the feature information of each frame image of the original training video multiplied by the second weight to be trained, and the classification information of the fusion expression corresponding to each frame image of the original training video;

the fusing the feature information of each frame of image of the target training video, the feature information of the face key points, the classification information of the target expression and the preset classification information corresponding to the original expression comprises the following steps:

adding and averaging the classification information of the target expression of each frame of image of the target training video and the preset classification information corresponding to the original expression to obtain the classification information of the fusion expression corresponding to each frame of image of the target training video; and splicing the feature information of the face key points of each frame of image of the target training video multiplied by the third weight to be trained, the feature information of each frame of image of the target training video multiplied by the fourth weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the target training video.

15. The training method of claim 7, wherein said training the first generator and the second generator according to the competing loss, the recurring coincident loss, and the pixel-by-pixel loss comprises:

weighting and summing the confrontation loss, the cyclic coincidence loss and the pixel-to-pixel loss to obtain a total loss;

training the first generator and the second generator according to the total loss.

16. An expression generation apparatus comprising:

the acquisition module is used for acquiring the feature information of each frame of image in the original video, the feature information of the key points of the human face and the classification information of the original expression;

the fusion module is used for fusing the feature information of each frame of image, the feature information of the key points of the face, the classification information of the original expression and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image;

and the generating module is used for generating a fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image, and obtaining a target video with a target expression, wherein the target expression is a facial expression formed by the fused images corresponding to all the images.

17. An expression generation model training device, comprising:

the acquisition module is used for acquiring a training pair consisting of each frame image of an original training video and each frame image of a target training video;

the first generation module is used for inputting each frame image of the original training video into a first generator, acquiring feature information of each frame image of the original training video, feature information of a face key point and classification information of an original expression, fusing the feature information of each frame image of the original training video, the feature information of the face key point, the classification information of the original expression and preset classification information corresponding to a target expression to obtain feature information of each frame fused image corresponding to the original training video, and obtaining each frame fused image corresponding to the original training video output by the first generator according to the feature information of each frame fused image corresponding to the original training video;

the second generation module is used for inputting each frame of image of the target training video into a second generator, acquiring feature information of each frame of image of the target training video, feature information of a face key point and classification information of a target expression, fusing the feature information of each frame of image of the target training video, the feature information of the face key point, the classification information of the target expression and preset classification information corresponding to an original expression to obtain feature information of each frame of fused image corresponding to the target training video, and obtaining each frame of fused image corresponding to the target training video output by the second generator according to the feature information of each frame of fused image corresponding to the target training video;

the determining module is used for determining the anti-loss and the cyclic consistent loss according to the fused images of the frames corresponding to the original training video and the fused images of the frames corresponding to the target training video;

a training module for training the first generator and the second generator according to the confrontation loss and the cycle consistent loss.

18. An electronic device, comprising:

a processor; and

a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform the expression generation method of any of claims 1-5 or the training method of the expression generation model of any of claims 6-15.

19. A non-transitory computer-readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the expression generation method of any one of claims 1 to 5, or the training method of the expression generation model of any one of claims 6 to 15.

20. An expression generation system comprising: the expression generation apparatus of claim 16 and the training apparatus of the expression generation model of claim 17.