CN116797877A

CN116797877A - Training method and device for image generation model, and image generation method and device

Info

Publication number: CN116797877A
Application number: CN202310762872.0A
Authority: CN
Inventors: 邹城; 王萌; 陈景东
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-06-26
Filing date: 2023-06-26
Publication date: 2023-09-22

Abstract

The embodiment of the specification provides a training method and device for an image generation model, and an image generation method and device. The image generation model comprises a generator, and the training method comprises the following steps: the method comprises the steps of obtaining a training sample, wherein the training sample comprises voice information, a real image frame sequence with lip shape synchronous with the voice information, an occlusion image frame sequence obtained by carrying out occlusion processing on a lower half face area of an image frame in the real image frame sequence, a face reference image frame sequence and a tooth reference image, and the image frames in the real image frame sequence and the face reference image frame sequence are face images of the same object; inputting the voice information, the shielding image frame sequence, the face reference image frame sequence and the tooth reference image into a generator for model processing to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the shielding image frame sequence; determining a prediction loss based on the reconstructed image frame sequence and the real image frame sequence; the parameters of the generator are adjusted with the aim of minimizing the prediction loss.

Description

Training method and device for image generation model, and image generation method and device

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a training method and device of an image generation model, and an image generation method and device.

Background

In the existing speaking video generation technology, a speaking video can be synthesized according to a section of voice and a video containing a portrait, and the lip shape of the portrait in the speaking video is synchronous with the content of the section of voice. The speaking video generation technology can be applied to scenes such as broadcasting and/or intelligent customer service.

In broadcast and/or intelligent customer service scenarios, digital man-made technology has grown to make the user experience more temperature, which can make the otherwise boring information transfer more lively, more acceptable and acceptable to the user. Among many types of digital persons, 2D high-fidelity digital persons have wide application prospects because the appearance and the true persons are not different, and the 2D high-fidelity digital persons can bring more trust feeling to clients like the true persons. The speaking video generation technique may be used to synthesize speaking video of a digital person.

A reasonable and reliable scheme is needed to help improve the clarity of the synthesized speaking video.

Disclosure of Invention

The embodiment of the specification provides a training method and device for an image generation model, and an image generation method and device, which can help to improve the definition of synthesized speaking video.

In a first aspect, embodiments of the present specification provide a training method of an image generation model, the image generation model including a generator, the method including: acquiring training samples, wherein the training samples comprise voice information, a real image frame sequence, an occlusion image frame sequence, a face reference image frame sequence and tooth reference images; the method comprises the steps that an image frame in a real image frame sequence and an image frame in a face reference image frame sequence are face images of the same object, lips of the real image frame sequence are synchronous with voice information, and an occlusion image frame sequence is obtained by conducting occlusion processing on a lower half face area of the image frame in the real image frame sequence; inputting the voice information, the shielding image frame sequence, the face reference image frame sequence and the tooth reference image into the generator for model processing to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the shielding image frame sequence; determining a prediction loss based on the reconstructed image frame sequence and the real image frame sequence; the parameters of the generator are adjusted with the aim of minimizing the prediction loss.

In some embodiments, the generator comprises a speech encoder, an image encoder, and a decoder; the model process includes: the voice encoder is utilized to encode the voice information, so that voice characteristics are obtained; performing image coding processing on the shielding image frame sequence, the face reference image frame sequence and the tooth reference image by using the image coder to obtain image characteristics; feature fusion is carried out on the voice features and the image features, and fusion results are obtained; and decoding the fusion result by using the decoder to obtain the reconstructed image frame sequence.

In some embodiments, the image generation model further comprises a visual quality arbiter; and said determining a prediction loss based on said reconstructed image frame sequence and said real image frame sequence, comprising: determining a reconstruction loss based on a difference between a first lower half-face region of an image frame in the reconstructed image frame sequence and a second lower half-face region of an image frame in the real image frame sequence; judging the authenticity of the reconstructed image frame sequence by using the vision quality judging device to obtain the reconstructed authenticity; a prediction loss is calculated, which is positively correlated with the reconstruction loss and negatively correlated with the reconstruction fidelity.

In some embodiments, the method further comprises: judging the authenticity of the real image frame sequence by using the vision quality judging device to obtain predicted authenticity; determining a discriminant loss that is positively correlated with the reconstructed fidelity and negatively correlated with the predicted fidelity; the parameters of the vision quality arbiter are adjusted with the goal of minimizing the arbiter loss.

In some embodiments, the determining a reconstruction loss includes: determining a first sub-loss based on a difference between a first mouth region of the first lower half-face region and a second mouth region of the second lower half-face region; determining a second sub-loss based on differences in other regions of the first lower half-face region and other regions of the second lower half-face region; based on first weight values respectively preset for a mouth region and other regions of a lower half face region, carrying out weighted summation on the first sub-loss and the second sub-loss to obtain the reconstruction loss; wherein the first weight value of the mouth region is greater than the first weight values of the other regions.

In some embodiments, the visual quality arbiter comprises a first arbiter; and judging the authenticity of the reconstructed image frame sequence by using the vision quality judging device to obtain the reconstructed authenticity, wherein the method comprises the following steps: inputting the first lower half face region into the first discriminator to obtain a first fidelity output by the first discriminator; the reconstruction fidelity is determined based on the first fidelity.

In some embodiments, the visual quality arbiter further comprises a second arbiter; the method further comprises: inputting a first mouth region of the first lower half face region into the second discriminator to obtain a second fidelity output by the second discriminator; said determining said reconstruction fidelity comprises: the reconstructed solidity is determined based on the first solidity and the second solidity.

In some embodiments, the vision quality arbiter further comprises a third arbiter; the method further comprises: and inputting the reconstructed image frame sequence into the third discriminator to obtain a third degree of realism output by the third discriminator, wherein the third degree of realism is used for determining the reconstructed degree of realism.

In some embodiments, the method further comprises: judging whether the lip shape of the reconstructed image frame sequence is synchronous with the voice information by using a pre-trained lip shape synchronous judging device to obtain lip shape synchronous loss; and said calculating a predicted loss, comprising: based on the reconstruction loss, the reconstruction fidelity, and the lip sync loss, a prediction loss is determined that is positively correlated with the lip sync loss.

In some embodiments, the lip sync discriminator includes a plurality of discriminators having different sizes; and judging whether the lip of the reconstructed image frame sequence is synchronous with the voice information by using a pre-trained lip synchronous discriminator to obtain lip synchronous loss, comprising: inputting the voice information and the reconstructed image frame sequence into a discriminator among the plurality of discriminators to obtain a discrimination result output by the discriminator; and generating the lip synchronization loss based on the discrimination results respectively output by the discriminators.

In some embodiments, the reconstruction loss, the reconstruction fidelity, and the lip sync loss are associated with a preset plurality of second weight values; and said determining a predicted loss based on said reconstruction loss, said reconstruction fidelity, and said lip sync loss, comprising: and carrying out weighted summation on the reconstruction loss, the reconstruction fidelity and the lip-shaped synchronous loss based on the plurality of second weight values to obtain a prediction loss, wherein the second weight value corresponding to the reconstruction fidelity is a negative value, and the rest second weight values are positive values.

In some embodiments, the same object comprises the same persona object.

In some embodiments, the dental reference image is one or more images.

In some embodiments, the tooth reference image comprises any one of the following: a tooth image showing tooth details, a reference image of tooth contours.

In a second aspect, embodiments of the present disclosure provide an image generating method, including: acquiring target voice information, a target image frame sequence, an occlusion image frame sequence and a tooth reference image; the target image frame sequence is used as a face reference image frame sequence, the image frames in the target image frame sequence are face images of a target object, and the shielding image frame sequence is obtained by shielding a lower half face area of the image frames in the target image frame sequence; inputting the target voice information, the target image frame sequence, the shielding image frame sequence and the tooth reference image into a generator in a pre-trained image generation model for model processing to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the shielding image frame sequence; wherein the lip of the reconstructed image frame sequence is synchronized with the target speech information.

In some embodiments, the target image frame sequence corresponds to a first image frame sequence of a first video of the target object and is formed from facial images extracted from the first image frame sequence; the method further comprises: replacing a face region of a corresponding image frame in the first image frame sequence with an image frame in the reconstructed image frame sequence; and generating a second video based on the first image frame sequence after the replacing processing.

In a third aspect, embodiments of the present disclosure provide a training apparatus for an image generation model, the image generation model including a generator, the apparatus comprising: an acquisition unit configured to acquire a training sample including speech information, a real image frame sequence, an occlusion image frame sequence, a face reference image frame sequence, and a tooth reference image; the method comprises the steps that an image frame in a real image frame sequence and an image frame in a face reference image frame sequence are face images of the same object, lips of the real image frame sequence are synchronous with voice information, and an occlusion image frame sequence is obtained by conducting occlusion processing on a lower half face area of the image frame in the real image frame sequence; the image generation unit is configured to input the voice information, the shielding image frame sequence, the face reference image frame sequence and the tooth reference image into the generator for model processing, so as to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the shielding image frame sequence; a loss determination unit configured to determine a prediction loss based on the reconstructed image frame sequence and the real image frame sequence; a parameter adjustment unit configured to adjust parameters of the generator with a view to minimizing the predictive loss.

In a fourth aspect, embodiments of the present specification provide an image generating apparatus including: an acquisition unit configured to acquire target voice information, a target image frame sequence, an occlusion image frame sequence, and a tooth reference image; the target image frame sequence is used as a face reference image frame sequence, the image frames in the target image frame sequence are face images of a target object, and the shielding image frame sequence is obtained by shielding a lower half face area of the image frames in the target image frame sequence; the image generation unit is configured to input the target voice information, the target image frame sequence, the shielding image frame sequence and the tooth reference image into a generator in a pre-trained image generation model for model processing to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the shielding image frame sequence; wherein the lip of the reconstructed image frame sequence is synchronized with the target speech information.

In a fifth aspect, embodiments of the present specification provide a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform a method as described in any implementation of the first and second aspects.

In a sixth aspect, embodiments of the present specification provide a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, implements a method as described in any implementation of the first and second aspects.

In a seventh aspect, the present description provides a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the method as described in any one of the implementations of the first and second aspects.

The above embodiment of the present disclosure provides a solution, in which, by inputting a face reference image frame sequence and a tooth reference image together with speech information and an occlusion image frame sequence into a generator for model processing, when the generator reconstructs a lower half face region of an image frame in the occlusion image frame sequence, the face reference image frame sequence and the tooth reference image provide necessary reference information, so that the sharpness of the reconstructed lower half face region and the sharpness of the tooth region of the lower half face region can be improved, and the sharpness of the image frame in the reconstructed image frame sequence generated by the generator can be improved. The definition of the synthesized speaking video can be improved by using the reconstructed image frame sequence generated by the image generation model after training to synthesize the speaking video.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only examples of the embodiments disclosed in the present specification, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of one application scenario in which embodiments of the present description may be applied;

FIG. 2 is a flowchart of a training method of an image generation model in an embodiment of the present description;

FIG. 3 is a schematic diagram of a structure of a generator;

FIG. 4 is a schematic diagram of a predictive loss determination process;

FIG. 5 is a schematic diagram of a reconstruction loss determination process;

FIG. 6 is a schematic diagram of a reconstruction fidelity determination process;

FIG. 7 is a flowchart of an image generation method in an embodiment of the present description;

FIG. 8 is a schematic diagram of a training apparatus for image generation model in the embodiment of the present specification;

fig. 9 is a schematic diagram of a configuration of an image generating apparatus in the embodiment of the present specification.

Detailed Description

The present specification is further described in detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. The described embodiments are only some of the embodiments of the present description and not all of the embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present application based on the embodiments herein.

For convenience of description, only a portion related to the present application is shown in the drawings. Embodiments and features of embodiments in this specification may be combined with each other without conflict. In addition, the words "first", "second", "third", and the like in the embodiments of the present specification are used for information distinction only, and do not serve as any limitation.

As described above, in the conventional speaking video generation technology, a speaking video in which a lip shape of a figure is synchronized with the content of a piece of speech can be synthesized from the piece of speech and a video including the figure.

To help improve the clarity of the synthesized spoken video, embodiments of the present specification provide a training scheme and an image generation scheme for an image generation model.

Fig. 1 is a schematic diagram of one application scenario to which the embodiments of the present description may be applied. In the application scenario shown in fig. 1, an image generation model 101 to be trained may be included, the image generation model 101 includes a generator 102, and the speech information 103, the real image frame sequence 104, the occlusion image frame sequence 105, the face reference image frame sequence 106, and the tooth reference image 107 in the training sample S1. Wherein the image frames in the real image frame sequence 104 and the face reference image frame sequence 106 are face images of the same object, the lip shape of the real image frame sequence 104 is synchronous with the voice information 103, and the occlusion image frame sequence 105 is obtained by performing occlusion processing on the lower half face area of the image frame in the real image frame sequence 104.

The voice information 103 may be voice recorded by a real person or voice synthesized by a voice synthesis technique. The same object may include, but is not limited to, the same persona object. The character object may be a real character or a virtual character (e.g., a digital person), etc. The real image frame sequence 104, the occlusion image frame sequence 105, and the face reference image frame sequence 106 may each include one or more image frames, and the tooth reference image 107 may be one or more images, which are not particularly limited herein. In addition, the tooth reference image 107 may include any image used to promote tooth details, and may include, for example, a tooth image showing tooth details, or a reference image including tooth contours, or the like.

It is noted that only one real image frame, one occlusion image frame, one face reference image frame, and one tooth reference image are shown in fig. 1 for ease of illustration. It should be understood that fig. 1 is merely an exemplary schematic diagram, which is not intended to be limiting in any way.

The training process of the image generation model 101 may be performed by any device, platform or cluster of devices having data storage, computing, processing capabilities. The execution subject may include a training sample S1 and an image generation model 101. In this training process, as shown in fig. 1, the speech information 103, the occlusion image frame sequence 105, the face reference image frame sequence 106, and the tooth reference image 107 in the training sample S1 may be input to the generator 102 for model processing, resulting in a reconstructed image frame sequence 108 reconstructed for the lower half-face region of the image frames in the occlusion image frame sequence 105. The prediction loss may then be determined based on the reconstructed image frame sequence 108 and the actual image frame sequence 104 in the training sample S1, and parameters of the generator 102 may be adjusted with the goal of minimizing the prediction loss.

In practice, the executing body may store a training sample set including the training sample S1, and the executing body may train the image generation model 101 using at least some training samples in the training sample set until the model converges.

After the image generation model 101 is trained, the image generation model 101 may be applied to different scenes, such as scenes of a broadcast class and/or intelligent customer service class, for talking video composition. In particular, the trained image generation model 101 may be deployed in any device, platform, or cluster of devices having data storage, computing, processing capabilities. The device, platform, or cluster of devices may acquire target speech information, a sequence of target image frames, a sequence of occlusion image frames, and a tooth reference image. The target voice information may be voice recorded by a real person or voice synthesized by a voice synthesis technology. The target image frame sequence functions as a face reference image frame sequence, and the image frames therein are face images of the target object. The sequence of occlusion image frames is obtained by an occlusion process of a lower half-face region of an image frame in the sequence of target image frames. The target object belongs to the same object class as the same object, for example, belongs to the same person class. The target speech information, the target image frame sequence, the occlusion image frame sequence, and the tooth reference image may then be input to the generator 102 in the image generation model 101 for model processing to obtain a reconstructed image frame sequence reconstructed for the lower face region of the image frames in the occlusion image frame sequence, the lip shape of the reconstructed image frame sequence being synchronized with the target speech information. The reconstructed image frame sequence may then be used to synthesize a speaking video of the target object.

In the following, specific implementation steps of the above method are described in connection with specific embodiments.

Referring to fig. 2, a flowchart of a training method of an image generation model in an embodiment of the present disclosure is shown. The image generation model includes a generator. The execution body of the method may be any device, platform or cluster of devices with data storage, computing, processing capabilities. The method comprises the following steps: step S201, a training sample is obtained, wherein the training sample comprises voice information, a real image frame sequence, an occlusion image frame sequence, a face reference image frame sequence and a tooth reference image; the method comprises the steps that an image frame in a real image frame sequence and an image frame in a face reference image frame sequence are face images of the same object, lips of the real image frame sequence are synchronous with voice information, and an occlusion image frame sequence is obtained by conducting occlusion processing on a lower half face area of the image frame in the real image frame sequence; step S203, inputting the voice information, the shielding image frame sequence, the face reference image frame sequence and the tooth reference image into a generator for model processing to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the shielding image frame sequence; step S205, determining a prediction loss based on the reconstructed image frame sequence and the real image frame sequence; step S207, aiming at minimizing the prediction loss, the parameters of the generator are adjusted.

The above steps are further described below.

In step S201, a training sample may be acquired, for example, from a preset training sample set. The training samples may include speech information, a sequence of real image frames, a sequence of occlusion image frames, a sequence of facial reference image frames, and a tooth reference image. The voice information may be voice recorded by a real person or voice synthesized by a voice synthesis technology. The image frames in the real image frame sequence and the face reference image frame sequence are face images of the same subject (hereinafter referred to as subject a). The object a may be, for example, a human object or an animal object, and is not particularly limited herein. The lips of the sequence of real image frames are synchronized with the speech information. The sequence of occlusion image frames is obtained by an occlusion process of a lower half face area of an image frame in the sequence of real image frames.

In one example, the voice information may relate to a video segment with which the lips are synchronized, the image frames in the video segment including the face region of object a. The sequence of real image frames may be formed from facial images of object a extracted from the video clip. The sequence of facial reference image frames may be formed from facial images of object a extracted from the original video in which the video segment is located. It is noted that the pose and/or the mouth shape of the object a in the face reference image frame sequence and the real image frame sequence may be different.

The sequence of facial reference image frames may be used to provide the necessary reference information when the generator reconstructs the lower half face region of the image frames in the sequence of occlusion image frames. The tooth reference image may be used to provide the necessary reference information for tooth generation when the generator reconstructs the lower half-face region.

The tooth reference image may be an image of the object a or an image of another object belonging to the same object class as the object a, and is not particularly limited herein. It should be noted that the tooth reference image may be one or more images. In addition, the tooth reference image may include any image used to promote tooth details, and may include, for example, a tooth image showing tooth details, or a reference image including tooth contours, or the like. Wherein, when the tooth reference image is a plurality of tooth images showing tooth details, the plurality of tooth images may be embodied as a high quality sequence of tooth images under different mouth shapes.

Next, in step S203, the speech information, the sequence of occlusion image frames, the sequence of facial reference image frames, and the tooth reference image may be input to a generator for model processing to obtain a sequence of reconstructed image frames reconstructed for the lower half-face region of the image frames in the sequence of occlusion image frames.

In one example, the generator may include a speech encoder, an image encoder, and a decoder as shown in fig. 3. Wherein fig. 3 is a schematic diagram of a structure of the generator. The speech encoder may for example consist of a stack of residual convolution layers for encoding the input speech. The image encoder may for example consist of a stack of 2D convolutional layers for image encoding the input image. The decoder may for example consist of a stack of convolutional layers and upsample using a transpose convolution for decoding the input features. It should be noted that in order to increase the resolution of the reconstructed image frame sequence generated by the generator, the decoder may be made to include a greater number of upsampling times than in the prior art schemes.

Based on this, as shown in fig. 3, the speech information may be subjected to encoding processing with a speech encoder to obtain speech features, and the occlusion image frame sequence, the face reference image frame sequence, and the tooth reference image may be subjected to image encoding processing with an image encoder to obtain image features. And then, carrying out feature fusion on the voice features and the image features to obtain a fusion result. The fusion result may then be decoded by a decoder to obtain a sequence of reconstructed image frames.

Next, in step S205, a prediction loss may be determined based on the reconstructed image frame sequence and the real image frame sequence.

In practice, the image generation model may also include a visual quality discriminator, which may be used to discriminate the authenticity of the input image. The vision quality discriminator may for example consist of a stack of convolution blocks, each of which may consist of a convolution layer and an activation layer (e.g. a LeakyReLU activation function) for improving the vision quality and the synchronization accuracy.

In determining the predicted loss, a predicted loss determination process as shown in fig. 4 may be performed. Wherein fig. 4 is a schematic diagram of a predictive loss determination process. The predictive loss determination process includes: step S401, determining reconstruction loss based on the difference between the first lower half face area of the image frame in the reconstructed image frame sequence and the second lower half face area of the image frame in the real image frame sequence; step S403, judging the authenticity of the reconstructed image frame sequence by using a vision quality judging device to obtain the reconstructed authenticity; in step S407, a prediction loss is calculated.

Steps S401 to S407 are further described below.

In step S401, a reconstruction loss may be determined based on a difference between a first lower half-face region of an image frame in the reconstructed image frame sequence and a second lower half-face region of an image frame in the real image frame sequence.

In one embodiment, the reconstruction loss may be determined directly based on the difference between the first lower half-face region and the second lower half-face region.

In another embodiment, in order to pay more attention to the mouth region, i.e., the region containing the lips and teeth, first weight values may be preset for the mouth region and other regions of the lower half-face region, respectively. Wherein the first weight value of the mouth region is greater than the first weight values of the other regions. In one example, the first weight value of the mouth region may be 2 times or 3 times, etc., the first weight value of the other region.

Based on this, when the reconstruction loss is determined, a reconstruction loss determination process as shown in fig. 5 may be performed. Wherein fig. 5 is a schematic diagram of a reconstruction loss determination process. The reconstruction loss determination process includes: step S501, determining a first sub-loss based on a difference between a first mouth region of a first lower half-face region and a second mouth region of a second lower half-face region; step S503, determining a second sub-loss based on the difference between the other areas of the first lower half-face area and the other areas of the second lower half-face area; step S505, based on first weight values respectively preset for a mouth region and other regions of the lower half face region, carrying out weighted summation on the first sub-loss and the second sub-loss to obtain reconstruction loss; wherein the first weight value of the mouth region is greater than the first weight values of the other regions.

In the reconstruction loss determination process, a mouth region mask may be introduced, and the mouth region and other regions of the first and second lower half-face regions may be weighted differently to calculate the sub-loss. In addition, the sub-loss may be calculated using an L1 or L2 loss function.

With continued reference to fig. 4, in step S403, the authenticity of the reconstructed image frame sequence may be determined using a visual quality determiner, resulting in a reconstructed authenticity.

In one embodiment, to ensure the realism of the lower half-face region of the reconstructed image frame, the authenticity of the lower half-face region may be determined. In particular, the visual quality discriminator may include a first discriminator, which may be a discriminator for the lower half-face region. The first lower half face region of an image frame in the reconstructed image frame sequence may be input to a first discriminator to obtain a first fidelity output by the first discriminator. Wherein the first realism may be a probability that the first lower half-face region is a true lower half-face region. Thereafter, a reconstruction fidelity may be determined based on the first fidelity. In one example, the first realism may be determined directly as the reconstructed realism.

In another embodiment, in order to ensure the realism of the lower half face region of the reconstructed image frame and the mouth region of the lower half face region, the authenticity discrimination may be performed for both the lower half face region and the mouth region. In particular, the visual quality discriminator may comprise a second discriminator, which may be a discriminator for the mouth region, and the first discriminator as previously described. The first lower half face region of an image frame in the reconstructed image frame sequence may be input to a first discriminator to obtain a first fidelity output by the first discriminator, and the first mouth region of the first lower half face region may be input to a second discriminator to obtain a second fidelity output by the second discriminator. Wherein the second fidelity may be a probability that the first mouth region is a real mouth region. Thereafter, a reconstructed fidelity may be determined based on the first fidelity and the second fidelity. For example, an average of the first and second realisations may be determined as the reconstructed realisation. For another example, the first and second discriminants may be pre-associated with weight values based on which the first and second realisations may be weighted and summed to obtain the reconstructed realisation.

In still another embodiment, in order to ensure the realism of the lower half-face region and the full-face region of the reconstructed image frame, the authenticity determination may be performed on both the lower half-face region and the full-face region. In particular, the visual quality discriminator may include a third discriminator, which may be a discriminator for a full face region, and the first discriminator as previously described. The first lower half face region of the image frame in the reconstructed image frame sequence may be input to a first discriminator to obtain a first fidelity output by the first discriminator, and the reconstructed image frame sequence may be input to a third discriminator to obtain a third fidelity output by the third discriminator. Wherein the third fidelity may be a probability that the reconstructed image frame sequence is a true image frame sequence. Thereafter, a reconstructed fidelity may be determined based on the first fidelity and the third fidelity. For example, an average of the first and third realisations may be determined as the reconstructed realisation. For another example, the first and third discriminants may be pre-associated with weight values based on which the first and third realisations may be weighted and summed to obtain the reconstructed realisation.

In still another embodiment, in order to ensure the realism of the lower half face region, the mouth region, and the full face region of the reconstructed image frame, the authenticity discrimination may be performed for each of the lower half face region, the mouth region, and the full face region. In particular, the visual quality discriminant may include a first discriminant, a second discriminant, and a third discriminant as previously described, and in determining the reconstructed reality, a reconstructed reality determination process as shown in fig. 6 may be performed. Wherein fig. 6 is a schematic diagram of a reconstruction fidelity determination process. The reconstruction fidelity determination process comprises the following steps: step S601, inputting a first lower half face area of an image frame in a reconstructed image frame sequence into a first discriminator to obtain a first fidelity output by the first discriminator; step S603, inputting the mouth area of the first lower half face area into a second discriminator to obtain a second fidelity output by the second discriminator; step S605, inputting the reconstructed image frame sequence into a third discriminator to obtain a third fidelity output by the third discriminator; in step S607, the reconstructed fidelity is determined based on the first fidelity, the second fidelity, and the third fidelity.

In step S607, the average of the first, second, and third realisations may be determined as the reconstructed realisation. Alternatively, the first arbiter, the second arbiter, and the third arbiter may be pre-associated with a weight value, and the first realism, the second realism, and the third realism may be weighted and summed based on the weight value to obtain the reconstructed realism.

With continued reference to fig. 4, in step S407, a predictive loss may be calculated.

In one embodiment, the prediction loss, which is positively correlated with the reconstruction loss and negatively correlated with the reconstruction fidelity, may be determined based on the reconstruction loss and the reconstruction fidelity. As an example, the reconstruction loss and the reconstruction fidelity may be associated with a preset plurality of third weight values, the third weight value corresponding to the reconstruction loss being a positive value, and the third weight value corresponding to the reconstruction fidelity being a negative value. Based on this, the reconstruction loss and the reconstruction fidelity may be weighted and summed based on the plurality of third weight values, resulting in a prediction loss.

In another embodiment, step S405 may also be performed before step S407 in order to ensure the synchronicity of the lips of the reconstructed image frame sequence generated by the generator with the input speech information. In step S405, a pre-trained lip sync discriminator may be used to determine whether the lips of the reconstructed image frame sequence are synchronized with the speech information, resulting in a lip sync loss. Based on this, in step S407, a prediction loss, which is positively correlated with the reconstruction loss and the lip-sync loss and negatively correlated with the reconstruction fidelity, may be determined based on the reconstruction loss, the reconstruction fidelity, and the lip-sync loss.

Further, the reconstruction loss, reconstruction fidelity, and lip sync loss may be associated with a preset plurality of second weight values. The second weight value corresponding to the reconstruction fidelity is a negative value, and the rest second weight values are positive values. Based on this, the reconstruction loss, the reconstruction fidelity, and the lip sync loss may be weighted summed based on the plurality of second weight values to obtain the predicted loss.

The lip sync identifier in the embodiment of the present specification may be a pre-trained expert lip sync identifier, and may accurately detect lip sync errors. In addition, the lip sync discriminator in the embodiment of the present specification may be one discriminator, or include a plurality of discriminators having different sizes.

When the lip-sync discriminator is a discriminator, in step S405, the speech information and the reconstructed image frame sequence may be input into the lip-sync discriminator to obtain a discrimination result output by the lip-sync discriminator. The discrimination result may be a probability that the lip of the reconstructed image frame sequence is synchronized with the speech information. Thereafter, based on the discrimination result, a lip sync loss can be generated. For example, a cross entropy loss function may be used to generate a lip sync loss based on the discrimination result. The cross entropy loss function may also be referred to as BCE loss function, which is a negative log likelihood function used for probability estimation, and is mainly used for classification problems.

When the lip-sync discriminator includes a plurality of discriminators having different sizes, in step S405, for a discriminator among the plurality of discriminators, the speech information and the reconstructed image frame sequence may be input to the discriminator, resulting in a discrimination result output by the discriminator. Then, based on the discrimination results output from the plurality of discriminators, respectively, a lip sync loss can be generated. It should be noted that by referencing a plurality of discriminators having different sizes, whether the lips of the image frames in the reconstructed image frame sequence are correct or not can be discriminated on the different sizes, respectively, so that the correctness of the synthesized lips can be further ensured.

With continued reference to fig. 2, after determining the prediction loss by performing step S205, step S207 may be performed next. In step S207, the parameters of the generator are adjusted with the aim of minimizing the prediction loss.

The scheme provided by the corresponding embodiment of fig. 2, by inputting the face reference image frame sequence and the tooth reference image together with the voice information and the occlusion image frame sequence into the generator for model processing, when the generator reconstructs the lower half face area of the image frame in the occlusion image frame sequence, the face reference image frame sequence and the tooth reference image provide necessary reference information, so that the definition of the reconstructed lower half face area and the tooth area of the lower half face area can be improved, and the definition of the image frame in the reconstructed image frame sequence generated by the generator can be improved. The definition of the synthesized speaking video can be improved by using the reconstructed image frame sequence generated by the image generation model after training to synthesize the speaking video.

The foregoing mainly describes the training process for the generator comprised by the image generation model. In one embodiment, the visual quality discriminators included in the image generation model may also be trained during training of the generator. As an example, when the authenticity of the reconstructed image frame sequence is determined by using the vision quality discriminator to obtain the reconstructed authenticity, the authenticity of the reconstructed image frame sequence may also be determined by using the vision quality discriminator to obtain the predicted authenticity. Thereafter, a discriminant loss can be determined, which is positively correlated with the reconstructed fidelity and negatively correlated with the predicted fidelity. The parameters of the vision quality arbiter are then adjusted with the goal of minimizing the loss of the arbiter. It should be noted that, the determination process of the predicted reality is similar to the determination process of the reconstructed reality, and reference may be made to the description related to the determination process of the reconstructed reality in the foregoing, which is not repeated herein. In addition, a BCE loss function may be utilized to determine a discriminant loss based on the reconstructed and predicted realism.

After the image generation model is trained, the image generation model can be applied to different scenes, such as scenes of a broadcasting class and/or an intelligent customer service class, for speaking video synthesis.

Referring to fig. 7, a flowchart of an image generation method in an embodiment of the present disclosure is shown. The execution body of the method may be any device, platform or cluster of devices with data storage, computing, processing capabilities. The method comprises the following steps: step S701, obtaining target voice information, a target image frame sequence, an occlusion image frame sequence and a tooth reference image; the target image frame sequence is used as a face reference image frame sequence, the image frames in the target image frame sequence are face images of a target object, and the shielding image frame sequence is obtained by shielding a lower half face area of the image frames in the target image frame sequence; step S703, inputting the target voice information, the target image frame sequence, the shielding image frame sequence and the tooth reference image into a generator in a pre-trained image generation model for model processing to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the shielding image frame sequence; wherein the lip of the reconstructed image frame sequence is synchronized with the target speech information.

Next, steps S701 to S703 will be further described.

In step S701, target speech information, a target image frame sequence, an occlusion image frame sequence, and a tooth reference image may be acquired. The target voice information may be voice recorded by a real person or voice synthesized by a voice synthesis technology. The target image frame sequence may function as a face reference image frame sequence, and the image frames therein are face images of the target object. The target object belongs to the same object class as the object a in the foregoing. The sequence of occlusion image frames is obtained by an occlusion process of a lower half-face region of an image frame in the sequence of target image frames.

It is to be noted that the target image frame sequence may correspond to a first image frame sequence of a first video of the target object, and be formed of face images extracted from the first image frame sequence. The target speech information and the first video may be two separate parts without any relation.

Next, in step S703, the target speech information, the target image frame sequence, the occlusion image frame sequence, and the tooth reference image may be input into a generator in a pre-trained image generation model for model processing, to obtain a reconstructed image frame sequence for reconstructing a lower half face region of an image frame in the occlusion image frame sequence. Wherein the lip of the reconstructed image frame sequence is synchronized with the target speech information.

Further, the generator may include a speech encoder, an image encoder, and a decoder. In step S703, the target speech information may be encoded by a speech encoder to obtain speech features, and the occlusion image frame sequence, the target image frame sequence, and the tooth reference image may be image-encoded by an image encoder to obtain image features. And then, carrying out feature fusion on the voice features and the image features to obtain a fusion result. The fusion result may then be decoded by a decoder to obtain a sequence of reconstructed image frames.

The scheme provided by the corresponding embodiment of fig. 7, by inputting the target image frame sequence and the tooth reference image together with the target voice information and the occlusion image frame sequence into the generator for model processing, when the generator reconstructs the lower half face area of the image frame in the occlusion image frame sequence, the target image frame sequence and the tooth reference image provide necessary reference information, so that the definition of the reconstructed lower half face area and the tooth area of the lower half face area can be improved, and the definition of the image frame in the reconstructed image frame sequence generated by the generator can be improved. By using the reconstructed image frame sequence for synthesizing the speaking video, the sharpness of the synthesized speaking video can be improved.

In one embodiment, after the reconstructed image frame sequence is obtained by performing step S703, the face region of the corresponding image frame in the first image frame sequence may be replaced with an image frame in the reconstructed image frame sequence, and the second video may be generated based on the first image frame sequence after the replacing process. The second video is a speaking video of the synthesized target object. When the target object is a target digital person, the second video is a synthesized speaking video of the target digital person.

According to the description, the scheme provided by the embodiment of the specification can introduce the tooth reference image at the input end of the model to guide the generation of tooth details, so that unconditional referents in tooth generation can be avoided, and only fuzzy average teeth can be generated. In addition, on the model structure, the decoder can have larger resolution and support larger-size output, so that blurring introduced in the process of scaling from small size to large size is avoided. By introducing a plurality of different masks in calculating the loss, such as a half-face region mask, a mouth region mask, and a full-face region mask, the reconstructed realism determined based on the first, second, and third realisms is used to determine the predicted loss, the realism of the lower half-face region, the realism of the mouth region, and the realism of the full-face region of the reconstructed image frame can be ensured. Therefore, the scheme can help to improve the overall resolution and definition of the synthesized speaking video.

Referring further to fig. 8, a schematic structural diagram of the training device for image generation model in the embodiment of the present disclosure is shown. The image generation model includes a generator. The apparatus may be applied to any device, platform or cluster of devices having data storage, computing, processing capabilities.

As shown in fig. 8, the training apparatus 800 of the image generation model of the present embodiment includes: an acquisition unit 801, an image generation unit 802, a loss determination unit 803, and a parameter adjustment unit 804. Wherein the obtaining unit 801 is configured to obtain training samples including speech information, a sequence of real image frames, a sequence of occlusion image frames, a sequence of facial reference image frames, and a tooth reference image; the method comprises the steps that an image frame in a real image frame sequence and an image frame in a face reference image frame sequence are face images of the same object, lips of the real image frame sequence are synchronous with voice information, and an occlusion image frame sequence is obtained by conducting occlusion processing on a lower half face area of the image frame in the real image frame sequence; the image generating unit 802 is configured to input the voice information, the occlusion image frame sequence, the face reference image frame sequence and the tooth reference image into the generator for model processing, so as to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the occlusion image frame sequence; the loss determination unit 803 is configured to determine a prediction loss based on the reconstructed image frame sequence and the real image frame sequence; the parameter adjustment unit 804 is configured to adjust the parameters of the generator with the aim of minimizing the prediction loss.

In one embodiment, the generator includes a speech encoder, an image encoder, and a decoder; the model processing includes: encoding the voice information by using a voice encoder to obtain voice characteristics; performing image coding processing on the shielding image frame sequence, the face reference image frame sequence and the tooth reference image by using an image coder to obtain image characteristics; feature fusion is carried out on the voice features and the image features, and fusion results are obtained; and decoding the fusion result by using a decoder to obtain a reconstructed image frame sequence.

In one embodiment, the image generation model includes a visual quality arbiter; and the loss determination unit 803 may include: a first determining subunit (not shown in the figure) configured to determine a reconstruction loss based on a difference between a first lower half-face region of an image frame in the reconstructed image frame sequence and a second lower half-face region of an image frame in the real image frame sequence; a second determining subunit (not shown in the figure) configured to determine, by using a vision quality identifier, the authenticity of the reconstructed image frame sequence, resulting in reconstructed authenticity; a third determination subunit (not shown in the figure) is configured to calculate a predicted loss, which is positively correlated with the reconstruction loss and negatively correlated with the reconstruction fidelity.

In one embodiment, the second determination subunit may be further configured to: judging the authenticity of the real image frame sequence by using a vision quality judging device to obtain the predicted authenticity; the loss determination unit 803 may further include: a fourth determination subunit (not shown in the figure) configured to determine a discriminant loss, which is positively correlated with the reconstructed fidelity and negatively correlated with the predicted fidelity; the parameter adjustment unit 804 may be further configured to: the parameters of the vision quality arbiter are adjusted with the goal of minimizing the arbiter loss.

In one embodiment, the first determination subunit may be further configured to: determining a first sub-loss based on a difference between a first mouth region of the first lower half-face region and a second mouth region of the second lower half-face region; determining a second sub-loss based on differences in other regions of the first lower half-face region and other regions of the second lower half-face region; based on first weight values respectively preset for a mouth region and other regions of the lower half face region, carrying out weighted summation on the first sub-loss and the second sub-loss to obtain reconstruction loss; wherein the first weight value of the mouth region is greater than the first weight values of the other regions.

In one embodiment, the visual quality discriminator comprises a first discriminator; and the second determination subunit may be further configured to: inputting the first lower half face region into a first discriminator to obtain a first fidelity output by the first discriminator; based on the first fidelity, a reconstruction fidelity is determined.

Further, the vision quality discriminator also includes a second discriminator; and the second determination subunit may be further configured to: inputting the first mouth region of the first lower half face region into a second discriminator to obtain a second fidelity output by the second discriminator; the reconstruction fidelity is determined based on the first fidelity and the second fidelity.

In one embodiment, the visual quality discriminator further comprises a third discriminator; and the second determination subunit may be further configured to: and inputting the reconstructed image frame sequence into a third discriminator to obtain third authenticity output by the third discriminator, wherein the third authenticity is used for determining the reconstructed authenticity.

In one embodiment, the loss determination unit 803 may further include: a fifth determining subunit (not shown in the figure) configured to determine, by using a pre-trained lip-sync identifier, whether the lips of the reconstructed image frame sequence are synchronized with the voice information, resulting in a lip-sync loss; and the third determination subunit may be further configured to: based on the reconstruction loss, reconstruction fidelity, and lip sync loss, a prediction loss is determined, which is positively correlated with the lip sync loss.

In one embodiment, the lip sync discriminator includes a plurality of discriminators having different sizes; and the fifth determination subunit may be further configured to: inputting the voice information and the reconstructed image frame sequence into the discriminators for the discriminators in the plurality of discriminators to obtain discrimination results output by the discriminators; based on the discrimination results output by the plurality of discriminators, respectively, a lip synchronization loss is generated.

In one embodiment, the reconstruction loss, reconstruction fidelity, and lip sync loss are associated with a preset plurality of second weight values; and the third determination subunit may be further configured to: and carrying out weighted summation on the reconstruction loss, the reconstruction fidelity and the lip-sync loss based on the plurality of second weight values to obtain a prediction loss, wherein the second weight value corresponding to the reconstruction fidelity is a negative value, and the rest second weight values are positive values.

In one embodiment, the same object comprises the same person object.

In one embodiment, the tooth reference image is one or more images.

In one embodiment, the tooth reference image comprises a tooth image showing tooth details, or a reference image comprising tooth contours, or the like.

Fig. 9 is a schematic diagram of a configuration of an image generating apparatus in the embodiment of the present specification. The apparatus may be applied to any device, platform or cluster of devices having data storage, computing, processing capabilities.

As shown in fig. 9, the image generating apparatus 900 of the present embodiment includes: an acquisition unit 901 and an image generation unit 902. Wherein the acquisition unit 901 is configured to acquire target voice information, a target image frame sequence, an occlusion image frame sequence, and a tooth reference image; the target image frame sequence is used as a face reference image frame sequence, the image frames in the target image frame sequence are face images of a target object, and the shielding image frame sequence is obtained by shielding a lower half face area of the image frames in the target image frame sequence; the image generating unit 902 is configured to input the target voice information, the target image frame sequence, the occlusion image frame sequence and the tooth reference image into a generator in a pre-trained image generating model for model processing, so as to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the occlusion image frame sequence; wherein the lip of the reconstructed image frame sequence is synchronized with the target speech information.

In one embodiment, the target image frame sequence corresponds to a first image frame sequence of a first video of the target object and is formed from facial images extracted from the first image frame sequence; and the apparatus 900 may further include: a replacing unit (not shown in the figure) configured to replace a face area of a corresponding image frame in the first image frame sequence with an image frame in the reconstructed image frame sequence; a video generating unit (not shown in the figure) configured to generate a second video based on the first image frame sequence after the substitution processing.

In the embodiments of the apparatus respectively corresponding to fig. 8 and fig. 9, the specific processing of each unit and the technical effects brought by the processing may refer to the related descriptions in the related method embodiments respectively, and are not repeated herein.

The present specification also provides a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the training method of the image generation model described in the method embodiments above, or the image generation method.

The embodiments of the present specification also provide a computing device, including a memory and a processor, where the memory stores executable code, and the processor implements the training method of the image generation model or the image generation method described in the foregoing method embodiments when executing the executable code.

The present specification embodiment also provides a computer program, wherein the computer program, when executed in a computer, causes the computer to execute the training method of the image generation model described in the method embodiment above, or the image generation method.

Those of skill in the art will appreciate that in one or more of the above examples, the functions described in the various embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

While the foregoing detailed description has described the objects, aspects and advantages of the embodiments disclosed herein in further detail, it should be understood that the foregoing detailed description is merely illustrative of the embodiments disclosed herein and is not intended to limit the scope of the embodiments disclosed herein, but rather any modifications, equivalents, improvements or the like that may be made to the embodiments disclosed herein are intended to be included within the scope of the embodiments disclosed herein.

Claims

1. A training method of an image generation model, the image generation model comprising a generator, the method comprising:

acquiring training samples, wherein the training samples comprise voice information, a real image frame sequence, an occlusion image frame sequence, a face reference image frame sequence and tooth reference images; the method comprises the steps that an image frame in a real image frame sequence and an image frame in a face reference image frame sequence are face images of the same object, lips of the real image frame sequence are synchronous with voice information, and an occlusion image frame sequence is obtained by conducting occlusion processing on a lower half face area of the image frame in the real image frame sequence;

inputting the voice information, the shielding image frame sequence, the face reference image frame sequence and the tooth reference image into the generator for model processing to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the shielding image frame sequence;

determining a prediction loss based on the reconstructed image frame sequence and the real image frame sequence;

the parameters of the generator are adjusted with the aim of minimizing the prediction loss.

2. The method of claim 1, wherein the generator comprises a speech encoder, an image encoder, and a decoder; the model process includes:

The voice encoder is utilized to encode the voice information, so that voice characteristics are obtained;

performing image coding processing on the shielding image frame sequence, the face reference image frame sequence and the tooth reference image by using the image coder to obtain image characteristics;

feature fusion is carried out on the voice features and the image features, and fusion results are obtained;

and decoding the fusion result by using the decoder to obtain the reconstructed image frame sequence.

3. The method of claim 1, wherein the image generation model further comprises a visual quality arbiter; and

the determining a prediction loss based on the reconstructed image frame sequence and the real image frame sequence, comprising:

determining a reconstruction loss based on a difference between a first lower half-face region of an image frame in the reconstructed image frame sequence and a second lower half-face region of an image frame in the real image frame sequence;

judging the authenticity of the reconstructed image frame sequence by using the vision quality judging device to obtain the reconstructed authenticity;

a prediction loss is calculated, which is positively correlated with the reconstruction loss and negatively correlated with the reconstruction fidelity.

4. A method according to claim 3, further comprising:

judging the authenticity of the real image frame sequence by using the vision quality judging device to obtain predicted authenticity;

determining a discriminant loss that is positively correlated with the reconstructed fidelity and negatively correlated with the predicted fidelity;

the parameters of the vision quality arbiter are adjusted with the goal of minimizing the arbiter loss.

5. The method of claim 3, wherein the determining a reconstruction loss comprises:

determining a first sub-loss based on a difference between a first mouth region of the first lower half-face region and a second mouth region of the second lower half-face region;

determining a second sub-loss based on differences in other regions of the first lower half-face region and other regions of the second lower half-face region;

based on first weight values respectively preset for a mouth region and other regions of a lower half face region, carrying out weighted summation on the first sub-loss and the second sub-loss to obtain the reconstruction loss; wherein the first weight value of the mouth region is greater than the first weight values of the other regions.

6. A method according to claim 3, wherein the vision quality discriminator comprises a first discriminator; and

The judging the authenticity of the reconstructed image frame sequence by using the vision quality judging device to obtain the reconstructed authenticity comprises the following steps:

inputting the first lower half face region into the first discriminator to obtain a first fidelity output by the first discriminator;

the reconstruction fidelity is determined based on the first fidelity.

7. The method of claim 6, wherein the vision quality arbiter further comprises a second arbiter; and

the method further comprises the steps of:

inputting a first mouth region of the first lower half face region into the second discriminator to obtain a second fidelity output by the second discriminator;

said determining said reconstruction fidelity comprises:

the reconstructed solidity is determined based on the first solidity and the second solidity.

8. The method according to one of claims 6-7, wherein the vision quality arbiter further comprises a third arbiter; and

the method further comprises the steps of:

and inputting the reconstructed image frame sequence into the third discriminator to obtain a third degree of realism output by the third discriminator, wherein the third degree of realism is used for determining the reconstructed degree of realism.

9. The method of one of claims 3-7, further comprising:

Judging whether the lip shape of the reconstructed image frame sequence is synchronous with the voice information by using a pre-trained lip shape synchronous judging device to obtain lip shape synchronous loss; and

the calculating a predicted loss includes:

based on the reconstruction loss, the reconstruction fidelity, and the lip sync loss, a prediction loss is determined that is positively correlated with the lip sync loss.

10. The method of claim 9, wherein the lip sync discriminator comprises a plurality of discriminators having different sizes; and

the method for judging whether the lip shape of the reconstructed image frame sequence is synchronous with the voice information by using a pre-trained lip shape synchronous discriminator to obtain lip shape synchronous loss comprises the following steps:

inputting the voice information and the reconstructed image frame sequence into a discriminator among the plurality of discriminators to obtain a discrimination result output by the discriminator;

and generating the lip synchronization loss based on the discrimination results respectively output by the discriminators.

11. The method of claim 9, wherein the reconstruction loss, the reconstruction fidelity, and the lip sync loss are associated with a preset plurality of second weight values; and

The determining a predicted loss based on the reconstruction loss, the reconstruction fidelity, and the lip sync loss, comprising:

and carrying out weighted summation on the reconstruction loss, the reconstruction fidelity and the lip-shaped synchronous loss based on the plurality of second weight values to obtain a prediction loss, wherein the second weight value corresponding to the reconstruction fidelity is a negative value, and the rest second weight values are positive values.

12. The method of claim 1, wherein the same object comprises the same persona object.

13. The method of claim 1, wherein the tooth reference image is one or more images.

14. The method of claim 1, wherein the tooth reference image comprises any one of: a tooth image showing tooth details, a reference image of tooth contours.

15. An image generation method, comprising:

acquiring target voice information, a target image frame sequence, an occlusion image frame sequence and a tooth reference image; the target image frame sequence is used as a face reference image frame sequence, the image frames in the target image frame sequence are face images of a target object, and the shielding image frame sequence is obtained by shielding a lower half face area of the image frames in the target image frame sequence;

Inputting the target voice information, the target image frame sequence, the shielding image frame sequence and the tooth reference image into a generator in a pre-trained image generation model for model processing to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the shielding image frame sequence; wherein the lip of the reconstructed image frame sequence is synchronized with the target speech information.

16. The method of claim 15, wherein the target image frame sequence corresponds to a first image frame sequence of a first video of the target object and is formed from facial images extracted from the first image frame sequence; and

the method further comprises the steps of:

replacing a face region of a corresponding image frame in the first image frame sequence with an image frame in the reconstructed image frame sequence;

and generating a second video based on the first image frame sequence after the replacing processing.

17. A training apparatus of an image generation model, the image generation model comprising a generator, the apparatus comprising:

an acquisition unit configured to acquire a training sample including speech information, a real image frame sequence, an occlusion image frame sequence, a face reference image frame sequence, and a tooth reference image; the method comprises the steps that an image frame in a real image frame sequence and an image frame in a face reference image frame sequence are face images of the same object, lips of the real image frame sequence are synchronous with voice information, and an occlusion image frame sequence is obtained by conducting occlusion processing on a lower half face area of the image frame in the real image frame sequence;

The image generation unit is configured to input the voice information, the shielding image frame sequence, the face reference image frame sequence and the tooth reference image into the generator for model processing, so as to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the shielding image frame sequence;

a loss determination unit configured to determine a prediction loss based on the reconstructed image frame sequence and the real image frame sequence;

a parameter adjustment unit configured to adjust parameters of the generator with a view to minimizing the predictive loss.

18. An image generating apparatus comprising:

an acquisition unit configured to acquire target voice information, a target image frame sequence, an occlusion image frame sequence, and a tooth reference image; the target image frame sequence is used as a face reference image frame sequence, the image frames in the target image frame sequence are face images of a target object, and the shielding image frame sequence is obtained by shielding a lower half face area of the image frames in the target image frame sequence;

the image generation unit is configured to input the target voice information, the target image frame sequence, the shielding image frame sequence and the tooth reference image into a generator in a pre-trained image generation model for model processing to obtain a reconstructed image frame sequence reconstructed for the lower half face area of the image frame in the shielding image frame sequence; wherein the lip of the reconstructed image frame sequence is synchronized with the target speech information.

19. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-16.

20. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-16.