CN112966607A

CN112966607A - Model training method, face video generation method, device, equipment and medium

Info

Publication number: CN112966607A
Application number: CN202110246125.2A
Authority: CN
Inventors: 胡天舒; 洪智滨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-06-15

Abstract

The application discloses a model training method, a face video generation device, a face video generation equipment, a face video generation medium and a program product, relates to the field of artificial intelligence, and particularly relates to computer vision and deep learning technologies. The specific implementation scheme is as follows: inputting input data for training a main model into the main model to obtain output data of the main model; calculating an initial loss value according to the input data and the output data by utilizing a pre-trained expert model; calculating the matching probability of the input data and the corresponding training marking result by utilizing the expert model; and fusing the matching probability and the initial loss value to obtain a target loss value, wherein the target loss value is used for performing supervision training on the main model. According to the method and the device, the loss value is dynamically adjusted, and the accuracy of model training can be improved no matter whether the performance of the expert model is good enough or not.

Description

Model training method, face video generation method, device, equipment and medium

Technical Field

The present application relates to the field of artificial intelligence, in particular to computer vision and deep learning technologies, and in particular to a model training method, a face video generation method, an apparatus, a device, a medium, and a program product.

Background

With the development of theory and computing power, the deep learning algorithm has become the most important technology in the field of artificial intelligence. And in the process of training the deep learning model, the loss function plays a crucial role. A good loss function is often required to correctly measure the quality of the current model and to guide the model to converge toward the direction with positive gain during the training process.

In the prior art, a common approach is to use an already trained model as an "expert" to supervise the training of another model and use the expert model as one of the loss functions. This approach works well for many tasks. However, on some data sets, the performance of these "expert" models is not particularly stable in itself, so that the correct loss value cannot be calculated, and even the loss value calculated finally is harmful to the training of the model due to the large difference from the true value.

Disclosure of Invention

The application provides a model training method, a face video generation device, a face video generation equipment, a face video generation medium and a program product, so that accuracy of model training is improved.

In a first aspect, the present application provides a model training method, including:

inputting input data for training a main model into the main model to obtain output data of the main model;

calculating an initial loss value according to the input data and the output data by utilizing a pre-trained expert model;

calculating the matching probability of the input data and the corresponding training marking result by utilizing the expert model;

and fusing the matching probability and the initial loss value to obtain a target loss value, wherein the target loss value is used for performing supervision training on the main model.

In a second aspect, the present application further provides a method for generating a face video based on voice driving, including:

outputting a face lip sequence according to an input face image and voice data by utilizing a pre-trained voice driven lip model, wherein the voice driven lip model is obtained by training according to a model training method as described in any embodiment of the application;

and generating the face video driven by the voice data on the face image based on the face lip sequence.

In a third aspect, the present application further provides a model training apparatus, including:

the output data determining module is used for inputting input data for training a main model into the main model to obtain output data of the main model;

the initial loss value determining module is used for calculating an initial loss value according to the input data and the output data by utilizing a pre-trained expert model;

the matching probability calculation module is used for calculating the matching probability of the input data and the corresponding training marking result by utilizing the expert model;

and the target loss value determining module is used for fusing the matching probability and the initial loss value to obtain a target loss value, wherein the target loss value is used for performing supervision training on the main model.

In a fourth aspect, the present application further provides a face video generating device based on voice driving, including:

a face lip sequence determination module, configured to use a pre-trained voice-driven lip model to obtain a face lip sequence according to an input face image and voice data, where the voice-driven lip model is trained by the model training apparatus according to any one of claims 8-13;

and the face video generation module is used for generating the face video driven by the voice data on the face image based on the face lip sequence.

In a fifth aspect, the present application further provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a model training method as described in any of the embodiments of the present application.

In a sixth aspect, the present application further provides a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the model training method according to any of the embodiments of the present application.

In a seventh aspect, the present application further provides a computer program product comprising a computer program which, when executed by a processor, implements the model training method according to any of the embodiments of the present application.

In an eighth aspect, the present application further provides an electronic device, including:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method for generating a video based on a human face driven by a voice according to any of the embodiments of the present application.

In a ninth aspect, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the method for generating a human face video based on a speech driven system according to any of the embodiments of the present application.

In a tenth aspect, the present application further provides a computer program product, including a computer program, where the computer program, when executed by a processor, implements the method for generating a face video based on voice driving according to any embodiment of the present application.

It should be understood that the statements herein do not intend to identify key or critical features of the present application, nor to limit the scope of the present application. Other features of the present application will become readily apparent from the following description, and other effects of the above alternatives will be described hereinafter in conjunction with specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic flow diagram of a model training method according to an embodiment of the present application;

FIG. 2 is a schematic flow diagram of a model training method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a model training method according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for generating a face video based on voice driving according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a model training apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a face video generation device based on voice driving according to an embodiment of the present application;

FIG. 7 is a block diagram of an electronic device for implementing a model training method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic flowchart of a model training method according to an embodiment of the present application, which is applicable to a case where an expert model is used to perform supervised training on a main model, and relates to computer vision and deep learning technologies. The method may be performed by a model training apparatus, which is implemented in software and/or hardware, and is preferably configured in an electronic device, such as a server or a computer device. As shown in fig. 1, the method specifically includes the following steps:

s101, inputting input data for training the main model into the main model to obtain output data of the main model.

The main model is a model to be trained, such as a voice-driven lip model or the like, which performs an arbitrary deep learning task, and the model obtains output data according to input data.

And S102, calculating an initial loss value according to the input data and the output data by utilizing a pre-trained expert model.

Wherein the expert model is a pre-trained model, such as a lip synchronization model. After the expert model is trained, parameters of the expert model are fixed, the expert model is switched to a test mode, input data and output data of the main model are input into the expert model, the expert model is used for calculating the matching degree of the input data and the output data of the main model, and then an initial loss value is determined according to the matching degree. The training method for the expert model and the method for calculating the loss value may be any one of the methods in the prior art, and are not described herein again.

S103, calculating the matching probability of the input data and the corresponding training annotation result by using an expert model.

The training labeling result is a pre-labeled real labeling result corresponding to input data for training. And inputting the input data and the training annotation result into the expert model, and calculating the matching probability of the input data and the training annotation result by the expert model.

Because the training annotation result is a real result corresponding to the input data, and the input data is completely matched with the training annotation result, theoretically, the matching probability calculated by the expert model should be equal to 1.0, and actually, the expert model is not stable in some data sets. Therefore, if the matching probability is closer to 1.0, the more accurate the expert model is in judging the input data pair, the more the expert can be believed; on the contrary, if the matching probability is farther from 1.0, it indicates that the expert model is less accurate in evaluation and less trustworthy, and at this time, if the expert model is still used to calculate the loss value and supervise the training of the main model, the training effect is affected, and the accuracy of the model training is affected.

And S104, fusing the matching probability and the initial loss value to obtain a target loss value, wherein the target loss value is used for performing supervision training on the main model.

In the embodiment of the application, the matching probability of the input data and the training labeling result is calculated by the expert model, so that the matching probability can be regarded as a confidence coefficient to help judge how credible the expert model is at all, the initial loss value is dynamically adjusted by utilizing the matching probability, the loss function is changed into a self-adaptive dynamic function, and the target loss value calculated after fusion is also a dynamic adjustment value, so that the problem that the training effect of the main model is influenced when the expert model does not well perform on certain data sets is solved. Therefore, the method and the device can also accelerate the convergence of the model and improve the training speed of the model, thereby improving the hardware execution efficiency on the whole and saving hardware resources.

According to the technical scheme of the embodiment of the application, after the initial loss value is calculated, the matching probability of the input data and the corresponding training marking result is calculated by using the expert model, so that the matching probability and the initial loss value are fused to obtain the target loss value of dynamic adjustment, and then, the training of the main model can be better guided by introducing the self-adaptive dynamic loss function. Even under the condition that the expert model or the loss function is not good enough, the final loss value can be adjusted in a self-adaptive mode according to the current condition, and the effect and the accuracy of model training are improved.

Fig. 2 is a schematic flow chart of a model training method according to an embodiment of the present application, and the present embodiment further performs optimization based on the above embodiment. As shown in fig. 2, the method specifically includes the following steps:

s201, inputting input data for training the main model into the main model to obtain output data of the main model.

S202, calculating the matching degree of the input data and the output data by utilizing an expert model.

And S203, calculating an initial loss value according to the matching degree and a pre-configured loss function.

And S204, calculating the matching probability of the input data and the corresponding training annotation result by using an expert model.

S205, judging whether the matching probability is higher than a preset threshold value, if so, executing S206, otherwise, executing S207.

And S206, multiplying the matching probability serving as a weight by the initial loss value to obtain a target loss value, wherein the target loss value is used for performing supervision training on the main model.

And S207, setting the matching probability to zero, and not calculating the current target loss value.

The threshold may be preset as needed, and the threshold may be exemplarily set to 0.5, that is, when the matching probability is higher than 0.5, it indicates that the expert model performs well and is trustworthy, and if the matching probability is not higher than 0.5, it indicates that the expert model performs poorly and is untrustworthy on the current data set, which indicates that at this time, the confidence of the expert model is too low, and even the main model is guided to iterate in a bad direction. Therefore, when the matching probability is not higher than the threshold value, the matching probability is set to zero, and the current target loss value is not calculated, so as to avoid influencing the training of the main model.

And when the matching probability is higher than the threshold value, the matching probability can be used as weight to be multiplied by the initial loss value, so that the calculated target loss value is a dynamically adjusted value and is a value dynamically adjusted according to the confidence coefficient of the expert model, therefore, the calculated target loss value is more accurate, and the training of the main model is more accurate by using the target loss value.

According to the technical scheme, whether the confidence coefficient of the expert model is too low is judged through the threshold value, so that the situation that the expert model guides the main model to go to iteration in a bad direction is avoided, the training of the main model is influenced, and the training effect of the main model is improved.

Fig. 3 is a flowchart of a model training method according to an embodiment of the present application, which takes a voice-driven lip task as an example, and further performs optimization based on the above embodiment. As shown in fig. 3, the method specifically includes the following steps:

s301, inputting the face image and the voice data for training the voice-driven lip model into the voice-driven lip model to obtain a face lip sequence output by the voice-driven lip model.

In this embodiment, the voice-driven lip model is a main model to be trained, input data used for training includes a face image and voice data, the voice-driven lip model outputs a face lip sequence according to the input data, and a face video driven by the input voice data can be generated on the face image according to the face lip sequence.

S302, taking the pre-trained voice lip synchronization model as an expert model, and calculating the matching degree of the voice data and the human face lip sequence by using the expert model.

The expert model is a lip synchronization model and is used for calculating the matching degree between input voice data and a face lip sequence.

And S303, calculating an initial loss value according to the matching degree and a pre-configured loss function.

For example, an expert model is used as one of the loss functions, and an initial loss value is calculated based on the degree of matching and the loss function. Wherein, the loss function can be configured according to the actual situation.

S304, calculating the matching probability of the voice data and the corresponding training annotation result by using the expert model.

S305, judging whether the matching probability is higher than a preset threshold value, if so, executing S307 after executing S306, otherwise, executing S308.

And S306, multiplying the matching probability serving as a weight by the initial loss value to obtain a target loss value.

S307, performing supervision training on the voice-driven lip model by using the target loss value.

And S308, setting the matching probability to zero, and not calculating the target loss value of the current round.

In order to measure the reliability of a voice lip synchronization model serving as an expert model, the voice data and the labeled true value training labeling result are input into the expert model to calculate the matching probability of the voice data and the training labeling result, and if the matching probability is closer to 1.0, the more accurate the voice lip synchronization model judges the input data pair, the more credible the voice lip synchronization model believes the expert; on the contrary, if the matching probability is farther away from 1.0, the more inaccurate the evaluation of the lip synchronization model is, the less trustworthy the evaluation is. Thus, the match probability can be viewed as a confidence of the lip sync model.

In addition, in order to avoid that the confidence of the lip synchronization model is too low, even the voice-driven lip model is guided to iterate in a bad direction, a threshold is preset in the embodiment, when the matching probability is not higher than the threshold, the matching probability is set to zero, the current target loss value is not calculated, so that the voice-driven lip model training is prevented from being affected, when the matching probability is higher than the threshold, the matching probability is used as a weight to be multiplied by the initial loss value, so that the target loss value is obtained, dynamic loss is introduced according to the confidence of the lip synchronization model, the voice-driven lip model is supervised-trained by using the dynamic target loss value, the accuracy of model training is improved, and the prediction accuracy of the voice-driven lip model is improved.

According to the technical scheme of the embodiment of the application, the loss value is dynamically adjusted by introducing the self-adaptive dynamic loss function, the accuracy of the training of the voice-driven lip model can be improved no matter whether the performance of the voice lip synchronization model is good enough, the training of the model can be better guided, and even under the condition that a so-called 'expert' model or the loss function is not good enough, the final loss value can be adaptively adjusted according to the current condition. Moreover, the technical scheme of the embodiment of the application can be applied to voice-driven lip tasks and can be expanded to different deep learning tasks, and the portability is strong.

Fig. 4 is a schematic flowchart of a method for generating a face video based on voice driving according to an embodiment of the present application, which is applicable to a case where an expert model is used to perform supervised training on a voice-driven lip model and a trained voice-driven lip model is used to generate a face video based on voice driving, and relates to computer vision and deep learning technologies. The method can be executed by a human face video generation device based on voice driving, which is implemented by software and/or hardware, and is preferably configured in electronic equipment, such as a server or computer equipment. As shown in fig. 4, the method specifically includes the following steps:

s401, outputting a face lip sequence according to an input face image and voice data by using a pre-trained voice driven lip model, wherein the voice driven lip model is obtained by training according to the model training method in any embodiment of the application.

And S402, generating a voice data-driven face video on the face image based on the face lip sequence.

According to the model training method disclosed by any embodiment of the application, the loss value is dynamically adjusted by introducing the adaptive dynamic loss function, the accuracy of the training of the voice-driven lip model can be improved no matter whether the performance of the voice-lip synchronous model is good enough, the training of the voice-driven lip model can be better guided, and the final loss value can be adaptively adjusted according to the current situation even under the condition that a so-called 'expert' model or the loss function is not good enough. Therefore, the voice-driven lip-shaped model trained by the model training method according to any embodiment of the application has better performance on a voice-driven lip-shaped task, and a face lip-shaped sequence can be predicted more accurately according to the input face image and voice data, so that a more accurate voice data-driven face video is obtained.

Fig. 5 is a schematic structural diagram of a model training device according to an embodiment of the present application, which is applicable to a case where an expert model is used to perform supervised training on a model, and relates to computer vision and deep learning technologies. The device can realize the model training method in any embodiment of the application. As shown in fig. 5, the apparatus 500 specifically includes:

an output data determining module 501, configured to input data for training a master model into the master model to obtain output data of the master model;

an initial loss value determining module 502, configured to calculate an initial loss value according to the input data and the output data by using a pre-trained expert model;

a matching probability calculation module 503, configured to calculate, by using the expert model, a matching probability of the input data and the training annotation result corresponding to the input data;

a target loss value determining module 504, configured to fuse the matching probability and the initial loss value to obtain a target loss value, where the target loss value is used to perform supervised training on the main model.

Optionally, the initial loss value determining module 502 includes:

a matching degree calculation unit for calculating a matching degree of the input data and the output data using the expert model;

and the initial loss value calculating unit is used for calculating the initial loss value according to the matching degree and a pre-configured loss function.

Optionally, the target loss value determining module 504 is specifically configured to:

and multiplying the matching probability serving as a weight by the initial loss value to obtain the target loss value.

Optionally, the target loss value determining module 504 includes:

a first determining unit, configured to, if the matching probability is higher than a preset threshold, multiply the initial loss value by the matching probability as a weight to obtain the target loss value;

a second determining unit, configured to set the matching probability to zero if the matching probability is not higher than the threshold.

Optionally, the input data includes a face image and voice data; the main model is a voice-driven lip model; the output data includes a face lip sequence.

Optionally, the expert model is a lip synchronization model, and is configured to calculate a matching degree between the voice data and the face lip sequence.

The model training device 500 provided in the embodiment of the present application can execute the model training method provided in any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method. Reference may be made to the description of any method embodiment of the present application for details not explicitly described in this embodiment.

Fig. 6 is a schematic structural diagram of a face video generation device based on voice driving according to an embodiment of the present application, which is applicable to a case where an expert model is used to perform supervised training on a voice-driven lip model, and a face video based on voice driving is generated by using the trained voice-driven lip model, and relates to computer vision and deep learning technologies. The device can realize the face video generation method based on the voice drive in any embodiment of the application. As shown in fig. 6, the apparatus 600 specifically includes:

the face lip sequence determining module 601 is configured to use a pre-trained voice-driven lip model to obtain a face lip sequence according to an input face image and voice data, where the voice-driven lip model is obtained by training with a model training device according to any embodiment of the present application;

and a face video generating module 602, configured to generate a voice data-driven face video on the face image based on the face lip sequence.

The face video generation device 600 based on voice driving provided by the embodiment of the present application can execute the face video generation method based on voice driving provided by any embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method. Reference may be made to the description of any method embodiment of the present application for details not explicitly described in this embodiment.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the model training method. For example, in some embodiments, the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM703 and executed by the computing unit 701, may perform one or more steps of the model training method described above. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the model training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome. The server may also be a server of a distributed system, or a server incorporating a blockchain.

In addition, according to the embodiments of the present application, another electronic device, another readable storage medium, and another computer program product are provided for executing one or more steps of the method for generating a face video based on voice driving according to any embodiment of the present application. The specific structure and program code thereof can be referred to the content description of the embodiment shown in fig. 7, and are not described herein again.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A model training method, comprising:

2. The method of claim 1, wherein said calculating an initial loss value from said input data and output data using a pre-trained expert model comprises:

calculating the matching degree of the input data and the output data by utilizing the expert model;

and calculating the initial loss value according to the matching degree and a pre-configured loss function.

3. The method of claim 1, wherein said fusing the match probability with the initial loss value to obtain a target loss value comprises:

4. The method of claim 3, wherein said multiplying said match probability as a weight with said initial loss value to obtain said target loss value comprises:

if the matching probability is higher than a preset threshold value, multiplying the matching probability serving as weight by the initial loss value to obtain the target loss value;

and if the matching probability is not higher than the threshold value, setting the matching probability to zero.

5. The method of any of claims 1-4, wherein the input data includes face images and voice data; the main model is a voice-driven lip model; the output data includes a face lip sequence.

6. The method of claim 5, wherein the expert model is a lip synchronization model for calculating a degree of matching of the speech data with a lip sequence of a human face.

7. A face video generation method based on voice driving comprises the following steps:

outputting a face lip sequence from input face images and speech data using a pre-trained speech-driven lip model, wherein the speech-driven lip model is trained according to the model training method of any one of claims 1-6;

8. A model training apparatus comprising:

9. The apparatus of claim 8, wherein the initial loss value determination module comprises:

10. The apparatus of claim 8, wherein the target loss value determination module is specifically configured to:

11. The apparatus of claim 10, wherein the target loss value determination module comprises:

12. The apparatus of any of claims 8-11, wherein the input data comprises face images and voice data; the main model is a voice-driven lip model; the output data includes a face lip sequence.

13. The apparatus of claim 12, wherein the expert model is a lip synchronization model for calculating a degree of matching of the voice data with a lip sequence of a human face.

14. A voice-driven face video generation device, comprising:

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method of any one of claims 1-6.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the model training method of any one of claims 1-6.

17. A computer program product comprising a computer program which, when executed by a processor, implements a model training method according to any one of claims 1-6.

18. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of claim 7.

19. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the speech-driven face video generation method of claim 7.

20. A computer program product comprising a computer program which, when executed by a processor, implements the speech-driven face video generation method according to claim 7.