CN112132912A

CN112132912A - Method and device for establishing face generation model and generating face image

Info

Publication number: CN112132912A
Application number: CN201910556085.4A
Authority: CN
Inventors: 李鑫; 刘霄; 张赫男; 赵翔; 李甫; 何栋梁; 龙翔; 周志超; 孙昊; 文石磊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2020-12-25
Anticipated expiration: 2039-06-25
Also published as: CN112132912B

Abstract

The invention provides a method for establishing a human face generation model, which comprises the following steps: acquiring a face image; extracting images of preset parts and face edge images from each face image, and splicing the extracted images to serve as spliced images corresponding to the face images, wherein the images of the preset parts are mouth images; constructing a generation countermeasure network comprising a generation model and a discrimination model; and training the generation countermeasure network according to the face image and the spliced image corresponding to the face image, and obtaining a face generation model by using a generation model in the generation countermeasure network obtained by training. The invention also provides a method for generating a face image, which comprises the following steps: acquiring a mouth image; extracting a face edge image of a face in a template image, and splicing the face edge image and the mouth image to obtain an input image; and inputting the input image into a face generation model, and obtaining a face image according to an output result of the face generation model. The invention can generate high-definition vivid human face images.

Description

Method and device for establishing face generation model and generating face image

[ technical field ] A method for producing a semiconductor device

The present invention relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a computer storage medium for creating a face generation model and generating a face image.

[ background of the invention ]

In the related art, the human face image is generally generated using a 2D technique or a 3D technique. However, the face image generated by the 2D technique is blurred, and the expression of the face image generated by the 3D technique is dull. Therefore, it is desirable to provide a method capable of generating a high-definition realistic human face image.

[ summary of the invention ]

In view of the above, the present invention provides a method, an apparatus, a device and a computer storage medium for creating a face generation model and generating a face image, which are used to generate a high-definition realistic face image.

The technical scheme adopted by the invention for solving the technical problem is to provide a method for establishing a human face generation model, which comprises the following steps: acquiring a face image; extracting images of preset parts and face edge images from each face image, and splicing the extracted images to serve as spliced images corresponding to the face images, wherein the images of the preset parts are mouth images; constructing a generation countermeasure network comprising a generation model and a discrimination model; and training the generation countermeasure network according to the face image and the spliced image corresponding to the face image, and obtaining a face generation model by using a generation model in the generation countermeasure network obtained by training.

According to a preferred embodiment of the present invention, after the face image is acquired, the method further includes: and acquiring the resolution of each face image, and filtering the face images with the resolution lower than a preset threshold value.

According to a preferred embodiment of the present invention, the image of the predetermined portion further includes an eye image and an eyebrow image; the face edge image is an image from which a mouth, a nose, and a chin of the face image are removed.

According to a preferred embodiment of the present invention, the constructing a generative countermeasure network including a generative model and a discriminant model includes: and combining N discriminators to form the discrimination model, wherein the input of each discriminator corresponds to image blocks with different scales respectively, and N is a positive integer greater than or equal to 2.

According to a preferred embodiment of the present invention, the training of the generation countermeasure network according to the face image and the corresponding stitched image comprises: taking the face image as a real sample; inputting the spliced image into the generation model, and taking an output result obtained by the generation model as a generation sample; taking the real sample and the corresponding generated sample as the input of the discriminant model, and obtaining the loss functions of the discriminant model and the generated model according to the output result of the discriminant model; and adjusting parameters in the network structures of the generative model and the discriminant model according to the discriminant model and the loss function of the generative model until the generation countermeasure network converges.

According to a preferred embodiment of the present invention, the using the real sample and the corresponding generated sample as the input of the discriminant model includes: acquiring N image blocks with different scales from the real sample; acquiring N image blocks with different scales from the same position of the generated sample; and taking the two image blocks with the same scale as the input of the discriminators with the corresponding scale, and splicing the output result of each scale discriminator as the output result of the discrimination model.

The technical scheme adopted by the invention for solving the technical problem is to provide a method for generating a face image, which comprises the following steps: acquiring a mouth image; extracting a face edge image of a face in a template image, and splicing the face edge image and the mouth image to obtain an input image; and inputting the input image into a face generation model, and obtaining a face image according to an output result of the face generation model.

According to a preferred embodiment of the present invention, the acquiring a mouth image includes: acquiring a text; the text is converted into speech, and a mouth image is generated based on the converted speech.

According to a preferred embodiment of the invention, the method further comprises: extracting an eye image and an eyebrow image of a human face in the template image; and splicing the mouth image, the eye image, the eyebrow image and the face edge image to obtain an input image.

The technical scheme adopted by the invention for solving the technical problem is to provide a device for establishing a human face generation model, and the device comprises: the first acquisition unit is used for acquiring a face image; the first splicing unit is used for extracting images of preset parts and human face edge images from the human face images and splicing the extracted images to serve as spliced images corresponding to the human face images, wherein the images of the preset parts are mouth images; the system comprises a construction unit, a judgment unit and a control unit, wherein the construction unit is used for constructing a generation countermeasure network comprising a generation model and a judgment model; and the training unit is used for training the generation countermeasure network according to the face image and the spliced image corresponding to the face image, and obtaining a face generation model by using a generation model in the generation countermeasure network obtained by training.

According to a preferred embodiment of the present invention, after the first obtaining unit obtains the face image, the first obtaining unit further performs: and acquiring the resolution of each face image, and filtering the face images with the resolution lower than a preset threshold value.

According to a preferred embodiment of the present invention, the constructing unit, when constructing the generative countermeasure network including the generative model and the discriminant model, specifically performs: and combining N discriminators to form the discrimination model, wherein the input of each discriminator corresponds to image blocks with different scales respectively, and N is a positive integer greater than or equal to 2.

According to a preferred embodiment of the present invention, when the training unit trains the generation countermeasure network according to the face image and the corresponding stitched image, the following steps are specifically performed: taking the face image as a real sample; inputting the spliced image into the generation model, and taking an output result obtained by the generation model as a generation sample; taking the real sample and the corresponding generated sample as the input of the discriminant model, and obtaining the loss functions of the discriminant model and the generated model according to the output result of the discriminant model; and adjusting parameters in the network structures of the generative model and the discriminant model according to the discriminant model and the loss function of the generative model until the generation countermeasure network converges.

According to a preferred embodiment of the present invention, the training unit specifically executes, when the real sample and the corresponding generated sample are used as the input of the discriminant model: acquiring N image blocks with different scales from the real sample; acquiring N image blocks with different scales from the same position of the generated sample; and taking the two image blocks with the same scale as the input of the discriminators with the corresponding scale, and splicing the output result of each scale discriminator as the output result of the discrimination model.

The technical solution adopted by the present invention to solve the technical problem is to provide a device for generating a face image, the device comprising: a second acquisition unit configured to acquire a mouth image; the second splicing unit is used for extracting a face edge image of a face in the template image and splicing the face edge image and the mouth image to obtain an input image; and the processing unit is used for inputting the input image into a human face generation model and obtaining a human face image according to an output result of the human face generation model.

According to a preferred embodiment of the present invention, the second acquiring unit, when acquiring the mouth image, specifically performs: acquiring a text; the text is converted into speech, and a mouth image is generated based on the converted speech.

According to a preferred embodiment of the present invention, the splicing unit is further configured to: extracting an eye image and an eyebrow image of a human face in the template image; and splicing the mouth image, the eye image, the eyebrow image and the face edge image to obtain an input image.

According to the technical scheme, the method and the device have the advantages that the images of the preset parts in the face images and the spliced images obtained by the face edge images are extracted to train the generation of the countermeasure network, so that the problem that the different preset parts in the face can influence the face images when speaking is fully considered, and the generation model in the generated countermeasure network obtained by training can generate a more high-definition and vivid face image.

[ description of the drawings ]

Fig. 1 is a flowchart of a method for creating a face generation model according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for generating a face image according to an embodiment of the present invention;

FIG. 3 is a diagram of an apparatus for creating a face generation model according to an embodiment of the present invention;

fig. 4 is a structural diagram of an apparatus for generating a face image according to an embodiment of the present invention;

fig. 5 is a block diagram of a computer system/server according to an embodiment of the invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

Fig. 1 is a flowchart of a method for creating a face generation model according to an embodiment of the present invention, as shown in fig. 1, the method includes:

in 101, a face image is acquired.

In this step, a face image is acquired, and the acquired face image is used for training generation of a countermeasure network to establish a face generation model. The step can acquire a face image from the Internet in a web crawler mode; the face image can also be obtained from each frame of image extracted from a given video. The method for acquiring the face image is not limited.

In addition, in order to enable the established face generation model to generate a high-definition face image, after the face image is acquired, the following contents may be further included: acquiring the resolution of each face image; and filtering the face image with the resolution ratio lower than a preset threshold value. That is, the face image with lower resolution is discarded in this step, so that the face generation model is established by using the clearer face image.

At 102, images of preset portions and face edge images are extracted from the face images, and the extracted images are spliced to serve as spliced images corresponding to the face images, wherein the images of the preset portions are mouth images.

In this step, images of preset portions and face edge images are respectively extracted from the face images acquired in step 101, and the extracted images of the preset portions and the face edge images are stitched to serve as stitched images corresponding to the face images. In the step, the image of the preset part is a mouth image in the face image, and can also comprise an eye image and an eyebrow image in the face image.

Since the lower half of the face is different due to different mouth shapes when the user speaks, the face edge image in this step is an image with the mouth, nose, and chin removed from the face image.

In this step, a face key point detection technology may be used to extract an eye image, an eyebrow image, and a mouth image in the face image, and an edge detection technology may be used to obtain a face edge image from which a mouth, a nose, and a chin in the face image are removed. The image of the preset part and the face edge image are extracted from the face image by adopting the prior art, and the description is omitted here.

At 103, a generative confrontation network is constructed that includes a generative model and a discriminative model.

In the step, a generation countermeasure network comprising a generation model and a discrimination model is constructed, so that after the training of the constructed generation countermeasure network is completed, a face generation model for generating a high-definition face image is obtained based on the generation model in the generation countermeasure network obtained by training.

The role of the generative model in the generation countermeasure network constructed in this step is to generate a generative sample that is as similar as possible to the real sample, and the role of the discriminant model is to distinguish the real sample from the generative sample as much as possible. The method is characterized in that a generated countermeasure network is trained in a countermeasure game mode between a generated model and a discriminant model, so that the authenticity of a generated sample output by the generated model is as high as possible, and the discriminant model cannot distinguish whether the output obtained by the generated model is the generated sample or a real sample.

In general, only one discriminator is included in the discrimination model for generating the countermeasure network, and therefore the discrimination model of the related art cannot take into account both the texture detail and the overall quality of the input image. Therefore, in order to enable the discrimination model to take account of the texture details and the overall quality of the face image, the discrimination model in the generation countermeasure network constructed in the step is formed by combining N discriminators, the input of each discriminator respectively corresponds to image blocks with different scales in the face image, and N is a positive integer greater than or equal to 2. The discriminators corresponding to the small scales in the discrimination model pay more attention to the texture details of the image, and the discriminators corresponding to the large scales in the discrimination model pay more attention to the overall quality of the image.

For example, if the discrimination model constructed in this step includes 3 discriminators, namely, discriminator 1, discriminator 2 and discriminator 3, respectively, where the input of the discriminator 1 may be an image block with a size of 32 × 32 pixels in the face image, the input of the discriminator 2 may be an image block with a size of 64 × 64 pixels in the face image, and the input of the discriminator 3 may be an image block with a size of 128 × 128 pixels in the face image.

At 104, the generation countermeasure network is trained according to the face image and the mosaic image corresponding to the face image, and a face generation model is obtained by using a generation model in the generation countermeasure network obtained by training.

And training the generated countermeasure network consisting of the generated model and the discriminant model in an alternating training mode, considering that the training of the generated countermeasure network is finished when the whole generated countermeasure network is converged, and further taking the generated model in the generated countermeasure network obtained by training as a face generation model, wherein data can be input through the face generation model to obtain a corresponding high-definition face image.

Specifically, when the countermeasure network is generated according to the face image and the stitched image corresponding to the face image, the following method may be adopted: taking the obtained face image as a real sample; inputting the obtained spliced image into a generation model, and taking an output result obtained by the generation model as a generation sample; inputting the real sample and the corresponding generated sample into a discrimination model, and obtaining a loss function of the discrimination model and the generated model according to an output result of the discrimination model; and adjusting parameters in the network structures of the generated model and the discriminant model according to the discriminant model and the loss function of the generated model until the generation of the confrontation network converges.

It can be understood that if the constructed discriminant model includes N discriminants, the following method may be adopted in this step when inputting the real sample and the corresponding generated sample into the discriminant model: acquiring N image blocks with different scales from a real sample; acquiring N image blocks with different scales from the same position of the generated sample, for example, acquiring a 32 × 32 image block from the upper left corner of the real sample, and acquiring a 32 × 32 image block from the upper left corner of the generated sample; and taking the two image blocks with the same scale as the input of the discriminators with the corresponding scale, and splicing the output result of each scale discriminator as the output result of the discrimination model.

Wherein, the generation of the confrontation network convergence in the step is the minimization of the loss function of the generation model and the discriminant model. Optionally, in a specific implementation process of this embodiment, if the obtained loss functions within the preset number of times are equal, the loss function is considered to be minimized; the loss function may also be considered to be minimized if a difference between the loss functions obtained within the preset number of times is less than or equal to a preset threshold; it is also possible to consider the loss function to be minimized if the number of training passes a preset number.

When the loss function of the generative model and the loss function of the discriminant model are minimized, that is, the generative confrontation network converges, the training of the generative confrontation network is considered to be completed, and the generative model in the generated confrontation network after training is used as the face generative model.

Fig. 2 is a flowchart of a method for generating a face image according to an embodiment of the present invention, as shown in fig. 2, the method includes:

in 201, a mouth image is acquired.

In this step, a mouth image is acquired, and the acquired mouth image is used as an input of a face generation model to obtain a face image.

Specifically, this step may acquire a mouth image in the following manner: acquiring a text, wherein the acquired text can be a single Chinese character or a single letter, and different characters correspond to different mouth shapes; the acquired text is converted into speech, and a mouth image is generated based on the speech obtained by the conversion. In this step, the mouth image may also be obtained from a preset image sequence, and the image in the preset image sequence may be the mouth image directly or an image including the mouth image.

In 202, a face edge image of a face in a template image is extracted, and the face edge image and the mouth image are spliced to obtain an input image.

In this step, a face edge image of a face in the template image is extracted, the extracted face edge image and the mouth image obtained in step 201 are stitched, and the stitched result is used as an input image.

It can be understood that, when the face edge image is extracted from the template image, the eye image and the eyebrow image of the face in the template image can be extracted, the extracted eye image, eyebrow image, face edge image and mouth image are spliced, and the spliced result is used as an input image.

In 203, the input image is input into a face generation model obtained by pre-training, and a face image is obtained according to an output result of the face generation model.

In this step, the input image obtained in step 202 is used as the input of the face generation model obtained by pre-training, and the face image is obtained according to the output result of the face generation model.

It is understood that if a plurality of mouth images are acquired in step 201, the following may be included after acquiring a plurality of face images in this step: combining the acquired face images according to a preset sequence to obtain a face image sequence, for example, according to the sequence of each image in the preset image sequence or according to the character sequence of an input text; acquiring voices corresponding to the mouth images to obtain a voice sequence; and synchronously superposing the voice sequence and the face image sequence to obtain virtual video data. That is to say, in this step, after the high-definition face image is acquired, the virtual video data with a high-definition visual effect can be further acquired.

Fig. 3 is a structural diagram of an apparatus for creating a face generation model according to an embodiment of the present invention, as shown in fig. 3, the apparatus includes: a first acquisition unit 31, a first stitching unit 32, a construction unit 33 and a training unit 34.

A first acquiring unit 31 for acquiring a face image.

The first acquisition unit 31 acquires a face image, and the acquired face image is used for training generation of a countermeasure network to build a face generation model. The first acquiring unit 31 may acquire a face image from the internet in a web crawler manner; the face image can also be obtained from each frame of image extracted from a given video. The method for acquiring the face image is not limited.

In addition, in order to enable the created face generation model to generate a high-definition face image, the first acquiring unit 31 may further include the following after acquiring the face image: acquiring the resolution of each face image; and filtering the face image with the resolution ratio lower than a preset threshold value. That is, the first acquisition unit 31 discards a face image with a lower resolution, thereby determining to build a face generation model using a clearer face image.

The first stitching unit 32 is configured to extract an image of a preset portion and a face edge image from each face image, and stitch the extracted images as stitched images corresponding to each face image, where the image of the preset portion is a mouth image.

The first stitching unit 32 extracts an image of a preset portion and a face edge image from each face image acquired by the first acquiring unit 31, and stitches the extracted image of the preset portion and the face edge image as a stitched image corresponding to each face image. The image of the preset portion extracted by the first stitching unit 32 is a mouth image in the face image, and may further include an eye image and an eyebrow image in the face image.

Since the lower half of the face is different due to different mouth shapes when the user speaks, the face edge image in the first stitching unit 32 is an image from which the mouth, nose, and chin of the face image are removed.

The first stitching unit 32 may extract an eye image, an eyebrow image, and a mouth image in the face image by using a face key point detection technique, and may obtain a face edge image with the mouth, nose, and chin removed from the face image by using an edge detection technique.

A construction unit 33, configured to construct a generative confrontation network including a generative model and a discriminant model.

The construction unit 33 constructs a generative confrontation network including the generative model and the discriminant model, so that after the training of the constructed generative confrontation network is completed, a face generation model for generating a high-definition face image is obtained based on the generative model in the generated confrontation network obtained by the training.

It is the responsibility of the generative model in the antagonistic network constructed by the construction unit 33 to generate generative samples that are as similar as possible to the true samples, while it is the responsibility of the discriminant model to distinguish the true samples from the generative samples as much as possible. The method is characterized in that a generated countermeasure network is trained in a countermeasure game mode between a generated model and a discriminant model, so that the authenticity of a generated sample output by the generated model is as high as possible, and the discriminant model cannot distinguish whether the output obtained by the generated model is the generated sample or a real sample.

In general, only one discriminator is included in the discrimination model for generating the countermeasure network, and therefore the discrimination model of the related art cannot take into account both the texture detail and the overall quality of the input image. Therefore, in order to make the discrimination model take into account the texture details and the overall quality of the face image, the discrimination model in the generation countermeasure network constructed by the construction unit 33 is composed of N discriminators, the input of each discriminator corresponds to an image block with different dimensions in the face image, and N is a positive integer greater than or equal to 2. The discriminators corresponding to the small scales in the discrimination model pay more attention to the texture details of the image, and the discriminators corresponding to the large scales in the discrimination model pay more attention to the overall quality of the image.

And the training unit 34 is configured to train the generation countermeasure network according to the face image and the stitched image corresponding to the face image, and obtain a face generation model by using a generation model in the generation countermeasure network obtained through training.

Specifically, when the training unit 34 trains and generates the countermeasure network according to the face image and the stitched image corresponding to the face image, the following method may be adopted: taking the obtained face image as a real sample; inputting the obtained spliced image into a generation model, and taking an output result obtained by the generation model as a generation sample; inputting the real sample and the corresponding generated sample into a discrimination model, and obtaining a loss function of the discrimination model and the generated model according to an output result of the discrimination model; and adjusting parameters in the network structures of the generated model and the discriminant model according to the discriminant model and the loss function of the generated model until the generation of the confrontation network converges.

It can be understood that if the constructed discriminant model includes N discriminants, the training unit 34 may adopt the following method when inputting the real sample and the corresponding generated sample into the discriminant model: acquiring N image blocks with different scales from a real sample; acquiring N image blocks with different scales from the same position of a generated sample; and taking the two image blocks with the same scale as the input of the discriminators with the corresponding scale, and splicing the output result of each scale discriminator as the output result of the discrimination model.

The convergence of the countermeasure network generated in the training unit 34 is the minimization of the loss function of the generated model and the discriminant model. Optionally, in a specific implementation process of this embodiment, if the obtained loss functions within the preset number of times are equal, the loss function is considered to be minimized; the loss function may also be considered to be minimized if a difference between the loss functions obtained within the preset number of times is less than or equal to a preset threshold; it is also possible to consider the loss function to be minimized if the number of training passes a preset number.

Fig. 4 is a block diagram of an apparatus for generating a face image according to an embodiment of the present invention, as shown in fig. 4, the apparatus includes: a second acquisition unit 41, a second stitching unit 42 and a processing unit 43.

A second acquisition unit 41 for acquiring a mouth image.

The second acquisition unit 41 acquires a mouth image, which is used as an input of the face generation model to obtain a face image.

Specifically, the second acquisition unit 41 may acquire the mouth image in the following manner: acquiring a text, wherein the acquired text can be a single Chinese character or a single letter, and different characters correspond to different mouth shapes; the acquired text is converted into speech, and a mouth image is generated based on the speech obtained by the conversion. The second obtaining unit 41 may also obtain a mouth image from a preset image sequence, and the images in the preset image sequence may be the mouth image directly or images including the mouth image.

And the second splicing unit 42 is configured to extract a face edge image of a face in the template image, and splice the face edge image and the mouth image to obtain an input image.

The second stitching unit 42 extracts a face edge image of the face in the template image, and stitches the extracted face edge image with the mouth image acquired by the second acquisition unit 41, taking a stitching result as an input image.

It is understood that, when the second stitching unit 42 extracts the face edge image from the template image, it may further extract an eye image and an eyebrow image of the face in the template image, and stitch the extracted eye image, eyebrow image, face edge image, and mouth image, and use the stitching result as the input image.

And the processing unit 43 is configured to input the input image into a face generation model obtained through pre-training, and obtain a face image according to an output result of the face generation model.

In this step, the input image obtained by the second stitching unit 42 is used as the input of the face generation model obtained by pre-training, and the face image is obtained according to the output result of the face generation model.

It is understood that, if the first acquiring unit 41 acquires a plurality of mouth images, the processing unit 43 may further include the following after acquiring a plurality of face images: combining the acquired face images according to a preset sequence to obtain a face image sequence, for example, according to the sequence of each image in the preset image sequence or according to the character sequence of an input text; acquiring voices corresponding to the mouth images to obtain a voice sequence; and synchronously superposing the voice sequence and the face image sequence to obtain virtual video data. That is, the processing unit 43 can further acquire virtual video data having a high-definition visual effect after acquiring a high-definition face image.

As shown in fig. 5, the computer system/server 012 is embodied as a general purpose computing device. The components of computer system/server 012 may include, but are not limited to: one or more processors or processing units 016, a system memory 028, and a bus 018 that couples various system components including the system memory 028 and the processing unit 016.

Bus 018 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer system/server 012 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 012 and includes both volatile and nonvolatile media, removable and non-removable media.

System memory 028 can include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)030 and/or cache memory 032. The computer system/server 012 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 034 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be connected to bus 018 via one or more data media interfaces. Memory 028 can include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the present invention.

Program/utility 040 having a set (at least one) of program modules 042 can be stored, for example, in memory 028, such program modules 042 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof might include an implementation of a network environment. Program modules 042 generally perform the functions and/or methodologies of embodiments of the present invention as described herein.

The computer system/server 012 may also communicate with one or more external devices 014 (e.g., keyboard, pointing device, display 024, etc.), hi the present invention, the computer system/server 012 communicates with an external radar device, and may also communicate with one or more devices that enable a user to interact with the computer system/server 012, and/or with any device (e.g., network card, modem, etc.) that enables the computer system/server 012 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 022. Also, the computer system/server 012 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 020. As shown, the network adapter 020 communicates with the other modules of the computer system/server 012 via bus 018. It should be appreciated that, although not shown, other hardware and/or software modules may be used in conjunction with the computer system/server 012, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 016 executes programs stored in the system memory 028, thereby executing various functional applications and data processing, such as implementing the method flow provided by the embodiment of the present invention.

With the development of time and technology, the meaning of media is more and more extensive, and the propagation path of computer programs is not limited to tangible media any more, and can also be downloaded from a network directly and the like. Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

By utilizing the technical scheme provided by the invention, the image of the preset part in the face image and the spliced image obtained by the face edge image are extracted to train the generation countermeasure network, so that the problem that the difference of the preset part in the face can influence the face image when speaking is fully considered, and the generation model in the generation countermeasure network obtained by training can generate a more high-definition vivid face image.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of creating a face generation model, the method comprising:

acquiring a face image;

extracting images of preset parts and face edge images from each face image, and splicing the extracted images to serve as spliced images corresponding to the face images, wherein the images of the preset parts are mouth images;

constructing a generation countermeasure network comprising a generation model and a discrimination model;

and training the generation countermeasure network according to the face image and the spliced image corresponding to the face image, and obtaining a face generation model by using a generation model in the generation countermeasure network obtained by training.

2. The method of claim 1, after obtaining the face image, further comprising:

and acquiring the resolution of each face image, and filtering the face images with the resolution lower than a preset threshold value.

3. The method according to claim 1, wherein the images of the predetermined locations further include an eye image and an eyebrow image;

the face edge image is an image from which a mouth, a nose, and a chin of the face image are removed.

4. The method of claim 1, wherein constructing a generative countermeasure network comprising a generative model and a discriminant model comprises:

and combining N discriminators to form the discrimination model, wherein the input of each discriminator corresponds to image blocks with different scales respectively, and N is a positive integer greater than or equal to 2.

5. The method of claim 4, wherein training the generator countermeasure network from the face image and its corresponding stitched image comprises:

taking the face image as a real sample;

inputting the spliced image into the generation model, and taking an output result obtained by the generation model as a generation sample;

taking the real sample and the corresponding generated sample as the input of the discriminant model, and obtaining the loss functions of the discriminant model and the generated model according to the output result of the discriminant model;

and adjusting parameters in the network structures of the generative model and the discriminant model according to the discriminant model and the loss function of the generative model until the generation countermeasure network converges.

6. The method of claim 5, wherein the taking the real samples and their corresponding generated samples as input to the discriminant model comprises:

acquiring N image blocks with different scales from the real sample;

acquiring N image blocks with different scales from the same position of the generated sample;

and taking the two image blocks with the same scale as the input of the discriminators with the corresponding scale, and splicing the output result of each scale discriminator as the output result of the discrimination model.

7. A method of generating an image of a human face, the method comprising:

acquiring a mouth image;

extracting a face edge image of a face in a template image, and splicing the face edge image and the mouth image to obtain an input image;

inputting the input image into a face generation model, and obtaining a face image according to an output result of the face generation model;

the face generation model is pre-built according to any of claims 1 to 6.

8. The method of claim 7, wherein the acquiring a mouth image comprises:

acquiring a text;

the text is converted into speech, and a mouth image is generated based on the converted speech.

9. The method of claim 7, further comprising:

extracting an eye image and an eyebrow image of a human face in the template image;

and splicing the mouth image, the eye image, the eyebrow image and the face edge image to obtain an input image.

10. An apparatus for modeling face generation, the apparatus comprising:

the first acquisition unit is used for acquiring a face image;

the first splicing unit is used for extracting images of preset parts and human face edge images from the human face images and splicing the extracted images to serve as spliced images corresponding to the human face images, wherein the images of the preset parts are mouth images;

the system comprises a construction unit, a judgment unit and a control unit, wherein the construction unit is used for constructing a generation countermeasure network comprising a generation model and a judgment model;

and the training unit is used for training the generation countermeasure network according to the face image and the spliced image corresponding to the face image, and obtaining a face generation model by using a generation model in the generation countermeasure network obtained by training.

11. The apparatus according to claim 10, wherein the first acquiring unit further performs, after acquiring the face image:

12. The apparatus according to claim 10, wherein the images of the predetermined locations further include an eye image and an eyebrow image;

13. The apparatus according to claim 10, wherein the constructing unit, when constructing the generative countermeasure network including the generative model and the discriminant model, specifically performs:

14. The apparatus according to claim 13, wherein the training unit, when training the generation countermeasure network according to the face image and the corresponding stitched image, specifically performs:

taking the face image as a real sample;

15. The apparatus according to claim 14, wherein the training unit specifically executes, as the input of the discriminant model, the real samples and the corresponding generated samples:

acquiring N image blocks with different scales from the real sample;

16. An apparatus for generating a face image, the apparatus comprising:

a second acquisition unit configured to acquire a mouth image;

the second splicing unit is used for extracting a face edge image of a face in the template image and splicing the face edge image and the mouth image to obtain an input image;

the processing unit is used for inputting the input image into a human face generation model and obtaining a human face image according to an output result of the human face generation model;

the face generation model is pre-built according to any of claims 10 to 15.

17. The apparatus according to claim 16, wherein the second acquisition unit, when acquiring the mouth image, specifically performs:

acquiring a text;

18. The apparatus of claim 16, wherein the splicing unit is further configured to perform:

19. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any one of claims 1 to 9.

20. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 9.