CN115810215A

CN115810215A - Face image generation method, device, equipment and storage medium

Info

Publication number: CN115810215A
Application number: CN202310083541.4A
Authority: CN
Inventors: 左童春; 周良; 何山; 胡金水; 刘聪; 殷兵
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-03-17

Abstract

The application provides a face image generation method, a face image generation device, face image generation equipment and a storage medium, and relates to the technical field of neural networks. The face image generation method includes: acquiring face material data, wherein the face material data comprises at least one of a face wireframe image, a face mask image, a face description text and a face reference image; and inputting the face material data into a pre-trained face generation model to obtain a target face image which is generated by the face generation model and is matched with the face material data. One or more types of face material data are used as the input of the face generation model, so that the requirements of a user on a target face image can be expressed by using the face material data in different modes, the operation difficulty of the user is reduced, and the generation efficiency and accuracy of the target face image are improved.

Description

Face image generation method, device, equipment and storage medium

Technical Field

The present application relates to the field of neural network technology, and in particular, to a method, an apparatus, a device, and a storage medium for generating a facial image.

Background

With the rapid development of the virtual human technology, the requirements of users on the reality degree, convenience and accuracy of virtual human creation are also remarkably improved, especially the requirements on the reality sense of the face of the virtual human.

In the related art, a face image is generated by performing a 3D (3 Dimensions) face generation technique with reference to an image provided by a user. Or generating a face image by using a text description for the image provided by a user as a reference and then performing a 2D (2 Dimensions) face generation technique. Generating a face image with the above-described single modality as a reference makes it difficult to grasp a user's demand truly and comprehensively, resulting in a generated face image that is generally greatly different from a target face image demanded by the user.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides a face image generation method, a face image generation device, face image generation equipment and a storage medium.

In a first aspect, an embodiment of the present application provides a face image generation method, including: acquiring face material data, wherein the face material data comprises at least one of a face wireframe image, a face mask image, a face description text and a face reference image; the facial line frame graph comprises a contour marking line of a target facial region, and the facial mask graph is obtained by masking the facial region to be adjusted; inputting the face material data into a pre-trained face generation model to obtain a target face image which is generated by the face generation model and is matched with the face material data; the face generation model is obtained by performing face image generation training by taking a face wire frame image, a face mask image, a face description text and a face reference image as input.

With reference to the first aspect, in certain implementations of the first aspect, the face material data includes at least a face reference map, and further includes at least one of a face wireframe map, a face mask map, and a face description text; wherein the face mask image comprises a face mask image obtained by masking a first face region in the face reference image; the face description text comprises description text of a second face region in the face reference map; the face wire frame map includes a face wire frame map obtained by outlining a third face region in the face reference map and/or the face mask map with a wire frame.

With reference to the first aspect, in certain implementations of the first aspect, the face reference map includes a face image generated by the face generation model.

With reference to the first aspect, in some implementations of the first aspect, each type of face material data is encoded separately, so as to obtain an encoding feature corresponding to each type of face material data; performing feature fusion processing on coding features corresponding to each kind of face material data based on predetermined weight corresponding to each kind of face material data to obtain fusion features; and decoding to obtain a target face image based on the fusion characteristics.

With reference to the first aspect, in certain implementations of the first aspect, the face generation model includes a codec model and a diffusion model; the coding and decoding model comprises a coding model and a decoding model, and the coding model, the diffusion model and the decoding model are connected in series in sequence.

With reference to the first aspect, in certain implementations of the first aspect, the training process of the codec model includes: extracting a sample feature map of a first sample image by using a coding model, and calculating to obtain a first image generation loss between the sample feature map and specific data distribution based on the mean value and the variance of the sample feature map; generating a sampling feature map corresponding to the sample feature map based on the mean and variance of the sample feature map and a noise map; inputting the sampling feature map into a decoding model to obtain a decoded image, and calculating to obtain a second image generation loss based on the decoded image and the first sample image; performing parameter correction on the encoding model and the decoding model based on the first image generation loss and the second image generation loss.

With reference to the first aspect, in certain implementations of the first aspect, the noise map comprises a gaussian noise map sampled from a standard normal distribution; and/or the second image generation penalty comprises a mean square penalty MSE and/or a generation countering network GAN penalty.

With reference to the first aspect, in some implementations of the first aspect, a second sample image is input to a coding model, so as to obtain a sampling feature map corresponding to the second sample image; the second sample image includes a sample face mask image obtained by performing face region mask processing on a sample face reference image; inputting the sampling feature map, a sample face wireframe map, a noise map and the text feature of the sample face description text into a diffusion model, and enabling the diffusion model to reconstruct an image through predicting image noise; and performing parameter correction on the diffusion model at least based on the noise prediction loss in the process of the diffusion model training.

With reference to the first aspect, in certain implementations of the first aspect, the text features of the sample face description text include sentence-wise features of the sample face description text and/or sentence-wise features of individual sentences of the sample face description text.

In a second aspect, an embodiment of the present application provides a face image generation apparatus, including: an acquisition model for acquiring face material data including at least one of a face wireframe, a face mask, a face description text, and a face reference; the facial line frame image comprises a contour marking line of a target facial area, and the facial mask image is obtained by masking the facial area to be adjusted; the generation model is used for inputting the face material data into a pre-trained face generation model to obtain a target face image which is generated by the face generation model and is matched with the face material data; the face generation model is obtained by performing face image generation training by taking a face wire frame image, a face mask image, a face description text and a face reference image as input.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium storing a computer program for executing the method of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor; a memory for storing processor-executable instructions; the processor is configured to perform the method of the first aspect.

According to the face image generation method provided by the embodiment of the application, the face material data is obtained and comprises at least one of a face wire frame image, a face mask image, a face description text and a face reference image, and then the face material data is input into a pre-trained face generation model, so that a target face image which is generated by the face generation model and is matched with the face material data is obtained. The method and the device have the advantages that one or more types of face material data are used as the input of the face generation model, so that the user can express the requirements of the target face image by using the face material data in different modes, the operation difficulty of the user is reduced, and the generation efficiency and accuracy of the target face image are improved. One or more types of face material data are used as the input of the face generation model, so that the requirements of a user on a target face image can be expressed by using the face material data in different modes, the operation difficulty of the user is reduced, and the generation efficiency and accuracy of the target face image are improved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally indicate like parts or steps.

Fig. 1 is a schematic view of a scenario applicable to the embodiment of the present application.

Fig. 2a is a schematic flowchart illustrating a method for generating a face image according to an exemplary embodiment of the present application.

Fig. 2b is a face reference diagram provided in an exemplary embodiment of the present application.

Fig. 2c is a diagram of a face mask provided in an exemplary embodiment of the present application.

Fig. 2d is a facial wire frame diagram provided in an exemplary embodiment of the present application.

Fig. 2e is a schematic diagram illustrating generation of a target face image according to an exemplary embodiment of the present application.

Fig. 3 is a flowchart illustrating a training process of a codec model according to an exemplary embodiment of the present application.

Fig. 4 is a flowchart illustrating a training process of a diffusion model according to an exemplary embodiment of the present application.

Fig. 5 is a schematic structural diagram of a face image generation apparatus according to an exemplary embodiment of the present application.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

With the development of science and technology, virtual human technology appears in the sight of the public, and through the virtual human technology, characters in a game can be modeled, and facial images and the like meeting requirements can be generated for some related departments.

In the related art, a face image is generated by using a face photograph automatic 3D face generation technology, but the similarity of modeling is maximized by using the face photograph, and the generated face image greatly deviates from the user's requirement when a hand-drawn image is input.

In another related art, a face image is generated by using a text description for an image provided by a user as a reference and then performing a 2D (2 Dimensions) face generation technique, but one text description generally corresponds to a plurality of face images, so that the generated face image cannot be guaranteed to meet the user's needs.

The face image generated by using the single modality as a reference is usually different from a target face image required by a user and cannot be adjusted in combination with other modality information. In order to be able to generate a face image that meets the needs of the user, the inventors have proposed a face image generation method through a series of studies. In an embodiment of the present application, the method may include acquiring face material data, the face material data including at least one of a face wireframe map, a face mask map, a face description text, and a face reference map; the facial line frame image comprises a contour marking line of a target facial area, and the facial mask image is obtained by masking the facial area to be adjusted; inputting the face material data into a pre-trained face generation model to obtain a target face image which is generated by the face generation model and is matched with the face material data; the face generation model is obtained by performing face image generation training by taking a face wire frame image, a face mask image, a face description text and a face reference image as input. One or more types of face material data are used as the input of the face generation model, so that the requirements of a user on a target face image can be expressed by using the face material data in different modes, the operation difficulty of the user is reduced, and the generation efficiency and accuracy of the target face image are improved.

The technical scheme of the embodiment of the application can be applied to face image generation, face pinching of game characters and the like in some related industries, and is not limited herein.

Fig. 1 is a schematic view of a scenario applicable to the embodiment of the present application. The scene comprises an input interface 101, face generation means 102, an image output 103.

Illustratively, a text input box and an image input box may be included in the input interface 101. For the user to input a demand for the target face image.

Illustratively, the input interface 101 may also be used to select and/or edit facial images. After receiving the facial image sent by the image output terminal 103, the user may select one of the facial images that meets the needs of the user as the target facial image, or may select one of the facial images that best meets the needs of the user for editing.

Illustratively, the face generating device 102 may generate a face image according to a demand for a target face image input by a user through the input interface 101, and input the face image to the image output terminal 103. Where the face generation apparatus 102 is pre-trained.

Illustratively, the image output terminal 103 may be configured to receive the face images sent by the face generating device 102, and forward the face images sent by the face generating device 102 to the input terminal interface 101 for the user to select one of the face images that meets the own requirements as the target face image or one of the face images that best meets the own requirements for editing.

In practical application, when a user needs to generate a target face image, according to the requirement of the user, a face description text, such as a red and plosive face, a black and round big eye, and the like, may be filled in a text input frame of the input interface 101, and/or a face image that is most similar to the requirement may be selected as a face reference image by adding a preselected face reference image in the image input frame of the input interface 101, or a face reference image, such as a complete face image, a simple stroke, and the like, may be directly uploaded, and then a face area to be adjusted is selected in the face reference image, so as to obtain a face mask image. After the completion of the filling, the image is input to the face generating device 102, and at least one face image is generated by the face generating device 102 and forwarded to the input interface 101 via the image output terminal 103 for selection or editing by the user. The user can select a facial image meeting the own requirement as a target facial image, or select a facial image most meeting the own requirement as a facial reference image for editing, and at least one edited facial image is continuously input to the image output end 103 and forwarded to the input end interface 101 for selection or editing by the user until the user selects the facial image meeting the own requirement as the target facial image. The above is only one of the cases in the process of generating the target face image in the embodiment of the present application, and other cases will be described in detail in the following embodiments, which are not described herein again.

Fig. 2a is a schematic flow chart of a face image generation method according to an exemplary embodiment of the present application. Illustratively, as shown in fig. 2a, the facial image generation method provided by the embodiment of the present application may include the following steps.

Step S201: face material data is acquired.

Illustratively, the face material data includes at least one of a face wire frame map, a face mask map, a face description text, and a face reference map.

Illustratively, the face description text may be text describing the target face image as a whole, such as a lovely girl with a black and round eye, or may be text describing a part of the target face image, such as a black and round eye.

Illustratively, fig. 2b is a face reference diagram provided in an exemplary embodiment of the present application. As shown in fig. 2b, the face reference map is a complete face image, and the face reference map may be a face mask map or a face wire frame map directly input by the user, or may be a face image generated by a face generation model,

specifically, the face generation model may simultaneously generate a plurality of face images, and the user selects one face image among the plurality of face images as the face reference image according to his or her needs.

Illustratively, fig. 2c shows a face mask provided by an exemplary embodiment of the present application. As shown in fig. 2c, the face mask map may be obtained by masking the face area to be adjusted, and the exemplary face area to be adjusted in fig. 2c is the eye area and the hair area.

Illustratively, fig. 2d is a facial wire frame diagram provided in an exemplary embodiment of the present application. As shown in fig. 2d, the facial wire frame diagram may include a contour marking line for a target facial region, wherein the target facial region may be a mask portion after the facial region to be adjusted is masked, and the contour marking line in fig. 2d is for an eye region. The target face region may also be each region of the face, such as an eye region, a hair region, a mouth region, and the like, and a contour of at least one of the regions is labeled, so that the labeled region is a face wire frame diagram. The contour marking line of the target face area in the face wireframe image can also be automatically generated, a face semantic segmentation network is used for extracting a face semantic segmentation image, and the face is segmented into a plurality of semantic areas such as eyes, a nose, a mouth, hair and the like. And then setting a higher gradient threshold value, and carrying out edge detection by using a canny operator to obtain a facial wireframe diagram of each facial area. The gradient threshold value represents the refinement degree of the edge detection, and the higher the gradient threshold value is, the coarser the extracted facial semantic segmentation graph is.

Further, the face material data includes at least a face reference map, and also includes at least one of a face wire frame map, a face mask map, and a face description text.

The face mask image comprises a face mask image obtained by masking a first face area in the face reference image. The first face area is a face area to be adjusted in the face reference image.

Wherein the face description text includes description text of a second face region in the face reference map. The description text of the second face area is a description text for describing features of the face area to be adjusted in the face reference image.

The facial wire frame map comprises a facial wire frame map obtained by using a wire frame to carry out contour marking on a third face area in the face reference map and/or the face mask map. The third face area of the face reference map is an area obtained by masking the face area to be adjusted in the face reference map, or the third face area of the face mask map may be a masking portion of the face mask map.

Specifically, the face reference map, the face wire frame map and the face mask map may be directly input by the user, or may be generated by a face generation model, and the face description text may be directly input only by the user.

Step S202: and inputting the face material data into a pre-trained face generation model to obtain a target face image which is generated by the face generation model and is matched with the face material data.

Illustratively, the face generation model is trained by face image generation using as input a face wire frame map, a face mask map, a face description text, and a face reference map.

Specifically, the embodiment of the present application trains a face generation model in advance, and is configured to generate a target face image matched with face material data input by a user according to the face material data input by the user.

Unlike a conventional face generation model, the face generation model proposed in the embodiment of the present application is obtained by performing face image generation training using a face wire frame diagram, a face mask diagram, a face description text, and a face reference diagram as training samples.

That is, in the training process, a face wire frame image, a face mask image, a face description text, and a face reference image are input as models, and the models are caused to generate face images that conform to the input face wire frame image, face mask image, face description text, and face reference image at the same time, from the input face wire frame image, face mask image, face description text, and face reference image.

After the training, the face generation model can take any one or more of a face wire frame image, a face mask image, a face description text and a face reference image as input to generate a corresponding face image. Illustratively, the face generation model may include a codec model and a diffusion model. The coding and decoding model comprises a coding model and a decoding model, and the coding model, the diffusion model and the decoding model are sequentially connected in series. The coding and decoding model can be a variational self-encoder, and the diffusion model can be a denoising diffusion probability model.

Specifically, the face generation model generates a target face image that matches the face material data, and may include the following steps.

Step A: and respectively coding each kind of face material data to obtain the coding characteristics corresponding to each kind of face material data.

Illustratively, each kind of face material data is encoded by using an encoding model, and an encoding feature corresponding to each kind of face material data is output.

And B: and performing feature fusion processing on the coding features corresponding to each kind of face material data based on the predetermined weight corresponding to each kind of face material data to obtain fusion features.

For example, the weight may be an attention weight, a preset weight, or the like, and is not limited herein. When the face material data is one kind, the weight of the face material data may be set to 1, and the weights of the remaining kinds of face material data may be set to 0, to indicate that the weights may be adaptively generated when the face material data is plural kinds.

Illustratively, attention mechanism may refer to a task-dependent attention with a predetermined purpose, focused on an object intentionally and actively.

Illustratively, the diffusion model is used for carrying out fusion processing on the coding features corresponding to each kind of face material data, and a noise map is randomly added in the fusion processing process to obtain the fusion features.

In the training process, the face generation model can automatically adjust the weights of different types of face material data, so as to generate face images matched with the different types of face material data. On the basis of the training, for the combination of any different types of face material data, the face generation model can determine the weights of the different types of face material data based on the training result, so that useful features with proper proportions are extracted from the different types of face material data for weighted fusion, and the fusion features are obtained. Generating a face image based on the fused feature enables the generated face image to be matched with the input face material data.

And C: and decoding to obtain a target face image based on the fusion characteristics.

Illustratively, the diffusion model is used for predicting the noise in the fusion features, and then the target face image is obtained by matching with the decoding of the decoding model.

Specifically, fig. 2e is a schematic diagram illustrating generation of a target face image according to an exemplary embodiment of the present application.

As shown in fig. 2e, in practical applications, when the user inputs only the facial wire-frame diagram into the facial generation model, the target facial region is replaced by the shape of the target facial region marked by the contour marking line in the facial wire-frame diagram, so as to obtain the target facial image. When the user only inputs the face mask image into the face generation model, the face area to be adjusted in the face mask image is replaced randomly to obtain a target face image. When a user inputs a face mask image and a face description text into a face generation model, text features in the face description text are extracted, and a face area to be adjusted is replaced according to the text features to obtain a target face image. When the user only inputs the face description text into the face generation model, the text features in the face description text are extracted, and a target face image matched with the face description text is generated according to the text features.

After the face material data is input into the face generation model, at least one face image can be obtained, a user can select a face image meeting requirements from the at least one generated face image as a target face image, and if the generated face images do not meet the requirements, a face image which best meets the requirements is selected from the at least one generated face image as a face reference image to be edited. When editing is carried out, the images of the face area to be adjusted which do not meet the requirements in the face reference image can be masked to obtain a face mask image, meanwhile, whether face material data corresponding to the face area to be adjusted are input or not can be selected, if the data are not input, namely the data are not input, the face area to be adjusted is randomly replaced, at least one face image is generated, and the face image which meets the requirements in the generated at least one face image is selected as a target face image. If the selection is yes, namely input, whether a face description text is input or not needs to be selected secondarily, if the selection is yes, the face description text is selected to be input, after the face description text is input, text features in the face description text are extracted, a face region to be adjusted is replaced according to the text features, if the selection is not yes, a face wire frame diagram is input, a contour mark line meeting the self requirement is marked in the region to be adjusted of the face mask diagram to generate a face wire frame diagram, the face region to be adjusted is replaced according to the face wire frame diagram to generate at least one face image, and the face image meeting the requirement is selected from the generated at least one face image to serve as a target face image.

Illustratively, before one of the face mask map, the face reference map, and the face wire frame map is input into the face generation model, face key points need to be detected by using a face feature point tool Mtcnn (Multi-tasking convolutional neural network), all the face key points are aligned with face feature points in the corresponding face feature point template by using a preset unified face feature point template through affine transformation to obtain one of the aligned face mask map, the face reference map, and the face wire frame map, and when one of the face mask map, the face reference map, and the face wire frame map needs to be input into the face generation model, one of the aligned face mask map, the face reference map, and the face wire frame map is input into the face generation model. The facial key points can be at five positions of nose, eyes, mouth, eyebrows and hair, and for a facial wireframe diagram, the facial key points within the range of the contour marking lines can be determined according to the contour marking lines and are aligned with the facial feature points in the corresponding facial feature point templates.

In the embodiment of the application, the user can use one kind of face material data as the input of the face generation model, and also can use various kinds of face material data as the input of the face generation model, so that the user can express the requirement of the target face image by using the face material data in different modes, the operation difficulty of the user is reduced, and the generation efficiency and the accuracy of the target face image are improved.

Fig. 3 is a flowchart illustrating a training process of a codec model according to an exemplary embodiment of the present disclosure. As shown in fig. 3, the training process of the codec model provided in the embodiment of the present application may include the following steps.

Step 301: and extracting a sample feature map of the first sample image by using a coding model, and calculating to obtain a first image generation loss between the sample feature map and the specific data distribution based on the mean value and the variance of the sample feature map.

The specific data distribution may be arbitrarily set data distribution or data distribution determined by automatically learning the data distribution characteristics of the sample feature map using a codebook. In the embodiment of the present application, the specific data distribution adopts a standard gaussian distribution, and the first image generation loss is a loss between the sample feature map and the specific data distribution. The loss between the sample profile and the particular data distribution may be calculated using any loss function, for example, a KL divergence loss may be calculated or a Wasserstein distance of both may be calculated as the loss.

The embodiment of the application calculates KL divergence loss between the sample feature map and the standard Gaussian distribution as the first image generation loss.

Illustratively, the first sample image is a complete face image and is used for training, and after the first sample image is input into the coding model, the first sample image is down-sampled step by using a convolution layer with residual errors, so as to obtain a sample feature map.

Illustratively, the resolution of the first sample image is B × C × H × W, where B is batch and represents the number of the first sample images, C is channel and represents the number of channels of the first sample image, H is height and represents the height of the first sample image, and W is width and represents the width of the first sample image. Correspondingly, the sample feature map of the first sample image is a map obtained by down-sampling the first sample image to a resolution of B × C × H/8 × w/8.

Exemplarily, taking the first image generation loss as KL divergence loss as an example, the following formula (1) shows a formula for calculating KL divergence loss.

KL_loss =

（1）

Wherein KL _ loss represents the first image generation loss,

respectively, the number of channels, height and width,

is taken as the average value of the values,

is a variance, wherein the mean and the variance are randomly generated under the condition that the sample feature map of the first sample image is constrained to a standard gaussian distribution.

Step 302: and generating a sampling feature map corresponding to the sample feature map based on the mean and the variance of the sample feature map and the noise map.

Illustratively, the noise map may comprise a gaussian noise map sampled from a standard normal distribution;

exemplarily, the following formula (2) shows a calculation formula of the sampling feature map of the sample feature map.

（2）

Wherein the content of the first and second substances,

is a sampling feature map of the sample feature map,

is taken as the mean value of the average value,

is the variance of the number of the received signals,

is a gaussian noise map.

Step 303: and inputting the sampling feature map into a decoding model to obtain a decoded image, and calculating to obtain a second image generation loss based on the decoded image and the first sample image.

Illustratively, the resolution of the decoded image is B × C × H × W.

Illustratively, the second image generation penalty comprises a mean square penalty MSE and/or a generation counterpoise GAN penalty. The following formulas (3-1), (3-2), and (3-3) show specific formulas of the second image generation loss.

Rec_loss = MSE(

,

(3-1) or

Rec_loss = GAN(

,

(3-2) or

Rec_loss = MSE(

,

+GAN(

,

（3-3）

Wherein Rec _ loss is the second image generation loss, MSE (MSE: (MSE) (MSE))

,

Mean square loss MSE, GAN: (

,

In order to counteract the loss of the GAN network,

in order to decode the image(s),

is the first sample image.

Step 304: the encoding model and the decoding model are parameter-corrected based on the first image generation loss and the second image generation loss.

Specifically, the total training loss of the coding and decoding model is obtained based on the first image generation loss and the second image generation loss, and the coding and decoding model is subjected to parameter correction by using the total training loss. The following formula (4) shows a calculation formula of the total loss of training of the codec model.

Training total loss = Rec _ loss +

*KL_loss （4）

Wherein KL _ loss is the first image generation loss, rec _ loss is the second image generation loss,

the scale factor can be adjusted according to actual conditions, and can be 0.0001.

According to the embodiment of the application, the encoding and decoding model is trained, so that the purpose that the definition of the decoding image generated by the encoding and decoding model is higher than that of the first sample image generated by the encoding and decoding model on the premise that the similarity between the first sample image input into the encoding and decoding model and the decoding image generated by the encoding and decoding model is ensured is achieved, in addition, the sample characteristic diagram generated by the encoding model can be subjected to specific data distribution, and therefore the sampling characteristic diagram is provided for the diffusion model.

Fig. 4 is a flowchart illustrating a training process of a diffusion model according to an exemplary embodiment of the present application. As shown in fig. 4, the training process of the diffusion model provided in the embodiment of the present application may include the following steps.

Step 401: and inputting the second sample image into the coding model to obtain a sampling feature map corresponding to the second sample image.

For example, the second sample image may include a sample face mask map obtained by performing face region masking processing on the sample face reference map.

Step 402: and inputting the sampling feature map, the sample face wire frame map, the noise map and the text feature of the sample face description text into a diffusion model, and enabling the diffusion model to reconstruct the image through predicting the image noise.

Illustratively, the noise map may be a gaussian noise map, a poisson noise map, or the like.

Illustratively, the text features of the sample face description text include sentence features of the whole sentence of the sample face description text and/or sentence features of individual sentences of the sample face description text.

Specifically, the determination of the text feature of the sample face description text may include the following steps.

Step A: and acquiring text description data corresponding to the first sample image.

Illustratively, the text description data corresponding to the first sample image is a text description of the first sample image, such as a lovely girl with a big eye that is black and round, and the like.

And B, step B: and performing global description feature extraction on the text description data corresponding to the first sample image to obtain the text description feature corresponding to the first sample image.

Illustratively, the text description feature corresponding to the first sample image is a complete sentence feature.

And C: text description data corresponding to the target face region in the first sample image is acquired.

Illustratively, the target face region in the first sample image may be a nose, eyes, mouth, or the like. The text description data corresponding to the target face region in the first sample image is a textual description for the target region in the first sample image, e.g., a red pounding face, a black and round large eye, a straight beeping mouth, etc.

Step D: and performing local description feature extraction on the text description data corresponding to the target face area in the first sample image to obtain the text description feature corresponding to the target face area in the first sample image.

Illustratively, the text description feature corresponding to the target face region in the first sample image is a clause feature.

Illustratively, text extractor is used to extract text description features corresponding to a first sample image, such as a lovely girl with a black and round eye, which is extracted to be girl, black eye, round eye, and text description features corresponding to a target face area in the first sample image, such as a red pounce's face extracted to be a red face, etc. The text extractor may be a CLIP (graphical language-Image Pre-tracing, image text multimodal model) text extractor, which is not limited herein.

Illustratively, the determining manner of the noise map may include: and determining a preset constraint interval based on the mean value and the variance corresponding to the standard Gaussian distribution. Further, a noise map is randomly selected within a preset constraint interval. Wherein, the mean value and the variance corresponding to the standard Gaussian distribution are respectively 0 and 1, and the preset constraint interval is (0, 1).

Specifically, inputting a diffusion model to a sampling feature map, a sample face wire frame map, a noise map, and a text feature of a sample face description text, and causing the diffusion model to perform image reconstruction by predicting image noise may include the following steps.

Step A: and performing superposition processing on the sampling feature map, the sample face wire frame map and the noise map to obtain a noise-added feature map.

And B: and denoising the noise-added characteristic image by using a diffusion model based on the text characteristics of the sample face description text to obtain the noise prediction loss.

For example, in order to completely remove noise, the process of removing noise is usually iterated, and may be, for example, 50 times.

Exemplarily, the following formula (5) shows a formula of calculating the noise prediction loss.

（5）

Wherein L is the noise prediction loss,

is a desired value for the neural network,

in order to be a neural network, the network is,

for U-Net network, z _t In order to be a noise figure, the noise figure,

in order to sample the characteristic map, the method comprises the steps of,

is a sample face line frame diagram, and t is time.

Specifically, based on the text features of the sample face description text, denoising the noisy feature map by using the diffusion model to obtain the noise prediction loss, which may include the following steps.

Step B ₁ : and predicting noise in the noise-added feature map by using a diffusion model based on the text features of the sample face description text to obtain noise prediction data.

Step B ₂ : and denoising the denoising characteristic map based on the noise prediction data to obtain the noise prediction loss.

Illustratively, the noise-added feature map is input into a U-Net network, text features of the sample face description text are embedded into a network layer of the U-Net network by using an attention mechanism, noise at the time t is predicted by using the U-Net network to obtain noise prediction data, and the noise-added feature map is denoised by using the noise prediction data to train a diffusion model.

Illustratively, the diffusion model has a forward diffusion process and a reverse diffusion process, the forward diffusion process is a process of gradually adding noise, and the reverse diffusion process is a denoising process.

Step 403: and performing parameter correction on the diffusion model at least based on the noise prediction loss in the process of the diffusion model training.

Illustratively, the degree of training of the diffusion model is evaluated through the noise prediction loss, and when the noise prediction loss does not reach the expectation, parameter correction is carried out until the noise prediction loss reaches the expectation, and the training is considered to be finished.

In the embodiment of the application, the diffusion model is trained by utilizing the sampling feature map, the sample face wireframe map, the noise map and the text features of the sample face description text, so that the diffusion model can obtain a matched target face image through at least one of the input face mask map, the face wireframe map, the sample face description text and the face reference map, and the multi-modal combined face generation method is realized.

The embodiment of the face image generation method of the present application is described in detail above with reference to fig. 2a to 4, and the embodiment of the face image generation apparatus of the present application is described in detail below with reference to fig. 5. It is to be understood that the description of the embodiments of the face image generation method corresponds to the description of the embodiments of the face image generation apparatus, and therefore, parts not described in detail may be referred to the foregoing method embodiments.

Fig. 5 is a schematic structural diagram of a face image generation apparatus according to an exemplary embodiment of the present application. As shown in fig. 5, the facial image generation apparatus provided in the embodiment of the present application may include the following modules.

An obtaining module 501 is configured to obtain face material data.

Illustratively, the face material data includes at least one of a face wireframe map, a face mask map, a face description text, and a face reference map; the facial line frame image comprises a contour marking line of a target facial area, and the facial mask image is obtained by masking the facial area to be adjusted.

Illustratively, the face material data includes at least a face reference map, and further includes at least one of a face wire frame map, a face mask map, and a face description text; the face mask image comprises a face mask image obtained by masking a first face area in the face reference image; the face description text includes description text of a second face region in the face reference map; the facial wire frame map includes a facial wire frame map obtained by outlining a third face region in the facial reference map and/or the facial mask map with a wire frame.

Illustratively, the face reference map includes a face image generated by a face generation model.

The generating module 502 is configured to input the face material data into a pre-trained face generation model, so as to obtain a target face image generated by the face generation model and matched with the face material data.

Illustratively, face generation models, including codec models and diffusion models; the coding and decoding model comprises a coding model and a decoding model, and the coding model, the diffusion model and the decoding model are sequentially connected in series.

The face generation model is obtained by performing face image generation training by taking a face wire frame image, a face mask image, a face description text and a face reference image as input.

In an embodiment of the present application, the generating module 502 is further configured to encode each type of facial material data respectively to obtain a corresponding encoding feature of each type of facial material data; performing feature fusion processing on coding features corresponding to each type of face material data based on predetermined weight corresponding to each type of face material data to obtain fusion features; and decoding to obtain a target face image based on the fusion characteristics.

In an embodiment of the present application, the facial image generation model apparatus may further include the following modules.

The first training module is used for extracting a sample characteristic diagram of the first sample image by using the coding model, and calculating to obtain first image generation loss between the sample characteristic diagram and specific data distribution based on the mean value and the variance of the sample characteristic diagram; generating a sampling feature map corresponding to the sample feature map based on the mean and the variance of the sample feature map and the noise map; inputting the sampling feature map into a decoding model to obtain a decoded image, and calculating to obtain a second image generation loss based on the decoded image and the first sample image; the encoding model and the decoding model are parameter-corrected based on the first image generation loss and the second image generation loss.

Illustratively, the noise map may comprise a gaussian noise map sampled from a standard normal distribution; and/or the second image generation penalty comprises a mean square penalty MSE and/or a generation countermeasure network GAN penalty.

The second training module is used for inputting the second sample image into the coding model to obtain a sampling characteristic diagram corresponding to the second sample image; the second sample image includes a sample face mask image obtained by performing face region mask processing on the sample face reference image; inputting a diffusion model into the sampling feature map, the sample face wireframe map, the noise map and the text feature of the sample face description text, and enabling the diffusion model to reconstruct an image through predicting image noise; and performing parameter correction on the diffusion model at least based on the noise prediction loss in the process of the diffusion model training.

Illustratively, the text features of the sample face description text include sentence-wise features of the sample face description text and/or sentence-wise features of individual sentences of the sample face description text.

It should be understood that the operations and functions of the acquiring module 501 and the generating module 502 in the face image generating apparatus provided in fig. 5 may refer to the face image generating method provided in fig. 2a to 4, and are not described herein again to avoid repetition.

Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 6. Fig. 6 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application.

As shown in fig. 6, the electronic device 60 includes one or more processors 601 and memory 602.

Processor 601 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in electronic device 60 to perform desired functions.

Memory 602 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 601 to implement the methods of the various embodiments of the application described above and/or other desired functions. Various contents such as a frame map including a face, a face mask map, a face description text, and a face reference map may also be stored in the computer-readable storage medium.

In one example, the electronic device 60 may further include: an input device 603 and an output device 604, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

The input device 603 may include, for example, a keyboard, mouse, etc.

The output device 604 can output various information including a face wire frame image, a face mask image, a face description text, a face reference image, and the like to the outside. The output devices 604 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, for simplicity, only some of the components of the electronic device 60 relevant to the present application are shown in fig. 6, and components such as buses, input/output interfaces, and the like are omitted. In addition, electronic device 60 may include any other suitable components depending on the particular application.

In addition to the above described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the methods according to the various embodiments of the present application described above in this specification.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor to perform the steps in the methods according to the various embodiments of the present application described above in the present specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably herein. As used herein, the words "or" and "refer to, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, each component or step can be decomposed and/or re-combined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A face image generation method characterized by comprising:

acquiring face material data, wherein the face material data comprises at least one of a face wireframe image, a face mask image, a face description text and a face reference image; the facial line frame image comprises a contour marking line of a target facial area, and the facial mask image is obtained by masking the facial area to be adjusted;

inputting the face material data into a pre-trained face generation model to obtain a target face image which is generated by the face generation model and matched with the face material data;

2. The method according to claim 1, wherein the face material data includes at least a face reference map, and further includes at least one of a face wireframe map, a face mask map, and a face description text;

wherein the face mask image comprises a face mask image obtained by masking a first face region in the face reference image;

the face description text comprises description text of a second face region in the face reference map;

the face wire frame map includes a face wire frame map obtained by outlining a third face region in the face reference map and/or the face mask map with a wire frame.

3. The method of claim 1, wherein the face reference map comprises a face image generated by the face generation model.

4. The method of claim 1, wherein the face generation model generates a target face image that matches the face material data, comprising:

coding each kind of face material data respectively to obtain coding characteristics corresponding to each kind of face material data;

performing feature fusion processing on coding features corresponding to each kind of face material data based on predetermined weight corresponding to each kind of face material data to obtain fusion features;

and decoding to obtain a target face image based on the fusion characteristics.

5. The method according to any one of claims 1 to 4, wherein the face generation model comprises a codec model and a diffusion model;

the coding and decoding model comprises a coding model and a decoding model, and the coding model, the diffusion model and the decoding model are sequentially connected in series.

6. The method of claim 5, wherein the training process of the codec model comprises:

extracting a sample feature map of a first sample image by using a coding model, and calculating to obtain a first image generation loss between the sample feature map and specific data distribution based on the mean value and the variance of the sample feature map;

generating a sampling feature map corresponding to the sample feature map based on the mean and variance of the sample feature map and a noise map;

inputting the sampling feature map into a decoding model to obtain a decoded image, and calculating to obtain a second image generation loss based on the decoded image and the first sample image;

performing parameter correction on the encoding model and the decoding model based on the first image generation loss and the second image generation loss.

7. The method of claim 6, wherein the noise map comprises a gaussian noise map sampled from a standard normal distribution;

and/or the presence of a gas in the atmosphere,

the second image generation penalty comprises a mean square penalty MSE and/or a generation countermeasure network GAN penalty.

8. The method of claim 5, wherein the training process of the diffusion model comprises:

inputting a second sample image into a coding model to obtain a sampling feature map corresponding to the second sample image; the second sample image includes a sample face mask image obtained by performing face region mask processing on a sample face reference image;

inputting the sampling feature map, a sample face wireframe map, a noise map and the text feature of the sample face description text into a diffusion model, and enabling the diffusion model to reconstruct an image through prediction image noise;

and performing parameter correction on the diffusion model at least based on the noise prediction loss in the process of the diffusion model training.

9. The method of claim 8, wherein the text features of the sample face description text comprise sentence-wise features of the sample face description text and/or sentence-wise features of individual sentences of the sample face description text.

10. A face image generation apparatus characterized by comprising:

an acquisition model for acquiring face material data including at least one of a face wireframe, a face mask, a face description text, and a face reference; the facial line frame image comprises a contour marking line of a target facial area, and the facial mask image is obtained by masking the facial area to be adjusted;

the generation model is used for inputting the face material data into a pre-trained face generation model to obtain a target face image which is generated by the face generation model and is matched with the face material data;

11. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor configured to perform the method of any of the preceding claims 1 to 9.

12. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the method of any of the preceding claims 1 to 9.