CN115393177A

CN115393177A - Face image processing method and electronic equipment

Info

Publication number: CN115393177A
Application number: CN202210860811.3A
Authority: CN
Inventors: 郑煜伟; 谢飞; 查俊莉; 黄志星; 蔡佳然
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-11-25

Abstract

The embodiment of the application discloses a face image processing method and electronic equipment, wherein the method comprises the following steps: acquiring a first data set consisting of a plurality of stylized face images with a target style and a second data set consisting of a plurality of real face images; taking the first data set and the second data set as training samples, and training a first algorithm model in an unsupervised learning mode to obtain a third data set consisting of a real face image and a stylized face image which have a pairing relation; and taking the third data set as a training sample, training a second algorithm model in a supervised learning mode so as to distribute the second algorithm model to the terminal equipment where the client is located, wherein the client is used for converting the real face image acquired by the terminal equipment into the stylized face image with the target style through the second algorithm model. Through the embodiment of the application, the real-time face stylization processing at the mobile terminal can be realized at lower cost.

Description

Face image processing method and electronic equipment

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a face image processing method and an electronic device.

Background

In the fields of live broadcast, short video, image-text scenes, digital people and the like, the representation form of the face always plays an important role; the facial stylization not only can realize scene functions of atmosphere creation, impression improvement and the like, but also has the functions of face privacy protection, fun adding, IP style propaganda and the like. The face stylization is to change a real face image collected in reality into a face image of a certain style, for example, a face image of a discone caricature style, and simultaneously retain the landmark attribute features of the real face in the converted face image.

In the process of implementing facial stylization, an algorithm model needs to be trained in advance, so that the algorithm model learns an image processing mode in the process of mapping a real facial image to a certain style of facial image, and after a real facial image is input into the algorithm model, the algorithm model can output the style of facial image.

In the prior art, when training a face stylized algorithm model, a pair of training samples, that is, a real face image and a stylized face image corresponding to the real face image need to be acquired. However, the stylized face image corresponding to the real face image does not exist in reality, and therefore, a large number of designers in professional fields are often required to perform long-period style design, drawing and repeated modification of the stylized face image aiming at the real face image, and then train an algorithm model by using the obtained stylized face image corresponding to the real face image. This model not only makes the production of each stylized algorithm model require a large investment of labor and time, but also makes it difficult to keep up with the internet world which is changing from moment to moment.

Disclosure of Invention

The application provides a face image processing method and electronic equipment, which can realize real-time face stylization processing at a mobile terminal with lower cost.

The application provides the following scheme:

a facial image processing method, comprising:

acquiring a first data set consisting of a plurality of stylized face images with a target style and a second data set consisting of a plurality of real face images, wherein the stylized face images in the first data set are not in pairing relationship with the real face images in the second data set;

taking the first data set and the second data set as training samples, and training a first algorithm model in an unsupervised learning mode so as to obtain a third data set consisting of a real face image and a stylized face image which have a pairing relation through the first algorithm model;

and taking the third data set as a training sample, training a second algorithm model in a supervised learning mode so as to distribute the second algorithm model to the terminal equipment where the client is located, wherein the client is used for converting the real facial image acquired by the terminal equipment into the stylized facial image with the target style through the second algorithm model.

Wherein said obtaining a first data set comprised of a plurality of stylized facial images having a target style comprises:

collecting a first quantity of stylized facial image source material relating to the target style;

training a third algorithm model by using the original materials;

obtaining a second number of stylized face images according to a trained third algorithm model to form the first data set;

wherein the first number is less than the second number.

Wherein, still include:

pre-training the third algorithm model using a plurality of real facial images;

the training of the third algorithm model by using the raw materials comprises the following steps:

and performing secondary training on the basis of the pre-trained third algorithm model by using the original materials.

Wherein, still include:

before the second training, fixing the values of partial parameters obtained by the pre-training in the third algorithm model, wherein the partial parameters are: parameters related to common features between the real facial image and the stylized facial image.

Wherein, still include:

parameter values obtained by the secondary training are subjected to error correction by performing parameter value fusion on the pre-training result and the secondary training result;

the obtaining a second number of stylized facial images according to a trained third algorithm model includes:

and obtaining a second number of stylized face images according to the third algorithm model after the parameter value fusion.

Wherein, still include:

training a fourth algorithm model for generating random vectors on the basis of the third algorithm model after pre-training;

and when a second number of stylized face images are obtained according to a third algorithm model obtained by secondary training, generating a random vector through the fourth algorithm model, and using the random vector as the input of the third algorithm model to control the distribution of the stylized face images output by the third algorithm model.

Wherein said obtaining a first data set consisting of a plurality of stylized facial images having a target style comprises:

collecting stylized face image raw materials of at least two styles respectively;

respectively training a third algorithm model by using the raw materials corresponding to the at least two styles to obtain at least two groups of parameter values corresponding to the at least two styles;

fusing the at least two groups of parameter values to obtain fused parameter values;

and obtaining a plurality of stylized face images with the target style according to the third algorithm model and the fused parameter values.

The first algorithm model comprises a generation network part and a judgment network part;

the method further comprises the following steps:

adding an average error L1 loss term about the background area at the generating network part, ignoring a countermeasure loss term of the background area, and removing a branch involved in discriminating the whole image at the discriminating network part so as to avoid the image of the background area from being stylized in the process of converting the real face image into the stylized face image.

Wherein, the pixel loss function of the second algorithm model comprises a countermeasure loss and a perception function.

Wherein, still include:

performing edge recognition on the stylized face image in the third data set, and performing fuzzy processing on edge parts;

the training of the second algorithm model by using the third data set as a training sample in a supervised learning manner comprises the following steps:

and taking the third data set and the stylized face image with the fuzzy processing on the edge part as training samples, training a second algorithm model in a supervised learning mode, and providing edge lifting countermeasure loss in the second algorithm model so as to obtain edge lifting through the stylized face image generated by the second algorithm model.

The second algorithm model comprises a discrimination network, and the discrimination network has global discrimination capability, local discrimination capability and attention mechanism.

Wherein, still include:

and performing data enhancement processing on the face image in the third data set, wherein the data enhancement processing comprises random cropping, random scaling or random optical distortion processing.

The client comprises a client provided by a commodity information service system;

the client is used for:

after a request of a user for live broadcasting or short video/photo shooting of a target commodity is received, stylized processing options are provided;

responding to a request initiated by the stylized processing option, determining a target style, and intercepting a real face image from an original image acquired by terminal equipment;

and converting the real face image into a stylized face image of the target style through a second algorithm model corresponding to the target style, and pasting the stylized face image back to the original image.

A facial image processing apparatus comprising:

the system comprises a data generation unit and a face recognition unit, wherein the data generation unit is used for acquiring a first data set consisting of a plurality of stylized face images with target styles and a second data set consisting of a plurality of real face images, and the stylized face images in the first data set and the real face images in the second data set have no pairing relation;

the unsupervised learning unit is used for taking the first data set and the second data set as training samples and training a first algorithm model in an unsupervised learning mode so as to obtain a third data set consisting of a real face image and a stylized face image which have a pairing relation through the first algorithm model;

and the client is used for converting the real face image acquired by the terminal equipment into the stylized face image with the target style through the second algorithm model.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the preceding claims.

An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the preceding claims.

According to the specific embodiments provided herein, the present application discloses the following technical effects:

according to the embodiment of the application, the generation process of the stylized processing model can be divided into a plurality of stages, in the first stage, a first data set formed by a plurality of stylized face images with target styles and a second data set formed by a plurality of real face images can be obtained, and the stylized face images in the first data set and the real face images in the second data set do not need to have a pairing relationship. Then, in the second stage, the first data set and the second data set can be used as training samples, and the first algorithm model is trained in an unsupervised learning manner, so that a third data set consisting of a real face image and a stylized face image with a pairing relationship is obtained through the first algorithm model. In the third stage, the third data set can be used as a training sample, a second algorithm model is trained in a supervised learning mode, and then the second algorithm model can be distributed to the terminal equipment where the client is located, so that the client can convert the acquired real face image into the stylized face image with the target style by using the second algorithm model. By the method, the three tasks of stylized data production, paired image data production and mobile terminal image translation can be decoupled. Moreover, the training of the second algorithm model can be completed at lower cost without performing stylized face image design for matching the real face images by designers, experts and the like, so that stylized face images with certain styles can be correspondingly output under the condition of inputting one real face image; and because the training of the second algorithm model can be realized in a supervision mode, the control of the operation amount can be realized, so that the operation can be realized at the client side, and the real-time stylized processing can be realized at the client side.

In an optional embodiment, optimization and improvement can be performed on an algorithm in each stage, an algorithm between stages, or data. For example, in a first stage, by using pre-training of the algorithm model with real face data, fixing partial parameters, etc., to reduce the need for a quantity of stylized face image raw material, a third model may be trained with a minimal quantity of stylized face image raw material to generate a large quantity of stylized face data. And the style innovation can be realized by the modes of parameter fusion of different styles and the like. In the second stage, the background image fixation in the stylizing process can be realized by increasing an average error L1 loss term related to the background area by the generation network part, neglecting a countermeasure loss term of the background area, removing branches related to the determination of the whole image in the determination network part and the like, and the situations of background blurring and the like caused by the stylizing processing of the image of the background area are avoided. In the third stage, the stylized face image generated by the second algorithm model is subjected to edge lifting by adding loss and a perception function and adding the stylized face image subjected to edge blurring processing in training data; and the robustness of the algorithm model can be improved by performing data enhancement processing on the face image in the third data set, and the like.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for practicing the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;

FIG. 2 is a flow chart of a method provided by an embodiment of the present application;

FIG. 3 is a schematic view of an apparatus provided by an embodiment of the present application;

fig. 4 is a schematic view of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

In order to facilitate understanding of a specific implementation scheme provided in the embodiment of the present application, it is first described that, in an actual application scenario, usually, a user initiates a request for stylized face processing in a process of live broadcasting or shooting a short video through a specific application client, and at this time, an actually acquired real face image needs to be identified and deducted, and after the stylized face image is subjected to the stylized processing, the obtained stylized face image is attached back to an image acquisition interface. In this process, in order to better support real-time performance, a specific processing procedure is generally required to be completed by the client. That is to say, the server needs to issue the specific stylized algorithm model to the client, and the client locally runs the specific algorithm model on the terminal device and generates the stylized face image, and so on. The specific terminal device is usually based on a mobile terminal such as a mobile phone, so the operation amount in the operation process of the algorithm model cannot be too large, otherwise the performance of the mobile terminal may not be supported.

Therefore, the improvement objective of the embodiment of the present application is not only to reduce the production cost of the stylized algorithm model, but also to meet the requirement of performing the facial stylization processing in real time at the mobile terminal. Therefore, in the scheme provided by the embodiment of the application, the training process of the stylized algorithm model is divided into three stages, wherein:

the first stage is used to produce a plurality of stylized face images and collect a plurality of real face images, however, in the embodiment of the present application, the stylized face images that are produced specifically do not need to correspond to the real face images specifically, that is, although image acquisition of two domains, namely, the stylized face images and the real face images, is involved, the two domains do not need to have a pairing relationship. Also, because the stylized facial image need not have a mating relationship with the real facial image, the production of the stylized facial image need not be designed by a designer or the like, but rather may be generated by way of an algorithmic model.

The second stage may train the first algorithm model according to the stylized face image and the real face image without a pairing relationship, so that the first algorithm model learns the first mapping relationship of the stylized face image converted from the real face image to the target style from these data, and thus may output the stylized face image data having a pairing relationship with the real face image in a case where one real face image is input. Of course, since the stylized face image and the real face image as training data do not have a pairing relationship, the learning process of the first algorithm model belongs to unsupervised learning, and accordingly, relatively high requirements are imposed on operators, depths, widths, resolutions and the like of the algorithm, so that although the first algorithm model can output the stylized face image with the target style according to the input real face image, the first algorithm model is usually difficult to operate at a mobile terminal due to too high calculation amount. For this reason, the embodiment of the present application further provides a third stage.

In the third stage, the plurality of real face images can be converted into the stylized face image by using the first algorithm model obtained by the training in the second stage, so that a data set consisting of the real face image and the stylized face image with a pairing relation can be obtained, and the second algorithm model is trained by using the paired real face image and stylized face image in the data set, so that the real face image can be converted into the stylized face image by using the second algorithm model. In this case, the second algorithm model may be a supervised model, and thus, the calculation amount during operation is lower than that of the first algorithm model, and the second algorithm model is more suitable for operation at the mobile terminal.

In each specific stage, training of the model can be performed by using training data suitable for the application scenario of the embodiment of the application on the basis of the existing algorithm model. In addition, during specific implementation, some optimization or improvement can be performed on the existing algorithm model according to specific application scene characteristics so as to improve the effect of the image output by the algorithm. These specific optimization or improvement points will be described in detail later.

From the perspective of system architecture, referring to fig. 1, the embodiment of the present application may relate to a client and a server of a related application, where the server is mainly configured to generate a specific stylized algorithm model, that is, the three stages may all be completed at the server, and after the second algorithm model is obtained, the second algorithm model may be issued to the client, so that the client may locally complete a processing process of converting a real facial image into a stylized facial image at a terminal device according to the specific algorithm model.

The following describes in detail specific implementations provided in embodiments of the present application.

First, the embodiment provides a method for processing a face image from the perspective of the aforementioned service end, and referring to fig. 2, the method may specifically include:

s201: the method comprises the steps of obtaining a first data set consisting of a plurality of stylized face images with target styles and a second data set consisting of a plurality of real face images, wherein the stylized face images in the first data set are not in pairing relation with the real face images in the second data set.

The step S201 is the first stage. Specifically, a specific desired target style, e.g., a series of animated facial styles, etc., may first be determined, and then a first data set of a plurality of stylized facial images having the target style, and a second data set of a plurality of real facial images, may be obtained. The stylized face image in the first data set and the real face image in the second data set do not need to have a pairing relationship, that is, the specific stylized face image does not need to have the characteristics of a certain real face image. Thus, such stylized facial data may be obtained by collection from some library of related pictures, official websites, and the like.

Of course, the collected stylized facial images are used to train a particular model, and therefore, a relatively large number of stylized facial images are typically required (e.g., at least a thousand images are typically required), but the number of stylized facial images of the same style that can be collected in practice may be relatively small, e.g., only a few tens of images may be required, and so on. Therefore, in an alternative embodiment, when the number of collected stylized face images is small, such stylized face images may be generated by a specific algorithm model (which may be referred to as a third algorithm model for distinguishing from subsequent algorithm models).

Specifically, the third algorithm model may be trained using a first number of collected stylized facial images as training samples. Particularly, when such stylized facial images are collected, the data distribution may be optimally adjusted, for example, a specific training sample preferably includes a plurality of different facial features such as different angles (front, side, face up, etc.), whether glasses are worn, and the like. For example, by collecting from an official website, etc., 100 stylized face images of a certain target style are acquired, data distribution optimization adjustment can be performed on the 100 raw materials, so that images of various facial features are evenly distributed, and so on. After the third algorithm model is trained using the source material, a second number of stylized facial images may be obtained from the trained third algorithm model to form the first dataset. Wherein the first number is less than the second number. That is, the amount of raw material may be small, but after training of the third algorithm model is completed, more stylized face images about a certain style may be acquired through the third algorithm model.

The third algorithm model may be implemented by generating a discriminative countermeasure network model (e.g., style GAN, etc.), which includes a generative network model portion and a discriminative network model portion. The generated network model part can take any data (such as random vectors with the length of N) as input, the aim is to output the stylized face image, of course, the output of the generated network model cannot really have the required style at the initial stage of training, the discrimination network model part is used for comparing the image output by the generated network model part with the real stylized face image in the sample, and if the two images are not close enough, the generated network model part is informed to continuously learn and modify parameter values. After multiple rounds of iteration, the training can be stopped after the image output by the network model generation part is close enough to the real stylized face image in the sample. And then, generating the stylized face image by the network model generating part and the corresponding parameter values. That is, by inputting a random vector to such a network model generation section, a stylized face image having a target style can be output.

Because the number of raw materials of stylized face data collected in the same style may be relatively small, in order to ensure the training effect on the third algorithm model in the case of fewer training samples, in an optional embodiment, after the third algorithm model is trained by using such raw materials, first a plurality of real face images are used for pre-training the third algorithm model, and then, the raw materials of stylized face images may be used for performing secondary training on the basis of the pre-trained third algorithm model.

The pre-training is to acquire a set of parameter values for the third algorithm model, so that the third algorithm model can output a real face image according to the input random vector under the condition of the set of parameter values. Because the number of the real face images can be collected, all the real face images can be used as samples for pre-training, and therefore, an ideal pre-training effect can be obtained. In addition, because a set of parameter values are obtained, on the basis of the set of parameter values, the original material of the stylized face image is used as a training sample, and the third algorithm model is subjected to secondary training, so that the iteration times in the training process can be reduced, and the requirement on the number of stylized training samples is reduced.

Since the stylized face image and the real face image may also have some common features, such as face contours, etc., in an optional manner, after the pre-training is completed and before the secondary training is performed, the values of partial parameters obtained through the pre-training in the third algorithm model may also be fixed, where the partial parameters may be: parameters related to common features between the real facial image and the stylized facial image. That is to say, through analysis and comparison, it can be found which parameters affect the features such as the face contour, and the parameter values of these parameters can be fixed, so that in the secondary training process, only the parameter values of other parameters need to be learned, and therefore, the iteration times and the requirements for the number of training samples can be reduced.

In addition, because the raw material of the collected stylized facial image is usually drawn and does not need to correspond to the real facial image, some features may not be sufficiently apparent, for example, most stylized facial images may not include teeth, so that in the process of pre-training with the real facial image and then performing secondary training with the stylized facial image as training data, when features related to teeth are involved, the model may not be able to correctly learn the processing mode related to the teeth features. Therefore, in a preferred mode, error correction can be performed on the parameter values obtained by the secondary training in a mode of performing parameter value fusion on the pre-training result and the secondary training result, and then a second number of stylized face images can be obtained subsequently according to a third algorithm model after parameter value fusion. When the parameter fusion is performed, a set of parameter values obtained after pre-training is fused with a set of parameter values obtained after secondary training, for example, corresponding values of the same parameter in two sets of parameter values are averaged, or weighted average processing is performed, and the like. When the weighted average is performed, the selection of a specific weight value can be determined by trial and error.

Further, since the input of the generation network portion is a random vector in the generation countermeasure network model, if the vector input is performed in a completely random manner, the distribution of the output face image may not be controlled. For this reason, in an alternative implementation, distribution control of the generation of the stylized face data may be implemented using GAN inversion and other techniques. Specifically, a coding network may be trained according to algorithms such as E4E (Encoder 4 Editing), where the coding network may be a fourth algorithm model for generating a random vector, and may also be obtained through a training mode. For example, in a specific implementation, the fourth algorithm model for generating the random vector may be trained on the trained third algorithm model. That is, inputting the real facial image to the fourth algorithm model (i.e., the coding network) can result in a hidden spatial code, which is input to a pre-trained Style GAN generator to obtain an output map that substantially corresponds to the input real facial image. Therefore, the training of the fourth algorithm model can be completed on the basis of the pre-trained third algorithm model. Furthermore, when a second number of stylized face images are obtained according to a trained third algorithm model, a random vector may be generated by modifying a fourth algorithm model and used as an input to the third algorithm model to control the distribution of the stylized face images output by the third algorithm model.

Under the condition that the stylized face image is generated through the third algorithm model, style innovation can be performed in a mode of fusing a plurality of different styles, and therefore generation of stylized face images with more styles can be supported. For example, 20 styles of face image raw materials may be collected in reality, 20 sets of parameter values may be obtained by training the third algorithm model through the 20 styles of raw materials, and more combination styles may be obtained by fusing the 20 sets of parameter values pairwise or fusing multiple sets of parameter values together, that is, the parameter-fused third algorithm model may be used to produce a stylized face image of the combination style, and then the conversion from the real face image to the stylized face image of the combination style may be subsequently achieved. For example, a style a is a discone style, a style B is a style related to children, after parameter values corresponding to the style a and the style B are fused, a specific third algorithm model can produce stylized face images with a "discone children" style by using the fused parameter values, and then in the process of facial stylization processing, not only the discone children style, the children style, but also the discone children style, and the like can be provided.

That is, in the above manner, the stylized face image raw materials of at least two styles may be collected respectively, and then the third algorithm model may be trained by using the raw materials corresponding to the at least two styles, respectively, to obtain at least two sets of parameter values corresponding to the at least two styles, respectively. And then, a plurality of stylized face images with the target style can be obtained according to the third algorithm model and the fused parameter values.

In the process of creating a style by fusing two or more different styles, a specific parameter fusion method may include various fusion methods such as averaging or weighted averaging of parameter values on corresponding parameters. The weights specifically used in the weighted average process may also be determined by repeated tests, and the like, and are not described in detail here.

S202: and taking the first data set and the second data set as training samples, and training a first algorithm model in an unsupervised learning mode so as to obtain a third data set consisting of a real face image and a stylized face image which have a pairing relation through the first algorithm model.

After the first data set and the second data set are obtained, the second phase may be entered, that is, the first algorithm model is trained by using the first data set and the second data set as training samples. The stylized face image in the first data set and the real face image in the second data set are not in a pairing relationship, so that the first algorithm model can be trained in an unsupervised mode. The training targets are: the first algorithm model learns the mapping relationship of the stylized face image converted from the real face image to the target style in an unsupervised manner, so that when a certain real face image is input into the first algorithm model, the stylized face image with the target style can be output, and of course, a pairing relationship exists between the output stylized face image and the input real face image, that is, the output stylized face image has partial features in the input real face image, such as the face shape and the like (even for a user corresponding to the real face image or a user who is relatively familiar with the person, the output stylized face image not only has the target style, but also can roughly judge who the person is).

Such a first algorithm model may specifically be U-GAT-IT (unpaired image translation), or the like. However, in the embodiment of the present application, considering that there may be a need for fixing a background image in some scenes, some modifications may be made to the structure of an existing algorithm model, etc., to adapt to the need.

Specifically, since the input image and the output image of the first algorithm model are usually rectangular images with the same size, the input image and the output image are equivalent to an actual face image and only a partial region in the input image, and the other part belongs to a background image (the face region cannot be a rectangle, and the input image includes a complete face image, so that a partial background image is inevitably intercepted in the process of intercepting the rectangular input image). If all pixels in such a rectangular input image are directly subjected to the stylizing processing, the background image part is also subjected to the stylizing processing, but actually, the background image does not need to be subjected to the stylizing processing, and even if the stylizing processing is performed, the background image becomes blurred and distorted. In some scenes (for example, real-time stylization processing is performed during shooting of a short video, etc.), the stylized face image generated by the algorithm model needs to be pasted back to the original image, and at this time, the stylized face image may be difficult to better connect with the original image due to the fact that the background image is blurred, etc., and the final visual effect is affected, so that in the scenes, the background image needs to be fixed during the stylization processing.

To meet this need, the existing generative confrontation network model can be improved in the embodiments of the present application. For example, since the first algorithm model includes a generating network portion and a discriminating network portion, the generating network portion may add an L1 (average error) loss term with respect to the background region, ignore the contrast loss of the background region, and restrict the image generated by this generating network to have the same background as the original image. Meanwhile, because the image generated by the generated network is used as the input of the discriminator for discrimination, a discrimination loss item exists in the discrimination process, usually, the loss item is discriminated based on the whole image, so that if the generated background image is found to be different from other stylized modes, the discriminator punishments the generated background part, and the background part in the image generated by the generated network is enabled to be more stylized. Therefore, while adding L1 (average error) loss terms about the background area in the generation network, it is also possible to eliminate branches involved in the discrimination of the whole image in the discrimination network portion, which makes it unnecessary for the discriminator to discriminate the background image in the image output from the generation network. By the method, the image of the background area can be prevented from being stylized in the process of converting the real face image into the stylized face image.

In summary, after a first algorithm model is trained using a first data set and a second data set, a stylized face image having a specific style can be output by inputting a real face image into such a first algorithm model, and there is a pairing relationship between the input image and the output image of the first algorithm model.

S203: and taking the third data set as a training sample, training a second algorithm model in a supervised learning mode so as to distribute the second algorithm model to the terminal equipment where the client is located, wherein the client is used for converting the real facial image acquired by the terminal equipment into the stylized facial image with the target style through the second algorithm model.

After a third data set consisting of the real face image and the stylized face image in the pairing relationship is obtained through the first algorithm model, the second algorithm model can be trained by taking the third data set as a training sample, and the training target is to learn the mapping relationship of the stylized face image converted from the real face image to the target style, so that when one real face image is input into the second algorithm model, how to process the real face image can be known, and the stylized face image with the target style can be output. That is, after training is completed, when a real face image is input to the second algorithm model, a stylized face image having a target style may be output.

And because the images of the two domains in the training sample have a pairing relation, supervised learning of the second algorithm model can be realized. The second algorithm model based on supervised learning can be different from the first algorithm model in terms of operators, modules, depth, width, resolution and the like, and can be greatly simplified in terms of operation amount. Specifically, the operator, the module, the depth, the width and the resolution can be determined appropriately by analyzing the operating efficiency of the operator, the module, the depth, the width, the resolution and the like on a central processor, a graphic processor and the like of the terminal device, so as to obtain a lightweight second algorithm model, and make the second algorithm model suitable for operating on the mobile terminal device.

It should be noted here that, since different mobile terminal devices have different hardware resource configurations (including the aforementioned central processor, graphics processor, etc.), the capability of running the algorithm model may also be different, and for the second algorithm model, although a smaller model structure may be selected than the first algorithm model, under the condition that the mobile terminal side can bear the load, if the second algorithm model is increased as much as possible, it is also beneficial to obtain a better processing effect. Therefore, during specific implementation, the models of various mobile terminal devices can be collected, hardware conditions of the models are analyzed respectively, suitable model scales on the models are determined, second algorithm models with different sizes can be trained respectively, and the trained second algorithm models can be issued to the terminal devices of the corresponding models to operate respectively. Thus, the face stylization processing effect can be enhanced as much as possible while achieving weight reduction of the end-side model. Of course, when the second algorithm models with different sizes are specifically trained, the training data and the training targets used may be consistent as a whole, but may be slightly different in terms of the number of training data, the number of iterations, the accuracy of the final output face stylization result, and the like.

In summary, after the second algorithm model is designed, it can be trained using the stylized facial image and the real facial image having a pairing relationship in the aforementioned third data set. Moreover, since the fixed background image is already realized when the first algorithm model is trained, that is, the background part in the image is not stylized, when such data is used as a training sample, the processing result of the second algorithm model can also realize the fixed background image.

Specifically, a model such as U-Net can be used as the second algorithm model. However, if the existing U-Net model is directly used, the L1 loss is generally performed on the whole input and output images, that is, each pixel point is considered to be equally important and all of the pixels are close to the original image, which results in smoother (or blurred) whole image. However, the actual human eyes perceive different contents in the image differently, for example, the human eyes perceive edges more obviously, and correspondingly, for non-edge regions, even if the non-edge regions are not clear, the human eyes may not perceive obviously; similarly, the human eye perceives the foreground more clearly, the background portion is blurred, and the human eye perceives it as less obvious or less noticeable, etc. Therefore, there is such a problem from the perspective of perception by the human eye. Therefore, discrimination loss and perception loss can be added in the loss function in the existing model, and the definition of the image edge and the foreground is improved, so that better experience is obtained.

In addition, in order to perform edge enhancement on the generated stylized face image (including the edge of the face, the edge of the glasses, the edge of the nose, the edge of ornaments such as hair ornaments, and the like), edge recognition may be performed on the stylized face image in the third data set, and the edge portion may be blurred, so as to obtain a stylized face image with blurred edges. Thus, in training the second algorithm model, the training data may include three types, namely, the original stylized face image in the third data set, the paired real face image, and the edge-blurred stylized face image. The stylized face image with clear edges and the stylized face image with fuzzy edges are input to a discriminator for training, so that the discriminator punishs the image with fuzzy edges. In this way, the trained second algorithm model can achieve edge lifting of the output stylized face image.

In addition, a discriminant network can be included in the second algorithm model, and the discriminant network can have global discriminant capability, local discriminant capability and attention mechanism. The global discrimination capability refers to processing the whole image, outputting a value representing whether the image is a true stylized image or a false stylized face image, and then performing loss to make the generated whole image more inclined to a true stylized image. The local discrimination capability means that after the generated image is input to the discriminator, the discriminator can generate a matrix grid, each value of which corresponds to a pixel block of the original image, for example, a pixel block including a certain region near the nose, and the like, and judge whether the generated stylized image is real or false in the pixel block dimension, and the discrimination capability is called local discrimination capability. The attention mechanism is that the algorithm model can selectively focus on part of information, and meanwhile, neglect other information, so that limited computing resources are utilized more reasonably.

Furthermore, in order to improve the robustness of the second algorithm model, a data enhancement process may be performed on the face image in the third data set, for example, the specific data enhancement process may include a random cropping process, a random scaling process, a random optical distortion process, or the like. The random cropping refers to performing consistent cropping on a pair of paired face images in the same offset direction and offset pixel number, for example, in the x-axis direction, the content of the left ten pixels is cropped, so as to crop a new image. The direction of the specific offset, and the amount of the offset, may then be random for different pairs of facial images. The random scaling is to perform an enlargement or reduction process on face images having a matching relationship in the third data set, and perform the process in the same manner in the same image pair, but may be random in scaling and the like for different image pairs. The optical distortion processing is mainly directed to the real face image in the third data set, and the quality of the real face image may be damaged by adding some noise, or turning bright and dark, etc. After the data in the third data set are processed in the above manner, the data are used for training the second algorithm model, so that the second algorithm model has higher robustness. For example, in the process of performing real-time stylization processing on the actually acquired real face image by using the second algorithm model, even if the actually acquired real face image has poor quality including noise, or image acquisition is performed in a dark environment, so that the acquired real face image is dark, etc., the stylized image generated after passing through the second algorithm model can still have high quality.

After the second algorithm model is trained, the server may issue the second algorithm model to the terminal device on which the corresponding client is installed. Certainly, in specific implementation, the server may train the second algorithm model respectively for multiple target styles to obtain multiple groups of different parameter values, and therefore, the multiple groups of different parameter values may be issued to the terminal device where the client is located. Therefore, in the process of using the client to carry out live broadcast or short video and photo shooting, a stylized processing option can be provided in a shooting interface, after the user selects the stylized processing function, a plurality of selectable styles can be provided for the user, after the user selects one style, the second algorithm model can be locally operated on the terminal equipment where the client is located, correspondingly, the real facial image can be intercepted from the image collected by the terminal equipment, and the stylized facial image with the corresponding style can be generated by using the second algorithm model and the parameter value corresponding to the currently selected style. Then, the stylized face image can be pasted back to the image acquired by the terminal device, so that a photographer user can check the stylized face image in a screen of the terminal device, and in addition, the generated short video or the video recorded in the live broadcast can replace the real face image in each frame by using the stylized face image.

The stylization process may be applied in a variety of actual scenarios. For example, in a commodity information service system, a service that can be provided to a variety of users such as buyers and sellers is available. When the buyer user is oriented, the user can publish 'buyer show' based on the service, that is, take and publish short videos, photos and the like of clothes bought from the system for other buyer users to view. When the short video or the photo is shot, for the purposes of protecting the privacy of the user, enhancing the pleasure and the like, the stylized processing option can be provided after the user triggers the shooting, and at the moment, the real face entering the shot picture can be stylized by using the algorithm model provided by the embodiment of the application at the client. In addition, when facing the seller user, the seller user can shoot the 'seller show' and publish videos or photos of the goods sold by the seller user, so that the buyer user can browse the videos or photos and the like to help the user make a purchase decision. The seller users may need to go out of the mirror by real people in the process of shooting the commodities of the seller users, and at the moment, the real faces in the seller users are stylized, and the stylized face images are usually more attractive, so that the seller users do not need to find beautiful models, and the seller users can play the roles of beautifying and improving the appearance through the stylized processing function. In addition, for sellers selling commodities facing a certain specific crowd, the faces of the anchor can be converted into face images with the styles of the corresponding crowd through stylization processing in the process of shooting short videos or live broadcasting, and therefore the effect of creating a store atmosphere is achieved. For example, a certain seller user mainly sells children's garments, but needs an adult anchor to explain the goods when recording and explaining videos or playing directly, and at this time, by converting the face image of the anchor into a face image of a child style, the face image can be made to be consistent with the atmosphere of the current selling scene, so as to play a role in creating the atmosphere of a marketplace.

In summary, according to the embodiment of the present application, the generation process of the stylized processing model can be divided into a plurality of stages, in the first stage, a first data set composed of a plurality of stylized face images with a target style and a second data set composed of a plurality of real face images can be obtained, and there may not be a pairing relationship between the stylized face images in the first data set and the real face images in the second data set. Then, in the second stage, the first data set and the second data set can be used as training samples, and the first algorithm model is trained in an unsupervised learning manner, so that a third data set consisting of a real face image and a stylized face image with a pairing relationship is obtained through the first algorithm model. In the third stage, the third data set can be used as a training sample, a second algorithm model is trained in a supervised learning mode, and then the second algorithm model can be distributed to the terminal equipment where the client is located, so that the client can convert the acquired real face image into the stylized face image with the target style by using the second algorithm model. By the method, the three tasks of stylized data production, paired image data production and mobile terminal image translation can be decoupled. Moreover, the training of the second algorithm model can be completed at lower cost without performing stylized face image design for matching the real face images by designers, experts and the like, so that stylized face images with certain styles can be correspondingly output under the condition of inputting one real face image; and because the training of the second algorithm model can be realized in a supervision mode, the control of the operation amount can be realized, so that the operation can be realized at the client side, and the real-time stylized processing can be realized at the client side.

It should be noted that, in the embodiments of the present application, the user data may be used, and in practical applications, the user-specific personal data may be used in the scheme described herein within the scope permitted by the applicable law, under the condition of meeting the requirements of the applicable law and regulations in the country (for example, the user explicitly agrees, the user is informed, etc.).

Corresponding to the foregoing method embodiment, the present application embodiment further provides a facial image processing apparatus, and referring to fig. 3, the apparatus may include:

a data generating unit 301, configured to obtain a first data set composed of a plurality of stylized face images with a target style, and a second data set composed of a plurality of real face images, where there is no pairing relationship between the stylized face images in the first data set and the real face images in the second data set;

an unsupervised learning unit 302, configured to train a first algorithm model in an unsupervised learning manner by using the first data set and the second data set as training samples, so as to obtain a third data set composed of a real face image and a stylized face image with a pairing relationship through the first algorithm model;

the supervised learning unit 303 is configured to train the second algorithm model in a supervised learning manner by using the third data set as a training sample so as to distribute the second algorithm model to the terminal device where the client is located, where the client is configured to convert the real face image acquired by the terminal device into the stylized face image of the target style through the second algorithm model.

The data generation unit may specifically be configured to:

a raw material collection subunit for collecting a first number of stylized facial image raw materials relating to the target style;

a model training subunit is generated and used for training a third algorithm model by using the original materials;

the stylized face image generation subunit is used for obtaining a second number of stylized face images according to a third algorithm model obtained by training so as to form the first data set;

wherein the first number is less than the second number.

In order to reduce the amount of training data required by the third algorithm model, the apparatus may further comprise:

a pre-training unit for pre-training the third algorithm model using a plurality of real face images;

the generative model training subunit may be specifically configured to:

In addition, before the second training, the values of partial parameters obtained by the pre-training in the third algorithm model may be fixed, where the partial parameters are: parameters related to common features between the real facial image and the stylized facial image.

Furthermore, the apparatus may further include:

the first parameter fusion unit is used for carrying out parameter value fusion on the pre-training result and the secondary training result so as to carry out error correction on the parameter values obtained by the secondary training;

the stylized facial image generation subunit may be specifically configured to:

In addition, the apparatus may further include:

the fourth algorithm model training unit is used for training a fourth algorithm model used for generating random vectors on the basis of the third algorithm model which is pre-trained;

and the random vector generation unit is used for generating random vectors through the fourth algorithm model when a second number of stylized face images are obtained according to a third algorithm model obtained by secondary training, and the random vectors are used as the input of the third algorithm model so as to control the distribution of the stylized face images output by the third algorithm model.

Or, in another mode, the data generating unit may specifically include:

the multi-style raw material collecting subunit is used for respectively collecting the raw materials of the stylized face images of at least two styles;

the multi-style training subunit is used for respectively utilizing the original materials corresponding to the at least two styles to train a third algorithm model to obtain at least two groups of parameter values corresponding to the at least two styles;

the fusion subunit is used for obtaining a fused parameter value by fusing the at least two groups of parameter values;

and the generating subunit is used for obtaining a plurality of stylized face images with the target style according to the third algorithm model and the fused parameter values.

In addition, in the second stage, the first algorithm model may include a generation network part and a discrimination network part; the average error L1 loss term of the background area can be added in the generation network part, the countermeasure loss term of the background area is ignored, and the branch involved in the whole image discrimination is removed in the discrimination network part, so that the image of the background area is prevented from being stylized in the process of converting the real face image into the stylized face image.

In addition, the pixel loss function of the second algorithm model may further include a counter loss and a perceptual function.

Furthermore, the apparatus may further include:

the fuzzy processing unit is used for carrying out edge recognition on the stylized face image in the third data set and carrying out fuzzy processing on edge parts;

the supervised learning unit may specifically be configured to:

and taking the third data set and the stylized face image subjected to the fuzzy processing on the edge part as training samples, training a second algorithm model in a supervised learning mode, and providing edge improvement confrontation loss in the second algorithm model so as to obtain edge improvement through the stylized face image generated by the second algorithm model.

Furthermore, the apparatus may further include:

and the data enhancement processing unit is used for carrying out data enhancement processing on the face image in the third data set, wherein the data enhancement processing comprises random cutting, random scaling or random optical distortion processing.

the client is used for:

after receiving a request of a user for shooting a short video/photo for a target commodity, providing stylized processing options;

responding to a request initiated by the stylized processing option, determining a target style, and intercepting a real facial image from an original image acquired by terminal equipment;

In addition, the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method described in any of the preceding method embodiments.

And an electronic device comprising:

one or more processors; and

memory associated with the one or more processors for storing program instructions which, when read and executed by the one or more processors, perform the steps of the method of any of the preceding method embodiments.

Fig. 4 schematically shows an architecture of an electronic device, which may specifically include a processor 410, a video display adapter 411, a disk drive 412, an input/output interface 413, a network interface 414, and a memory 420. The processor 410, the video display adapter 411, the disk drive 412, the input/output interface 413, the network interface 414, and the memory 420 may be communicatively connected by a communication bus 430.

The processor 410 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.

The Memory 420 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 420 may store an operating system 421 for controlling the operation of the electronic device 400, a Basic Input Output System (BIOS) for controlling low-level operations of the electronic device 400. In addition, a web browser 423, a data storage management system 424, and a facial image processing system 425, and the like, may also be stored. The facial image processing system 425 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided in the present application is implemented by software or firmware, the relevant program code is stored in the memory 420 and called to be executed by the processor 410.

The input/output interface 413 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 414 is used to connect a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).

Bus 430 includes a path that transfers information between the various components of the device, such as processor 410, video display adapter 411, disk drive 412, input/output interface 413, network interface 414, and memory 420.

It should be noted that although the above-mentioned devices only show the processor 410, the video display adapter 411, the disk drive 412, the input/output interface 413, the network interface 414, the memory 420, the bus 430 and so on, in a specific implementation, the device may also include other components necessary for normal operation. In addition, it will be understood by those skilled in the art that the above-described apparatus may also include only the components necessary to implement the embodiments of the present application, and need not include all of the components shown in the figures.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The face image processing method and the electronic device provided by the present application are introduced in detail, and a specific example is applied in the description to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understanding the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. A facial image processing method, comprising:

2. The method of claim 1,

the obtaining a first data set consisting of a plurality of stylized facial images having a target style comprises:

training a third algorithm model by using the original materials;

obtaining a second number of stylized facial images according to a trained third algorithm model to form the first data set;

wherein the first number is less than the second number.

3. The method of claim 2, further comprising:

pre-training the third algorithm model using a plurality of real facial images;

4. The method of claim 3, further comprising:

5. The method of claim 3, further comprising:

parameter value fusion is carried out on the pre-training result and the secondary training result, so that error correction is carried out on the parameter values obtained by the secondary training;

6. The method of claim 3, further comprising:

7. The method of claim 1,

8. The method of claim 1,

the method further comprises the following steps:

9. The method of claim 1,

the pixel loss function of the second algorithm model comprises a countermeasure loss and a perception function.

10. The method of claim 1, further comprising:

11. The method of claim 1,

12. The method according to any one of claims 1 to 11,

the client is used for:

13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 12.

14. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of claims 1 to 12.