CN114202615A

CN114202615A - Facial expression reconstruction method, device, equipment and storage medium

Info

Publication number: CN114202615A
Application number: CN202111503555.4A
Authority: CN
Inventors: 姚粤汉; 张�雄; 彭国柱
Original assignee: Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-03-18

Abstract

The application provides a facial expression reconstruction method, a device, equipment and a storage medium, which are used for acquiring a facial image; inputting the face image into a pre-trained deep neural network model to output an expression coefficient and a shooting parameter of the face image; the expression coefficients represent the weights of all basic expression templates, the value of each weight is greater than or equal to zero, and the value of at least one weight is greater than zero, so that at least one basic expression template can be selected from a preset basic expression template library according to the expression coefficients; and reconstructing the facial expression according to the expression coefficient, at least one basic expression template and the shooting parameters. The method is used for capturing the facial expression of a single facial image based on a deep learning mode, so that the obtained expression coefficient and shooting parameters are very accurate, the result obtained in the subsequent reconstruction of the facial expression is more accurate, the situation that a large amount of optimization is not needed like a traditional model is avoided, and the calculated amount is greatly reduced.

Description

Facial expression reconstruction method, device, equipment and storage medium

Technical Field

The application relates to the technical field of network live broadcast, in particular to a facial expression reconstruction method, a device, equipment and a storage medium.

Background

With the development of computer vision technology, facial expression reconstruction has been widely applied in fields including games, live broadcast, AR, and the like. The current facial expression reconstruction can be roughly divided into two types according to the difference of input data: the method comprises a facial expression capturing method based on a depth image and a facial expression capturing method based on a single picture. The facial expression capturing method based on the depth image usually needs expensive depth camera equipment, while the facial expression reconstruction method based on a single picture needs additional key point detection and a complex optimization calculation process although the equipment is simple, and the operation is very complex and the efficiency is low.

Disclosure of Invention

In view of this, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for reconstructing a facial expression.

In a first aspect, an embodiment of the present application provides a method for reconstructing a facial expression, where the method includes:

acquiring a face image;

inputting the face image into a pre-trained deep neural network model to output an expression coefficient and a shooting parameter of the face image;

selecting at least one basic expression template from a preset basic expression template library according to the expression coefficient;

and reconstructing the facial expression according to the expression coefficient, at least one basic expression template and the shooting parameters.

In a second aspect, an embodiment of the present application provides an apparatus for reconstructing a facial expression, where the apparatus includes:

the face image acquisition module is used for acquiring a face image;

the coefficient and parameter output module is used for inputting the face image into a pre-trained deep neural network model so as to output the expression coefficient and shooting parameters of the face image;

the template selection module is used for selecting at least one basic expression template from a preset basic expression template library according to the expression coefficient;

and the facial expression reconstruction module is used for reconstructing facial expressions according to the expression coefficients, at least one basic expression template and the shooting parameters.

In a third aspect, an embodiment of the present application provides a terminal device, including: a memory; one or more processors coupled with the memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, and the one or more application programs are configured to perform the facial expression reconstruction method provided by the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, and the program code may be called by a processor to execute the facial expression reconstruction method provided in the first aspect.

According to the method, the device, the equipment and the storage medium for reconstructing the facial expression, firstly, a facial image is obtained; then inputting the face image into a pre-trained deep neural network model to output an expression coefficient and a shooting parameter of the face image; the expression coefficients represent the weights of all basic expression templates, the value of each weight is greater than or equal to zero, and the value of at least one weight is greater than zero, so that at least one basic expression template can be selected from a preset basic expression template library according to the expression coefficients; and finally, facial expression reconstruction is carried out according to the expression coefficient, at least one basic expression template and the shooting parameters.

The facial expression reconstruction method is used for capturing the facial expression of a single facial image based on a deep learning mode, so that the obtained expression coefficient and shooting parameters are very accurate, the result obtained in the process of reconstructing the facial expression is more accurate, the condition that a large amount of optimization is not needed as in a traditional model is avoided, and the calculated amount is greatly reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a reconstruction method of a facial expression provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for reconstructing a facial expression according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a reconstructed facial expression according to an embodiment of the present application;

FIG. 4 is a schematic view of a Blenshape structure according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a face image sample according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a structure for selecting a curve on a face contour according to an embodiment of the present application;

fig. 7 is a block diagram of a facial expression reconstruction apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer-readable storage medium provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely below, and it should be understood that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For more detailed explanation of the present application, a facial expression reconstruction method, an apparatus, a terminal device and a computer storage medium provided in the present application are specifically described below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application scenario of a reconstruction method of a facial expression provided in an embodiment of the present application, where the application scenario includes a server 102, a live broadcast end 104, and a client 106 provided in an embodiment of the present application. Wherein, a network is arranged among the server 102, the live end 104 and the client 106. The network is used to provide a medium for communication links between the server 102, the live end 104, and the client 106. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The server 102 can communicate with the live end 104 and the client 106 to provide live services to the live end 104 or/and the client 106. For example, the live end 104 may send a live video stream of a live room to the server 102, and a user may access the server 102 through the client 106 to view the live video of the live room. For another example, the server 102 may also send a notification message to the user's client 106 when the user subscribes to a live room. The live video stream can be a video stream currently live in a live platform or a complete video stream formed after the live broadcast is completed.

In some implementation scenarios, the live end 104 and the client 106 may be used interchangeably. For example, a anchor may use the live end 104 to provide live video services to viewers, and may also act as a user to view live video provided by other anchors. For another example, the user may use the client 106 to view live video provided by a main broadcast of interest, or may serve as a main broadcast to provide live video services to other viewers.

In this embodiment, the live broadcast end 104 and the client 106 are both terminals, and may be various electronic devices with display screens, including but not limited to smart phones, personal digital assistants, tablet computers, personal computers, notebook computers, virtual reality terminal devices, augmented reality terminal devices, and the like. The live broadcast end 104 and the client 106 may have internet products installed therein for providing live internet services, for example, the internet products may be applications APP, Web pages, applets, and the like used in a computer or a smart phone and related to live internet services.

It is understood that the application scenario shown in fig. 1 is only one possible example, and in other possible embodiments, the application scenario may include only some of the components shown in fig. 1 or may also include other components. For example, the application scenario shown in fig. 1 may further include a video capture terminal 108 for capturing a live video frame of the anchor, where the video capture terminal 108 may be directly installed or integrated in the live end 104, or may be independent of the live end 104, and the like, and this embodiment is not limited herein.

It should be understood that the number of live ends 104, clients 106, networks, and servers 102 are merely illustrative. There may be any number of live ends 104, clients 106, networks, and servers 102, as desired for an implementation. For example, the server may be a server cluster composed of a plurality of servers. The live broadcast end 104 and the client 106 interact with the server through the network to receive or send messages and the like. The server 102 may be a server that provides various services. The live broadcast end 104 or the client 102 may be configured to execute the steps of the facial expression reconstruction method provided in the embodiment of the present application.

Based on this, the embodiment of the application provides a facial expression reconstruction method. Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a flow of a method for reconstructing a facial expression according to an embodiment of the present application, and the method is applied to a live broadcast end in fig. 1 as an example to explain the method, and includes the following steps:

step S110, a face image is acquired.

The face image refers to a picture including face data. Facial data refers to some data that is rich in facial features (e.g., facial features, expressive features, camera angle features, etc.) or information. The facial feature, the expression feature, the shooting angle feature and the like of the human face can be obtained by analyzing the facial data.

Alternatively, the facial image may be any picture of the face of the user that needs expression reconstruction. When in live broadcast, particularly in virtual live broadcast, the face image is usually a picture of the face of the anchor, and the picture of the face of the anchor can be analyzed to capture the expression of the anchor, and the expression of the anchor is reconstructed on the virtual anchor to complete the expression driving of the anchor to the virtual anchor, even if the virtual anchor makes the same expression as the anchor, and the like, please refer to fig. 3 specifically.

In an alternative embodiment, the face image is usually a 2D picture, and may be a face photograph directly acquired by using a shooting device; or the face photos can be extracted from the video acquired by the video acquisition terminal.

In this embodiment, the facial image is usually one, that is, the facial expression of the user can be reconstructed by one facial image. However, in live broadcasting, when the facial expression of the anchor needs to be reconstructed, the facial image of the anchor can be collected in real time or at regular time, so that the facial expression of the anchor can be reconstructed in real time. In different moments, the shooting angle, illumination, color, expression and the like of the acquired face image can be different.

And step S120, inputting the face image into a pre-trained deep neural network model so as to output the expression coefficient and the shooting parameter of the face image.

The Deep Neural Network (DNN) model is a discriminant model, has at least one hidden layer of Neural network, and can be trained by using a back propagation algorithm. The neural networks inside the DNN can be divided into three categories according to different location divisions: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.

In the present embodiment, the deep neural network model may be a deep convolutional neural network model (DCNN). The convolutional neural network model has good effect in the related tasks of feature recognition, and is commonly used for image recognition and voice recognition. Alternative deep convolutional neural network models may be a residual network variant structure based on deep residual network optimization, a residual network variant structure employing a new training method, a residual network variant structure based on increased width, and a residual network variant structure employing new dimensions.

The essence of the model training is that an input vector and a target output value are given, then the input vector is input into one or more network structures or functions to obtain an actual output value, the offset is calculated according to the target output value and the actual output value, and whether the offset is within an allowable range is judged; if the training is within the allowable range, finishing the training and fixing the related parameters; if the deviation is not in the allowable range, some parameters in the network structure or the function are continuously adjusted until the training is finished and the related parameters are fixed when the deviation is in the allowable range or a certain finishing condition is reached, and finally the trained model can be obtained according to the fixed related parameters. In this embodiment, face image samples with different expressions and poses (i.e., shooting angles) are mainly input to the deep neural network model, a loss function of the deep neural network model is calculated, and then network parameters of the deep neural network model are updated until the network converges, so that a trained deep neural network model is obtained.

The deep neural network model has strong nonlinear fitting capability, strong feature extraction capability, strong capability of processing high-latitude data and strong error recognition capability, and can not cause great influence on the overall training result even if part of neurons are damaged. Therefore, in the embodiment of the application, the trained deep neural network model is adopted to extract the features of the face image, more accurate features can be obtained, and the output expression coefficients and the shooting coefficients are more accurate.

The expression coefficient refers to some data or information related to the facial expression, and is commonly used for fitting the facial expression. The facial expressions comprise anger generation, surprise, fear, aversion, distraction, injury and the like, and the expression which is commonly used at present can reach 52 types. Alternatively, the expression coefficient may be a data set, and each data in the set represents the weight of its corresponding expression. For example, if there are 52 expressions, which are respectively angry, surprise, fear, aversion, joy, hurt … …, and the expression coefficients are 1, 0.2, 0, … … 0, then the weight of angry is 1, the weight of surprise is 0.2, and the weights of others are all 0.

The shooting parameters refer to some shooting equipment related parameters when the face image is shot, and include but are not limited to camera rotation parameters and translation parameters.

Step S130, selecting at least one basic expression template from a preset basic expression template library according to the expression coefficient.

Specifically, the basic expression template is also called a basic expression base image, also called Blendshape (deformation target or expression deformation). The Blendshape can be used for controlling the facial detail expression, in general, a face can be provided with several or dozens of blendshapes, that is, the facial expression of a person can be combined by several or dozens of blendshapes. Each Blendshape generally controls only one facial detail, for example, eyes, mouth corners, eyebrows, nose, etc. can be controlled by using different blendshapes, and each Blendshape can have a value ranging from 0 to 1, for example, a Blendshape for controlling mouth, which indicates tight mouth if Blendshape is 0, full open mouth if Blendshape is 1, and the degree of open mouth if Blendshape is other. Based on which very complex facial expressions can be formed by combining several or dozens of blendshapes.

In an alternative embodiment, each expression may generate one Blendshape, and the number of blendshapes may be multiple, for example, 52, as shown in fig. 4, where only part of the Blendshape is shown in fig. 3. A plurality of blendshapes can form a set which is recorded as a preset basic expression template library.

Because the expression coefficients represent the weight corresponding to each expression (namely, Blendshape), at least one basic expression template is selected from a preset basic expression template library according to the expression coefficients, the fact that each Blendshape corresponds to a numerical value is determined, when one Blendshape is larger than 0, the Blendshape is extracted from the preset basic expression template library, and when one Blendshape is equal to 0, the Blendshape can be ignored. All relevant blendshapes are searched out in this way.

And step S140, facial expression reconstruction is carried out according to the expression coefficient, the at least one basic expression template and the shooting parameters.

After all the basic expression templates (namely Blendshape) are found, the expression coefficients, the shooting parameters and all the basic expression templates are combined to form a 3D face model, and the facial expression reconstruction is completed.

In one embodiment, in performing step S140, the photographing parameters include a rotation parameter and a translation parameter; reconstructing the facial expression according to the expression coefficient, at least one basic expression template and the shooting parameters, comprising the following steps: carrying out face synthesis according to the expression coefficient and at least one basic expression model to obtain an initial face expression model; and adjusting the initial facial expression model according to the rotation parameters and the translation parameters to form a final facial expression model so as to complete facial expression reconstruction.

Specifically, all basic expression templates (i.e., Blendshape) are found, then all basic expression templates are combined according to the weight corresponding to each basic expression template to form a 3D face model, and then the 3D face model is adjusted according to the rotation parameters and the translation parameters, so that facial expression reconstruction is completed.

A detailed embodiment is given for ease of understanding. Suppose the expression coefficient is W_expThe initial facial expression model is V ═ B × W_exp ^TAnd finally, the facial expression model V' is R multiplied by V + T. Wherein B ∈ R^n×3×52Representing 52 Blendshape, n being the number of vertices of a 3D facial expression, 3 representing 3D coordinates x, y, z]52 for 52 expressions, W_expIt can be understood as the weight of each expression. After the initial facial expression model is obtained as V, the pose of the face can be controlled by V ═ R × V + T, so that the pose of the estimated face is consistent with that of the face in the input face image.

In another alternative embodiment, the face form factor of the input facial image may also be considered comprehensively during the facial expression reconstruction process according to the expression coefficients, the at least one basic expression template and the shooting parameters. The specific process is as follows: the shape coefficient of the input face image can be obtained, and then facial expression reconstruction is carried out according to the shape coefficient, the expression coefficient, at least one basic expression template and the shooting parameters. The human face shape factor is considered when the human face expression is reconstructed, so that the reconstructed human face shape is closer to the human face shape in the input human face image, and the accuracy of the human face expression reconstruction can be further improved.

According to the method, the device, the equipment and the storage medium for reconstructing the facial expression, firstly, a facial image is obtained; then inputting the face image into a pre-trained deep neural network model to output an expression coefficient and a shooting parameter of the face image; the expression coefficients represent the weights of the basic expression templates, and at least one weight is a positive value, so that at least one basic expression template can be selected from a preset basic expression template library according to the expression coefficients; and finally, facial expression reconstruction is carried out according to the expression coefficient, at least one basic expression template and the shooting parameters.

Further, a specific embodiment of model training is given, which is described as follows:

in one embodiment, the pre-trained deep neural network model is obtained by:

in step S1, a face image sample is obtained.

Specifically, a relatively large number (e.g., several thousand, several tens of thousands, etc.) of face image samples are prepared first. The face image sample can be shot and collected by adopting a shooting device. Generally, the more image samples, the more accurate the model trained; too many face image samples can slow down model training. Therefore, in practical applications, an appropriate number of facial image samples may be selected, but the samples are diversified as much as possible when preparing the facial image samples, and the facial image samples include images of various expressions and various shooting angles, for example, the facial image samples include both normal-posture expressions and postures of a side face, a head-down posture, a head-up posture, and various expressions of eyes opening and closing, mouth tilting, and the like. In addition, a data training set can be established when the face image sample is prepared, and the face image sample is stored in the data training set.

And step S2, performing face key point labeling on the face image sample to obtain a first face key point set.

After the face image samples are obtained, face key point labeling is performed on the face image samples, and for each face image sample, a plurality of key points may be labeled, for example, 278 may be labeled, specifically, refer to fig. 5. These face keypoints form a first set of face keypoints.

And step S3, inputting the face image sample after the key point is marked into the deep neural network model to output the shape coefficient, the expression coefficient and the shooting parameter.

Specifically, a deep neural network model needs to be constructed first, a convolutional neural network can be used as a backbone network in the model, and then four full-connection layers are added behind the backbone network, wherein the four full-connection layers are respectively used for outputting a shape coefficient W_idExpression coefficient W_expShooting parameters (namely a camera rotation parameter R and a translation parameter T). Alternatively, the convolutional neural network may be mobilenetv3, resnet, or densenet, etc.

The expression coefficient refers to some data or information related to the facial expression, and is commonly used for fitting the facial expression. The shooting parameters refer to some shooting equipment related parameters when the face image is shot, and include but are not limited to camera rotation parameters and translation parameters. Shape coefficients refer to some data or information related to the shape of a face, which is often used to fit the shape of a face. Because the facial image sample includes the facial images of a plurality of different people, the face type of each person can be different, different face types are usually different when expressing the same expression, namely the face type can possibly interfere with the expression of the expression, therefore, the factor is fully considered when the deep neural network model is trained, namely, the shape coefficient related to the face type is selected to train the deep neural network model, the trained model is more accurate, the extracted expression coefficient is more accurate when the facial expression is subsequently reconstructed, and the reconstruction precision of the facial expression is further improved.

And step S4, carrying out face reconstruction according to the shape coefficient, the expression coefficient, the shooting parameters and the three-dimensional deformation model to obtain a face three-dimensional model.

The three-dimensional deformation model is a basic model used for carrying out three-dimensional reconstruction on the human face. Alternatively, the three-dimensional deformation model may be Facewarehouse (multiple linear face model), which is a three-dimensional deformation model of a human face commonly used in the fields of computer vision and computer graphics.

The specific process is as follows: coefficient of shape W_id∈R^150×1Expression coefficient W_exp∈R^52×1The camera rotation parameter R belongs to R^3×3The translation parameter T is equal to R^1×3And a three-dimensional deformation model Cr epsilon R^{n×3×52×150}Performing face synthesis, i.e. V1 ═ Cr × W_id ^T×W_exp ^T(ii) a V1 ═ R × V + T, where Cr denotes the three-dimensional deformation model, V1 ∈ R^n×3For the coordinate representation of the set of vertices of the face unit model before processing according to the shooting parameters, V1' e.R^n×3The human face three-dimensional model is subjected to posture control such as rotation and translation transformation through camera parameters (namely shooting parameters).

Step S5, selecting face key points on the face three-dimensional model, and projecting the selected face key points onto a two-dimensional pixel plane to form a second face key point set.

After the face three-dimensional model is obtained, face key points are selected on the face three-dimensional model, and therefore a plurality of face key points are obtained. The number of the face key points selected in the face three-dimensional model is the same as that of the face key points of the input face image sample. After the key points of the face are selected, the selected key points can be projected to a two-dimensional pixel plane P through perspective projection, and therefore a second face key point set is generated.

Optionally, when the three-dimensional face model selects the key points, the selection of the key points is mainly divided into two parts: one part is selected from the facial features of the fixed human face and one part is selected from the facial contour. The specific process is that a plurality of face key points are respectively selected from different positions in the five sense organs and the face contour of the face three-dimensional model.

The five sense organs of the human face mainly include a nose, eyes, a mouth and eyebrows, and when key points are selected on the five sense organs, some relatively fixed points are usually selected, such as mouth corners, eyelids and the like. When the key points are selected in the face contour, the key points need to be selected from different positions of the face, and the face key word positions of the input face image samples need to be referred to, namely, the positions of the second face key points are substantially corresponding to the positions of the first face key points.

Further, an embodiment of selecting face key points from a face contour is provided, which is described in detail below.

In one embodiment, selecting a plurality of face key points from different positions in a face contour of a three-dimensional model of a face comprises: randomly selecting a plurality of curves on the face contour of the face three-dimensional model, wherein each curve is distributed at different positions of the face contour and each curve is not intersected; selecting one or more points from each curve as candidate key points; and selecting a plurality of face key points from the candidate key points.

Specifically, first, some curves are selected on the three-dimensional model of the face, as shown in fig. 6, some curves are selected on the left and right cheeks respectively according to the horizontal or vertical direction, and a point on each curve is used as a candidate key point of the cheek contour in different postures. Also in the chin portion, i.e., from the mouth to the chin, some curves are taken, and points on each curve are used as candidate key points of the chin contour at different postures. When curves are selected, the curves do not intersect with each other, and the selected key points may repeat after the curves intersect with each other, and finally the key point selection may be inaccurate.

After obtaining a plurality of curves, when selecting candidate key points from the curves, different selection modes are provided for curve selection on the cheek and the chin, and the specific mode is as follows:

in one embodiment, selecting one or more points from each curve as candidate keypoints comprises: when the curve is positioned on the cheek, selecting a point with the maximum absolute value of the abscissa from the curve as a candidate key point; and/or; and when the curve is positioned on the chin, selecting a point with the largest sum of the square value of the abscissa and the square value of the ordinate from the curve as a candidate key point.

Specifically, a point with the maximum absolute value x can be selected from left and right cheek curves as a candidate key point; selecting x on the chin curve²+y²The largest point is taken as the candidate keypoint.

By adopting the candidate key point selection mode, namely, the face contour key points adopt a dynamic selection algorithm, the problem that the face contour key points are not fixed on the 3D model in different postures in the model training process is easily caused to be selected wrongly is solved.

In addition, as the selected curve may have sparse and uneven conditions, the candidate key points selected based on the curve are correspondingly sparse and uneven, and the number of the candidate key points is not matched, so that the fixed number and even face contour key points are obtained through interpolation based on the number of the candidate key points. The interpolation process is as follows:

in one embodiment, selecting one or more points from each curve as candidate keypoints comprises: performing interpolation operation on each curve to obtain a plurality of interpolation points; and taking a plurality of interpolation points as candidate key points.

Specifically, for any curve, the distance between two adjacent points on the curve is calculated first, and is recorded as D ═ D₀，d₁，...，d_nIn which d is₀That is, the distance between the first and second points on the curve, and so on, and dn is the last point on the curveThe distance between n points and the n-1 th point from the last. Then, the accumulated distance is calculated according to one direction of the curve

Wherein d is₀Is the cumulative distance between a first point and a second point on the curve, di is the distance between the first point and the second point on the curve, plus the distance from the first point to a third point, and finally plus the distance from the first point to the ith point to form the cumulative distance. Determining the number of interpolation points according to the accumulated distance, and calculating the distance between the interpolation points according to the length of the curve and the number of interpolation points

Then according to the accumulated distance of the interpolation points

Calculating at which two points P 'the interpolation point P' falls_i，P_i+1To (c) to (d); computing interpolation points to P' to P_i，P_i+1A distance w between_i，w_i+1(ii) a Calculating coordinates of interpolation points

And step S6, determining a mean square loss function according to the first face key point set and the second face key point set.

Specifically, a mean square loss function, i.e., mselos, is calculated according to the face key points in the second face key point set and the face key points in the first face key point set:

wherein N represents the number of key points of the human face, Pj is the j-th key point selected on the three-dimensional model V1 ', P'_jThe j-th personal face key point in the first face key first set is referred to, namely the j-th personal face key point on the input face image sample.

And step S7, performing affine transformation on the face image sample after the key point labeling, and determining a third face key point set in the face image sample after the simulation transformation.

Specifically, assuming that a face image sample after the key point labeling is I, affine transformation of an M matrix is performed on the face image sample to obtain I'. If the j-th key point on the human face three-dimensional model V1' constructed by taking the human face image sample as I is recorded as P_IjThen P_IjObtaining P through affine transformation of M matrix_I′j，P_I′jAnd recording the third face key points as third face key points, and combining each third face key point set together to form a third face key point set.

And step S8, determining a consistency loss function according to the first face key point set, the second face key point set and the third face key point set.

Specifically, the expression of the consistency loss function is:

，

wherein P is_IjThe j-th key point on a human face three-dimensional model V1' constructed by taking a human face image sample as I is marked as P_Ij，P′_IjThe j-th personal face key point on the input face image sample is referred to; p'_I′jIs such that P_IjAnd obtaining a jth personal face key point through affine transformation of the M matrix.

In step S9, a total loss function is determined according to the mean square loss function and the consistency loss function.

Step S10, updating the network parameters of the deep neural network model according to the total loss function until convergence so as to obtain a pre-trained deep neural network model; and the number of the face key points in the first face key point set, the second face key point set and the third face key point set is the same.

Wherein, the total loss function is composed of the following two parts: i.e. L ═ L_2d+L_consistency. Calculating total loss function, and updating deep nerve according to the total loss function and back propagation algorithmAnd (4) network parameters of the network model are converged to obtain a pre-trained deep neural network model.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The embodiment disclosed in the present application describes a method for reconstructing a facial expression in detail, and the method disclosed in the present application can be implemented by devices in various forms, so that the present application also discloses a device for reconstructing a facial expression corresponding to the method, and a detailed description is given below with respect to a specific embodiment.

Please refer to fig. 7, which is a device for reconstructing a facial expression according to an embodiment of the present application, and the device mainly includes:

a face image obtaining module 710, configured to obtain a face image.

And the coefficient and parameter output module 720 is configured to input the facial image into a pre-trained deep neural network model to output an expression coefficient and a shooting parameter of the facial image.

The template selecting module 730 is configured to select at least one basic expression template from a preset basic expression template library according to the expression coefficient.

And a facial expression reconstruction module 740, configured to perform facial expression reconstruction according to the expression coefficient, the at least one basic expression template, and the shooting parameter.

In one embodiment, the apparatus further comprises:

and the sample acquisition module is used for acquiring a face image sample.

The first key point set obtaining module is used for carrying out face key point labeling on the face image sample to obtain a first face key point set;

and the parameter output module is used for inputting the face image sample after the key point is marked into the deep neural network model so as to output the shape coefficient, the expression coefficient and the shooting parameter.

And the human face three-dimensional model reconstruction module is used for reconstructing a human face according to the shape coefficient, the expression coefficient, the shooting parameters and the three-dimensional deformation model to obtain a human face three-dimensional model.

And the second key point set obtaining module is used for selecting face key points on the face three-dimensional model and projecting the selected face key points onto the two-dimensional pixel plane to form a second face key point set.

And the mean square loss function calculation module is used for determining a mean square loss function according to the first face key point set and the second face key point set.

And the third key point set obtaining module is used for carrying out affine transformation on the face image sample after the key points are labeled and determining a third face key point set in the face image sample subjected to simulation transformation.

And the consistency loss function calculation module is used for determining a consistency loss function according to the first face key point set, the second face key point set and the third face key point set.

And the total loss function determining module is used for determining a total loss function according to the mean square loss function and the consistency loss function.

The model obtaining module is used for updating the network parameters of the deep neural network model according to the total loss function until convergence so as to obtain a pre-trained deep neural network model; and the number of the face key points in the first face key point set, the second face key point set and the third face key point set is the same.

In one embodiment, the second key point set obtaining module is configured to select a plurality of face key points from different positions in the five sense organs and the face contour of the three-dimensional face model.

In one embodiment, the second keypoint set obtaining module is configured to randomly select a plurality of curves on a face contour of the three-dimensional face model, where the curves are distributed at different positions of the face contour and do not intersect with each other; selecting one or more points from each curve as candidate key points; and selecting a plurality of face key points from the candidate key points.

In one embodiment, the second keypoint set obtaining module is configured to, when the curve is located on the cheek, select a point with a largest absolute value of abscissa from the curve as a candidate keypoint; and/or; and when the curve is positioned on the chin, selecting a point with the largest sum of the square value of the abscissa and the square value of the ordinate from the curve as a candidate key point.

In one embodiment, the second keypoint set obtaining module is configured to perform an interpolation operation on each curve to obtain a plurality of interpolation points; and taking a plurality of interpolation points as candidate key points.

In an embodiment, the facial expression reconstruction module 740 is configured to perform facial synthesis according to the expression coefficient and at least one basic expression model to obtain an initial facial expression model; and adjusting the initial facial expression model according to the rotation parameters and the translation parameters to form a final facial expression model so as to complete facial expression reconstruction.

For specific limitations of the facial expression reconstruction device, see the above limitations on the method, which are not described herein again. The various modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent of a processor in the terminal device, and can also be stored in a memory in the terminal device in a software form, so that the processor can call and execute operations corresponding to the modules.

Referring to fig. 8, fig. 8 is a block diagram illustrating a structure of a terminal device according to an embodiment of the present application. The terminal device 80 may be a computer device. The terminal device 80 in the present application may include one or more of the following components: a processor 82, a memory 84, and one or more applications, wherein the one or more applications may be stored in the memory 84 and configured to be executed by the one or more processors 82, the one or more applications configured to perform the methods described in the above-described embodiments of the method for reconstructing a facial expression.

The processor 82 may include one or more processing cores. The processor 82 connects various parts within the overall terminal device 80 using various interfaces and lines, and performs various functions of the terminal device 80 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 84, and calling data stored in the memory 84. Alternatively, the processor 82 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 82 may be integrated with one or a combination of a Central Processing Unit (CPU), a Graphic Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may be implemented by a communication chip, rather than integrated into the processor 82.

The Memory 84 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 84 may be used to store instructions, programs, code sets or instruction sets. The memory 84 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the terminal device 80 in use, and the like.

Those skilled in the art will appreciate that the structure shown in fig. 8 is a block diagram of only a portion of the structure relevant to the present application, and does not constitute a limitation on the terminal device to which the present application is applied, and a particular terminal device may include more or less components than those shown in the drawings, or combine some components, or have a different arrangement of components.

In summary, the terminal device provided in the embodiment of the present application is used to implement the method for reconstructing a facial expression corresponding to the foregoing method embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Referring to fig. 9, a block diagram of a computer-readable storage medium according to an embodiment of the present disclosure is shown. The computer-readable storage medium 90 stores program codes, and the program codes can be called by a processor to execute the methods described in the above embodiments of the facial expression reconstruction method.

The computer-readable storage medium 90 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 90 includes a non-transitory computer-readable storage medium. The computer readable storage medium 90 has storage space for program code 92 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 92 may be compressed, for example, in a suitable form.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for reconstructing a facial expression, the method comprising:

acquiring a face image;

2. The method of claim 1, wherein the pre-trained deep neural network model is obtained by:

acquiring a face image sample;

performing face key point labeling on the face image sample to obtain a first face key point set;

inputting the face image sample after the key point is labeled into a deep neural network model so as to output a shape coefficient, an expression coefficient and a shooting parameter;

carrying out face reconstruction according to the shape coefficient, the expression coefficient, the shooting parameters and the three-dimensional deformation model to obtain a face three-dimensional model;

selecting face key points on the face three-dimensional model, and projecting the selected face key points onto a two-dimensional pixel plane to form a second face key point set;

determining a mean square loss function according to the first face key point set and the second face key point set;

carrying out affine transformation on the face image sample after the key point labeling, and determining a third face key point set in the face image sample subjected to simulation transformation;

determining a consistency loss function according to the first face key point set, the second face key point set and the third face key point set;

determining a total loss function according to the mean square loss function and the consistency loss function;

updating the network parameters of the deep neural network model according to the total loss function until convergence so as to obtain a pre-trained deep neural network model;

and the number of the face key points in the first face key point set, the second face key point set and the third face key point set is the same.

3. The method of claim 2, wherein selecting face key points on the three-dimensional model of the face comprises:

and respectively selecting a plurality of face key points from different positions in the five sense organs and the face contour of the face three-dimensional model.

4. The method of claim 3, wherein said selecting a plurality of face keypoints from different positions in a face contour of the three-dimensional model of the face comprises:

randomly selecting a plurality of curves on the face contour of the face three-dimensional model, wherein the curves are distributed at different positions of the face contour and do not intersect with each other;

respectively selecting one or more points from each curve as candidate key points;

and selecting a plurality of face key points from the candidate key points.

5. The method of claim 4, wherein said individually selecting one or more points from each of said curves as candidate keypoints comprises:

when the curve is positioned on the cheek, selecting a point with the largest absolute value of the abscissa from the curve as the candidate key point; and/or;

and when the curve is positioned on the chin, selecting a point with the largest sum of the square value of the abscissa and the square value of the ordinate from the curve as the candidate key point.

6. The method of claim 5, wherein said individually selecting one or more points from each of said curves as candidate keypoints comprises:

performing interpolation operation on each curve to obtain a plurality of interpolation points;

and taking a plurality of interpolation points as the candidate key points.

7. The method according to any one of claims 1-6, wherein the shooting parameters include a rotation parameter and a translation parameter; the facial expression reconstruction according to the expression coefficient, the at least one basic expression template and the shooting parameters comprises the following steps:

carrying out face synthesis according to the expression coefficient and at least one basic expression model to obtain an initial face expression model;

and adjusting the initial facial expression model according to the rotation parameters and the translation parameters to form a final facial expression model so as to complete facial expression reconstruction.

8. An apparatus for reconstructing a facial expression, the apparatus comprising:

the face image acquisition module is used for acquiring a face image;

9. A terminal device, comprising:

a memory; one or more processors coupled with the memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-7.

10. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 7.