CN113688753A

CN113688753A - Static face dynamic method, system, computer equipment and readable storage medium

Info

Publication number: CN113688753A
Application number: CN202111002870.9A
Authority: CN
Inventors: 谌竟成; 齐镗泉
Original assignee: Shenzhen Wondershare Software Co Ltd
Current assignee: Shenzhen Wondershare Software Co Ltd
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-11-23
Anticipated expiration: 2041-08-30
Also published as: CN113688753B

Abstract

The invention discloses a static face dynamic method, a system, computer equipment and a readable storage medium, wherein the method comprises the following steps: carrying out face detection on a face sample image to obtain face frame information and face key point information, and cutting to obtain an initial face image; calculating a face rotation angle and rotating to obtain a target face image; extracting the coordinate information of the motion key point and the parameters of the motion key point of each frame of picture in the target face image and the initial motion video, carrying out coordinate conversion, and carrying out convolution sampling to generate a plurality of frames of face motion images; and reversely rotating the face action image according to the face rotation angle and attaching the face action image to the corresponding frame to obtain a target motion video. According to the invention, the human face is detected firstly, the human face image and the motion video are processed by using the convolution network, and finally the human face action image is attached to the corresponding frame of the motion video to obtain the target motion video.

Description

Static face dynamic method, system, computer equipment and readable storage medium

Technical Field

The invention relates to the technical field of video editing, in particular to a static face dynamic method, a static face dynamic system, computer equipment and a readable storage medium.

Background

At present, some short video editing software and short video friend-making software in China have the function of making static images dynamic. Aiming at human faces, the main methods of the algorithms are realized by adopting a face changing mode, namely the human faces of each frame in an action video are extracted, respectively replaced or fused into a static image, and then the static image and a synthesized video are combined, so that the dynamic function of the static image is realized, and other APPs mainly splice 8 sections of different human face dynamic videos with the static images in different directions in space, so that the whole video looks dynamic. Although the conventional APP short video can realize the AI static image dynamic function, a large gap still exists in comparison with the real requirement of a user: 1. the user's demand is that the people's face in the static image can reappear the people's face in the action video, adopts and to scratch the people's face in the action video and pastes it on the static image again, though adopt the mode of face fusion can keep the partial characteristic of people's face in the original static image, but the people's face main part sees from the vision main part and obviously changes for whole natural, smooth inadequately. 2. Different face motion videos are spatially spliced with faces to generate videos, only the whole video looks dynamic, but the middle static image does not realize the dynamic, and the dynamic quality and the fluency of the whole video are low.

Disclosure of Invention

The embodiment of the invention provides a method, a system, computer equipment and a readable storage medium for dynamic processing of a static face, and aims to solve the problems that the dynamic processing of the static face is not smooth enough and the quality is not high in the prior art.

In a first aspect, an embodiment of the present invention provides a static face dynamic method, including:

carrying out face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and shearing the face sample image according to the face frame information to obtain an initial face image;

calculating a face rotation angle in the initial face image according to the face key point information and rotating the initial face image to obtain a target face image;

extracting the coordinate information and the parameters of the motion key points of the target face image by using a target convolutional network; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network;

performing coordinate conversion on the target face image to obtain a converted target face image based on the motion key point coordinate information and the motion key point parameter of the target face image and the motion key point coordinate information and the motion key point parameter of each frame of image in the initial motion video, and performing convolution sampling on the converted target face image and the target face image before conversion to generate a multi-frame face action image;

and reversely rotating the face action image according to the face rotation angle and attaching the face action image to a corresponding frame to obtain a target motion video.

In a second aspect, an embodiment of the present invention provides a static face dynamic system, which includes:

an initial face image obtaining unit, configured to perform face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and cut the face sample image according to the face frame information to obtain an initial face image;

the target face image acquisition unit is used for calculating a face rotation angle in the initial face image according to the face key point information and rotating the initial face image to obtain a target face image;

the information parameter extraction unit is used for extracting the coordinate information and the parameters of the motion key points of the target face image by using a target convolution network; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network;

the target motion video generation unit is used for carrying out coordinate conversion on the target face image to obtain a converted target face image based on the motion key point coordinate information and the motion key point parameter of the target face image and the motion key point coordinate information and the motion key point parameter of each frame of image in the initial motion video, carrying out convolution sampling on the converted target face image and the target face image before conversion, and generating a multi-frame face motion image;

and the target motion video acquisition unit is used for reversely rotating the face action image according to the face rotation angle and attaching the face action image to a corresponding frame to obtain a target motion video.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the static human face dynamic method according to the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the static face dynamic method according to the first aspect.

The embodiment of the invention provides a static face dynamic method, a system, computer equipment and a readable storage medium, wherein the method comprises the following steps: carrying out face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and shearing the face sample image according to the face frame information to obtain an initial face image; calculating a face rotation angle in the initial face image according to the face key point information and rotating the initial face image to obtain a target face image; extracting the coordinate information and the parameters of the motion key points of the target face image by using a target convolutional network; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network; performing coordinate conversion on the target face image to obtain a converted target face image based on the motion key point coordinate information and the motion key point parameter of the target face image and the motion key point coordinate information and the motion key point parameter of each frame of image in the initial motion video, and performing convolution sampling on the converted target face image and the target face image before conversion to generate a multi-frame face action image; and reversely rotating the face action image according to the face rotation angle and attaching the face action image to a corresponding frame to obtain a target motion video. The embodiment of the invention detects all faces in the image by using the face detection network, processes the face image and the motion video by using the convolution network, and then attaches the face action image to the corresponding frame of the motion video to obtain the target motion video, so that the whole process is simpler, and the generated target motion video is smoother and has better visual effect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a static face dynamic method according to an embodiment of the present invention;

fig. 2 is a diagram of a target convolutional network architecture of the static face dynamic method according to the embodiment of the present invention;

fig. 3 is a fourth convolution network architecture diagram of the static face dynamic method according to the embodiment of the present invention;

fig. 4 is a schematic block diagram of a static face dynamization system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flow chart of a static face dynamic method according to an embodiment of the present invention, where the method includes steps S101 to S105.

S101, carrying out face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and cutting the face sample image according to the face frame information to obtain an initial face image;

in the step, a face sample image containing a face is input into a face detection network for face detection, face frame information and face key point information are obtained, and the face sample image is cut according to the face frame information, so that an initial face image containing the face is obtained. The face frame information comprises face frame upper left corner coordinate information, face frame width and face frame height, and the face key point information comprises a left eye key point, a right eye key point, a nose key point, a left mouth corner key point and a right mouth corner key point. If a plurality of faces exist in the face sample image, detecting all the faces in the face sample image, and acquiring face frame information and face key point information of all the faces. If the plurality of faces in the face sample image have the condition that part of the faces are shielded or overlapped, after the initial face images corresponding to all the faces are obtained, whether the initial face image corresponding to each face is complete or not is judged, and if the initial face images are incomplete, the initial face images with the missing faces are repaired by using a face repairing algorithm to obtain the complete faces.

In this embodiment, the MTCNN network is used as a face detection network to perform face detection on a face sample image, so as to obtain face frame information and face key point information. And shearing the face sample image according to the width and the height of the face frame, compensating the width and the height of the face frame, and calculating the width and the height of the initial face image by combining a compensation result. Specifically, the width of the initial face image is calculated according to the following formula: w is W + offset W, where W is the width of the initial face image, W is the width of the face frame, and offset W is the width compensation result; calculating the height of the initial face image according to the following formula: h + offset, where H is the height of the initial face image, H is the height of the face frame, and offset is the height compensation result. In this embodiment, the offset W is 3/8W, and the offset H is 3/8H.

In an embodiment, the performing face detection on a face sample image by using a face detection network to obtain face frame information and face key point information includes:

carrying out multi-stage scaling on the face sample image to obtain a plurality of input images with different sizes, inputting the input images into a first convolution network for convolution processing, and generating a face candidate frame with a corresponding size;

inputting the face candidate frames with the corresponding sizes into a second convolution network for training, and screening out qualified face candidate frames;

and inputting the qualified face candidate box into a third convolutional network for training to obtain final face frame information and face key point information.

In this embodiment, after the face sample image is subjected to multi-stage scaling, a plurality of input images with different sizes are obtained, and are input into the first convolution network for convolution processing, then the convolution result of the first convolution network is input into the second convolution network for convolution processing, and finally the convolution result of the second convolution network is input into the third convolution network for convolution processing, so that the final face frame information and the face key point information are obtained.

Specifically, the method comprises the following steps: inputting the face sample image into an image pyramid with preset size for multistage scaling to obtain a plurality of input images with different sizes, wherein the scaling factor is 0.71, and the minisize is 20; inputting a plurality of input images with different sizes into a first convolution network for convolution processing, generating feature maps sequentially through convolution layers and pooling layers with different sizes, judging face contour points through the feature maps, generating face candidate frames and frame regression vectors after analysis processing by the first convolution network, and removing unqualified face candidate frames by adopting non-maximum value inhibition, thereby obtaining face candidate frames with a plurality of sizes, wherein the non-maximum value inhibition value is 0.707; inputting the obtained face candidate frames with a plurality of sizes into a second convolutional network for training, continuously removing unqualified face candidate frames by setting a threshold, inhibiting by using a non-maximum value, then removing highly overlapped face candidate frames, and obtaining a plurality of qualified face candidate frames after calibration, wherein the non-maximum value is inhibited to take a value of 0.707; inputting the obtained plurality of qualified face candidate frames into a third convolutional network for training, inhibiting by using a non-maximum value, then removing the highly overlapped face candidate frames, calibrating to obtain a plurality of face candidate frames, wherein the non-maximum value is inhibited to take a value of 0.707, and finally outputting face frame information and face key point information containing face position information.

When input images of various sizes are input into the first convolution network, the convolution is carried out on the convolution layer of 3 x 64, the convolution is carried out on the convolution layer of 3 x 32, and finally the convolution is carried out on the convolution layer of 3 x 16 to obtain the face candidate frames of various sizes. And when the face candidate frames with multiple sizes are input into the second convolution network, performing convolution through a convolution layer of 3 × 16, performing convolution through a convolution layer of 3 × 32, performing convolution through a convolution layer of 3 × 64, and performing convolution through a full-connection layer of 1 × 128 to obtain a qualified face candidate frame. And when the qualified face candidate frame is input into the third convolutional network, performing convolution through a 3 × 32 convolutional layer, performing convolution through a 3 × 64 convolutional layer, performing convolution through a 2 × 128 convolutional layer, and performing convolution through a 1 × 256 fully-connected layer to obtain face frame information and face key point information.

After face frame information and face key point information are obtained, corresponding loss functions of the first convolution network, the second convolution network and the third convolution network are calculated, and the first convolution network, the second convolution network and the third convolution network are trained by using the loss functions and adopting gradient descent back propagation. Wherein the loss function of the first convolutional network is calculated using the following formula: l is_p0.8 × L1+0.1 × L2+0.1 × L3; calculating a loss function of the second convolutional network using the following formula: l is_R0.1 × L1+0.6 × L2+0.3 × L3; calculating a loss function of the third convolutional network using the following formula: l is_o0.2 × L1+0.4 × L2+0.4 × L3; wherein, L1 is face classification loss, and the formula is:

for the final output face, p_iIs a real picture; l2 is the face candidate box regression loss, and the formula is:

as true label coordinates or width and height, y_iFor the output face frame coordinates or width and height, when i is 1, the value x of the upper left corner of the face frame is represented, when i is 2, the value y of the upper left corner of the face frame is represented, when i is 3, the width of the face frame is represented, and when i is 4, the height of the face frame is represented; l3 is the face key point loss, and the formula is:

labeling the true keypoints with coordinates, y_ijThe coordinates of the face key points are output, wherein i ═ 1, 2, 3, 4 and 5 respectively represent five key points of the face, namely: i ═ 1 is a left-eye key point, i ═ 2 is a right-eye key point, i ═ 3 is a nose key point, i ═ 4 is a left-mouth-angle key point, i ═ 5 is a right-mouth-angle key point, j ═ 1 represents the abscissa of the key point, and j ═ 2 represents the ordinate of the key point.

S102, calculating a face rotation angle in the initial face image according to the face key point information and rotating the initial face image to obtain a target face image;

in the step, the rotation angle is calculated by using the face key point information, so that the initial face image is rotated to obtain the target face image.

In one embodiment, the step S102 includes:

acquiring coordinate information of a left-eye key point and a right-eye key point, and calculating a face rotation angle based on the coordinate information of the left-eye key point and the right-eye key point;

and rotating the initial face image according to the face rotation angle by adopting a bilinear interpolation method to obtain a target face image.

In this embodiment, the rotation angle of the face is calculated by using the coordinate information of the key points of the left eye and the right eye, and a bilinear interpolation method is adopted to calculate the rotation angle of the face according to the rotation angle of the faceAnd rotating the initial face image to obtain a target face image. In this embodiment, let the coordinates of the key points of the left eye be (x)₁，y₁) The coordinate of the right eye is (x)₂，y₂) Then, the rotation angle of the face in the calculation can be calculated as angle ═ arctan ((y)₁-y₂)/(x₁-x₂))。

S103, extracting the coordinate information and the parameters of the motion key points of the target face image by using a target convolution network; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network;

in this step, a target convolution network is used to extract the coordinate information of the motion key points and the parameters of the motion key points corresponding to each frame of picture in the target face image and the initial motion video. And the motion key point parameter information is a 2 x 2 matrix, and the motion key point coordinate information and the motion key point parameters are 10.

In one embodiment, the step S103 includes:

inputting the target face image into a first target convolutional layer for convolution operation to obtain a first target convolution result, inputting the first target convolution result into a second target convolutional layer for convolution operation to obtain a second target convolution result, and inputting the second target convolution result into a third target convolutional layer for convolution operation to obtain a third target convolution result; inputting the third target convolution result into a full-link layer to carry out convolution operation, and obtaining the motion key point coordinate information and the motion key point parameters of the target face image;

inputting each frame of picture in the initial motion video into a first target convolutional layer for convolution operation to obtain a fourth target convolution result, inputting the fourth target convolution result into a second target convolutional layer for convolution operation to obtain a fifth target convolution result, and inputting the fifth target convolution result into a sixth target convolutional layer for convolution operation to obtain a sixth target convolution result; and inputting the sixth target convolution result into a full-link layer for convolution operation to obtain the motion key point coordinate information and the motion key point parameters of each frame of picture in the initial motion video.

In this embodiment, as shown in fig. 2, the target face network is composed of 3 consecutive convolutional layers and a full link layer, when extracting the coordinate information and the parameters of the motion key points of the target face image, inputting the target face image into a first target convolution layer with a convolution layer of 3 x 16 to carry out convolution operation to obtain a first target convolution result, then inputting the first target convolution result into a second target convolution layer with the convolution layer being 3 x 32 to carry out convolution operation to obtain a second target convolution result, then inputting the second target convolution result into a third target convolution layer with the convolution layer being 2 x 128 to carry out convolution operation, and then inputting the obtained third target convolution result into a 1 x 256 full-connection layer to carry out convolution operation, so as to obtain the coordinate information of the motion key points and the parameters of the motion key points of the target face image; similarly, when extracting the motion key point coordinate information and the motion key point parameter of each frame of picture in the initial motion video, sequentially inputting each frame of picture in the initial motion video into the first target convolutional layer, the second target convolutional layer, the third target convolutional layer and the full-connection layer for convolution operation, and obtaining the motion key point coordinate information and the motion key point parameter of each frame of picture in the initial motion video.

S104, performing coordinate conversion on the target face image to obtain a converted target face image based on the coordinate information and the motion key point parameters of the motion key point of the target face image and the coordinate information and the motion key point parameters of the motion key point of each frame of image in the initial motion video, and performing convolution sampling on the converted target face image and the target face image before conversion to generate a multi-frame face motion image;

in this step, according to the coordinate information and the parameters of the motion key points of the target face image and the coordinate information and the parameters of the motion key points of the current frame picture in the initial motion video, the coordinate conversion is performed on the target face image to obtain a converted target face image, then the convolution sampling is performed on the target face image before and after the conversion to generate a face action image corresponding to the current frame, and the face action image corresponding to each frame picture in the initial motion video is generated according to the method.

In an embodiment, the performing coordinate transformation on the target face image based on the coordinate information and the parameter of the motion key point of the target face image and the coordinate information and the parameter of the motion key point of each frame of picture in the initial motion video to obtain the transformed target face image includes:

and carrying out coordinate conversion on the target face image according to the following formula:

Zt_n＝SK_n+SP_n*1/DfP_n*(z-DfK_n)

wherein SK_nIs the coordinate information of the nth motion key point of the target face image, SP_nThe nth motion key point parameter of the target face image, DfP_nFor the nth motion key point parameter of one of the frames in the initial motion video, DfK_nIs the coordinate information of the nth motion key point of one frame in the initial motion video, z is the original pixel point of the target face image, Zt_nCurrent coordinate information of an original pixel point of the target face image;

and based on the number of the motion key points, extracting pixel values at corresponding positions of the target face image according to the coordinate information of the motion key points after coordinate conversion, and filling the pixel values to original pixel points to obtain a target face image after multi-frame conversion corresponding to one frame in the initial motion video.

In this embodiment, coordinate conversion is performed on the target face image according to the above formula, then pixel points are extracted at corresponding positions on the target face image according to the coordinate information of the motion key points after coordinate conversion, and the original pixel points are filled back, so as to obtain a target face image after multi-frame conversion corresponding to one frame in the initial motion video. Specifically, the key point of the motion of the target face image is SK₁,SK₂…SK₁₀The parameter information of the motion key points is SP1, SP2 and … SP10, and the motion key point of a certain frame f in the initial motion video is DfK₁,DfK₂,…Dfk_nDfP is the motion key point parameter information₁、DfP₂，…DfP_nAnd sequentially transforming 10 key points of the target face image according to the formula to obtain 10 coordinate-transformed target face images corresponding to a certain frame f in the initial motion video.

In an embodiment, the performing convolution sampling on the transformed target face image and the target face image before transformation to generate a multi-frame face motion image includes:

inputting the target face image before transformation and a plurality of frames of transformed target face images into a fourth convolution network for convolution operation to generate a frame of face action image corresponding to one frame in the initial motion video;

inputting the generated human face motion image and the corresponding frame picture in the initial motion video into a trained VGG19 network for convolution processing, obtaining each layer of output result of the VGG19 network, and calculating an L1 loss function according to each layer of output result of the VGG19 network;

and performing gradient descent back propagation training on the target convolutional network and the fourth convolutional network by using the L1 loss function to obtain the optimized target convolutional network and the optimized fourth convolutional network.

In this embodiment, a plurality of converted target face images corresponding to each frame of the initial motion video and the target face image before conversion are input into a fourth convolution network for convolution operation, so as to generate a face motion image corresponding to each frame; and inputting the face motion image corresponding to each frame and the corresponding frame picture in the initial motion video into a pre-trained VGG19 network for convolution processing, calculating an L1 loss function according to the output result of each layer in the VGG19 network, and performing gradient descent back propagation training on a target convolution network and a fourth convolution network by using the L1 loss function. The output of the corresponding frame picture in the nth layer initial motion video of the VGG19 network is represented by VGGn (f), and VGGn (g) is divided intoRespectively representing the output of the human face motion image of the corresponding frame in the nth layer initial motion video of the VGG19 network, the L1 loss function is: l1_loss＝Σ|VGGn(f)-VGGn(g)|。

In this embodiment, as shown in fig. 3, the fourth convolution network is composed of 3 continuous convolution layers and 3 continuous upsampling layers, when a plurality of transformed target face images corresponding to each frame of the initial motion video and the target face image before transformation are input to the fourth convolution network, the transformed target face images are firstly convolved by one convolution layer of 3 × 16, then input to one convolution layer of 3 × 32 for convolution, finally input to one convolution layer of 2 × 128 for convolution, then input to one upsampling layer of 3 × 128 for convolution, and finally input to one upsampling layer of 3 × 16 for convolution, and finally output one frame of the face motion image.

And S105, reversely rotating the face action image according to the face rotation angle and attaching the face action image to a corresponding frame to obtain a target motion video.

In this step, after each face motion image is reversely rotated according to the face rotation angle, the face motion images are attached to corresponding frames to obtain a target motion video.

In one embodiment, the step S105 includes:

reversely rotating the face action image according to the face rotation angle, and attaching the rotated face action image to a corresponding frame of the target motion video;

and acquiring the vertex coordinates of each frame of the face action image, and performing edge processing on the face action image according to the vertex coordinates to obtain a final target motion video.

In this embodiment, after all the face motion images are reversely rotated according to the face rotation angle, the face motion images are attached back to corresponding frames in the initial motion video, and the edge processing is performed on the face motion images, so as to obtain a final output target motion image. After the face action image is attached back to the corresponding frame, the following filters are adopted to process 10 pixel points around the attachment position, if the pixel points are located at the boundary, the pixel points are filled with 0 values, and the filters are specifically as follows:

Kenel＝[[0.2,0.1,0.1],

[0.1,0.0,0.1],

[0.1,0.1,0.2]]。

the coordinates of the vertices of the top left corner, top right corner, bottom left corner and bottom right corner after the face motion image is pasted back are represented by (x0, y0), (x0, y1), (x1, y0), (x1, y1), the pixel value of the vertex of the top left corner is represented by V (x0, y0), and (x) is used_n,y_n) The representation is located at (x)₀-5,y₀-5)～(x₀+5,y₁+5)，(x₀-5,y₀-5)～(x₁+5,y₀+5)，(x₀-5,y₁-5)～(x₁+5,y₁+5)，(x₁-5,y₀-5)～(x₁+5,y₁+5) internal pixel points, and calculating the values of all the points in the internal pixel points as follows:

V(x_n,y_n)＝0.2*V(x_n-1,y_n-1)+0.1*V(x_n,y_n-1)+0.1*V(x_n,y_n+1)+0.1*V(x_n-1,y_n+1)+0.0*V(x_n,y_n)+0.1*V(x_n+1,y_n)+0.1*V(x_n-1,y_n+1)+0.1*V(x_n,y_n+1)+0.2*V(x_n+1,y_n+1)

if the calculation result exceeds the image boundary, the value is represented by V-0.

Referring to fig. 4, fig. 4 is a schematic block diagram of a static face dynamic system according to an embodiment of the present invention, where the static face dynamic system 200 includes:

an initial face image obtaining unit 201, configured to perform face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and cut the face sample image according to the face frame information to obtain an initial face image;

a target face image obtaining unit 202, configured to calculate a face rotation angle in the initial face image according to the face key point information and rotate the initial face image to obtain a target face image;

an information parameter extraction unit 203, configured to extract, by using a target convolutional network, motion key point coordinate information and motion key point parameters of the target face image; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network;

a face motion image generating unit 204, configured to perform coordinate conversion on the target face image to obtain a converted target face image based on the motion key point coordinate information and the motion key point parameter of the target face image and the motion key point coordinate information and the motion key point parameter of each frame of image in the initial motion video, and perform convolution sampling on the converted target face image and the target face image before conversion to generate a multi-frame face motion image;

and the target motion video acquiring unit 205 is configured to perform reverse rotation on the face motion image according to the face rotation angle and attach the face motion image to a corresponding frame to obtain a target motion video.

In one embodiment, the initial face image obtaining unit 201 includes:

the face candidate frame generating unit is used for carrying out multi-stage scaling on the face sample image to obtain a plurality of input images with different sizes, inputting the input images into a first convolution network for convolution processing, and generating a face candidate frame with a corresponding size;

the face candidate frame screening unit is used for inputting the face candidate frames with the corresponding sizes into a second convolutional network for training, and screening qualified face candidate frames;

and the face information acquisition unit is used for inputting the qualified face candidate box into a third convolutional network for training to obtain final face frame information and face key point information.

In one embodiment, the target face image obtaining unit 202 includes:

the face rotation angle calculation unit is used for acquiring coordinate information of the left eye key point and the right eye key point and calculating a face rotation angle based on the coordinate information of the left eye key point and the right eye key point;

and the image rotating unit is used for rotating the initial face image according to the face rotation angle by adopting a bilinear interpolation method to obtain a target face image.

In an embodiment, the information parameter extracting unit 203 includes:

the face image convolution unit is used for inputting the target face image into a first target convolution layer for convolution operation to obtain a first target convolution result, inputting the first target convolution result into a second target convolution layer for convolution operation to obtain a second target convolution result, and inputting the second target convolution result into a third target convolution layer for convolution operation to obtain a third target convolution result; inputting the third target convolution result into a full-link layer to carry out convolution operation, and obtaining the motion key point coordinate information and the motion key point parameters of the target face image;

the motion video frame convolution unit is used for inputting each frame of picture in the initial motion video into a first target convolution layer for convolution operation to obtain a first target convolution result, inputting the first target convolution result into a second target convolution layer for convolution operation to obtain a second target convolution result, and inputting the second target convolution result into a third target convolution layer for convolution operation to obtain a third target convolution result; and inputting the third target convolution result to a full-link layer for convolution operation to obtain the motion key point coordinate information and the motion key point parameters of the target face image.

In one embodiment, the facial motion image generation unit 204 includes:

the formula calculation unit is used for carrying out coordinate conversion on the target face image according to the following formula:

Zt_n＝SK_n+SP_n*1/DfP_n*(z-DfK_n)

wherein SK_nAs images of the target faceCoordinate information of nth motion key point, SP_nThe nth motion key point parameter of the target face image, DfP_nFor the nth motion key point parameter of one of the frames in the initial motion video, DfK_nIs the coordinate information of the nth motion key point of one frame in the initial motion video, z is the original pixel point of the target face image, Zt_nCurrent coordinate information of an original pixel point of the target face image;

and the pixel extraction unit is used for extracting pixel values at corresponding positions of the target face image according to the coordinate information of the motion key points after coordinate conversion based on the number of the motion key points, and filling the pixel values to original pixel points to obtain a target face image after multi-frame conversion corresponding to one frame in the initial motion video.

In one embodiment, the facial motion image generation unit 204 includes:

the convolution network convolution unit is used for inputting the target face image before transformation and the target face image after multi-frame transformation into a fourth convolution network for convolution operation to generate a frame of face action image corresponding to one frame in the initial motion video;

and the back propagation training unit is used for inputting the generated face motion image and the corresponding frame picture in the initial motion video into the VGG19 for classification, and performing gradient descent back propagation training by adopting a loss function to obtain a plurality of frames of face motion images.

In one embodiment, the target motion video acquiring unit 205 includes:

the image reverse rotation unit is used for reversely rotating the face action image according to the face rotation angle and attaching the rotated face action image to the corresponding frame of the target motion video;

and the edge processing unit is used for acquiring the vertex coordinates of each frame of the human face action image and carrying out edge processing on the human face action image according to the vertex coordinates to obtain a final target motion video.

The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the static human face dynamic method as described above is implemented.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the static human face dynamic method is implemented as described above.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A static face dynamization method is characterized by comprising the following steps:

2. The static human face dynamic method according to claim 1, wherein the obtaining human face frame information and human face key point information by performing human face detection on the human face sample image by using a human face detection network comprises:

3. The static human face dynamic method according to claim 1, wherein the calculating a human face rotation angle in the initial human face image according to the human face key point information and rotating the initial human face image to obtain a target human face image comprises:

4. The static human face dynamic method according to claim 1, wherein the extracting of the coordinate information and the parameters of the motion key points of the target human face image by using a target convolution network is performed; and extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network, wherein the method comprises the following steps:

5. The method according to claim 1, wherein the transforming the coordinates of the target face image into the transformed target face image based on the coordinate information and the parameter of the motion key point of the target face image and the coordinate information and the parameter of the motion key point of each frame of image in the initial motion video comprises:

Zt_n＝SK_n+SP_n*1/DfP_n*(z-DfK_n)

6. The static human face dynamic method according to claim 1, wherein the performing convolution sampling on the transformed target human face image and the target human face image before transformation to generate a multi-frame human face motion image comprises:

7. The static human face dynamic method according to claim 1, wherein the reversely rotating the human face motion image according to the human face rotation angle and attaching the human face motion image to a corresponding frame to obtain a final target motion video comprises:

8. A static face dynamization system, comprising:

the face motion image generating unit is used for performing coordinate conversion on the target face image to obtain a converted target face image based on the motion key point coordinate information and the motion key point parameter of the target face image and the motion key point coordinate information and the motion key point parameter of each frame of image in the initial motion video, and performing convolution sampling on the converted target face image and the target face image before conversion to generate a multi-frame face motion image;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the static face dynamization method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the static face dynamization method according to any one of claims 1 to 7.