CN113688753A - Static face dynamic method, system, computer equipment and readable storage medium - Google Patents

Static face dynamic method, system, computer equipment and readable storage medium Download PDF

Info

Publication number
CN113688753A
CN113688753A CN202111002870.9A CN202111002870A CN113688753A CN 113688753 A CN113688753 A CN 113688753A CN 202111002870 A CN202111002870 A CN 202111002870A CN 113688753 A CN113688753 A CN 113688753A
Authority
CN
China
Prior art keywords
face
target
image
motion
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111002870.9A
Other languages
Chinese (zh)
Other versions
CN113688753B (en
Inventor
谌竟成
齐镗泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wondershare Software Co Ltd
Original Assignee
Shenzhen Wondershare Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wondershare Software Co Ltd filed Critical Shenzhen Wondershare Software Co Ltd
Priority to CN202111002870.9A priority Critical patent/CN113688753B/en
Publication of CN113688753A publication Critical patent/CN113688753A/en
Application granted granted Critical
Publication of CN113688753B publication Critical patent/CN113688753B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Processing Or Creating Images (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a static face dynamic method, a system, computer equipment and a readable storage medium, wherein the method comprises the following steps: carrying out face detection on a face sample image to obtain face frame information and face key point information, and cutting to obtain an initial face image; calculating a face rotation angle and rotating to obtain a target face image; extracting the coordinate information of the motion key point and the parameters of the motion key point of each frame of picture in the target face image and the initial motion video, carrying out coordinate conversion, and carrying out convolution sampling to generate a plurality of frames of face motion images; and reversely rotating the face action image according to the face rotation angle and attaching the face action image to the corresponding frame to obtain a target motion video. According to the invention, the human face is detected firstly, the human face image and the motion video are processed by using the convolution network, and finally the human face action image is attached to the corresponding frame of the motion video to obtain the target motion video.

Description

Static face dynamic method, system, computer equipment and readable storage medium
Technical Field
The invention relates to the technical field of video editing, in particular to a static face dynamic method, a static face dynamic system, computer equipment and a readable storage medium.
Background
At present, some short video editing software and short video friend-making software in China have the function of making static images dynamic. Aiming at human faces, the main methods of the algorithms are realized by adopting a face changing mode, namely the human faces of each frame in an action video are extracted, respectively replaced or fused into a static image, and then the static image and a synthesized video are combined, so that the dynamic function of the static image is realized, and other APPs mainly splice 8 sections of different human face dynamic videos with the static images in different directions in space, so that the whole video looks dynamic. Although the conventional APP short video can realize the AI static image dynamic function, a large gap still exists in comparison with the real requirement of a user: 1. the user's demand is that the people's face in the static image can reappear the people's face in the action video, adopts and to scratch the people's face in the action video and pastes it on the static image again, though adopt the mode of face fusion can keep the partial characteristic of people's face in the original static image, but the people's face main part sees from the vision main part and obviously changes for whole natural, smooth inadequately. 2. Different face motion videos are spatially spliced with faces to generate videos, only the whole video looks dynamic, but the middle static image does not realize the dynamic, and the dynamic quality and the fluency of the whole video are low.
Disclosure of Invention
The embodiment of the invention provides a method, a system, computer equipment and a readable storage medium for dynamic processing of a static face, and aims to solve the problems that the dynamic processing of the static face is not smooth enough and the quality is not high in the prior art.
In a first aspect, an embodiment of the present invention provides a static face dynamic method, including:
carrying out face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and shearing the face sample image according to the face frame information to obtain an initial face image;
calculating a face rotation angle in the initial face image according to the face key point information and rotating the initial face image to obtain a target face image;
extracting the coordinate information and the parameters of the motion key points of the target face image by using a target convolutional network; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network;
performing coordinate conversion on the target face image to obtain a converted target face image based on the motion key point coordinate information and the motion key point parameter of the target face image and the motion key point coordinate information and the motion key point parameter of each frame of image in the initial motion video, and performing convolution sampling on the converted target face image and the target face image before conversion to generate a multi-frame face action image;
and reversely rotating the face action image according to the face rotation angle and attaching the face action image to a corresponding frame to obtain a target motion video.
In a second aspect, an embodiment of the present invention provides a static face dynamic system, which includes:
an initial face image obtaining unit, configured to perform face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and cut the face sample image according to the face frame information to obtain an initial face image;
the target face image acquisition unit is used for calculating a face rotation angle in the initial face image according to the face key point information and rotating the initial face image to obtain a target face image;
the information parameter extraction unit is used for extracting the coordinate information and the parameters of the motion key points of the target face image by using a target convolution network; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network;
the target motion video generation unit is used for carrying out coordinate conversion on the target face image to obtain a converted target face image based on the motion key point coordinate information and the motion key point parameter of the target face image and the motion key point coordinate information and the motion key point parameter of each frame of image in the initial motion video, carrying out convolution sampling on the converted target face image and the target face image before conversion, and generating a multi-frame face motion image;
and the target motion video acquisition unit is used for reversely rotating the face action image according to the face rotation angle and attaching the face action image to a corresponding frame to obtain a target motion video.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the static human face dynamic method according to the first aspect when executing the computer program.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the static face dynamic method according to the first aspect.
The embodiment of the invention provides a static face dynamic method, a system, computer equipment and a readable storage medium, wherein the method comprises the following steps: carrying out face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and shearing the face sample image according to the face frame information to obtain an initial face image; calculating a face rotation angle in the initial face image according to the face key point information and rotating the initial face image to obtain a target face image; extracting the coordinate information and the parameters of the motion key points of the target face image by using a target convolutional network; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network; performing coordinate conversion on the target face image to obtain a converted target face image based on the motion key point coordinate information and the motion key point parameter of the target face image and the motion key point coordinate information and the motion key point parameter of each frame of image in the initial motion video, and performing convolution sampling on the converted target face image and the target face image before conversion to generate a multi-frame face action image; and reversely rotating the face action image according to the face rotation angle and attaching the face action image to a corresponding frame to obtain a target motion video. The embodiment of the invention detects all faces in the image by using the face detection network, processes the face image and the motion video by using the convolution network, and then attaches the face action image to the corresponding frame of the motion video to obtain the target motion video, so that the whole process is simpler, and the generated target motion video is smoother and has better visual effect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a static face dynamic method according to an embodiment of the present invention;
fig. 2 is a diagram of a target convolutional network architecture of the static face dynamic method according to the embodiment of the present invention;
fig. 3 is a fourth convolution network architecture diagram of the static face dynamic method according to the embodiment of the present invention;
fig. 4 is a schematic block diagram of a static face dynamization system according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1, fig. 1 is a schematic flow chart of a static face dynamic method according to an embodiment of the present invention, where the method includes steps S101 to S105.
S101, carrying out face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and cutting the face sample image according to the face frame information to obtain an initial face image;
in the step, a face sample image containing a face is input into a face detection network for face detection, face frame information and face key point information are obtained, and the face sample image is cut according to the face frame information, so that an initial face image containing the face is obtained. The face frame information comprises face frame upper left corner coordinate information, face frame width and face frame height, and the face key point information comprises a left eye key point, a right eye key point, a nose key point, a left mouth corner key point and a right mouth corner key point. If a plurality of faces exist in the face sample image, detecting all the faces in the face sample image, and acquiring face frame information and face key point information of all the faces. If the plurality of faces in the face sample image have the condition that part of the faces are shielded or overlapped, after the initial face images corresponding to all the faces are obtained, whether the initial face image corresponding to each face is complete or not is judged, and if the initial face images are incomplete, the initial face images with the missing faces are repaired by using a face repairing algorithm to obtain the complete faces.
In this embodiment, the MTCNN network is used as a face detection network to perform face detection on a face sample image, so as to obtain face frame information and face key point information. And shearing the face sample image according to the width and the height of the face frame, compensating the width and the height of the face frame, and calculating the width and the height of the initial face image by combining a compensation result. Specifically, the width of the initial face image is calculated according to the following formula: w is W + offset W, where W is the width of the initial face image, W is the width of the face frame, and offset W is the width compensation result; calculating the height of the initial face image according to the following formula: h + offset, where H is the height of the initial face image, H is the height of the face frame, and offset is the height compensation result. In this embodiment, the offset W is 3/8W, and the offset H is 3/8H.
In an embodiment, the performing face detection on a face sample image by using a face detection network to obtain face frame information and face key point information includes:
carrying out multi-stage scaling on the face sample image to obtain a plurality of input images with different sizes, inputting the input images into a first convolution network for convolution processing, and generating a face candidate frame with a corresponding size;
inputting the face candidate frames with the corresponding sizes into a second convolution network for training, and screening out qualified face candidate frames;
and inputting the qualified face candidate box into a third convolutional network for training to obtain final face frame information and face key point information.
In this embodiment, after the face sample image is subjected to multi-stage scaling, a plurality of input images with different sizes are obtained, and are input into the first convolution network for convolution processing, then the convolution result of the first convolution network is input into the second convolution network for convolution processing, and finally the convolution result of the second convolution network is input into the third convolution network for convolution processing, so that the final face frame information and the face key point information are obtained.
Specifically, the method comprises the following steps: inputting the face sample image into an image pyramid with preset size for multistage scaling to obtain a plurality of input images with different sizes, wherein the scaling factor is 0.71, and the minisize is 20; inputting a plurality of input images with different sizes into a first convolution network for convolution processing, generating feature maps sequentially through convolution layers and pooling layers with different sizes, judging face contour points through the feature maps, generating face candidate frames and frame regression vectors after analysis processing by the first convolution network, and removing unqualified face candidate frames by adopting non-maximum value inhibition, thereby obtaining face candidate frames with a plurality of sizes, wherein the non-maximum value inhibition value is 0.707; inputting the obtained face candidate frames with a plurality of sizes into a second convolutional network for training, continuously removing unqualified face candidate frames by setting a threshold, inhibiting by using a non-maximum value, then removing highly overlapped face candidate frames, and obtaining a plurality of qualified face candidate frames after calibration, wherein the non-maximum value is inhibited to take a value of 0.707; inputting the obtained plurality of qualified face candidate frames into a third convolutional network for training, inhibiting by using a non-maximum value, then removing the highly overlapped face candidate frames, calibrating to obtain a plurality of face candidate frames, wherein the non-maximum value is inhibited to take a value of 0.707, and finally outputting face frame information and face key point information containing face position information.
When input images of various sizes are input into the first convolution network, the convolution is carried out on the convolution layer of 3 x 64, the convolution is carried out on the convolution layer of 3 x 32, and finally the convolution is carried out on the convolution layer of 3 x 16 to obtain the face candidate frames of various sizes. And when the face candidate frames with multiple sizes are input into the second convolution network, performing convolution through a convolution layer of 3 × 16, performing convolution through a convolution layer of 3 × 32, performing convolution through a convolution layer of 3 × 64, and performing convolution through a full-connection layer of 1 × 128 to obtain a qualified face candidate frame. And when the qualified face candidate frame is input into the third convolutional network, performing convolution through a 3 × 32 convolutional layer, performing convolution through a 3 × 64 convolutional layer, performing convolution through a 2 × 128 convolutional layer, and performing convolution through a 1 × 256 fully-connected layer to obtain face frame information and face key point information.
After face frame information and face key point information are obtained, corresponding loss functions of the first convolution network, the second convolution network and the third convolution network are calculated, and the first convolution network, the second convolution network and the third convolution network are trained by using the loss functions and adopting gradient descent back propagation. Wherein the loss function of the first convolutional network is calculated using the following formula: l isp0.8 × L1+0.1 × L2+0.1 × L3; calculating a loss function of the second convolutional network using the following formula: l isR0.1 × L1+0.6 × L2+0.3 × L3; calculating a loss function of the third convolutional network using the following formula: l iso0.2 × L1+0.4 × L2+0.4 × L3; wherein, L1 is face classification loss, and the formula is:
Figure BDA0003236161930000061
Figure BDA0003236161930000062
Figure BDA0003236161930000063
for the final output face, piIs a real picture; l2 is the face candidate box regression loss, and the formula is:
Figure BDA0003236161930000064
Figure BDA0003236161930000065
as true label coordinates or width and height, yiFor the output face frame coordinates or width and height, when i is 1, the value x of the upper left corner of the face frame is represented, when i is 2, the value y of the upper left corner of the face frame is represented, when i is 3, the width of the face frame is represented, and when i is 4, the height of the face frame is represented; l3 is the face key point loss, and the formula is:
Figure BDA0003236161930000066
Figure BDA0003236161930000067
labeling the true keypoints with coordinates, yijThe coordinates of the face key points are output, wherein i ═ 1, 2, 3, 4 and 5 respectively represent five key points of the face, namely: i ═ 1 is a left-eye key point, i ═ 2 is a right-eye key point, i ═ 3 is a nose key point, i ═ 4 is a left-mouth-angle key point, i ═ 5 is a right-mouth-angle key point, j ═ 1 represents the abscissa of the key point, and j ═ 2 represents the ordinate of the key point.
S102, calculating a face rotation angle in the initial face image according to the face key point information and rotating the initial face image to obtain a target face image;
in the step, the rotation angle is calculated by using the face key point information, so that the initial face image is rotated to obtain the target face image.
In one embodiment, the step S102 includes:
acquiring coordinate information of a left-eye key point and a right-eye key point, and calculating a face rotation angle based on the coordinate information of the left-eye key point and the right-eye key point;
and rotating the initial face image according to the face rotation angle by adopting a bilinear interpolation method to obtain a target face image.
In this embodiment, the rotation angle of the face is calculated by using the coordinate information of the key points of the left eye and the right eye, and a bilinear interpolation method is adopted to calculate the rotation angle of the face according to the rotation angle of the faceAnd rotating the initial face image to obtain a target face image. In this embodiment, let the coordinates of the key points of the left eye be (x)1,y1) The coordinate of the right eye is (x)2,y2) Then, the rotation angle of the face in the calculation can be calculated as angle ═ arctan ((y)1-y2)/(x1-x2))。
S103, extracting the coordinate information and the parameters of the motion key points of the target face image by using a target convolution network; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network;
in this step, a target convolution network is used to extract the coordinate information of the motion key points and the parameters of the motion key points corresponding to each frame of picture in the target face image and the initial motion video. And the motion key point parameter information is a 2 x 2 matrix, and the motion key point coordinate information and the motion key point parameters are 10.
In one embodiment, the step S103 includes:
inputting the target face image into a first target convolutional layer for convolution operation to obtain a first target convolution result, inputting the first target convolution result into a second target convolutional layer for convolution operation to obtain a second target convolution result, and inputting the second target convolution result into a third target convolutional layer for convolution operation to obtain a third target convolution result; inputting the third target convolution result into a full-link layer to carry out convolution operation, and obtaining the motion key point coordinate information and the motion key point parameters of the target face image;
inputting each frame of picture in the initial motion video into a first target convolutional layer for convolution operation to obtain a fourth target convolution result, inputting the fourth target convolution result into a second target convolutional layer for convolution operation to obtain a fifth target convolution result, and inputting the fifth target convolution result into a sixth target convolutional layer for convolution operation to obtain a sixth target convolution result; and inputting the sixth target convolution result into a full-link layer for convolution operation to obtain the motion key point coordinate information and the motion key point parameters of each frame of picture in the initial motion video.
In this embodiment, as shown in fig. 2, the target face network is composed of 3 consecutive convolutional layers and a full link layer, when extracting the coordinate information and the parameters of the motion key points of the target face image, inputting the target face image into a first target convolution layer with a convolution layer of 3 x 16 to carry out convolution operation to obtain a first target convolution result, then inputting the first target convolution result into a second target convolution layer with the convolution layer being 3 x 32 to carry out convolution operation to obtain a second target convolution result, then inputting the second target convolution result into a third target convolution layer with the convolution layer being 2 x 128 to carry out convolution operation, and then inputting the obtained third target convolution result into a 1 x 256 full-connection layer to carry out convolution operation, so as to obtain the coordinate information of the motion key points and the parameters of the motion key points of the target face image; similarly, when extracting the motion key point coordinate information and the motion key point parameter of each frame of picture in the initial motion video, sequentially inputting each frame of picture in the initial motion video into the first target convolutional layer, the second target convolutional layer, the third target convolutional layer and the full-connection layer for convolution operation, and obtaining the motion key point coordinate information and the motion key point parameter of each frame of picture in the initial motion video.
S104, performing coordinate conversion on the target face image to obtain a converted target face image based on the coordinate information and the motion key point parameters of the motion key point of the target face image and the coordinate information and the motion key point parameters of the motion key point of each frame of image in the initial motion video, and performing convolution sampling on the converted target face image and the target face image before conversion to generate a multi-frame face motion image;
in this step, according to the coordinate information and the parameters of the motion key points of the target face image and the coordinate information and the parameters of the motion key points of the current frame picture in the initial motion video, the coordinate conversion is performed on the target face image to obtain a converted target face image, then the convolution sampling is performed on the target face image before and after the conversion to generate a face action image corresponding to the current frame, and the face action image corresponding to each frame picture in the initial motion video is generated according to the method.
In an embodiment, the performing coordinate transformation on the target face image based on the coordinate information and the parameter of the motion key point of the target face image and the coordinate information and the parameter of the motion key point of each frame of picture in the initial motion video to obtain the transformed target face image includes:
and carrying out coordinate conversion on the target face image according to the following formula:
Ztn=SKn+SPn*1/DfPn*(z-DfKn)
wherein SKnIs the coordinate information of the nth motion key point of the target face image, SPnThe nth motion key point parameter of the target face image, DfPnFor the nth motion key point parameter of one of the frames in the initial motion video, DfKnIs the coordinate information of the nth motion key point of one frame in the initial motion video, z is the original pixel point of the target face image, ZtnCurrent coordinate information of an original pixel point of the target face image;
and based on the number of the motion key points, extracting pixel values at corresponding positions of the target face image according to the coordinate information of the motion key points after coordinate conversion, and filling the pixel values to original pixel points to obtain a target face image after multi-frame conversion corresponding to one frame in the initial motion video.
In this embodiment, coordinate conversion is performed on the target face image according to the above formula, then pixel points are extracted at corresponding positions on the target face image according to the coordinate information of the motion key points after coordinate conversion, and the original pixel points are filled back, so as to obtain a target face image after multi-frame conversion corresponding to one frame in the initial motion video. Specifically, the key point of the motion of the target face image is SK1,SK2…SK10The parameter information of the motion key points is SP1, SP2 and … SP10, and the motion key point of a certain frame f in the initial motion video is DfK1,DfK2,…DfknDfP is the motion key point parameter information1、DfP2,…DfPnAnd sequentially transforming 10 key points of the target face image according to the formula to obtain 10 coordinate-transformed target face images corresponding to a certain frame f in the initial motion video.
In an embodiment, the performing convolution sampling on the transformed target face image and the target face image before transformation to generate a multi-frame face motion image includes:
inputting the target face image before transformation and a plurality of frames of transformed target face images into a fourth convolution network for convolution operation to generate a frame of face action image corresponding to one frame in the initial motion video;
inputting the generated human face motion image and the corresponding frame picture in the initial motion video into a trained VGG19 network for convolution processing, obtaining each layer of output result of the VGG19 network, and calculating an L1 loss function according to each layer of output result of the VGG19 network;
and performing gradient descent back propagation training on the target convolutional network and the fourth convolutional network by using the L1 loss function to obtain the optimized target convolutional network and the optimized fourth convolutional network.
In this embodiment, a plurality of converted target face images corresponding to each frame of the initial motion video and the target face image before conversion are input into a fourth convolution network for convolution operation, so as to generate a face motion image corresponding to each frame; and inputting the face motion image corresponding to each frame and the corresponding frame picture in the initial motion video into a pre-trained VGG19 network for convolution processing, calculating an L1 loss function according to the output result of each layer in the VGG19 network, and performing gradient descent back propagation training on a target convolution network and a fourth convolution network by using the L1 loss function. The output of the corresponding frame picture in the nth layer initial motion video of the VGG19 network is represented by VGGn (f), and VGGn (g) is divided intoRespectively representing the output of the human face motion image of the corresponding frame in the nth layer initial motion video of the VGG19 network, the L1 loss function is: l1loss=Σ|VGGn(f)-VGGn(g)|。
In this embodiment, as shown in fig. 3, the fourth convolution network is composed of 3 continuous convolution layers and 3 continuous upsampling layers, when a plurality of transformed target face images corresponding to each frame of the initial motion video and the target face image before transformation are input to the fourth convolution network, the transformed target face images are firstly convolved by one convolution layer of 3 × 16, then input to one convolution layer of 3 × 32 for convolution, finally input to one convolution layer of 2 × 128 for convolution, then input to one upsampling layer of 3 × 128 for convolution, and finally input to one upsampling layer of 3 × 16 for convolution, and finally output one frame of the face motion image.
And S105, reversely rotating the face action image according to the face rotation angle and attaching the face action image to a corresponding frame to obtain a target motion video.
In this step, after each face motion image is reversely rotated according to the face rotation angle, the face motion images are attached to corresponding frames to obtain a target motion video.
In one embodiment, the step S105 includes:
reversely rotating the face action image according to the face rotation angle, and attaching the rotated face action image to a corresponding frame of the target motion video;
and acquiring the vertex coordinates of each frame of the face action image, and performing edge processing on the face action image according to the vertex coordinates to obtain a final target motion video.
In this embodiment, after all the face motion images are reversely rotated according to the face rotation angle, the face motion images are attached back to corresponding frames in the initial motion video, and the edge processing is performed on the face motion images, so as to obtain a final output target motion image. After the face action image is attached back to the corresponding frame, the following filters are adopted to process 10 pixel points around the attachment position, if the pixel points are located at the boundary, the pixel points are filled with 0 values, and the filters are specifically as follows:
Kenel=[[0.2,0.1,0.1],
[0.1,0.0,0.1],
[0.1,0.1,0.2]]。
the coordinates of the vertices of the top left corner, top right corner, bottom left corner and bottom right corner after the face motion image is pasted back are represented by (x0, y0), (x0, y1), (x1, y0), (x1, y1), the pixel value of the vertex of the top left corner is represented by V (x0, y0), and (x) is usedn,yn) The representation is located at (x)0-5,y0-5)~(x0+5,y1+5),(x0-5,y0-5)~(x1+5,y0+5),(x0-5,y1-5)~(x1+5,y1+5),(x1-5,y0-5)~(x1+5,y1+5) internal pixel points, and calculating the values of all the points in the internal pixel points as follows:
V(xn,yn)=0.2*V(xn-1,yn-1)+0.1*V(xn,yn-1)+0.1*V(xn,yn+1)+0.1*V(xn-1,yn+1)+0.0*V(xn,yn)+0.1*V(xn+1,yn)+0.1*V(xn-1,yn+1)+0.1*V(xn,yn+1)+0.2*V(xn+1,yn+1)
if the calculation result exceeds the image boundary, the value is represented by V-0.
Referring to fig. 4, fig. 4 is a schematic block diagram of a static face dynamic system according to an embodiment of the present invention, where the static face dynamic system 200 includes:
an initial face image obtaining unit 201, configured to perform face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and cut the face sample image according to the face frame information to obtain an initial face image;
a target face image obtaining unit 202, configured to calculate a face rotation angle in the initial face image according to the face key point information and rotate the initial face image to obtain a target face image;
an information parameter extraction unit 203, configured to extract, by using a target convolutional network, motion key point coordinate information and motion key point parameters of the target face image; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network;
a face motion image generating unit 204, configured to perform coordinate conversion on the target face image to obtain a converted target face image based on the motion key point coordinate information and the motion key point parameter of the target face image and the motion key point coordinate information and the motion key point parameter of each frame of image in the initial motion video, and perform convolution sampling on the converted target face image and the target face image before conversion to generate a multi-frame face motion image;
and the target motion video acquiring unit 205 is configured to perform reverse rotation on the face motion image according to the face rotation angle and attach the face motion image to a corresponding frame to obtain a target motion video.
In one embodiment, the initial face image obtaining unit 201 includes:
the face candidate frame generating unit is used for carrying out multi-stage scaling on the face sample image to obtain a plurality of input images with different sizes, inputting the input images into a first convolution network for convolution processing, and generating a face candidate frame with a corresponding size;
the face candidate frame screening unit is used for inputting the face candidate frames with the corresponding sizes into a second convolutional network for training, and screening qualified face candidate frames;
and the face information acquisition unit is used for inputting the qualified face candidate box into a third convolutional network for training to obtain final face frame information and face key point information.
In one embodiment, the target face image obtaining unit 202 includes:
the face rotation angle calculation unit is used for acquiring coordinate information of the left eye key point and the right eye key point and calculating a face rotation angle based on the coordinate information of the left eye key point and the right eye key point;
and the image rotating unit is used for rotating the initial face image according to the face rotation angle by adopting a bilinear interpolation method to obtain a target face image.
In an embodiment, the information parameter extracting unit 203 includes:
the face image convolution unit is used for inputting the target face image into a first target convolution layer for convolution operation to obtain a first target convolution result, inputting the first target convolution result into a second target convolution layer for convolution operation to obtain a second target convolution result, and inputting the second target convolution result into a third target convolution layer for convolution operation to obtain a third target convolution result; inputting the third target convolution result into a full-link layer to carry out convolution operation, and obtaining the motion key point coordinate information and the motion key point parameters of the target face image;
the motion video frame convolution unit is used for inputting each frame of picture in the initial motion video into a first target convolution layer for convolution operation to obtain a first target convolution result, inputting the first target convolution result into a second target convolution layer for convolution operation to obtain a second target convolution result, and inputting the second target convolution result into a third target convolution layer for convolution operation to obtain a third target convolution result; and inputting the third target convolution result to a full-link layer for convolution operation to obtain the motion key point coordinate information and the motion key point parameters of the target face image.
In one embodiment, the facial motion image generation unit 204 includes:
the formula calculation unit is used for carrying out coordinate conversion on the target face image according to the following formula:
Ztn=SKn+SPn*1/DfPn*(z-DfKn)
wherein SKnAs images of the target faceCoordinate information of nth motion key point, SPnThe nth motion key point parameter of the target face image, DfPnFor the nth motion key point parameter of one of the frames in the initial motion video, DfKnIs the coordinate information of the nth motion key point of one frame in the initial motion video, z is the original pixel point of the target face image, ZtnCurrent coordinate information of an original pixel point of the target face image;
and the pixel extraction unit is used for extracting pixel values at corresponding positions of the target face image according to the coordinate information of the motion key points after coordinate conversion based on the number of the motion key points, and filling the pixel values to original pixel points to obtain a target face image after multi-frame conversion corresponding to one frame in the initial motion video.
In one embodiment, the facial motion image generation unit 204 includes:
the convolution network convolution unit is used for inputting the target face image before transformation and the target face image after multi-frame transformation into a fourth convolution network for convolution operation to generate a frame of face action image corresponding to one frame in the initial motion video;
and the back propagation training unit is used for inputting the generated face motion image and the corresponding frame picture in the initial motion video into the VGG19 for classification, and performing gradient descent back propagation training by adopting a loss function to obtain a plurality of frames of face motion images.
In one embodiment, the target motion video acquiring unit 205 includes:
the image reverse rotation unit is used for reversely rotating the face action image according to the face rotation angle and attaching the rotated face action image to the corresponding frame of the target motion video;
and the edge processing unit is used for acquiring the vertex coordinates of each frame of the human face action image and carrying out edge processing on the human face action image according to the vertex coordinates to obtain a final target motion video.
The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the static human face dynamic method as described above is implemented.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the static human face dynamic method is implemented as described above.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A static face dynamization method is characterized by comprising the following steps:
carrying out face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and shearing the face sample image according to the face frame information to obtain an initial face image;
calculating a face rotation angle in the initial face image according to the face key point information and rotating the initial face image to obtain a target face image;
extracting the coordinate information and the parameters of the motion key points of the target face image by using a target convolutional network; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network;
performing coordinate conversion on the target face image to obtain a converted target face image based on the motion key point coordinate information and the motion key point parameter of the target face image and the motion key point coordinate information and the motion key point parameter of each frame of image in the initial motion video, and performing convolution sampling on the converted target face image and the target face image before conversion to generate a multi-frame face action image;
and reversely rotating the face action image according to the face rotation angle and attaching the face action image to a corresponding frame to obtain a target motion video.
2. The static human face dynamic method according to claim 1, wherein the obtaining human face frame information and human face key point information by performing human face detection on the human face sample image by using a human face detection network comprises:
carrying out multi-stage scaling on the face sample image to obtain a plurality of input images with different sizes, inputting the input images into a first convolution network for convolution processing, and generating a face candidate frame with a corresponding size;
inputting the face candidate frames with the corresponding sizes into a second convolution network for training, and screening out qualified face candidate frames;
and inputting the qualified face candidate box into a third convolutional network for training to obtain final face frame information and face key point information.
3. The static human face dynamic method according to claim 1, wherein the calculating a human face rotation angle in the initial human face image according to the human face key point information and rotating the initial human face image to obtain a target human face image comprises:
acquiring coordinate information of a left-eye key point and a right-eye key point, and calculating a face rotation angle based on the coordinate information of the left-eye key point and the right-eye key point;
and rotating the initial face image according to the face rotation angle by adopting a bilinear interpolation method to obtain a target face image.
4. The static human face dynamic method according to claim 1, wherein the extracting of the coordinate information and the parameters of the motion key points of the target human face image by using a target convolution network is performed; and extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network, wherein the method comprises the following steps:
inputting the target face image into a first target convolutional layer for convolution operation to obtain a first target convolution result, inputting the first target convolution result into a second target convolutional layer for convolution operation to obtain a second target convolution result, and inputting the second target convolution result into a third target convolutional layer for convolution operation to obtain a third target convolution result; inputting the third target convolution result into a full-link layer to carry out convolution operation, and obtaining the motion key point coordinate information and the motion key point parameters of the target face image;
inputting each frame of picture in the initial motion video into a first target convolutional layer for convolution operation to obtain a fourth target convolution result, inputting the fourth target convolution result into a second target convolutional layer for convolution operation to obtain a fifth target convolution result, and inputting the fifth target convolution result into a sixth target convolutional layer for convolution operation to obtain a sixth target convolution result; and inputting the sixth target convolution result into a full-link layer for convolution operation to obtain the motion key point coordinate information and the motion key point parameters of each frame of picture in the initial motion video.
5. The method according to claim 1, wherein the transforming the coordinates of the target face image into the transformed target face image based on the coordinate information and the parameter of the motion key point of the target face image and the coordinate information and the parameter of the motion key point of each frame of image in the initial motion video comprises:
and carrying out coordinate conversion on the target face image according to the following formula:
Ztn=SKn+SPn*1/DfPn*(z-DfKn)
wherein SKnIs the coordinate information of the nth motion key point of the target face image, SPnThe nth motion key point parameter of the target face image, DfPnFor the nth motion key point parameter of one of the frames in the initial motion video, DfKnIs the coordinate information of the nth motion key point of one frame in the initial motion video, z is the original pixel point of the target face image, ZtnCurrent coordinate information of an original pixel point of the target face image;
and based on the number of the motion key points, extracting pixel values at corresponding positions of the target face image according to the coordinate information of the motion key points after coordinate conversion, and filling the pixel values to original pixel points to obtain a target face image after multi-frame conversion corresponding to one frame in the initial motion video.
6. The static human face dynamic method according to claim 1, wherein the performing convolution sampling on the transformed target human face image and the target human face image before transformation to generate a multi-frame human face motion image comprises:
inputting the target face image before transformation and a plurality of frames of transformed target face images into a fourth convolution network for convolution operation to generate a frame of face action image corresponding to one frame in the initial motion video;
inputting the generated human face motion image and the corresponding frame picture in the initial motion video into a trained VGG19 network for convolution processing, obtaining each layer of output result of the VGG19 network, and calculating an L1 loss function according to each layer of output result of the VGG19 network;
and performing gradient descent back propagation training on the target convolutional network and the fourth convolutional network by using the L1 loss function to obtain the optimized target convolutional network and the optimized fourth convolutional network.
7. The static human face dynamic method according to claim 1, wherein the reversely rotating the human face motion image according to the human face rotation angle and attaching the human face motion image to a corresponding frame to obtain a final target motion video comprises:
reversely rotating the face action image according to the face rotation angle, and attaching the rotated face action image to a corresponding frame of the target motion video;
and acquiring the vertex coordinates of each frame of the face action image, and performing edge processing on the face action image according to the vertex coordinates to obtain a final target motion video.
8. A static face dynamization system, comprising:
an initial face image obtaining unit, configured to perform face detection on a face sample image by using a face detection network to obtain face frame information and face key point information, and cut the face sample image according to the face frame information to obtain an initial face image;
the target face image acquisition unit is used for calculating a face rotation angle in the initial face image according to the face key point information and rotating the initial face image to obtain a target face image;
the information parameter extraction unit is used for extracting the coordinate information and the parameters of the motion key points of the target face image by using a target convolution network; extracting the coordinate information and the parameters of the motion key points of each frame of picture in the initial motion video by using the target convolutional network;
the face motion image generating unit is used for performing coordinate conversion on the target face image to obtain a converted target face image based on the motion key point coordinate information and the motion key point parameter of the target face image and the motion key point coordinate information and the motion key point parameter of each frame of image in the initial motion video, and performing convolution sampling on the converted target face image and the target face image before conversion to generate a multi-frame face motion image;
and the target motion video acquisition unit is used for reversely rotating the face action image according to the face rotation angle and attaching the face action image to a corresponding frame to obtain a target motion video.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the static face dynamization method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the static face dynamization method according to any one of claims 1 to 7.
CN202111002870.9A 2021-08-30 2021-08-30 Static face dynamic method, system, computer equipment and readable storage medium Active CN113688753B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111002870.9A CN113688753B (en) 2021-08-30 2021-08-30 Static face dynamic method, system, computer equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111002870.9A CN113688753B (en) 2021-08-30 2021-08-30 Static face dynamic method, system, computer equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN113688753A true CN113688753A (en) 2021-11-23
CN113688753B CN113688753B (en) 2023-09-29

Family

ID=78583973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111002870.9A Active CN113688753B (en) 2021-08-30 2021-08-30 Static face dynamic method, system, computer equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113688753B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117523636A (en) * 2023-11-24 2024-02-06 北京远鉴信息技术有限公司 Face detection method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070052698A1 (en) * 2003-07-11 2007-03-08 Ryuji Funayama Image processing apparatus, image processing method, image processing program, and recording medium
CN105184860A (en) * 2015-09-30 2015-12-23 南京邮电大学 Method for reconstructing dense three-dimensional structure and motion field of dynamic face simultaneously
CN109711258A (en) * 2018-11-27 2019-05-03 哈尔滨工业大学(深圳) Lightweight face critical point detection method, system and storage medium based on convolutional network
CN110647865A (en) * 2019-09-30 2020-01-03 腾讯科技(深圳)有限公司 Face gesture recognition method, device, equipment and storage medium
CN111696185A (en) * 2019-03-12 2020-09-22 北京奇虎科技有限公司 Method and device for generating dynamic expression image sequence by using static face image
CN111753782A (en) * 2020-06-30 2020-10-09 西安深信科创信息技术有限公司 False face detection method and device based on double-current network and electronic equipment
CN112733616A (en) * 2020-12-22 2021-04-30 北京达佳互联信息技术有限公司 Dynamic image generation method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070052698A1 (en) * 2003-07-11 2007-03-08 Ryuji Funayama Image processing apparatus, image processing method, image processing program, and recording medium
CN105184860A (en) * 2015-09-30 2015-12-23 南京邮电大学 Method for reconstructing dense three-dimensional structure and motion field of dynamic face simultaneously
CN109711258A (en) * 2018-11-27 2019-05-03 哈尔滨工业大学(深圳) Lightweight face critical point detection method, system and storage medium based on convolutional network
CN111696185A (en) * 2019-03-12 2020-09-22 北京奇虎科技有限公司 Method and device for generating dynamic expression image sequence by using static face image
CN110647865A (en) * 2019-09-30 2020-01-03 腾讯科技(深圳)有限公司 Face gesture recognition method, device, equipment and storage medium
CN111753782A (en) * 2020-06-30 2020-10-09 西安深信科创信息技术有限公司 False face detection method and device based on double-current network and electronic equipment
CN112733616A (en) * 2020-12-22 2021-04-30 北京达佳互联信息技术有限公司 Dynamic image generation method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117523636A (en) * 2023-11-24 2024-02-06 北京远鉴信息技术有限公司 Face detection method and device, electronic equipment and storage medium
CN117523636B (en) * 2023-11-24 2024-06-18 北京远鉴信息技术有限公司 Face detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113688753B (en) 2023-09-29

Similar Documents

Publication Publication Date Title
CN110490896B (en) Video frame image processing method and device
CN112184585B (en) Image completion method and system based on semantic edge fusion
CN111861872B (en) Image face changing method, video face changing method, device, equipment and storage medium
CN107316286B (en) Method and device for synchronously synthesizing and removing rain and fog in image
JP2023539691A (en) Human image restoration methods, devices, electronic devices, storage media, and program products
KR101028628B1 (en) Image texture filtering method, storage medium of storing program for executing the same and apparatus performing the same
CN112991165B (en) Image processing method and device
WO2023066173A1 (en) Image processing method and apparatus, and storage medium and electronic device
CN113688753A (en) Static face dynamic method, system, computer equipment and readable storage medium
CN111311732B (en) 3D human body grid acquisition method and device
CN111932594B (en) Billion pixel video alignment method and device based on optical flow and medium
CN110298229B (en) Video image processing method and device
CN113240584A (en) Multitask gesture picture super-resolution method based on picture edge information
US7522189B2 (en) Automatic stabilization control apparatus, automatic stabilization control method, and computer readable recording medium having automatic stabilization control program recorded thereon
CN116563497A (en) Virtual person driving method, device, equipment and readable storage medium
Hongying et al. Image completion by a fast and adaptive exemplar-based image inpainting
Wang Single image super-resolution with u-net generative adversarial networks
CN115116468A (en) Video generation method and device, storage medium and electronic equipment
JP2004264919A (en) Image processing method
Cho et al. Example-based super-resolution using self-patches and approximated constrained least squares filter
CN108364273B (en) Method for multi-focus image fusion in spatial domain
CN105469399A (en) Face super-resolution reconstruction method facing mixed noises and apparatus thereof
CN108133459B (en) Depth map enhancement method and depth map enhancement device
CN112884664B (en) Image processing method, device, electronic equipment and storage medium
WO2024034388A1 (en) Image processing device, image processing method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant