CN112883826B

CN112883826B - Face cartoon generation method based on learning geometry and texture style migration

Info

Publication number: CN112883826B
Application number: CN202110118105.7A
Authority: CN
Inventors: 霍静; 刘祥德; ***; 高阳
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2024-04-09
Anticipated expiration: 2041-01-28
Also published as: CN112883826A

Abstract

The invention discloses a face cartoon generation method based on learning geometry and texture style migration, which is used for acquiring a face deformation graph based on a geometry deformation module; the texture migration module follows manifold alignment style transfer assumption and migrates a new style based on a neural network optimization mode; for a local similar semantic region of the deformation map and the cartoon, the texture migration module constrains stylized output to have a similar characteristic map with the cartoon style map; and the StyleGAN is used for generating an artistic cartoon augmentation style data set with various styles, and various cartoons are acquired in a hidden space interpolation mode, so that the generation of various face cartoons is realized. The invention can exaggerate the facial features of people and custom acquire the geometric deformation styles of different artists; the geometric distortion and the texture are purposefully rendered, and the cartoon style graph with controllable interpolation is combined, so that the generated facial cartoon image is more vivid and various.

Description

Face cartoon generation method based on learning geometry and texture style migration

Technical Field

The invention belongs to the field of computer application, and particularly relates to a face cartoon generation method based on learning geometry and texture style migration.

Background

The facial cartoon is an artistic form which expresses specific emotion and gives impressive impression by exaggerating the characteristics of tasks, has rich and flexible diversity and is popular with the public. On the one hand, the facial cartoon can have different depiction forms, such as a simple drawing, sketch, oil painting and the like; on the other hand, the facial cartoon can express different emotions through different exaggeration styles. Meanwhile, the cartoon creator also has respective artistic styles and expression modes, so that the diversity of the artistic form of the face cartoon is further increased.

Face cartoons are often created by professionally skilled artists, so that only a few celebrities or the like often have their own cartoons. With the development and popularization of the internet and the mobile internet, more and more common people want to have the cartoon image of themselves, and the creation by professional artists is inconvenient and has high cost. Therefore, the automatic generation of corresponding cartoon images from face photos by computer technology is attracting attention and favor. To produce a realistic caricature, two key issues need to be addressed. One is to perform facial geometry morphing to exaggerate certain key features of a person's face. The other is to synthesize a texture or style similar to a real caricature. There has been a great deal of effort in relation to caricature generation. The traditional automatic generation method of the facial cartoon is mainly divided into two major types, namely a rule-based method and a sample-based method. The method based on the rules adjusts photos through manually preset rules to generate the facial cartoon, for example, the differences between the input facial photos and the average faces are calculated, and the most prominent differences are exaggerated to generate the cartoon; the sample-based method needs to collect a cartoon sample library, then detect the shape information of the five sense organs, the outlines and the like of each input photo, and then search the best matched cartoon five sense organs and outlines from the sample library to form a new cartoon image. It can be seen that these methods are limited in that the exaggerated style is limited by predefined rules and examples. In recent years, with the wide application of deep learning in the field of computer vision, some cross-domain image conversion methods based on deep learning, such as Cycle-GAN and MUNIT, can convert a face photo into a cartoon style, but the cartoon generated by the methods lacks exaggeration in shape; in addition, with the development of generating a countermeasure network (GAN), some students try to generate a cartoon using the GAN. However, a major disadvantage of using GAN to generate a caricature is the large amount of data required for training. Moreover, the resulting caricature is not customizable and modifiable by humans. For example, the Warp-GAN exaggerates the shape of the face by way of Warp on the basis of deep learning-based style conversion, so that the generated cartoon is more realistic in terms of both color and shape. However, the methods can only generate a fixed shape exaggeration style for the same input photo, and cannot meet the requirement of people on the variety of cartoon styles. Generally, the automatic generation of the facial cartoon has the following difficulties: 1) Generating a human face cartoon according to the human face photo, not only requiring the conversion of the color style of the image, but also requiring the exaggeration of the shape of the image according to the characteristics of the input human face, artistic creation style and the like; 2) The artistic form of the facial cartoon has rich diversity, and when the cartoon is created according to the same photo, various different styles can be generated on the color and the shape of the cartoon due to different creation means, different emotions, different styles of artists and the like; 3) Furthermore, for the texture style migration process, the primary limitation is that the generation of caricature images may be limited to the style provided by the dataset.

Disclosure of Invention

The invention aims to: the invention provides a face cartoon generation method based on learning geometry and texture style migration aiming at the task of automatically generating a face cartoon, which uses a pure style migration method to migrate the deformation and texture of the cartoon, allows a user to flexibly customize, and can generate the cartoon with various deformation styles for one face photo.

The technical scheme is as follows: the invention discloses a face cartoon generation method based on learning geometry and texture style migration, which comprises the following steps:

(1) Obtaining key points of faces and cartoon: obtaining key points of a face photo and a cartoon photo through a face key point extraction algorithm;

(2) Building face and cartoon distribution: dividing a pre-acquired facial photo cartoon image data set into a training set and a testing set, and loading a representation matrix of facial photo key point distribution and cartoon key point distribution;

(3) Calculating a projection matrix: acquiring covariance matrixes of key point distribution of the face photo and cartoon key point distribution, acquiring linear transformation capable of enabling the key point distribution and the cartoon key point distribution to be aligned through a WCT algorithm, and marking the linear transformation as a projection matrix;

(4) Obtaining key points of a cartoon domain: projecting key points of the human face to the cartoon domain by utilizing the projection matrix, so as to obtain the key points corresponding to the human face in the cartoon domain;

(5) Acquiring a human face deformation graph: obtaining a Warp affine matrix from a human face to a cartoon according to the human face key points and the cartoon key points, and applying the affine matrix on the whole human face image so as to obtain a human face deformation graph;

(6) Calculating a feature-based neighbor matrix: respectively inputting the human face deformation graph and the cartoon style graph into a VGG-19 network to extract content representation and style representation; calculating cosine similarity of each position feature of the content representation and each position feature of the style representation to obtain a position-to-position neighbor matrix, wherein each element of the matrix encodes the similarity of the features in different spatial positions, and the matching relation of the content to the style semantic level is described;

(7) Semantic-based style loss: for the content representation of the facial deformation graph and the style representation of the cartoon, calculating a Euclidean distance matrix from position to position of the facial deformation graph and the style representation of the cartoon, wherein the Euclidean distance matrix and the neighbor matrix have equal large dimensions; defining K as the number of neighbors, assigning 1 to K positions corresponding to the Euclidean distance matrix according to the position relation of the first K neighbors in the neighbor matrix, and assigning 0 to the rest positions, and finally summing the distance matrix to be used as style loss;

(8) Iterative generation of cartoon pictures: based on a back propagation algorithm, returning a gradient of style loss by using an iterative updating mode, and punishing the characteristics of matching the content to the style, so that the more similar characteristics are more similar and better, the input deformation graph is gradually rendered with the style texture of the cartoon, and a final cartoon picture is obtained;

(9) Feature-based face content structure retention: re-inputting the generated cartoon to VGG-19 to extract stylized content representation, and using the mean square error of the stylized content representation and the deformation graph content representation as content loss, thereby keeping the original face structure of the deformation graph from being damaged by the texture migration process;

(10) Obtaining diversified cartoon styles: generating massive cartoon generation by means of StyleGAN, simultaneously providing the effect of controlling the cartoon generation by a hidden space interpolation method, and constructing cartoon style distribution with rich characteristics;

(11) And deforming the face photos with different inputs by using a projection matrix in a training stage, and rendering textures by using a style migration method based on semantic level.

Further, the number of the key points in the step (1) is 128.

Further, the step (3) is implemented by the following formula:

wherein L is _pw For whitening matrices, the mean vector of keypoints is expressed asThe mean vector of the cartoon key points is +.>The key point matrix of the picture after centralization is +.>The key point matrix of the cartoon after centralization is +.>For->Decomposing the characteristic value to obtain D _p 、E _p 、/>D _p For diagonal arrays of diagonal element eigenvalues in the picture domain, E _p Orthogonal array with picture domain column vector as feature vector, < >>For E _p Is a transpose of (2); for->Decomposing the characteristic value to obtain D _c 、E _c 、/>D _c A diagonal matrix of diagonal element characteristic values of cartoon domain, E _c Orthogonal array with cartoon domain array vector as feature vector>For E _c Is a transpose of (2); />Is the aligned picture key point matrix, the covariance matrix and +.>Is the same as the covariance matrix of (a); p is the projection matrix from the picture to the caricature.

Further, the key points corresponding to the faces in the cartoon domain in the step (4) are as follows:

wherein P is a projection matrix, l _p Is a face key point corresponding to the test picture, and the average value vector of the face key point is expressed asThe mean vector of the cartoon key points is expressed as +.>

Further, the implementation process of the step (7) is as follows:

constructing a graph matrixW ^l And H ^l The width and height of the feature map, respectively, l refers to the features of the first layer; wherein each element of the matrix encodes the similarity of the feature at a different spatial location, each element being defined as follows:

wherein,is C ^l Feature vector at the i-th position, +.>Is S ^l A feature vector at a j-th position;representation of +.>K nearest neighbors of (a); />Representing +.>K nearest neighbors of (a);

the distance measure used to calculate k neighbors is the cosine distance; to achieve semantically aligned style migration, the following objective function is optimized:

if A ^l (i, j) =1, the feature of the i-th position of the content map shares the same semantic meaning as the feature of the j-th position of the style map; the objective function of stylization is to force G ^l And (3) withSimilarly, the final objective function is as follows:

wherein,is defined in the first _con Content loss on layer features, alpha and beta are superparameters that balance content loss and style loss.

Further, in the step (10), massive cartoon generation based on StyleGAN is performed by training a generation countermeasure network on a large-scale cartoon data set, so that cartoon style graphs with different styles are generated.

The beneficial effects are that: compared with the prior art, the invention has the beneficial effects that: 1. the deformation style migration method based on the WCT is provided for the first time, and reasonable exaggerated and diversified deformation styles are obtained through the key points of the face and the key points of the cartoon; 2. the semantic grade style migration method based on neural network optimization can obtain cartoon textures with semantic consistency, and the cartoon facial expression can be obtained in an iterative and controllable manner, so that the visual effect of the facial cartoon is more vivid and interesting; 3. the facial features of people can be exaggerated, and geometric deformation styles of different artists can be obtained in a customized mode; 4. the novel cartoon generation method with separable deformation migration and texture migration is provided, so that a model can render geometrical distortion and texture in a targeted manner, and a cartoon style picture with controllable interpolation is combined, so that the generated facial cartoon image is more vivid and various.

Drawings

FIG. 1 is a flow chart of the invention;

FIG. 2 is a schematic diagram of a geometry deforming network according to the present invention;

FIG. 3 is a drawing of an example of a facial cartoon generated using the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

The invention provides a face cartoon generation method based on learning geometry and texture style migration, which respectively uses a geometry deformation module and a texture migration module to perform geometry deformation and texture rendering on an image, generates an artistic cartoon augmentation style data set with various styles through StyleGAN, and allows various cartoon to be acquired in a hidden space interpolation mode, thereby realizing diversified face cartoon generation. As shown in fig. 1, the method specifically comprises the following steps:

(1) Obtaining key points of faces and cartoon: and obtaining 128 key points of the face photo and the cartoon photo through a face key point extraction algorithm.

For any photo in the data set, the invention adopts the face key point detection algorithm to extract the key points of the photo, so that the key point pairs from the face domain to the cartoon domain maintain the spatial consistency. Let us now assume that there are two sets of keypoints from the face domain and the caricature domain, respectively.

(2) Building face and cartoon distribution: dividing a training set and a testing set from a pre-acquired facial photo cartoon image data set, and loading a representation matrix of facial photo key point distribution and cartoon key point distribution.

Representing a key point matrix containing a face map asRepresenting a matrix containing caricature keypoints asWherein n is _p Representing the number of face photos, n _c The number of comics is represented, and d is the number of key point coordinates (key point×2). When there are a sufficient number of photos and comics, the two matrices should be able to delineate the distribution of the photo and comic keypoints.

(3) Calculating a projection matrix: and obtaining covariance matrixes of the key point distribution of the face photo and the key point distribution of the cartoon, obtaining linear transformation capable of enabling the key point distribution and the key point distribution to be aligned through a WCT algorithm, and marking the linear transformation as a projection matrix.

The geometric deformation network is responsible for learning the transformation from the photo face key points to the cartoon face key points, and deforming the image through the Warping operation. The detailed structure of the geometrically deformed network is shown in fig. 2:

given a face map, this module aims to perform a distorted photograph based on geometric deformation. The invention provides a WCT algorithm based on key points. Heretofore, the WCT algorithm was merely a texture for image style migration to obtain a style map. The invention uses the domain knowledge of key point distribution to apply the WCT algorithm in the geometric deformation migration domain for the first time by following the WCT theory idea.

By using WCT, the object of the invention is to find a projection, let L _c Covariance matrix and L _p Is the same. Thus, the projection can map the photo keypoints to the caricature keypoint space. The detailed process is as follows:

the mean vector of the picture key points is expressed asThe mean vector of the cartoon key points is expressed as +.> The picture key point matrix after centering is expressed as +.>The centered cartoon key point matrix is expressed as +.>The whitening operation first obtains the picture covariance matrix +.>Is described, and feature vectors. For->Decomposing the characteristic value to obtain D _p 、E _p 、/>By D _p To represent a diagonal matrix of diagonal element eigenvalues, using E _p To represent an orthogonal array with column vectors as eigenvectors, with +.>To represent E _c Is a transpose of (2); obviously there is->The whitening process of the picture key point matrix is as follows:

wherein the whitening matrix L _pw Is characterized in that

Similar to the whitening process, singular value decomposition is also used to solve the cartoon key point matrixIs described, and feature vectors. For->Decomposing the characteristic value to obtain D _c 、E _c 、/>By D _c To represent a diagonal matrix of diagonal element eigenvalues, using E _c To represent an orthogonal array with column vectors as eigenvectors, with +.>To represent E _c Is a transpose of (2); obviously there is->The coloring process is as follows:

is the aligned picture key point matrix, the covariance matrix and +.>Is the same. Can be demonstrated by simple mathematical derivation +.>Satisfy->That is, by applying WCT on the keypoint matrix of the two domains, the algorithm achieves alignment of the two distributions. Finally, let(s)>Add->The invention obtains the cartoon key point L corresponding to the transformed photo _pc . Finally, the projection matrix P from the picture to the cartoon is obtained by arrangement:

(4) Obtaining key points of a cartoon domain: and projecting the key points of the human face to the cartoon domain by using the projection matrix, thereby obtaining the key points corresponding to the human face in the cartoon domain.

In the testing stage, the invention uses the projection matrix P to project the key points of the test picture to the cartoon key point space, and the corresponding cartoon key points after transformation are as follows:

wherein, the projection matrix P, l is obtained in the step (3) _p Is a face key point corresponding to the test picture, and the average value vector of the face key point is expressed asThe mean vector of the cartoon key points is expressed as +.>

(5) Acquiring a human face deformation graph: and obtaining a Warp affine matrix from the human face to the cartoon according to the human face key points and the cartoon key points, and applying the affine matrix on the whole human face image so as to obtain the human face deformation graph.

The key point l of the cartoon domain obtained in the step (4) _pc Key point l of face domain _pc And calculating an affine matrix H from the human face domain to the cartoon domain, and deforming the human face photo by using the matrix to obtain a deformation graph.

(6) Calculating a feature-based neighbor matrix: respectively inputting the human face deformation graph and the cartoon style graph into a VGG-19 network to extract content representation and style representation; and calculating cosine similarity of each position feature of the content representation and each position feature of the style representation to obtain a position-to-position neighbor matrix, wherein each element of the matrix encodes the similarity of the features in different spatial positions, and the matching relation of the content to the style semantic level is described.

(7) Semantic-based style loss: for the content representation of the facial deformation graph and the style representation of the cartoon, calculating a Euclidean distance matrix from position to position of the facial deformation graph and the style representation of the cartoon, wherein the Euclidean distance matrix and the neighbor matrix have equal large dimensions; and defining K as the number of neighbors, assigning 1 to K positions corresponding to the Euclidean distance matrix according to the position relation of the first K neighbors in the neighbor matrix, and assigning 0 to the rest positions, and finally summing the distance matrix to obtain the style loss.

For the deformation map generated by the WCT, the next step is to render the caricature on the deformation map. Based on manifold assumptions, the invention provides an optimized semantic alignment style migration network.

Specifically, in the field of style migration, a picture called content map of an overall content structure is provided, and a picture called style map of a stylized texture is provided, so that the style migration is to obtain the stylized texture with high quality as much as possible on the basis of maintaining the original content structure. Here, the deformation map (content map) is denoted as I _p Representing a cartoon (stylistic picture) as I _c Generating cartoon as I _g . The output characteristics of the three images input into the VGG-19 characteristic extraction network are respectivelyWherein D is ^l Is the number of characteristic channels, W ^l And H ^l The length and width of the feature map, respectively.

I _g First initialized to I _p The invention constructs a graph matrixWherein each element of the matrix encodes the similarity of the feature at a different spatial location, each element being defined as follows:

wherein,is C ^l Feature vector at the i-th position, +.>Is S ^l Feature vector at the j-th position.Representation of +.>K nearest neighbors of (a); similarly, let go of>Representing +.>Is not equal to k-nearest neighbor of (a). The distance measure used to calculate k-nearest neighbors is the cosine distance. To achieve semantically aligned style migration, the algorithm needs to optimize the following objective functions:

if A ^l When (i, j) =1, the feature of the ith position of the content map shares the same semantic meaning as the feature of the jth position of the style map. Therefore, the objective function of stylization is to force G ^l And (3) withSimilarly. The final objective function is as follows:

wherein,is defined in the first _con Content loss on layer features. Alpha and beta are superparameters that balance content loss and style loss.

(8) Iterative generation of cartoon pictures: based on a back propagation algorithm, the gradient of the style loss is returned by using an iterative updating mode, and the characteristics of matching from the content to the style are punished, so that the more similar characteristics are more similar and better, the input deformation graph is gradually rendered with the style texture of the cartoon, and the final cartoon picture is obtained.

(9) Feature-based face content structure retention: and re-inputting the generated cartoon to the VGG-19, extracting stylized content representation, and using the mean square error of the stylized content representation and the deformation graph content representation as content loss, so that the original face structure of the deformation graph is kept from being damaged by the texture migration process.

(10) Obtaining diversified cartoon styles: and generating massive cartoon generation by means of StyleGAN, and simultaneously providing the effect of controlling the cartoon generation by a hidden space interpolation method to construct cartoon style distribution with rich characteristics.

The disadvantages of most style migration methods are: style types are limited to datasets. In order to obtain more style types, the invention can use StyleGAN to generate the fake cartoon, so that the user can use the fake cartoon as the target style cartoon. Specifically, styleGAN is trained with a large cartoon data set, and then a large number of counterfeit cartoons of various styles are obtained. In addition, the user can also utilize hidden space interpolation to generate an interpolation style graph between two fake cartoons so as to generate the intended target cartoons.

And preprocessing the face photo and the cartoon image. Aligning and cutting the human face according to the key points of the human face in the image and adjusting the image to be 512 multiplied by 512 pixels; acquiring key points of a face and a cartoon image by using a face key point detection algorithm; and selecting a facial photo cartoon image data set, and dividing a training set and a testing set. In the training stage, loading a proper amount of facial photo key points and cartoon key points to obtain a centralized facial key point matrix and a centralized cartoon key point matrix; in the geometric deformation module, a covariance matrix corresponding to the face key point matrix of the whitening algorithm is utilized, so that the covariance matrix of the whitened face key point matrix is a diagonal matrix, and then in order to align the whitened face key point covariance matrix with the cartoon key point covariance matrix, a coloring algorithm is used for solving eigenvectors and the diagonal matrix of the cartoon key point covariance matrix, and a projection matrix from the face to the cartoon is obtained through arrangement. In the test stage, for a given face photograph and its keypoints, the projection matrix is used to transform its keypoints into cartoon keypoints. Based on paired face key points and cartoon key points, solving an affine matrix, and finally using the matrix to warp the whole face map to obtain a face deformation map; in the texture migration module, there are two ways to obtain the destination cartoon style map: hidden spatial interpolation from StyleGAN; randomly sampled from the real caricature collection.

In the test stage, for a given cartoon style graph, a neighbor relation matrix of a face deformation graph (content graph) and the cartoon (style graph) is calculated based on manifold hypothesis theory, and the style loss iteration return gradient based on semantic level alignment is used for optimizing the input deformation graph, so that artistic colors or textures of a target cartoon are rendered. In the iterative optimization process, content loss maintained by using the content structure avoids the damage of inherent structural features of the face. After a proper number of iterations, the final generated caricature is output, as shown in fig. 3.

Claims

1. The face cartoon generation method based on learning geometry and texture style migration is characterized by comprising the following steps of:

(11) Deforming the face photos with different inputs by using a projection matrix in a training stage, and rendering textures by using a style migration method based on semantic level;

the step (3) is realized by the following formula:

wherein L is _pw For whitening matrices, the mean vector of keypoints is expressed asThe mean value vector of the cartoon key points isThe key point matrix of the picture after centralization is +.>The key point matrix of the cartoon after centralization is +.>For->Decomposing the characteristic value to obtain D _p 、E _p 、/>D _p For diagonal arrays of diagonal element eigenvalues in the picture domain, E _p Orthogonal array with picture domain column vector as feature vector, < >>For E _p Is a transpose of (2); for->Decomposing the characteristic value to obtain D _c 、E _c 、/>D _c A diagonal matrix of diagonal element characteristic values of cartoon domain, E _c Orthogonal array with cartoon domain array vector as feature vector>For E _c Is a transpose of (2); />Is the aligned picture key point matrix, the covariance matrix and +.>Is the same as the covariance matrix of (a); p is the projection matrix from the picture to the caricature.

2. The method for generating facial cartoon based on learning geometry and texture style migration of claim 1 wherein the number of key points in step (1) is 128.

3. The method for generating a facial cartoon based on learning geometry and texture style migration according to claim 1, wherein the key points corresponding to the facial cartoon in the step (4) are:

4. The face cartoon generating method based on learning geometry and texture style migration of claim 1, wherein the implementation process of the step (7) is as follows:

wherein,is C ^l Feature vector at the i-th position, +.>Is S ^l A feature vector at a j-th position; />Representation of +.>K nearest neighbors of (a); />Representing +.>K nearest neighbors of (a); the distance measure used to calculate k neighbors is the cosine distance; to achieve semantically aligned style migration, the following objective function is optimized:

5. The method according to claim 1, wherein the step (10) of generating massive generated cartoons by means of StyleGAN is to generate cartoons style drawings with different styles by training a generation countermeasure network on a large-scale cartoons data set.