CN114612289A

CN114612289A - Stylized image generation method and device and image processing equipment

Info

Publication number: CN114612289A
Application number: CN202210207313.9A
Authority: CN
Inventors: 金成彬; 韩欣彤
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-06-10

Abstract

The embodiment of the application provides a stylized image generation method, a stylized image generation device and an image processing device. In this way, multi-scale feature extraction is carried out on the input image through a variational self-encoder, the mean value and the variance of the hidden vector of each feature dimension of the input image are calculated according to the feature extraction result, and then the hidden vector of the target space is obtained based on the mean value and the variance sampling to generate a corresponding stylized image. In this way, the image quality of the generated stylized image can be improved.

Description

Stylized image generation method and device and image processing equipment

Technical Field

The application relates to the technical field of graphic image processing, in particular to a stylized image generation method and device and image processing equipment.

Background

The stylization of the image can be called image style migration, which is a technology that can migrate the image style with characteristics (such as artistic characteristics) to another image, so that the original image has unique artistic styles such as cartoon, oil painting, watercolor, ink and the like while the original content is kept. For example, in a typical application scenario of the image stylizing technique, a face picture input by a user may be stylized and converted to output a face picture of a specific style, for example, a face picture input by a user may be stylized and converted to output a face picture of a discothey style, a romance style, a cartoon style, and the like, so as to meet specific requirements of the user.

How to improve the quality of the stylized picture obtained by the stylized conversion of the image based on the gradual increase of the user's demand for stylization of the image has been a subject of research by those skilled in the art.

Disclosure of Invention

Based on the above, in order to solve at least part of the problems described above, in a first aspect, an embodiment of the present application provides a stylized image generating method, including:

inputting an image to be converted into a variational self-encoder obtained by training in advance, and processing the image to be converted through the variational self-encoder to obtain the mean value and the variance of the implicit vectors of each characteristic dimension of the image to be converted;

obtaining a target space hidden vector of the image to be converted in each characteristic dimension according to the mean value and the variance of the hidden vector of each characteristic dimension;

and inputting the target space hidden vector into a trained stylized image generator to perform image stylized conversion, and generating a stylized image with a set image style.

In a possible implementation manner of this embodiment, obtaining the target space hidden vector of the image to be converted in each feature dimension according to the mean and the variance of the hidden vectors in each feature dimension includes:

sampling the mean value and the variance of the hidden vectors of each characteristic dimension through the variational self-encoder to obtain a first spatial hidden vector of the image to be converted in each characteristic dimension;

and correcting the first spatial hidden vector of the image to be converted in each characteristic dimension to obtain a second spatial hidden vector of the image to be converted in each characteristic dimension, wherein the second spatial hidden vector is used as the target spatial hidden vector.

The correction formula for correcting the first spatial hidden vector of the image to be converted in each feature dimension is as follows:

wherein s + is the second spatial hidden vector, s is the first spatial hidden vector, x represents the input image, ε_θRepresents the variation self-encoder and the variation self-encoder,

representing the stylized image generator, the

The stylized image generated on behalf of the stylized image generator and the image to be converted are kept consistent at the pixel level,

representing the semantic feature loss of the image to be converted, F representing a VGG network for calculating the perceptual similarity of the image to be converted from the first space for image generation, w_vggIs a pre-set weight parameter that is,

an initial value s representing the second spatial hidden vector and the first spatial hidden vector₀The difference of (A) is required to be within a set range, represented by s₀And fine adjustment with small amplitude is carried out on the initial value and the number of times is set in an iterative manner.

In a possible implementation manner of this embodiment, the variational self-encoder includes a plurality of coding layers which are sequentially cascaded, a plurality of decoding layers which are sequentially cascaded, and a full-connection layer which is respectively connected to the decoding layers, where each coding layer is connected to a corresponding decoding layer, and each decoding layer is connected to one full-connection layer;

the image to be converted is input from a first coding layer in a plurality of coding layers, each coding layer sequentially performs scale-reduction coding processing on the image to be converted to obtain a coding feature map, and the obtained coding feature map is output to a next coding layer and a decoding layer corresponding to the coding layer;

the input of the first decoding layer is the output of the corresponding coding layer, the input of each of the other decoding layers is the output of the decoding layer above the first decoding layer plus the output of the corresponding coding layer, each decoding layer respectively decodes the input data to output feature maps with different dimensionality sizes, and processes the feature maps through the full-connection layer to obtain the mean value and the variance of the hidden vectors with different dimensionalities corresponding to the image to be converted;

and in the decoding layers, the dimension of the characteristic diagram output by each decoding layer is gradually reduced from the first decoding layer.

In a possible implementation manner of this embodiment, the method further includes a step of training the variational self-encoder, where the step includes:

acquiring a first training data set, wherein the first training data set comprises a plurality of sample images;

sequentially inputting the sample images into the variational self-encoder to obtain target spatial hidden vectors corresponding to the sample images;

inputting the target space hidden vector corresponding to the sample image into the trained stylized image generator to obtain a stylized image corresponding to the sample image;

and calculating to obtain a loss function of the variational self-encoder according to the stylized image and the sample image, and adjusting model parameters of the variational self-encoder according to the loss function until a training convergence condition is met.

In a possible implementation manner of this embodiment, the loss function includes a pixel level loss, a semantic similarity loss, and an identity information loss, and an expression of the loss function is as follows:

L＝L_rec+W_perL_per+W_klL_arcwherein:

representing the pixel level loss, x being the sample image, ε_θRepresents the variation self-encoder and the variation self-encoder,

representative of a well trained stylized image generator, L₂Representing an image distance between the sample image and a stylized image generated by the stylized image generator;

representing said loss of semantic similarity, L_lpipsRepresenting semantic feature similarity between the sample image and the generated stylized image of the sample image, which is obtained by calculating after semantic feature extraction is respectively carried out on the sample image and the stylized image of the sample image through a semantic extraction model;

representing said loss of identity information, L_arcRepresenting the similarity between identity characteristics obtained by respectively carrying out identity information identification on the sample image and the generated stylized image of the sample image;

said w_per、w_id、w_klAre respectively beforehand said L_per、L_id、L_arcAnd setting a weight parameter.

In a possible implementation manner of this embodiment, the method further includes a step of training the stylized image generator in advance, where the step includes:

acquiring a stylized image dataset, wherein the stylized image dataset comprises a plurality of stylized sample images;

sequentially inputting the stylized sample images into a stylized image generator to be trained to obtain a generated image corresponding to the stylized sample image, and calculating to obtain a loss function value of the stylized image generator;

optimizing the network parameters of the stylized image generator according to the loss function value until the calculated loss function value meets a training convergence condition to obtain a trained stylized image generator;

the stylized image generator comprises convolution layers corresponding to different image resolutions respectively, and when network parameters of the stylized image generator are optimized, the network parameters of the convolution layers corresponding to the image resolutions smaller than the set image resolution are kept unchanged.

Wherein the calculation formula of the loss function value of the stylized image generator is as follows:

wherein:

x～p_drepresenting a distribution of the stylized image dataset,

representing a distribution of a data set composed of stylized images generated by the stylized image generator from each of the stylized sample images, D is a discriminator,

representing a discriminator gradient calculation operator on a stylized sample image.

In a second aspect, the present embodiment further provides a stylized image generation apparatus applied to an image processing device, the stylized image generation apparatus including:

the residual error calculation module is used for inputting the image to be converted into a variational self-encoder obtained by training in advance, and processing the image to be converted through the variational self-encoder to obtain the mean value and the variance of the implicit vectors of each characteristic dimension of the image to be converted;

the hidden vector processing module is used for obtaining a target space hidden vector of the image to be converted in each characteristic dimension according to the mean value and the variance of the hidden vector of each characteristic dimension;

and the image generation module is used for inputting the target space hidden vector into a trained stylized image generator to perform image stylized conversion, and generating a stylized image with a set image style.

In a third aspect, the present embodiments also provide an image processing apparatus comprising a machine-readable storage medium and one or more processors, the machine-readable storage medium storing machine-executable instructions that, when executed by the one or more processors, implement the above-described method.

Based on the above content of the embodiments of the present application, compared with the prior art, the method, the apparatus, and the image processing device for generating a stylized image provided by the embodiments of the present application perform multi-scale feature extraction on an input image through a variational self-encoder (e.g., a variational self-encoder having a feature pyramid residual network structure), calculate (estimate) a mean value and a variance of a hidden vector of each feature dimension of the input image according to a feature extraction result, sample and obtain a hidden vector of a target space (e.g., S space) based on the mean value and the variance, and generate a corresponding stylized image based on the hidden vector by a stylized image generator. Therefore, the features extracted by carrying out multi-scale feature extraction on the image can be more accurately expressed, and the image quality of the generated stylized image can be improved.

Further, in the embodiment, a concept of spatial hidden vector modification is introduced, for example, an S spatial hidden vector (a first spatial hidden vector) is modified to obtain an S + spatial hidden vector (a second spatial hidden vector) for generating a stylized image, so that the generated stylized image has a good ID (identity information) retention capability, the generated stylized image is closer to the real feature data distribution of the input image, and the image quality of the generated stylized image is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic flow chart of a stylized image generation method according to an embodiment of the present application.

Fig. 2 is a schematic diagram provided in an embodiment of the present application to describe a workflow and a principle of a variational self-encoder in the embodiment.

Fig. 3 is a schematic flowchart for training the variational self-encoder according to an embodiment of the present disclosure.

Fig. 4 is a schematic diagram illustrating an exemplary process for training a variational self-encoder as described above.

Fig. 5 is a schematic diagram of a training flow for the stylized image generator used in the present embodiment.

Fig. 6 is a schematic comparison diagram before and after training of the stylized image generator provided in the embodiment of the present application.

Fig. 7 is a schematic diagram of an exemplary process of generating a stylized image of a corresponding style by using a trained variational auto-encoder and stylized image generation to perform stylization on an image to be converted in an embodiment of the present application.

Fig. 8 is a schematic diagram of an image processing apparatus provided in an embodiment of the present application.

Fig. 9 is a functional block diagram of a stylized image generating apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Before describing the related technical solutions provided in the embodiments of the present application in detail, in order to more clearly understand the related technical solutions of the embodiments of the present application, first, related technical terms related to the embodiments of the present application are described.

First, before exemplary descriptions of related embodiments provided herein, related terms referred to herein will be explained.

Hidden space (style countermeasure network style gan2) generally includes Z space, W + space, and S space.

The Z space, which is the most primitive input space, is typically a standard normal distribution or uniform distribution, and may be referred to as a random noise space.

The W-space, a potential space derived using a series of fully-connected layer transformations on the Z-space, is generally considered to reflect the learned de-entanglement properties better than the Z-space.

W + space, which is similar to the construction method of W space, but the potential vectors fed by each layer of the generator are different, and are often used for Style blending (Style Mixing) and Image Inversion (Image Inversion).

And S space, performing further transformation based on the W space, and mapping omega e W into channel-level style parameters S by using different Affine Transformations (affinity Transformations, performing linear transformation and then performing translation operation) for each layer of the generator.

The inventor of the present application has found that, in a possible image stylization technology, a single input image (for example, a human face image) can be converted into a stylized image with a set style (for example, an image with a disconey style) based on a migration learning and convolutional layer switching (layer-swaping) technology, and this method generally adopts an iterative optimization inverse mapping method to calculate a hidden space vector (late) of the input image, and then generates the stylized image based on the hidden space vector. This method has the disadvantages of slow calculation speed and may affect the quality of the generated stylized image due to the fact that the distribution of the hidden vector space of the input image itself cannot be well maintained.

In view of this, an embodiment of the present application innovatively provides a stylized image generation scheme, where a variational self-encoder (e.g., a variational self-encoder with a feature pyramid residual network structure) is used to perform multi-scale feature extraction on an input image, a mean value and a variance of a hidden vector of each feature dimension of the input image are calculated (estimated) according to a feature extraction result, a hidden vector of a target space (e.g., S space) is obtained based on mean value and variance sampling, and a stylized image generator is used to generate a corresponding stylized image based on the hidden vector. Therefore, the generated stylized image can be more close to the real characteristic data distribution of the input image, and the generation quality and effect of the stylized image are improved. Specific implementation methods of the embodiments of the present application will be described below in conjunction with the accompanying drawings.

Fig. 1 is a schematic flow chart of a stylized image generation method according to an embodiment of the present application. It should be understood that, in the stylized image generation method provided in this embodiment, the order of some steps included in the stylized image generation method may be interchanged according to actual needs in actual implementation, or some steps may be omitted or deleted, which is not specifically limited in this embodiment.

The following describes in detail, by way of example, the steps of the stylized image generation method of the present embodiment, and in detail, as shown in fig. 1, the method may include related contents described in steps S100 to S300 below.

Step S100, inputting an image to be converted into a variational self-encoder obtained by training in advance, and processing the image to be converted through the variational self-encoder to obtain the mean value and the variance of the implicit vectors of each characteristic dimension of the image to be converted.

In a possible implementation manner of this embodiment, the variational self-encoder may include a plurality of coding layers that are sequentially cascaded, a plurality of decoding layers that are sequentially cascaded, and a full-connection layer that is respectively connected to the decoding layers, where each coding layer is connected to one corresponding decoding layer, and each decoding layer is connected to one full-connection layer.

The input of the first decoding layer is the output of the corresponding coding layer, the input of each of the other decoding layers is the output of the decoding layer above the first decoding layer plus the output of the corresponding coding layer, each decoding layer respectively decodes the input data to output feature maps with different dimensionality sizes, and processes the feature maps through the full-connection layer to obtain the mean value and the variance of the hidden vectors with different dimensionalities corresponding to the image to be converted.

In this embodiment, in the plurality of decoding layers, starting from the first decoding layer, the dimension of the feature map output by each decoding layer is gradually reduced.

For example, fig. 2 is a schematic diagram provided in this application to describe the working flow and principle of the variational self-encoder. As an example, the plurality of coding layers included in the variational self-encoder may be five coding layers which are sequentially cascaded, such as E1, E2, E3, E4, and E5 shown in fig. 2, where E1 is the first coding layer and E5 is the last coding layer. The plurality of decoding layers included in the variational self-encoder can be five decoding layers which are sequentially cascaded, such as D1, D2, D3, D4 and D5 shown in fig. 2, wherein D1 is used as a first decoding layer, and D5 is used as a last decoding layer. The encoding layers E1, E2, E3, E4, and E5 are connected to the decoding layers D5, D4, D3, D2, and D1, respectively.

The image to be converted is input from a first layer of the coding layers, each coding layer sequentially performs scale-down coding processing on the image to be converted to obtain a coding feature map, and the obtained coding feature map is output to a coding layer of a next layer and a decoding layer corresponding to the coding layer.

For example, as shown in fig. 2, the image Z1 to be converted may be an image with a scale H (e.g., with a pixel resolution of 1024), and the first layer encoding layer E1 may perform down-scaling processing on the image Z1 by means of down-sampling, when the image is input from the first layer encoding layer E1. For example, the image to be processed may be processed into a feature map with a scale of H/2 (e.g., pixel resolution is changed to 512) by means of 2-fold down-sampling, and the processed feature map is input into the next encoding layer E2 and its corresponding decoding layer D5.

The second encoding layer E2 may perform downscaling on the feature maps output by the encoding layer E1 by means of downsampling. For example, the feature map output by the encoding layer E1 may be down-scaled by 2-fold down-sampling, and the feature map obtained after processing may be input into the next encoding layer E3 and the corresponding decoding layer D4.

The third encoding layer E3 may perform down-scaling on the feature map output by the encoding layer E2 by means of downsampling. For example, the feature map output by the encoding layer E2 may be down-scaled by 2-fold down-sampling to obtain a feature map with a scale H/4, and the feature map obtained after the processing is input to the next encoding layer E4 and the corresponding decoding layer D3.

The fourth encoding layer E4 may perform down-scaling on the feature map output by the encoding layer E3 by means of downsampling. For example, the feature map output by the encoding layer E3 may be down-scaled by 2-fold down-sampling to obtain a feature map with a scale H/8, and the feature map obtained after the processing is input to the next encoding layer E5 and the corresponding decoding layer D2.

The fifth encoding layer E5 is the last encoding layer, and the feature map output by the encoding layer E4 can be down-scaled by down-sampling. For example, the feature map output by the encoding layer E4 may be down-scaled by 2-fold down-sampling to obtain a feature map with a scale H/16, and the feature map obtained after the down-scaling is input to the corresponding decoding layer D1.

Correspondingly, the decoding layer D1, the decoding layer D2, the decoding layer D3, the decoding layer D4, and the decoding layer D5 respectively perform decoding processing (for example, upsampling processing may be performed) on the input feature map to obtain feature maps of feature dimensions of the image to be converted, and respectively transmit the feature maps to corresponding full-connection layers (FC as shown in the figure). Wherein, the output of each of the decoding layer D1, decoding layer D2, decoding layer D3 and decoding layer D4 is also used as the input of the corresponding next decoding layer. For example, the output of the decoding layer D1 is used as the input of the decoding layer D2, the output of the decoding layer D2 is used as the input of the decoding layer D3, the output of the decoding layer D3 is used as the input of the decoding layer D4, and the output of the decoding layer D4 is used as the input of the decoding layer D5.

The feature maps output by the decoding layer D1, the decoding layer D2, the decoding layer D3, the decoding layer D4 and the decoding layer D5 have dimensions gradually reduced, for example, the feature maps output by the decoding layer D1, the decoding layer D2, the decoding layer D3, the decoding layer D4 and the decoding layer D5 have feature dimensions of 15 × 512, 3 × 256,3 × 128,3 × 64 and 2 × 32, respectively. In this embodiment, each coding layer may include at least one convolutional layer (e.g., three downsampled convolutional layers), and each decoding layer may also include at least one convolutional layer (e.g., three upsampled convolutional layers). Therefore, each decoding layer can output the characteristic diagram data corresponding to the three convolutional layers respectively. For example, based on the example of fig. 2, the present embodiment includes five decoding layers, and accordingly, the variational self-encoder may have an output of 5 feature dimensions.

The full-connection layer respectively processes the input feature maps to obtain the mean values of the hidden vectors (such as Z shown in fig. 2) of different feature dimensions corresponding to the image to be converted_μ) Sum variance (Z shown in FIG. 2_σ). And then based on the mean of the hidden vectors (Z as shown in FIG. 2)_μ) Sum variance (Z shown in FIG. 2_σ) The obtained hidden vector S is input into a stylized image generator, and a corresponding stylized image Z2 can be generated.

And S200, obtaining a target space hidden vector of the image to be converted in each characteristic dimension according to the mean value and the variance of the hidden vectors of each characteristic dimension.

In a possible implementation manner of this embodiment, the full connection layers may process the mean and variance of the obtained hidden vectors of each feature dimension to obtain a target space hidden vector of the image to be converted in each feature dimension.

Based on step S200, the mean and variance of the hidden vectors of each feature dimension may be sampled by the variational self-encoder, so as to obtain a first spatial hidden vector of the image to be converted in each feature dimension. For example, the mean and the variance may be sampled by the full-connected layer to obtain a first spatial hidden vector of each feature dimension of the image to be converted. Wherein the first spatial hidden vector may be an S-spatial hidden vector.

Wherein the formula for sampling the mean and variance may be:

s is the first spatial hidden vector.

And then, correcting the first space hidden vector of the image to be converted in each feature dimension to obtain a second space hidden vector of the image to be converted in each feature dimension, wherein the second space hidden vector is used as the target space hidden vector. The second spatial hidden vector may be a hidden vector obtained by modifying based on the S null hidden vector, and this embodiment may define the second spatial hidden vector as an S + spatial hidden vector.

In some possible implementations, the target spatial hidden vector may also be directly the S-spatial hidden vector. However, through experimental research, the inventor finds that in the process of image stylization conversion, for example, when conversion is performed from a real person domain to a stylized domain, the ID holding capacity of a person is poor, so that the stylized image generated after stylization conversion is not consistent with the person ID information (such as human face features) in the input real person image in terms of fine details. Based on the discovery of the technical problem, in the embodiment of the application, a concept of spatial hidden vector correction is introduced, for example, S spatial hidden vector is corrected, so that a stylized image obtained in the subsequent generation of the stylized image has better ID retention capability.

As a possible example, in this embodiment, the formula for correcting the first spatial hidden vector of each feature dimension of the image to be converted may be:

wherein s + is the second spatial hidden vector, s is the first spatial hidden vector, x represents the image to be converted, epsilon_θRepresents the variation self-encoder and the variation self-encoder,

representing the stylized image generator, the

Characterizing differences at a pixel level between the stylized image generated by the stylized image generator and the image to be converted,

an initial value s representing the second spatial hidden vector and the first spatial hidden vector₀The difference of (a) and (b) needs to be within a set range by s₀And fine adjustment with small amplitude is carried out on the initial value and the number of times is set in an iterative manner.

And step S300, inputting the target space hidden vector into a trained stylized image generator to perform image stylized conversion, and generating a stylized image with a set image style.

In summary, in this embodiment, the variational self-encoder performs downscaling processing on different encoding layers of the to-be-converted image that is sent from the bottom to the top, and performs feature extraction on feature maps of different scales corresponding to the to-be-converted image, so as to obtain a feature pyramid. The feature map corresponding to the bottom of the feature pyramid is in a high scale (such as high resolution), the feature map corresponding to the top of the feature pyramid is in a low scale (such as low resolution), and the higher the level is, the smaller the image is, and the smaller the scale is. Therefore, the feature dimension can be increased and the high-dimensional feature can be constructed based on the scale change of the image.

Meanwhile, the low-layer coding layer of the variational self-encoder can pay more attention to the detail characteristics of the image, the high-layer coding layer pays more attention to the deep semantic information of the image, and the encoder with the structure can express the image more accurately when the characteristics extracted by multi-scale characteristic extraction of the image are extracted.

In the embodiment, the mean and the variance of each feature dimension hidden vector of the image are predicted according to the extracted features, then the target space hidden vector is obtained based on the predicted mean and variance and then is input into the stylized image generator to generate a corresponding stylized image, and the quality of the generated stylized image can be improved.

Since the variational self-encoder of the present embodiment performs residual prediction based on the feature pyramid for subsequent generation of style images, the variational self-encoder of the present embodiment may be defined as a "feature pyramid residual variational self-encoder".

Further, the variational self-encoder provided by the embodiment may be obtained in advance through training. For example, as shown in fig. 3, the step of training the variational self-encoder may include the following steps S310 to S340, which are exemplarily described as follows.

Step S310, a first training data set is obtained, where the first training data set includes a plurality of sample images.

In this embodiment, the sample image may be a pre-collected face image with face features. The training of the variational self-encoder can be realized by training based on a reconstruction principle and matching with any stylized image generator obtained by pre-training.

And step S320, sequentially inputting the sample images into the variational self-encoder to obtain spatial hidden vectors corresponding to the sample images.

In this embodiment, the spatial hidden vector corresponding to the sample image may be the S-spatial hidden vector, and does not need to be corrected in the training process.

And step S330, inputting the spatial hidden vector corresponding to the sample image into the trained stylized image generator to obtain the stylized image corresponding to the sample image.

In this embodiment, the trained stylized image generator may be a generator for generating any one style of stylized image. For example, in this embodiment, the stylized image generator used in the training process of the variational self-encoder may be StyleGANv2-ADA, and the style of the stylized image generated by the stylized image generator may be the original style of the sample image, or may be another style different from the original style of the sample image, and the stylized image generator may be implemented by using the original style in the training process. Meanwhile, in the training process, the model parameters (such as weight) of the stylized image generator are kept fixed, and only the model parameters of the variational self-encoder need to be adjusted, so that the stylized image generated (or reconstructed) by the stylized image generator is consistent with the sample image and serves as the training target of the variational self-encoder. In this way, z obtained by the encoder can be made_μ(mean of implicit vectors) and z_σThe variance of the implicit vector may be as good as possible with a normal distribution, for example, we may define that the mean and variance of the samples satisfy the following Kullback-Leibler divergence formula:

D_klis Kullback-Leiler divergence formula, p (z) represents normal distribution, z in the expansion formula_μ,iMeans, z, representing hidden vectors z of dimension i_σ,iIs the ith dimensionThe variance of the implicit vector z.

And step S340, calculating to obtain a loss function of the variational self-encoder according to the stylized image and the sample image, and adjusting model parameters of the variational self-encoder according to the loss function until a training convergence condition is met to obtain the trained variational self-encoder.

In this embodiment, the loss function may include pixel level loss, semantic similarity loss, identity information loss, and an expression of the loss function is as follows:

L＝L_rec+W_perL_per+W_klL_arc。

wherein:

representing a pre-trained stylized image generator, L₂Representing an image distance between the sample image and the stylized image generated by the stylized image generator. For example, the image distance may represent the similarity between the sample image and the stylized image generated by the stylized image generator, and may be represented by a cosine distance or a euclidean distance, and the training process needs to make the image distance between the two images as close to or equal to 0 as possible.

Representing said loss of semantic similarity, L_lpipsAnd representing semantic feature similarity between the sample image and the generated stylized image of the sample image, which is obtained by calculating after semantic feature extraction is respectively carried out on the sample image and the stylized image of the sample image through a semantic extraction model. In this embodiment, the semantic extraction model may be any model that can realize Image semantic extraction, for example, lpips (left Perceptual Image Patch Similarity, feeling)Knowing image block similarity) model.

Representing said loss of identity information, L_arcRepresenting the similarity between the identity characteristics obtained by respectively carrying out identity information recognition on the sample image and the generated stylized image of the sample image. For example, the similarity between the face features obtained by face recognition of the sample image and the stylized image of the generated sample image may be separately obtained, and may also be represented by cosine similarity.

The above training process for the variational auto-encoder may refer to the exemplary process of fig. 4. In detail, in this embodiment, the sample image P0 may be input into the variational self-coding to be trained to perform mean and variance prediction, so as to obtain a mean z of the hidden vectors of each feature dimension of the sample image P0_μSum variance z_σThen based on said mean value z_μSum variance z_σSampling is carried out to obtain a corresponding hidden vector S, the hidden vector S is input into an existing (pre-trained) stylized image generator to carry out stylized image generation to obtain a stylized image P1, finally, a loss function value (used for representing the consistency of a reconstructed/generated image P1 and P0) of the variational self-encoder is calculated according to the stylized image P1 and a sample image P0, then, iterative updating or iterative optimization is carried out on model parameters (or weights) of the variational self-encoder according to the loss function value, and the training of the variational self-encoder is completed. The stylized image generator used in the training process may be a generator for generating any stylized image, for example, a generator in which the styles of the input image and the generated image are consistent, and this embodiment does not limit this.

Next, the stylized image generator used in step S300 may be implemented by a training process as shown in FIG. 5, which may include steps S510-S530 described below, as exemplary described below.

Step S510, a stylized image dataset is obtained, where the stylized image dataset includes a plurality of stylized sample images.

For example, in one possible implementation, a plurality (e.g., 110) of images of a set style (e.g., the style of an avatar of a romantic character) may be crawled from a network as the stylized image dataset.

And step S520, sequentially inputting the stylized sample images into a stylized image generator to be trained to obtain a generated image corresponding to the stylized sample image, and calculating to obtain a loss function value of the stylized image generator.

In this embodiment, the loss function value may be constructed according to a similarity between a generated image corresponding to the stylized sample image and the stylized sample image, for example, the corresponding loss function value may be obtained according to an euclidean distance, a cosine similarity, a pearson similarity, and the like of the two, and the loss function value may be used to characterize whether an image reconstructed by the stylized sample image generator is consistent or nearly consistent with an original image.

And step S530, optimizing the network parameters of the stylized image generator according to the loss function value until the calculated loss function value meets the training convergence condition, and obtaining the trained stylized image generator.

The training convergence condition may be that the loss function value is smaller than a set threshold, or that the number of iterative training reaches a set training number.

For example, as shown in fig. 6, in this embodiment, the stylized image generator to be trained (the pre-trained model in the figure) may include convolutional layers with different resolutions, for example, convolutional layers with resolutions of 8 × 8, 16 × 16, 32 × 32, 64 × 64, 128 × 128, 256 × 256, 512 × 512, 1024 × 1024, and the like. Taking a stylized image generator for a face avatar as an example, low-resolution convolutional layers such as 8 × 8 and 16 × 16 are used to control geometric structure (geometry) information such as face orientation and face shape, and when the face is stylized and converted, an important step from a real person domain to a stylized domain needs to keep the face orientation (pos) and the face shape consistent with an input image, so in this embodiment, network parameters of the 8 × 8 and 16 × 16 low-resolution convolutional layers can be fixed, and only network parameters of high-resolution convolutional layers such as 32 × 32 to 1024 × 1024 are needed. In other embodiments, the specific convolutional layer needed to keep the network parameters fixed during the training process can be determined according to practical situations, and is not limited to the above examples.

In a possible implementation of this embodiment, the calculation formula of the loss function value of the stylized image generator is as follows:

wherein:

x～p_drepresenting a distribution of the stylized image dataset,

representing the discriminator gradient calculation operator on the stylized sample image.

The training process of the stylized image generation can be an unsupervised image stylized training method based on small samples, and the method can rapidly develop generators of different styles, such as martial arts style, middle-aged style, optimized style and the like, through the small samples (such as 110 samples) at low cost, and provides a low-cost and high-efficiency standardized flow for new special effects of various application scenes (such as application scenes of live broadcast, short video and the like) of image stylization conversion.

The process of generating the stylized image of the corresponding style by performing the stylization process on the image to be converted by using the trained variational auto-encoder and the stylized image generation can be seen in fig. 7.

In detail, in this embodiment, the image M0 to be converted may be input into a trained variational self-coding to perform mean and variance prediction, so as to obtain a mean z of the hidden vector of each feature dimension of the image M0 to be converted_μSum variance z_σThen based on said mean value z_μSum variance z_σSampling is carried out to obtain a corresponding hidden vector S. Then, the hidden vector S (first spatial hidden vector) is corrected to obtain a corrected hidden vector S + (second spatial hidden vector). And finally, inputting the corrected hidden vector S + into a trained stylized image generator to generate a stylized image, and obtaining a generated stylized image M1. Based on the above description, in the training process of the variational self-encoder, the training of the variational self-encoder can be completed without correcting the implicit vector S of the sample image in the training process, so that the training process can be accelerated and the training efficiency can be improved. In the application process after training, the hidden vector S (first spatial hidden vector) corresponding to the input image (image to be converted) is corrected and then the stylized image is generated, so that the ID holding capacity of an object to be converted (such as a face image) in the stylized conversion process can be further improved, and the generated stylized image can retain the fine details of the input image.

Referring to fig. 8, fig. 8 is a schematic diagram of an image processing apparatus 100 for implementing the stylized image generation method according to the embodiment of the present application. In detail, the image processing apparatus 100 may include one or more processors 110, a machine-readable storage medium 120, and a stylized image generation apparatus 130. The processor 110 and the machine-readable storage medium 120 may be communicatively connected via a system bus. The machine-readable storage medium 120 stores machine-executable instructions, and the processor 110 implements the stylized image generation method described above by reading and executing the machine-executable instructions in the machine-readable storage medium 120. In this embodiment, the image processing apparatus 100 may be, but is not limited to, a personal computer, a notebook computer, a smart phone, a tablet computer, a server, a cloud service platform, and other computer apparatuses with image processing capability.

The machine-readable storage medium 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The machine-readable storage medium 120 is used for storing a program, and the processor 110 executes the program after receiving an execution instruction.

The processor 110 may be an integrated circuit chip having signal processing capabilities. The Processor may be, but is not limited to, a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), and the like.

Fig. 9 is a schematic diagram of functional modules of the stylized image generating apparatus 130. In this embodiment, the stylized image generating apparatus 130 may include one or more software functional modules running on the image processing device 100, and these software functional modules may be stored in the machine-readable storage medium 120 in the form of a computer program, so that when these software functional modules are called and executed by the processor 130, they may implement the stylized image generating method described in this embodiment.

In detail, the stylized image generating apparatus 130 may include a residual calculating module 131, a hidden vector processing module 132, and an image generating module 133.

The residual calculation module 131 is configured to input an image to be converted into a variational self-encoder obtained through training in advance, and process the image to be converted through the variational self-encoder to obtain a mean value and a variance of implicit vectors of feature dimensions of the image to be converted.

In this embodiment, the residual calculation module 131 is configured to execute step S100 in the above method embodiment, and for more details of the residual calculation module 131, reference may be made to the above detailed description of step S100, which is not repeated herein.

The hidden vector processing module 132 is configured to obtain a target space hidden vector of the image to be converted in each feature dimension according to the mean and the variance of the hidden vectors of each feature dimension.

In this embodiment, the hidden vector processing module 132 may be configured to execute step S200 in the above method embodiment, and for more details of the residual calculation module 132, reference may be made to the detailed description of step S200, which is not repeated herein.

The image generating module 133 is configured to input the target spatial hidden vector into a trained stylized image generator to perform image stylized conversion, so as to generate a stylized image with a set image style.

In this embodiment, the image generation module 133 is configured to execute step S300 in the foregoing method embodiment, and for more details about the image generation module 133, reference may be made to the above detailed description of step S300, which is not repeated herein.

In summary, according to the method, the apparatus, and the image processing device for generating a stylized image provided in the embodiment of the present application, a variational self-encoder (for example, a variational self-encoder having a feature pyramid residual error network structure) is used to perform multi-scale feature extraction on an input image, a mean value and a variance of hidden vectors of each feature dimension of the input image are calculated (pre-estimated) according to a feature extraction result, a hidden vector of a target space (such as an S space) is obtained based on mean value and variance sampling, and finally, a stylized image generator generates a corresponding stylized image based on the hidden vector. Therefore, the features extracted by carrying out multi-scale feature extraction on the image can be more accurately expressed, and the image quality of the generated stylized image can be improved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A stylized image generation method applied to an image processing apparatus, the method comprising:

2. The stylized image generation method of claim 1, wherein obtaining the target spatial hidden vector of the image to be converted in each feature dimension according to the mean and variance of the hidden vectors in each feature dimension comprises:

3. The stylized image generation method of claim 2, characterized in that a modification formula for modifying the first spatial hidden vector of each feature dimension of the image to be converted is:

representing the stylized image generator, the

4. The stylized image generating method according to claim 1, wherein the variational self-encoder comprises a plurality of coding layers which are sequentially cascaded, a plurality of decoding layers which are sequentially cascaded, and full-connection layers which are respectively connected to the decoding layers, each coding layer being connected to a corresponding decoding layer, each decoding layer being connected to one full-connection layer;

and in the plurality of decoding layers, the dimensionality of the feature graph output by each decoding layer is gradually reduced from the first decoding layer.

5. The stylized image generation method of claim 4, further comprising a step of training the variational self-encoder, comprising:

6. The stylized image generation method of claim 5, characterized in that the loss function comprises a pixel level loss, a semantic similarity loss, an identity information loss, the expression of which is as follows:

L＝L_rec+W_perL_per+W_klL_arcwherein:

representing the pixel level loss, x is the sample image, ε_θRepresents the variation self-encoder and the variation self-encoder,

7. The stylized image generation method of any of claims 1-6, further comprising a step of pre-training the stylized image generator, comprising:

8. The stylized image generation method of claim 7, wherein the loss function value of the stylized image generator is calculated as follows:

wherein:

x～p_drepresenting a distribution of the stylized image dataset,

9. A stylized image generation apparatus applied to an image processing device, characterized by comprising:

10. An image processing apparatus comprising a machine-readable storage medium and one or more processors, the machine-readable storage medium having stored thereon machine-executable instructions that, when executed by the one or more processors, implement the method of any one of claims 1-8.