CN114038055A

CN114038055A - Image generation method based on contrast learning and generation countermeasure network

Info

Publication number: CN114038055A
Application number: CN202111254371.9A
Authority: CN
Inventors: 张亮; 王博文
Original assignee: University of Electronic Science and Technology of China; Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: University of Electronic Science and Technology of China; Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2022-02-11

Abstract

The invention discloses an image generation method based on contrast learning and generation of a confrontation network, and belongs to the field of computer vision. The method comprises the steps that firstly, a generated countermeasure network is selected as a basic framework, when the generated countermeasure network is trained, positive and negative samples are required to be constructed for query objects of a discriminator and a generator respectively, the purpose is to map images into a representation space by using the discriminator, so that the discriminator can learn reasonable representation of the images under self-supervision, introduction of additional model parameters can be reduced, and the generator can map similar input random vectors into similar images under self-supervision and map dissimilar random vectors into different images, so that the diversity of the generated images is improved. After the generation of the confrontation network is trained, an image can be generated by inputting noise into the generator. By the method, the advantages of contrast learning and generation of the countermeasure network are fully utilized, and the diversity of the generated images of the existing generation method is improved.

Description

Image generation method based on contrast learning and generation countermeasure network

Technical Field

The invention belongs to the field of computer vision, and mainly relates to the problem of improving the diversity of image generation; the method is mainly applied to the aspects of film and television entertainment industry, man-machine interaction, machine vision understanding and the like.

Background

At present, the demand for image generation is increasing in the fields of movie and television entertainment, computer vision understanding and the like. For example: in a role playing game, a player can control parameters to generate a character avatar according to preferences; in early education, matching images were generated from texts, and infants were guided to recognize the wide and varied world by using the images. Common image generation models include autoregressive models, variational autoencoders, Generative Adaptive Networks (GAN), and flow models. The generation of the countermeasure network has the advantages of small calculation amount, high quality of generated images, simple model structure and the like, and is widely applied to image generation. In recent years, model improvements for generation of countermeasure networks have focused mainly on both structural improvements and loss function improvements.

The method for improving the structure of the countermeasure network is quite multiple, and mainly aims to improve the capability of generating a model of the countermeasure network, such as SAGAN, and introduces a self-attention module into the model, so that the model can take three aspects of simulation remote dependence, efficiency and calculation amount into consideration, and the generation capability of the model is improved. Reference documents: zhang, h., Goodfellow, i., Metaxas, d., & ondena, a. (2019, May), Self-authentication genetic adaptive network in International conference on machine learning (pp.7354-7363), PMLR.

The improvement on the generation of the countermeasure network loss function is mainly to solve the problems of unstable training, gradient disappearance, mode collapse and the like of the generation of the countermeasure network. Arjovsky et al use Wasserstein distance to construct a loss function and impose Lipschitz continuity constraints on the discriminators, significantly improving the quality and richness of the images generated by generating the countermeasure network. The method for improving the countermeasure network has the advantages of good model improvement effect and convenient reference, and the method is combined with the idea of comparative learning to improve the loss function of the countermeasure network. Reference documents: gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Corville, A. (2017). Improved training of wasserstein gans.arxiv preprint arXiv: 1704.00028.

Contrast learning is a common self-supervision method, which utilizes labeled information generated by data itself to realize model training in a supervision mode, and has become an effective solution for deep representation learning. Aiming at the problem that the traditional generation countermeasure network generates images with insufficient diversity, the loss function of the generation countermeasure network is improved by combining contrast learning, and excellent results are obtained.

Disclosure of Invention

The invention discloses an image generation method based on contrast learning and generation countermeasure network, which solves the problem of insufficient diversity of generated images in the prior art.

The method comprises the steps of firstly, selecting and using a generated countermeasure network as a basic framework, normalizing and scaling a training picture to be 32 x 3, and sampling a normal distribution to obtain an input random vector. Meanwhile, the core thought of the comparative learning and the Info NCE loss function of the comparative learning are used for reference. When the countermeasure network is generated by training, positive and negative samples need to be constructed for query objects of the discriminator and the generator respectively, the purpose is to map images into a representation space by using the discriminator, so that the discriminator can learn reasonable representation of the images under self-supervision, and introduction of additional model parameters can be reduced, while the generator can map similar input random vectors into similar images under self-supervision, and map dissimilar random vectors into different images, thereby improving the diversity of the generated images. After the generation of the confrontation network is trained, an image can be generated by inputting noise into the generator. By the method, the advantages of contrast learning and generation of the countermeasure network are fully utilized, and the diversity of the generated images of the existing generation method is improved. The general structure of the algorithm is schematically shown in fig. 1.

For convenience in describing the present disclosure, certain terms are first defined.

Definition 1: a normal distribution. Also called gaussian distribution, is a probability distribution that is very important in the fields of mathematics, physics, engineering, etc., and has a significant influence on many aspects of statistics. If the random variable x, its probability density function satisfies

Where μ is the mathematical expectation of a normal distribution, σ²The variance of a normal distribution is said to satisfy the normal distribution, and is often denoted as N (mu, sigma)²)。

Definition 2: a countermeasure network is generated. The generation countermeasure network comprises two different neural networks, one called generator G and the other called discriminator D, which oppose each other during the training process, the purpose of the discriminator being to distinguish the true data distribution p_rAnd generating a data distribution p_gThe purpose of the generator is not to distinguish the two distributions by the discriminator, and finally the generated data distribution is consistent with the real data distribution: p is a radical of_r＝p_g。

Definition 3: and (5) comparison learning. Contrast learning learns the representation of samples by making comparisons between input samples. Contrast learning does not learn signals from a single data sample at a time, but rather learns by making comparisons between different samples. A comparison may be made between a positive pair of "similar" inputs (consisting of the query object and its positive sample) and a negative pair of "different" inputs (consisting of the query object and its negative sample).

Definition 4: a batch normalization layer. The deep neural network training technique is a technique for deep neural network training, namely, each batch of data is normalized, so that the convergence rate of a model can be increased, and more importantly, the problem of gradient dispersion in a deep network is relieved to a certain extent, so that the deep network model is trained more easily and stably.

Definition 5: the ReLU activation layer. Also called modified linear unit, is an activation function commonly used in artificial neural networks, usually referred to as a non-linear function represented by a ramp function and its variants, and expressed as f (x) max (0, x)

Definition 6: tanh active layer. Can be expressed

And (4) defining.

Definition 7: a global average pooling layer. Pooling layers are often used by convolutional networks to reduce model size, speed computation, and improve robustness of extracted features. The global average pooling layer is used for averaging the input characteristic diagram as the name suggests, and is generally used for replacing the final full-connection layer of the neural network, so that the dimension reduction is directly realized, and the parameters of the network are greatly reduced.

Definition 8: IS index. The IS index IS mainly used for measuring the quality and diversity of the image, when the image IS clear enough, the type of the image can be known definitely, the Inception V3 model IS used for predicting the probability of generating the image x as the type y, namely the conditional probability p (y | x) IS obtained, the Inception V3 model can divide the image into 1000 types, namely the quality of the image x IS considered to be better as long as the probability of the image x as one type IS higher. Similarly, in order to consider the richness of the image, it is desirable that the categories of the image are uniformly distributed, and the edge probability is considered at this time

Where N is the number of generated images and the number of classes y depends on the number of image classes N in the training data set_classHope that

The calculation formula of the IS index can be described as:

definition 9: FID index. Calculating the FID index also requires the use of the IncepotionV 3 model.When calculating the FID, this last pooling layer needs to be removed, and a 2048-dimensional high-level feature, hereinafter referred to as an n-dimensional feature, is obtained. For real images, their n-dimensional features follow a certain distribution; likewise, the n-dimensional features of the image generated by GAN also follow a distribution, the FID index represents the frichet Distance between the two distributions, and the calculation method can be described as follows:

tr represents the summation operation of the elements on the diagonal of the matrix. Mu.s_data，μ_gSum Σ_data，∑_gMean and covariance of the n-dimensional features representing the real image and the generated image, respectively. A lower FID means closer proximity between the two distributions, which means higher quality and better diversity of the generated pictures.

Definition 10: IncepotionV 3 model. The InceptitionV 3 model is a deep neural network used to extract features, which passes the extracted features through the last pooling layer to output a class of images.

Therefore, the technical scheme of the invention is an image generation method based on contrast learning and generation of a countermeasure network, which comprises the following steps:

step 1: preprocessing the data set;

acquiring real images, classifying and labeling the real images according to objects displayed by the images, and normalizing pixel values of all pictures;

step 2: constructing a generator network and a discriminator network for generating an antagonistic neural network;

1) the network input of the generator is a random vector, and the output is a picture; the first layer of the generator network is a linear layer, then three upsampling residual error network blocks are connected, then a batch normalization layer, a ReLU activation layer and a convolution layer are sequentially connected, and finally a Tanh activation layer is connected;

2) the network input of the discriminator is a picture, and the output is a scalar and a vector; the discriminator network is divided into three modules: feature extraction module D₁Anti-loss module D₂Watch with a watchA sign mapping module H; feature extraction module D₁The input is a picture, the output is a feature vector of the picture, and a feature extraction module D₁The first layer of (2) is an optimized down-sampling residual network block, then three standard down-sampling residual network blocks are connected, and then a ReLU activation layer and a global average pooling layer are sequentially connected; loss-fighting module D₂Is the feature extraction module D₁Is output as a scalar: true or false, counter-loss module D₂The method is characterized by comprising the following steps of (1) adopting a linear layer; input of the characterization mapping module H is a feature extraction module D₁The output of the mapping module is a characterization vector, the characterization mapping module H is formed by a linear layer, the overall network structure is shown in figure 2, and the residual error network block structure is shown in figure 5;

and step 3: constructing positive and negative samples for the query object;

1) when constructing positive and negative samples for the discriminator, random image transformation operation including image rotation, image inversion, and image saturation, contrast and darkness adjustment is performed on the real image x to obtain its positive sample x⁺Randomly extracting other real images as negative samples of the query image

1, wherein N is a negative sample number; the construction is shown in fig. 3.

2) Defining a radius as R for a query variable z when constructing positive and negative samples for a generator; in a real image, taking a hypersphere with a query variable z as a center, randomly sampling in the hypersphere to obtain a positive sample z⁺：||z⁺-z||₂R is less than or equal to R, and a negative sample is obtained by random sampling outside the hypersphere

N is the number of negative samples; the construction is shown in fig. 4.

And 4, step 4: designing a loss function;

1) designing a loss function for the discriminator network, and setting an image randomly generated by a generator as x_g～p_gRandomly extracting query image x from real image_rTo search forImage x of inquiry_rConstructing a positive sample x according to method 1) in step 3⁺And negative sample

Feature extraction module D using discriminator₁To extract a query image x_rRespectively obtaining the characteristics f of the query image and the characteristics f of the positive sample query image⁺Negative sample query image feature

Sign for

Loss-fighting module D for sending the characteristics f of the query image to the discriminator₂Performing countermeasure training, and mapping all image features into a representation space through a representation mapping module H to obtain H (H (f)), wherein H is⁺＝H(f⁺)，

1, N, a loss function of the discriminator

Comprises the following steps:

wherein the content of the first and second substances,

generating a countermeasure loss function, α, for the arbiter of the countermeasure network_dIs the weight of the term that is used,

is a contrast loss function, D (x)_g) Representing the output value of the discriminator on the generated image, the larger the output value is, indicating that,

indicating the desire for the output value,

i.e. distribution

Generating an image distribution p for a set of data pr and_ge denotes a linear mixing coefficient,

indicating that the discriminant function is graded with respect to the blended image,

the coefficient is a gradient penalty term and is used for constraining the parameters of the discriminator model to accord with the lipschitz continuous condition, wherein lambda is a gradient penalty coefficient, and tau is a temperature coefficient;

2) aiming at a loss function of generator network design, recording a random vector extracted from a standard multivariate normal distribution N (0, I) as z and I as an identity matrix, and constructing a positive sample z for the random vector according to the method 2) in the step 3⁺Negative sample

Inputting the vector into a generator to obtain a corresponding generated image

Then, the generated image is input to a feature extraction section D of the discriminator₁Obtaining corresponding generated image features

Characterizing an image

Loss-fighting module D sent to discriminator₂Performing countermeasure training, and mapping all generated image features into a representation space through a representation mapping module H to obtain

Loss function of generator

Comprises the following steps:

generating a countermeasure loss function for the generator of the countermeasure network, α_gIs the weight of the term that is used,

is a contrast loss function introduced by the present invention; g (z) is a generated image of the generator with respect to the random variable z, D (G (z)) is an output value of the discriminator with respect to the generated image,

representing the mathematical expectation of the discriminator with respect to the output value of the generated image when the input random variable z of the generator conforms to the standard multivariate normal distribution;

and 5: training the generation countermeasure neural network constructed in the step 2, performing network training by using the loss function constructed in the step 4, fixing the parameters of the discriminator network D when updating the generator network G, fixing the parameters of the generator network G when updating the discriminator network D, and iteratively updating the discriminator 5 times each time and then updating the generator once;

step 6: the trained generator network G is used to generate images.

The innovation here is that:

1) the image is projected into the representation space directly by the discriminator without introducing an additional mapping model, and the discriminator can be prompted to distinguish the image by learning the characteristics of the image instead of directly memorizing the image.

2) The method for constructing the positive and negative samples in the input vector space enables a generator to map vectors far away from the input vector space into different images, so that the diversity of the generated images of the generator is improved, meanwhile, the image features generated by the generator on vectors very close to the input vector space are more similar, and the robustness of the generator can be improved.

3) The idea of comparative learning was used to improve the loss function for generation of the countermeasure network and to achieve excellent results in experiments with IS score 8.046 and FID score 15.60 on the CIFAR-10 dataset.

Drawings

FIG. 1 is a main flow chart of the method of the present invention.

Fig. 2 is a diagram of the main network structure of the method of the present invention.

FIG. 3 is a schematic diagram of constructing positive and negative samples for a query image according to the present invention.

FIG. 4 is a diagram illustrating the construction of positive and negative samples for a query vector according to the present invention.

Fig. 5 is a schematic diagram of an upsampled residual network block, a standard downsampled residual network block, and an optimized downsampled residual network block of the present invention.

Detailed Description

Step 1: preprocessing the data set;

CIFAR-10 dataset (http:// www.cs.toronto.edu/. kriz/CIFAR-10-python. tar. gz) was obtained, which has a total of 60000 RGB natural images of size 32X 32, comprising 10 categories, these ten categories including: airplanes, cars, birds, cats, deer, dogs, frogs, horses, boats, and trucks. Using 50000 images as training set, randomly disordering training sequence, finally normalizing picture pixel value to range [ -1, 1 ].

Step 2: constructing a generator network and a discriminator network for generating a countermeasure network;

2) judgmentThe input of the discriminator network is a picture, and the output is a scalar and a vector; the discriminator network is divided into three modules: feature extraction module D₁Anti-loss module D₂And a characterization mapping module H. The input of the feature extraction module is a picture, the output is a feature vector of the picture, and the feature extraction module D₁The first layer of (1) is an optimized down-sampled residual network block, followed by three standard down-sampled residual network blocks, followed in turn by a ReLU activation layer, a global average pooling layer. Loss-fighting module D₂The input of (D) is a feature vector, the output is a scalar quantity, and a loss resisting module D₂A linear layer formation is used. The input of the characterization mapping module H is a feature vector, the output is a characterization vector, the characterization mapping module H is formed by adopting a linear layer, the overall network structure is shown in figure 2, and the residual error network block structure is shown in figure 5.

And step 3: constructing positive and negative samples for the query object;

1) when positive and negative samples are constructed for the discriminator part, random image transformation operation (including image rotation, image inversion, and image saturation, contrast and darkness adjustment) is carried out on the inquired real image x to obtain a positive sample x thereof⁺Randomly extracting other real images from the training data set as negative samples of the query image

N, N is a negative sample number, and is constructed in the manner shown in fig. 3.

2) When positive and negative samples are constructed for a generator part, a hypersphere with radius R and query variable z as center is defined for a query variable z-N (0, I), and positive example z is obtained by random sampling in the hypersphere⁺：||z⁺-z||₂R is less than or equal to R, and a negative sample is obtained by random sampling outside the hypersphere

N, N is a negative sample number, and is constructed in the manner shown in fig. 4.

And 4, step 4: designing a loss function;

1) design the loss function for the discriminant, memory from the training data set distribution p_rX is the real image extracted at random_rThe image randomly generated by the generator is xg-pg, which is the real image x of the query_rConstructing positive and negative samples x according to method 1) in step 3⁺，

N, using a feature extraction module D of the discriminator₁To extract the image features and respectively obtain the features f of the query image and the features f of the regular example image⁺Features of negative sample images

N, the characteristics f of the query image are sent to a countermeasure loss module D of the discriminator₂Performing countermeasure training, and mapping all image features into a representation space through a characterization mapping module H to obtain H ═ H (f), and H + ═ H (f)⁺)，

N, the penalty function of the discriminator may be described as:

generating a countermeasure loss function for the arbiter of the countermeasure network, wherein

I.e. distribution

Distributing p for a data set_rAnd generating an image distribution p_gWith λ being the gradient penalty factor, oneIs generally set to 10.

Is a contrast loss function introduced by the present invention, where α_dτ is the temperature coefficient, which is the weight of this term.

2) For the loss function of generator design, let the random vector extracted from the standard multivariate normal distribution N (0, I) be z, and construct positive and negative samples z for it according to method 2) in step 3⁺，

N, inputting the vector into a generator to obtain a corresponding generated image

Characterizing an image

The confrontation training module sent to the discriminator carries out confrontation training and maps all the generated image characteristics to the representation space through the representation mapping module H to obtain

N, the loss function of the generator may be described as:

to generate a countermeasure loss function for the generator of the countermeasure network,

is a contrast loss function introduced by the present invention, where α_gτ is the temperature coefficient, which is the weight of this term.

And 5: training the generation countermeasure neural network constructed in the step 3, performing network training by using the loss function constructed in the step 4, fixing the parameter of D when G is updated, fixing the parameter of G when D is updated, and updating the discriminator 5 times and then updating the generator once each time;

step 6: in the testing stage, the model is trained in the step 5, and only a generator network part for generating the countermeasure network is taken; 50000 random vectors are randomly sampled from a standard multivariate normal distribution N (0, I), 50000 generated images are obtained by inputting the vectors into a generator network, and an IS index and an FID index of the generated images are calculated so as to evaluate the generation capacity of the generator network.

The picture size is as follows: 32*32*3

Learning rate: 0.0002 and decreases linearly with the number of iterations

Training batch size: 64

Iteration times are as follows: 100000

Contrast loss function weight alpha of discriminator_d：1

Contrast loss function weight alpha of the generator_g：1

Temperature coefficient in contrast loss function τ: 0.3

Number of negative samples N: 64.

Claims

1. an image generation method based on contrast learning and generation of a countermeasure network, the method comprising:

step 1: preprocessing the data set;

2) the network input of the discriminator is a picture, and the output is a scalar and a vector; the discriminator network is divided into three modules: feature extraction module D₁Anti-loss module D₂And a characterization mapping module H; feature extraction module D₁The input is a picture, the output is a feature vector of the picture, and a feature extraction module D₁The first layer of (2) is an optimized down-sampling residual network block, then three standard down-sampling residual network blocks are connected, and then a ReLU activation layer and a global average pooling layer are sequentially connected; loss-fighting module D₂Is the feature extraction module D₁Is output as a scalar: true or false, counter-loss module D₂The method is characterized by comprising the following steps of (1) adopting a linear layer; input of the characterization mapping module H is a feature extraction module D₁The output is a characterization vector, and a characterization mapping module H is formed by adopting a linear layer;

and step 3: constructing positive and negative samples for the query object;

1, wherein N is a negative sample number;

N is the number of negative samples;

and 4, step 4: designing a loss function;

1) designing a loss function for the discriminator network, and setting an image randomly generated by a generator as x_g～p_gRandomly extracting query image x from real image_rTo query the image x_rConstructing a positive sample x according to method 1) in step 3⁺And negative sample

Feature extraction module D using discriminator₁To extract a query image x_rRespectively obtaining the characteristics f of the query image and the characteristics f of the positive sample query image⁺Negative examples query features of images

Loss function of discriminator

Comprises the following steps:

wherein the content of the first and second substances,

indicating the desire for the output value,

i.e. distribution

Distributing p for a data set_rAnd generating an image distribution p_gThe linear mixing of (a) and (b),_∈the linear mixing coefficients are represented by the coefficients of the linear mixing,

Inputting the vector into a generator to obtain a corresponding generated image

Characterizing an image

Loss function of generator

Comprises the following steps:

step 6: the trained generator network G is used to generate images.