CN110866958A

CN110866958A - Method for text to image

Info

Publication number: CN110866958A
Application number: CN201911033265.0A
Authority: CN
Inventors: 袁春; 吴航昊; 贲有成
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2020-03-06
Anticipated expiration: 2039-10-28
Also published as: CN110866958B

Abstract

The invention provides a method for generating an image by a text, which comprises the following steps: s1: training an anti-visual semantic embedding model, the anti-visual semantic embedding model comprising: an image encoder network, an image decoder network, a generator network and a discriminator network paired with the generator network; s2: inputting text into the generator network, the generator network outputting text feature embedding; s3: the text features are embedded input to the decoder network, which outputs an image that conforms to a semantic description of the text. The method for performing visual feature embedding on the existing text data in a countertraining mode is enhanced, so that the difference between the distribution of the text mode data and the image mode data in a semantic space is reduced.

Description

Method for text to image

Technical Field

The invention relates to the technical field of deep learning, in particular to a text-to-image method.

Background

Text-to-image generation is a popular research topic in the field of computer vision in recent years. Depth-generated models based on generative confrontation networks (GANs) are of particular importance in the existing classes of methods, since in theory they are able to generate a variety of realistic images with relatively few model parameters, which means that they have the ability to capture the nature of natural images. GAN has attracted extensive attention as a class of generation models that have the ability to fit the distribution of natural images and are widely used in various image generation tasks such as image inpainting, super-resolution, image-to-image conversion and future frame prediction.

In recent years, there have been many attempts to extract semantic embeddings of text, such as classical word2 vec. In the field of computer vision, people focus on embedding the visual semantics of text, such as color, properties, texture, location, etc. information mentioned in the text description into a semantic space. Most of the existing methods adopt a method based on a discriminant type task to pre-train a deep neural network for extracting text semantic embedding. Specifically, the discriminant task is to determine whether a picture and a text description are semantically matched.

With the development of deep generative models, especially against the theoretical and practical advances in generating networks, the task of text-to-image generation has achieved staged results. The existing mainstream method generally adopts a framework of a conditional generation countermeasure network, and takes a text description as a condition to generate an image conforming to the text description. However, in the task of generating the text-to-image in the cross-modality manner, the data of two different modalities are distributed in a semantic space and are not equal to each other, the data of the text modality is sparse, and the data of the image modality is dense. These methods described above do not fully exploit the potential of text feature extraction when dealing with features of text modalities, thereby spanning this gap.

Disclosure of Invention

The invention provides a text-to-image method for solving the problem of text-to-image in the prior art.

In order to solve the above problems, the technical solution adopted by the present invention is as follows:

a method of text generating an image, comprising the steps of: s1: training an anti-visual semantic embedding model, the anti-visual semantic embedding model comprising: an image encoder network, an image decoder network, a generator network and a discriminator network paired with the generator network; s2: inputting text into the generator network, the generator network outputting text feature embedding; s3: the text features are embedded input to the decoder network, which outputs an image that conforms to a semantic description of the text.

Preferably, training the counterintuitive semantic embedding model comprises the steps of: s11: constructing an anti-vision semantic embedding model; s12: training the image encoder network and the image decoder network using a reconstruction loss function; s13: training the generator network and the decoder network using the reconstruction loss function; s14: training the generator network and the discriminator network using the Wtherstein distance as a loss function.

Preferably, the reconstruction loss function is:

wherein Enc represents the image Encoder network, Dec represents the image decoder network, Z represents the features extracted by the Encoder, x represents the training image samples, D_KLRepresenting the KL divergence, P (z) is the prior distribution.

Preferably, perceptual loss is added to the reconstruction loss function.

Preferably, the wotherstein distance is:

wherein G is the generator network, D is the discriminator network, nz is 100-dimensional noise sampled from standard Gaussian distribution, x is a training set picture, t is a training set text description, 1-lipschitz represents first-order Lipschies continuity, and P is_dataE is the expectation for the probability distribution of the training set pictures.

Preferably, the method further comprises the following steps: s4: and constructing a new discriminator network, wherein the new network discriminator is used for discriminating the images output by the decoder network from the images in the training set.

Preferably, the encoder network uses a convolutional layer of Resnet101, and two fully-connected layers are added after the convolutional layer; the decoder network is a deconvolution network symmetric to Resnet 101; the generator network is a channel convolution network of the GAN-INT-CLS; the discriminator network is a convolutional network.

Preferably, the discriminator network is constrained in a spectral normalization manner.

Preferably, the generator network is in the same output format as the encoder network.

The invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method as set forth in any of the above.

The invention has the beneficial effects that: a method for generating an image from a text is provided, and the generation of the text into the image is realized by training a countering visual semantic embedding model and by countering the visual semantic embedding model. The method for performing visual feature embedding on the existing text data in a countertraining mode is enhanced, so that the difference between the distribution of the text mode data and the image mode data in a semantic space is reduced.

The existing text-to-image generation pipeline flow is improved, thereby avoiding the use of an unstable conventional CGAN training framework, and instead employing a more advanced WGAN training framework. The method greatly enhances the stability of training, accelerates the convergence rate of training and reduces the probability of collapse of the training trapping mode.

The index for quantitatively measuring the image generation quality such as Incepotion Score is obviously improved.

Drawings

Fig. 1 is a schematic diagram of a method for generating an image from a text in an embodiment of the present invention.

FIG. 2 is a schematic diagram of a method for training the anti-vision semantic embedding model according to an embodiment of the present invention.

FIG. 3 is a diagram illustrating a method for generating an image according to another embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. The connection may be for fixation or for circuit connection.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.

Example 1

In deep learning, an Auto-encoder (Auto-encoder) is a widely used method for generating a model and extracting features of data. There are many improvements and variations of the self-encoder, such as a noise-reducing self-encoder, a sparse self-encoder, and so on. Among them, the most widely used is a Variational Auto-encoder (Variational Auto-encoder) based on Variational derivation.

The self-encoder is composed of two deep neural networks, one is an encoder and is responsible for compressing input high-dimensional sample data into low-dimensional data characteristics, and the other is a decoder and is responsible for restoring the low-dimensional data characteristics into high-dimensional data samples. To achieve this effect, the training goal of the self-encoder needs to minimize the reconstruction loss between the input and the output, and therefore the L2 loss or the binary cross entropy loss is generally adopted as the reconstruction loss.

The variational autocoder further constrains the distribution of the features of the output of the coder on the basis of the autocoder, and the distribution of the features is required to be as similar as possible to a known prior distribution (in practice, multidimensional gaussian distribution or uniform distribution is generally adopted), so that the loss function of the variational autocoder also imposes the constraint on the features of the output of the coder on the basis of reconstruction loss. Specifically, the mean value and the variance of the output characteristic distribution of the self-encoder and the mean value and the variance of the known prior distribution are measured by using KL divergence, and the measurement result is taken as the prior loss and the reconstruction loss to jointly form a training loss function of the variational self-encoder.

Because the invention needs to extract visual semantic features from the generation class of text-to-image generation, the invention extracts image features by using and adopting the structure and training mode of a variational self-encoder on the structure of a model and converts text description into images.

GAN and text-to-image generation

As a promising branch of the generative model, GAN treats the training process as a zero-sum game between two competitors, the generator G and the discriminator D. In particular, the generator G is intended to generate realistic images, while the discriminator D tries to distinguish real images from false images generated by the generator G. Training the GAN is equivalent to optimizing the following objectives:

the condition GAN (cgan) extends GAN by providing additional label information as input conditions to generator G and arbiter D.

The GAN-based text-to-image generation method (GAN-INT-CLS) further enhances the utilization of a conditional supervision signal on the basis of CGAN, and the GAN-INT-CLS provides three types of discrimination sample pairs for a discriminator in a training process, namely a real image and a text conforming to image content, a real image and a text not conforming to the image content, and an unreal image and a text conforming to the image content. Of these three types of sample pairs, only the first is a positive sample, and the remaining two are negative samples. In this case, the training targets are:

WGAN based on optimal transmission theory

Based on the theory of game theory, GAN treats the image generation process as a kind of zero-sum game, in which two competitors are the generator G and the discriminator D, respectively. The generator G is intended to generate realistic images, while the discriminator D tries to distinguish "false" images generated by the generator G from real, natural images. The GAN is trained to achieve nash balance between the generator G and the discriminator D. The existing training target of GAN is to minimize JS divergence between the generation distribution and the target distribution theoretically, and some practical skill or improvement method destroys this training target, such as the method of formula (2), and the training of the model often encounters instability and mode collapse.

To address this problem, the present invention replaces the training target based on JS divergence between metric distributions with the training target based on the optimal transmission theory — the Wasserstein Distance (Wasserstein Distance). Wtherstein distance-based WGAN (Wasserstein GAN) training targets were as follows:

EMD(P_r，P_θ)＝inf_γ∈Π∑_x，yγ(x，y)||x-y|| (3)

wherein EMD is an abbreviation of Earth Mover Distance, having the same meaning as W in formula (4), Pr denotes a probability distribution, P_θSimilarly, x and y are subject to Pr and P_θThe sample γ (x, y) of (a) represents the optimal transport scheme of the distribution, and F represents the Frobenius inner product. Since the optimal transportation solution itself is also an optimization problem, through the K-R dual transformation, the final training objective can be described as:

wherein 1-lipschitz represents a first order lipschitz continuum.

To solve this problem, the present invention improves the existing text visual semantic embedding method by means of countertraining to generate a counternet. Specifically, the invention firstly uses a variational self-encoder to extract the characteristics of the image. Then, a discriminator is used for discriminating the image features and the text features extracted by the encoder, the image features are taken as positive samples, the text features are taken as negative samples, and therefore a deep neural network capable of mapping the text features to the image features is learned.

Under the training framework of the process, texts and images are input in a matched mode, and a traditional training mode of conditional countermeasure training is not needed, so that a training target can be changed into a more advanced training target based on the Waterstein distance, and the problems of collapse, difficulty in convergence and the like of a traditional mode for generating a countermeasure network are solved.

In this invention, the invention proposes a new text visual feature embedding method, and is called as anti-visual semantic embedding (AdViSE). The model of the invention consists of four deep neural network networks: the system comprises an image encoder, an image decoder, a text feature embedded network and a discriminator matched with the text feature embedded network.

As shown in fig. 1, the present invention provides a method for generating an image from a text, comprising the following steps:

s1: training an anti-visual semantic embedding model, the anti-visual semantic embedding model comprising: an image encoder network, an image decoder network, a generator network and a discriminator network paired with the generator network;

s2: inputting text into the generator network, the generator network outputting text feature embedding;

s3: the text features are embedded input to the decoder network, which outputs an image that conforms to a semantic description of the text.

The AdVisE model firstly trains a variational self-encoder of an image, so that the model has the capability of extracting semantic features of the image. The input and output of the variational self-encoder are both images of 128x128x3 resolution. The output of the encoder is two 1024-dimensional vectors μ, σ, which respectively represent the mean and variance of the image feature distribution.

After a period of training of the self-encoder, the countertraining of the text semantic embedding network can be started when the generated image is no longer simple noise. The text first inputs the word sequence to the pre-trained text feature extraction network phi adopted in the GAN-INT-CLS, then splices the output of phi with noise sampled from 100-dimensional Gaussian distribution, and inputs the spliced output into the Generator network. The output format of the Generator network is the same as the output format of the encoder, since the text features output by the Generator network need to fit the visual features as closely as possible.

As shown in fig. 2, training the anti-vision semantic embedding model includes the following steps:

s11: constructing an anti-vision semantic embedding model;

s12: training the image encoder network and the image decoder network using a reconstruction loss function;

s13: training the generator network and the decoder network using the reconstruction loss function;

s14: training the generator network and the discriminator network using the Wtherstein distance as a loss function.

In one embodiment of the invention, the entire model of the AdViSE consists of four networks. Where the encoder uses the convolutional layer of the classical Resnet 101. In order to output the mean and variance vectors μ and σ, two full-concatenation layers need to be added after the last convolutional layer. The decoder corresponding to it is a deconvolution network symmetric to Resnet 101. The text semantic embedding network Generator adopts a GAN-INT-CLS channel convolution network, and the Discriminator adopts a common convolution network.

On the training target of the model, because the generator network and the arbiter network are in matched training, and the encoder network and the decoder network are also in matched training, the training target of the whole model is divided into two parts.

In the first part, the training targets of the encoder network and the decoder network are the training targets of the variational self-encoder. The loss function for this target is composed of the addition of two components as shown in equation (5).

Wherein Enc represents the image Encoder network, Dec represents the image decoder network, Z represents the features extracted by the Encoder, x represents the training image samples, D_KLRepresenting the KL divergence, P (z) is the prior distribution. The former term, referred to as reconstruction loss, is used to measure the difference between the input image and the output image. The latter term, called a priori loss, is used to constrain the difference between the Encoder features and the a priori distribution P (z). Here, KL divergence is used as a measure of both.

In the second part, the training targets of the generator network and the discriminator network are the training targets of the Wtherstein distance-based WGAN:

wherein G is the generator network, D is the discriminator network, nz is 100-dimensional noise sampled from standard Gaussian distribution, x is a training set picture, t is a training set text description, 1-lipschitz represents first-order Lipschies continuity, and P is_dataTo train the probability distribution of the set pictures, E represents expectation.

The initial method of WGAN to implement 1-Lipschitz would result in a large loss of expression capacity in the model. We therefore use spectral normalization (spectra normalization) to constrain the discriminator D so that it satisfies first order lippis continuity.

In the above process of obtaining visual semantic embedding of text, if the embedding is directly input to the Decoder, it is natural to obtain a text-to-image generative model. Therefore, after four network training of the AdViSE is finished, the generator network and the decoder network can be paired to serve as a generation model from a text to an image, and fine adjustment is performed on the basis, for example, perceptual loss is added, the common reconstruction loss is that a norm | | | x-x ' | | is directly solved by using an input sample x and an output sample x ', the perceptual loss reconstruction is that a pre-trained convolutional network (generally, VGG16 trained by imagenet) is recorded as f, and then a loss function becomes | | | | f (x) -f (x ') |, so that the generation effect from the text to the image is further improved.

As shown in fig. 3, the method for generating an image from a text further includes the following steps:

s4: and constructing a new discriminator network, wherein the new network discriminator is used for discriminating the images output by the decoder network from the images in the training set.

The effect of text-to-image generation can be further improved by adding a discriminator.

Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

To verify the effectiveness of the present invention, the present invention compares the 2 public data sets (CUB, Oxford-102) with the existing text-to-image generation model.

TABLE 1 text-to-image generative model comparison

In quantitative experiments, the highest scores were obtained with the AdVisE method on both CUB and Oxford-102, as shown in Table 1.

In qualitative experiments, the invention performs basic text-to-image generation on CUB and Oxford-102. The effect of image generation is observed, and the method of the invention can generate pictures satisfying the description of the images although the method is different from the prior method in flow, and the details described in the text are all matched with the generated images correspondingly.

The method has the advantages that text-to-image conversion can be realized by simply applying the Decoder with the AdVise, and good effect can be obtained. A certain improvement is obtained by introducing a perception loss on the basis. However, adding a new discriminator after the Decoder of the basic AdViSE will significantly improve. On the other hand, text semantic embedding of the existing method is replaced by pre-trained AdVisE visual semantic embedding, and great improvement is achieved, which shows that AdVisE can more dig out the potential of text modal features compared with the improvement of the existing method in image generation capacity.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. A method for generating an image from text, comprising the steps of:

2. The method of text-generating images of claim 1, wherein training the anti-visual semantic embedding model comprises the steps of:

s11: constructing an anti-vision semantic embedding model;

3. The method of text-generating an image of claim 2, wherein the reconstruction loss function is:

4. The method of text-generating an image of claim 3, wherein perceptual loss is added to the reconstruction loss function.

5. The method of text-generating an image of claim 2, wherein the wotherstein distance is:

6. The method of text-generating an image of claim 1, further comprising the steps of:

7. The method of text-to-image rendering as claimed in any one of claims 1 to 6, wherein said encoder network employs a convolutional layer of Resnet101, two fully-connected layers being added after said convolutional layer;

the decoder network is a deconvolution network symmetric to Resnet 101;

the generator network is a channel convolution network of the GAN-INT-CLS;

the discriminator network is a convolutional network.

8. The method of text-generating images of any of claims 1-6, wherein the network of discriminators is constrained using spectral normalization.

9. The method of text generating images of any of claims 1-6, wherein the generator network is in the same output format as the encoder network.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.