CN110866958A - Method for text to image - Google Patents

Method for text to image Download PDF

Info

Publication number
CN110866958A
CN110866958A CN201911033265.0A CN201911033265A CN110866958A CN 110866958 A CN110866958 A CN 110866958A CN 201911033265 A CN201911033265 A CN 201911033265A CN 110866958 A CN110866958 A CN 110866958A
Authority
CN
China
Prior art keywords
network
text
image
training
generator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911033265.0A
Other languages
Chinese (zh)
Other versions
CN110866958B (en
Inventor
袁春
吴航昊
贲有成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen International Graduate School of Tsinghua University
Original Assignee
Shenzhen International Graduate School of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen International Graduate School of Tsinghua University filed Critical Shenzhen International Graduate School of Tsinghua University
Priority to CN201911033265.0A priority Critical patent/CN110866958B/en
Publication of CN110866958A publication Critical patent/CN110866958A/en
Application granted granted Critical
Publication of CN110866958B publication Critical patent/CN110866958B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/001Texturing; Colouring; Generation of texture or colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Editing Of Facsimile Originals (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a method for generating an image by a text, which comprises the following steps: s1: training an anti-visual semantic embedding model, the anti-visual semantic embedding model comprising: an image encoder network, an image decoder network, a generator network and a discriminator network paired with the generator network; s2: inputting text into the generator network, the generator network outputting text feature embedding; s3: the text features are embedded input to the decoder network, which outputs an image that conforms to a semantic description of the text. The method for performing visual feature embedding on the existing text data in a countertraining mode is enhanced, so that the difference between the distribution of the text mode data and the image mode data in a semantic space is reduced.

Description

Method for text to image
Technical Field
The invention relates to the technical field of deep learning, in particular to a text-to-image method.
Background
Text-to-image generation is a popular research topic in the field of computer vision in recent years. Depth-generated models based on generative confrontation networks (GANs) are of particular importance in the existing classes of methods, since in theory they are able to generate a variety of realistic images with relatively few model parameters, which means that they have the ability to capture the nature of natural images. GAN has attracted extensive attention as a class of generation models that have the ability to fit the distribution of natural images and are widely used in various image generation tasks such as image inpainting, super-resolution, image-to-image conversion and future frame prediction.
In recent years, there have been many attempts to extract semantic embeddings of text, such as classical word2 vec. In the field of computer vision, people focus on embedding the visual semantics of text, such as color, properties, texture, location, etc. information mentioned in the text description into a semantic space. Most of the existing methods adopt a method based on a discriminant type task to pre-train a deep neural network for extracting text semantic embedding. Specifically, the discriminant task is to determine whether a picture and a text description are semantically matched.
With the development of deep generative models, especially against the theoretical and practical advances in generating networks, the task of text-to-image generation has achieved staged results. The existing mainstream method generally adopts a framework of a conditional generation countermeasure network, and takes a text description as a condition to generate an image conforming to the text description. However, in the task of generating the text-to-image in the cross-modality manner, the data of two different modalities are distributed in a semantic space and are not equal to each other, the data of the text modality is sparse, and the data of the image modality is dense. These methods described above do not fully exploit the potential of text feature extraction when dealing with features of text modalities, thereby spanning this gap.
Disclosure of Invention
The invention provides a text-to-image method for solving the problem of text-to-image in the prior art.
In order to solve the above problems, the technical solution adopted by the present invention is as follows:
a method of text generating an image, comprising the steps of: s1: training an anti-visual semantic embedding model, the anti-visual semantic embedding model comprising: an image encoder network, an image decoder network, a generator network and a discriminator network paired with the generator network; s2: inputting text into the generator network, the generator network outputting text feature embedding; s3: the text features are embedded input to the decoder network, which outputs an image that conforms to a semantic description of the text.
Preferably, training the counterintuitive semantic embedding model comprises the steps of: s11: constructing an anti-vision semantic embedding model; s12: training the image encoder network and the image decoder network using a reconstruction loss function; s13: training the generator network and the decoder network using the reconstruction loss function; s14: training the generator network and the discriminator network using the Wtherstein distance as a loss function.
Preferably, the reconstruction loss function is:
Figure BDA0002250743060000022
wherein Enc represents the image Encoder network, Dec represents the image decoder network, Z represents the features extracted by the Encoder, x represents the training image samples, DKLRepresenting the KL divergence, P (z) is the prior distribution.
Preferably, perceptual loss is added to the reconstruction loss function.
Preferably, the wotherstein distance is:
Figure BDA0002250743060000021
wherein G is the generator network, D is the discriminator network, nz is 100-dimensional noise sampled from standard Gaussian distribution, x is a training set picture, t is a training set text description, 1-lipschitz represents first-order Lipschies continuity, and P isdataE is the expectation for the probability distribution of the training set pictures.
Preferably, the method further comprises the following steps: s4: and constructing a new discriminator network, wherein the new network discriminator is used for discriminating the images output by the decoder network from the images in the training set.
Preferably, the encoder network uses a convolutional layer of Resnet101, and two fully-connected layers are added after the convolutional layer; the decoder network is a deconvolution network symmetric to Resnet 101; the generator network is a channel convolution network of the GAN-INT-CLS; the discriminator network is a convolutional network.
Preferably, the discriminator network is constrained in a spectral normalization manner.
Preferably, the generator network is in the same output format as the encoder network.
The invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method as set forth in any of the above.
The invention has the beneficial effects that: a method for generating an image from a text is provided, and the generation of the text into the image is realized by training a countering visual semantic embedding model and by countering the visual semantic embedding model. The method for performing visual feature embedding on the existing text data in a countertraining mode is enhanced, so that the difference between the distribution of the text mode data and the image mode data in a semantic space is reduced.
The existing text-to-image generation pipeline flow is improved, thereby avoiding the use of an unstable conventional CGAN training framework, and instead employing a more advanced WGAN training framework. The method greatly enhances the stability of training, accelerates the convergence rate of training and reduces the probability of collapse of the training trapping mode.
The index for quantitatively measuring the image generation quality such as Incepotion Score is obviously improved.
Drawings
Fig. 1 is a schematic diagram of a method for generating an image from a text in an embodiment of the present invention.
FIG. 2 is a schematic diagram of a method for training the anti-vision semantic embedding model according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating a method for generating an image according to another embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. The connection may be for fixation or for circuit connection.
It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.
Example 1
In deep learning, an Auto-encoder (Auto-encoder) is a widely used method for generating a model and extracting features of data. There are many improvements and variations of the self-encoder, such as a noise-reducing self-encoder, a sparse self-encoder, and so on. Among them, the most widely used is a Variational Auto-encoder (Variational Auto-encoder) based on Variational derivation.
The self-encoder is composed of two deep neural networks, one is an encoder and is responsible for compressing input high-dimensional sample data into low-dimensional data characteristics, and the other is a decoder and is responsible for restoring the low-dimensional data characteristics into high-dimensional data samples. To achieve this effect, the training goal of the self-encoder needs to minimize the reconstruction loss between the input and the output, and therefore the L2 loss or the binary cross entropy loss is generally adopted as the reconstruction loss.
The variational autocoder further constrains the distribution of the features of the output of the coder on the basis of the autocoder, and the distribution of the features is required to be as similar as possible to a known prior distribution (in practice, multidimensional gaussian distribution or uniform distribution is generally adopted), so that the loss function of the variational autocoder also imposes the constraint on the features of the output of the coder on the basis of reconstruction loss. Specifically, the mean value and the variance of the output characteristic distribution of the self-encoder and the mean value and the variance of the known prior distribution are measured by using KL divergence, and the measurement result is taken as the prior loss and the reconstruction loss to jointly form a training loss function of the variational self-encoder.
Because the invention needs to extract visual semantic features from the generation class of text-to-image generation, the invention extracts image features by using and adopting the structure and training mode of a variational self-encoder on the structure of a model and converts text description into images.
GAN and text-to-image generation
As a promising branch of the generative model, GAN treats the training process as a zero-sum game between two competitors, the generator G and the discriminator D. In particular, the generator G is intended to generate realistic images, while the discriminator D tries to distinguish real images from false images generated by the generator G. Training the GAN is equivalent to optimizing the following objectives:
Figure BDA0002250743060000041
the condition GAN (cgan) extends GAN by providing additional label information as input conditions to generator G and arbiter D.
The GAN-based text-to-image generation method (GAN-INT-CLS) further enhances the utilization of a conditional supervision signal on the basis of CGAN, and the GAN-INT-CLS provides three types of discrimination sample pairs for a discriminator in a training process, namely a real image and a text conforming to image content, a real image and a text not conforming to the image content, and an unreal image and a text conforming to the image content. Of these three types of sample pairs, only the first is a positive sample, and the remaining two are negative samples. In this case, the training targets are:
Figure BDA0002250743060000051
WGAN based on optimal transmission theory
Based on the theory of game theory, GAN treats the image generation process as a kind of zero-sum game, in which two competitors are the generator G and the discriminator D, respectively. The generator G is intended to generate realistic images, while the discriminator D tries to distinguish "false" images generated by the generator G from real, natural images. The GAN is trained to achieve nash balance between the generator G and the discriminator D. The existing training target of GAN is to minimize JS divergence between the generation distribution and the target distribution theoretically, and some practical skill or improvement method destroys this training target, such as the method of formula (2), and the training of the model often encounters instability and mode collapse.
To address this problem, the present invention replaces the training target based on JS divergence between metric distributions with the training target based on the optimal transmission theory — the Wasserstein Distance (Wasserstein Distance). Wtherstein distance-based WGAN (Wasserstein GAN) training targets were as follows:
EMD(Pr,Pθ)=infγ∈Πx,yγ(x,y)||x-y|| (3)
wherein EMD is an abbreviation of Earth Mover Distance, having the same meaning as W in formula (4), Pr denotes a probability distribution, PθSimilarly, x and y are subject to Pr and PθThe sample γ (x, y) of (a) represents the optimal transport scheme of the distribution, and F represents the Frobenius inner product. Since the optimal transportation solution itself is also an optimization problem, through the K-R dual transformation, the final training objective can be described as:
Figure BDA0002250743060000052
wherein 1-lipschitz represents a first order lipschitz continuum.
To solve this problem, the present invention improves the existing text visual semantic embedding method by means of countertraining to generate a counternet. Specifically, the invention firstly uses a variational self-encoder to extract the characteristics of the image. Then, a discriminator is used for discriminating the image features and the text features extracted by the encoder, the image features are taken as positive samples, the text features are taken as negative samples, and therefore a deep neural network capable of mapping the text features to the image features is learned.
Under the training framework of the process, texts and images are input in a matched mode, and a traditional training mode of conditional countermeasure training is not needed, so that a training target can be changed into a more advanced training target based on the Waterstein distance, and the problems of collapse, difficulty in convergence and the like of a traditional mode for generating a countermeasure network are solved.
In this invention, the invention proposes a new text visual feature embedding method, and is called as anti-visual semantic embedding (AdViSE). The model of the invention consists of four deep neural network networks: the system comprises an image encoder, an image decoder, a text feature embedded network and a discriminator matched with the text feature embedded network.
As shown in fig. 1, the present invention provides a method for generating an image from a text, comprising the following steps:
s1: training an anti-visual semantic embedding model, the anti-visual semantic embedding model comprising: an image encoder network, an image decoder network, a generator network and a discriminator network paired with the generator network;
s2: inputting text into the generator network, the generator network outputting text feature embedding;
s3: the text features are embedded input to the decoder network, which outputs an image that conforms to a semantic description of the text.
The AdVisE model firstly trains a variational self-encoder of an image, so that the model has the capability of extracting semantic features of the image. The input and output of the variational self-encoder are both images of 128x128x3 resolution. The output of the encoder is two 1024-dimensional vectors μ, σ, which respectively represent the mean and variance of the image feature distribution.
After a period of training of the self-encoder, the countertraining of the text semantic embedding network can be started when the generated image is no longer simple noise. The text first inputs the word sequence to the pre-trained text feature extraction network phi adopted in the GAN-INT-CLS, then splices the output of phi with noise sampled from 100-dimensional Gaussian distribution, and inputs the spliced output into the Generator network. The output format of the Generator network is the same as the output format of the encoder, since the text features output by the Generator network need to fit the visual features as closely as possible.
As shown in fig. 2, training the anti-vision semantic embedding model includes the following steps:
s11: constructing an anti-vision semantic embedding model;
s12: training the image encoder network and the image decoder network using a reconstruction loss function;
s13: training the generator network and the decoder network using the reconstruction loss function;
s14: training the generator network and the discriminator network using the Wtherstein distance as a loss function.
In one embodiment of the invention, the entire model of the AdViSE consists of four networks. Where the encoder uses the convolutional layer of the classical Resnet 101. In order to output the mean and variance vectors μ and σ, two full-concatenation layers need to be added after the last convolutional layer. The decoder corresponding to it is a deconvolution network symmetric to Resnet 101. The text semantic embedding network Generator adopts a GAN-INT-CLS channel convolution network, and the Discriminator adopts a common convolution network.
On the training target of the model, because the generator network and the arbiter network are in matched training, and the encoder network and the decoder network are also in matched training, the training target of the whole model is divided into two parts.
In the first part, the training targets of the encoder network and the decoder network are the training targets of the variational self-encoder. The loss function for this target is composed of the addition of two components as shown in equation (5).
Figure BDA0002250743060000072
Wherein Enc represents the image Encoder network, Dec represents the image decoder network, Z represents the features extracted by the Encoder, x represents the training image samples, DKLRepresenting the KL divergence, P (z) is the prior distribution. The former term, referred to as reconstruction loss, is used to measure the difference between the input image and the output image. The latter term, called a priori loss, is used to constrain the difference between the Encoder features and the a priori distribution P (z). Here, KL divergence is used as a measure of both.
In the second part, the training targets of the generator network and the discriminator network are the training targets of the Wtherstein distance-based WGAN:
Figure BDA0002250743060000071
wherein G is the generator network, D is the discriminator network, nz is 100-dimensional noise sampled from standard Gaussian distribution, x is a training set picture, t is a training set text description, 1-lipschitz represents first-order Lipschies continuity, and P isdataTo train the probability distribution of the set pictures, E represents expectation.
The initial method of WGAN to implement 1-Lipschitz would result in a large loss of expression capacity in the model. We therefore use spectral normalization (spectra normalization) to constrain the discriminator D so that it satisfies first order lippis continuity.
In the above process of obtaining visual semantic embedding of text, if the embedding is directly input to the Decoder, it is natural to obtain a text-to-image generative model. Therefore, after four network training of the AdViSE is finished, the generator network and the decoder network can be paired to serve as a generation model from a text to an image, and fine adjustment is performed on the basis, for example, perceptual loss is added, the common reconstruction loss is that a norm | | | x-x ' | | is directly solved by using an input sample x and an output sample x ', the perceptual loss reconstruction is that a pre-trained convolutional network (generally, VGG16 trained by imagenet) is recorded as f, and then a loss function becomes | | | | f (x) -f (x ') |, so that the generation effect from the text to the image is further improved.
As shown in fig. 3, the method for generating an image from a text further includes the following steps:
s4: and constructing a new discriminator network, wherein the new network discriminator is used for discriminating the images output by the decoder network from the images in the training set.
The effect of text-to-image generation can be further improved by adding a discriminator.
Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
To verify the effectiveness of the present invention, the present invention compares the 2 public data sets (CUB, Oxford-102) with the existing text-to-image generation model.
TABLE 1 text-to-image generative model comparison
Figure BDA0002250743060000081
Figure BDA0002250743060000091
In quantitative experiments, the highest scores were obtained with the AdVisE method on both CUB and Oxford-102, as shown in Table 1.
In qualitative experiments, the invention performs basic text-to-image generation on CUB and Oxford-102. The effect of image generation is observed, and the method of the invention can generate pictures satisfying the description of the images although the method is different from the prior method in flow, and the details described in the text are all matched with the generated images correspondingly.
The method has the advantages that text-to-image conversion can be realized by simply applying the Decoder with the AdVise, and good effect can be obtained. A certain improvement is obtained by introducing a perception loss on the basis. However, adding a new discriminator after the Decoder of the basic AdViSE will significantly improve. On the other hand, text semantic embedding of the existing method is replaced by pre-trained AdVisE visual semantic embedding, and great improvement is achieved, which shows that AdVisE can more dig out the potential of text modal features compared with the improvement of the existing method in image generation capacity.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims (10)

1. A method for generating an image from text, comprising the steps of:
s1: training an anti-visual semantic embedding model, the anti-visual semantic embedding model comprising: an image encoder network, an image decoder network, a generator network and a discriminator network paired with the generator network;
s2: inputting text into the generator network, the generator network outputting text feature embedding;
s3: the text features are embedded input to the decoder network, which outputs an image that conforms to a semantic description of the text.
2. The method of text-generating images of claim 1, wherein training the anti-visual semantic embedding model comprises the steps of:
s11: constructing an anti-vision semantic embedding model;
s12: training the image encoder network and the image decoder network using a reconstruction loss function;
s13: training the generator network and the decoder network using the reconstruction loss function;
s14: training the generator network and the discriminator network using the Wtherstein distance as a loss function.
3. The method of text-generating an image of claim 2, wherein the reconstruction loss function is:
Figure FDA0002250743050000011
wherein Enc represents the image Encoder network, Dec represents the image decoder network, Z represents the features extracted by the Encoder, x represents the training image samples, DKLRepresenting the KL divergence, P (z) is the prior distribution.
4. The method of text-generating an image of claim 3, wherein perceptual loss is added to the reconstruction loss function.
5. The method of text-generating an image of claim 2, wherein the wotherstein distance is:
Figure FDA0002250743050000012
wherein G is the generator network, D is the discriminator network, nz is 100-dimensional noise sampled from standard Gaussian distribution, x is a training set picture, t is a training set text description, 1-lipschitz represents first-order Lipschies continuity, and P isdataE is the expectation for the probability distribution of the training set pictures.
6. The method of text-generating an image of claim 1, further comprising the steps of:
s4: and constructing a new discriminator network, wherein the new network discriminator is used for discriminating the images output by the decoder network from the images in the training set.
7. The method of text-to-image rendering as claimed in any one of claims 1 to 6, wherein said encoder network employs a convolutional layer of Resnet101, two fully-connected layers being added after said convolutional layer;
the decoder network is a deconvolution network symmetric to Resnet 101;
the generator network is a channel convolution network of the GAN-INT-CLS;
the discriminator network is a convolutional network.
8. The method of text-generating images of any of claims 1-6, wherein the network of discriminators is constrained using spectral normalization.
9. The method of text generating images of any of claims 1-6, wherein the generator network is in the same output format as the encoder network.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.
CN201911033265.0A 2019-10-28 2019-10-28 Method for text to image Active CN110866958B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911033265.0A CN110866958B (en) 2019-10-28 2019-10-28 Method for text to image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911033265.0A CN110866958B (en) 2019-10-28 2019-10-28 Method for text to image

Publications (2)

Publication Number Publication Date
CN110866958A true CN110866958A (en) 2020-03-06
CN110866958B CN110866958B (en) 2023-04-18

Family

ID=69653498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911033265.0A Active CN110866958B (en) 2019-10-28 2019-10-28 Method for text to image

Country Status (1)

Country Link
CN (1) CN110866958B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062865A (en) * 2020-03-18 2020-04-24 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer equipment and storage medium
CN111402365A (en) * 2020-03-17 2020-07-10 湖南大学 Method for generating picture from characters based on bidirectional architecture confrontation generation network
CN111768326A (en) * 2020-04-03 2020-10-13 南京信息工程大学 High-capacity data protection method based on GAN amplification image foreground object
CN112418310A (en) * 2020-11-20 2021-02-26 第四范式(北京)技术有限公司 Text style migration model training method and system and image generation method and system
CN112990302A (en) * 2021-03-11 2021-06-18 北京邮电大学 Model training method and device based on text generated image and image generation method
CN113191375A (en) * 2021-06-09 2021-07-30 北京理工大学 Text-to-multi-object image generation method based on joint embedding
CN113298895A (en) * 2021-06-18 2021-08-24 上海交通大学 Convergence guarantee-oriented unsupervised bidirectional generation automatic coding method and system
CN113449491A (en) * 2021-07-05 2021-09-28 思必驰科技股份有限公司 Pre-training framework for language understanding and generation with two-stage decoder
CN113627567A (en) * 2021-08-24 2021-11-09 北京达佳互联信息技术有限公司 Picture processing method, text processing method, related equipment and storage medium
CN113837229A (en) * 2021-08-30 2021-12-24 厦门大学 Knowledge-driven text-to-image generation method
CN114638905A (en) * 2022-01-30 2022-06-17 中国科学院自动化研究所 Image generation method, device, equipment, storage medium and computer program product
CN114648681A (en) * 2022-05-20 2022-06-21 浪潮电子信息产业股份有限公司 Image generation method, device, equipment and medium
CN114677569A (en) * 2022-02-17 2022-06-28 之江实验室 Character-image pair generation method and device based on feature decoupling
CN115879515A (en) * 2023-02-20 2023-03-31 江西财经大学 Document network theme modeling method, variation neighborhood encoder, terminal and medium
CN116208772A (en) * 2023-05-05 2023-06-02 浪潮电子信息产业股份有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN116710910A (en) * 2020-12-29 2023-09-05 迪真诺有限公司 Design generating method based on condition generated by learning and device thereof
CN116939320A (en) * 2023-06-12 2023-10-24 南京邮电大学 Method for generating multimode mutually-friendly enhanced video semantic communication

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180373979A1 (en) * 2017-06-22 2018-12-27 Adobe Systems Incorporated Image captioning utilizing semantic text modeling and adversarial learning
CN109543159A (en) * 2018-11-12 2019-03-29 南京德磐信息科技有限公司 A kind of text generation image method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180373979A1 (en) * 2017-06-22 2018-12-27 Adobe Systems Incorporated Image captioning utilizing semantic text modeling and adversarial learning
CN109543159A (en) * 2018-11-12 2019-03-29 南京德磐信息科技有限公司 A kind of text generation image method and device

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111402365A (en) * 2020-03-17 2020-07-10 湖南大学 Method for generating picture from characters based on bidirectional architecture confrontation generation network
CN111402365B (en) * 2020-03-17 2023-02-10 湖南大学 Method for generating picture from characters based on bidirectional architecture confrontation generation network
CN111062865A (en) * 2020-03-18 2020-04-24 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer equipment and storage medium
CN111768326A (en) * 2020-04-03 2020-10-13 南京信息工程大学 High-capacity data protection method based on GAN amplification image foreground object
CN111768326B (en) * 2020-04-03 2023-08-25 南京信息工程大学 High-capacity data protection method based on GAN (gas-insulated gate bipolar transistor) amplified image foreground object
CN112418310A (en) * 2020-11-20 2021-02-26 第四范式(北京)技术有限公司 Text style migration model training method and system and image generation method and system
CN116710910A (en) * 2020-12-29 2023-09-05 迪真诺有限公司 Design generating method based on condition generated by learning and device thereof
CN112990302A (en) * 2021-03-11 2021-06-18 北京邮电大学 Model training method and device based on text generated image and image generation method
CN113191375A (en) * 2021-06-09 2021-07-30 北京理工大学 Text-to-multi-object image generation method based on joint embedding
CN113298895A (en) * 2021-06-18 2021-08-24 上海交通大学 Convergence guarantee-oriented unsupervised bidirectional generation automatic coding method and system
CN113449491A (en) * 2021-07-05 2021-09-28 思必驰科技股份有限公司 Pre-training framework for language understanding and generation with two-stage decoder
CN113449491B (en) * 2021-07-05 2023-12-26 思必驰科技股份有限公司 Pre-training framework for language understanding and generation with two-stage decoder
CN113627567A (en) * 2021-08-24 2021-11-09 北京达佳互联信息技术有限公司 Picture processing method, text processing method, related equipment and storage medium
CN113627567B (en) * 2021-08-24 2024-04-02 北京达佳互联信息技术有限公司 Picture processing method, text processing method, related device and storage medium
CN113837229A (en) * 2021-08-30 2021-12-24 厦门大学 Knowledge-driven text-to-image generation method
CN113837229B (en) * 2021-08-30 2024-03-15 厦门大学 Knowledge-driven text-to-image generation method
CN114638905A (en) * 2022-01-30 2022-06-17 中国科学院自动化研究所 Image generation method, device, equipment, storage medium and computer program product
CN114638905B (en) * 2022-01-30 2023-02-21 中国科学院自动化研究所 Image generation method, device, equipment and storage medium
CN114677569A (en) * 2022-02-17 2022-06-28 之江实验室 Character-image pair generation method and device based on feature decoupling
CN114677569B (en) * 2022-02-17 2024-05-10 之江实验室 Character-image pair generation method and device based on feature decoupling
CN114648681B (en) * 2022-05-20 2022-10-28 浪潮电子信息产业股份有限公司 Image generation method, device, equipment and medium
CN114648681A (en) * 2022-05-20 2022-06-21 浪潮电子信息产业股份有限公司 Image generation method, device, equipment and medium
CN115879515B (en) * 2023-02-20 2023-05-12 江西财经大学 Document network theme modeling method, variation neighborhood encoder, terminal and medium
CN115879515A (en) * 2023-02-20 2023-03-31 江西财经大学 Document network theme modeling method, variation neighborhood encoder, terminal and medium
CN116208772A (en) * 2023-05-05 2023-06-02 浪潮电子信息产业股份有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN116939320A (en) * 2023-06-12 2023-10-24 南京邮电大学 Method for generating multimode mutually-friendly enhanced video semantic communication

Also Published As

Publication number Publication date
CN110866958B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN110866958B (en) Method for text to image
CN111754596B (en) Editing model generation method, device, equipment and medium for editing face image
CN107480144B (en) Method and device for generating image natural language description with cross-language learning capability
CN111191075B (en) Cross-modal retrieval method, system and storage medium based on dual coding and association
CN112817914A (en) Attention-based deep cross-modal Hash retrieval method and device and related equipment
CN110798636B (en) Subtitle generating method and device and electronic equipment
CN111402257A (en) Medical image automatic segmentation method based on multi-task collaborative cross-domain migration
US20190303499A1 (en) Systems and methods for determining video content relevance
CN111275784A (en) Method and device for generating image
CN112584062B (en) Background audio construction method and device
Fried et al. Patch2vec: Globally consistent image patch representation
CN111078940A (en) Image processing method, image processing device, computer storage medium and electronic equipment
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN113961736A (en) Method and device for generating image by text, computer equipment and storage medium
CN116304984A (en) Multi-modal intention recognition method and system based on contrast learning
CN116564338B (en) Voice animation generation method, device, electronic equipment and medium
CN113361646A (en) Generalized zero sample image identification method and model based on semantic information retention
CN111178039A (en) Model training method and device, and method and device for realizing text processing
CN115115540A (en) Unsupervised low-light image enhancement method and unsupervised low-light image enhancement device based on illumination information guidance
CN112668608A (en) Image identification method and device, electronic equipment and storage medium
CN111582066A (en) Heterogeneous face recognition model training method, face recognition method and related device
KR20210047467A (en) Method and System for Auto Multiple Image Captioning
KR20210058059A (en) Unsupervised text summarization method based on sentence embedding and unsupervised text summarization device using the same
CN117197268A (en) Image generation method, device and storage medium
Yang et al. Finding badly drawn bunnies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant