CN110866958A - Method for text to image - Google Patents
Method for text to image Download PDFInfo
- Publication number
- CN110866958A CN110866958A CN201911033265.0A CN201911033265A CN110866958A CN 110866958 A CN110866958 A CN 110866958A CN 201911033265 A CN201911033265 A CN 201911033265A CN 110866958 A CN110866958 A CN 110866958A
- Authority
- CN
- China
- Prior art keywords
- network
- text
- image
- training
- generator
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 66
- 238000009826 distribution Methods 0.000 claims abstract description 26
- 238000004590 computer program Methods 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 4
- 230000003595 spectral effect Effects 0.000 claims description 3
- 238000009877 rendering Methods 0.000 claims 1
- 230000000007 visual effect Effects 0.000 abstract description 11
- 230000006870 function Effects 0.000 description 12
- 230000000694 effects Effects 0.000 description 6
- 230000006872 improvement Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/001—Texturing; Colouring; Generation of texture or colour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Editing Of Facsimile Originals (AREA)
- Image Processing (AREA)
Abstract
The invention provides a method for generating an image by a text, which comprises the following steps: s1: training an anti-visual semantic embedding model, the anti-visual semantic embedding model comprising: an image encoder network, an image decoder network, a generator network and a discriminator network paired with the generator network; s2: inputting text into the generator network, the generator network outputting text feature embedding; s3: the text features are embedded input to the decoder network, which outputs an image that conforms to a semantic description of the text. The method for performing visual feature embedding on the existing text data in a countertraining mode is enhanced, so that the difference between the distribution of the text mode data and the image mode data in a semantic space is reduced.
Description
Technical Field
The invention relates to the technical field of deep learning, in particular to a text-to-image method.
Background
Text-to-image generation is a popular research topic in the field of computer vision in recent years. Depth-generated models based on generative confrontation networks (GANs) are of particular importance in the existing classes of methods, since in theory they are able to generate a variety of realistic images with relatively few model parameters, which means that they have the ability to capture the nature of natural images. GAN has attracted extensive attention as a class of generation models that have the ability to fit the distribution of natural images and are widely used in various image generation tasks such as image inpainting, super-resolution, image-to-image conversion and future frame prediction.
In recent years, there have been many attempts to extract semantic embeddings of text, such as classical word2 vec. In the field of computer vision, people focus on embedding the visual semantics of text, such as color, properties, texture, location, etc. information mentioned in the text description into a semantic space. Most of the existing methods adopt a method based on a discriminant type task to pre-train a deep neural network for extracting text semantic embedding. Specifically, the discriminant task is to determine whether a picture and a text description are semantically matched.
With the development of deep generative models, especially against the theoretical and practical advances in generating networks, the task of text-to-image generation has achieved staged results. The existing mainstream method generally adopts a framework of a conditional generation countermeasure network, and takes a text description as a condition to generate an image conforming to the text description. However, in the task of generating the text-to-image in the cross-modality manner, the data of two different modalities are distributed in a semantic space and are not equal to each other, the data of the text modality is sparse, and the data of the image modality is dense. These methods described above do not fully exploit the potential of text feature extraction when dealing with features of text modalities, thereby spanning this gap.
Disclosure of Invention
The invention provides a text-to-image method for solving the problem of text-to-image in the prior art.
In order to solve the above problems, the technical solution adopted by the present invention is as follows:
a method of text generating an image, comprising the steps of: s1: training an anti-visual semantic embedding model, the anti-visual semantic embedding model comprising: an image encoder network, an image decoder network, a generator network and a discriminator network paired with the generator network; s2: inputting text into the generator network, the generator network outputting text feature embedding; s3: the text features are embedded input to the decoder network, which outputs an image that conforms to a semantic description of the text.
Preferably, training the counterintuitive semantic embedding model comprises the steps of: s11: constructing an anti-vision semantic embedding model; s12: training the image encoder network and the image decoder network using a reconstruction loss function; s13: training the generator network and the decoder network using the reconstruction loss function; s14: training the generator network and the discriminator network using the Wtherstein distance as a loss function.
Preferably, the reconstruction loss function is:
wherein Enc represents the image Encoder network, Dec represents the image decoder network, Z represents the features extracted by the Encoder, x represents the training image samples, DKLRepresenting the KL divergence, P (z) is the prior distribution.
Preferably, perceptual loss is added to the reconstruction loss function.
Preferably, the wotherstein distance is:
wherein G is the generator network, D is the discriminator network, nz is 100-dimensional noise sampled from standard Gaussian distribution, x is a training set picture, t is a training set text description, 1-lipschitz represents first-order Lipschies continuity, and P isdataE is the expectation for the probability distribution of the training set pictures.
Preferably, the method further comprises the following steps: s4: and constructing a new discriminator network, wherein the new network discriminator is used for discriminating the images output by the decoder network from the images in the training set.
Preferably, the encoder network uses a convolutional layer of Resnet101, and two fully-connected layers are added after the convolutional layer; the decoder network is a deconvolution network symmetric to Resnet 101; the generator network is a channel convolution network of the GAN-INT-CLS; the discriminator network is a convolutional network.
Preferably, the discriminator network is constrained in a spectral normalization manner.
Preferably, the generator network is in the same output format as the encoder network.
The invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method as set forth in any of the above.
The invention has the beneficial effects that: a method for generating an image from a text is provided, and the generation of the text into the image is realized by training a countering visual semantic embedding model and by countering the visual semantic embedding model. The method for performing visual feature embedding on the existing text data in a countertraining mode is enhanced, so that the difference between the distribution of the text mode data and the image mode data in a semantic space is reduced.
The existing text-to-image generation pipeline flow is improved, thereby avoiding the use of an unstable conventional CGAN training framework, and instead employing a more advanced WGAN training framework. The method greatly enhances the stability of training, accelerates the convergence rate of training and reduces the probability of collapse of the training trapping mode.
The index for quantitatively measuring the image generation quality such as Incepotion Score is obviously improved.
Drawings
Fig. 1 is a schematic diagram of a method for generating an image from a text in an embodiment of the present invention.
FIG. 2 is a schematic diagram of a method for training the anti-vision semantic embedding model according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating a method for generating an image according to another embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. The connection may be for fixation or for circuit connection.
It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.
Example 1
In deep learning, an Auto-encoder (Auto-encoder) is a widely used method for generating a model and extracting features of data. There are many improvements and variations of the self-encoder, such as a noise-reducing self-encoder, a sparse self-encoder, and so on. Among them, the most widely used is a Variational Auto-encoder (Variational Auto-encoder) based on Variational derivation.
The self-encoder is composed of two deep neural networks, one is an encoder and is responsible for compressing input high-dimensional sample data into low-dimensional data characteristics, and the other is a decoder and is responsible for restoring the low-dimensional data characteristics into high-dimensional data samples. To achieve this effect, the training goal of the self-encoder needs to minimize the reconstruction loss between the input and the output, and therefore the L2 loss or the binary cross entropy loss is generally adopted as the reconstruction loss.
The variational autocoder further constrains the distribution of the features of the output of the coder on the basis of the autocoder, and the distribution of the features is required to be as similar as possible to a known prior distribution (in practice, multidimensional gaussian distribution or uniform distribution is generally adopted), so that the loss function of the variational autocoder also imposes the constraint on the features of the output of the coder on the basis of reconstruction loss. Specifically, the mean value and the variance of the output characteristic distribution of the self-encoder and the mean value and the variance of the known prior distribution are measured by using KL divergence, and the measurement result is taken as the prior loss and the reconstruction loss to jointly form a training loss function of the variational self-encoder.
Because the invention needs to extract visual semantic features from the generation class of text-to-image generation, the invention extracts image features by using and adopting the structure and training mode of a variational self-encoder on the structure of a model and converts text description into images.
GAN and text-to-image generation
As a promising branch of the generative model, GAN treats the training process as a zero-sum game between two competitors, the generator G and the discriminator D. In particular, the generator G is intended to generate realistic images, while the discriminator D tries to distinguish real images from false images generated by the generator G. Training the GAN is equivalent to optimizing the following objectives:
the condition GAN (cgan) extends GAN by providing additional label information as input conditions to generator G and arbiter D.
The GAN-based text-to-image generation method (GAN-INT-CLS) further enhances the utilization of a conditional supervision signal on the basis of CGAN, and the GAN-INT-CLS provides three types of discrimination sample pairs for a discriminator in a training process, namely a real image and a text conforming to image content, a real image and a text not conforming to the image content, and an unreal image and a text conforming to the image content. Of these three types of sample pairs, only the first is a positive sample, and the remaining two are negative samples. In this case, the training targets are:
WGAN based on optimal transmission theory
Based on the theory of game theory, GAN treats the image generation process as a kind of zero-sum game, in which two competitors are the generator G and the discriminator D, respectively. The generator G is intended to generate realistic images, while the discriminator D tries to distinguish "false" images generated by the generator G from real, natural images. The GAN is trained to achieve nash balance between the generator G and the discriminator D. The existing training target of GAN is to minimize JS divergence between the generation distribution and the target distribution theoretically, and some practical skill or improvement method destroys this training target, such as the method of formula (2), and the training of the model often encounters instability and mode collapse.
To address this problem, the present invention replaces the training target based on JS divergence between metric distributions with the training target based on the optimal transmission theory — the Wasserstein Distance (Wasserstein Distance). Wtherstein distance-based WGAN (Wasserstein GAN) training targets were as follows:
EMD(Pr,Pθ)=infγ∈Π∑x,yγ(x,y)||x-y|| (3)
wherein EMD is an abbreviation of Earth Mover Distance, having the same meaning as W in formula (4), Pr denotes a probability distribution, PθSimilarly, x and y are subject to Pr and PθThe sample γ (x, y) of (a) represents the optimal transport scheme of the distribution, and F represents the Frobenius inner product. Since the optimal transportation solution itself is also an optimization problem, through the K-R dual transformation, the final training objective can be described as:
wherein 1-lipschitz represents a first order lipschitz continuum.
To solve this problem, the present invention improves the existing text visual semantic embedding method by means of countertraining to generate a counternet. Specifically, the invention firstly uses a variational self-encoder to extract the characteristics of the image. Then, a discriminator is used for discriminating the image features and the text features extracted by the encoder, the image features are taken as positive samples, the text features are taken as negative samples, and therefore a deep neural network capable of mapping the text features to the image features is learned.
Under the training framework of the process, texts and images are input in a matched mode, and a traditional training mode of conditional countermeasure training is not needed, so that a training target can be changed into a more advanced training target based on the Waterstein distance, and the problems of collapse, difficulty in convergence and the like of a traditional mode for generating a countermeasure network are solved.
In this invention, the invention proposes a new text visual feature embedding method, and is called as anti-visual semantic embedding (AdViSE). The model of the invention consists of four deep neural network networks: the system comprises an image encoder, an image decoder, a text feature embedded network and a discriminator matched with the text feature embedded network.
As shown in fig. 1, the present invention provides a method for generating an image from a text, comprising the following steps:
s1: training an anti-visual semantic embedding model, the anti-visual semantic embedding model comprising: an image encoder network, an image decoder network, a generator network and a discriminator network paired with the generator network;
s2: inputting text into the generator network, the generator network outputting text feature embedding;
s3: the text features are embedded input to the decoder network, which outputs an image that conforms to a semantic description of the text.
The AdVisE model firstly trains a variational self-encoder of an image, so that the model has the capability of extracting semantic features of the image. The input and output of the variational self-encoder are both images of 128x128x3 resolution. The output of the encoder is two 1024-dimensional vectors μ, σ, which respectively represent the mean and variance of the image feature distribution.
After a period of training of the self-encoder, the countertraining of the text semantic embedding network can be started when the generated image is no longer simple noise. The text first inputs the word sequence to the pre-trained text feature extraction network phi adopted in the GAN-INT-CLS, then splices the output of phi with noise sampled from 100-dimensional Gaussian distribution, and inputs the spliced output into the Generator network. The output format of the Generator network is the same as the output format of the encoder, since the text features output by the Generator network need to fit the visual features as closely as possible.
As shown in fig. 2, training the anti-vision semantic embedding model includes the following steps:
s11: constructing an anti-vision semantic embedding model;
s12: training the image encoder network and the image decoder network using a reconstruction loss function;
s13: training the generator network and the decoder network using the reconstruction loss function;
s14: training the generator network and the discriminator network using the Wtherstein distance as a loss function.
In one embodiment of the invention, the entire model of the AdViSE consists of four networks. Where the encoder uses the convolutional layer of the classical Resnet 101. In order to output the mean and variance vectors μ and σ, two full-concatenation layers need to be added after the last convolutional layer. The decoder corresponding to it is a deconvolution network symmetric to Resnet 101. The text semantic embedding network Generator adopts a GAN-INT-CLS channel convolution network, and the Discriminator adopts a common convolution network.
On the training target of the model, because the generator network and the arbiter network are in matched training, and the encoder network and the decoder network are also in matched training, the training target of the whole model is divided into two parts.
In the first part, the training targets of the encoder network and the decoder network are the training targets of the variational self-encoder. The loss function for this target is composed of the addition of two components as shown in equation (5).
Wherein Enc represents the image Encoder network, Dec represents the image decoder network, Z represents the features extracted by the Encoder, x represents the training image samples, DKLRepresenting the KL divergence, P (z) is the prior distribution. The former term, referred to as reconstruction loss, is used to measure the difference between the input image and the output image. The latter term, called a priori loss, is used to constrain the difference between the Encoder features and the a priori distribution P (z). Here, KL divergence is used as a measure of both.
In the second part, the training targets of the generator network and the discriminator network are the training targets of the Wtherstein distance-based WGAN:
wherein G is the generator network, D is the discriminator network, nz is 100-dimensional noise sampled from standard Gaussian distribution, x is a training set picture, t is a training set text description, 1-lipschitz represents first-order Lipschies continuity, and P isdataTo train the probability distribution of the set pictures, E represents expectation.
The initial method of WGAN to implement 1-Lipschitz would result in a large loss of expression capacity in the model. We therefore use spectral normalization (spectra normalization) to constrain the discriminator D so that it satisfies first order lippis continuity.
In the above process of obtaining visual semantic embedding of text, if the embedding is directly input to the Decoder, it is natural to obtain a text-to-image generative model. Therefore, after four network training of the AdViSE is finished, the generator network and the decoder network can be paired to serve as a generation model from a text to an image, and fine adjustment is performed on the basis, for example, perceptual loss is added, the common reconstruction loss is that a norm | | | x-x ' | | is directly solved by using an input sample x and an output sample x ', the perceptual loss reconstruction is that a pre-trained convolutional network (generally, VGG16 trained by imagenet) is recorded as f, and then a loss function becomes | | | | f (x) -f (x ') |, so that the generation effect from the text to the image is further improved.
As shown in fig. 3, the method for generating an image from a text further includes the following steps:
s4: and constructing a new discriminator network, wherein the new network discriminator is used for discriminating the images output by the decoder network from the images in the training set.
The effect of text-to-image generation can be further improved by adding a discriminator.
Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
To verify the effectiveness of the present invention, the present invention compares the 2 public data sets (CUB, Oxford-102) with the existing text-to-image generation model.
TABLE 1 text-to-image generative model comparison
In quantitative experiments, the highest scores were obtained with the AdVisE method on both CUB and Oxford-102, as shown in Table 1.
In qualitative experiments, the invention performs basic text-to-image generation on CUB and Oxford-102. The effect of image generation is observed, and the method of the invention can generate pictures satisfying the description of the images although the method is different from the prior method in flow, and the details described in the text are all matched with the generated images correspondingly.
The method has the advantages that text-to-image conversion can be realized by simply applying the Decoder with the AdVise, and good effect can be obtained. A certain improvement is obtained by introducing a perception loss on the basis. However, adding a new discriminator after the Decoder of the basic AdViSE will significantly improve. On the other hand, text semantic embedding of the existing method is replaced by pre-trained AdVisE visual semantic embedding, and great improvement is achieved, which shows that AdVisE can more dig out the potential of text modal features compared with the improvement of the existing method in image generation capacity.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.
Claims (10)
1. A method for generating an image from text, comprising the steps of:
s1: training an anti-visual semantic embedding model, the anti-visual semantic embedding model comprising: an image encoder network, an image decoder network, a generator network and a discriminator network paired with the generator network;
s2: inputting text into the generator network, the generator network outputting text feature embedding;
s3: the text features are embedded input to the decoder network, which outputs an image that conforms to a semantic description of the text.
2. The method of text-generating images of claim 1, wherein training the anti-visual semantic embedding model comprises the steps of:
s11: constructing an anti-vision semantic embedding model;
s12: training the image encoder network and the image decoder network using a reconstruction loss function;
s13: training the generator network and the decoder network using the reconstruction loss function;
s14: training the generator network and the discriminator network using the Wtherstein distance as a loss function.
3. The method of text-generating an image of claim 2, wherein the reconstruction loss function is:
wherein Enc represents the image Encoder network, Dec represents the image decoder network, Z represents the features extracted by the Encoder, x represents the training image samples, DKLRepresenting the KL divergence, P (z) is the prior distribution.
4. The method of text-generating an image of claim 3, wherein perceptual loss is added to the reconstruction loss function.
5. The method of text-generating an image of claim 2, wherein the wotherstein distance is:
wherein G is the generator network, D is the discriminator network, nz is 100-dimensional noise sampled from standard Gaussian distribution, x is a training set picture, t is a training set text description, 1-lipschitz represents first-order Lipschies continuity, and P isdataE is the expectation for the probability distribution of the training set pictures.
6. The method of text-generating an image of claim 1, further comprising the steps of:
s4: and constructing a new discriminator network, wherein the new network discriminator is used for discriminating the images output by the decoder network from the images in the training set.
7. The method of text-to-image rendering as claimed in any one of claims 1 to 6, wherein said encoder network employs a convolutional layer of Resnet101, two fully-connected layers being added after said convolutional layer;
the decoder network is a deconvolution network symmetric to Resnet 101;
the generator network is a channel convolution network of the GAN-INT-CLS;
the discriminator network is a convolutional network.
8. The method of text-generating images of any of claims 1-6, wherein the network of discriminators is constrained using spectral normalization.
9. The method of text generating images of any of claims 1-6, wherein the generator network is in the same output format as the encoder network.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911033265.0A CN110866958B (en) | 2019-10-28 | 2019-10-28 | Method for text to image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911033265.0A CN110866958B (en) | 2019-10-28 | 2019-10-28 | Method for text to image |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110866958A true CN110866958A (en) | 2020-03-06 |
CN110866958B CN110866958B (en) | 2023-04-18 |
Family
ID=69653498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911033265.0A Active CN110866958B (en) | 2019-10-28 | 2019-10-28 | Method for text to image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110866958B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111062865A (en) * | 2020-03-18 | 2020-04-24 | 腾讯科技(深圳)有限公司 | Image processing method, image processing device, computer equipment and storage medium |
CN111402365A (en) * | 2020-03-17 | 2020-07-10 | 湖南大学 | Method for generating picture from characters based on bidirectional architecture confrontation generation network |
CN111768326A (en) * | 2020-04-03 | 2020-10-13 | 南京信息工程大学 | High-capacity data protection method based on GAN amplification image foreground object |
CN112418310A (en) * | 2020-11-20 | 2021-02-26 | 第四范式(北京)技术有限公司 | Text style migration model training method and system and image generation method and system |
CN112990302A (en) * | 2021-03-11 | 2021-06-18 | 北京邮电大学 | Model training method and device based on text generated image and image generation method |
CN113191375A (en) * | 2021-06-09 | 2021-07-30 | 北京理工大学 | Text-to-multi-object image generation method based on joint embedding |
CN113298895A (en) * | 2021-06-18 | 2021-08-24 | 上海交通大学 | Convergence guarantee-oriented unsupervised bidirectional generation automatic coding method and system |
CN113449491A (en) * | 2021-07-05 | 2021-09-28 | 思必驰科技股份有限公司 | Pre-training framework for language understanding and generation with two-stage decoder |
CN113627567A (en) * | 2021-08-24 | 2021-11-09 | 北京达佳互联信息技术有限公司 | Picture processing method, text processing method, related equipment and storage medium |
CN113837229A (en) * | 2021-08-30 | 2021-12-24 | 厦门大学 | Knowledge-driven text-to-image generation method |
CN114638905A (en) * | 2022-01-30 | 2022-06-17 | 中国科学院自动化研究所 | Image generation method, device, equipment, storage medium and computer program product |
CN114648681A (en) * | 2022-05-20 | 2022-06-21 | 浪潮电子信息产业股份有限公司 | Image generation method, device, equipment and medium |
CN114677569A (en) * | 2022-02-17 | 2022-06-28 | 之江实验室 | Character-image pair generation method and device based on feature decoupling |
CN115879515A (en) * | 2023-02-20 | 2023-03-31 | 江西财经大学 | Document network theme modeling method, variation neighborhood encoder, terminal and medium |
CN116208772A (en) * | 2023-05-05 | 2023-06-02 | 浪潮电子信息产业股份有限公司 | Data processing method, device, electronic equipment and computer readable storage medium |
CN116710910A (en) * | 2020-12-29 | 2023-09-05 | 迪真诺有限公司 | Design generating method based on condition generated by learning and device thereof |
CN116939320A (en) * | 2023-06-12 | 2023-10-24 | 南京邮电大学 | Method for generating multimode mutually-friendly enhanced video semantic communication |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180373979A1 (en) * | 2017-06-22 | 2018-12-27 | Adobe Systems Incorporated | Image captioning utilizing semantic text modeling and adversarial learning |
CN109543159A (en) * | 2018-11-12 | 2019-03-29 | 南京德磐信息科技有限公司 | A kind of text generation image method and device |
-
2019
- 2019-10-28 CN CN201911033265.0A patent/CN110866958B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180373979A1 (en) * | 2017-06-22 | 2018-12-27 | Adobe Systems Incorporated | Image captioning utilizing semantic text modeling and adversarial learning |
CN109543159A (en) * | 2018-11-12 | 2019-03-29 | 南京德磐信息科技有限公司 | A kind of text generation image method and device |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111402365A (en) * | 2020-03-17 | 2020-07-10 | 湖南大学 | Method for generating picture from characters based on bidirectional architecture confrontation generation network |
CN111402365B (en) * | 2020-03-17 | 2023-02-10 | 湖南大学 | Method for generating picture from characters based on bidirectional architecture confrontation generation network |
CN111062865A (en) * | 2020-03-18 | 2020-04-24 | 腾讯科技(深圳)有限公司 | Image processing method, image processing device, computer equipment and storage medium |
CN111768326A (en) * | 2020-04-03 | 2020-10-13 | 南京信息工程大学 | High-capacity data protection method based on GAN amplification image foreground object |
CN111768326B (en) * | 2020-04-03 | 2023-08-25 | 南京信息工程大学 | High-capacity data protection method based on GAN (gas-insulated gate bipolar transistor) amplified image foreground object |
CN112418310A (en) * | 2020-11-20 | 2021-02-26 | 第四范式(北京)技术有限公司 | Text style migration model training method and system and image generation method and system |
CN116710910A (en) * | 2020-12-29 | 2023-09-05 | 迪真诺有限公司 | Design generating method based on condition generated by learning and device thereof |
CN112990302A (en) * | 2021-03-11 | 2021-06-18 | 北京邮电大学 | Model training method and device based on text generated image and image generation method |
CN113191375A (en) * | 2021-06-09 | 2021-07-30 | 北京理工大学 | Text-to-multi-object image generation method based on joint embedding |
CN113298895A (en) * | 2021-06-18 | 2021-08-24 | 上海交通大学 | Convergence guarantee-oriented unsupervised bidirectional generation automatic coding method and system |
CN113449491A (en) * | 2021-07-05 | 2021-09-28 | 思必驰科技股份有限公司 | Pre-training framework for language understanding and generation with two-stage decoder |
CN113449491B (en) * | 2021-07-05 | 2023-12-26 | 思必驰科技股份有限公司 | Pre-training framework for language understanding and generation with two-stage decoder |
CN113627567A (en) * | 2021-08-24 | 2021-11-09 | 北京达佳互联信息技术有限公司 | Picture processing method, text processing method, related equipment and storage medium |
CN113627567B (en) * | 2021-08-24 | 2024-04-02 | 北京达佳互联信息技术有限公司 | Picture processing method, text processing method, related device and storage medium |
CN113837229A (en) * | 2021-08-30 | 2021-12-24 | 厦门大学 | Knowledge-driven text-to-image generation method |
CN113837229B (en) * | 2021-08-30 | 2024-03-15 | 厦门大学 | Knowledge-driven text-to-image generation method |
CN114638905A (en) * | 2022-01-30 | 2022-06-17 | 中国科学院自动化研究所 | Image generation method, device, equipment, storage medium and computer program product |
CN114638905B (en) * | 2022-01-30 | 2023-02-21 | 中国科学院自动化研究所 | Image generation method, device, equipment and storage medium |
CN114677569A (en) * | 2022-02-17 | 2022-06-28 | 之江实验室 | Character-image pair generation method and device based on feature decoupling |
CN114677569B (en) * | 2022-02-17 | 2024-05-10 | 之江实验室 | Character-image pair generation method and device based on feature decoupling |
CN114648681B (en) * | 2022-05-20 | 2022-10-28 | 浪潮电子信息产业股份有限公司 | Image generation method, device, equipment and medium |
CN114648681A (en) * | 2022-05-20 | 2022-06-21 | 浪潮电子信息产业股份有限公司 | Image generation method, device, equipment and medium |
CN115879515B (en) * | 2023-02-20 | 2023-05-12 | 江西财经大学 | Document network theme modeling method, variation neighborhood encoder, terminal and medium |
CN115879515A (en) * | 2023-02-20 | 2023-03-31 | 江西财经大学 | Document network theme modeling method, variation neighborhood encoder, terminal and medium |
CN116208772A (en) * | 2023-05-05 | 2023-06-02 | 浪潮电子信息产业股份有限公司 | Data processing method, device, electronic equipment and computer readable storage medium |
CN116939320A (en) * | 2023-06-12 | 2023-10-24 | 南京邮电大学 | Method for generating multimode mutually-friendly enhanced video semantic communication |
Also Published As
Publication number | Publication date |
---|---|
CN110866958B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110866958B (en) | Method for text to image | |
CN111754596B (en) | Editing model generation method, device, equipment and medium for editing face image | |
CN107480144B (en) | Method and device for generating image natural language description with cross-language learning capability | |
CN111191075B (en) | Cross-modal retrieval method, system and storage medium based on dual coding and association | |
CN112817914A (en) | Attention-based deep cross-modal Hash retrieval method and device and related equipment | |
CN110798636B (en) | Subtitle generating method and device and electronic equipment | |
CN111402257A (en) | Medical image automatic segmentation method based on multi-task collaborative cross-domain migration | |
US20190303499A1 (en) | Systems and methods for determining video content relevance | |
CN111275784A (en) | Method and device for generating image | |
CN112584062B (en) | Background audio construction method and device | |
Fried et al. | Patch2vec: Globally consistent image patch representation | |
CN111078940A (en) | Image processing method, image processing device, computer storage medium and electronic equipment | |
CN111653270B (en) | Voice processing method and device, computer readable storage medium and electronic equipment | |
CN113961736A (en) | Method and device for generating image by text, computer equipment and storage medium | |
CN116304984A (en) | Multi-modal intention recognition method and system based on contrast learning | |
CN116564338B (en) | Voice animation generation method, device, electronic equipment and medium | |
CN113361646A (en) | Generalized zero sample image identification method and model based on semantic information retention | |
CN111178039A (en) | Model training method and device, and method and device for realizing text processing | |
CN115115540A (en) | Unsupervised low-light image enhancement method and unsupervised low-light image enhancement device based on illumination information guidance | |
CN112668608A (en) | Image identification method and device, electronic equipment and storage medium | |
CN111582066A (en) | Heterogeneous face recognition model training method, face recognition method and related device | |
KR20210047467A (en) | Method and System for Auto Multiple Image Captioning | |
KR20210058059A (en) | Unsupervised text summarization method based on sentence embedding and unsupervised text summarization device using the same | |
CN117197268A (en) | Image generation method, device and storage medium | |
Yang et al. | Finding badly drawn bunnies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |