CN113961736A

CN113961736A - Method and device for generating image by text, computer equipment and storage medium

Info

Publication number: CN113961736A
Application number: CN202111072292.6A
Authority: CN
Inventors: 陆璐; 叶锡洪; 冼允廷
Original assignee: Guangdong Yousuan Technology Co ltd; South China University of Technology SCUT
Current assignee: Guangdong Yousuan Technology Co ltd; South China University of Technology SCUT
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2022-01-21

Abstract

The invention discloses a method, a device, computer equipment and a storage medium for generating an image by a text, wherein the method comprises the following steps: acquiring a text image pair in a database; the text in the text image pair is used as an original text; inputting an original text into a plurality of stages to generate a confrontation network, and obtaining a corresponding image; inputting the corresponding image into a trained image labeling network to generate a prediction text; inputting the predicted text and the original text into the trained twin neural network to obtain the similarity between the predicted text and the original text; training a multi-stage generation countermeasure network according to the similarity to obtain a well-trained multi-stage generation countermeasure network; inputting the text input by the user into the trained multistage generation confrontation network, and generating the image corresponding to the text. The invention adopts a multilevel generation countermeasure network to gradually improve the pixel and the quality of the generated image, and simultaneously improves the authenticity of the generated image by adding an attention mechanism, thereby improving the semantic consistency of the generated image and the text.

Description

Method and device for generating image by text, computer equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing and computer vision, and in particular, to a method and apparatus for generating an image from a text, a computer device, and a storage medium.

Background

Computer vision and natural language processing are both used for processing single-type data, namely images or characters, the computer vision mainly focuses on understanding of pictures and comprises subtasks of image semantic segmentation, image classification, target retrieval and the like, and the natural language processing mainly focuses on modeling processing of text information and comprises subtasks of machine translation, named entity recognition, word segmentation and the like. In recent years, multi-modal tasks combining multiple data types such as images, texts and videos have been receiving more and more attention from researchers, and can relate relationships among multiple different types of data, such as mapping, fusion and the like. The two most common data types in the multi-modal task are characters and images, and cross-modal retrieval, image subtitle generation and the like are common research directions in the multi-modal task.

Characters and images serve as two different types of information carriers and play an important role in daily life, the images visually show the contained contents to people and show the details not contained in the characters, the expression of the characters is simple and complete, and the contents which can be expressed by a large number of images can be expressed through concise description, so that the characters and the images are combined, and the objects can be comprehensively described in a mode of combining pictures and texts. Such scenes are ubiquitous in life: pictures designed by designers often cannot meet the description of customers, and even if the pictures are repeatedly modified, the requirements of the customers cannot be met; in a crime scene, witness cases and witnesses of criminal suspects can only describe the appearance characteristics of the criminal suspects in a verbal expression mode, and the description is converted into pictures for social reference, so that professional persons are required to participate, time and labor are wasted, and a better effect cannot be obtained necessarily.

The task of generating the image by the text refers to a task of inputting a segment of word description to generate a corresponding image. The GAN-INT-CLS proposed by Reed et al in 2016 for the task of text generation of images enables the conversion of manually written descriptive text into corresponding images. StackGAN pioneers stacking two cgans together, generating a low resolution image in the first stage cGAN containing the contours and colors of the main subject, and expanding the low resolution image in the second stage, generating a high resolution image containing more vivid subjects. The AttnGAN is provided for providing semantic consistency, the model simultaneously encodes text description into sentence characteristics and word characteristics, the sentence characteristics are used as input of a network to generate an initial low-resolution image, the word characteristics are used for extracting important words in a subsequent generation process and finding out image sub-regions corresponding to the important words, the attention of the regions is improved, and fine-grained details are generated in the important sub-regions of the images, so that the semantic consistency of the images is improved.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a method, a device, computer equipment and a storage medium for generating an image by a text, wherein the method adopts a multilevel generation countermeasure network to gradually improve the pixel and the quality of the generated image, so that the problems of low pixel and poor quality of the image generated by a single generation countermeasure network are avoided, meanwhile, an attention mechanism is added between cascaded generators to pay attention to an important part in an output characteristic, so that the authenticity of the generated image is further improved, and the semantic consistency of the generated image and the text is improved.

A first object of the present invention is to provide a method of generating an image of a text.

A second object of the present invention is to provide an apparatus for generating an image of a text.

It is a third object of the invention to provide a computer apparatus.

It is a fourth object of the present invention to provide a storage medium.

The first purpose of the invention can be achieved by adopting the following technical scheme:

a method of text generating an image, the method comprising:

acquiring a text image pair in a database; the text image pair comprises a text and an image, and the text is a descriptive text of the image and serves as an original text;

inputting the original text into a plurality of stages to generate a confrontation network, and obtaining a corresponding image;

inputting the corresponding image into a trained image labeling network to generate a prediction text;

inputting the predicted text and the original text into a trained twin neural network to obtain the similarity between the predicted text and the original text;

training the multistage generation countermeasure network according to the similarity between the predicted text and the original text to obtain a well-trained multistage generation countermeasure network;

inputting the text input by the user into the trained multistage generation countermeasure network, and generating an image corresponding to the text.

Further, the inputting the original text into a multilevel generation countermeasure network to obtain a corresponding image specifically includes:

before inputting the original text into a multistage generation countermeasure network, firstly inputting the original text into a text encoder to obtain a text characteristic vector;

and embedding the sentence of the text feature vector into the feature vector, inputting the sentence into a multilevel generated countermeasure network, and obtaining a corresponding image.

Further, the multi-stage generation countermeasure network comprises n generators and n-1 attention mechanism modules; wherein n is a positive integer greater than 1;

the embedding of the sentence of the text feature vector into the feature vector is input into a multilevel generation countermeasure network to obtain a corresponding image, and the method specifically includes:

before inputting the sentence embedding feature vector into a generator in a multilevel generation countermeasure network, performing condition enhancement on the sentence embedding feature vector to obtain an enhanced sentence embedding feature vector;

when i is 1, embedding the enhanced sentence into a feature vector and inputting the feature vector into the ith generator to obtain the output feature of the ith generator;

when i is a positive integer which is greater than 1 and less than or equal to n, inputting the output characteristics of the (i-1) th generator into the (i-1) th attention mechanism module, and acquiring an important part in the output characteristics of the (i-1) th generator; inputting the important part in the output characteristics of the (i-1) th generator and the output characteristics of the (i-1) th generator into the ith generator to obtain the output characteristics of the ith generator;

taking the output characteristic of the nth generator as a corresponding image;

as the number of generators increases, the resolution of the generator output image gradually increases.

Further, the inputting the output characteristics of the (i-1) th generator into the (i-1) th attention mechanism module and acquiring the important part of the output characteristics of the (i-1) th generator specifically includes:

the words of the text feature vector are embedded into the feature matrix and the output features of the (i-1) th generator are input into the (i-1) th attention mechanism module;

and calculating the most relevant part in the output characteristics of the i-1 th generator and the keywords in the original text through the attention mechanism of the i-1 th attention mechanism module to obtain the important part in the output characteristics of the i-1 th generator.

Further, the multi-stage generation countermeasure network further comprises n discriminators, each discriminator corresponding to one generator;

the training the multistage generation countermeasure network according to the similarity between the predicted text and the original text to obtain the trained multistage generation countermeasure network specifically comprises:

in the multi-stage generation confrontation network training process, one round of training comprises the following two processes:

fixing the parameters of all generators, and updating the parameters of the discriminator by using the loss function of the discriminator;

fixing the parameters of all discriminators, and updating the parameters of a generator by using a loss function of the generator and the similarity between the test text and the original text;

and performing multiple rounds of training on the multi-stage generation countermeasure network by utilizing the similarity between the plurality of predicted texts and the original text, thereby obtaining the trained multi-stage generation countermeasure network.

Further, taking an image in a text image pair corresponding to the original text as a real image;

the input of the tth discriminator comprises the output characteristic of the tth generator and the real image; wherein t is a positive integer greater than or equal to 1 and less than n + 1;

when k is 1, the input of the kth discriminator also comprises the sentence embedding feature vector;

when k is more than 1 and less than or equal to n, the input of the kth discriminator also comprises a word embedding feature matrix of the text feature vector;

the loss function of the discriminator is as follows:

wherein the content of the first and second substances,

embedding feature vectors or words into a feature matrix for sentences, I is a real image, s₀Representing the output characteristics of the previous generator, c is a sentence-embedding characteristic vector, G(s)₀C) output characteristics of the generator, IC is an image annotation network,

sim is the similarity between the predicted text and the original text for the output of the discriminator;

the loss function of the generator is as follows:

wherein the content of the first and second substances,

is the KL divergence between the text feature vector and the gaussian distribution.

Further, the image annotation network comprises an encoder and a decoder; wherein the encoder comprises a convolutional neural network and a linear transform, and the decoder comprises an LSTM network;

the inputting the corresponding image into a trained image labeling network to generate a prediction text specifically comprises:

inputting the corresponding image into the convolutional neural network to obtain a characteristic matrix of the image;

after the characteristic matrix of the image is subjected to linear transformation, a transformed characteristic matrix is obtained;

and inputting the transformed feature matrix into the LSTM network to generate a predicted text.

Further, the twin neural network comprises a text feature extraction network and a pooling layer;

inputting the predicted text and the original text into a trained twin neural network to obtain the similarity between the predicted text and the original text, which specifically comprises the following steps:

respectively inputting the predicted text and the original text into the text feature extraction network to respectively obtain extracted text features;

inputting the extracted features of the text which are respectively obtained into the pooling layer to obtain a feature vector U and a feature vector V;

according to the cosine similarity, calculating the similarity of the characteristic vector U and the characteristic vector V, wherein the formula is as follows:

wherein, U_iAnd V_iIth of U and V respectivelyAnd (5) vector quantity.

The second purpose of the invention can be achieved by adopting the following technical scheme:

an apparatus for text-generating an image, the apparatus comprising:

the text image pair acquisition module is used for acquiring a text image pair in the database; the text image pair comprises a text and an image, and the text is a descriptive text of the image and serves as an original text;

the predicted image generation module is used for inputting the original text into a multilevel generation countermeasure network to obtain a corresponding image;

the predicted text generation module is used for inputting the corresponding image into a trained image labeling network to generate a predicted text;

the similarity calculation module is used for inputting the predicted text and the original text into the trained twin neural network to obtain the similarity between the predicted text and the original text;

the multi-stage generation confrontation network training module is used for training the multi-stage generation confrontation network according to the similarity between the predicted text and the original text to obtain a trained multi-stage generation confrontation network;

and the text generation image module is used for inputting the text input by the user into the trained multistage generation countermeasure network and generating an image corresponding to the text.

The third purpose of the invention can be achieved by adopting the following technical scheme:

a computer device comprising a processor and a memory for storing a program executable by the processor, the processor implementing the method of text generation image described above when executing the program stored in the memory.

The fourth purpose of the invention can be achieved by adopting the following technical scheme:

a storage medium stores a program which, when executed by a processor, implements the method of generating an image from a text as described above.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention adopts the progressive multi-layer generation countermeasure network to gradually improve the pixel and the quality of the generated image, and avoids the problems of low pixel and poor quality of the generated image generated by the single generation countermeasure network. Meanwhile, an attention mechanism is added among cascaded generators, an important part in output characteristics is paid attention to, and the reality of generated images is further improved.

2. According to the invention, a text alignment mode is adopted, an image annotation network and a twin neural network are pre-trained, supervision factors are enhanced in a text alignment mode in the process of training a progressive multi-layer generation confrontation network, text alignment constraints are added on the basis of conditional constraints of an identifier, and the semantic consistency of the generated image and the generated text is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is a flowchart of a method for generating an image from a text according to embodiment 1 of the present invention.

Fig. 2 is a schematic structural diagram of the entire network according to embodiment 1 of the present invention.

Fig. 3 is a schematic diagram of an image annotation network structure in embodiment 1 of the present invention.

Fig. 4 is a schematic structural diagram of a twin neural network in embodiment 1 of the present invention.

Fig. 5 is a block diagram showing a configuration of an apparatus for generating an image from a text according to embodiment 2 of the present invention.

Fig. 6 is a block diagram of a computer device according to embodiment 3 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention. It should be understood that the description of the specific embodiments is intended to be illustrative only and is not intended to be limiting.

Example 1:

as shown in fig. 1, the present embodiment provides a method for generating an image from a text, including the following steps:

s101, acquiring a text image pair in a database; the text image pair comprises a text and an image, and the text is descriptive text of the image and serves as an original text.

And acquiring image text pair data in the network as a text pair in a database by means of a crawler and the like. The text in one text image pair is descriptive sentences of the images, and semantic consistency exists between the text and the images.

In this embodiment, the entire network includes a multistage generation countermeasure network, an image annotation network, and a twin neural network.

And S102, inputting the original text into multiple stages to generate a confrontation network, and obtaining a corresponding image.

Further, step S102 includes:

(1) before inputting an original text into a multistage generation countermeasure network, the original text is input into a text encoder to obtain a text feature vector.

(2) The countermeasure network is generated in multiple stages.

The multi-stage generation countermeasure network is a progressive multi-stage generation countermeasure network and comprises n generators, n discriminators and n-1 attention mechanism modules, wherein each discriminator corresponds to one generator.

As shown in fig. 2, in the present embodiment, n is 3, that is, three generators are included, a first generator is composed of 4 deconvolution blocks, and each deconvolution block is composed of an upsampling layer and a spectral normalization layer. The up-sampling layer reduces the number of channels of the three-dimensional feature vector to half of the original number each time, and simultaneously enlarges the width and the height of the feature to 2 times of the original width and the height of the feature. The feature vectors generated by the first generator have dimensions of 3 × 64 × 64. The spectrum normalization layer is used for improving the stability of the generated countermeasure network in the training process and avoiding the problems of mode collapse and the like. The two subsequent generators are composed of 4 deconvolution blocks, each deconvolution block is mainly composed of a convolution layer, a residual error layer and an upper sampling layer, a new feature matrix is obtained, the size of an output image is simultaneously improved, specifically, the convolution layer and the residual error layer process a feature image generated by the previous generator, and the upper sampling layer improves image pixels.

The previous producer output characteristic of the countermeasure network is generated at multiple stages as the next producer input. The output image resolution of each generator in the multi-stage generative countermeasure network gradually increases from 128 × 128 to 256 × 256, and finally 512 × 512.

(3) And inputting the text feature vectors into a multilevel generation countermeasure network to obtain a corresponding image.

Further, the step (3) specifically comprises:

(3-1) before the text feature vector is input into a multi-stage generation countermeasure network, condition enhancement is firstly carried out to generate additional condition variables, and sentence embedding feature vectors are obtained.

Sentence-embedded feature vectors of text feature vectors are first conditionally enhanced to produce additional conditional variables that are randomly sampled from an independent gaussian distribution before being input into a multi-stage production confrontation network.

Sentence embedding feature vector of text feature vector as

Representing a D-dimensional vector. Because the data volume is limited, in order to improve the generalization capability of the network model, before the sentence is embedded into the feature vector and input into the network, condition enhancement is firstly carried out, and the condition enhancement calculation mode is as follows:

c represents the sentence embedding feature vector after condition enhancement. The main method of this condition enhancement is the random sampling of the condition variables from an independent gaussian distribution.

And (3-2) embedding the sentence into the feature vector, inputting the sentence into a multi-stage generation countermeasure network, and obtaining a corresponding image.

And (4) splicing the sentence-embedded feature vectors, taking the random noise with the mean value of 0 and the variance of 1 as network input, and inputting the random noise into the 1 st generator to obtain the output feature of the 1 st generator.

Before the output of the former generator is input into the next generator, an attention mechanism module is needed to obtain the important part in the generated image. The attention mechanism module comprises two part inputs, namely word embedding feature matrix of text feature vector

And the output characteristics of the generator, wherein D represents the dimension of each word and T represents the text length. Through an attention mechanism, the most relevant part of the generated sub-images of each stage and the original text key words can be calculated, and the quality of the sub-images is improved.

And inputting the output of the 2 nd attention mechanism module into the 3 rd generator to obtain the output characteristic of the 3 rd generator as a corresponding image, namely the generated image corresponding to the original image.

The discriminator includes two aspects of constraints: unconditional constraints and conditional constraints. The discriminator requires three inputs. The input of the first discriminator is sentence embedding characteristic vector, the generated image and the real image of the first generator; the inputs to the other discriminators are: a word embedding feature matrix, a generated image of the generator, and a real image. Unconditional constraint is used for judging whether the generated image is a real natural image or not, and the generated image and the real image of the generator are used as judgment conditions; conditional constraints are used to determine whether the generated image is consistent with the text description, using sentence-embedded feature vectors or word-embedded feature matrices and the generator image as the determination conditions.

Conditional constraints are employed in this embodiment.

The discriminator loss function is:

wherein the content of the first and second substances,

word-embedding feature matrices for sentence-embedding feature vectors/text feature vectors, I being the real image, s₀Representing the output characteristics of the previous generator, c is a sentence-embedding characteristic vector, G(s)₀C) output characteristics of the generator, IC is an image annotation network,

sim is the similarity between the predicted text and the original text for the output of the discriminator.

The loss function of the generator is:

wherein the content of the first and second substances,

is the KL divergence between the text feature vector and the gaussian distribution, with the aim of avoiding overfitting.

And S103, inputting the corresponding image into the trained image labeling network to generate a prediction text.

The image annotation network is used for generating descriptive text which is consistent with input image semantics.

As shown in fig. 3, the image annotation network mainly includes two parts: an encoder and a decoder. The encoder includes a convolutional neural network and a linear transform, and uses the convolutional neural network to pre-train the model resnet-152, removing the last fully-connected layer. The encoder firstly obtains a feature matrix of an input image by using a convolutional neural network, then converts the feature matrix (feature vector) into a representation form suitable for the input of the decoder after linear transformation, and then inputs the representation form into the decoder. For the decoder part, an LSTM network (long short term memory network) is used, the decoder comprising a plurality of LSTMs in which the descriptive text is predicted and the text output encoding is obtained.

Training the image labeling network, specifically comprising:

inputting a real image into an image labeling network to obtain a text output code;

and comparing the text output codes with the text characteristics corresponding to the real images for training, thereby obtaining a trained image labeling network.

And inputting the corresponding image into the trained image labeling network to generate a prediction text.

And S104, inputting the predicted text and the original text into the trained twin neural network to obtain the similarity between the predicted text and the original text.

As shown in fig. 4, the twin neural network mainly includes a text feature extraction network and a pooling layer.

In the text image pair database, texts between different images are used as negative sample pairs, texts of the same image are used as positive sample pairs, the similarity between the positive sample pairs is set to be 0.8, and the similarity between the negative sample pairs is set to be 0.5. The twin neural network takes two texts as input, obtains the representations of the embedded high-dimensional space of the two input texts, and then calculates the similarity degree between the two representations of the embedded high-dimensional space. Specifically, two texts are respectively input into a text feature extraction network, the text feature extraction network selected in this embodiment is BERT, and in order to express the extracted features as the same dimension so as to calculate the similarity, the features are input into a pooling layer, so that two feature vectors U, V are obtained. And then, calculating the similarity of the two eigenvectors by using cosine similarity, wherein the similarity between positive samples is higher, and the similarity between negative samples is lower. The cosine similarity calculation formula is as follows:

wherein, U_iAnd V_iRepresenting the ith vector of U and V, respectively.

And respectively inputting the positive and negative sample pairs into the twin neural network to obtain prediction similarity, and updating parameters of the twin neural network for training through the difference between the target similarity and the prediction similarity, thereby obtaining the trained twin neural network.

And inputting the predictive text and the original text into the trained twin neural network, and calculating the similarity of the two texts.

And S105, training the multistage generation countermeasure network according to the similarity between the predicted text and the original text to obtain the trained multistage generation countermeasure network.

Through the steps S101-S104, the similarity between the plurality of predicted texts and the original text is obtained, and the multi-stage generation confrontation network is trained according to the similarity between the plurality of predicted texts and the original text.

In the multi-stage generation confrontation network training process, one round of training is divided into two processes:

(1) firstly, fixing parameters of all generators, and updating the parameters of the discriminator by using a loss function of the discriminator;

(2) then, parameters of all the fixed discriminators are updated by using a loss function of the generator and the similarity obtained by the twin neural network;

by analogy, 600 rounds of training are completed, the learning rate is set to be 0.0002, and therefore the well-trained multistage generation countermeasure network is obtained.

And S106, inputting the text input by the user into the trained multistage generation countermeasure network, and generating an image corresponding to the input text.

The user only needs to input the text into the trained multistage generation countermeasure network, and the corresponding target image can be generated without using a discriminator, an image labeling network and a twin neural network.

Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program to instruct associated hardware, and the corresponding program may be stored in a computer-readable storage medium.

It should be noted that although the method operations of the above-described embodiments are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the depicted steps may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Example 2:

as shown in fig. 5, the present embodiment provides an apparatus for generating an image from a text, the apparatus including a text image pair obtaining module 501, a prediction image generating module 502, a prediction text generating module 503, a similarity calculating module 504, a multi-stage generation countermeasure network training module 505, and a text generating image module 506, wherein:

a text image pair obtaining module 501, configured to obtain a text image pair in a database; the text image pair comprises a text and an image, and the text is a descriptive text of the image and serves as an original text;

a prediction image generation module 502, configured to input the original text into a multi-level generation countermeasure network to obtain a corresponding image;

a predicted text generation module 503, configured to input the corresponding image into a trained image labeling network, and generate a predicted text;

a similarity calculation module 504, configured to input the predicted text and the original text into a trained twin neural network, so as to obtain a similarity between the predicted text and the original text;

a multistage generation countermeasure network training module 505, configured to train the multistage generation countermeasure network according to the similarity between the predicted text and the original text, so as to obtain a trained multistage generation countermeasure network;

and a text generation image module 506, configured to input a text input by a user into the trained multi-level generation countermeasure network, and generate an image corresponding to the text.

The specific implementation of each module in this embodiment may refer to embodiment 1, which is not described herein any more; it should be noted that, the apparatus provided in this embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above.

Example 3:

the present embodiment provides a computer device, which may be a computer, as shown in fig. 6, and includes a processor 602, a memory, an input device 603, a display 604, and a network interface 605 connected by a system bus 601, where the processor is used to provide computing and control capabilities, the memory includes a nonvolatile storage medium 606 and an internal memory 607, the nonvolatile storage medium 606 stores an operating system, a computer program, and a database, the internal memory 607 provides an environment for the operating system and the computer program in the nonvolatile storage medium to run, and when the processor 602 executes the computer program stored in the memory, the method for generating an image from a text in the foregoing embodiment 1 is implemented as follows:

Example 4:

the present embodiment provides a storage medium, which is a computer-readable storage medium, and stores a computer program, and when the computer program is executed by a processor, the method for generating an image from a text of the above embodiment 1 is implemented as follows:

It should be noted that the computer readable storage medium of the present embodiment may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In summary, the multilevel generated countermeasure network constructed by the invention mainly comprises a generator, a discriminator and an attention mechanism module, wherein the generator is used for generating a corresponding image, the discriminator is used for judging whether the generated image is a real image, and the attention mechanism is used for acquiring an attention feature map of the image generated by the low-level generator; the text alignment mainly comprises an image labeling network and a twin neural network, wherein the image labeling network takes the image generated by the generator as input and outputs a descriptive text corresponding to the image, and the twin neural network is used for calculating the similarity degree between two text data: and respectively taking the original input text of the generated confrontation network and the text output by the image labeling network as two inputs of the twin neural network to obtain the similarity degree between the two texts, namely the text alignment degree, and updating and generating the confrontation network parameters by utilizing the alignment degree. Compared with the prior art, the method and the device can improve the semantic consistency between the generated image and the text.

The above description is only for the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the scope of the present invention.

Claims

1. A method of text-generating an image, the method comprising:

2. The method for generating an image according to the text of claim 1, wherein the inputting the original text into a multi-stage generation countermeasure network to obtain a corresponding image specifically comprises:

3. The method of text-generating images of claim 2, wherein the multi-stage generating countermeasure network includes n generators and n-1 attention mechanism modules; wherein n is a positive integer greater than 1;

taking the output characteristic of the nth generator as a corresponding image;

4. The method for generating an image according to the text of claim 3, wherein the step of inputting the output characteristics of the (i-1) th generator into the (i-1) th attention mechanism module to obtain the important part of the output characteristics of the (i-1) th generator comprises:

5. The method of text generating images of claim 3, wherein said multi-level generating countermeasure network further comprises n discriminators, each of said discriminators corresponding to one generator;

fixing the parameters of all discriminators, and updating the parameters of a generator by using a loss function of the generator and the similarity between the predicted text and the original text;

6. The method for generating image according to text of claim 5, wherein the image in the text image pair corresponding to the original text is used as the real image;

the loss function of the discriminator is as follows:

wherein the content of the first and second substances,

the loss function of the generator is as follows:

wherein the content of the first and second substances,

7. The method of text-generating an image of claim 1, wherein the image annotation network comprises an encoder and a decoder; wherein the encoder comprises a convolutional neural network and a linear transform, and the decoder comprises an LSTM network;

8. The method of text-generating an image of claim 1, wherein the twin neural network comprises a text feature extraction network and a pooling layer;

wherein, U_iAnd V_iThe ith vector of U and V respectively。

9. An apparatus for generating an image from text, the apparatus comprising:

10. A storage medium storing a program, wherein the program, when executed by a processor, implements the method of text generating an image according to any one of claims 1 to 8.