CN112669215A

CN112669215A - Training text image generation model, text image generation method and device

Info

Publication number: CN112669215A
Application number: CN202110008742.9A
Authority: CN
Inventors: 李虎
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-04-16

Abstract

The application discloses a method and a device for generating a training text image, and the method for generating the training text image comprises the following steps: inputting a text to be trained into a text encoder to obtain a text vector; and training the multistage generation type countermeasure network by using the text vector and the label image of the text to be trained to obtain a text image generation model. The text image generation method comprises the following steps: determining a target text of an image to be generated; processing the target text by using a text encoder to obtain a target text vector; and inputting the target text vector into the text image generation model to generate a target image corresponding to the target text. Because the text image generation model is a multi-stage generation type countermeasure network, the association between the text to be trained and the label image thereof is progressively learned, and a high-resolution image can be generated; therefore, the target text vector corresponding to the target text is processed by using the text image generation model, and the resolution of the obtained target image corresponding to the target text is higher.

Description

Training text image generation model, text image generation method and device

Technical Field

The application relates to the technical field of data processing, in particular to a method and a device for training a text image generation model and generating a text image.

Background

With the rapid development of machine learning, images corresponding to text contents are generally generated through a generative confrontation network. The generative confrontation network comprises a generative network and a discriminant network, the characteristics of the text are input into the generative confrontation network, and the mutual game learning through the generative network and the discriminant network can produce quite good image output.

However, the inventors have studied and found that the above-described generative countermeasure network is a one-stage generative countermeasure network, and that generating an image corresponding to text content using only the one-stage generative countermeasure network is extremely likely to generate abrupt texture information when generating an image, and that the generated image has low resolution and poor image quality.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for training a text image generation model, so that the text image generation model obtained by training can generate a high-resolution image, and a generated image obtained by using the text image generation model has a higher resolution and a higher image quality.

In a first aspect, an embodiment of the present application provides a method for training a text image generation model, where the method includes:

processing a text to be trained by using a text encoder to obtain a text vector;

and training a multistage generation type countermeasure network to obtain a text image generation model based on the text vector and the label image of the text to be trained.

Optionally, the training a multistage generation type confrontation network to obtain a text image generation model based on the text vector and the label image of the text to be trained includes:

splicing the text vector and the first random vector to obtain a first spliced vector;

training a first-stage generation type countermeasure network based on the first splicing vector and the label image of the text to be trained;

obtaining a kth splicing vector based on the text vector and a kth-1 generation image output by a kth-1 generation countermeasure network, wherein k is a positive integer and is more than or equal to 2;

training a kth generation type countermeasure network based on the kth splicing vector and the label image of the text to be trained;

and taking the trained multistage generative confrontation network as the text image generation model.

Optionally, the training a first-stage generative confrontation network based on the first stitching vector and the label image of the text to be trained includes:

processing the first stitching vector by utilizing a generating network in the first-stage generating type countermeasure network to generate a first generated image;

sampling the label image of the text to be trained according to the size of the first generated image to obtain a first label image;

processing the first generated image and the first label image by using a discrimination network in the first-stage generation type countermeasure network to obtain the discrimination probability of the first generated image and the discrimination probability of the first label image;

and adjusting the network parameters of the first-stage generation type countermeasure network based on the discrimination probability of the first generated image and the discrimination probability of the first label image.

Optionally, the training a kth-level generation countermeasure network based on the kth stitching vector and the label image of the text to be trained includes:

processing the kth splicing vector by utilizing a generating network in the kth-level generating countermeasure network to generate a kth generated image;

sampling the label image of the text to be trained according to the size of the kth generated image to obtain the kth label image;

processing the kth generated image and the kth label image by using a discrimination network in the kth-level generation countermeasure network to obtain the discrimination probability of the kth generated image and the discrimination probability of the kth label image;

and adjusting the network parameters of the kth generation type countermeasure network based on the discrimination probability of the kth generated image and the discrimination probability of the kth label image.

Optionally, the obtaining a kth stitching vector based on the text vector and a kth-1 generation image output by the k-1 generation countermeasure network includes:

processing the (k-1) th generated image by using an image encoder to obtain a (k-1) th image vector;

and splicing the text vector and the (k-1) th image vector to obtain a k-th spliced vector.

Optionally, the processing a text to be trained by using a text encoder to obtain a text vector includes:

processing the text to be trained by utilizing an embedding layer of the text encoder to obtain a text embedding vector;

compressing the text embedding vector with a fully connected layer of the text encoder to obtain the text vector.

Optionally, the training completion condition is convergence of the multistage generative countermeasure network; or, the training completion condition is that the training iteration number of the multistage generation type countermeasure network is greater than or equal to a preset training iteration number.

In a second aspect, an embodiment of the present application provides a method for generating a text image, where the method includes:

determining a target text of an image to be generated;

processing the target text by using a text encoder to obtain a target text vector;

inputting the target text vector into a text image generation model to generate a target image corresponding to the target text;

wherein the text image generation model is the text image generation model of any one of the first aspect.

Optionally, the inputting the target text vector into the text image generation model to generate a target image corresponding to the target text includes:

splicing the target text vector and the second random vector to obtain a first target splicing vector;

processing the first target splicing vector by utilizing a first-stage generating countermeasure network in the text image generating model to generate a first target generating image;

obtaining a kth target splicing vector based on the target text vector and a kth-1 target generation image output by a kth-1 level generation type countermeasure network in the text image generation model, wherein k is a positive integer and is more than or equal to 2;

processing the kth target splicing vector by using a kth generation countermeasure network in the text image generation model to generate a kth target generation image;

and taking a target generation image generated by the last-stage generation type countermeasure network in the text image generation model as a target image corresponding to the target text.

Optionally, the obtaining a kth target stitching vector based on the target text vector and a kth-1 target generated image output by a k-1 level generating countermeasure network in the text image generation model includes:

processing the (k-1) th target generation image by using an image encoder to obtain a (k-1) th target image vector;

and splicing the target text vector and the (k-1) th target image vector to obtain a k-th target splicing vector.

In a third aspect, an embodiment of the present application provides an apparatus for training a text image generation model, where the apparatus includes:

the first obtaining unit is used for processing the text to be trained by utilizing a text encoder to obtain a text vector;

and the second obtaining unit is used for training the multistage generation type countermeasure network to obtain a text image generation model based on the text vector and the label image of the text to be trained.

Optionally, the second obtaining unit includes:

the first obtaining subunit is used for splicing the text vector and the first random vector to obtain a first spliced vector;

the first training subunit is used for training a first-stage generation type countermeasure network based on the first splicing vector and the label image of the text to be trained;

the second obtaining subunit is used for obtaining a kth mosaic vector based on the text vector and a kth-1 generation image output by the kth-1 generation countermeasure network, wherein k is a positive integer and is more than or equal to 2;

the second training subunit is used for training the kth-level generative confrontation network based on the kth splicing vector and the label image of the text to be trained;

the first as a subunit, configured to use the trained multi-stage generative confrontation network as the text image generation model.

Optionally, the first training subunit includes:

the first generation module is used for processing the first splicing vector by utilizing a generation network in the first-stage generation type countermeasure network to generate a first generated image;

the first obtaining module is used for sampling the label image of the text to be trained according to the size of the first generated image to obtain a first label image;

a second obtaining module, configured to process the first generated image and the first label image by using a discrimination network in the first-level generating countermeasure network, and obtain a discrimination probability of the first generated image and a discrimination probability of the first label image;

and the first adjusting module is used for adjusting the network parameters of the first-stage generation type countermeasure network based on the discrimination probability of the first generated image and the discrimination probability of the first label image.

Optionally, the second training subunit includes:

the second generation module is used for processing the kth splicing vector by utilizing a generation network in the kth-level generation countermeasure network to generate a kth generated image;

a third obtaining module, configured to sample the tag image of the text to be trained according to the size of the kth generated image, so as to obtain a kth tag image;

a fourth obtaining module, configured to process the kth generated image and the kth tag image by using a discrimination network in the kth-level generating countermeasure network, and obtain a discrimination probability of the kth generated image and a discrimination probability of the kth tag image;

and the second adjusting module is used for adjusting the network parameters of the kth-level generation type countermeasure network based on the discrimination probability of the kth generated image and the discrimination probability of the kth label image.

Optionally, the second obtaining subunit includes:

a fifth obtaining module, configured to process the (k-1) th generated image by using an image encoder to obtain a (k-1) th image vector;

and the sixth obtaining module is used for splicing the text vector and the (k-1) th image vector to obtain a k-th spliced vector.

Optionally, the first obtaining unit includes:

a third obtaining subunit, configured to process the text to be trained by using an embedding layer of the text encoder to obtain a text embedding vector;

a fourth obtaining subunit, configured to compress the text embedding vector using a fully connected layer of the text encoder to obtain the text vector.

In a fourth aspect, an embodiment of the present application provides an apparatus for text image generation, where the apparatus includes:

the determining unit is used for determining a target text of an image to be generated;

a third obtaining unit, configured to process the target text with a text encoder to obtain a target text vector;

the generating unit is used for inputting the target text vector into a text image generating model and generating a target image corresponding to the target text;

Optionally, the generating unit includes:

a fifth obtaining subunit, configured to splice the target text vector and the second random vector to obtain a first target spliced vector;

the first generation subunit is used for processing the first target splicing vector by using a first-stage generation countermeasure network in the text image generation model to generate a first target generation image;

a sixth obtaining subunit, configured to obtain a kth target splicing vector based on the target text vector and a kth-1 target generation image output by a kth-1-level generation countermeasure network in the text image generation model, where k is a positive integer and is greater than or equal to 2;

the second generation subunit is used for processing the kth target splicing vector by using a kth generation countermeasure network in the text image generation model to generate a kth target generation image;

and the second as a subunit, configured to use a target generation image generated by the countermeasure network at the last stage in the text image generation model as a target image corresponding to the target text.

Optionally, the sixth obtaining subunit includes:

a seventh obtaining module, configured to process the (k-1) th target generated image with an image encoder to obtain a (k-1) th target image vector;

and the eighth obtaining module is used for splicing the target text vector and the (k-1) th target image vector to obtain a k-th target splicing vector.

In a fifth aspect, an embodiment of the present application provides a terminal device, where the terminal device includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method for training a text image generation model according to any one of the first aspect or the method for text image generation according to any one of the second aspect according to instructions in the program code.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium for storing program code for executing the method for training a text image generation model according to any one of the first aspect or the method for generating a text image according to any one of the second aspect.

Compared with the prior art, the method has the advantages that:

by adopting the technical scheme of the embodiment of the application, the text to be trained is input into a text encoder to obtain a text vector; and training the multistage generation type countermeasure network by using the text vector and the label image of the text to be trained to obtain a text image generation model. Therefore, through each level of the generated countermeasure network in the multi-level generated countermeasure network, the association between the text to be trained and the label image of the text to be trained is gradually learned, so that the resolution of the generated image of each level of the generated countermeasure network is gradually increased, abrupt texture information is avoided, and the text image generation model obtained through training can generate a high-resolution image.

In addition, in another embodiment of the present application, a target text of an image to be generated is determined; processing the target text by using a text encoder to obtain a target text vector; and inputting the target text vector into the text image generation model to generate a target image corresponding to the target text. Because the text image generation model is a multi-stage generation type countermeasure network, a high-resolution image can be generated; therefore, the target text vector corresponding to the target text is processed by using the text image generation model, and the resolution of the obtained target image corresponding to the target text is higher, that is, the image quality of the target image is higher.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of a system framework related to an application scenario in an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for training a text image generation model according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of an architecture of a training text image generation model according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a method for generating a text image according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus for training a text image generation model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus for generating a text image according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, generally, an image corresponding to text content is generated through a one-level generation type confrontation network, and only the one-level generation type confrontation network is used for correlating a learning text with the corresponding image, so that abrupt texture information is extremely easy to generate when the image is generated, and the generated image has low resolution and poor image quality.

In order to solve the problem, in the embodiment of the application, a text to be trained is input into a text encoder to obtain a text vector; and training the multistage generation type countermeasure network by using the text vector and the label image of the text to be trained to obtain a text image generation model. Therefore, through each level of the generation type countermeasure network in the multi-level generation type countermeasure network, the association between the text to be trained and the label image of the text to be trained is gradually learned, so that the resolution of the generated image of each level of the generation type countermeasure network is gradually increased, abrupt texture information is avoided, and the obtained text image generation model can generate a high-resolution image.

In addition, in another embodiment of the present application, a target text of an image to be generated is determined; processing the target text by using a text encoder to obtain a target text vector; and inputting the target text vector into the text image generation model to generate a target image corresponding to the target text. Because the text image generation model is a multi-stage generation type countermeasure network, a high-resolution image can be generated; therefore, the target text vector corresponding to the target text is processed by using the text image generation model, and the resolution of the obtained target image corresponding to the target text is high, that is, the image quality of the target image is high.

For example, one of the scenarios in the embodiment of the present application may be applied to the scenario shown in fig. 1, where the scenario includes the terminal device 101 and the processor 102. The terminal device 101 collects a text to be trained and a label image thereof and sends the text to the processor 102, and the processor 102 obtains and stores a text image generation model by using the implementation manner provided by the embodiment of the application. When the terminal device 101 sends the target text of the image to be generated to the processor 102, the processor 102 generates a target image corresponding to the target text by using the text image generation model according to another implementation manner provided by the embodiment of the application, and returns the target image to the terminal device 101, so that the terminal device 101 displays the target image.

First, in the application scenario described above, although the actions of the embodiments provided in the embodiments of the present application are described as being performed by the processor 102; however, the embodiments of the present application are not limited in terms of executing subjects as long as the actions disclosed in the embodiments provided by the embodiments of the present application are executed.

Next, the above scenario is only one example of the scenario provided in the embodiment of the present application, and the embodiment of the present application is not limited to this scenario.

The following describes in detail a specific implementation of the method and apparatus for training a text image generation model and generating a text image in the embodiments of the present application with reference to the drawings.

Exemplary method

Referring to fig. 2, a flowchart of a method for training a text image generation model in an embodiment of the present application is shown. In this embodiment, the method may include, for example, the steps of:

step 201: and processing the text to be trained by using a text encoder to obtain a text vector.

In the embodiment of the application, in order to generate the image corresponding to the text content through the generative confrontation network, a large amount of texts and images corresponding to the text content of the texts need to be collected to serve as the text to be trained and the label image of the text to be trained; the data is used for training the generative confrontation network so that the generative confrontation network learns the association between the text to be trained and the label image thereof, and a text image generation model is obtained and used for generating the image corresponding to the text content.

In the process of training a text to be trained and a label image of the text to be trained to generate a confrontation network, firstly, relevant features of text contents in the text to be trained need to be extracted to form a text vector, and a text encoder is generally adopted to process the text to be trained to obtain the text vector; that is, step 201 is performed.

In this embodiment of the present application, when step 201 is implemented specifically, for example, a text encoder may be constructed to include an embedded layer and a full-link layer, where the embedded layer is connected to the full-link layer, and the embedded layer is used to encode a text to be trained, so as to extract relevant features of text content of an extractor to obtain a text embedded vector; the full connection layer is used for compressing the text embedding vector to obtain a text vector meeting the dimension requirement of the generating type confrontation network input vector. Therefore, in an optional implementation manner of this embodiment of the present application, the step 201 may include, for example, the following steps:

step A: and processing the text to be trained by utilizing an embedding layer of the text encoder to obtain a text embedding vector.

And B: compressing the text embedding vector with a fully connected layer of the text encoder to obtain the text vector.

In the embodiment of the present application, the text encoder may be, for example, a character-level Convolutional Neural Network-cyclic Neural Network (abbreviated as char-CNN-RNN), as an example.

Step 202: and training a multistage generation type countermeasure network to obtain a text image generation model based on the text vector and the label image of the text to be trained.

The problem that when the generative countermeasure network is a primary generative countermeasure network, only the primary generative countermeasure network is used for correlating the learning text with the corresponding image, abrupt texture information is easily generated when the image is generated, and the generated image has low resolution and poor image quality is solved. In the embodiment of the application, a multi-level generation type countermeasure network is considered, and the association between the text to be trained and the label image of the text is progressively learned through each level of generation type countermeasure network in the multi-level generation type countermeasure network, so that the resolution of the generated image of each level of generation type countermeasure network is gradually increased, abrupt texture information is avoided, and the obtained text image generation model can generate a high-resolution image. That is, after step 201, step 202 is performed.

In this embodiment of the application, when step 202 is implemented specifically, for a first-stage generation type countermeasure network in a multi-stage generation type countermeasure network, first, training data of the first-stage generation type countermeasure network is obtained, in order to enable the generation network in the first-stage generation type countermeasure network to generate a corresponding image based on a text vector, in addition to the text vector, a random vector needs to be obtained as a first random vector, so that the generation network in the first-stage generation type countermeasure network can generate detailed information in the image based on the first random vector; specifically, a text vector and a first random vector are spliced to obtain a spliced vector as a first spliced vector; wherein the first random vector obeys normal distribution, and the vector dimension of the first random vector is the same as the vector dimension of the text vector. Then, the first splicing vector and the label image of the text to be trained are used as training data of the first-stage generation type countermeasure network, and the first-stage generation type countermeasure network is trained.

For the second generation countermeasure network, the third generation countermeasure network, and so on in the multistage generation countermeasure network, that is, for the kth generation countermeasure network, k is a positive integer equal to or greater than 2; firstly, obtaining training data of a kth-level generation type countermeasure network, and in order to enable the generation network in the kth-level generation type countermeasure network to generate a corresponding image based on a text vector, obtaining a kth-1 generation image output by the kth-1 generation type countermeasure network in addition to the text vector so that the generation network in the kth-level generation type countermeasure network can generate detail information in the image based on the kth-1 generation image; specifically, an image is generated through a text vector and a (k-1) th image, and a splicing vector is obtained and used as a k-th splicing vector. And then, taking the kth splicing vector and the label image of the text to be trained as training data of the kth generation type countermeasure network, and training the kth generation type countermeasure network.

Therefore, in an alternative implementation manner of this embodiment of the present application, the step 202 may include the following steps:

and C: and splicing the text vector and the first random vector to obtain a first spliced vector.

Step D: and training a first-stage generation type countermeasure network based on the first splicing vector and the label image of the text to be trained.

When the step D is implemented, firstly, inputting the first stitching vector into the first-stage generation countermeasure network to generate a network, where the generation network can output a generated image as a first generated image, and the size of the first generated image is smaller than that of the label image of the text to be trained. Next, based on the above description, it is necessary to sample an image having the same size as the first generated image from the label images of the text to be trained, and obtain the first label image. Then, the first generated image and the first label image are input into a discrimination network in the first-stage generation countermeasure network, and the discrimination probability of the first generated image and the discrimination probability of the first label image can be output by the generation network. Finally, generating images which are enough to deceive the discrimination network as much as possible based on the purpose of generating the network, wherein the purpose of discriminating the network is to distinguish whether the images generated by the generation network are false and the label images are true; the network parameters of the first-stage generative confrontation network need to be adjusted according to the discrimination probability of the first generated image, the discrimination probability of the first label image and the corresponding expected probability, so as to realize the training of the first-stage generative confrontation network. Therefore, in an optional implementation manner of the embodiment of the present application, the step D may include, for example, the following steps:

step D1: and processing the first splicing vector by utilizing a generating network in the first-stage generating type countermeasure network to generate a first generated image.

Step D2: and sampling the label image of the text to be trained according to the size of the first generated image to obtain a first label image.

Step D3: and processing the first generated image and the first label image by using a discrimination network in the first-stage generation type countermeasure network to obtain the discrimination probability of the first generated image and the discrimination probability of the first label image.

Step D4: and adjusting the network parameters of the first-stage generation type countermeasure network based on the discrimination probability of the first generated image and the discrimination probability of the first label image.

Step E: and obtaining a kth splicing vector based on the text vector and the kth-1 generation image output by the kth-1 generation countermeasure network, wherein k is a positive integer and is more than or equal to 2.

When the step E is implemented specifically, the kth-1 th generated image can be spliced with the text vector after being converted into the vector, so that the kth spliced vector is obtained. Firstly, the relevant features of the image content in the image to be generated at the (k-1) th generation graph need to be extracted to form an image vector as a (k-1) th image vector, an image encoder is generally adopted to process the (k-1) th generation image to obtain a (k-1) th image vector, and the vector dimension of the (k-1) th image vector is the same as the vector dimension of the text vector; and then, splicing the text vector and the (k-1) th image vector to obtain a spliced vector as a k-th spliced vector. Therefore, in an optional implementation manner of the embodiment of the present application, the step E may include, for example, the following steps:

step E1: and processing the (k-1) th generated image by using an image encoder to obtain a (k-1) th image vector.

Step E2: and splicing the text vector and the (k-1) th image vector to obtain a k-th spliced vector.

Step F: and training a kth generation type countermeasure network based on the kth splicing vector and the label image of the text to be trained.

When step F is implemented, firstly, inputting the kth stitching vector into the kth generation-based countermeasure network to generate a network, where the generation network can output a generated image as the kth generated image, and the size of the kth generated image is smaller than or equal to the size of the label image of the text to be trained. Next, based on the above description, it is necessary to sample an image having the same size as the kth generated image from the label images of the text to be trained, and obtain the image as the kth label image. Then, the kth generated image and the kth label image are input into a discrimination network in the kth generation countermeasure network, and the discrimination probability of the kth generated image and the discrimination probability of the kth label image can be output by the generation network. Finally, the network parameters of the kth generation type countermeasure network are also adjusted according to the discrimination probability of the kth generation image, the discrimination probability of the kth label image and the corresponding expected probability, so as to realize the training of the first generation type countermeasure network. Therefore, in an optional implementation manner of the embodiment of the present application, the step F may include, for example, the following steps:

step F1: and processing the kth stitching vector by utilizing a generating network in the kth-level generating countermeasure network to generate a kth generated image.

Step F2: and sampling the label image of the text to be trained according to the size of the kth generated image to obtain the kth label image.

Step F3: and processing the kth generated image and the kth label image by using a discrimination network in the kth-level generation countermeasure network to obtain the discrimination probability of the kth generated image and the discrimination probability of the kth label image.

Step F4: and adjusting the network parameters of the kth generation type countermeasure network based on the discrimination probability of the kth generated image and the discrimination probability of the kth label image.

Note that, the size of the kth generated image is larger than that of the first generated image, and the size of the kth generated image increases as k increases. As an example, the size of the first generated image is 4 × 4, the size of the second generated image is 8 × 8, and the size of the second generated image is 16 × 16 … …, until the size of the kth generated image is the size of the label image of the text to be trained.

Based on the descriptions of the steps C to F, progressively learning the association between the text to be trained and the label image thereof through each level of the multi-level generative countermeasure network, where the input of the first level generative countermeasure network is the first spliced vector formed by splicing the text vector and the first random vector, and the input of the kth level generative countermeasure network is the kth spliced vector formed by splicing the text vector and the generated image of the last level generative countermeasure network (level k-1), for example, an architecture diagram of a training text image generation model as shown in fig. 3, in such a way that the resolution of the generated image of each level generative countermeasure network gradually increases, and finally the multi-level generative countermeasure network generates a high-quality and high-resolution generated image.

Step G: and taking the trained multistage generative confrontation network as the text image generation model.

When the multi-level generative confrontation network is trained until the training completion condition is met, the training of the multi-level generative confrontation network is completed, and the trained multi-level generative confrontation network can be used as a text image generation model. In an optional implementation manner of the embodiment of the present application, the training completion condition is convergence of the multistage generative countermeasure network; or, the training completion condition is that the training iteration number of the multistage generation type countermeasure network is greater than or equal to a preset training iteration number.

Through various implementation manners provided by the embodiment, a text to be trained is input into a text encoder to obtain a text vector; and training the multistage generation type countermeasure network by using the text vector and the label image of the text to be trained to obtain a text image generation model. Therefore, through each level of the generated countermeasure network in the multi-level generated countermeasure network, the association between the text to be trained and the label image of the text to be trained is gradually learned, so that the resolution of the generated image of each level of the generated countermeasure network is gradually increased, abrupt texture information is avoided, and the text image generation model obtained through training can generate a high-resolution image.

It should be noted that, in the above embodiment, the text image generation model is obtained by training the multistage generation type countermeasure network by using the text vector and the label image of the text to be trained, and the multistage generation type countermeasure network can gradually learn the association between the text to be trained and the label image thereof as compared with the one-stage generation type countermeasure network, and avoid generating abrupt texture information when generating the image, so that the resolution of the generated image is higher and the image quality is higher. Therefore, when an image corresponding to the text content of the target text needs to be generated, after the text encoder is used to extract the relevant features of the text content in the target text to form the target text vector, the text image generation model obtained by the above embodiment is used to process the target text vector, and an image with higher resolution and higher image quality is obtained as the target image corresponding to the target text.

Referring to fig. 4, a flow chart of another method for generating a text image in the embodiment of the present application is shown. In this embodiment, on the basis of the text image generation model described in the above embodiment, the method may include the following steps:

step 401: and determining a target text of the image to be generated.

Step 402: and processing the target text by using a text encoder to obtain a target text vector.

Step 403: and inputting the target text vector into a text image generation model to generate a target image corresponding to the target text.

In this embodiment of the application, when step 403 is specifically implemented, a target text vector and a second random vector are spliced to obtain a first target splicing vector, where the second random vector obeys normal distribution, and the vector dimension of the second random vector is the same as the vector dimension of the text vector; inputting the first target splicing vector into a first-stage generation type countermeasure network in a text image generation model, and outputting a first target generation image; processing a (k-1) th target generation image by using an image encoder to obtain a (k-1) th target image vector, wherein k is a positive integer and is more than or equal to 2; splicing the target text vector and the kth-1 target image vector to obtain a kth target splicing vector; inputting the kth target splicing vector and the label image of the text to be trained into a kth generation type countermeasure network in a text image generation model, and outputting a kth target generation image; and finally, generating a target generation image output by the countermeasure network in the final stage of the text image generation model as a target image corresponding to the target text.

Therefore, in an optional implementation manner of this embodiment of this application, the step 403 may include, for example, the following steps:

step H: and splicing the target text vector and the second random vector to obtain a first target splicing vector.

Step I: and processing the first target splicing vector by using a first-stage generating countermeasure network in the text image generation model to generate a first target generation image.

Step J: and obtaining a kth target splicing vector based on the target text vector and a kth-1 target generation image output by a kth-1 level generation type countermeasure network in the text image generation model, wherein k is a positive integer and is more than or equal to 2.

In an optional implementation manner of the embodiment of the present application, the step J may include, for example, the following steps:

step J1: processing the (k-1) th target generation image by using an image encoder to obtain a (k-1) th target image vector;

step J2: and splicing the target text vector and the (k-1) th target image vector to obtain a k-th target splicing vector.

Step K: and processing the kth target splicing vector by using a kth generation countermeasure network in the text image generation model to generate a kth target generation image.

Step L: and taking a target generation image generated by the last-stage generation type countermeasure network in the text image generation model as a target image corresponding to the target text.

Determining a target text of an image to be generated through various implementation modes provided by the embodiment; processing the target text by using a text encoder to obtain a target text vector; and inputting the target text vector into the text image generation model to generate a target image corresponding to the target text. Because the text image generation model is a multi-stage generation type countermeasure network, a high-resolution image can be generated; therefore, the target text vector corresponding to the target text is processed by using the text image generation model, and the resolution of the obtained target image corresponding to the target text is higher, that is, the image quality of the target image is higher.

Exemplary devices

Referring to fig. 5, a schematic structural diagram of an apparatus for training a text image generation model in an embodiment of the present application is shown. In this embodiment, the apparatus may specifically include:

a first obtaining unit 501, configured to process a text to be trained by using a text encoder to obtain a text vector;

a second obtaining unit 502, configured to train a multistage generative confrontation network to obtain a text image generation model based on the text vector and the label image of the text to be trained.

In an optional implementation manner of this embodiment of this application, the second obtaining unit 502 includes:

In an optional implementation manner of the embodiment of the present application, the first training subunit includes:

In an optional implementation manner of the embodiment of the present application, the second training subunit includes:

In an optional implementation manner of the embodiment of the present application, the second obtaining subunit includes:

In an optional implementation manner of the embodiment of the present application, the first obtaining unit 501 includes:

In an optional implementation manner of the embodiment of the present application, the training completion condition is convergence of the multistage generative countermeasure network; or, the training completion condition is that the training iteration number of the multistage generation type countermeasure network is greater than or equal to a preset training iteration number.

Referring to fig. 6, a schematic structural diagram of an apparatus for training a text image generation model in an embodiment of the present application is shown. In this embodiment, on the basis of the text image generation model described in the above embodiment, the apparatus may specifically include:

a determining unit 601, configured to determine a target text of an image to be generated;

a third obtaining unit 602, configured to process the target text with a text encoder to obtain a target text vector;

a generating unit 603, configured to input the target text vector into a text image generation model, and generate a target image corresponding to the target text.

In an optional implementation manner of the embodiment of the present application, the generating unit 603 includes:

In an optional implementation manner of the embodiment of the present application, the sixth obtaining subunit includes:

In addition, an embodiment of the present application further provides a terminal device, where the terminal device includes a processor and a memory:

the processor is configured to execute the method for training a text image generation model according to the method embodiments or the method for generating a text image according to the method embodiments according to the instructions in the program code.

An embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used to store a program code, and the program code is used to execute the method for training a text image generation model according to the foregoing method embodiment or the method for generating a text image according to the foregoing method embodiment.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a preferred embodiment of the present application and is not intended to limit the present application in any way. Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application. Those skilled in the art can now make numerous possible variations and modifications to the disclosed embodiments, or modify equivalent embodiments, using the methods and techniques disclosed above, without departing from the scope of the claimed embodiments. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present application still fall within the protection scope of the technical solution of the present application without departing from the content of the technical solution of the present application.

Claims

1. A method of training a text image generation model, comprising:

2. The method of claim 1, wherein training a multi-level generative confrontation network to obtain a text image generation model based on the text vector and the label image of the text to be trained comprises:

3. The method of claim 2, wherein training the first-stage generative confrontation network based on the first stitching vector and the label image of the text to be trained comprises:

4. The method of claim 2, wherein training the kth-level generative confrontation network based on the kth stitching vector and the label image of the text to be trained comprises:

5. The method of claim 2, wherein the obtaining a kth stitching vector based on the text vector and a kth-1 generation image of a level k-1 generation countermeasure network output comprises:

6. The method of claim 1, wherein processing the text to be trained with a text encoder to obtain a text vector comprises:

7. The method of claim 1, wherein the training completion condition is convergence of the multistage generative countermeasure network; or, the training completion condition is that the training iteration number of the multistage generation type countermeasure network is greater than or equal to a preset training iteration number.

8. A method of text image generation, comprising:

determining a target text of an image to be generated;

wherein the text image generation model is the text image generation model of any one of claims 1 to 7.

9. The method of claim 8, wherein the inputting the target text vector into the text image generation model to generate a target image corresponding to the target text comprises:

10. The method of claim 9, wherein obtaining a kth target stitching vector based on the target text vector and a kth-1 target generation image of a kth-1 generation countermeasure network output in the text image generation model comprises:

11. An apparatus for training a text image generation model, comprising:

12. An apparatus for text image generation, comprising:

13. A terminal device, comprising a processor and a memory:

the processor is configured to perform the method of training a text image generation model according to any one of claims 1 to 7 or the method of text image generation according to any one of claims 8 to 10 according to instructions in the program code.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program code for performing the method of training a text image generation model according to any one of claims 1 to 7 or the method of text image generation according to any one of claims 8 to 10.