CN116051668B

CN116051668B - Training method of diffusion model of draft map and image generation method based on text

Info

Publication number: CN116051668B
Application number: CN202211732667.1A
Authority: CN
Inventors: 余欣彤; 刘佳祥; 冯仕堃
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-09-19
Anticipated expiration: 2042-12-30
Also published as: CN116051668A

Abstract

The disclosure provides a training method of a diffusion model of a text-based graph and an image generation method based on the text, relates to the technical field of artificial intelligence, and particularly relates to the technical field of deep learning and natural language processing. The specific implementation scheme is as follows: carrying out noise reduction treatment on the noise-added sample image according to the sample text by using the text graph diffusion model to generate a noise-reduced sample image; obtaining a first text graph alignment score according to a first representation vector of the noise reduction sample image and a second representation vector of the sample text, and selecting a first training sample from the training samples of the current batch based on the first text graph alignment score; determining a first loss function of a diffusion model of the meristematic map according to an original sample image and a noise reduction sample image of a sample text in a first training sample, and adjusting the diffusion model of the meristematic map based on the first loss function; and training is continued by using the training samples of the next batch until the training is finished to obtain a target text-generated graph diffusion model, so that the training precision of the text-generated graph diffusion model is improved.

Description

Training method of diffusion model of draft map and image generation method based on text

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to the field of natural language processing, and more particularly, to a training method for a diffusion model of a text-based image, a text-based image generating method, a text-based image generating device, an electronic device, a storage medium, and a computer program product.

Background

At present, with the continuous development of artificial intelligence technology, the literature graph diffusion model is widely applied in the fields of games, cartoons, webpage design and the like, and has the advantages of high efficiency, high automation degree and the like. For example, text may be input into a meridional graph diffusion model, from which images are output. However, in the related art, there is a problem that training accuracy is low in training of the literature graph diffusion model.

Disclosure of Invention

The present disclosure provides a training method for a diffusion model of a text-based image, a text-based image generation method, a device, an electronic apparatus, a storage medium and a computer program product.

According to an aspect of the present disclosure, there is provided a training method of a text-to-graph diffusion model, including: carrying out noise reduction processing on the noise-added sample image according to the sample text by using the text graph diffusion model to generate a noise-reduced sample image, wherein the sample text and the noise-added sample image are a group of training samples; obtaining a first text-to-image alignment score according to the first representation vector of the noise reduction sample image and the second representation vector of the sample text, and selecting a first training sample from the training samples of the current batch based on the first text-to-image alignment score; determining a first loss function of the literature graph diffusion model according to an original sample image and a noise reduction sample image of a sample text in the first training sample, and adjusting the literature graph diffusion model based on the first loss function; and continuing training the adjusted literature graph diffusion model by using the training samples of the next batch until the training is finished to obtain a final target literature graph diffusion model.

According to another aspect of the present disclosure, there is provided a text-based image generation method including: acquiring a target text; and inputting the target text into a target text-to-text graph diffusion model to output a target image corresponding to the target text, wherein the target text-to-text graph diffusion model is a model obtained by adopting the training method of the text-to-text graph diffusion model.

According to another aspect of the present disclosure, there is provided a training apparatus of a diffusion model of a text-generated graph, including: the noise reduction module is used for carrying out noise reduction processing on the noise-added sample image according to the sample text by the text-generated graph diffusion model to generate a noise-reduced sample image, wherein the sample text and the noise-added sample image are a group of training samples; the selecting module is used for obtaining a first text-to-graph alignment score according to the first representation vector of the noise reduction sample image and the second representation vector of the sample text, and selecting a first training sample from the training samples in the current batch based on the first text-to-graph alignment score; the training module is used for determining a first loss function of the literature graph diffusion model according to an original sample image and a noise reduction sample image of a sample text in the first training sample, and adjusting the literature graph diffusion model based on the first loss function; and the training module is also used for continuing training the adjusted literature graph diffusion model by using the training samples of the next batch until the training is finished to obtain a final target literature graph diffusion model.

According to another aspect of the present disclosure, there is provided a text-based image generating apparatus including: the acquisition module is used for acquiring the target text; the generation module is used for inputting the target text into a target text-to-graph diffusion model to output a target image corresponding to the target text, wherein the target text-to-graph diffusion model is a model obtained by adopting the training method of the text-to-graph diffusion model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a training method of a text-based image generation method of a diffusion model of a meridional chart.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a training method of a text-based graph diffusion model, a text-based image generation method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the steps of a training method of a text-based graph diffusion model, the steps of a text-based image generation method.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of a training method of a text-to-graph diffusion model according to a first embodiment of the present disclosure;

FIG. 2 is a flow diagram of a training method of a text-to-graph diffusion model according to a second embodiment of the present disclosure;

FIG. 3 is a flow chart of a training method of a text graph diffusion model according to a third embodiment of the present disclosure;

FIG. 4 is a schematic illustration of a text-to-graph diffusion model, a text-to-graph alignment model, according to a fourth embodiment of the present disclosure;

FIG. 5 is a flow chart of a training method of a text graph diffusion model according to a fifth embodiment of the present disclosure;

FIG. 6 is a flow diagram of a text-based image generation method according to a first embodiment of the present disclosure;

FIG. 7 is a block diagram of a training apparatus of a text-to-graph diffusion model according to a first embodiment of the present disclosure;

fig. 8 is a block diagram of a text-based image generating apparatus according to a first embodiment of the present disclosure;

FIG. 9 is a block diagram of an electronic device for implementing a training method for a text-to-graph diffusion model in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

AI (Artificial Intelligence ) is a technical science that studies, develops theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. At present, the AI technology has the advantages of high automation degree, high accuracy and low cost, and is widely applied.

DL (Deep Learning) is a new research direction in the field of ML (Machine Learning), and is an inherent rule and expression hierarchy of Learning sample data, so that a Machine can analyze Learning ability like a person, can recognize data such as characters, images and sounds, and is widely applied to speech and image recognition.

NLP (Natural Language Processing ) is an important direction in the fields of computer science and artificial intelligence to study a computer system that can effectively implement natural language communication, especially a science of software systems therein.

Fig. 1 is a flow diagram of a training method of a text-to-graph diffusion model according to a first embodiment of the present disclosure.

As shown in fig. 1, a training method of a text graph diffusion model according to a first embodiment of the present disclosure includes:

s101, carrying out noise reduction processing on the noise-added sample image according to the sample text by using a text graph diffusion model, and generating a noise-reduced sample image, wherein the sample text and the noise-added sample image are a group of training samples.

It should be noted that, the execution body of the training method of the text-to-graph diffusion model according to the embodiments of the present disclosure may be a hardware device with data information processing capability and/or software necessary for driving the hardware device to work. Alternatively, the execution body may include a workstation, a server, a computer, a user terminal, and other intelligent devices. The user terminal comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and the like.

It should be noted that, the diffusion model of the draft image, the sample text, and the noisy sample image are not excessively limited, for example, the sample text may include chinese, english, japanese, etc., and the noisy sample image may include a two-dimensional image, a three-dimensional image, etc.

It will be appreciated that different sample texts may correspond to different noisy sample images. For example, the sample text may be a "landscape", the corresponding noisy sample image may be a landscape-like image, the sample text may be a "cartoon", the corresponding noisy sample image may be a cartoon-like image, the sample text may be a "person", and the corresponding noisy sample image may be a person-like image.

Note that, the specific mode of the noise reduction process is not limited too much, for example, any noise reduction process mode in the related art may be adopted, for example, an average filter, a median filter, wavelet denoising, and the like may be adopted.

S102, obtaining a first text graph alignment score according to a first representation vector of the noise reduction sample image and a second representation vector of the sample text, and selecting a first training sample from the training samples of the current batch based on the first text graph alignment score.

In one embodiment, the noise-reduced sample image may be feature extracted to obtain a first representation vector, and/or the sample text may be feature extracted to obtain a second representation vector.

In one embodiment, the first document map alignment score is used to characterize the correlation between the noise reduced sample image and the sample text, e.g., the correlation between the noise reduced sample image and the sample text is positively correlated with the first document map alignment score, i.e., the higher the correlation between the noise reduced sample image and the sample text, the higher the first document map alignment score.

In one embodiment, obtaining a first document alignment score from a first representation vector of the noise reduced sample image and a second representation vector of the sample text includes obtaining a similarity between the first representation vector and the second representation vector, and obtaining the first document alignment score based on the similarity.

In some examples, obtaining the similarity between the first representation vector and the second representation vector includes obtaining a cosine distance between the first representation vector and the second representation vector, the similarity being based on the cosine distance. For example, the cosine distance is positively correlated with the similarity, i.e., the greater the cosine distance, the higher the similarity.

In an embodiment of the present disclosure, the training samples of the current batch include a plurality of training samples, the first training samples are part of the training samples in the training samples of the current batch, and the number of the first training samples is not excessively limited.

In one embodiment, selecting a first training sample from the current batch of training samples based on the first document alignment score includes screening the training samples of the current batch for a first document alignment score less than a set threshold

And (3) training samples, or, sorting the first text graph alignment scores in a descending order, and determining the training samples corresponding to the first text graph alignment scores 5 of the first N text graph alignment scores as the first training samples. It will be appreciated that the smaller the alignment score of the first document map, the more indicative of

The lower the correlation between the noise reduction sample image and the sample text is, namely the generation effect of the noise reduction sample image generated by the venturi figure diffusion model is poor, and the corresponding training sample is a difficult sample. Therefore, in the method, the first training samples with lower first text-to-graph scores can be screened from the training samples in the current batch, and the screening of difficult samples can be realized.

It should be noted that the set threshold is not limited too much, for example, the value range of the alignment score of the first document is 0 to 100 minutes, and the set threshold may be 60 minutes. N is a positive integer, and is not excessively limited, e.g., N

May be half the total number of training samples of the current batch.

For example, the training samples of the current lot include training sample A, B, C, training sample A includes sample text

"scenery" and noisy sample image 1, training sample B includes sample text "cartoon" and noisy sample image 2, and training sample C includes sample text "character" and noisy sample image 3.

5, carrying out noise reduction treatment on the noise-added sample image 1 according to the sample text landscape by using a text-generated graph diffusion model to generate

The noise-reduction sample image 4 is obtained by carrying out noise reduction treatment on the noise-reduction sample image 2 according to a sample text cartoon by the venturi image diffusion model, generating a noise-reduction sample image 5, and carrying out noise reduction treatment on the noise-reduction sample image 3 according to a sample text character by the venturi image diffusion model, generating a noise-reduction sample image 6.

If the first text-to-image alignment score of 0 is 80 points based on the first representative vector of the noise-reduced sample image 4 and the second representative vector of the sample text "landscape", the first representative vector of the noise-reduced sample image 5 and the sample text "dynamic" are obtained

The second representative vector of the diffuse "gives a first text-to-image alignment score of 30 points, and the first text-to-image alignment score of 40 points is obtained from the first representative vector of the noise-reduced sample image 6 and the second representative vector of the sample text" person ".

If the threshold is set at 60 minutes, the training sample B, C can be determined to be the first training sample.

S103, determining a first loss function of the text 5 raw graph diffusion model according to the original sample image and the noise reduction sample image of the sample text in the first training sample, and adjusting the text 5 raw graph diffusion model based on the first loss function.

It should be noted that, the sample text, the noisy sample image, and the original sample image are a set of training samples. The original sample image can be subjected to noise adding processing to obtain a noise added sample image.

Note that the class of the first loss function is not limited too much, for example, the first loss function may include CE (Cross Entropy), BCE (Binary Cross Entropy ), LMSE (Least Mean Square Error, least squares error), and the like. Wherein LMSE is also called a two-norm loss function.

In one embodiment, determining a first loss function of the diffusion model of the meristematic map from the original sample image and the noise-reduced sample image of the sample text in the first training sample comprises deriving the first loss function from a third representation vector of the original sample image and the first representation vector of the noise-reduced sample image.

In some examples, feature extraction may be performed on the original sample image to obtain a third representation vector.

In some examples, deriving the first loss function from the third representative vector of the original sample image and the first representative vector of the noise-reduced sample image includes obtaining a difference between the third representative vector and the first representative vector and deriving the first loss function based on the difference.

It should be noted that, the specific manner of adjusting the diffusion model of the meristematic map based on the first loss function is not limited too much, for example, any adjustment manner in the related art may be adopted. In some examples, gradient information for the first loss function may be obtained and model parameters for the meridional map diffusion model may be updated based on the gradient information. For example, the model parameters may be updated by back-propagation based on gradient information.

And S104, training the adjusted literature graph diffusion model by using the training samples of the next batch until the training is finished to obtain a final target literature graph diffusion model.

In embodiments of the present disclosure, there may be multiple batches of training samples. The relevant content of the next batch of training samples for continuing training the adjusted diffusion model of the text graph can be referred to the relevant content of steps S101-S103, and will not be described here again.

For example, there may be M batches of training samples, M being a positive integer.

Selecting a first training sample A from training samples of the 1 st batch ₁ Based on the first training sample A ₁ Original sample image and noise reduction sample image of middle sample text, and determining a first loss function B of a diffusion model of a venturi figure ₁ And based on a first loss function B ₁ And adjusting the diffusion model of the draft graph.

Selecting a first training sample A from training samples of the 2 nd batch ₂ Based on the first training sample A ₂ Original sample image and noise reduction sample image of middle sample text, and determining a first loss function B of a diffusion model of a venturi figure ₂ And based on a first loss function B ₂ And adjusting the diffusion model of the draft graph.

Selecting a first training sample A from training samples of 3 rd batch ₃ Based on the first training sample A ₃ Original sample image and noise reduction sample image of middle sample text, and determining a first loss function B of a diffusion model of a venturi figure ₃ And based on a first loss function B ₃ And adjusting the diffusion model of the draft graph.

It should be noted that, the training process of training samples of the 4 th to M th batches may refer to the training process of training samples of the 1 st to 3 rd batches, and will not be described herein.

In summary, according to the training method of the text-to-graph diffusion model in the embodiment of the disclosure, the first text-to-graph alignment score can be obtained by comprehensively considering the first representation vector of the noise-reduced sample image and the second representation vector of the sample text, and the first training sample is selected from the current batch of training samples based on the first text-to-graph alignment score, and the samples in the first training sample are according to

The original sample image and the noise reduction sample image of the text determine a first loss function, and adjust the diffusion model of the text-generated image based on the first loss function, i.e. the scheme does not need to use all training samples of the current batch to obtain

The first loss function is obtained only by utilizing partial training samples (namely the first training samples) of the current batch, so that the learning effect of the literature graph diffusion model on the first training samples is improved, the targeted training of the literature graph diffusion model can be realized, the training precision of the literature graph diffusion model is improved, and the training samples of the next batch can be used

And continuing training the adjusted literature graph diffusion model until training is finished to obtain a final target literature graph diffusion model 0.

Fig. 2 is a flow diagram of a training method of a text-to-graph diffusion model according to a second embodiment of the present disclosure.

As shown in fig. 2, a training method of a text graph diffusion model according to a second embodiment of the present disclosure includes:

s201, carrying out noise reduction processing on the noise-added sample image according to the sample text by using the text graph diffusion model, and generating a noise-reduced sample image, wherein the sample text and the noise-added sample image are a group of training samples.

5S202, obtaining a first expression vector according to the noise reduction sample image and a second expression vector of the sample text

A document alignment score.

The relevant content of steps S201-S202 can be seen in the above embodiments, and will not be described here again.

And S203, acquiring a second text graph alignment score between the sample text and the original sample image.

The relevant content of step S203 may refer to the relevant content of the first text-to-graph alignment score in the above embodiment, and will not be described in detail here.

In one embodiment, obtaining a second document alignment score between the sample text and the original sample image includes obtaining the second document alignment score from a third representative vector of the original sample image and a second representative vector of the sample text.

S204, obtaining a score difference value of the first text graph alignment score and the second text graph alignment score.

5 it will be appreciated that the fractional difference value is used to characterize the gap between the noise reduced sample image and the original sample image. At the position of

In some examples, the fractional difference value is positively correlated with the gap between the noise-reduced sample image and the original sample image, i.e., the higher the fractional difference value, the greater the gap between the noise-reduced sample image and the original sample image.

In one embodiment, obtaining a score difference of the first document map alignment score and the second document map alignment score includes obtaining a score difference of the second document map alignment score minus the first document map alignment score.

S205, selecting a first training sample from training samples in the current batch according to the score difference value.

In one embodiment, selecting the first training samples from the training samples in the current batch according to the score difference value includes selecting the training samples with the score difference value greater than the set threshold value from the training samples in the current batch as the first training samples, or sorting the training samples in the current batch in descending order according to the score difference value, and selecting the part of the training samples with the score difference value being ranked forward as the first training samples. Therefore, in the method, the first training samples with higher score difference values can be screened from the training samples in the current batch, the screening of the difficult samples with high quality can be realized, the screening of the training samples with low quality can be avoided, and the learning effect of the literature graph diffusion model on the difficult samples with high quality can be improved.

It will be appreciated that if the quality of the training sample is low, the correlation between the sample text and the original sample image is low, and the second text-to-image alignment score is also low, the score difference obtained is low. On the contrary, if the quality of the training sample is higher, the correlation between the sample text and the original sample image is higher, the alignment score of the second text-to-image is also higher, if the generation effect of the noise reduction sample image generated by the diffusion model of the text-to-image is poorer, the alignment score of the first text-to-image is lower, and the obtained score difference is higher at this time, namely, the training sample is a difficult sample with high quality at this time.

S206, determining a first loss function of the diffusion model of the draft map according to the original sample image and the noise reduction sample image of the sample text in the first training sample, and adjusting the diffusion model of the draft map based on the first loss function.

S207, training the adjusted literature graph diffusion model by using the training samples of the next batch until the training is finished to obtain a final target literature graph diffusion model.

The relevant content of steps S206-S207 can be seen in the above embodiments, and will not be described here again.

In summary, according to the training method of the text-to-graph diffusion model in the embodiment of the disclosure, the second text-to-graph alignment score between the sample text and the original sample image is obtained, and the first training sample is selected from the training samples in the current batch based on the score difference value of the first text-to-graph alignment score and the second text-to-graph alignment score, so that the original sample image can be considered to screen the first training sample, and the accuracy of the first training sample is improved.

Fig. 3 is a flow chart of a training method of a text-to-graph diffusion model according to a third embodiment of the present disclosure.

As shown in fig. 3, a training method of a text graph diffusion model according to a third embodiment of the present disclosure includes:

s301, carrying out noise reduction processing on the noise-added sample image according to the sample text by using the text graph diffusion model, and generating a noise-reduced sample image, wherein the sample text and the noise-added sample image are a group of training samples.

The relevant content of step S301 may be referred to the above embodiments, and will not be described herein.

S302, image coding is carried out on the noise reduction sample image, and a first representation vector is obtained.

It should be noted that the specific manner of image encoding is not limited, and may be implemented by any image encoding method in the related art, for example, huffman encoding, predictive encoding, transform encoding, block encoding, and the like.

In one embodiment, as shown in fig. 4, the document alignment model includes an image encoder, and the noise-reduced sample image may be input into the document alignment model, and the image encoder in the document alignment model performs image encoding on the noise-reduced sample image to obtain the first representation vector.

It should be noted that, the document and image alignment model and the image encoder are not excessively limited.

And S303, carrying out relevance scoring on the noise reduction sample image and the sample text according to the first expression vector and the second expression vector based on the text image alignment model to obtain a first text image alignment score.

In one embodiment, as shown in fig. 4, the document alignment model further includes a document alignment layer, an output end of the image encoder is connected to an input end of the document alignment layer, and the second expression vector may be input into the document alignment model, and the document alignment layer in the document alignment model performs correlation scoring on the noise reduction sample image and the sample text according to the first expression vector and the second expression vector to obtain a first document alignment score.

And S304, obtaining a second loss function of the document alignment model according to the second document alignment score and the first document alignment score between the sample text and the original sample image.

Note that the category of the second loss function is not limited too much, for example, the second loss function may include CE (Cross Entropy), BCE (Binary Cross Entropy ), LMSE (Least Mean Square Error, least squares error), and the like. Wherein LMSE is also called a two-norm loss function.

In one embodiment, obtaining the second loss function of the document alignment model according to the second document alignment score and the first document alignment score between the sample text and the original sample image includes obtaining a score difference value of the first document alignment score and the second document alignment score, and obtaining the second loss function according to the score difference value.

And S305, adjusting the document and graph alignment model based on the second loss function.

It should be noted that, the specific manner of adjusting the alignment model of the text and graphics based on the second loss function is not limited too much, for example, any adjustment manner in the related art may be adopted. In some examples, gradient information for the second loss function may be obtained and model parameters for the graph alignment model updated based on the gradient information. For example, the model parameters may be updated by back-propagation based on gradient information.

And S306, continuing training the adjusted text-to-image alignment model by using the training samples of the next batch.

In embodiments of the present disclosure, there may be multiple batches of training samples. The relevant content of the training continued by using the training samples of the next batch to the adjusted alignment model of the text and graphics can be referred to the relevant content of steps S304-S305, and will not be described again here.

It should be noted that, the training ending condition of the document map alignment model may include generating the target document map diffusion model, that is, ending the training document map alignment model when generating the target document map diffusion model.

S307, selecting a first training sample from the training samples in the current batch based on the first text graph alignment score.

S308, determining a first loss function of the diffusion model of the draft map according to the original sample image and the noise reduction sample image of the sample text in the first training sample, and adjusting the diffusion model of the draft map based on the first loss function.

And S309, continuing training the adjusted literature graph diffusion model by using the training samples of the next batch until the training is finished to obtain a final target literature graph diffusion model.

The relevant content of steps S307-S309 can be seen in the above embodiments, and will not be described here again.

In summary, according to the training method of the text-to-graph diffusion model of the embodiment of the disclosure, the text-to-graph alignment model may be utilized to score the correlation of the noise-reduced sample image and the sample text according to the first representation vector and the second representation vector, to obtain a first text-to-graph alignment score, determine a second loss function according to the first text-to-graph alignment score and the second text-to-graph alignment score, and train the text-to-graph alignment model based on the second loss function, so that synchronous training of the text-to-graph alignment model and the text-to-graph diffusion model may be realized.

Fig. 5 is a flow diagram of a training method of a text-to-graph diffusion model according to a fourth embodiment of the present disclosure.

As shown in fig. 5, a training method of a text graph diffusion model according to a fourth embodiment of the present disclosure includes:

S501, based on Gaussian noise, the original sample image of the sample text is subjected to noise adding, and a noise added sample image corresponding to the sample text is obtained.

It should be noted that, the specific manner of adding noise based on gaussian noise is not limited too much, for example, any gaussian noise adding manner in the related art may be used. In some examples, a sequence of values of the gaussian distribution may be obtained, and the sum of the gray values of the pixels of the original sample image and the values in the sequence may be determined as the gray value of the pixels of the noisy sample image.

S502, inputting the sample text and the noise-added sample image into a text-to-image diffusion model, and performing text encoding on the sample text by a text encoder in the text-to-image diffusion model to generate a second representation vector.

Note that the specific manner of text encoding is not limited, and may be implemented by any text encoding manner in the related art, for example, ASCII (American Standard Code for Information Interchange ), unicode (unified code), or the like.

In an embodiment of the disclosure, as shown in fig. 4, the text-to-graph diffusion model includes a text encoder, the sample text and the noisy sample image may be input into the text-to-graph diffusion model, and the text encoder in the text-to-graph diffusion model may perform text encoding on the sample text to generate the second representation vector. The output end of the text encoder is connected with the input end of the text-to-picture alignment layer.

S503, performing layer-by-layer noise reduction on the noise-added sample image based on the second representation vector by using a plurality of text-generated layers in the text-generated graph diffusion model to obtain a noise-reduced sample image.

In an embodiment of the present disclosure, as shown in fig. 4, the meridional graph diffusion model further includes a plurality of meridional graph layers.

It should be noted that, the specific manner of layer-by-layer noise reduction is not limited too much, for example, any noise reduction manner in the related art may be adopted. In some examples, the ith gaussian noise may be derived from the second representation vector based on the ith venturi layer, and removed from the noisy sample image until the last venturi layer is reached to generate a final noise-reduced sample image.

S504, obtaining a first text graph alignment score according to the first representation vector of the noise reduction sample image and the second representation vector of the sample text, and selecting a first training sample from the training samples of the current batch based on the first text graph alignment score.

S505, determining a first loss function of the diffusion model of the draft map according to the original sample image and the noise reduction sample image of the sample text in the first training sample, and adjusting the diffusion model of the draft map based on the first loss function.

And S506, training the adjusted literature graph diffusion model by using the training samples of the next batch until the training is finished to obtain a final target literature graph diffusion model.

The relevant content of steps S504-S506 can be seen in the above embodiments, and will not be described here again.

In summary, according to the training method of the text-to-graph diffusion model of the embodiment of the disclosure, the original sample image may be denoised based on gaussian noise to obtain a denoised sample image, the sample text and the denoised sample image are input into the text-to-graph diffusion model, the text encoder in the text-to-graph diffusion model performs text encoding on the sample text to generate a second representation vector, and the denoised sample image is denoised layer by layer based on the second representation vector by a plurality of text-to-graph layers in the text-to-graph diffusion model to obtain a denoised sample image.

Fig. 6 is a flow chart of a text-based image generating method according to a sixth embodiment of the present disclosure.

As shown in fig. 6, a text-based image generating method of a sixth embodiment of the present disclosure includes:

s601, acquiring a target text.

It should be noted that, the execution subject of the text-based image generating method according to the embodiment of the present disclosure may be a hardware device having data information processing capability and/or software necessary for driving the hardware device to operate. Alternatively, the execution body may include a workstation, a server, a computer, a user terminal, and other intelligent devices. The user terminal comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and the like.

It should be noted that the target text is not limited too much, and may include chinese, english, japanese, and the like, for example.

In an embodiment, taking the execution subject as a user terminal, the user terminal may obtain the target text from its own storage space, or may obtain the target text input by the user on the display interface, or may crawl the target text from the web page.

S602, inputting the target text into a target text-to-text graph diffusion model to output a target image corresponding to the target text, wherein the target text-to-text graph diffusion model is a model obtained by a training method of the text-to-text graph diffusion model.

It should be noted that, the target literature graph diffusion model may be obtained by using the training method of the literature graph diffusion model described in fig. 1 to 5, and will not be described herein.

For example, the target text may be "scenery", the corresponding target image may be a scenery-like image, the target text may be "cartoon", the corresponding target image may be a cartoon-like image, the target text may be "person", and the corresponding target image may be a person-like image.

In one embodiment, taking the execution subject as a user terminal as an example, the method further includes displaying the target image on a display interface of the user terminal.

In summary, according to the text-based image generation method of the embodiment of the disclosure, a target text is acquired, and the target text is input into a target text-to-text graph diffusion model to output a target image corresponding to the target text, wherein the target text-to-text graph diffusion model is a model obtained by a training method of the text-to-text graph diffusion model, and the target text-to-text graph diffusion model has high precision and is beneficial to improving the precision of the target image.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to the embodiment of the disclosure, the disclosure further provides a training device for the diffusion model of the text-generated graph, which is used for realizing the training method of the diffusion model of the text-generated graph.

Fig. 7 is a block diagram of a training apparatus of a text-to-graph diffusion model according to a first embodiment of the present disclosure.

As shown in fig. 7, a training apparatus 700 of a text graph diffusion model according to an embodiment of the present disclosure includes: a noise reduction module 701, a selection module 702 and a training module 703.

The noise reduction module 701 is configured to perform noise reduction processing on a noise-added sample image according to a sample text by using a text graph diffusion model, and generate a noise-reduced sample image, where the sample text and the noise-added sample image are a set of training samples;

The selecting module 702 is configured to obtain a first text-to-graph alignment score according to the first representation vector of the noise-reduced sample image and the second representation vector of the sample text, and select a first training sample from the training samples in the current batch based on the first text-to-graph alignment score;

the training module 703 is configured to determine a first loss function of the literature graph diffusion model according to an original sample image and a noise reduction sample image of a sample text in the first training sample, and adjust the literature graph diffusion model based on the first loss function;

the training module 703 is further configured to use the training samples of the next batch to continue training the adjusted literature graph diffusion model until the training is finished to obtain a final target literature graph diffusion model.

In one embodiment of the present disclosure, the selecting module 702 is further configured to: obtaining a second text-to-image alignment score between the sample text and the original sample image; obtaining a score difference value of the first text graph alignment score and the second text graph alignment score; and selecting the first training sample from the training samples of the current batch according to the score difference value.

In one embodiment of the present disclosure, the selecting module 702 is further configured to: selecting training samples with the score difference value larger than a set threshold value from the training samples of the current batch as the first training samples; or sorting training samples in the current batch of training samples in a descending order according to the score difference value, and selecting a part of training samples with the front sorting as the first training sample.

In one embodiment of the present disclosure, the selecting module 702 is further configured to: performing image coding on the noise reduction sample image to obtain the first representation vector; and carrying out relevance scoring on the noise reduction sample image and the sample text according to the first representation vector and the second representation vector based on a text graph alignment model to obtain the first text graph alignment score.

In one embodiment of the present disclosure, after the obtaining the first document alignment score, the training module 703 is further configured to: obtaining a second loss function of the document map alignment model according to a second document map alignment score between the sample text and the original sample image and the first document map alignment score; adjusting the text-to-graph alignment model based on the second loss function; and continuing training the adjusted text-graph alignment model by using the training samples of the next batch.

In one embodiment of the present disclosure, the training device 700 of the literature graph diffusion model further includes: and the noise adding module is used for: and based on Gaussian noise, carrying out noise adding on the original sample image of the sample text to obtain the noise added sample image corresponding to the sample text.

In one embodiment of the present disclosure, the noise reduction module 701 is further configured to: inputting the sample text and the noise-added sample image into the draft map diffusion model, and performing text coding on the sample text by a text coder in the draft map diffusion model to generate the second representation vector; and carrying out layer-by-layer noise reduction on the noise-added sample image based on the second representation vector by a plurality of text-generated image layers in the text-generated image diffusion model to obtain the noise-reduced sample image.

In summary, the training device for a text-to-text graph diffusion model according to the embodiments of the present disclosure may comprehensively consider a first representation vector of a noise-reduced sample image and a second representation vector of a sample text, obtain a first text-to-graph alignment score, select a first training sample from a current batch of training samples based on the first text-to-graph alignment score, determine a first loss function according to an original sample image of a sample text in the first training sample and the noise-reduced sample image, and adjust the text-to-graph diffusion model based on the first loss function, that is, in the present scheme, the first loss function is obtained without using all training samples of the current batch, and only using part of training samples (i.e., the first training samples) of the current batch to obtain the first loss function, so as to help to improve the learning effect of the text-to the first training sample, enable the targeted training of the text-to-graph diffusion model, improve the training precision of the text-to-graph diffusion model, and enable the training of the adjusted text-to be continued until the training is finished to obtain the final target text-to-graph diffusion model.

According to an embodiment of the present disclosure, the present disclosure further provides a text-based image generating apparatus, configured to implement the above-described text-based image generating method.

Fig. 8 is a block diagram of a text-based image generating apparatus according to a first embodiment of the present disclosure.

As shown in fig. 8, a text-based image generating apparatus 800 of an embodiment of the present disclosure includes: an acquisition module 801 and a generation module 802.

The acquisition module 801 is used for acquiring a target text;

the generating module 802 is configured to input the target text into a target text-to-graph diffusion model to output a target image corresponding to the target text, where the target text-to-graph diffusion model is a model obtained by using a training method of the text-to-graph diffusion model as disclosed in the present disclosure.

In summary, the text-based image generating device in the embodiment of the present disclosure obtains a target text, inputs the target text into a target text-to-text graph diffusion model to output a target image corresponding to the target text, where the target text-to-text graph diffusion model is a model obtained by a training method of the text-to-text graph diffusion model, and the target text-to-text graph diffusion model has high precision, which is conducive to improving the precision of the target image.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic device 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as the training method of the meridional graph diffusion model described in fig. 1 to 5, such as the text-based image generation method described in fig. 6. For example, in some embodiments, the method of training the meridional graph diffusion model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into RAM 903 and executed by the computing unit 901, one or more steps of the training method of the text-derived graph diffusion model described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the training method of the text graph diffusion model in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

According to an embodiment of the disclosure, the disclosure further provides a computer program product, which comprises a computer program, wherein the computer program, when executed by a processor, implements the steps of the training method of the diffusion model of the paperwork and the steps of the image generating method based on text.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of training a diffusion model for a venturi graph, wherein the method comprises:

Carrying out noise reduction processing on the noise-added sample image according to the sample text by using the text graph diffusion model to generate a noise-reduced sample image, wherein the sample text and the noise-added sample image are a group of training samples;

obtaining a first text-to-image alignment score according to the first representation vector of the noise reduction sample image and the second representation vector of the sample text, and selecting a first training sample from the training samples of the current batch based on the first text-to-image alignment score; the first text-to-image alignment score is used to characterize a correlation between the noise-reduced sample image and the sample text;

determining a first loss function of the literature graph diffusion model according to an original sample image and a noise reduction sample image of a sample text in the first training sample, and adjusting the literature graph diffusion model based on the first loss function;

and continuing training the adjusted literature graph diffusion model by using the training samples of the next batch until the training is finished to obtain a final target literature graph diffusion model.

2. The method of claim 1, wherein the selecting a first training sample from a current batch of training samples based on the first text graph alignment score comprises:

Obtaining a second text-to-image alignment score between the sample text and the original sample image, the second text-to-image alignment score being used to characterize a correlation between the sample text and the original sample image;

obtaining a score difference value of the first text graph alignment score and the second text graph alignment score;

and selecting the first training sample from the training samples of the current batch according to the score difference value.

3. The method of claim 2, wherein the selecting the first training sample from the current batch of training samples according to the fractional difference comprises:

selecting training samples with the score difference value larger than a set threshold value from the training samples of the current batch as the first training samples; or,

and sorting training samples in the current batch of training samples in a descending order according to the score difference value, and selecting a part of training samples with the front sorting as the first training sample.

4. A method according to any of claims 1-3, wherein said deriving a first text-to-graph alignment score from the first representation vector of the noise-reduced sample image and the second representation vector of the sample text comprises:

Performing image coding on the noise reduction sample image to obtain the first representation vector;

and carrying out relevance scoring on the noise reduction sample image and the sample text according to the first representation vector and the second representation vector based on a text graph alignment model to obtain the first text graph alignment score.

5. The method of claim 4, wherein the obtaining the first document alignment score further comprises:

obtaining a second loss function of the document map alignment model according to a second document map alignment score between the sample text and the original sample image and the first document map alignment score;

adjusting the text-to-graph alignment model based on the second loss function;

and continuing training the adjusted text-graph alignment model by using the training samples of the next batch.

6. A method according to any one of claims 1-3, wherein the method further comprises:

and based on Gaussian noise, carrying out noise adding on the original sample image of the sample text to obtain the noise added sample image corresponding to the sample text.

7. A method according to any one of claims 1-3, wherein the denoising the denoised sample image from the sample text by the venturi map diffusion model, generating a denoised sample image, comprises:

Inputting the sample text and the noise-added sample image into the draft map diffusion model, and performing text coding on the sample text by a text coder in the draft map diffusion model to generate the second representation vector;

and carrying out layer-by-layer noise reduction on the noise-added sample image based on the second representation vector by a plurality of text-generated image layers in the text-generated image diffusion model to obtain the noise-reduced sample image.

8. A text-based image generation method, wherein the method comprises:

acquiring a target text;

inputting the target text into a target text-to-text graph diffusion model to output a target image corresponding to the target text, wherein the target text-to-text graph diffusion model is a model obtained by adopting the training method according to any one of claims 1-7.

9. A training device for a diffusion model of a text-generated graph, wherein the device comprises:

the noise reduction module is used for carrying out noise reduction processing on the noise-added sample image according to the sample text by the text-generated graph diffusion model to generate a noise-reduced sample image, wherein the sample text and the noise-added sample image are a group of training samples;

the selecting module is used for obtaining a first text-to-graph alignment score according to the first representation vector of the noise reduction sample image and the second representation vector of the sample text, and selecting a first training sample from the training samples in the current batch based on the first text-to-graph alignment score; the first text-to-image alignment score is used to characterize a correlation between the noise-reduced sample image and the sample text;

The training module is used for determining a first loss function of the literature graph diffusion model according to an original sample image and a noise reduction sample image of a sample text in the first training sample, and adjusting the literature graph diffusion model based on the first loss function;

and the training module is also used for continuing training the adjusted literature graph diffusion model by using the training samples of the next batch until the training is finished to obtain a final target literature graph diffusion model.

10. The apparatus of claim 9, wherein the selection module is further configured to:

11. The apparatus of claim 10, wherein the selection module is further configured to:

12. The apparatus of any of claims 9-11, wherein the selection module is further to:

13. The apparatus of claim 12, wherein after the obtaining the first document alignment score, the training module is further configured to:

adjusting the text-to-graph alignment model based on the second loss function;

14. The apparatus of any of claims 9-11, wherein the apparatus further comprises: and the noise adding module is used for:

15. The apparatus of any of claims 9-11, wherein the noise reduction module is further to:

16. A text-based image generation apparatus, wherein the apparatus comprises:

the acquisition module is used for acquiring the target text;

the generating module is configured to input the target text into a target text-to-graph diffusion model to output a target image corresponding to the target text, where the target text-to-graph diffusion model is a model obtained by adopting the training method according to any one of claims 1-7.

17. An electronic device, comprising:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.