CN116363249A

CN116363249A - Controllable image generation method and device and electronic equipment

Info

Publication number: CN116363249A
Application number: CN202310341978.3A
Authority: CN
Inventors: 李国豪; 李伟; 刘家辰; 肖欣延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-06-30

Abstract

The disclosure provides a controllable image generation method, a controllable image generation device and electronic equipment, relates to the field of artificial intelligence, and particularly relates to the technical field of image generation. The specific implementation scheme is as follows: acquiring a control image, extracting at least one characteristic of the control image, and generating a first characteristic vector; fusing the first feature vectors to generate second feature vectors; and inputting the noise image into a diffusion network to perform reasoning operation, and inputting the second feature vector into a corresponding reasoning layer to participate in the reasoning operation so as to generate a target image. The embodiment of the disclosure can control the generated image through one or more control images, can realize accurate control on the image generating process, and improves the controllability of the generated image.

Description

Controllable image generation method and device and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to the field of image generation techniques.

Background

In recent years, the field of image generation technology has rapidly progressed, and among them, a technology of generating an image from an input text has been particularly widely paid attention to and applied. This technique allows the user to automatically generate high quality drawings that rival professional painters through simple text.

With the popularization of artificial intelligence (Artificial Intelligence, AI) authoring, it is gradually difficult to meet the requirement of users for higher controllability of generated images only by means of text generation and drawing, and currently, image generation schemes with higher controllability are not available.

Disclosure of Invention

The disclosure provides a controllable image generation method, a controllable image generation device and electronic equipment.

According to a first aspect of the present disclosure, there is provided a controllable image generation method, comprising:

acquiring a control image, extracting at least one characteristic of the control image, and generating a first characteristic vector;

fusing the first feature vectors to generate second feature vectors;

and inputting the noise image into a diffusion network to perform reasoning operation, and inputting the second feature vector into a corresponding reasoning layer to participate in the reasoning operation so as to generate a target image.

Optionally, the method further comprises:

acquiring a reference image, and inputting the reference image into a control diagram extraction module;

and extracting control features in the reference image according to the control dimension of the control diagram extraction module to generate a control image.

Optionally, the control dimensions include a composition dimension, a color dimension, and a structure dimension, and the extracting the control feature in the reference image according to the control dimension of the control map extracting module to generate the control image includes at least one of:

Classifying each pixel in the reference image in response to the control dimension being a composition dimension, and labeling the pixel according to the class of the pixel to generate the control image;

in response to the control dimension being a color dimension, blurring the reference image to generate the control image, wherein the control image comprises a plurality of color blocks;

and in response to the control dimension being a structure dimension, identifying a structure in the reference image and generating the control image according to the structure.

Optionally, the extracting the features of at least one control image, and generating the first feature vector includes:

and inputting the control image into a dimension encoder to perform reasoning operation, and obtaining a first feature vector output by each reasoning layer, wherein the dimension encoder comprises a plurality of reasoning layers.

Optionally, the fusing the first feature vector to generate a second feature vector includes:

acquiring control weights of all the dimension encoders, and multiplying a first feature vector corresponding to the encoder by the corresponding control weight to acquire an intermediate feature vector;

the intermediate feature vectors are added to obtain the second feature vector.

Optionally, the diffusion network includes an encoder and a decoder, the encoder and the decoder include a plurality of inference layers, the number of the inference layers and the structure of the inference layers in the decoder and the encoder are the same as those of the dimension encoder, and the dimension encoder is in one-to-one correspondence with the inference layers in the decoder.

Optionally, the inputting the noise image into the diffusion network to perform inference operation, and inputting the second feature vector into a corresponding inference layer to participate in the inference operation, and generating the target image includes:

inputting the noise image into an inference layer of the encoder for inference operation and then inputting the noise image into an inference layer of the decoder;

adding the first feature vector and the output of the corresponding reasoning layer in the decoder, and then inputting the added first feature vector and the output of the corresponding reasoning layer in the decoder to perform reasoning operation on the next reasoning layer in the decoder so as to generate a pending image;

and inputting the undetermined image into the diffusion network for iterative operation so as to generate the target image.

Optionally, the inputting the pending image into the diffusion network to perform iterative operation to generate the target image includes:

acquiring a preset iteration number threshold, setting the count value of the iteration number to be 0, and increasing the iteration number value by 1 after finishing outputting the undetermined image by each iteration operation;

If the iteration count value is smaller than the iteration count threshold, continuing the iteration operation;

and stopping iterative operation if the iteration count value is equal to the iteration count threshold value, and determining the output undetermined image as the target image.

According to a second aspect of the present disclosure, there is provided a controllable image generation apparatus comprising:

the feature extraction module is used for acquiring a control image, extracting at least one feature of the control image and generating a first feature vector;

the feature fusion module is used for fusing the first feature vectors to generate second feature vectors;

and the image generation module is used for inputting the noise image into the diffusion network to perform reasoning operation, inputting the second characteristic vector into a corresponding reasoning layer to participate in the reasoning operation, and generating a target image.

Optionally, the apparatus further includes:

the reference image acquisition module is used for acquiring a reference image and inputting the reference image into the control diagram extraction module;

and the control image generation module is used for extracting control features in the reference image according to the control dimension of the control image extraction module so as to generate a control image.

Optionally, the control dimension includes a composition dimension, a color dimension, and a structure dimension, and the control image generating module includes:

the composition extraction sub-module is used for responding to the control dimension as a composition dimension, classifying each pixel in the reference image, and labeling the pixel according to the type of the pixel so as to generate the control image;

the color extraction sub-module is used for responding to the control dimension as a color dimension and carrying out blurring processing on the reference image so as to generate the control image, wherein the control image comprises a plurality of color blocks;

and the structure extraction sub-module is used for responding to the control dimension as a structure dimension, identifying a structure in the reference image and generating the control image according to the structure.

Optionally, the feature extraction module includes:

and the feature extraction sub-module is used for inputting the control image into a dimension encoder to perform reasoning operation and obtaining a first feature vector output by each reasoning layer, wherein the dimension encoder comprises a plurality of reasoning layers.

Optionally, the feature fusion module includes:

the weighting sub-module is used for acquiring the control weight of each dimension encoder and multiplying the first feature vector corresponding to the encoder by the corresponding control weight to acquire an intermediate feature vector;

And the feature fusion sub-module is used for adding the intermediate feature vectors to obtain the second feature vector.

Optionally, the image generating module includes:

the first reasoning sub-module is used for inputting the noise image into the reasoning layer of the encoder for carrying out reasoning operation and then inputting the noise image into the reasoning layer of the decoder;

the second reasoning sub-module is used for adding the first characteristic vector and the output of the corresponding reasoning layer in the decoder and inputting the first characteristic vector and the output of the corresponding reasoning layer in the decoder to the next reasoning layer in the decoder for carrying out reasoning operation so as to generate a pending image;

and the iteration sub-module is used for inputting the undetermined image into the diffusion network to carry out iteration operation so as to generate the target image.

Optionally, the iteration submodule includes:

the counting unit is used for acquiring a preset iteration number threshold value, setting the count value of the iteration number to be 0, and increasing the iteration number value by 1 after each iteration operation finishes outputting the undetermined image;

The first judging unit is used for continuing the iterative operation if the iteration number count value is smaller than the iteration number threshold value;

and the second judging unit stops iterative operation and determines the output undetermined image as the target image if the iteration count value is equal to the iteration count threshold.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first aspects.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the first aspects.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any of the first aspects.

The technical scheme of the disclosure can realize at least the following beneficial effects:

the generated image is controlled by one or more control images, so that the accurate control of the image generating process can be realized, and the controllability of the generated image is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a controllable image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a controllable image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a controllable image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a controllable image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 5 is a control image schematic provided in accordance with an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a controllable image generation system provided in accordance with an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a controllable image generation model provided in accordance with an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a controllable image generation apparatus provided in accordance with an embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing controllable image generation in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In recent years, the field of image generation technology has rapidly progressed, and among them, generation of text to images has been particularly widely paid attention to and applied. This technique allows the user to automatically generate high quality drawings that rival professional painters through simple text.

With the continuous popularization of AI authoring, it is gradually difficult to meet the requirement of higher controllability of users only by means of generating drawings by text. Some professional users need to carry out multi-dimensional fine control on the painting, such as composition, color, object outline, figure gesture and the like, and even under the creation scene with higher controllability requirement, the control of multiple dimensions needs to be comprehensively used so as to achieve satisfactory generation effect.

The method of controlling the generated image by controlling the image in the related art includes the following two kinds:

1) And training a diffusion generation model, namely directly splicing the control image to the input side, and generating an original image from the control image by the training model.

The diffusion model is a type of generation model that applies noise to an image step by step in the forward diffusion process until the image is corrupted to become a complete gaussian noise, and then learns to recover from the gaussian noise to a real image in the backward sampling process. After model training is completed, only one Gaussian noise needs to be randomly given, so that rich real images can be generated.

2) On the basis of keeping the capability of the existing diffusion model, a plug-and-play control module is added. The control module is responsible for processing a given control image and organically integrating information of the given control image into a basic diffusion model to realize the effect of controlling image generation.

The method 1) needs to collect a large amount of graph-text data for large-scale training, has high cost, and needs to train a model for each dimension control independently, so that multiple control conditions are difficult to combine.

The method 2) can retain the capability of basic generation model to a large extent and is relatively easy to train and use, but because the diffusion model needs multiple iterative operation, the control module is called for each iteration, the operation load of the system is increased, the speed of generating the image is greatly reduced, the image generating efficiency is lower, the image can only be controlled from a single dimension, and the control diagram of multiple dimensions can not be input at the same time.

In order to solve the above problems, the disclosure provides a controllable image generation method, a controllable image generation device and an electronic device. Fig. 1 is a schematic diagram of a controllable image generating method according to an embodiment of the disclosure, as shown in fig. 1, the method includes:

step 101, acquiring a control image, extracting at least one characteristic of the control image, and generating a first characteristic vector;

in this embodiment, the control of the generated image is achieved by controlling the images, and each control image may control the generated image effect from one dimension. Multiple control images may be input to control the generated image in multiple dimensions.

FIG. 5 is a schematic diagram of a control image provided according to an embodiment of the present disclosure, as shown in FIG. 5, in one possible embodiment, three large-dimension control may be performed on the generated image effect, including: composition dimension, color dimension, and structure dimension.

The scene graph is used for controlling in a composition dimension, the scene graph is a semantic segmentation graph, each pixel comprises a type label and is used for distinguishing the type of the pixel, and the type of the pixel can be set according to specific requirements of an implementer, such as: foreground and background; character, pet, environment. The scene graph characterizes the spatial layout of the individual elements in the picture.

The color block diagram is used for controlling the color dimension of the effect of the generated image, wherein the color block diagram comprises the colors of all pixels and characterizes the distribution characteristics of the colors in the picture. In one possible embodiment, the color block map contains a plurality of large color blocks, with all pixels in each color block being the same color; in one possible embodiment, the color block map is manually drawn using a thicker brush. Both of these color block diagrams represent the color distribution of the picture.

The contour map and the attitude map are used for controlling the structural dimension of the effect of the generated image. The structural dimensions, i.e., the specific morphology of the elements in the picture, include, but are not limited to: the posture of the human body and the outline of the article. Because the details of the image are more in the control of the structural dimension, the generated image can be controlled more finely.

Step 102, fusing the first feature vectors to generate second feature vectors.

In this embodiment, the number of dimensions of the first feature vectors is the same, and only one feature vector is input into the diffusion network, so that in the case of a plurality of first feature vectors, an implementer may add and fuse the first feature vectors extracted from each control image through vector addition, to generate the second feature vector.

And step 103, inputting the noise image into a diffusion network for reasoning operation, and inputting the second feature vector into a corresponding reasoning layer to participate in the reasoning operation so as to generate a target image.

In this embodiment, the diffusion network is used to generate an image, and the specific principle is as follows: the target image is generated according to the noise image, which can be understood as denoising the image, and the noise in the noise image is removed to restore the image. And combining the second feature vector into the image generation process of the paperweight graphic encoder, so that the generation effect of the image can be controlled, and a target image can be generated.

Fig. 2 is a schematic diagram of a controllable image generating method according to an embodiment of the disclosure, as shown in fig. 2, where the method in fig. 1 further includes:

step 201, obtaining a reference image, and inputting the reference image into a control diagram extraction module;

in this embodiment, the control image may be manually drawn, or an existing image may be extracted after a certain processing. The control diagram extraction module provided in this embodiment is configured to perform a certain process on an existing reference image to obtain a control image.

And 202, extracting control features in the reference image according to the control dimension of the control diagram extraction module to generate a control image.

In this embodiment, the control map extracting module includes a plurality of extracting sub-modules, where each extracting sub-module corresponds to a control dimension, and is configured to extract a control image corresponding to the control dimension. Control images corresponding to various control dimensions can be extracted from a reference image.

In one possible embodiment, a reference image is input to each extraction sub-module, and a plurality of control images are acquired.

In one possible embodiment, each extraction sub-module is respectively input with a different reference image, and a plurality of control images are acquired.

In a possible embodiment, the extraction sub-modules of the control map extraction module are all neural network models trained in advance, and can extract control images corresponding to control dimensions.

In this embodiment, the composition dimension is the spatial layout occupied by the picture element, so to obtain the control image of the composition dimension, the category of the pixel in the reference image needs to be extracted, and the category of the pixel needs to be labeled, so as to generate the control image.

In one possible embodiment, the reference image is processed through a semantic segmentation network model, and the generated semantic segmentation graph is the control image.

In one possible embodiment, the reference image is processed by a depth extraction network model, and the generated depth map is the control image.

In one possible embodiment, the reference image is binarized, and a black-and-white image including only black pixels and white pixels is generated as the control image.

in this embodiment, the color dimension is a layout feature of colors in a picture in space, so to obtain a control image of the color dimension, it is necessary to extract color distribution features in a reference image to generate the control image.

In one possible embodiment, the reference image is converted into a color block diagram by performing blurring processing or mosaic processing on the reference image, the color block diagram is used as the control image, and an implementer can adjust parameters (such as the size of the mosaic) in the mosaic processing according to actual requirements to control the effect of the color block diagram.

In this embodiment, the structural dimension is a specific form of an element in the frame, and the structural dimension may be subdivided into multiple sub-dimensions, such as an object contour, a character gesture, etc., so to obtain a control image of the structural dimension, specific morphological features of one or more elements in the reference image need to be extracted to generate the control image.

In a possible embodiment, edge features in the reference image are extracted through an edge detection network, a line drawing containing edge lines of an object or a human body is generated, and the line drawing is used as the control image.

In one possible embodiment, the human body gesture in the reference image is extracted through a gesture recognition network, a gesture image containing the marks of the various limb parts of the human body is generated, and the gesture image is used as the control image.

In this embodiment, the dimension encoder is configured to perform upsampling to obtain a high-dimensional feature, where the high-dimensional feature includes a plurality of inference layers, where the inference layers are arranged in a certain order, and an output obtained by performing an inference operation in a previous inference layer is input to a next inference layer for performing an inference operation. The output of each inference layer is taken as the first feature vector.

The dimension encoder is a pre-trained encoder, and the corresponding encoder needs to be trained in a targeted manner aiming at control images with different control dimensions so that the encoder can accurately extract effective features.

Fig. 3 is a schematic diagram of a controllable image generating method according to an embodiment of the present disclosure, as shown in fig. 3, step 102 in fig. 1 of fusing the first feature vectors to generate second feature vectors specifically includes:

step 301, obtaining control weights of the dimension encoders, and multiplying the first feature vectors corresponding to the encoders with the corresponding control weights to obtain intermediate feature vectors.

Step 302, adding the intermediate feature vectors to obtain the second feature vector.

In this embodiment, when the practitioner inputs a plurality of reference images in controlling the image generation, a plurality of first feature vectors are acquired, and these feature vectors need to be fused together. Control weights of all dimension encoders need to be set, and the control weights represent importance of first feature vectors corresponding to the control encoders to an image generation process.

By the embodiment, the weight control of the first feature vector of each control dimension is realized, the controllability of the image generation process is improved, and an implementer can customize the control weight in the image generation process, so that the target image can be generated more accurately.

In this embodiment, after the training is completed, in order to ensure the effect of the diffusion network during operation, parameters in the diffusion network are locked and cannot be adjusted. The encoder is used for up-sampling and extracting high-dimensional characteristics; the decoder is used for downsampling to restore the high-dimensional features to a two-dimensional image.

Fig. 4 is a schematic diagram of a controllable image generating method according to an embodiment of the present disclosure, as shown in fig. 4, in step 103 in fig. 1, a noise image is input into a diffusion network to perform an inference operation, and the second feature vector is input into a corresponding inference layer to participate in the inference operation, where generating a target image specifically includes:

step 401, inputting the noise image into an inference layer of the encoder for inference operation, and then inputting the noise image into the inference layer of the decoder;

in one possible embodiment, the noise image may be generated by sampling gaussian noise, which is 64×64 in size, the noise image is input to the first inference layer of the diffusion network, and then the outputs of the inference layers are sequentially input to the next inference layer, the diffusion network is followed by the encoder and then the decoder, so the noise image is input to the first inference layer of the encoder.

Step 402, adding the first feature vector and the output of the corresponding inference layer in the decoder, and inputting the added first feature vector to the next inference layer in the decoder to perform inference operation so as to generate a pending image;

in this embodiment, since the inference layers in the dimension encoder are in one-to-one correspondence with the inference layers in the decoder, the corresponding inference layers have the same structure, and the difference is that the parameters are different. And the number of the feature vector dimensions output by each inference layer in the dimension encoder is reduced layer by layer, the number of the feature vector dimensions output by each inference layer in the decoder is increased layer by layer, and the feature vector dimensions output by each inference layer in the decoder and the feature vector dimensions input by the corresponding inference layer are the same, so that vector addition can be performed. After the reasoning operation of the encoder and the decoder, an iterative operation process is completed, and a pending image is output.

And step 403, inputting the undetermined image into the diffusion network for iterative operation so as to generate the target image.

After finishing one iteration operation, the undetermined image generated by the iteration operation can be used as the input of a diffusion network to carry out the next iteration operation, after a certain number of iteration operations, an implementer can stop the iteration operation if satisfied with the generated image, and the undetermined image finally generated is used as a target image.

It should be noted that, in the whole iterative process, only the first feature vector is extracted once and the second feature vector is obtained by fusion in the first iterative operation, and in the subsequent iterative operation process, the process is not required to be repeated.

In a possible embodiment, the input of the diffusion network further includes a text feature vector, the text feature vector is obtained by extracting features of the prompt text by a pre-trained text encoder, the size is 77×768, the text feature vector is input into each reasoning layer of the diffusion network to participate in the reasoning operation, and the output result of each reasoning layer is affected. The input of each reasoning layer is a text feature vector, and the feature vector output by the previous reasoning layer and the corresponding first feature vector.

In a possible embodiment, the iteration number threshold is any number of times between 20 and 50 times, such as 25 times.

Fig. 6 is a schematic structural diagram of a controllable image generating system according to an embodiment of the present disclosure, as shown in fig. 6, a user may input a single image or multiple images as a control condition, in which case the user may directly input a control image (such as the above-mentioned contour map, scene map, etc.), and in which case we provide a control map extraction module, which may be applied by the user to automatically extract a control map from a reference image as an input.

The image generation process comprises the following steps: the text, the noise image and the plurality of control images are input into a multi-dimensional controllable image generation model, wherein the model comprises the diffusion network and the dimensional encoder, and the final target image is obtained through multiple diffusion iteration generation. The image will conform to text semantics and also to control conditions for each dimension.

Fig. 7 is a schematic structural diagram of a controllable image generation model, as shown in fig. 7, based on a basic textdiagram diffusion network, as shown on the left side of the figure, according to an embodiment of the present disclosure. The figure depicts the central diffuse UNET network part, which, in brief, includes an encoder and decoder that can denoise the image and restore the noise figure to a sharp image. In the existing text-based graphics process, a clear image can be generated from a random noise image by calling UNET diffusion network multiple times under text guidance.

To be able to take advantage of the powerful capabilities of the underlying text-to-map network while increasing control over the image, the present embodiment multiplexes and trains the encoder sections in the fine tune UNET to form a series of encoders for the control map (e.g., scene map encoder, contour map encoder, etc.). The series of encoders can extract features from the control image and are effectively incorporated into the image generation process of the paperweight graphic encoder. We fix the underlying text-generated graph network unchanged.

In order to realize the controllable generation of multiple dimensions, we propose a weighted fusion method: the encoders of the various control charts are used in parallel, the results are weighted and summed, and the final result is input to the decoder. In this process, it is necessary to ensure that the sum of the weights is 1, and the larger the general weight is, the larger the corresponding control map is affected. The implementer user can realize finer control on the generated graph by adjusting the corresponding weight.

To ensure the generation efficiency, we pre-perform feature extraction of the control graph encoder outside the core text graph UNET diffusion loop. Because the control graph feature extraction is performed only once, the generation time is not increased too much compared with the basic text graph flow.

Fig. 8 is a schematic diagram of a controllable image generating apparatus according to an embodiment of the present disclosure, and as shown in fig. 8, the controllable image generating apparatus 500 includes:

the feature extraction module 510 is configured to obtain a control image, extract features of at least one control image, and generate a first feature vector;

a feature fusion module 520, configured to fuse the first feature vectors to generate second feature vectors;

the image generating module 530 is configured to input a noise image into the diffusion network to perform an inference operation, and input the second feature vector into a corresponding inference layer to participate in the inference operation, so as to generate a target image.

Optionally, the apparatus further includes:

Optionally, the feature extraction module 510 includes:

Optionally, the feature fusion module 520 includes:

Optionally, the image generating module 530 includes:

Optionally, the iteration submodule includes:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a controllable image generation method. For example, in some embodiments, the controllable image generation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the controllable image generation method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the controllable image generation method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A controllable image generation method, comprising:

fusing the first feature vectors to generate second feature vectors;

2. The method of claim 1, wherein the method further comprises:

3. The method of claim 2, wherein the control dimensions include a composition dimension, a color dimension, and a structure dimension, the extracting control features in the reference image according to the control dimension of the control map extraction module to generate a control image includes at least one of:

4. The method of claim 1, wherein the extracting features of at least one of the control images, generating a first feature vector, comprises:

5. The method of claim 4, wherein the fusing the first feature vector to generate a second feature vector comprises:

the intermediate feature vectors are added to obtain the second feature vector.

6. The method of claim 4, wherein the flooding network includes an encoder and a decoder, the encoder and decoder including a plurality of inference layers therein, the decoder and the encoder having the same number of inference layers and structure therein as the dimension encoder, the dimension encoder having a one-to-one correspondence with the inference layers in the decoder.

7. The method of claim 6, wherein the inputting the noise image into the diffusion network performs an inference operation, and inputting the second feature vector into a corresponding inference layer to participate in the inference operation, generating the target image comprises:

8. The method of claim 7, wherein the inputting the pending image into the diffusion network for iterative operations to generate the target image comprises:

9. A controllable image generation apparatus comprising:

10. The apparatus of claim 9, wherein the apparatus further comprises:

11. The apparatus of claim 10, wherein the control dimensions include a composition dimension, a color dimension, and a structure dimension, the control image generation module comprising:

12. The apparatus of claim 9, wherein the feature extraction module comprises:

13. The apparatus of claim 12, wherein the feature fusion module comprises:

14. The apparatus of claim 12, wherein the flooding network comprises an encoder and a decoder, the encoder and decoder comprising a plurality of inference layers therein, the decoder and the encoder having the same number of inference layers and structure therein as the dimension encoder, the dimension encoder having a one-to-one correspondence with the inference layers in the decoder.

15. The apparatus of claim 14, wherein the image generation module comprises:

16. The apparatus of claim 15, wherein the iterative sub-module comprises:

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-8.