CN117975211A

CN117975211A - Image processing method and device based on multi-mode information

Info

Publication number: CN117975211A
Application number: CN202311796360.2A
Authority: CN
Inventors: 石雅洁
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2023-12-25
Filing date: 2023-12-25
Publication date: 2024-05-03

Abstract

The disclosure relates to the technical field of computer vision, and provides an image processing method and device based on multi-mode information. The method comprises the following steps: adding random noise data corresponding to t time steps to the image to obtain a feature vector of a noise image corresponding to the image; performing feature fusion processing based on the feature vector of the prompt text and the feature vector of the image to obtain a multi-mode feature enhancement vector; based on the multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image, carrying out t times of noise prediction processing to obtain t noise feature vectors corresponding to the noise image; based on the feature vectors of t noises and the feature vectors of the noise image, a denoising image containing a target object is generated, the problem of low accuracy of the generated image in the prior art is solved, and the accuracy and efficiency of image generation are improved.

Description

Image processing method and device based on multi-mode information

Technical Field

The disclosure relates to the technical field of computer vision, in particular to an image processing method and device based on multi-mode information.

Background

As artificial intelligence techniques continue to implement breakthrough iterations, artificial intelligence content generation is increasingly being applied. Images have played an important role in the field of artificial intelligence content generation as a modality of artificial intelligence content generation. In recent years, the image generation technology has also achieved a lot of key breakthroughs, and the existing image generation method mainly comprises the steps of mapping a specific text to a text space to obtain a text feature vector, converting the text feature vector into a corresponding image feature vector, and obtaining the image corresponding to the text through a decoder. However, the above-described image generation method has a problem that the generated image is not accurate enough.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide an image processing method, apparatus, electronic device and readable storage medium based on multi-modal information, so as to solve the problem of low accuracy of generating an image in the prior art.

In a first aspect of an embodiment of the present disclosure, there is provided an image processing method based on multi-modal information, including: acquiring a prompt text and an image, wherein the prompt text and the image contain a target object; adding random noise data corresponding to t time steps to the image to obtain a feature vector of a noise image corresponding to the image, wherein t is a positive integer; extracting features of the prompt text to obtain feature vectors of the prompt text, extracting features of the image to obtain feature vectors of the image, and carrying out feature fusion processing based on the feature vectors of the prompt text and the feature vectors of the image to obtain multi-mode feature enhancement vectors; based on the multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image, carrying out t times of noise prediction processing to obtain t noise feature vectors corresponding to the noise image; a denoised image containing the target object is generated based on the feature vectors of the t noises and the feature vectors of the noise image.

A second aspect of the embodiments of the present disclosure provides an image processing apparatus based on multi-modality information, including: the acquisition module is used for acquiring prompt texts and images, wherein the prompt texts and the images contain target objects; the noise adding module is used for adding random noise data corresponding to t time steps to the image to obtain a feature vector of a noise image corresponding to the image, wherein t is a positive integer; the feature fusion module is used for carrying out feature extraction on the prompt text to obtain a feature vector of the prompt text, carrying out feature extraction on the image to obtain a feature vector of the image, and carrying out feature fusion processing on the basis of the feature vector of the prompt text and the feature vector of the image to obtain a multi-mode feature enhancement vector; the prediction module is used for carrying out t times of noise prediction processing on the basis of the multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image to obtain t noise feature vectors corresponding to the noise image; the image generation module is used for generating a denoising image containing a target object based on the feature vectors of the t noises and the feature vectors of the noise image.

In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the disclosed embodiments, there is provided a readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: and adding random noise data corresponding to t time steps to the image, namely adding Gaussian noise, and obtaining a characteristic diagram of a noise diagram containing the Gaussian noise. The feature vector of the prompt text and the feature vector of the image are fused, the characteristics of the prompt text and the characteristics of the image are combined to obtain the multi-mode feature enhancement vector, the feature expression of the image and the prompt text is enhanced, more available information is utilized, the model can be helped to better understand the contents of the image and the prompt text, and a more accurate and vivid image is generated. Based on the multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image, t times of noise prediction processing are carried out to obtain t noise feature vectors corresponding to the noise image, the predicted noise feature vectors are obtained, a denoising image containing a target object is generated according to the t noise feature vectors and the noise image feature vectors, the noise is restrained and eliminated on the feature level, the feature vector of the denoising image is obtained, the feature vector of the denoising image is converted into specific pixel values in an image space through a decoder, a visualized, accurate and fidelity denoising image is generated, the problem that the accuracy of the generated image is low in the prior art is solved, and the accuracy and the efficiency of image generation are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a scene schematic diagram of an application scene of an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of an image processing method based on multi-modal information according to an embodiment of the disclosure;

FIG. 3 is a flow chart of another image processing method based on multi-modal information according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of yet another image processing method based on multimodal information provided in an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an image processing apparatus based on multi-modal information according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

An image processing method and apparatus based on multi-modal information according to embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 1 is a scene diagram of an application scene of an embodiment of the present disclosure. The application scenario may include terminal devices 1, 2 and 3, a server 4 and a network 5.

The terminal devices 1,2 and 3 may be hardware or software. When the terminal devices 1,2 and 3 are hardware, they may be various electronic devices having a display screen and supporting communication with the server 4, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like; when the terminal apparatuses 1,2, and 3 are software, they can be installed in the electronic apparatus as above. The terminal devices 1,2 and 3 may be implemented as a plurality of software or software modules, or as a single software or software module, to which the embodiments of the present disclosure are not limited. Further, various applications, such as a data processing application, an instant messaging tool, social platform software, a search class application, a shopping class application, and the like, may be installed on the terminal devices 1,2, and 3.

The server 4 may be a server that provides various services, for example, a background server that receives a request transmitted from a terminal device with which communication connection is established, and the background server may perform processing such as receiving and analyzing the request transmitted from the terminal device and generate a processing result. The server 4 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center, which is not limited in the embodiment of the present disclosure.

The server 4 may be hardware or software. When the server 4 is hardware, it may be various electronic devices that provide various services to the terminal devices 1,2, and 3. When the server 4 is software, it may be a plurality of software or software modules providing various services to the terminal devices 1,2, and 3, or may be a single software or software module providing various services to the terminal devices 1,2, and 3, which is not limited by the embodiments of the present disclosure.

The network 5 may be a wired network using coaxial cable, twisted pair wire, and optical fiber connection, or may be a wireless network that can implement interconnection of various Communication devices without wiring, for example, bluetooth (Bluetooth), near Field Communication (NFC), infrared (Infrared), etc., which are not limited by the embodiments of the present disclosure.

The user can establish a communication connection with the server 4 via the network 5 through the terminal devices 1, 2, and 3 to receive or transmit information or the like. Specifically, the server 4 acquires a prompt text and an image, wherein the prompt text and the image contain a target object; adding random noise data corresponding to t time steps to the image to obtain a feature vector of a noise image corresponding to the image, wherein t is a positive integer; extracting features of the prompt text to obtain feature vectors of the prompt text, extracting features of the image to obtain feature vectors of the image, and carrying out feature fusion processing based on the feature vectors of the prompt text and the feature vectors of the image to obtain multi-mode feature enhancement vectors; based on the multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image, carrying out t times of noise prediction processing to obtain t noise feature vectors corresponding to the noise image; a denoised image containing the target object is generated based on the feature vectors of the t noises and the feature vectors of the noise image.

It should be noted that the specific types, numbers and combinations of the terminal devices 1,2 and 3, the server 4 and the network 5 may be adjusted according to the actual requirements of the application scenario, which is not limited by the embodiment of the present disclosure.

Fig. 2 is a flowchart of an image processing method based on multi-mode information according to an embodiment of the present disclosure. The image processing method of fig. 2 based on multi-modal information may be performed by the server of fig. 1. As shown in fig. 2, the method includes:

step 201, acquiring a prompt text and an image, wherein the prompt text and the image contain a target object.

In some embodiments, for the task of generating the target image, it is required to generate a multimodal feature enhancement vector based on a prompt text and an image, where each of the prompt text and the image includes at least one target object, and the generated target image also includes the target object, for example, the prompt text includes an object a, and the image also includes an object a, specifically, the object a may be a "cat", that is, the prompt text may be a "map of a cat", and the image is a specific image including a "cat". By using the text information and the image information, the understanding of the text information by the model can be enhanced, and more information and more accurate guidance can be provided for the subsequent generation of the target image.

Step 202, adding random noise data corresponding to t time steps to the image to obtain a feature vector of a noise image corresponding to the image, wherein t is a positive integer.

In some embodiments, the time step may be a number of steps of adding random noise data to the image, and different time steps correspond to different numbers of steps of adding noise. The random noise data may be gaussian noise. And inputting the image into a diffusion model, and adding random noise data corresponding to t time steps to the image through the diffusion model. The diffusion model includes two processes: a forward process, also known as a diffusion process, which may be used to add noise data to an input image to obtain a noise image containing noise, and a reverse process, which may be used to predict the noise in the noise image. In the process of adding noise to an input image, the noise amounts of random noise data corresponding to different time steps are different, and the larger the time step t is, the larger the noise amount of the random noise data is. For example, when t is equal to 10, random noise data is added to the image z0 input to the diffusion model to obtain a noise image z1, random noise data is added to the noise image z1 to obtain a noise image z2, and so on to obtain a noise image z10. Specifically, the diffusion model performs noise addition on an image by adding a random noise value conforming to a gaussian distribution to each pixel of the image, for example, adding the noise value to the original value of the pixel can affect the gray value or the color value of the pixel. The feature extraction is performed on the generated noise image z10, for example, feature extraction may be performed through a convolutional neural network or a residual neural network, and a feature vector of the noise image is obtained. The feature vector of the obtained noise image is used for further tasks such as noise prediction image generation. By gradually adding Gaussian noise into the image, the diffusion model can simulate the degradation process of the image under the influence of various factors, a series of noise images containing noise are generated, and the noise images containing noise can be used as training data for learning the image denoising method to train the diffusion model.

Step 203, extracting features of the prompt text to obtain feature vectors of the prompt text, extracting features of the image to obtain feature vectors of the image, and performing feature fusion processing based on the feature vectors of the prompt text and the feature vectors of the image to obtain multi-mode feature enhancement vectors.

In some embodiments, the method can perform vector embedding on the prompt text through embedding layers, convert each word of the prompt text into a vector with a fixed length, obtain a feature vector of the prompt text, convert the prompt text into a form recognizable by a computer, recognize semantic information of the prompt text, and improve generalization capability and robustness of the model. The input image can be subjected to feature extraction through an image encoder, the image is converted into a form which can be recognized by a computer, and the color, texture, shape and other features of the image are captured. And feature fusion processing is carried out on the feature vector of the prompt text and the feature vector of the image to obtain a multi-mode feature enhancement vector, so that richer feature representation is obtained, the multi-mode feature enhancement vector not only comprises semantic information of the prompt text, but also comprises visual information of the image, so that the diversity and generalization capability of a diffusion model can be improved, the capability of image generation is improved, more accurate noise prediction is obtained, and an accurate and fidelity denoising image containing a target object is generated.

And 204, performing t times of noise prediction processing on the basis of the multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image to obtain t noise feature vectors corresponding to the noise image.

In some embodiments, the multi-modal feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image are input into a diffusion model, and the self-adaptive capacity of the diffusion model after training is utilized to combine the multi-modal features to perform t times of noise prediction processing on the noise image. Each processing results in a noise feature vector which can be considered as random noise data corresponding to t time steps added to the image z 0. The obtained t noise feature vectors can be used for carrying out subsequent image generation tasks, so that the accuracy of image generation is improved, and a picture with high sense of realism is obtained.

Step 205, generating a denoising image containing the target object based on the feature vectors of the t noises and the feature vectors of the noise image.

In some embodiments, based on the t noise feature vectors predicted by the diffusion model and the feature vector of the noise image, the t noise feature vectors may be subtracted from the feature vector of the noise image, thereby obtaining a feature vector of the denoising image containing no noise, and a denoising image containing the target object is generated based on the feature vector of the denoising image, so as to generate an accurate and fidelity denoising image containing the target object.

Based on the foregoing embodiment, the random noise data corresponding to t time steps, that is, the gaussian noise is added to the image, so as to obtain the feature map of the noise map containing the gaussian noise. The feature vector of the prompt text and the feature vector of the image are fused, the characteristics of the prompt text and the characteristics of the image are combined to obtain the multi-mode feature enhancement vector, the feature expression of the image and the prompt text is enhanced, more available information is utilized, the model can be helped to better understand the contents of the image and the prompt text, and a more accurate and vivid image is generated. Based on the multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image, t times of noise prediction processing are carried out to obtain t noise feature vectors corresponding to the noise image, the predicted noise feature vectors are obtained, a denoising image containing a target object is generated according to the t noise feature vectors and the noise image feature vectors, the noise is restrained and eliminated on the feature level, the feature vector of the denoising image is obtained, the feature vector of the denoising image is converted into specific pixel values in an image space through a decoder, a visualized, accurate and fidelity denoising image is generated, the problem that the accuracy of the generated image is low in the prior art is solved, and the accuracy and the efficiency of image generation are improved.

In some embodiments, feature extraction is performed on the prompt text to obtain feature vectors of the prompt text, including: vector embedding is carried out on the prompt text, and an initial feature vector of the prompt text is obtained; and encoding the initial feature vector of the text by a first text encoder to obtain the feature vector of the prompt text.

In some embodiments, the prompt text may be vector embedded by the embedding layers to obtain an initial feature vector of the prompt text. The first text encoder may be a text encoder that compares to a text-to-Image Pre-training (CLIP) model. For example, the prompt text is "a cat diagram", and the vector embedding is performed on the "a cat diagram", so that 5 initial feature vectors of the prompt text can be extracted, and each word of the prompt text is converted into a feature vector with a fixed length. The text encoder of the CLIP model is used for encoding the initial feature vectors of the 5 prompt texts, capturing semantic information and structural information of the texts to obtain feature vectors t1 of the prompt texts, and the feature vectors of the texts can be used for subsequent tasks such as noise prediction.

In some embodiments, feature extraction is performed on an image to obtain a feature vector of the image, including: extracting features of the image through an image encoder to obtain an initial feature vector of the image; and carrying out feature transformation on the initial feature vector of the image through a multi-layer perceptron to obtain the feature vector of the image.

In some embodiments, the image encoder may be a CLIP model image encoder, with two matrices contained within the multi-layer perceptron. Visual and semantic information of an image can be captured by converting the input image into a set of feature vectors, i.e., initial feature vectors of the image, by an image encoder. And the pre-trained multi-layer perceptron is used for carrying out feature transformation on the initial feature vector of the image, the initial feature vector of the image and two matrixes of the multi-layer perceptron are calculated to obtain the feature vector of the image, so that the global information and semantic information of the image can be better represented and used in the process of data processing such as subsequent feature fusion and the like.

In some embodiments, feature fusion processing is performed based on feature vectors of a prompt text and feature vectors of an image to obtain a multi-modal feature enhancement vector, including: carrying out local replacement processing on the initial feature vector of the prompt text based on the feature vector of the image to obtain an initial multi-mode feature vector; encoding the initial multi-modal feature vector through a second text encoder to obtain the multi-modal feature vector; weighting the multi-mode feature vector and the feature vector of the prompt text based on a preset weight to obtain a weighted processing result; and adding the weighted processing result and the multi-modal feature vector to obtain the multi-modal feature enhancement vector.

In some embodiments, for the part of the initial feature vector of the prompt text, which is related to the image feature vector, the feature vector of the image may be used to perform local replacement, so that the association and complementarity between the image and the text may be captured, thereby better expressing the multi-modal information and obtaining the initial multi-modal feature vector. For example, the prompt text is a "map of one cat", and the vector embedding is performed on the "map of one cat", so that initial feature vectors of 5 prompt texts, that is, text_emb1, text_emb2, text_emb3, text_emb4, and text_emb5, can be extracted, wherein the initial feature vector corresponding to the "cat" is text_emb3, and the text_emb3 is replaced by the feature vector a of the image, so as to obtain initial multi-modal feature vectors text_emb1, text_emb2, a, text_emb4, and text_emb5. The initial multi-mode feature vector not only contains the feature information of the image, but also contains the feature information of the text, and can be used for tasks such as feature fusion, noise prediction and the like. The second text encoder can be a text encoder of the CLIP model, the text encoder of the CLIP model is used for encoding the initial multi-modal feature vector, the initial multi-modal feature vector is optimized, the multi-modal feature vector t2 is obtained, and the expression capacity and the generalization performance of the multi-modal feature vector are further enhanced. And f-function calculation is carried out on the feature vector t1 of the prompt text and the multi-modal feature vector t2 to obtain a multi-modal feature enhancement vector t3, and particularly, the f-function uses the multi-modal feature vector t2 for the position corresponding to the object 'cat' in the specific image. The multi-mode feature vector t2 not only contains the feature information of the image, but also contains the feature information of the prompt text, so that the information of the position can be better expressed; for other positions, the f function is calculated by using weighted features, the weight is set to w, the weight w is a parameter which can be learned in the training process, the other positions refer to all positions except the position corresponding to the 'cat' in the specific image, the features of the other positions are calculated by calculating w+t1+ (1-w) t2 in such a way that the features of each position are expressed as weighted average of the feature vector t1 of the prompt text and the multi-mode feature vector t2, wherein w is a weight which can be learned and used for controlling the balance between the feature vector t1 of the prompt text and the multi-mode feature vector t2, and the weighted processing result is obtained to express the features of the other positions. And adding the weighted processing result and the multi-modal feature vector t2 to obtain a multi-modal feature enhancement vector t3, wherein the multi-modal feature enhancement vector t3 not only contains information of the position corresponding to the 'cat' in the image, but also contains information of other positions, and can be used in subsequent noise prediction tasks.

Based on the foregoing embodiment, the diffusion model may include a feature fusion module 300, where the feature fusion module 300 may include an embedding layer 301, a first text encoder 302, an image encoder 303, a multi-layer perceptron 304, a local replacement processing module 305, a second text encoder 306, and a weighting processing module 307, as shown in fig. 3, where the prompt text is input into the embedding layer 301 to perform vector embedding, so as to obtain an initial feature vector of the prompt text, and the initial feature vector of the prompt text is input into the first text encoder 302 to perform encoding processing, so as to obtain a feature vector of the prompt text. The image is input to the image encoder 303 to extract the features of the image to obtain an initial feature vector of the image, and the initial feature vector of the image is input to the multi-layer perceptron 304 to perform feature transformation on the initial feature vector of the image to obtain a feature vector of the image. The feature vector of the image and the initial feature vector of the prompt text are input into the local replacement processing module 305, and local replacement processing is performed on the initial feature vector of the prompt text based on the feature vector of the image, so as to obtain an initial multi-mode feature vector. The initial multi-modal feature vector is input to the second text encoder 306 for encoding to obtain a multi-modal feature vector. Inputting the multi-modal feature vector and the feature vector of the prompt text into a weighting processing module 307, and carrying out weighting processing on the multi-modal feature vector and the feature vector of the prompt text based on a preset weight to obtain a weighting processing result; and adding the weighted processing result and the feature vector of the prompt text to obtain the multi-mode feature enhancement vector. The feature extraction and feature fusion are carried out on the prompt text and the image input feature fusion module 300, the feature vector of the prompt text and the multi-mode feature enhancement vector are output to obtain richer feature representation, the multi-mode feature enhancement vector not only comprises the semantic information of the prompt text, but also comprises the visual information of the image, the diversity and generalization capability of a diffusion model can be improved, the image generation capability is improved, more accurate noise prediction is obtained, and an accurate and fidelity denoising image comprising a target object is generated.

In some embodiments, the diffusion model includes a plurality of downsampling processing modules and a plurality of upsampling processing modules, and performs t times of noise prediction processing based on the multi-mode feature enhancement vector, the feature vector of the hint text, and the feature vector of the noise image, to obtain t times of feature vectors of noise corresponding to the noise image, including: the multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image are subjected to downsampling through a plurality of downsampling processing modules, so that a plurality of downsampling processing results are obtained; the method comprises the steps that up-sampling processing is carried out on a plurality of down-sampling processing results, multi-mode feature enhancement vectors and feature vectors of prompt texts through a plurality of up-sampling processing modules, so that noise feature vectors of noise images are obtained; and carrying out t times of noise prediction processing on the feature vectors of the noise image to obtain t noise feature vectors corresponding to the noise image.

In some embodiments, the diffusion model includes a plurality of downsampling process modules and a plurality of upsampling process modules, each downsampling process module including a feature process module and a downsampling network layer, each upsampling process module including an upsampling network layer and a feature process module, each feature process module including a residual network, a self-attention layer, and a cross-attention layer.

For example, if the diffusion model includes 9 feature processing modules, the structure of which is shown in fig. 4, including a first feature processing module 401, a first downsampling network 402, a second feature processing module 403, a second downsampling network 404, a third feature processing module 405, a third downsampling network 406, a fourth feature processing module 407, a fourth downsampling network 408, a fifth feature processing module 409, a first upsampling network 410, a sixth feature processing module 411, a second upsampling network 412, a seventh feature processing module 413, a third upsampling network 414, an eighth feature processing module 415, a fourth upsampling network 416, and a ninth feature processing module 417, each feature processing module includes a residual network, a self-focusing layer, and a cross-focusing layer. Inputting the feature vector of the noise image into a residual network in the first feature processing module 401 for feature transformation, and using the pre-designed residual network in the first feature processing module 401 for feature transformation of the feature vector of the noise image, capturing deeper features and context information of the noise image, and simultaneously, effectively preventing network degradation and obtaining a residual processing result of the noise image; inputting the residual processing result of the noise image into a self-attention layer in the first feature processing module 401 to perform self-attention coding, performing complex transformation and calculation on the residual processing result to obtain self-attention weights of all positions of the noise image, and capturing long-term dependency relationship and global information in input data to obtain a coding result of the noise image; the coding result of the noisy image, the multi-mode feature enhancement vector and the feature vector of the prompt text are input into a cross attention layer in the first feature processing module 401 to perform cross attention processing, the key parameters and the feature vector of the prompt text are added to obtain a key vector, the value parameters and the multi-mode feature enhancement vector are added to obtain a value vector, the coding result of the noisy image is used as a query vector to perform cross attention processing with the key vector and the value vector, and the first feature processing result is obtained. The key parameter is a to_k matrix, which can be decomposed into two matrices with low rank, and the two matrices are trainable parameters in the training process. The value parameter is a to_v matrix, which can be decomposed into two matrices with low rank, and the two matrices are trainable parameters in the training process. The rank of the to_k matrix may be 4, and the rank of the to_v matrix may be 16, because the space for adjusting and optimizing the feature vector t1 for the hint text calculated by the to_k matrix is smaller, and the space for adjusting and optimizing the multi-mode feature enhancement vector t3 calculated by the to_v matrix is larger. In the training process, the training speed of the diffusion model can be accelerated by adjusting the two matrices of the to_v matrix low rank decomposition and the two matrices of the to_k matrix low rank decomposition. And adding the key parameters and the feature vectors of the prompt text to obtain key vectors, adding the value parameters and the multi-mode feature enhancement vectors to obtain value vectors, cross attention processing is carried out on the coding result of the noise-added image as a query vector and the key vectors and the value vectors, similar features between the query vector and the key vectors and between the value vectors are calculated, weights of similar features are increased, weights of dissimilar features are reduced, and a more comprehensive and accurate feature representation, namely a first feature processing result, is obtained. And inputs the first feature processing result into the first downsampling network 402 to perform downsampling processing, so as to obtain a first downsampling processing result. And inputting the first downsampling processing result, the multi-mode feature enhancement vector and the feature vector of the prompt text into a second feature processing module 403 for feature processing to obtain a second feature processing result. And inputting the second characteristic processing result into a second downsampling network 404 for downsampling, so as to obtain a second downsampling processing result. And inputting the second downsampling result, the multi-mode feature enhancement vector and the feature vector of the prompt text into a third feature processing module 405 for feature processing to obtain a third feature processing result. And inputting the third characteristic processing result into a third downsampling network 406 to perform downsampling processing to obtain a third downsampling processing result. And inputting the third downsampling result, the multi-mode feature enhancement vector and the feature vector of the prompt text into a fourth feature processing module 407 for feature processing to obtain a fourth feature processing result. The fourth feature processing result is input to the fourth downsampling network 408 to perform downsampling processing, so as to obtain a fourth downsampling processing result. And inputting the fourth downsampling result, the multi-mode feature enhancement vector and the feature vector of the prompt text into a fifth feature processing module 409 for feature processing to obtain a fifth feature processing result. The fifth feature processing result is input to the first upsampling network 410 to perform upsampling processing, so as to obtain a first upsampling processing result. And splicing the first upsampling processing result and the fourth feature processing result to obtain a first splicing result, and inputting the first splicing result, the multi-mode feature enhancement vector and the feature vector of the prompt text into a sixth feature processing module 411 to perform feature processing to obtain a sixth feature processing result. The sixth feature processing result is input to the second upsampling network 412 to perform upsampling processing, resulting in a second upsampling processing result. And splicing the second upsampling processing result and the third feature processing result to obtain a second splicing result, and inputting the second splicing result, the multi-mode feature enhancement vector and the feature vector of the prompt text into a seventh feature processing module 413 for feature processing to obtain a seventh feature processing result. The seventh feature processing result is input to the third upsampling network 414 to perform upsampling processing, so as to obtain a third upsampling processing result. And splicing the third upsampling processing result and the second feature processing result to obtain a third splicing result, and inputting the third splicing result, the multi-mode feature enhancement vector and the feature vector of the prompt text into an eighth feature processing module 415 for feature processing to obtain an eighth feature processing result. The eighth feature processing result is input to the fourth upsampling network 416 to perform upsampling processing, so as to obtain a fourth upsampling processing result. And splicing the fourth upsampling processing result and the first feature processing result to obtain a fourth splicing result, and inputting the fourth splicing result, the multi-mode feature enhancement vector and the feature vector of the prompt text into an eighth feature processing module 417 for feature processing to obtain a ninth feature processing result. And carrying out noise prediction based on the ninth feature processing result, and analyzing information in the feature processing result to learn the feature expression and distribution condition of the noise and obtain a noise feature vector of the noise image. And carrying out t times of noise prediction processing on the feature vector of the noise image, and repeating the process for t times to obtain t noise feature vectors corresponding to the noise image. The obtained t noise feature vectors can be used for carrying out subsequent image generation tasks, so that the accuracy of image generation is improved, and a picture with high sense of realism is obtained.

In some embodiments, generating a denoised image containing the target object based on the feature vectors of the t noise and the feature vectors of the noise image comprises: subtracting t noise feature vectors from the feature vector of the noise image to obtain a feature vector of the denoising image; and inputting the feature vector of the denoising image into a decoder for decoding processing to generate the denoising image containing the target object.

In some embodiments, the feature vector of the noise image is subtracted by t noise feature vectors, the influence of noise features is removed from the feature vector of the noise image, so that a purer feature vector of the noise image is obtained, the feature vector of the noise image is input into a decoder for decoding, the decoder refers to a deep learning model, the feature vector of the noise image can be converted into a pixel representation of the image, the noise image containing the target object can be obtained through the processing of the decoder, the noise removing effect can be improved, the noise image containing the target object can be generated, and the accuracy of image generation can be improved.

In some embodiments, before adding the random noise data corresponding to the t time steps to the image, the method further includes: acquiring a training set, wherein the training set comprises a plurality of training images and training prompt texts corresponding to the training images; inputting a plurality of training images into a diffusion model, adding random noise data corresponding to k time steps to each training image to obtain a feature vector of a noise image of each training image, wherein k is a positive integer; feature extraction is carried out on each training prompt text to obtain feature vectors of each training prompt text, feature extraction is carried out on each training image to obtain feature vectors of each training image, and feature fusion processing is carried out on the basis of the feature vectors of each training prompt text and the feature vectors of each training image to obtain each multi-mode feature enhancement vector; respectively carrying out k times of noise prediction processing based on each multi-mode feature enhancement vector, the feature vector of each training prompt text and the feature vector of the noise image of each training image to obtain k noise feature vectors corresponding to the noise image of each training image; and determining a loss value of the diffusion model according to the k noise eigenvectors corresponding to the noise images of each training image and the eigenvectors of the random noise data corresponding to the k time steps added by each training image, and updating parameters in the diffusion model based on the loss value.

In some embodiments, the diffusion model to be trained is trained by a training set, the training set comprises a plurality of training images and training prompt texts corresponding to the plurality of training images, and the training images and the training prompt texts of teammates contain the same objects. For example, the training image is a specific "image" of "dog", and the corresponding training prompt text may be "a diagram of one dog". Inputting a plurality of training images into a diffusion model, adding random noise data corresponding to k time steps to each training image to obtain feature vectors of noise images of each training image, wherein the random noise data can be Gaussian noise, the random noise data corresponding to k time steps is added as a forward process of the diffusion model, the forward process is a noise adding process, and the noise images of the training images in the forward process are related to the noise images of the training images of the previous time step. The diffusion model adds a random noise value conforming to Gaussian distribution to each pixel point of the training image to realize noise addition of the training image, the noise value can be added with an original value of the pixel point, gray values or color values of the pixel point can be influenced, feature extraction is carried out on the generated noise image of the training image, and feature vectors of noise images of all the training images can be obtained through feature extraction of a convolutional neural network or a residual neural network.

Each training prompt text can be subjected to vector embedding through embedding layers, each word of each training prompt text is converted into a vector with a fixed length, the feature vector of each training prompt text is obtained, the training prompt text is converted into a form which can be understood by a computer, and semantic information of the prompt text is understood. The input training images can be subjected to feature extraction through an image encoder, the training images are converted into a form which can be understood by a computer, and the color, texture, shape and other features of the images are captured. And carrying out feature fusion processing on the feature vectors of the training prompt texts and the feature vectors of the corresponding training images to obtain multi-mode feature enhancement vectors which contain semantic information of the training prompt texts and visual information of the training images and obtain richer feature representations.

And inputting the multi-modal feature enhancement vectors, the feature vectors of the training prompt texts and the feature vectors of the go on a pilgrimage noise images into a diffusion model, and carrying out k times of noise prediction processing on the noise images of the training images by combining the self-adaption capability of the diffusion model with the multi-modal features. Each processing results in a noise feature vector of the noise image of one training image. The k noise feature vectors of the noise image of the training image can be utilized to carry out subsequent image generation tasks, so that the accuracy of image generation is improved, and the image with high sense of reality is obtained. Calculating the loss values between the k noise eigenvectors corresponding to the noise images of each training image and the eigenvectors of the random noise data corresponding to the k time steps added by each training image by using a square loss function, updating parameters in the diffusion model according to the loss values based on a back propagation algorithm, and obtaining the diffusion model after training under the condition that the loss values are smaller than or equal to preset values.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Fig. 5 is a schematic diagram of an image processing apparatus based on multi-modal information according to an embodiment of the present disclosure. As shown in fig. 5, the image processing apparatus based on multi-modality information includes:

The obtaining module 501 is configured to obtain a prompt text and an image, where the prompt text and the image include a target object;

The noise adding module 502 is configured to add random noise data corresponding to t time steps to an image, obtain a feature vector of a noise image corresponding to the image, and t is a positive integer;

The feature fusion module 503 is configured to perform feature extraction on the prompt text to obtain a feature vector of the prompt text, perform feature extraction on the image to obtain a feature vector of the image, and perform feature fusion processing based on the feature vector of the prompt text and the feature vector of the image to obtain a multi-mode feature enhancement vector;

The prediction module 504 is configured to perform t times of noise prediction processing based on the multi-mode feature enhancement vector, the feature vector of the prompt text, and the feature vector of the noise image, so as to obtain t feature vectors of noise corresponding to the noise image;

The image generating module 505 is configured to generate a denoised image including the target object based on the feature vectors of the t noises and the feature vectors of the noise image.

According to the technical scheme provided by the embodiment of the disclosure, the noise adding module 502 adds random noise data corresponding to t time steps to the image, namely adds gaussian noise, so as to obtain a feature map of a noise map containing gaussian noise. The feature fusion module 503 is used for carrying out fusion processing on the feature vector of the prompt text and the feature vector of the image, combining the feature of the prompt text and the feature of the image to obtain a multi-mode feature enhancement vector, enhancing the feature expression of the image and the prompt text, utilizing more available information, helping the model to better understand the content of the image and the prompt text and generating a more accurate and vivid image. The prediction module 504 performs t times of noise prediction processing based on the multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image to obtain t noise feature vectors corresponding to the noise image, so as to obtain predicted noise feature vectors, the image generation module 505 generates a denoising image containing a target object according to the t noise feature vectors and the feature vector of the noise image, suppresses and eliminates the noise on the feature level to obtain the feature vector of the denoising image, converts the feature vector of the denoising image into specific pixel values in the image space through a decoder, and generates a visualized, accurate and fidelity denoising image, so that the problem of low accuracy of generating the image in the prior art is solved, and the accuracy and efficiency of image generation are improved.

In some embodiments, the feature fusion module 503 is configured to perform feature extraction on the prompt text to obtain a feature vector of the prompt text, and perform vector embedding on the prompt text to obtain an initial feature vector of the prompt text; and encoding the initial feature vector of the text by a first text encoder to obtain the feature vector of the prompt text.

In some embodiments, the feature fusion module 503 is configured to perform feature extraction on the image to obtain a feature vector of the image, including: extracting features of the image through an image encoder to obtain an initial feature vector of the image; and carrying out feature transformation on the initial feature vector of the image through a multi-layer perceptron to obtain the feature vector of the image.

In some embodiments, the feature fusion module 503 is configured to perform feature fusion processing based on feature vectors of the prompt text and feature vectors of the image, where obtaining the multi-modal feature enhancement vector includes performing local replacement processing on an initial feature vector of the prompt text based on the feature vectors of the image, to obtain an initial multi-modal feature vector; encoding the initial multi-modal feature vector through a second text encoder to obtain the multi-modal feature vector; weighting the multi-mode feature vector and the feature vector of the prompt text based on a preset weight to obtain a weighted processing result; and adding the weighted processing result and the feature vector of the prompt text to obtain the multi-mode feature enhancement vector.

In some embodiments, the diffusion model includes a plurality of downsampling processing modules and a plurality of upsampling processing modules, and the prediction module 504 is configured to perform t times of noise prediction processing based on the multi-modal feature enhancement vector, the feature vector of the prompt text, and the feature vector of the noise image, and obtaining t times of noise feature vectors corresponding to the noise image includes downsampling the multi-modal feature enhancement vector, the feature vector of the prompt text, and the feature vector of the noise image by the plurality of downsampling processing modules to obtain a plurality of downsampling processing results; the method comprises the steps that up-sampling processing is carried out on a plurality of down-sampling processing results, multi-mode feature enhancement vectors and feature vectors of prompt texts through a plurality of up-sampling processing modules, so that noise feature vectors of noise images are obtained; and carrying out t times of noise prediction processing on the feature vectors of the noise image to obtain t noise feature vectors corresponding to the noise image.

In some embodiments, the image generation module 505 is configured to generate a denoised image containing the target object based on the feature vectors of the t noises and the feature vectors of the noise image, including subtracting the feature vectors of the t noises from the feature vectors of the noise image to obtain the feature vectors of the denoised image; and inputting the feature vector of the denoising image into a decoder for decoding processing to generate the denoising image containing the target object.

In some embodiments, the image processing device based on the multimodal information is configured to obtain a training set before adding random noise data corresponding to t time steps to the image, the training set including a plurality of training images and training prompt texts corresponding to the plurality of training images; inputting a plurality of training images into a diffusion model, adding random noise data corresponding to k time steps to each training image to obtain a feature vector of a noise image of each training image, wherein k is a positive integer; feature extraction is carried out on each training prompt text to obtain feature vectors of each training prompt text, feature extraction is carried out on each training image to obtain feature vectors of each training image, and feature fusion processing is carried out on the basis of the feature vectors of each training prompt text and the feature vectors of each training image to obtain each multi-mode feature enhancement vector; respectively carrying out k times of noise prediction processing based on each multi-mode feature enhancement vector, the feature vector of each training prompt text and the feature vector of the noise image of each training image to obtain k noise feature vectors corresponding to the noise image of each training image; and determining a loss value of the diffusion model according to the k noise eigenvectors corresponding to the noise images of each training image and the eigenvectors of the random noise data corresponding to the k time steps added by each training image, and updating parameters in the diffusion model based on the loss value.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.

Fig. 6 is a schematic diagram of an electronic device 6 provided by an embodiment of the present disclosure. As shown in fig. 6, the electronic device 6 of this embodiment includes: a processor 601, a memory 602 and a computer program 603 stored in the memory 602 and executable on the processor 601. The steps of the various method embodiments described above are implemented by the processor 601 when executing the computer program 603. Or the processor 601 when executing the computer program 603 performs the functions of the modules/units of the apparatus embodiments described above.

The electronic device 6 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 6 may include, but is not limited to, a processor 601 and a memory 602. It will be appreciated by those skilled in the art that fig. 6 is merely an example of the electronic device 6 and is not limiting of the electronic device 6 and may include more or fewer components than shown, or different components.

The Processor 601 may be a central processing unit (Central Processing Unit, CPU), or other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 602 may be an internal storage unit of the electronic device 6, for example, a hard disk or a memory of the electronic device 6. The memory 602 may also be an external storage device of the electronic device 6, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 6. The memory 602 may also include both internal and external storage units of the electronic device 6. The memory 602 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium (e.g., a computer readable storage medium). Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable storage medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims

1. An image processing method based on multi-modal information, comprising:

Acquiring a prompt text and an image, wherein the prompt text and the image contain a target object;

Adding random noise data corresponding to t time steps to the image to obtain a feature vector of a noise image corresponding to the image, wherein t is a positive integer;

Extracting features of the prompt text to obtain feature vectors of the prompt text, extracting features of the image to obtain feature vectors of the image, and carrying out feature fusion processing on the basis of the feature vectors of the prompt text and the feature vectors of the image to obtain multi-mode feature enhancement vectors;

performing t times of noise prediction processing on the basis of the multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image to obtain t noise feature vectors corresponding to the noise image;

and generating a denoising image containing the target object based on t eigenvectors of the noise and eigenvectors of the noise image.

2. The method according to claim 1, wherein the feature extraction of the prompt text to obtain a feature vector of the prompt text includes:

Embedding the vector into the prompt text to obtain an initial feature vector of the prompt text;

And encoding the initial feature vector of the text through a first text encoder to obtain the feature vector of the prompt text.

3. The method of claim 1, wherein the performing feature extraction on the image to obtain a feature vector of the image comprises:

Extracting features of the image through an image encoder to obtain an initial feature vector of the image;

and carrying out feature transformation on the initial feature vector of the image through a multi-layer perceptron to obtain the feature vector of the image.

4. The method according to claim 2, wherein the feature fusion processing is performed based on the feature vector of the hint text and the feature vector of the image to obtain a multi-modal feature enhancement vector, including:

Carrying out local replacement processing on the initial feature vector of the prompt text based on the feature vector of the image to obtain an initial multi-modal feature vector;

encoding the initial multi-modal feature vector through a second text encoder to obtain a multi-modal feature vector;

Weighting the multi-modal feature vector and the feature vector of the prompt text based on preset weights to obtain a weighted processing result;

and adding the weighted processing result and the multi-modal feature vector to obtain the multi-modal feature enhancement vector.

5. The method according to claim 1, wherein the diffusion model includes a plurality of downsampling processing modules and a plurality of upsampling processing modules, the performing t times of noise prediction processing based on the multi-modal feature enhancement vector, the feature vector of the hint text, and the feature vector of the noise image, to obtain t feature vectors of noise corresponding to the noise image, includes:

The multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image are subjected to downsampling through a plurality of downsampling processing modules, so that a plurality of downsampling processing results are obtained;

Performing upsampling processing on the plurality of downsampling processing results, the multi-mode feature enhancement vector and the feature vector of the prompt text through a plurality of upsampling processing modules to obtain a noise feature vector of the noise image;

and carrying out t times of noise prediction processing on the feature vectors of the noise image to obtain t feature vectors of the noise corresponding to the noise image.

6. The method of claim 1, wherein generating the denoised image containing the target object based on t feature vectors of the noise and feature vectors of the noise image comprises:

subtracting t characteristic vectors of the noise from the characteristic vector of the noise image to obtain the characteristic vector of the denoising image;

and inputting the feature vector of the denoising image into a decoder for decoding processing to generate a denoising image containing the target object.

7. The method of claim 1, wherein before adding the random noise data corresponding to t time steps to the image, further comprising:

Acquiring a training set, wherein the training set comprises a plurality of training images and training prompt texts corresponding to the training images;

Inputting a plurality of training images into a diffusion model, adding random noise data corresponding to k time steps to each training image to obtain feature vectors of noise images of each training image, wherein k is a positive integer;

Extracting features of each training prompt text to obtain feature vectors of each training prompt text, extracting features of each training image to obtain feature vectors of each training image, and carrying out feature fusion processing based on the feature vectors of each training prompt text and the feature vectors of each training image to obtain multi-mode feature enhancement vectors;

respectively carrying out k times of noise prediction processing based on each multi-mode feature enhancement vector, each feature vector of the training prompt text and the feature vector of the noise image of each training image to obtain k noise feature vectors corresponding to the noise image of each training image;

and determining a loss value of the diffusion model according to the k noise eigenvectors corresponding to the noise images of the training images and the eigenvectors of the random noise data corresponding to the k time steps added by the training images, and updating parameters in the diffusion model based on the loss value.

8. An image processing apparatus based on multi-modal information, comprising:

the acquisition module is used for acquiring prompt texts and images, wherein the prompt texts and the images contain target objects;

the noise adding module is used for adding random noise data corresponding to t time steps to the image to obtain a feature vector of a noise image corresponding to the image, wherein t is a positive integer;

The feature fusion module is used for carrying out feature extraction on the prompt text to obtain a feature vector of the prompt text, carrying out feature extraction on the image to obtain a feature vector of the image, and carrying out feature fusion processing on the basis of the feature vector of the prompt text and the feature vector of the image to obtain a multi-mode feature enhancement vector;

The prediction module is used for carrying out t times of noise prediction processing based on the multi-mode feature enhancement vector, the feature vector of the prompt text and the feature vector of the noise image to obtain t noise feature vectors corresponding to the noise image;

And the image generation module is used for generating a denoising image containing the target object based on t characteristic vectors of the noise and the characteristic vectors of the noise image.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1to 7.