CN118014858A

CN118014858A - Image fusion method and device, electronic equipment and storage medium

Info

Publication number: CN118014858A
Application number: CN202410116927.5A
Authority: CN
Inventors: 曾艳兵; 李岩; 高婷婷
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2024-01-26
Filing date: 2024-01-26
Publication date: 2024-05-10

Abstract

The disclosure relates to an image fusion method, an image fusion device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring at least two images to be fused, and encoding each image to be fused by an image encoder to obtain a plurality of image characteristics to be fused; obtaining text features corresponding to the features of each image to be fused according to the trained image-text conversion model and the features of the plurality of images to be fused; and performing diffusion processing on the standard noise data according to the trained image diffusion model to obtain noise information containing the image information of each image to be fused, and generating a fused image. Through adopting the method and the device, the image and the text are mutually translated by adopting the image-text conversion model, the image characteristics of a plurality of images and the corresponding text semantic characteristics are obtained, the fusion image containing high-level semantic characteristics is generated by denoising in the diffusion model, the fusion image is matched with the image information of each image to be fused, the fusion image with vivid visual effect can be generated, and the quality of the generated fusion image is improved.

Description

Image fusion method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of image processing, and in particular relates to an image fusion method, an image fusion device, electronic equipment and a storage medium.

Background

With the rapid development of the image processing field, users have new demands on images, for example, under the condition of having massive image-text contents, production creatives can be expanded for creators through image fusion, and production efficiency is improved.

In the related art, images are fused in a pixel space, for example, the transparency of two pictures is adjusted and then superimposed. The image fusion method can only be processed in a low-level pixel space, so that the generated fusion image is hard and has poor quality.

Disclosure of Invention

The disclosure provides an image fusion method, an image fusion device, an electronic device and a storage medium, so as to at least solve the problem that a fusion image generated in the related technology is harder. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided an image fusion method, including:

acquiring at least two images to be fused, and encoding each image to be fused through an image encoder to obtain a plurality of image characteristics to be fused;

obtaining text features corresponding to the image features to be fused according to the trained image-text conversion model and the plurality of image features to be fused;

inputting the image features to be fused, text features corresponding to the image features to be fused and standard noise data into a trained image diffusion model, and performing diffusion processing on the standard noise data to obtain noise information containing image information of the images to be fused;

and performing image decoding processing on the noise information based on an image decoder to generate a fusion image.

In one embodiment, the inputting the image features to be fused, the text features corresponding to the image features to be fused, and the standard noise data into the trained image diffusion model, performing diffusion processing on the standard noise data to obtain noise information including image information of the images to be fused, and includes:

splicing the image features to be fused corresponding to the images to be fused respectively to obtain image fusion features; splicing the text features corresponding to the features of the images to be fused to obtain text fusion features;

carrying out fusion processing on the image fusion characteristics and the text fusion characteristics to obtain fusion characteristics;

and performing diffusion processing on the standard noise data through the trained image diffusion model and the fusion characteristics to obtain noise information containing image information of each image to be fused.

In one embodiment, the performing diffusion processing on the standard noise data through the trained image diffusion model and the fusion feature to obtain noise information including image information of each image to be fused includes:

and carrying out noise reduction processing on the standard noise data by taking the fusion characteristics as conditions through the image diffusion model to obtain noise information containing image information of each image to be fused.

In one embodiment, the method further comprises:

Acquiring first training data, wherein the first training data comprises a sample image and a sample text;

extracting sample image features of each sample image and sample text features of each sample text;

Text conversion is carried out on the sample image characteristics through a text conversion sub-model in the image-text conversion model to be trained to obtain predicted text characteristics, and image conversion is carried out on the sample text characteristics through an image conversion sub-model in the image-text conversion model to be trained to obtain predicted image characteristics;

calculating a first loss value between the sample text feature and the predicted text feature, updating the text conversion sub-model through the first loss value, and calculating a second loss value between the sample image feature and the predicted image feature, and updating the image conversion sub-model through the second loss value until a preset training completion condition is met, so as to obtain a trained image-text conversion model.

In one embodiment, the method further comprises:

acquiring second training data, wherein the second training data comprises a sample image and a sample description text corresponding to the sample image;

coding the sample description text to obtain sample text characteristics, and obtaining sample image characteristics corresponding to the sample text characteristics through a trained image-text conversion model;

Inputting standard noise data, the sample text features and the sample image features into an image diffusion model, processing the standard noise data by taking the sample text features and the sample image features as state conditions in the image diffusion model to obtain prediction noise data containing image information of a sample image, and inputting the prediction noise data into an image decoder to obtain a prediction image;

And calculating a loss function according to the predicted image and the sample image, and updating the image diffusion model through a loss value corresponding to the loss function until a training completion condition is preset, so as to obtain a trained image diffusion model.

In one embodiment, the performing stitching processing on the image features to be fused corresponding to each image to be fused to obtain image fusion features includes:

and carrying out weighted calculation on the image features to be fused corresponding to the images to be fused respectively based on the image fusion weights corresponding to the images to be fused respectively, so as to obtain image fusion features.

In one embodiment, the performing a stitching process on the text features corresponding to the features of the images to be fused to obtain text fusion features includes:

And carrying out weighted calculation on text features corresponding to the images to be fused based on the text fusion weights corresponding to the images to be fused respectively, so as to obtain text fusion features.

According to a second aspect of embodiments of the present disclosure, there is provided an image fusion apparatus including:

the first acquisition unit is configured to acquire at least two images to be fused, and encode each image to be fused through the image encoder to obtain a plurality of image characteristics to be fused;

The determining unit is configured to execute the text features corresponding to the image features to be fused according to the trained image-text conversion model and the image features to be fused;

the noise information generation unit is configured to perform inputting the image features to be fused, text features corresponding to the image features to be fused and standard noise data into a trained image diffusion model, and performing diffusion processing on the standard noise data to obtain noise information containing image information of the images to be fused;

And an image generation unit configured to perform image decoding processing on the noise information based on an image decoder, and generate a fusion image.

In one embodiment, the noise information generating unit includes:

The splicing subunit is configured to splice the image features to be fused corresponding to the images to be fused respectively to obtain image fusion features; splicing the text features corresponding to the features of the images to be fused to obtain text fusion features;

the fusion subunit is configured to fuse the image fusion characteristics and the text fusion characteristics to obtain fusion characteristics;

and the noise information generation subunit is configured to perform diffusion processing on the standard noise data through the trained image diffusion model and the fusion characteristics to obtain noise information containing the image information of each image to be fused.

In one embodiment, the noise information generating subunit is specifically configured to:

In one embodiment, the apparatus further comprises:

A second acquisition unit configured to acquire first training data including a sample image and a sample text;

an extraction unit configured to extract sample image features of each of the sample images and sample text features of each of the sample texts;

The conversion unit is configured to perform text conversion on the sample image characteristics through a text conversion sub-model in the image-text conversion model to be trained to obtain predicted text characteristics, and perform image conversion on the sample text characteristics through an image conversion sub-model in the image-text conversion model to be trained to obtain predicted image characteristics;

And the first damage value calculating unit is configured to calculate a first loss value between the sample text feature and the predicted text feature, update the text conversion sub-model through the first loss value, calculate a second loss value between the sample image feature and the predicted image feature, update the image conversion sub-model through the second loss value until a preset training completion condition is met, and obtain a trained image-text conversion model.

In one embodiment, the apparatus further comprises:

A third acquisition unit configured to acquire second training data including a sample image and a sample description text corresponding to the sample image;

The coding unit is configured to code the sample description text to obtain sample text characteristics, and obtain sample image characteristics corresponding to the sample text characteristics through a trained image-text conversion model;

The training unit is configured to input standard noise data, the sample text features and the sample image features into an image diffusion model, process the standard noise data by taking the sample text features and the sample image features as state conditions in the image diffusion model to obtain prediction noise data containing image information of a sample image, and input the prediction noise data into an image decoder to obtain a prediction image;

And the second loss value calculation unit is configured to calculate a loss function according to the predicted image and the sample image, and update the image diffusion model through the loss value corresponding to the loss function until a training completion condition is preset, so as to obtain a trained image diffusion model.

In one embodiment, the splicing subunit is specifically configured to:

In one embodiment, the splicing subunit is specifically further configured to:

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

A processor;

A memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image fusion method according to any one of the first aspects above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the image fusion method as described in any one of the first aspects above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the image fusion method as described in any one of the first aspects above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

Through adopting the method and the device, the image and the text are mutually translated by adopting the image-text conversion model, the image characteristics of a plurality of images and the corresponding text semantic characteristics are obtained, the fusion image containing high-level semantic characteristics is generated by denoising in the diffusion model, the fusion image is matched with the image information of each image to be fused, the fusion image with vivid visual effect can be generated, and the quality of the generated fusion image is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is an application environment diagram illustrating an image fusion method according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a method of image fusion according to an exemplary embodiment.

FIG. 3 is a flow chart illustrating a fusion step performed in one embodiment;

FIG. 4 is a flow chart of a training step in one embodiment;

FIG. 5 is a flow chart of a training step in one embodiment;

FIG. 6 is a schematic diagram of a training process of a graphic conversion model in one embodiment;

FIG. 7 is a schematic diagram of a training process of an image diffusion model in one embodiment;

FIG. 8 is a flowchart of an image fusion method according to another embodiment;

fig. 9 is a block diagram illustrating an image fusion apparatus according to an exemplary embodiment.

Fig. 10 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be further noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

The image fusion method provided by the disclosure can be applied to an application environment as shown in fig. 1. Wherein the terminal 110 interacts with the server 120 through a network. The terminal 110 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, etc., and the server 120 may be implemented by a stand-alone server or a server cluster formed by a plurality of servers.

Fig. 2 is a flowchart illustrating an image fusion method according to an exemplary embodiment, and as shown in fig. 2, the image fusion method is used in an electronic device, and includes the following steps.

In step S210, at least two images to be fused are obtained, and each image to be fused is encoded by an image encoder, so as to obtain a plurality of image features to be fused.

The images to be fused can be images containing different semantics; the image encoder can be a module for extracting image feature vectors of the images, and the image features to be fused can be feature vectors obtained by carrying out feature extraction processing on the images; the image encoder may include a neural network having a plurality of layers, each of which may include an encoding block that may be used for image feature extraction or the like, or the image encoder may use a Convolutional Neural Network (CNN) such as EFFICIENT NET or the like.

Specifically, the electronic device may obtain at least two images to be fused, and perform feature extraction on each image to be fused through a pre-trained image encoder, so as to obtain features of the images to be fused, which correspond to each image to be fused. In one example, the electronic device may obtain, in response to a picture selection operation triggered by a user, at least two pictures corresponding to the picture selection operation as images to be fused; the electronic device may also obtain at least two images to be fused carried by the image fusion request after receiving the image fusion request sent by the user terminal, and the method for obtaining each image to be fused is not specifically limited in the disclosure.

In step S220, text features corresponding to the features of each image to be fused are obtained according to the trained image-text conversion model and the features of the plurality of images to be fused.

The image-text conversion model can be a model trained in advance based on text-image data pairs and is used for performing mutual conversion and mutual translation between image features corresponding to images and text features (semantic features) corresponding to texts, and the image-text conversion model can be a Prior network. For example, the teletext conversion model may comprise a plurality of sub-models including a text conversion sub-model for converting image features into text features and an image conversion sub-model for converting text features into image features.

Specifically, the electronic device may perform fusion processing according to the to-be-fused image features corresponding to each to-be-fused image respectively and a pre-trained image-text conversion model, so as to obtain text features corresponding to each to-be-fused image feature respectively. In one example, the electronic device may perform text conversion processing on the to-be-fused image features of each to-be-fused image through a text conversion sub-model in a pre-trained image-text conversion model, to obtain text features corresponding to each to-be-fused image feature, so as to obtain text features corresponding to each to-be-fused image, where the text features are used to describe high-level semantic information in the to-be-fused image in feature vector dimensions.

In step S230, the image features to be fused, the text features corresponding to the image features to be fused, and the standard noise data are input into the trained image diffusion model, and the standard noise data are subjected to diffusion processing, so as to obtain noise information including the image information of the images to be fused.

The standard noise data may include noise data, such as gaussian noise, etc.; the image diffusion model is a model for performing noise reduction processing on noise data based on image features and text features to obtain noise information containing image information, and for example, the image diffusion model may be unet model or the like.

Specifically, the electronic device may perform diffusion (diffusion) denoising processing on standard noise data through a pre-trained image diffusion model, each image feature to be fused and text features corresponding to each image feature to be fused, so as to obtain an output result of the image diffusion model, where the output result may be noise information including image information of each image to be fused.

In step S240, the image decoder performs image decoding processing on the noise information to generate a fused image.

Among them, an image Decoder (img Decoder) may be a module that performs decoding processing on input data to output a decoded image.

Specifically, the electronic device inputs noise information including image information of each image to be fused, which is output by the image diffusion model, to the image decoder, and decodes the noise information by the image decoder to obtain a fused image corresponding to the noise information including the image information of each image to be fused, where the fused image includes features of each image to be fused.

In the image fusion method, at least two images to be fused are obtained, and each image to be fused is encoded by an image encoder to obtain a plurality of image characteristics to be fused; obtaining text features corresponding to the features of each image to be fused according to the trained image-text conversion model and the features of the plurality of images to be fused; inputting the image features to be fused, text features corresponding to the image features to be fused and standard noise data into a trained image diffusion model, and performing diffusion processing on the standard noise data to obtain noise information containing image information of the images to be fused; and performing image decoding processing on the noise information based on the image decoder to generate a fusion image. The image and the text are mutually translated through the image-text conversion model, the image characteristics of a plurality of images and the corresponding text semantic characteristics are obtained, denoising is carried out in the diffusion model, a fusion image containing high-level semantic characteristics is generated, a fusion image with higher reality degree is generated, and the quality of the generated fusion image is improved.

In an exemplary embodiment, as shown in fig. 3, in step S220, each image feature to be fused, a text feature corresponding to each image feature to be fused, and standard noise data are input into a trained image diffusion model, and diffusion processing is performed on the standard noise data to obtain noise information including image information of each image to be fused, which may be specifically implemented by the following steps:

In step S310, stitching processing is performed on the image features to be fused corresponding to each image to be fused, so as to obtain image fusion features; and performing splicing processing on text features corresponding to the features of the images to be fused to obtain text fusion features.

Specifically, the electronic device may perform superposition processing on each image feature to be fused after obtaining the image features to be fused corresponding to each image to be fused, so as to obtain image fusion features; the electronic equipment can also carry out superposition processing on text features corresponding to the images to be fused respectively to obtain text fusion features. That is, after obtaining the image features to be fused corresponding to each image to be fused, the electronic device may perform text conversion processing on each image feature to be fused to obtain text features corresponding to each image feature to be fused, and perform superposition processing on each image feature to be fused to obtain image fusion features, and perform superposition processing on text features corresponding to each image feature to be fused to obtain text fusion features.

In step S320, fusion processing is performed on the image fusion feature and the text fusion feature, so as to obtain a fusion feature.

Specifically, the electronic device may perform a stitching process on the image fusion feature and the text fusion feature to obtain a fusion feature, or may perform a weighted calculation on the image fusion feature and the text fusion feature by the electronic device to obtain a fusion feature.

In step S330, the standard noise data is subjected to diffusion processing by using the trained image diffusion model and the fusion characteristics, so as to obtain noise information including the image information of each image to be fused.

Specifically, the terminal may input the fusion feature and the standard noise data into a pre-trained image diffusion model, process the fusion feature and the standard noise data through the image diffusion model, and the electronic device may perform noise adding processing on the fusion feature based on the standard noise data in the image diffusion model, that is, perform image diffusion on the fusion feature to obtain a first feature including the standard noise data, and perform denoising processing on the first feature through an image inverse diffusion process to obtain noise information including image information of each image to be fused.

In one example, the terminal may input the image fusion feature, the text fusion feature and the standard noise data into the trained image diffusion model, and perform noise adding processing and noise reducing processing on the image fusion feature and the text fusion feature in the image diffusion model to obtain noise information including image information of each image to be fused. Specifically, the electronic device may perform noise adding processing on the image fusion feature and the text fusion feature based on the standard noise data in the image diffusion model, that is, perform image diffusion on the image fusion feature and the text fusion feature to obtain a first feature containing the standard noise data, and perform denoising processing on the first feature through an image inverse diffusion process to obtain noise information containing image information of each image to be fused.

Based on the scheme, the image fusion characteristics and the text fusion characteristics are decoded through the image diffusion model, so that the image information of each image to be fused can be better learned, and an output image with better image fusion effect is obtained.

In an exemplary embodiment, in step S330, the standard noise data is subjected to diffusion processing by using the trained image diffusion model and the fusion feature, so as to obtain noise information including image information of each image to be fused, which may be specifically implemented by the following steps:

And denoising the standard noise data by taking the fusion characteristics as conditions through an image diffusion model to obtain noise information containing image information of each image to be fused.

The fusion feature may be a feature obtained by performing fusion processing on the image fusion feature and the text fusion feature, and the fusion feature may also include the image fusion feature and the text fusion feature.

Specifically, the electronic device may input standard noise data and fusion features into a trained image diffusion model, and in the image diffusion model, the fusion features are used as state conditions to perform diffusion denoising processing, so as to obtain noise information including image information of each image to be fused.

In one example, the electronic device may perform image diffusion processing on the fusion feature based on the standard noise data in the image diffusion model, that is, add the standard noise data to the fusion feature to obtain a first noise feature, and perform inverse diffusion processing corresponding to image diffusion on the first noise feature to obtain a first denoising feature corresponding to the fusion feature, that is, the first denoising feature may be noise information including image information of each image to be fused.

Based on the scheme, the image fusion characteristics and the text fusion characteristics are decoded through the image diffusion model, the image information of each image to be fused can be better learned, the image weight is used for fusion after the high-level semantic characteristics of two images to be fused are extracted, and new characteristics are obtained and then the new characteristics are input into the diffusion model, so that the generated image has the high-level semantic fusion information of two images instead of the pixel fusion of low-level semantics, and better, more realistic and more aesthetic-feeling output images are realized.

In an exemplary embodiment, as shown in fig. 4, the image fusion method further includes:

in step S410, first training data is acquired.

Wherein the first training data comprises a sample image and a sample text,

Specifically, in the process of training the graphic conversion model to be trained, the electronic device may acquire data pairs including a plurality of pictures and texts, that is, training data may include a plurality of data pairs, where each data pair includes a sample image and a text for describing the sample image. Alternatively, each data pair contains a sample text and a sample image consistent with the image information described by the sample text.

In step S420, sample image features of each sample image and sample text features of each sample text are extracted.

Specifically, the electronic device can encode each sample text through a text encoder to obtain sample text characteristics corresponding to each sample text; the text encoder may be a bi-directional encoder representation (Bidirectional Encoder Representations from Transformers,BERT)、GPT-3((Generative Pre-training Transformer, text generator) or a T5 (text-to-text transfer transformer, text-to-text transport converter) model, or the text encoder may be implemented by a bi-directional long short-term memory neural network (LSTM).

In step S430, text conversion is performed on the sample image features through the text conversion sub-model in the to-be-trained text conversion model to obtain predicted text features, and image conversion is performed on the sample text features through the image conversion sub-model in the to-be-trained text conversion model to obtain predicted image features.

The image-text conversion model may include a text conversion sub-model and an image conversion sub-model, where the text conversion sub-model may be a Prior-T for translating image features into text features and the image conversion sub-model may be a Prior-I model for translating text features into image features.

Specifically, the electronic device may input the sample image feature into a text conversion sub-model included in the image-text conversion model, and translate the sample image feature through the text conversion sub-model, where an output result of the text conversion sub-model may be a predicted text feature corresponding to the sample image feature; accordingly, the electronic device may input the sample text feature into an image conversion sub-model included in the image-text conversion model, and translate the sample text feature through the image conversion sub-model, where an output result of the image conversion sub-model may be a predicted image feature corresponding to the sample text feature.

In step S440, a first loss value between the sample text feature and the predicted text feature is calculated, the text conversion sub-model is updated by the first loss value, and a second loss value between the sample image feature and the predicted image feature is calculated, and the image conversion sub-model is updated by the second loss value until a preset training completion condition is satisfied, so as to obtain a trained graphic conversion model.

The preset training completion condition may be that the training iteration number of the model reaches the target iteration number, or that the first loss value of the loss function obtained by calculation is converged, the second loss value is converged, and so on, the specific content of the preset training completion condition is not limited in the present disclosure, and the specific content of the preset training completion condition may be determined in the art based on the actual application scenario. The loss value may be calculated based on a preset loss function, which in one example may be a mean square error function.

Specifically, for the text conversion sub-model in the image-text conversion model, the electronic device may update the model parameters of the text conversion sub-model to be trained based on the mean square error between each sample text feature and the predicted text feature corresponding to each sample text feature, and take the calculated mean square error as the first loss value, so that the electronic device may update the model parameters of the text conversion sub-model to be trained when determining that the preset training completion condition is not met currently, to obtain an updated text conversion sub-model, and re-execute the step of obtaining the training data until determining that the preset training completion condition is met currently.

Correspondingly, the electronic device can aim at the image conversion sub-model in the image-text conversion model, the electronic device can update the model parameters of the image conversion sub-model to be trained under the condition that the preset training completion condition is not met currently according to the mean square error between each sample image feature and the predicted image feature corresponding to each sample image feature, and the calculated mean square error is used as a second loss value, so that the updated image conversion sub-model is obtained, and the step of obtaining training data is re-executed until the trained image conversion sub-model is obtained under the condition that the preset training completion condition is met currently.

Based on the scheme, the accuracy and the comprehensiveness of model parameter updating can be further improved by calculating the loss value through calculating the sample image features, the sample text features, the predicted image features and the predicted text features, so that the accuracy of translation between the image features and the text features is ensured.

In an exemplary embodiment, as shown in fig. 5, the image fusion method further includes:

in step S510, second training data is acquired.

The second training data comprises a sample image and sample description text corresponding to the sample image, and the sample description text corresponding to the sample image is text information for describing image content in the sample image.

In particular, the electronic device may acquire second training data including a plurality of data pairs in the process of training the image diffusion model to be trained, where each data pair may include a sample image and sample description text for describing image content information of the sample image.

In step S520, the sample description text is encoded to obtain sample text features, and sample image features corresponding to the sample text features are obtained through the trained graphic conversion model.

Specifically, for each data pair, the electronic device may encode the sample description text through the text encoder to obtain a sample text feature corresponding to the sample description text, and the electronic device may input the sample text feature into a trained graphic conversion model, and perform image conversion processing on the sample text feature through the graphic conversion model to obtain a sample image feature corresponding to the sample text feature. Thus, the electronic device can obtain the sample text characteristic and the sample image characteristic which are respectively corresponding to each data pair contained in the training data.

In step S530, the standard noise data, the sample text feature and the sample image feature are input into an image diffusion model, in the image diffusion model, the standard noise data is subjected to diffusion processing using the sample text feature and the sample image feature as conditions, so as to obtain prediction noise data including image information of the sample image, and the prediction noise data is input into an image decoder, so as to obtain a prediction image.

Wherein the labeling noise data may be gaussian noise.

Specifically, the electronic device may input the standard noise data, the sample text feature and the sample image feature into the image diffusion model, and perform image diffusion processing and image inverse diffusion processing on the sample text feature and the sample image feature through the image diffusion model to obtain prediction noise data including image content of the sample image. In this way, the electronic device can input prediction noise data including the image content of the sample image to the image decoder, resulting in a denoised image, i.e., a predicted image.

In one example, the electronic device may perform image diffusion processing on the sample text feature and the sample image feature in the image diffusion model based on the standard noise data, that is, add the standard noise data to the sample text feature and the sample image feature to obtain a first noise feature, and perform inverse diffusion processing corresponding to image diffusion on the first noise feature to obtain a first denoising feature corresponding to the sample text feature and the sample image feature, that is, the first denoising feature may be prediction noise data including image information of the sample image.

In step S540, a loss function is calculated according to the predicted image and the sample image, and the image diffusion model is updated according to the loss value corresponding to the loss function until the training completion condition is preset, so as to obtain a trained image diffusion model.

The preset training completion condition may be that the training iteration number of the model reaches the target iteration number, or that a loss value corresponding to the loss function is calculated to be converged, the specific content of the preset training completion condition is not limited in the present disclosure, and the specific content of the preset training completion condition may be determined based on an actual application scenario in the present disclosure. The loss value corresponding to the loss function may be an MSE loss value and the loss function may be an MSE loss function.

Specifically, the electronic device may calculate the loss value corresponding to the loss function based on the sample text feature and the real image (sample image) and the predicted image described by the sample image feature, update the model parameter of the image diffusion model to obtain an updated image diffusion model when determining that the preset training completion condition is not met currently, and re-execute the step of obtaining the second training data until determining that the preset training completion condition is met currently, thereby obtaining the trained image diffusion model.

Based on the scheme, the image diffusion model generated based on real image training can fuse two images in high-level semantics, a brand new image which is real and has aesthetic feeling is generated, and model training efficiency and trained model learning feature accuracy are guaranteed.

In an exemplary embodiment, in step S310, the stitching processing is performed on the image features to be fused corresponding to each image to be fused, so as to obtain the image fusion feature, which may be specifically implemented by the following steps:

And carrying out weighted calculation on the image features to be fused, which correspond to the images to be fused, based on the image fusion weights, which correspond to the images to be fused, respectively, so as to obtain the image fusion features.

The image fusion weight corresponding to each image to be fused can be determined based on the importance degree of the image content in each image to be fused, or based on the duty ratio degree of each image to be fused in the fused image.

Specifically, after obtaining the features of the images to be fused corresponding to the images to be fused, the electronic device may obtain the importance degree of the image content of each image to be fused when the images to be fused are fused, and determine the image fusion weight of each image to be fused based on the importance degree of the image content of each image to be fused. Based on the above, the electronic device may perform weighted calculation based on the image fusion weights of the images to be fused and the features of the images to be fused, so as to obtain the image fusion features.

Based on the scheme, the image fusion characteristics are obtained through the image fusion weights of the images to be fused, and the contribution degree of the images to be fused in generating the fusion images can be flexibly adjusted, so that the obtained fusion images more meet the requirements of users.

In an exemplary embodiment, in step S310, the text features corresponding to the features of each image to be fused are spliced, so that the text fusion features are obtained specifically by the following steps:

And weighting calculation is carried out on the text features respectively corresponding to the images to be fused based on the text fusion weights respectively corresponding to the images to be fused, so as to obtain the text fusion features.

The text fusion weight corresponding to each image to be fused may be determined based on the importance degree of each image to be fused, for example, may be determined based on the duty ratio degree of each image to be fused in the fused image.

Specifically, after obtaining text features corresponding to each image to be fused, the electronic device may obtain importance degrees of text content of each image to be fused when the images are fused, and determine text fusion weights of each image to be fused based on the importance degrees of the text content of each image to be fused. Based on the text fusion weights and text features of the images to be fused, the electronic equipment can perform weighted calculation to obtain the text fusion features.

Based on the scheme, the text fusion characteristics are obtained through the text fusion weights of the images to be fused, and the contribution degree of the images to be fused in generating the fused image can be flexibly adjusted, so that the obtained fused image meets the requirements of users.

Hereinafter, a detailed implementation procedure of the above image fusion method will be described in detail with reference to a specific embodiment,

The image fusion is to fuse two pictures to generate a brand new picture, the picture has the characteristics of the original two pictures, and under the condition of having massive image-text contents, an creator can expand production originality and improve production efficiency by carrying out image fusion. Diffusion model (diffusion model) based is to add gaussian noise (forward diffusion process) to the available training data (e.g., image) and then reverse the process (called denoising or back diffusion process) to recover the data. The image fusion method provided by the disclosure can learn to remove noise step by step from pure noise by training the image-text conversion model and the image diffusion model and using massive image-text data, and generate the effect of fused images meeting the requirements.

The training process in the image fusion method based on the image diffusion model provided by the disclosure comprises the following steps: the Prior image-text translation module is used for training an image-text conversion model and can mutually translate pictures and texts in a hidden space; the text-to-image generation module is used for training an image diffusion model, and massive text-to-image pair data can be utilized to translate text features into image features through the Prior Prior module, and denoising is performed in the input diffusion model to generate a corresponding image. During prediction, feature extraction is performed on two input images. The feature vector is formed by encoding the visual features such as key points, textures and the like in the image. And then fusing vectors of the two pictures to obtain a brand new fusion feature, wherein the fusion feature has high-level semantic features of the two input images. And finally, taking the new feature vector as input, and sending the new feature vector into a diffusion model for denoising. The diffusion model can effectively eliminate random noise in the image, and meanwhile, the important information of the image is reserved. By the image fusion method based on the diffusion model, high-quality and creative fusion image generation can be realized, and more possibilities are provided for various downstream applications.

In one example, training data, which may be a pair of teletext, is obtained to construct a training sample, comprising an image and a text describing the content of the image. The data is obtained from a public dataset (laion-5 b) and an internal acquisition dataset.

As shown in fig. 6, a schematic diagram of a training process of the image-text conversion model, that is, a process of training the Prior network training (image-text conversion model) may be shown: the Prior network can obtain sample text, for example, a "a caucasian dog runs in a river, falls on the sun, an arch bridge and flows"; the method comprises the steps of performing coding processing on a sample text through a text coder to obtain sample text characteristics, and inputting the sample text characteristics into a Prior-I to obtain predicted image characteristics; the primary network may acquire a sample image corresponding to the sample text, for example, may be an image a; the sample image is encoded through an image encoder to obtain sample image characteristics, and the sample image characteristics are input to a Prior-T to obtain predicted text characteristics; updating parameters of the Prior-T based on the sample text features and the predicted text features, and updating parameters of the Prior-I based on the sample image features and the predicted image features to obtain a trained Prior network.

Specifically, the Prior network is responsible for "translating" text features and image features to each other. During training, a picture-text pair is given, corresponding features are respectively extracted, and the picture-text pair are mutually translated through a Prior-Net, wherein the Prior-I is responsible for translating the text features into image features, and the Prior-T is responsible for translating the image features into the text features. The relevant network weights are updated by calculating MSE (mean square error) between the translated and true features as a loss function.

As shown in fig. 7, a training process schematic diagram of an image diffusion model (a meridional graph network) may be adopted, and a sample description text is obtained, for example, the sample description text may be "a cauchy dog runs in a river, falls on the sun, an arch bridge and flows", and a sample image (a real image) corresponding to the sample description text may be an image a; the electronic device may input the sample descriptive text to a text encoder to obtain sample text features, input the sample text features to a Prior-I in a Prior-net to obtain predicted image features, input the predicted image features, the sample text features, and standard noise data (Gaussian noise) to a text-to-be-trained graph model (image diffusion model, unet) to obtain predicted noise data, and input the predicted noise data to an image decoder (img decoder) to obtain a predicted image. And calculating loss based on the predicted image and the real image, and updating model parameters of the text-to-image model until a preset training completion condition is met, so as to obtain a trained text-to-image model.

Specifically, the paperwork network is responsible for generating images that conform to the textual description. Given a text, text characteristics are obtained through a text encoder, trained Prior-Net is input, image characteristics are obtained, and the text characteristics and the image characteristics are simultaneously input into Unet. Unet taking noise as input, taking the text-image characteristics as condition, carrying out noise elimination, finally obtaining a generated image through an image decoder, calculating MSE loss with a real image, guiding updating of relevant network parameters, and obtaining a trained image diffusion model.

As shown in fig. 8, the image to be fused may include a first image and a second image, and the first image and the second image are encoded by an image encoder to obtain a first image feature and a second image feature; the electronic equipment can input the first image feature and the second image feature into a trained image-text conversion model to obtain a first text feature corresponding to the first image feature and a second text feature corresponding to the second image feature; the electronic equipment can superimpose the first image feature and the second image feature to obtain an image fusion feature, and superimpose the first text feature and the second text feature to obtain a text fusion feature; based on this, the electronic device may input the image fusion feature, the text fusion feature, and the standard noise data to a text-to-image model (image diffusion model, unet) to obtain prediction noise data, and input the prediction noise data to an image decoder (img decoder) to obtain a fused image.

Specifically, the two images to be fused are respectively extracted with image features, and text features are extracted through the Prior-Net. And carrying out weighted summation on the two pairs of text and image features to obtain fused text features and image features, inputting the fused text features and image features as condition conditions into Unet, carrying out diffusion denoising, and finally obtaining fused images through an image decoder, wherein the images simultaneously have the features of two input images.

The image fusion method provided by the disclosure can be based on a diffusion model, can fuse two images in high-level semantics and generate a brand new image with reality and aesthetic feeling. After the high-level semantic features of the two images are extracted, fusion is carried out based on the image weights, new features are obtained, and then a diffusion model is input, so that the generated image has high-level semantic fusion information of the two images, but not low-level semantic pixel fusion. Because the model training process uses real images and is filtered by artistic aesthetic screening, the images generated by the model have reality and artistic aesthetic feeling. The method avoids using the data of image fusion in the natural world and the Internet, can directly and effectively train related image fusion models, and ensures that the generated image has reality and aesthetic feeling.

It should be understood that, although the steps in the flowcharts of fig. 1 to 8 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps of fig. 1-8 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages in other steps.

It should be understood that the same/similar parts of the embodiments of the method described above in this specification may be referred to each other, and each embodiment focuses on differences from other embodiments, and references to descriptions of other method embodiments are only needed.

Fig. 9 is a device block diagram of an image fusion device 900 according to an exemplary embodiment. Referring to fig. 9, the apparatus includes a first acquisition unit 902, a determination unit 904, a noise information generation unit 906, and an image generation unit 908.

The first obtaining unit 902 is configured to obtain at least two images to be fused, and encode each image to be fused by an image encoder to obtain a plurality of image features to be fused;

A determining unit 904, configured to perform text features corresponding to the features of each image to be fused according to the trained image-text conversion model and the features of the multiple images to be fused;

the noise information generating unit 906 is configured to perform inputting the image features to be fused, the text features corresponding to the image features to be fused and the standard noise data into the trained image diffusion model, and perform diffusion processing on the standard noise data to obtain noise information containing the image information of the images to be fused;

The image generation unit 908 is configured to perform image decoding processing on the noise information based on the image decoder, generating a fused image.

In an exemplary embodiment, a noise information generation unit includes:

the splicing subunit is configured to splice the image features to be fused, which correspond to the images to be fused respectively, so as to obtain the image fusion features; splicing the text features corresponding to the features of the images to be fused to obtain text fusion features;

and the noise information generation subunit is configured to perform diffusion processing on the standard noise data through the trained image diffusion model and the fusion characteristics to obtain noise information containing image information of each image to be fused.

In an exemplary embodiment, the noise information generating subunit is specifically configured to:

And carrying out noise reduction processing on the standard noise data by taking the fusion characteristics as conditions through an image diffusion model to obtain noise information containing image information of each image to be fused.

In an exemplary embodiment, the apparatus further comprises:

An extraction unit configured to extract sample image features of each sample image and sample text features of each sample text;

The first damage value calculating unit is configured to calculate a first loss value between the sample text feature and the predicted text feature, update the text conversion sub-model through the first loss value, calculate a second loss value between the sample image feature and the predicted image feature, update the image conversion sub-model through the second loss value until a preset training completion condition is met, and obtain a trained image-text conversion model.

In an exemplary embodiment, the apparatus further comprises:

the third acquisition unit is configured to acquire second training data, wherein the second training data comprises a sample image and a sample description text corresponding to the sample image;

the training unit is configured to input standard noise data, sample text features and sample image features into an image diffusion model, process the standard noise data by taking the sample text features and the sample image features as state conditions in the image diffusion model to obtain prediction noise data containing image information of a sample image, and input the prediction noise data into an image decoder to obtain a prediction image;

The second loss value calculation unit is configured to calculate a loss function according to the predicted image and the sample image, and update the image diffusion model through the loss value corresponding to the loss function until a training completion condition is preset, so as to obtain a trained image diffusion model.

In an exemplary embodiment, the stitching subunit is specifically configured to:

In an exemplary embodiment, the stitching subunit is further specifically configured to:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 10 is a block diagram of an electronic device 1000 for an image fusion method, according to an example embodiment. For example, electronic device 1000 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 10, an electronic device 1000 may include one or more of the following components: a processing component 1002, a memory 1004, a power component 1006, a multimedia component 1008, an audio component 1010, an input/output (I/O) interface 1012, a sensor component 1014, and a communication component 1016.

The processing component 1002 generally controls overall operation of the electronic device 1000, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1002 can include one or more processors 1020 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 1002 can include one or more modules that facilitate interaction between the processing component 1002 and other components. For example, the processing component 1002 can include a multimedia module to facilitate interaction between the multimedia component 1008 and the processing component 1002.

The memory 1004 is configured to store various types of data to support operations at the electronic device 1000. Examples of such data include instructions for any application or method operating on the electronic device 1000, contact data, phonebook data, messages, pictures, video, and so forth. The memory 1004 may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, optical disk, or graphene memory.

The power supply component 1006 provides power to the various components of the electronic device 1000. The power components 1006 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 1000.

The multimedia component 1008 includes a screen between the electronic device 1000 and a user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1008 includes a front-facing camera and/or a rear-facing camera. When the electronic device 1000 is in an operational mode, such as a shooting mode or a video mode, the front-facing camera and/or the rear-facing camera may receive external multimedia data. Each front and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 1010 is configured to output and/or input audio signals. For example, the audio component 1010 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 1000 is in an operational mode, such as a call mode, a recording mode, and a speech recognition mode. The received audio signals may be further stored in memory 1004 or transmitted via communication component 1016. In some embodiments, the audio component 1010 further comprises a speaker for outputting audio signals.

The I/O interface 1012 provides an interface between the processing assembly 1002 and peripheral interface modules, which may be a keyboard, click wheel, buttons, and the like. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 1014 includes one or more sensors for providing status assessment of various aspects of the electronic device 1000. For example, the sensor assembly 1014 may detect an on/off state of the electronic device 1000, a relative positioning of the components, such as a display and keypad of the electronic device 1000, the sensor assembly 1014 may also detect a change in position of the electronic device 1000 or a component of the electronic device 1000, the presence or absence of a user's contact with the electronic device 1000, an orientation or acceleration/deceleration of the device 1000, and a change in temperature of the electronic device 1000. The sensor assembly 1014 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 1014 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1014 can also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1016 is configured to facilitate communication between the electronic device 1000 and other devices, either wired or wireless. The electronic device 1000 may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the communication component 1016 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1016 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1000 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a computer-readable storage medium is also provided, such as memory 1004, including instructions executable by processor 1020 of electronic device 1000 to perform the above-described method. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product is also provided, comprising instructions executable by the processor 1020 of the electronic device 1000 to perform the above-described method.

It should be noted that the descriptions of the foregoing apparatus, the electronic device, the computer readable storage medium, the computer program product, and the like according to the method embodiments may further include other implementations, and the specific implementation may refer to the descriptions of the related method embodiments and are not described herein in detail.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image fusion method, comprising:

2. The image fusion method according to claim 1, wherein the inputting the image features to be fused, the text features corresponding to the image features to be fused, and the standard noise data into the trained image diffusion model, performing diffusion processing on the standard noise data to obtain noise information including the image information of the images to be fused, includes:

3. The image fusion method according to claim 2, wherein the performing diffusion processing on standard noise data by using the trained image diffusion model and the fusion feature to obtain noise information including image information of each image to be fused includes:

4. The image fusion method of claim 1, further comprising:

5. The image fusion method of claim 1, further comprising:

6. The method for image fusion according to claim 2, wherein the performing stitching on the image features to be fused corresponding to the images to be fused respectively to obtain image fusion features includes:

7. The image fusion method according to claim 2, wherein the performing a stitching process on text features corresponding to each image feature to be fused to obtain text fusion features includes:

8. An image fusion apparatus, comprising:

9. An electronic device, comprising:

A processor;

A memory for storing the processor-executable instructions;

Wherein the processor is configured to execute the instructions to implement the image fusion method of any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the image fusion method of any one of claims 1 to 7.