CN116503504A

CN116503504A - Image synthesis method, device, electronic equipment and computer readable storage medium

Info

Publication number: CN116503504A
Application number: CN202310558639.0A
Authority: CN
Inventors: 陈瑞
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-07-28

Abstract

The present disclosure relates to an image synthesizing method, an apparatus, an electronic device, and a computer-readable storage medium, the image synthesizing method including: acquiring at least two original features corresponding to at least two original data one by one, wherein the at least two original features comprise original features of at least one non-image domain; inputting the original features of at least one non-image domain into a pre-trained feature conversion model to obtain at least one image domain feature corresponding to the original features of at least one non-image domain so as to obtain at least two image domain features corresponding to at least two original features; fusing at least two image domain features to obtain fused image features; and generating a synthetic image according to the fused image characteristics. The method and the device can realize image synthesis based on multi-mode original data, greatly expand the application range and flexibility of image synthesis, and simultaneously ensure the image quality of the synthesized image.

Description

Image synthesis method, device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image synthesis method, an image synthesis device, an electronic device, and a computer readable storage medium.

Background

With the development of digital technology and artificial intelligence (Artificial Intelligence, abbreviated as AI) technology, the content generation field gradually generates content from professional fields (such as short video bloggers with professional field knowledge) to artificial intelligence generation content (Artificial Intelligence Generated Content, abbreviated as AIGC). Internationally corresponding terms for AIGC are artificial intelligence Synthetic Media (AI-generated Media or Synthetic Media), which are defined as a generic term for the production, manipulation and modification of data or Media by artificial intelligence algorithms.

In particular, in the field of image processing, the AIGC can currently implement the synthesis of two images into one new image, for example, the synthesis of an image of an a vehicle and an image of a B vehicle into an image of a new vehicle having the shape of the a vehicle and the texture of the B vehicle. The technology can fuse the characteristics of two original images together to generate a brand new image, and greatly enriches the digital life of people. However, the technology depends on the original image, if the user cannot provide the original image which accords with the self expectation, the corresponding image synthesis cannot be realized, and the application range and flexibility of the image synthesis are greatly limited.

Disclosure of Invention

The disclosure provides an image synthesis method, an image synthesis device, an electronic device and a computer readable storage medium, so as to at least solve the problem of how to expand the application range of image synthesis in the related art, and may not solve any of the above problems.

According to a first aspect of the present disclosure, there is provided an image synthesizing method including: acquiring at least two original features corresponding to at least two original data one by one, wherein the at least two original features comprise original features of at least one non-image domain; inputting the original features of the at least one non-image domain into a pre-trained feature conversion model to obtain at least one image domain feature corresponding to the original features of the at least one non-image domain so as to obtain at least two image domain features corresponding to the at least two original features; performing fusion processing on the at least two image domain features to obtain fusion image features; and generating a synthetic image according to the fused image characteristics.

Optionally, the acquiring at least two original features corresponding to at least two original data one-to-one includes: receiving the at least two original data, wherein the original data is data for representing image content; inputting the at least two original data into a pre-trained feature extraction model to obtain the at least two original features corresponding to the at least two original data one by one.

Optionally, before the receiving the at least two raw data, the image synthesis method further includes: displaying a user interface, wherein the user interface is displayed with a synthesis button and at least two input controls; wherein said receiving said at least two raw data comprises: receiving input of the at least two original data in response to an input operation for the at least two input controls; the inputting the at least two original data into a pre-trained feature extraction model to obtain the at least two original features corresponding to the at least two original data one to one, including: and responding to the triggering operation of the synthesis button, inputting the at least two original data into the pre-trained feature extraction model to obtain the at least two original features corresponding to the at least two original data one by one.

Optionally, the user interface further displays a scaling control, where the scaling control is configured to receive respective fusion duty ratios of the at least two raw data, and the fusing the at least two image domain features to obtain fused image features, and includes: and according to the respective fusion duty ratio of the at least two original data, carrying out fusion processing on the at least two image domain features corresponding to the at least two original data to obtain the fusion image features.

Optionally, a display area is further displayed on the user interface, where after the generating a composite image according to the fused image features, the image synthesis method further includes: displaying the composite image in the display area; receiving an adjusted fusion duty ratio in response to an adjustment operation for the proportional adjustment control; and re-executing the fusion processing of the at least two image domain features according to the adjusted fusion duty ratio to obtain fused image features, and displaying the composite image in the display area to obtain an updated composite image.

Optionally, the image synthesis method further includes a training step of the pre-trained feature extraction model, and the training step of the pre-trained feature extraction model includes: acquiring a plurality of groups of first sample image data and first sample non-image data, wherein the image content represented by the first sample image data is consistent with the content represented by the first sample non-image data of the same group; inputting the multiple groups of first sample image data and the first sample non-image data into a feature extraction model to be trained to obtain sample features of each data in the first sample image data and the first sample non-image data corresponding to the multiple groups of contents; determining a contrast loss value between each sample feature; based on a contrast learning method, according to the contrast loss value, parameters of the feature extraction model to be trained are adjusted, and the pre-trained feature extraction model is obtained.

Optionally, the image synthesis method further includes a training step of the pre-trained feature transformation model, and the training step of the pre-trained feature transformation model includes: acquiring second sample image data and second sample non-image data, wherein the image content represented by the second sample image data is consistent with the content represented by the second sample non-image data; inputting the second sample image data and the second sample non-image data into the pre-trained feature extraction model to obtain second sample image features corresponding to the second sample image data and second sample non-image features corresponding to the second sample non-image data; inputting the second sample non-image features into a feature conversion model to be trained, and obtaining features of an image domain corresponding to the second sample non-image features as converted image features; determining a loss value according to the converted image feature and the second sample image feature; and adjusting parameters of the feature conversion model to be trained according to the loss value to obtain the pre-trained feature conversion model.

Optionally, the modalities of the at least two raw data include at least one of: image, text, voice.

Optionally, the feature extraction model is at least one of a diffusion model, a variational self-encoder model, and a generation countermeasure network model.

Optionally, the feature transformation model is at least one of a diffusion model, a variational self-encoder model, and a generation countermeasure network model.

According to a second aspect of the present disclosure, there is provided an image synthesizing apparatus including: an acquisition unit configured to perform acquisition of at least two original features in one-to-one correspondence with at least two original data, wherein the at least two original features include original features of at least one non-image domain; a conversion unit configured to perform inputting original features of the at least one non-image domain into a pre-trained feature conversion model to obtain at least one image domain feature corresponding to the original features of the at least one non-image domain, so as to obtain at least two image domain features corresponding to the at least two original features; the fusion unit is configured to perform fusion processing on the at least two image domain features to obtain fusion image features; and a generating unit configured to generate a synthetic image according to the fused image features.

Optionally, the obtaining unit is further configured to perform receiving the at least two raw data, the raw data being data for characterizing image content; inputting the at least two original data into a pre-trained feature extraction model to obtain the at least two original features corresponding to the at least two original data one by one.

Optionally, the image synthesis device further comprises a display unit configured to execute a display user interface, on which a synthesis button and at least two input controls are displayed; the acquisition unit is further configured to perform receiving the at least two raw data input in response to an input operation for the at least two input controls; and responding to the triggering operation of the synthesis button, inputting the at least two original data into the pre-trained feature extraction model to obtain the at least two original features corresponding to the at least two original data one by one.

Optionally, the user interface further displays a proportion adjustment control, where the proportion adjustment control is configured to receive respective fusion duty ratios of the at least two original data, and the fusion unit is further configured to perform fusion processing on the at least two image domain features corresponding to the at least two original data according to the respective fusion duty ratios of the at least two original data, so as to obtain the fused image feature.

Optionally, a display area is further displayed on the user interface, and the display unit is further configured to perform displaying the composite image on the display area; the acquisition unit is further configured to perform an adjustment operation in response to the scaling control, receive an adjusted fusion duty cycle; the fusion unit is further configured to perform fusion processing on the at least two image domain features again according to the adjusted fusion duty ratio to obtain the fused image features, and to rerun the generating unit, and the display unit is further configured to perform display of the synthetic image regenerated by the generating unit in the display area to obtain an updated synthetic image.

Optionally, the image synthesis device further comprises a first training unit configured to perform acquiring a plurality of sets of first sample image data and first sample non-image data, wherein image content represented by the first sample image data is consistent with content represented by the same set of first sample non-image data; inputting the multiple groups of first sample image data and first sample non-image data into a feature extraction model to be trained to obtain first sample image features corresponding to each first sample image data and first sample non-image features corresponding to each first sample non-image data; determining a contrast loss value between each of the first sample image features and each of the first sample non-image features; based on a contrast learning method, according to the contrast loss value, parameters of the feature extraction model to be trained are adjusted, and the pre-trained feature extraction model is obtained.

Optionally, the image synthesis device further comprises a second training unit configured to perform acquiring second sample image data and second sample non-image data, wherein the image content represented by the second sample image data is consistent with the content represented by the second sample non-image data; inputting the second sample image data and the second sample non-image data into the pre-trained feature extraction model to obtain second sample image features corresponding to the second sample image data and second sample non-image features corresponding to the second sample non-image data; inputting the second sample non-image features into a feature conversion model to be trained, and obtaining features of an image domain corresponding to the second sample non-image features as converted image features; determining a loss value according to the converted image feature and the second sample image feature; and adjusting parameters of the feature conversion model to be trained according to the loss value to obtain the pre-trained feature conversion model.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform an image compositing method according to the present disclosure.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium, which when executed by at least one processor, causes the at least one processor to perform an image synthesis method according to the present disclosure.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by at least one processor, implement an image synthesis method according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the image synthesis method and the image synthesis device, image synthesis based on multi-mode original data can be achieved, so that a user can flexibly select the original data with a proper mode according to own requirements, and the application range and flexibility of image synthesis are greatly expanded. In addition, the original features of the non-image domain corresponding to the original data of the non-image mode are converted into the image domain features, so that the feature expression capability of the original data of the non-image mode can be improved, and further feature fusion is completed by using at least two image domain features, so that fused image features are obtained, and the image quality of the synthesized image can be ensured while the application range of image synthesis is expanded.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is a schematic view showing an application scenario of an image synthesizing method;

fig. 2 is a flowchart illustrating an image synthesizing method according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a feature extraction model according to an exemplary embodiment of the present disclosure;

FIG. 4 is a training schematic illustrating a feature extraction model according to an exemplary embodiment of the present disclosure;

FIG. 5 is a flow diagram illustrating the generation of a composite image from two image domain features according to an exemplary embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a user interface according to a specific embodiment of the present disclosure;

fig. 7a and 7b are schematic diagrams illustrating two original images according to one specific embodiment of the present disclosure;

FIG. 7c is a schematic diagram illustrating a composite image according to one particular embodiment of the present disclosure;

fig. 8a and 8b are schematic diagrams illustrating two original images according to another specific embodiment of the present disclosure;

FIG. 8c is a schematic diagram illustrating a composite image according to another specific embodiment of the present disclosure;

fig. 9 is a block diagram illustrating an image synthesizing apparatus according to an exemplary embodiment of the present disclosure;

fig. 10 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The embodiments described in the examples below are not representative of all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.

The user information (including but not limited to user equipment information, user personal information, etc.) related to the present disclosure is information authorized by the user or sufficiently authorized by each party.

With the development of digital technology and artificial intelligence technology, the content generation field gradually changes from the generation of content in professional fields (such as short video bloggers with knowledge of the professional fields) to the generation of Artificial Intelligence Generated Content (AIGC). Internationally corresponding terms for AIGC are artificial intelligence synthetic media, which are defined as the collective term for the production, manipulation, and modification of data or media by artificial intelligence algorithms. Currently the AIGC field is leading a profound revolution, reshaping and even subversion of the production and consumption patterns of digital content. The AIGC greatly expands the richness and imagination of the digital content generation field, greatly enriches the digital life of people, is an indispensable supporting force for comprehensively moving to a new era of digital civilization in the future, and is also an indispensable infrastructure of the web 3.0.

In particular, in the field of image processing, the AIGC can currently implement the synthesis of two images into one new image, for example, the synthesis of an image of an a vehicle and an image of a B vehicle into an image of a new vehicle having the shape of the a vehicle and the texture of the B vehicle. The technology can fuse the characteristics of two original images together to generate a brand new image, and greatly enriches the digital life of people.

Fig. 1 is a schematic view showing an application scenario of an image synthesizing method. Referring to fig. 1, an application scenario of image composition may include a server 101 and a client 102 communicatively connected to the server. The server 101 may be installed on a server, and is used for receiving and managing data sent by a user, performing main calculation work in image composition, and feeding back a composite image to a client. The client 102 may be installed in a terminal device such as a smart phone, a tablet computer, a Personal Computer (PC), etc., and is configured to interact with a user, receive data sent by the user, and display a synthesized image. The number of clients 102 is often plural, so that different users can log in to provide image composition services for different users.

However, the existing image synthesis often depends on the original image, and if the user cannot provide the original image which accords with the self-expectation, the corresponding image synthesis cannot be realized, so that the application range and flexibility of the image synthesis are greatly limited.

According to the image synthesis method and the device of the exemplary embodiment of the disclosure, image synthesis based on multi-mode original data can be realized, so that a user can flexibly select the original data of a proper mode according to own requirements, and the application range and flexibility of image synthesis are greatly expanded. In addition, the original features of the non-image domain corresponding to the original data of the non-image mode are converted into the image domain features, so that the feature expression capability of the original data of the non-image mode can be improved, and further feature fusion is completed by using at least two image domain features, so that fused image features are obtained, and the image quality of the synthesized image can be ensured while the application range of image synthesis is expanded.

Next, an image synthesizing method and an image synthesizing apparatus according to exemplary embodiments of the present disclosure will be specifically described with reference to fig. 2 to 10.

Fig. 2 is a flowchart illustrating an image synthesizing method according to an exemplary embodiment of the present disclosure. It should be understood that the image composition method according to the exemplary embodiments of the present disclosure may be implemented in a terminal device such as a smart phone, a tablet computer, a Personal Computer (PC), i.e., as a flow of a client, or may be implemented in a device such as a server, i.e., as a flow of a server.

Referring to fig. 2, at step 201, at least two original features in one-to-one correspondence with at least two original data are acquired, wherein the at least two original features include original features of at least one non-image domain.

As an example, the number of original data is two, that is, two original data may be synthesized into one synthesized image. Optionally, the modalities of the at least two raw data comprise at least one of: image, text, voice. In other words, in step 201, the original data of the same modality may be used, or the original data of different modalities may be used to describe the content of the image. And the receiving of the multi-mode original data can be used for users to select the original data with proper modes to synthesize images according to the needs. For speech data, speech recognition may be advanced, as an example, to convert to text of corresponding semantics.

It should be understood that the at least two original features include at least one original feature of a non-image domain, meaning that at least one of the original data used is of a non-image modality, this is for convenience in combination with other steps to clarify that the present disclosure has the capability of enabling image synthesis based on multi-modality original data, but does not mean that the present disclosure can only use original data of different modalities, and that the present disclosure is equally applicable to the case where the original data are all image data.

Optionally, step 201 includes: receiving at least two original data, wherein the original data is data for representing image content; inputting the at least two original data into a pre-trained feature extraction model to obtain at least two original features corresponding to the at least two original data one by one. By introducing a pre-trained feature extraction model, feature extraction can be performed on at least two original data respectively, thereby providing reliable original features for image synthesis.

Specifically, the original feature is the encoding (encoding) of the original data, which is a vector capable of characterizing the internal features of the original data. After the original data is input into the pre-trained feature extraction model, the original features of the corresponding modal domain of the original data can be obtained. As an example, referring to fig. 3, image features may be available for image data, text features may be available for text data, and voice features may be available for voice data.

Alternatively, the feature extraction model may employ a generative model, also known as a probabilistic generative model, which learns the probability distribution of sample data and randomly generates observable data after learning is completed. For example, for image generation, the image is represented as a random vector X, each dimension in X represents a pixel value of the image, and assuming that the image of the natural scene follows an unknown distribution pr (X), then some sample data may be combined, the distribution is estimated by using a generation model, and after learning is completed, a vector conforming to the distribution pr (X) may be generated for the image data of the input model. The feature extraction model is at least one of a Diffusion (Diffusion) model, a variance-from-encoder (Variational Autoencoders, VAE) model, and a generation countermeasure network (Generative Adversarial Network, GAN) model, for example. By using these generative models as feature extraction models, extraction of original features that can effectively characterize the original data is facilitated. The concept of the diffusion model is derived from unbalanced thermodynamics, and comprises two processes, namely a forward Markov chain, gradually adding Gaussian noise to an input image and gradually changing the input image into a pure Gaussian noise image, and a reverse process, wherein the original image is recovered from the pure Gaussian noise image. Unlike conventional generative models, the hidden variable (layer code) in the diffusion model is high-dimensional.

Optionally, the image synthesis method according to an exemplary embodiment of the present disclosure further includes a training step of a pre-trained feature extraction model, referring to fig. 4, the training step of the pre-trained feature extraction model includes: acquiring a plurality of groups of first sample image data and first sample non-image data, wherein the image content represented by the first sample image data is consistent with the content represented by the first sample non-image data of the same group; inputting a plurality of groups of first sample image data and first sample non-image data into a feature extraction model to be trained to obtain first sample image features corresponding to each first sample image data and first sample non-image features corresponding to each first sample non-image data; determining a contrast loss value between each first sample image feature and each first sample non-image feature; based on a contrast learning method, parameters of a feature extraction model to be trained are adjusted according to contrast loss values, and a pre-trained feature extraction model is obtained, specifically, the parameters of the feature extraction model to be trained are adjusted with the aim of reducing the contrast loss values between the image features of the first sample and the non-image features of the first sample in the same group, increasing the contrast loss values between the image features of the first sample and the non-image features of the first sample in different groups. It should be understood that the content correspondence is essentially semantically identical, e.g., an image of a puppy and a "puppy" word, and a voice of a "puppy" are the content correspondences. By training the feature extraction model by adopting a contrast learning method, data with the same semantics can be extracted to similar features, so that similar feature extraction is carried out on the original data of different modes, and a data base is provided for multi-mode image synthesis.

As an example, the first sample non-image data comprises at least one of first sample text data and first sample speech data, and when the first sample non-image data comprises both the first sample text data and the first sample speech data, a loss of contrast value between each first sample image feature and each first sample text feature needs to be determined, and a loss of contrast value between each first sample image feature and each first sample speech feature needs to be determined, while adjusting the model parameters, still targeting decreasing the loss of contrast value between the same set of features, increasing the loss of contrast value between different sets of features.

In step 202, the original features of at least one non-image domain are input into a pre-trained feature transformation model to obtain at least one image domain feature corresponding to the original features of at least one non-image domain, so as to obtain at least two image domain features corresponding to at least two original features. In other words, original features of the image domain are kept unchanged, original features of the non-image domain are converted into the image domain, and mode domains of the original features are unified. As an example, referring to fig. 5, inputting text features or speech features into a pre-trained feature transformation model may result in image features of the same semantics.

Although the pre-trained feature extraction model can ensure the difference between the features of the same semantic and different modes, the difference between the features of the same semantic and different modes is smaller than the difference between the features of the same semantic and different modes, which is found by the applicant analysis, the difference between the features of the same semantic and different modes is larger than the difference between the features of the same semantic and different modes, namely the difference between the features of the same semantic and different modes cannot be ensured to be minimized, so that the expression accuracy of the original features can be easily influenced. For example, the pre-trained feature extraction model can ensure that the difference between the elephant image feature and the elephant text feature is smaller than the difference between the elephant image feature and the puppy text feature, but the difference between the elephant image feature and the elephant text feature is larger than the difference between the elephant image feature and the puppy image feature, and also larger than the difference between the elephant text feature and the puppy text feature. By converting the original features of the non-image into the image domain, the cross-modal difference between the features of the same semantics and different modalities can be made up, so that the expression accuracy of the original features is improved, the feature expression capability of the original data of the non-image modalities is improved, and the image quality of the synthesized image obtained later is guaranteed.

It should be understood that the step 202 is to convert the original features of the non-image domain, which means that if the original data in the step 201 are all data of the image mode, all the corresponding original features are the features of the image domain, and no specific conversion operation is needed.

Optionally, similar to the feature extraction model, the feature transformation model also employs a generation model, which is at least one of a diffusion model, a variational self-encoder model, and a generation countermeasure network model, and is not described in detail herein. By adopting the generation models as feature conversion models, the method is helpful for guaranteeing the reliable image domain conversion of the original features of the non-image domain.

Optionally, the image synthesis method according to an exemplary embodiment of the present disclosure further includes a training step of a pre-trained feature transformation model, the training step of the pre-trained feature transformation model including: acquiring second sample image data and second sample non-image data, wherein the second sample non-image data can comprise at least one of second sample text data and second sample speech data, and wherein the image content represented by the second sample image data is consistent with the content represented by the second sample non-image data; inputting the second sample image data and the second sample non-image data into a pre-trained feature extraction model to obtain second sample image features corresponding to the second sample image data and second sample non-image features corresponding to the second sample non-image data; inputting the non-image features of the second sample into a feature conversion model to be trained to obtain the features of an image domain corresponding to the non-image features of the second sample, wherein the features are used as converted image features; determining a loss value according to the converted image characteristic and the second sample image characteristic; and adjusting parameters of the feature conversion model to be trained according to the loss value to obtain a pre-trained feature conversion model. By utilizing the pre-trained feature extraction model, feature extraction of corresponding modes can be respectively carried out on second sample image data and second sample non-image data of the same semantic, and further the obtained second sample image features are used as sample marks of corresponding second sample non-image features, supervised training is carried out on the feature conversion model, automatic marking operation can be achieved on training samples, marking operation efficiency is improved, and reliability of training results of the feature conversion model can be guaranteed.

In step 203, fusion processing is performed on at least two image domain features to obtain a fused image feature.

Alternatively, the fusion process may be implemented by existing fusion techniques, such as, but not limited to, attention fusion (Memory Fusion Network, MFN for short), cross Attention (Cross Attention), self Attention (Self Attention) mechanisms, image feature pyramids.

It should be understood that, depending on the specific processing manner, the dimension of the fused image feature may be the same as the dimension of the single image domain feature, that is, the dimension of the fused image feature may be the same as the dimension of the fused image feature, or may be the sum of the dimensions of the at least two image domain features, that is, the dimension equivalent to the feature obtained by stitching the at least two image domain features, which is not limited in this disclosure.

In step 204, a composite image is generated from the fused image features.

Alternatively, the synthetic image may be generated using the aforementioned generation model, the training-completed generation model may be obtained from a legal channel, or model training may be performed by itself using sample data, which is not limited by the present disclosure.

As an example, the number of raw data is two, and two corresponding image domain features can be obtained. Referring to fig. 5, the two image domain features are fused, and then the processing result (i.e., the fused image features) is input into the generation model, so that a composite image can be generated.

Optionally, before step 201, the image synthesis method according to an exemplary embodiment of the present disclosure further includes: a user interface is displayed. By displaying the user interface, the user can interact with the user through the user interface, and image synthesis based on user intention is realized. It should be understood that when the image synthesizing method according to the exemplary embodiment of the present disclosure is used as a flow of the client, the user interface may be directly displayed at the client; when the image synthesizing method according to the exemplary embodiment of the present disclosure is used as a flow of the server, data of the user interface may be transmitted to the client, instructing the client to display the user interface according to the data.

Next, the content related to the user interface will be further described.

Optionally, a composition button and at least two input controls are displayed on the user interface. Accordingly, the step of receiving at least two raw data in step 201 includes: receiving at least two input raw data in response to an input operation for at least two input controls; the step of extracting at least two original features in step 201 includes: and in response to the triggering operation for the synthesis button, inputting at least two original data into a pre-trained feature extraction model to obtain at least two original features corresponding to the at least two original data one by one. The method has the advantages that the original data to be synthesized, which are input by a user, can be received by combining at least two input controls displayed on the user interface, and the user can be informed that the original data are confirmed by combining the synthesis buttons displayed on the user interface, so that clear starting time is provided for subsequent operations such as feature extraction, the time for inputting and confirming the original data can be reserved for the user, and the practicability of the scheme is improved.

It should be appreciated that the user interface may provide a number of input controls via which the user may input raw data as desired. In one example, if the number of original data to be synthesized by the user is less than the number of input controls, input operations may be performed only on the corresponding number of input controls, and then the synthesis button is triggered, and the corresponding original data is not input by the default remaining non-operated input controls, so that the user can autonomously adjust the number of the original data. In another example, a quantity selection control of raw data may also be configured at the user interface, for example, but not limited to, in the form of an input box, a drop down tab, and then a corresponding quantity of input controls is displayed in terms of the quantity of user input. In yet another example, a default number of input controls, e.g., two input controls, may also be configured, and further an add control may be configured, with one input control added each time the user activates the add control. The foregoing are all implementations of the disclosure, and fall within the scope of the disclosure.

It should also be appreciated that when the image composition method according to the exemplary embodiment of the present disclosure is used as a flow of a client, the user's operation with respect to the user interface may be directly detected; when the image synthesis method according to the exemplary embodiment of the present disclosure is used as a process of the server, a client may receive an operation of a user on a user interface, and send operation corresponding data to the server in a form of a request, so that the server may perform corresponding steps in response to the client request. The same is true below, and will not be described again.

Further, a proportion adjustment control is further displayed on the user interface, and the proportion adjustment control is used for receiving the fusion duty ratio of each of at least two original data. Accordingly, the image synthesizing method according to an exemplary embodiment of the present disclosure further includes: receiving respective fusion duty ratios of at least two original data; step 203 comprises: and according to the respective fusion duty ratio of the at least two original data, carrying out fusion processing on at least two image domain features corresponding to the at least two original data to obtain fusion image features. By configuring the proportion adjustment control, the fusion duty ratio input by the user can be received, so that the flexibility of image synthesis can be improved, the participation of the user can be increased, and the synthetic image meeting the user requirement can be obtained. It will be appreciated that the higher the duty cycle of the raw data is fused, the greater the impact of the composite image, the more similar the composite image is to the raw data.

As an example, the scaling control may include an input box and/or a sliding bar, and one scaling control is configured for each piece of original data, and a constraint condition that the sum of the fusion duty ratios is equal to 1 is preset, and of course, when only one fusion duty ratio is not set, the last fusion duty ratio may be automatically filled according to the constraint condition that the sum is 1. Further, for the case that the number of the original data is two, only one sliding bar can be configured, the user can adjust the two fusion duty ratios by moving the label on the sliding bar, the constraint condition that the sum of the fusion duty ratios is equal to 1 can be naturally met, and the configuration of the constraint condition is not needed to be additionally carried out.

Further, a presentation area is also displayed on the user interface. Accordingly, after step 204, the image synthesis method according to an exemplary embodiment of the present disclosure further includes: displaying the composite image in a display area; receiving an adjusted fusion duty ratio in response to an adjustment operation for the proportional adjustment control; and (3) re-executing the steps 204, 205 and displaying the composite image according to the adjusted fusion duty ratio to obtain an updated composite image. By configuring the display area on the user interface, the input of the original data and the display of the synthesized image can be realized in the same interface, and the user can conveniently compare and check. In addition, the proportion adjustment control is still in an editable state, so that a user can conveniently adjust the fusion duty ratio after viewing the composite image, on one hand, the satisfactory composite image of the user can be obtained, and on the other hand, the steps of extracting and converting the original characteristics are not required to be executed again, the redundant calculation can be reduced, and the operation efficiency is improved. Meanwhile, the adjusted composite image is displayed in the display area, so that a user can conveniently and intuitively check an adjustment result to confirm whether the adjustment is needed to be continued or not, and convenience of a scheme is improved.

As an example, when the adjusted composite image is displayed, the original composite image may be covered, the adjusted composite image may be displayed, or both the new and old composite images may be displayed at the same time, so that the user may compare and view the images, or the display options may be configured, and the user may select or change the display mode by himself. The present disclosure is not limited in this regard.

Fig. 6 is a schematic diagram illustrating a user interface according to a specific embodiment of the present disclosure.

Referring to fig. 6, two input controls are displayed on the user interface, a synthetic button is displayed between the two input controls, a proportional adjustment control in the form of a sliding bar is arranged below the three, and a display area is arranged below the three. The user can click the two input controls to input two original data, can input pictures, can also input texts or voices, then moves the label of the sliding bar to adjust the fusion duty ratio of the two original data, and clicks the synthesis button to finish the input of information. The resulting composite image is displayed in the display area.

In one embodiment, a puppy picture as shown in fig. 7a and a coffee machine picture as shown in fig. 7b are input, and the fusion ratio is set to 50% each, resulting in a puppy style coffee machine picture as shown in fig. 7 c.

In another embodiment, a traffic icon picture with a camel pattern as shown in fig. 8a and a halftoning picture as shown in fig. 8b are input, and the blending ratio is set to 50% each, so that a traffic icon picture with a halftoning pattern as shown in fig. 8c can be obtained.

Fig. 9 is a block diagram illustrating an image synthesizing apparatus according to an exemplary embodiment of the present disclosure. It should be understood that the image composing apparatus according to the exemplary embodiments of the present disclosure may be implemented in a terminal device such as a smart phone, a tablet computer, a Personal Computer (PC), in software, hardware, or a combination of software and hardware, or may be implemented in a device such as a server.

Referring to fig. 9, an image synthesizing apparatus 900 includes an acquisition unit 901, a conversion unit 902, a fusion unit 903, and a generation unit 904.

The acquiring unit 901 may acquire at least two original features corresponding to at least two original data one by one, wherein the at least two original features include original features of at least one non-image domain.

The acquisition unit 901 may also receive at least two pieces of original data, which are data for characterizing the image content; inputting the at least two original data into a pre-trained feature extraction model to obtain at least two original features corresponding to the at least two original data one by one.

Optionally, the modalities of the at least two raw data comprise at least one of: image, text, voice.

Optionally, the feature extraction model is at least one of a diffusion model, a variational self-encoder model, a generation countermeasure network model.

Optionally, the image synthesizing apparatus 900 further includes a first training unit (not shown in the figure), where the first training unit is configured to perform acquiring multiple sets of first sample image data and first sample non-image data, where the image content represented by the first sample image data is consistent with the content represented by the same set of first sample non-image data; inputting a plurality of groups of first sample image data and first sample non-image data into a feature extraction model to be trained to obtain first sample image features corresponding to each first sample image data and first sample non-image features corresponding to each first sample non-image data; determining a contrast loss value between each first sample image feature and each first sample non-image feature; based on a contrast learning method, according to a contrast loss value, parameters of a feature extraction model to be trained are adjusted, and a pre-trained feature extraction model is obtained.

The conversion unit 902 may input the original features of the at least one non-image domain into a pre-trained feature conversion model to obtain at least one image domain feature corresponding to the original features of the at least one non-image domain, so as to obtain at least two image domain features corresponding to the at least two original features.

Optionally, the feature transformation model is at least one of a diffusion model, a variational self-encoder model, a generation countermeasure network model.

Optionally, the image synthesizing apparatus 900 further includes a second training unit (not shown in the figure), where the second training unit is configured to perform acquiring second sample image data and second sample non-image data, and the image content represented by the second sample image data is consistent with the content represented by the second sample non-image data; inputting the second sample image data and the second sample non-image data into a pre-trained feature extraction model to obtain second sample image features corresponding to the second sample image data and second sample non-image features corresponding to the second sample non-image data; inputting the non-image features of the second sample into a feature conversion model to be trained to obtain the features of an image domain corresponding to the non-image features of the second sample, wherein the features are used as converted image features; determining a loss value according to the converted image characteristic and the second sample image characteristic; and adjusting parameters of the feature conversion model to be trained according to the loss value to obtain a pre-trained feature conversion model.

The fusion unit 903 may perform fusion processing on at least two image domain features to obtain a fused image feature.

The generating unit 904 may generate a composite image from the fused image features.

Optionally, the image synthesizing apparatus 900 further includes a display unit (not shown in the figure), which may display a user interface on which synthesizing buttons and at least two input controls are displayed; the acquisition unit 901 may also receive at least two raw data of an input in response to an input operation for at least two input controls; and in response to the triggering operation for the synthesis button, inputting at least two original data into a pre-trained feature extraction model to obtain at least two original features corresponding to the at least two original data one by one.

Optionally, a proportion adjustment control is further displayed on the user interface, where the proportion adjustment control is configured to receive respective fusion duty ratios of at least two original data, and the fusion unit 903 may further perform fusion processing on at least two image domain features corresponding to the at least two original data according to the respective fusion duty ratios of the at least two original data, to obtain a fused image feature.

Optionally, a display area is further displayed on the user interface, and the display unit can further display the composite image in the display area; the obtaining unit 901 may also receive the adjusted fusion duty ratio in response to an adjustment operation for the proportional adjustment control; the fusion unit 903 may further perform fusion processing on at least two image domain features again according to the adjusted fusion duty ratio, to obtain a fused image feature, and rerun the generating unit 904, and the display unit may further display the composite image regenerated by the generating unit 904 in the display area, to obtain an updated composite image.

The specific manner in which the individual units perform the operations in relation to the apparatus of the above embodiments has been described in detail in relation to the embodiments of the method and will not be described in detail here.

Fig. 10 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Referring to fig. 10, an electronic device 1000 includes at least one memory 1001 and at least one processor 1002, the at least one memory 1001 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 1002, perform an image compositing method according to an exemplary embodiment of the present disclosure.

By way of example, the electronic device 1000 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 1000 is not necessarily a single electronic device, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction sets) individually or in combination. The electronic device 1000 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).

In electronic device 1000, processor 1002 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 1002 may execute instructions or code stored in the memory 1001, wherein the memory 1001 may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory 1001 may be integrated with the processor 1002, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, the memory 1001 may include a separate device, such as an external disk drive, a storage array, or other storage device that may be used by any database system. The memory 1001 and the processor 1002 may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., so that the processor 1002 can read files stored in the memory.

In addition, the electronic device 1000 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 1000 may be connected to each other via buses and/or networks.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer-readable storage medium, which when executed by at least one processor, causes the at least one processor to perform an image synthesizing method according to an exemplary embodiment of the present disclosure. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card memory (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tape, floppy disks, magneto-optical data storage, hard disks, solid state disks, and any other means configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer program product comprising computer instructions which, when executed by at least one processor, cause the at least one processor to perform an image synthesis method according to an exemplary embodiment of the present disclosure.

According to the image synthesis method, the device, the electronic equipment and the computer readable storage medium of the exemplary embodiment of the disclosure, the image synthesis based on the original data of multiple modes can be realized, so that a user can flexibly select the original data of a proper mode according to own requirements, and the application range and the flexibility of the image synthesis are greatly expanded. In addition, the original features of the non-image domain corresponding to the original data of the non-image mode are converted into the image domain features, so that the feature expression capability of the original data of the non-image mode can be improved, and further feature fusion is completed by using at least two image domain features, so that fused image features are obtained, and the image quality of the synthesized image can be ensured while the application range of image synthesis is expanded.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image synthesizing method, characterized in that the image synthesizing method comprises:

acquiring at least two original features corresponding to at least two original data one by one, wherein the at least two original features comprise original features of at least one non-image domain;

inputting the original features of the at least one non-image domain into a pre-trained feature conversion model to obtain at least one image domain feature corresponding to the original features of the at least one non-image domain so as to obtain at least two image domain features corresponding to the at least two original features;

performing fusion processing on the at least two image domain features to obtain fusion image features;

and generating a synthetic image according to the fused image characteristics.

2. The image synthesizing method according to claim 1, wherein the acquiring at least two original features in one-to-one correspondence with at least two original data includes:

receiving the at least two original data, wherein the original data is data for representing image content;

Inputting the at least two original data into a pre-trained feature extraction model to obtain the at least two original features corresponding to the at least two original data one by one.

3. The image synthesis method according to claim 2, wherein before the receiving the at least two raw data, the image synthesis method further comprises:

displaying a user interface, wherein the user interface is displayed with a synthesis button and at least two input controls;

wherein said receiving said at least two raw data comprises:

receiving input of the at least two original data in response to an input operation for the at least two input controls;

the inputting the at least two original data into a pre-trained feature extraction model to obtain the at least two original features corresponding to the at least two original data one to one, including:

and responding to the triggering operation of the synthesis button, inputting the at least two original data into the pre-trained feature extraction model to obtain the at least two original features corresponding to the at least two original data one by one.

4. The image synthesis method according to claim 3, wherein a scaling control is further displayed on the user interface, the scaling control is configured to receive respective fusion duty ratios of the at least two raw data, and the fusing processing is performed on the at least two image domain features to obtain a fused image feature, and the fusing processing includes:

And according to the respective fusion duty ratio of the at least two original data, carrying out fusion processing on the at least two image domain features corresponding to the at least two original data to obtain the fusion image features.

5. The image composition method of claim 4, wherein a presentation area is further displayed on the user interface, wherein after the generating a composite image from the fused image features, the image composition method further comprises:

displaying the composite image in the display area;

receiving an adjusted fusion duty ratio in response to an adjustment operation for the proportional adjustment control;

and re-executing the fusion processing of the at least two image domain features according to the adjusted fusion duty ratio to obtain fused image features, and displaying the composite image in the display area to obtain an updated composite image.

6. The image synthesis method of claim 2, further comprising a training step of the pre-trained feature extraction model, the training step of the pre-trained feature extraction model comprising:

acquiring a plurality of groups of first sample image data and first sample non-image data, wherein the image content represented by the first sample image data is consistent with the content represented by the first sample non-image data of the same group;

Inputting the multiple groups of first sample image data and first sample non-image data into a feature extraction model to be trained to obtain first sample image features corresponding to each first sample image data and first sample non-image features corresponding to each first sample non-image data;

determining a contrast loss value between each of the first sample image features and each of the first sample non-image features;

based on a contrast learning method, according to the contrast loss value, parameters of the feature extraction model to be trained are adjusted, and the pre-trained feature extraction model is obtained.

7. The image synthesis method of claim 1, further comprising a training step of the pre-trained feature transformation model, the training step of the pre-trained feature transformation model comprising:

acquiring second sample image data and second sample non-image data, wherein the image content represented by the second sample image data is consistent with the content represented by the second sample non-image data;

inputting the second sample image data and the second sample non-image data into the pre-trained feature extraction model to obtain second sample image features corresponding to the second sample image data and second sample non-image features corresponding to the second sample non-image data;

Inputting the second sample non-image features into a feature conversion model to be trained, and obtaining features of an image domain corresponding to the second sample non-image features as converted image features;

determining a loss value according to the converted image feature and the second sample image feature;

and adjusting parameters of the feature conversion model to be trained according to the loss value to obtain the pre-trained feature conversion model.

8. The image synthesizing method according to claim 2, wherein,

the modalities of the at least two raw data include at least one of: images, text, speech; and/or

The feature extraction model is at least one of a diffusion model, a variational self-encoder model and a generation countermeasure network model; and/or

The feature transformation model is at least one of a diffusion model, a variational self-encoder model, and a generation countermeasure network model.

9. An image synthesizing apparatus, characterized in that the image synthesizing apparatus comprises:

an acquisition unit configured to perform acquisition of at least two original features in one-to-one correspondence with at least two original data, wherein the at least two original features include original features of at least one non-image domain;

A conversion unit configured to perform inputting original features of the at least one non-image domain into a pre-trained feature conversion model to obtain at least one image domain feature corresponding to the original features of the at least one non-image domain, so as to obtain at least two image domain features corresponding to the at least two original features;

the fusion unit is configured to perform fusion processing on the at least two image domain features to obtain fusion image features;

and a generating unit configured to generate a synthetic image according to the fused image features.

10. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the image synthesis method of any of claims 1 to 8.

11. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by at least one processor, cause the at least one processor to perform the image synthesis method of any of claims 1 to 8.